This is the main document term matrix creating function for textmineR.
In most cases, all you need to do is import documents as a character vector in R and then
run this function to get a document term matrix that is compatible with the
rest of textmineR's functionality and many other libraries. CreateDtm
is built on top of the excellent text2vec library.
A character vector of documents.
A vector of names for your documents. Defaults to
names(doc_vec). If NULL, then doc_names is set to be
1:length(doc_vec).
A numeric vector of length 2. The first entry is the minimum
n-gram size; the second entry is the maximum n-gram size. Defaults to
c(1, 1).
A character vector of stopwords you would like to remove.
Defaults to c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")).
If you do not want stopwords removed, specify stopword_vec = c().
Do you want all words coerced to lower case? Defaults to TRUE
Do you want to convert all non-alpha numeric
characters to spaces? Defaults to TRUE
Do you want to convert all numbers to spaces? Defaults
to TRUE
A function that you would like to apply to the documents for stemming, lemmatization, or similar. See examples for usage.
Defaults to TRUE. Do you want to see status during
vectorization?
Other arguments to be passed to TmParallelApply.
A document term matrix of class dgCMatrix. The rows index
documents. The columns index terms. The i, j entries represent the count of
term j appearing in document i.
The following transformations are applied to stopword_vec as
well as doc_vec:
lower,
remove_punctuation,
remove_numbers
See stopwords for details on the default to the
stopword_vec argument.
if (FALSE) {
data(nih_sample)
# DTM of unigrams and bigrams
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
doc_names = nih_sample$APPLICATION_ID,
ngram_window = c(1, 2))
# DTM of unigrams with Porter's stemmer applied
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
doc_names = nih_sample$APPLICATION_ID,
stem_lemma_function = function(x) SnowballC::wordStem(x, "porter"))
}