This is the main document term matrix creating function for textmineR
In most cases, all you need to do is import documents as a character vector in R and then
run this function to get a document term matrix that is compatible with the
rest of textmineR
's functionality and many other libraries. CreateDtm
is built on top of the excellent text2vec
A character vector of documents.
A vector of names for your documents. Defaults to
. If NULL, then doc_names is set to be
A numeric vector of length 2. The first entry is the minimum
n-gram size; the second entry is the maximum n-gram size. Defaults to
c(1, 1)
A character vector of stopwords you would like to remove.
Defaults to c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart"))
If you do not want stopwords removed, specify stopword_vec = c()
Do you want all words coerced to lower case? Defaults to TRUE
Do you want to convert all non-alpha numeric
characters to spaces? Defaults to TRUE
Do you want to convert all numbers to spaces? Defaults
A function that you would like to apply to the documents for stemming, lemmatization, or similar. See examples for usage.
Defaults to TRUE
. Do you want to see status during
Other arguments to be passed to TmParallelApply
A document term matrix of class dgCMatrix
. The rows index
documents. The columns index terms. The i, j entries represent the count of
term j appearing in document i.
The following transformations are applied to stopword_vec
well as doc_vec
See stopwords
for details on the default to the
if (FALSE) {
# DTM of unigrams and bigrams
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
doc_names = nih_sample$APPLICATION_ID,
ngram_window = c(1, 2))
# DTM of unigrams with Porter's stemmer applied
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
doc_names = nih_sample$APPLICATION_ID,
stem_lemma_function = function(x) SnowballC::wordStem(x, "porter"))