This is the main document term matrix creating function for textmineR. In most cases, all you need to do is import documents as a character vector in R and then run this function to get a document term matrix that is compatible with the rest of textmineR's functionality and many other libraries. CreateDtm is built on top of the excellent text2vec library.

CreateDtm(doc_vec, doc_names = names(doc_vec), ngram_window = c(1, 1),
  stopword_vec = c(stopwords::stopwords("en"),
  stopwords::stopwords(source = "smart")), lower = TRUE,
  remove_punctuation = TRUE, remove_numbers = TRUE,
  stem_lemma_function = NULL, verbose = FALSE, ...)



A character vector of documents.


A vector of names for your documents. Defaults to names(doc_vec). If NULL, then doc_names is set to be 1:length(doc_vec).


A numeric vector of length 2. The first entry is the minimum n-gram size; the second entry is the maximum n-gram size. Defaults to c(1, 1).


A character vector of stopwords you would like to remove. Defaults to c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")). If you do not want stopwords removed, specify stopword_vec = c().


Do you want all words coerced to lower case? Defaults to TRUE


Do you want to convert all non-alpha numeric characters to spaces? Defaults to TRUE


Do you want to convert all numbers to spaces? Defaults to TRUE


A function that you would like to apply to the documents for stemming, lemmatization, or similar. See examples for usage.


Defaults to TRUE. Do you want to see status during vectorization?


Other arguments to be passed to TmParallelApply.


A document term matrix of class dgCMatrix. The rows index documents. The columns index terms. The i, j entries represent the count of term j appearing in document i.


The following transformations are applied to stopword_vec as well as doc_vec: lower, remove_punctuation, remove_numbers

See stopwords for details on the default to the stopword_vec argument.



# DTM of unigrams and bigrams
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 doc_names = nih_sample$APPLICATION_ID,
                 ngram_window = c(1, 2))

# DTM of unigrams with Porter's stemmer applied
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 doc_names = nih_sample$APPLICATION_ID,
                 stem_lemma_function = function(x) SnowballC::wordStem(x, "porter"))
# }