This is the main document term matrix creating function for textmineR. In most cases, all you need to do is import documents as a character vector in R and then run this function to get a document term matrix that is compatible with the rest of textmineR's functionality and many other libraries. CreateDtm is built on top of the excellent text2vec library.

CreateDtm(doc_vec, doc_names = names(doc_vec), ngram_window = c(1, 1),
  stopword_vec = c(stopwords::stopwords("en"),
  stopwords::stopwords(source = "smart")), lower = TRUE,
  remove_punctuation = TRUE, remove_numbers = TRUE,
  stem_lemma_function = NULL, verbose = FALSE, ...)

Arguments

doc_vec

A character vector of documents.

doc_names

A vector of names for your documents. Defaults to names(doc_vec). If NULL, then doc_names is set to be 1:length(doc_vec).

ngram_window

A numeric vector of length 2. The first entry is the minimum n-gram size; the second entry is the maximum n-gram size. Defaults to c(1, 1).

stopword_vec

A character vector of stopwords you would like to remove. Defaults to c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")). If you do not want stopwords removed, specify stopword_vec = c().

lower

Do you want all words coerced to lower case? Defaults to TRUE

remove_punctuation

Do you want to convert all non-alpha numeric characters to spaces? Defaults to TRUE

remove_numbers

Do you want to convert all numbers to spaces? Defaults to TRUE

stem_lemma_function

A function that you would like to apply to the documents for stemming, lemmatization, or similar. See examples for usage.

verbose

Defaults to TRUE. Do you want to see status during vectorization?

...

Other arguments to be passed to TmParallelApply.

Value

A document term matrix of class dgCMatrix. The rows index documents. The columns index terms. The i, j entries represent the count of term j appearing in document i.

Note

The following transformations are applied to stopword_vec as well as doc_vec: lower, remove_punctuation, remove_numbers

See stopwords for details on the default to the stopword_vec argument.

Examples

# NOT RUN {
data(nih_sample)

# DTM of unigrams and bigrams
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 doc_names = nih_sample$APPLICATION_ID,
                 ngram_window = c(1, 2))

# DTM of unigrams with Porter's stemmer applied
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 doc_names = nih_sample$APPLICATION_ID,
                 stem_lemma_function = function(x) SnowballC::wordStem(x, "porter"))
# }