This is the main term co-occurrence matrix creating function for textmineR. In most cases, all you need to do is import documents as a character vector in R and then run this function to get a term co-occurrence matrix that is compatible with the rest of textmineR's functionality and many other libraries. CreateTcm is built on top of the excellent text2vec library.

CreateTcm(
  doc_vec,
  skipgram_window = Inf,
  ngram_window = c(1, 1),
  stopword_vec = c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),
  lower = TRUE,
  remove_punctuation = TRUE,
  remove_numbers = TRUE,
  stem_lemma_function = NULL,
  verbose = FALSE,
  ...
)

Arguments

doc_vec

A character vector of documents.

skipgram_window

An integer window, from 0 to Inf for skip-grams. Defaults to Inf. See 'Details', below.

ngram_window

A numeric vector of length 2. The first entry is the minimum n-gram size; the second entry is the maximum n-gram size. Defaults to c(1, 1). Must be c(1, 1) if skipgram_window is not 0 or Inf.

stopword_vec

A character vector of stopwords you would like to remove. Defaults to c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")). If you do not want stopwords removed, specify stopword_vec = c().

lower

Do you want all words coerced to lower case? Defaults to TRUE

remove_punctuation

Do you want to convert all non-alpha numeric characters to spaces? Defaults to TRUE

remove_numbers

Do you want to convert all numbers to spaces? Defaults to TRUE

stem_lemma_function

A function that you would like to apply to the documents for stemming, lemmatization, or similar. See examples for usage.

verbose

Defaults to TRUE. Do you want to see status during vectorization?

...

Other arguments to be passed to TmParallelApply.

Value

A document term matrix of class dgCMatrix. The rows index documents. The columns index terms. The i, j entries represent the count of term j appearing in document i.

Details

Setting skipgram_window counts the number of times that term j appears within skipgram_window places of term i. Inf and 0 create somewhat special TCMs. Setting skipgram_window to Inf counts the number of documents in which term j and term i occur together. Setting skipgram_window to 0 counts the number of terms shared by document j and document i. A TCM where skipgram_window is 0 is the only TCM that will be symmetric.

Note

The following transformations are applied to stopword_vec as well as doc_vec: lower, remove_punctuation, remove_numbers

See stopwords for details on the default to the stopword_vec argument.

Examples

if (FALSE) {
data(nih_sample)

# TCM of unigrams and bigrams
tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 skipgram_window = Inf, 
                 ngram_window = c(1, 2))

# TCM of unigrams and a skip=gram window of 3, applying Porter's word stemmer
tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 skipgram_window = 3,
                 stem_lemma_function = function(x) SnowballC::wordStem(x, "porter"))
}