R/corpus_functions.R
CreateTcm.RdThis is the main term co-occurrence matrix creating function for textmineR.
In most cases, all you need to do is import documents as a character vector in R and then
run this function to get a term co-occurrence matrix that is compatible with the
rest of textmineR's functionality and many other libraries. CreateTcm
is built on top of the excellent text2vec library.
A character vector of documents.
An integer window, from 0 to Inf for
skip-grams. Defaults to Inf. See 'Details', below.
A numeric vector of length 2. The first entry is the minimum
n-gram size; the second entry is the maximum n-gram size. Defaults to
c(1, 1). Must be c(1, 1) if skipgram_window is
not 0 or Inf.
A character vector of stopwords you would like to remove.
Defaults to c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")).
If you do not want stopwords removed, specify stopword_vec = c().
Do you want all words coerced to lower case? Defaults to TRUE
Do you want to convert all non-alpha numeric
characters to spaces? Defaults to TRUE
Do you want to convert all numbers to spaces? Defaults
to TRUE
A function that you would like to apply to the documents for stemming, lemmatization, or similar. See examples for usage.
Defaults to TRUE. Do you want to see status during
vectorization?
Other arguments to be passed to TmParallelApply.
A document term matrix of class dgCMatrix. The rows index
documents. The columns index terms. The i, j entries represent the count of
term j appearing in document i.
Setting skipgram_window counts the number of times that term
j appears within skipgram_window places of term i.
Inf and 0 create somewhat special TCMs. Setting skipgram_window
to Inf counts the number of documents in which term j
and term i occur together. Setting skipgram_window
to 0 counts the number of terms shared by document j
and document i. A TCM where skipgram_window
is 0 is the only TCM that will be symmetric.
The following transformations are applied to stopword_vec as
well as doc_vec:
lower,
remove_punctuation,
remove_numbers
See stopwords for details on the default to the
stopword_vec argument.
if (FALSE) {
data(nih_sample)
# TCM of unigrams and bigrams
tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT,
skipgram_window = Inf,
ngram_window = c(1, 2))
# TCM of unigrams and a skip=gram window of 3, applying Porter's word stemmer
tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT,
skipgram_window = 3,
stem_lemma_function = function(x) SnowballC::wordStem(x, "porter"))
}