CreateTcm.Rd
This is the main term co-occurrence matrix creating function for textmineR
.
In most cases, all you need to do is import documents as a character vector in R and then
run this function to get a term co-occurrence matrix that is compatible with the
rest of textmineR
's functionality and many other libraries. CreateTcm
is built on top of the excellent text2vec
library.
CreateTcm(doc_vec, skipgram_window = Inf, ngram_window = c(1, 1), stopword_vec = c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")), lower = TRUE, remove_punctuation = TRUE, remove_numbers = TRUE, stem_lemma_function = NULL, verbose = FALSE, ...)
doc_vec | A character vector of documents. |
---|---|
skipgram_window | An integer window, from |
ngram_window | A numeric vector of length 2. The first entry is the minimum
n-gram size; the second entry is the maximum n-gram size. Defaults to
|
stopword_vec | A character vector of stopwords you would like to remove.
Defaults to |
lower | Do you want all words coerced to lower case? Defaults to |
remove_punctuation | Do you want to convert all non-alpha numeric
characters to spaces? Defaults to |
remove_numbers | Do you want to convert all numbers to spaces? Defaults
to |
stem_lemma_function | A function that you would like to apply to the documents for stemming, lemmatization, or similar. See examples for usage. |
verbose | Defaults to |
... | Other arguments to be passed to |
A document term matrix of class dgCMatrix
. The rows index
documents. The columns index terms. The i, j entries represent the count of
term j appearing in document i.
Setting skipgram_window
counts the number of times that term
j
appears within skipgram_window
places of term i
.
Inf
and 0
create somewhat special TCMs. Setting skipgram_window
to Inf
counts the number of documents in which term j
and term i
occur together. Setting skipgram_window
to 0
counts the number of terms shared by document j
and document i
. A TCM where skipgram_window
is 0
is the only TCM that will be symmetric.
The following transformations are applied to stopword_vec
as
well as doc_vec
:
lower
,
remove_punctuation
,
remove_numbers
See stopwords
for details on the default to the
stopword_vec
argument.
if (FALSE) { data(nih_sample) # TCM of unigrams and bigrams tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT, skipgram_window = Inf, ngram_window = c(1, 2)) # TCM of unigrams and a skip=gram window of 3, applying Porter's word stemmer tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT, skipgram_window = 3, stem_lemma_function = function(x) SnowballC::wordStem(x, "porter")) }