R/corpus_functions.R
TermDocFreq.Rd
This function takes a document term matrix as input and returns a data frame with columns for term frequency, document frequency, and inverse-document frequency
TermDocFreq(dtm)
A document term matrix of class dgCMatrix
.
Returns a data.frame
or tibble
with 4 columns.
The first column, term
is a vector of token labels.
The second column, term_freq
is the count of times term
appears in the entire corpus. The third column doc_freq
is the
count of the number of documents in which term
appears.
The fourth column, idf
is the log-weighted
inverse document frequency of term
.
# Load a pre-formatted dtm and topic model
data(nih_sample_dtm)
data(nih_sample_topic_model)
# Get the term frequencies
term_freq_mat <- TermDocFreq(nih_sample_dtm)
str(term_freq_mat)
#> tibble [5,210 × 4] (S3: tbl_df/tbl/data.frame)
#> $ term : chr [1:5210] "folding" "tosuprttedprtmnt" "importation" "hd" ...
#> $ term_freq: num [1:5210] 1 1 1 1 1 1 1 1 1 1 ...
#> $ doc_freq : int [1:5210] 1 1 1 1 1 1 1 1 1 1 ...
#> $ idf : num [1:5210] 4.61 4.61 4.61 4.61 4.61 ...