A wrapper for RSpectra::svds that returns a nicely-formatted latent semantic analysis topic model.

FitLsaModel(dtm, k, calc_coherence = TRUE, return_all = FALSE, ...)

Arguments

dtm

A document term matrix of class Matrix::dgCMatrix

k

Number of topics

calc_coherence

Do you want to calculate probabilistic coherence of topics after the model is trained? Defaults to TRUE.

return_all

Should all objects returned from RSpectra::svds be returned here? Defaults to FALSE

...

Other arguments to pass to svds through its opts parameter.

Value

Returns a list with a minimum of three objects: phi, theta, and sv. The rows of phi index topics and the columns index tokens. The rows of theta index documents and the columns index topics. sv is a vector of singular values.

Details

Latent semantic analysis, LSA, uses single value decomposition to factor the document term matrix. In many LSA applications, TF-IDF weights are applied to the DTM before model fitting. However, this is not strictly necessary.

Examples

# Load a pre-formatted dtm 
data(nih_sample_dtm) 

# Convert raw word counts to TF-IDF frequency weights
idf <- log(nrow(nih_sample_dtm) / Matrix::colSums(nih_sample_dtm > 0))

dtm_tfidf <- Matrix::t(nih_sample_dtm) * idf

dtm_tfidf <- Matrix::t(dtm_tfidf)

# Fit an LSA model
model <- FitLsaModel(dtm = dtm_tfidf, k = 5)

str(model)
#> List of 6
#>  $ sv       : num [1:5] 181 156 150 144 143
#>  $ theta    : num [1:100, 1:5] 0.0213 0.0103 0.0093 0.0198 0.0144 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:100] "8693991" "8693362" "8607498" "8697008" ...
#>   .. ..$ : chr [1:5] "t_1" "t_2" "t_3" "t_4" ...
#>  $ phi      : num [1:5, 1:5210] 0.000263 -0.000897 0.000831 -0.000494 -0.000135 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:5] "t_1" "t_2" "t_3" "t_4" ...
#>   .. ..$ : chr [1:5210] "folding" "tosuprttedprtmnt" "importation" "hd" ...
#>  $ gamma    : num [1:5, 1:5210] 1.45e-06 -5.75e-06 5.54e-06 -3.43e-06 -9.45e-07 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:5] "t_1" "t_2" "t_3" "t_4" ...
#>   .. ..$ : chr [1:5210] "folding" "tosuprttedprtmnt" "importation" "hd" ...
#>  $ coherence: Named num [1:5] 0.937 0.937 0.231 0.268 0.986
#>   ..- attr(*, "names")= chr [1:5] "t_1" "t_2" "t_3" "t_4" ...
#>  $ data     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:14073] 94 51 73 13 29 16 89 98 80 63 ...
#>   .. ..@ p       : int [1:5211] 0 1 2 3 4 5 6 7 8 9 ...
#>   .. ..@ Dim     : int [1:2] 100 5210
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : chr [1:100] "8693991" "8693362" "8607498" "8697008" ...
#>   .. .. ..$ : chr [1:5210] "folding" "tosuprttedprtmnt" "importation" "hd" ...
#>   .. ..@ x       : num [1:14073] 4.61 4.61 4.61 4.61 4.61 ...
#>   .. ..@ factors : list()
#>  - attr(*, "class")= chr "lsa_topic_model"