Fit a Latent Dirichlet Allocation topic model

Fit a Latent Dirichlet Allocation topic model using collapsed Gibbs sampling.

FitLdaModel(
  dtm,
  k,
  iterations = NULL,
  burnin = -1,
  alpha = 0.1,
  beta = 0.05,
  optimize_alpha = FALSE,
  calc_likelihood = FALSE,
  calc_coherence = TRUE,
  calc_r2 = FALSE,
  ...
)

Arguments

dtm: A document term matrix or term co-occurrence matrix of class dgCMatrix
k: Integer number of topics
iterations: Integer number of iterations for the Gibbs sampler to run. A future version may include automatic stopping criteria.
burnin: Integer number of burnin iterations. If burnin is greater than -1, the resulting "phi" and "theta" matrices are an average over all iterations greater than burnin.
alpha: Vector of length k for asymmetric or a number for symmetric. This is the prior for topics over documents
beta: Vector of length ncol(dtm) for asymmetric or a number for symmetric. This is the prior for words over topics.
optimize_alpha: Logical. Do you want to optimize alpha every 10 Gibbs iterations? Defaults to FALSE.
calc_likelihood: Do you want to calculate the likelihood every 10 Gibbs iterations? Useful for assessing convergence. Defaults to FALSE.
calc_coherence: Do you want to calculate probabilistic coherence of topics after the model is trained? Defaults to TRUE.
calc_r2: Do you want to calculate R-squared after the model is trained? Defaults to FALSE.
...: Other arguments to be passed to TmParallelApply

Value

Returns an S3 object of class c("LDA", "TopicModel"). DESCRIBE MORE

Details

EXPLAIN IMPLEMENTATION DETAILS

Examples

# load some data
data(nih_sample_dtm)

# fit a model 
set.seed(12345)
m <- FitLdaModel(dtm = nih_sample_dtm[1:20,], k = 5,
                 iterations = 200, burnin = 175)

str(m)
#> List of 7
#>  $ phi      : num [1:5, 1:5210] 5.77e-05 6.69e-05 5.73e-05 6.69e-05 5.28e-05 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:5] "t_1" "t_2" "t_3" "t_4" ...
#>   .. ..$ : chr [1:5210] "folding" "tosuprttedprtmnt" "importation" "hd" ...
#>  $ theta    : num [1:20, 1:5] 0.00043 0.23802 0.111462 0.210652 0.000615 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:20] "8693991" "8693362" "8607498" "8697008" ...
#>   .. ..$ : chr [1:5] "t_1" "t_2" "t_3" "t_4" ...
#>  $ gamma    : num [1:5, 1:5210] 0.188 0.19 0.207 0.206 0.209 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:5] "t_1" "t_2" "t_3" "t_4" ...
#>   .. ..$ : chr [1:5210] "folding" "tosuprttedprtmnt" "importation" "hd" ...
#>  $ data     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:2864] 13 16 11 16 18 10 11 11 17 11 ...
#>   .. ..@ p       : int [1:5211] 0 0 0 0 1 1 2 2 2 2 ...
#>   .. ..@ Dim     : int [1:2] 20 5210
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : chr [1:20] "8693991" "8693362" "8607498" "8697008" ...
#>   .. .. ..$ : chr [1:5210] "folding" "tosuprttedprtmnt" "importation" "hd" ...
#>   .. ..@ x       : num [1:2864] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..@ factors : list()
#>  $ alpha    : Named num [1:5] 0.1 0.1 0.1 0.1 0.1
#>   ..- attr(*, "names")= chr [1:5] "t_1" "t_2" "t_3" "t_4" ...
#>  $ beta     : Named num [1:5210] 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ...
#>   ..- attr(*, "names")= chr [1:5210] "folding" "tosuprttedprtmnt" "importation" "hd" ...
#>  $ coherence: Named num [1:5] 0.0817 0.395 0.24 0.0267 0.04
#>   ..- attr(*, "names")= chr [1:5] "t_1" "t_2" "t_3" "t_4" ...
#>  - attr(*, "class")= chr "lda_topic_model"

# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100,], method = "gibbs",
              iterations = 200, burnin = 175)

# predict on held-out documents using the dot product method
p2 <- predict(m, nih_sample_dtm[21:100,], method = "dot")

# compare the methods
barplot(rbind(p1[1,],p2[1,]), beside = TRUE, col = c("red", "blue"))