This version is a patch. In this version I have
- Updated the URL to a working paper cited in the topic modeling vignette.
This version is a patch. In this version I have
- Fixed a bug in
CalcHellignerDist()
and CalcJSDivergence()
that sometimes caused inputs to be overwritten.
- Fixed some typos in the vignette for topic modeling
- Updated the documentation on
FitCtmModel()
to better explain how to pass control arguments to CTM’s underlying function.
- Enabled return of a
tibble
or data.frame
(instead of only data.frame
) in the following functions: SummarizeTopics
, GetTopTerms
, TermDocFreq
(Thanks to Mattias for the PR)
This version is a patch. In this version I have
- Removed unconditional stripping in MAKEVARs as specified by CRAN
- Improved outputs of
FitLdaModel
This version is a patch. In this version I have
- fixed an error related to the
update.lda_topic_model
method.
- added a method
posterior.lda_topic_model
to sample from the posterior of an LDA topic model.
This version is a patch. In this version I have
- changed some elements of NAMESPACE to pass additional CRAN checks.
- added an update method for the lda_topic_model class. This allows users to add documents to an existing model (and even add new topics) without changing the indices of previously-trained topics. e.g. topic 5 is still topic 5.
- added a vignette for using
tidytext
alongside textmineR
This version is a patch in response to issues revealed by automatic checks upon submission to CRAN plus an additional issue I encountered along the way.
I have * Used the CRAN template for my MIT LICENSE file * Modified the example of the LabelTopics function to speed up run time for that example * Modified vignettes to run in less time * Added a Makevars file to keep compiled code small on Ubuntu.
Please read below for major updates between v2.x.x and v3.x.x
This version significantly changes textmineR.
-
Several functions that were slated for deletion in version 2.1.3 are now gone.
- RecursiveRbind
- Vec2Dtm
- JSD
- HellDist
- GetPhiPrime
- FormatRawLdaOutput
- Files2Vec
- DepluralizeDtm
- CorrectS
- CalcPhiPrime
-
FitLdaModel has changed significantly.
- Now only Gibbs sampling is a supported training method. The Gibbs sampler is no longer wrapping lda::lda_collapsed_gibbs_sampler. It is now native to textmineR. It’s a little slower, but has additional features.
- Asymmetric priors are supported for both alpha and beta.
- There is an option, optimize_alpha, which updates alpha every 10 iterations based on the value of theta at the current iteration.
- The log likelihood of the data given estimates of phi and theta is optionally calculated every 10 iterations.
- Probabilistic coherence is optionally calculated at the time of model fit.
- R-squared is optionally calculated at the time of model fit.
Supported topic models (LDA, LSA, CTM) are now object-oriented, creating their own S3 classes. These classes have their own predict methods, meaning you do not have to do your own math to make predictions for new documents.
A new function SummarizeTopics has been added.
tm is no longer a dependency for stopwords. We now use the stopwords package. The extended result of this is that there is no longer any Java dependency.
Several packages have been moved from “Imports” to “Suggests”. The result is a faster install and lower likelihood of install failure based on packages with system dependencies. (Looking at you, topicmodels!)
Finally, I have changed the textmineR license to the MIT license. Note, however, that some dependencies may have more restrictive licenses. So if you’re looking to use textmineR in a commercial project, you may want to dig deeper into what is/isn’t permissable.
- Deprecating functions that will be removed, renamed, or have significant changes to syntax or functionality in the forthcoming textmineR v3.0.
- Functions slated for deletion:
- RecursiveRbind
- Vec2Dtm
- JSD
- HellDist
- GetPhiPrime
- FormatRawLdaOutput
- Files2Vec
- DepluralizeDtm
- CorrectS
- CalcPhiPrime
- In addition: FitLdaModel is going to change significantly in its functionality and argument calls.
- Deprecated RecursiveRbind - it depended on a deprecated function from the Matrix package. And the replacement offered by Matrix operates recursively, making this function truly superfluous.
- Corrected some code in the vignettes that caused errors on Linux machines.
- Added vignettes for common use cases of textmineR
- Modified averaging for
CalcProbCoherence
- Updated documentation to
CreateTcm
- Back-end changes to CreateTcm in response to new
text2vec
API. Functionality is unchanged.
- Changes to how the package interfaces with Rcpp
- Add
verbose
option to CreateDtm
and CreateTcm
to supress status messages.
- Add function
GetVocabFromDtm
to get text2vec
vocabulary object from a dgCMatrix
document term matrix.
- Patching errors introduced in version 2.0.3
- Patches to
CreateDtm
and CreateTcm
in response to updates to text2vec
.
- More formal update to take advantage of
text2vec
’s latest optimizations to follow.
- Patched
CreateDtm
and CreateTcm
. remove_punctuation now supports non-English characters.
- Patched
TmParallelApply
. Added an option to declare the environment to search for your export list. Default to that argument just searches the local environment. The default should cover ~95% of use cases. (And avoids crash on Windows OS)
- Patched
FitLdaModel
. Use of the ...
argument now allows you to control TmParallelApply
, lda::lda.collapsed.gibbs.sampler
, and topicmodels::LDA
without error.
- Patched
FitCtmModel
where the ...
argument now goes to topicmodels::CTM
’s control
argument.
- Patched
CreateTcm
to return objects of class dgCMatrix
. This allows you to run functions like FitLdaModel
on a TCM.
- Switched from irlba to RSpectra for LSA models because RSpectra’s implementation is much faster.
- Patched CreateDtm and CreateTcm. An error caused stopwords to not be removed
- Vec2Dtm is now deprecated in favor of CreateDtm
- A function, CreateTcm, now exists to create term co-occurrence matrices
- CreateDtm and CreateTcm are implemented with a parallel C++ back end through the text2vec library
- the implementation is much faster! I’ve clocked 2X - 10X speedups, depending on options
- adds external dependencies - C++ compiler and GNU make - and takes away an external dependency - Java.
- now all tokens will be included, regardless of length. (tm’s framework silently dropped all tokens of fewer than 3 characters.)
- Allow generic stemming and stopwords in CreateDtm & CreateTcm
- Now there is only one argument for stopwords, making it clearer how to use custom or non-English stopwords
- Now the stemming argument allows for passing of stem/lemmatization functions.
- Function for fitting correlated topic models
- Function to turn a document term matrix to term co-occurrence matrix
- Allowed LabelTopics to use unigrams, if you want. (n-grams are still better.)
- More robust error checking for CalcTopicModelR2 and CalcLikelihood
- All function arguments use “_“, not”.”.
- CalcPhiPrime replaces (the now deprecated) GetPhiPrime
- Allows you to pass an argument to specify non-uniform probabilities of each document
- Similarly, CalcHellingerDist and CalcJSDivergence replace HellDist and JSD. This is to conform to a naming convention where functions are “verbs”.
- Added modeling capability for latent semantic analysis in FitLsaModel()
- Added CalcProbCoherence() function which replaces ProbCoherence() and can calculate probabilistic coherence for the whole phi matrix.
- Added data from NIH research grants instead of borrowed data from tm
- Removed qcq data
- Added variational em method for FitLdaModel()
- Added function to represent document clustering as a topic model Cluster2TopicModel()
- Add deprecation warning to ProbCoherence
- Allow for arguments of number of cores to be passed to every function that uses implicit parallelziation
- Allow for passing of libraries to TmParallelApply (makes this function truely independent of textmineR)
- For Vec2Dtm ensure that stopwords and custom stopwords are lowercased when lower = TRUE
- Update README example to use model caches