4. Text embeddings

Text embeddings

Text embeddings are particularly hot right now. While textmineR doesn’t (yet) explicitly implement any embedding models like GloVe or word2vec, you can still get embeddings. Text embedding algorithms aren’t conceptually different from topic models. They are, however, operating on a different matrix. Instead of reducing the dimensions of a document term matrix, text embeddings are obtained by reducing the dimensions of a term co-occurrence matrix. In principle, one can use LDA or LSA in the same way. In this case, rows of theta are embedded words. A phi_prime may be obtained to project documents or new text into the embedding space.

Create a term co-occurrence matrix

The first step in fitting a text embedding model is to create a term co-occurrence matrix or TCM. In a TCM, both columns and rows index tokens. The \((i,j)\) entries of the matrix are a count of the number of times word \(i\) co-occurs with \(j\). However, there are several ways to count co-occurrence. textmineR gives you three.

The most useful way of counting co-occurrence for text embeddings is called the skip-gram model. Under the skip-gram model, the count would be the number of times word \(j\) appears within a certain window of \(i\). A skip-gram window of two, for example, would count the number of times word \(j\) occurred in the two words immediately before word \(i\) or the two words immediately after word \(i\). This helps capture the local context of words. In fact, you can think of a text embedding as being a topic model based on the local context of words. Whereas a traditional topic model is modeling words in their global context.

To read more about the skip-gram model, which was popularized in the embedding model word2vec, look here.

The other types of co-occurrence matrix textmineR provides are both global. One is a count of the number of documents in which words \(i\) and \(j\) co-occur. The other is the number of terms that co-occur between documents \(i\) and \(j\). See help(CreateTcm) for info on these.


# load the NIH data set
library(textmineR)
#> Loading required package: Matrix
#> 
#> Attaching package: 'textmineR'
#> The following object is masked from 'package:Matrix':
#> 
#>     update
#> The following object is masked from 'package:stats':
#> 
#>     update

# load nih_sample data set from textmineR
data(nih_sample)

# First create a TCM using skip grams, we'll use a 5-word window
# most options available on CreateDtm are also available for CreateTcm
tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 skipgram_window = 10,
                 verbose = FALSE,
                 cpus = 2)
#> 'as(<dgTMatrix>, "dgCMatrix")' is deprecated.
#> Use 'as(., "CsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").

# a TCM is generally larger than a DTM
dim(tcm)
#> [1] 5210 5210

Fitting a model

Once we have a TCM, we can use the same procedure to make an embedding model as we used to make a topic model. Note that it may take considerably longer (because of dimensionality of the matrix) or shorter (because of sparsity) to fit an embedding on the same corpus.

# use LDA to get embeddings into probability space
# This will take considerably longer as the TCM matrix has many more rows 
# than your average DTM
embeddings <- FitLdaModel(dtm = tcm,
                          k = 50,
                          iterations = 200,
                          burnin = 180,
                          alpha = 0.1,
                          beta = 0.05,
                          optimize_alpha = TRUE,
                          calc_likelihood = FALSE,
                          calc_coherence = TRUE,
                          calc_r2 = TRUE,
                          cpus = 2)

Interpretation of \(\Phi\) and \(\Theta\)

In the language of text embeddings, \(\Theta\) gives us our tokens embedded in a probability space (because we used LDA, Euclidean space if we used LSA). \(\Phi\) defines the dimensions of our embedding space. The rows of \(\Phi\) can still be interpreted as topics. But they are topics of local contexts, rather than within whole documents.

Evaluating the model

As it happens, the same evaluation metrics developed for topic modeling also apply here. There are subtle differences in interpretation because we are using a TCM not a DTM. i.e. occurrences relate words to each other, not to documents.

# Get an R-squared for general goodness of fit
embeddings$r2
#> [1] 0.1784392

# Get coherence (relative to the TCM) for goodness of fit
summary(embeddings$coherence)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.01280 0.06436 0.11249 0.12223 0.17628 0.28940

We will create a summary table as we did with a topic model before.

# Get top terms, no labels because we don't have bigrams
embeddings$top_terms <- GetTopTerms(phi = embeddings$phi,
                                    M = 5)

# Create a summary table, similar to the above
embeddings$summary <- data.frame(topic = rownames(embeddings$phi),
                                 coherence = round(embeddings$coherence, 3),
                                 prevalence = round(colSums(embeddings$theta), 2),
                                 top_terms = apply(embeddings$top_terms, 2, function(x){
                                   paste(x, collapse = ", ")
                                 }),
                                 stringsAsFactors = FALSE)

Here it is ordered by prevalence. (Here, we might say density of tokens along each embedding dimension.)

embeddings$summary[ order(embeddings$summary$prevalence, decreasing = TRUE) , ][ 1:10 , ]

Summary of top 10 embedding dimensions
	topic	coherence	prevalence	top_terms
t_46	t_46	0.180	196.39	research, core, program, health, studies
t_49	t_49	0.180	195.59	aim, specific, determine, study, models
t_25	t_25	0.173	126.55	cells, cell, human, function, mast
t_6	t_6	0.142	119.04	based, treatment, clinical, effect, therapeutic
t_8	t_8	0.161	118.00	factors, disease, risk, related, early
t_26	t_26	0.168	115.82	response, brain, responses, immune, tissue
t_24	t_24	0.064	111.65	health, community, wide, women, significant
t_33	t_33	0.221	110.72	dependent, sleep, memory, pathways, role
t_17	t_17	0.108	110.57	diabetes, intervention, behavior, race, fertility
t_1	t_1	0.190	109.65	expression, genetic, ipf, gene, lung

And here is the table ordered by coherence.

embeddings$summary[ order(embeddings$summary$coherence, decreasing = TRUE) , ][ 1:10 , ]

Summary of 10 most coherent embedding dimensions
	topic	coherence	prevalence	top_terms
t_27	t_27	0.289	100.56	proteins, cutaneous, sand, infection, fly
t_19	t_19	0.254	93.29	provided, applicant, project, cancer, health
t_34	t_34	0.252	100.06	secondary, ptc, brafv, processes, vegfr
t_5	t_5	0.236	102.69	injury, cmybp, blood, release, fragment
t_33	t_33	0.221	110.72	dependent, sleep, memory, pathways, role
t_28	t_28	0.210	99.53	influenza, cross, vaccine, antigen, protective
t_35	t_35	0.197	99.37	ri, gut, fc, microbiome, crc
t_1	t_1	0.190	109.65	expression, genetic, ipf, gene, lung
t_11	t_11	0.185	90.74	power, force, katp, sarcomere, dependence
t_31	t_31	0.180	92.41	developed, capacity, battery, proposed, size

Embedding documents under the model

You can embed whole documents under your model. Doing so, effectively makes your embeddings a topic model that have topics of local contexts, instead of global ones. Why might you want to do this? The short answer is that you may have reason to believe that an embedding model may give you better topics, especially if you are trying to pick up on more subtle topics. In a later example, we’ll be doing that to build a document summarizer.

A note on the below: TCMs may be very sparse and cause us to run into computational underflow issues when using the “gibbs” prediction method. As a result, I’m choosing to use the “dot” method.

# Make a DTM from our documents
dtm_embed <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
                       doc_names = nih_sample$APPLICATION_ID,
                       ngram_window = c(1,1),
                       verbose = FALSE,
                       cpus = 2)

dtm_embed <- dtm_embed[,colSums(dtm_embed) > 2]

# Project the documents into the embedding space
embedding_assignments <- predict(embeddings, dtm_embed, method = "gibbs",
                                 iterations = 200, burnin = 180)

Once you’ve embedded your documents, you effectively have a new \(\Theta\). We can use that to evaluate how well the embedding topics fit the documents as a whole by re-calculating R-squared and coherence.

# get a goodness of fit relative to the DTM
embeddings$r2_dtm <- CalcTopicModelR2(dtm = dtm_embed, 
                                      phi = embeddings$phi[,colnames(dtm_embed)], # line up vocabulary
                                      theta = embedding_assignments,
                                      cpus = 2)

embeddings$r2_dtm
#> [1] 0.2192964

# get coherence relative to DTM
embeddings$coherence_dtm <- CalcProbCoherence(phi = embeddings$phi[,colnames(dtm_embed)], # line up vocabulary
                                              dtm = dtm_embed)

summary(embeddings$coherence_dtm)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.01334 0.06805 0.10131 0.13567 0.19341 0.45255

Where to next?

Embedding research is only just beginning. I would encourage you to play with them and develop your own methods.

Thomas W. Jones