vignettes/e_doc_summarization.Rmd
e_doc_summarization.Rmd
In this example we’ll use text embeddings and a bit of network analysis to build a basic document summarizer.
Many document summarizers, as the one we’ll build here, do not generate language. Instead, they break a document down into sentences and then use some mechanism to score each sentence for relevance. Sentences with the top scores are returned as the “summary.” For more information on summarization, a good place to start is here.
The summarizer we’ll build is a version of the TextRank algorithm. We will split a document into sentences, create a nearest-neighbor network where sentences are connected to other similar sentences, and rank the sentences according to eigenvector centrality.
We will use a word embedding model, created on a whole corpus, to project the sentences into the embedding space. Once in the embedding space, we will measure similarity between documents using Hellinger distance. Hellinger distance is a metric specifically for probability distributions. Since we’ll use LDA to create embeddings to a probability space, it’s a useful measure.
We’ll use the movie review data set from text2vec
again.
The first thing we need to do is create a TCM and embedding model. We
will skip evaluation such as R-squared, coherence, inspecting top terms,
etc. However, in any real application, I’d strongly suggest evaluating
your models at every step of the way.
library(textmineR)
#> Loading required package: Matrix
#>
#> Attaching package: 'textmineR'
#> The following object is masked from 'package:Matrix':
#>
#> update
#> The following object is masked from 'package:stats':
#>
#> update
# load the data
data(movie_review, package = "text2vec")
# let's take a sample so the demo will run quickly
# note: textmineR is generally quite scaleable, depending on your system
set.seed(123)
s <- sample(1:nrow(movie_review), 200)
movie_review <- movie_review[ s , ]
# let's get those nasty "<br />" symbols out of the way
movie_review$review <- stringr::str_replace_all(movie_review$review, "<br */>", "")
# First create a TCM using skip grams, we'll use a 5-word window
# most options available on CreateDtm are also available for CreateTcm
tcm <- CreateTcm(doc_vec = movie_review$review,
skipgram_window = 10,
verbose = FALSE,
cpus = 2)
#> 'as(<dgTMatrix>, "dgCMatrix")' is deprecated.
#> Use 'as(., "CsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
# use LDA to get embeddings into probability space
# This will take considerably longer as the TCM matrix has many more rows
# than a DTM
embeddings <- FitLdaModel(dtm = tcm,
k = 50,
iterations = 200,
burnin = 180,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = FALSE,
calc_coherence = FALSE,
calc_r2 = FALSE,
cpus = 2)
Let’s use the above embeddings model to create a document summarizer. This will return the three most relevant sentences in each review.
The summarizer works best as a function, as we have many documents to
summarize. The function summarizer
is defined in the next
section. However, let’s look at some key bits of code in detail.
The variable doc
represents a single document, or a
single element of a character vector.
In the code chunk below, we split the document into sentences using
the stringi
package. Then we embed each sentence under the
model built on our whole corpus, above.
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we know index and order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2)
# remove any documents with 2 or fewer words
e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e), colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
Next, we measure the distance between each of the sentences within the embedding space.
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
Since we are using a distance measure whose values fall between \(0\) and \(1\), we can take \(1 - distance\) to get a similarity. We’ll also re-scale it to be between 0 and 100. (The rescaling is just a cautionary measure so that we don’t run into numerical precision issues when performing calculations downstream.)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
If you consider a similarity matrix to be an adjacency matrix, then you have a fully-connected graph. For the sake of potentially faster computation and with the hope of eliminating some noise, we will delete some edges. Going row-by-row, we will keep connections only to the top 3 most similar sentences.
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric again
g <- pmax(g, t(g))
Using the igraph
package (with its own objects) to
calculate eigenvector centrality. From there, we’ll take the top three
sentences.
The code below puts it all together in a single function. The first few lines vectorize the code, so that we can summarize multiple documents from a single function call.
library(igraph)
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
# let's do this in a function
summarizer <- function(doc, gamma) {
# recursive fanciness to handle multiple docs at once
if (length(doc) > 1 )
# use a try statement to catch any weirdness that may arise
return(sapply(doc, function(d) try(summarizer(d, gamma))))
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we know index and order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2)
# remove any documents with 2 or fewer words
e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e), colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric again
g <- pmax(g, t(g))
g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)
# calculate eigenvector centrality
ev <- evcent(g)
# format the result
result <- sent[ names(ev$vector)[ order(ev$vector, decreasing = TRUE)[ 1:3 ] ] ]
result <- result[ order(as.numeric(names(result))) ]
paste(result, collapse = " ")
}
How well did we do? Let’s look at summaries from the first three reviews.
# Let's see the summary of the first couple of reviews
docs <- movie_review$review[ 1:3 ]
names(docs) <- movie_review$id[ 1:3 ]
sums <- summarizer(docs, gamma = embeddings$gamma)
sums
#> 4273_1
#> "And being introduced to some of the most mind numbing shady immoral character of the Twin Peaks.To the Mind numbing almost pedophilia disgusting way the movie seems to romantically tell of the destruction of a Human Life through some random psychedelic phenomena in the Movie Twin Peak:Fire Come Walk with me. Save your self the agony the suspense and watch anything else that at least has the ability to tell a story, rather then seduce you into some kind mental porn movie.I have heard a lot of reviews, rants and raves about how great David Lynch. Because of his ability to define misery and and tragedy and making it into some kind of a wonderful thing. "
#> 7112_4
#> "Now there isn't much to recommend it, other than the inherent camp value of actors being \\\"terrified\\\" by replicas of human skulls. It definitely inspires more laughs than screams, however. Just try not to get the giggles when the wife (who does more than her share of screaming) goes into the greenhouse and is confronted with the ghost of her husband's ex."
#> 1891_3
#> "Documentary about nomadic Persians making a treacherous traverse of massive mountains to get their herds to grass. The spectacular scenery is lost on a small black and white screen, and there is an utter failure to establish any kind of plot line. I loved Nanook of the North and March of the Penguins, but despised this movie, notwithstanding the similarity of the theme. "
Compare that to the whole reviews yourself.