Using tidytext with textmineR

The tidytext package is one of the more popular natural language processing packages in R’s ecosystem. It follows conventions and syntax of the “tidyverse.”

You may prefer to use tidytext for a couple of reasons. First, tidytext has its own philosophy and syntax for handling text, particularly at early stages. You may be more familiar or comfortable with this approach. Second, tidytext does, theoretically, offer some more flexibility in options creating DTMs or TCMs. This early stage is critical to successful topic modeling.

See Text Mining with R: A Tidy Approach for more details about tidytext.

What follows is a short script combining tidytext with textmineR. Initial data curation and DTM creation is done with tidytext. Topic modeling is done with textmineR and the outputs are re-formatted in the flavor of tidytext’s “tidiers” for other topic models.

################################################################################
# Example: Using tidytext with textmineR
################################################################################

library(tidytext)
library(textmineR)
#> Loading required package: Matrix
#> 
#> Attaching package: 'textmineR'
#> The following object is masked from 'package:Matrix':
#> 
#>     update
#> The following object is masked from 'package:stats':
#> 
#>     update
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
#> 
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:Matrix':
#> 
#>     expand

# load documents in a data frame
docs <- textmineR::nih_sample 

# tokenize using tidytext's unnest_tokens
tidy_docs <- docs %>% 
  select(APPLICATION_ID, ABSTRACT_TEXT) %>% 
  unnest_tokens(output = word, 
                input = ABSTRACT_TEXT,
                stopwords = c(stopwords::stopwords("en"), 
                              stopwords::stopwords(source = "smart")),
                token = "ngrams",
                n_min = 1, n = 2) %>% 
  count(APPLICATION_ID, word) %>% 
  filter(n>1) #Filtering for words/bigrams per document, rather than per corpus

tidy_docs <- tidy_docs %>% # filter words that are just numbers
  filter(! stringr::str_detect(tidy_docs$word, "^[0-9]+$"))

# turn a tidy tbl into a sparse dgCMatrix for use in textmineR
d <- tidy_docs %>% 
  cast_sparse(APPLICATION_ID, word, n)


# create a topic model
m <- FitLdaModel(dtm = d, 
                 k = 20,
                 iterations = 200,
                 burnin = 175)


# below is equivalent to tidy_beta <- tidy(x = m, matrix = "beta")
tidy_beta <- data.frame(topic = as.integer(stringr::str_replace_all(rownames(m$phi), "t_", "")), 
                        m$phi, 
                        stringsAsFactors = FALSE) %>%
  gather(term, beta, -topic) %>% 
  tibble::as_tibble()

# below is equivalent to tidy_gamma <- tidy(x = m, matrix = "gamma")
tidy_gamma <- data.frame(document = rownames(m$theta),
                         m$theta,
                         stringsAsFactors = FALSE) %>%
  gather(topic, gamma, -document) %>%
  tibble::as_tibble()