Tokenizing on Stopwords

Introduction

Recently, I came across the idea that you can get relevant keywords for word2vec by tokenizing a corpus on stopwords, in addition to standard punctuation (found via). This seemed like a really cool unsupervised way of capturing (hopefully!) relevant phrases. I was intrigued.

A brief note: “tokenizing” refers to splitting a document into words or phrases based on a pre-defined set of rules. The most common way to do this is by splitting on spaces and “end-of-sentence” punctuation (ex: “!, ?, .”). This would then return a list of words, or “unigrams.” These can then be re-combined into n-grams, or multiword phrases.

An R Implementation

The code in the linked repo was all in Python, but, like many political scientists, I’m more comfortable in R. (This is not a statement on the relative merits of each language, simply my own comfort level. Please, no language wars here!). I was curious whether it would be possible to implement this approach in R.

My first thought was to use the amazing quanteda package, and I asked the maintainers about this approach. Ken Benoit had some helpful suggestions, as well as a nice discussion about the quanteda design philosophy. This approach first tokenizes the corpus, removes stopwords, creates phrases from the remaining words, and finally, recombines them with the original tokens object. This object can then be used in word2vec or other approaches.

Here, I test the approach on the US presidential inauagural addresses corpus in quanteda. I like this corpus because it’s small enough that computation is quick, and there is enough thematic structure that human inspection can tell if the results “make sense.”

options(stringsAsFactors = FALSE)
library(quanteda)
## Package version: 1.5.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library(data.table)
library(stringr)
# creating text
d <- quanteda::data_corpus_inaugural
tmp_data <- data.table::data.table(txts = d$documents$texts, yr = d$documents$Year, 
  president = paste(d$documents$FirstName, d$documents$President))
tmp_data <- as.data.table(tmp_data)

# tokenize the text
inaug_tokens <- tokens(x = char_tolower(tmp_data$txts), "word", 
  remove_punct = TRUE, remove_separators = TRUE)

# remove stopwords, but preserve the initial structure
inaug_tokens2 <- tokens_remove(inaug_tokens, stopwords("english"), padding = TRUE)

# collocations which are either two or three words long, and occur at least twice
cols <-  textstat_collocations(inaug_tokens2, size = 2, min_count = 2)
# filter collocation
cols <- cols[cols$z >= 2.58 ]
# recombine results
combined <- tokens_compound(inaug_tokens, cols)

# first twenty tokens of Obama's first inaugural: 
combined[[56]][1:20]
##  [1] "my"              "fellow_citizens" "i"              
##  [4] "stand"           "here"            "today"          
##  [7] "humbled"         "by"              "the"            
## [10] "task"            "before"          "us"             
## [13] "grateful"        "for"             "the"            
## [16] "trust"           "you"             "have"           
## [19] "bestowed"        "mindful"

Word2vec

Now that we have these recombined tokens, we can use them in our statistical model of choice. Here, I used word2vec, in line with the idea that inspired me. I use the reticulate package to call the gensim implementation of word2vec, which I initially discussed here.

set.seed(216)
library(reticulate)
gensim <- import("gensim") # import the gensim library
Word2Vec <- gensim$models$Word2Vec # Extract the Word2Vec model
multiprocessing <- import("multiprocessing") # For parallel processing

# create the word2vec object
basemodel = Word2Vec(
    workers = 1, # using 1 core
    window = 5L, # co-occurence window of size 5
    iter = 10L, # iter = sweeps of SGD through the data; more is better
    sg = 1L,
    hs = 0L, negative = 1L, 
    size = 25L # only 25, since it's a small corpus
)

# remove quanteda "textnn" names
combined_list <- unname(as.list(combined))

basemodel$build_vocab(sentences = combined_list)
basemodel$train(
  sentences = combined_list,
  epochs = basemodel$iter, 
  total_examples = basemodel$corpus_count)
## [1] 725200

And now, we can examine the model output. We’ll need to bring the embedding matrix into R (again, see my gensim tutorial for details), and use cosine similarity to see how well this captures meaningful relationships:

library(Matrix)
embeds <- basemodel$wv$syn0
rownames(embeds) <- basemodel$wv$index2word

# function for cosine distance
closest_vector <- function(vec1, mat1){
  vec1 <- Matrix(vec1, nrow = 1, ncol = length(vec1))
  mat1 <- Matrix(mat1)
  mat_magnitudes <- rowSums(mat1^2)
  vec_magnitudes <- rowSums(vec1^2)
  sim <- (t(tcrossprod(vec1, mat1)/
      (sqrt(tcrossprod(vec_magnitudes, mat_magnitudes)))))
  sim2 <- matrix(sim, dimnames = list(rownames(sim)))
  
  w <- sim2[order(-sim2),,drop = FALSE]
  w[1:10,]
}

closest_vector(embeds["united_states", ], embeds)
##      united_states       constitution            article 
##          1.0000000          0.8954939          0.8729917 
##     accountability general_government        legislative 
##          0.8634663          0.8520080          0.8464613 
##            several              under            defects 
##          0.8458755          0.8290494          0.8141552 
##             policy 
##          0.8125987
closest_vector(embeds["americans", ], embeds)
##  americans      quiet    resolve       live       face     hearts 
##  1.0000000  0.9407097  0.9357236  0.9310781  0.9199575  0.9130146 
## determined      renew   together     voices 
##  0.9118831  0.9114557  0.9087178  0.9079879

Not bad for an initial analysis!