library(quanteda)

This is intended to show how quanteda can be used with the text2vec package in order to replicate its gloVe example.

Read in the corpus

Reading the corpus from the text2vec vignette:

temp_file <- "/tmp/wiki.RDS"
if (!file.exists(temp_file)) {
    wiki <- corpus(readtext::readtext("http://mattmahoney.net/dc/text8.zip", verbosity = 0))
    saveRDS(wiki, file = paste0(temp_file))
} else {
    wiki <- readRDS(paste0(temp_file))
}

Select features

First, we tokenize the corpus, and then get the names of the features that occur five times or more. Trimming the features before constructing the fcm:

wiki_toks <- tokens(wiki)
dfm(wiki, verbose = TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 1 document, 253,853 features
##    ... created a 1 x 253,853 sparse dfm
##    ... complete. 
## Elapsed time: 15.2 seconds.
## Document-feature matrix of: 1 document, 253,853 features (0% sparse).
feats <- dfm(wiki_toks, verbose = TRUE) %>% 
    dfm_trim(min_count = 5) %>%
    featnames()
## Creating a dfm from a tokens input...
##    ... lowercasing
##    ... found 1 document, 253,853 features
##    ... created a 1 x 253,853 sparse dfm
##    ... complete. 
## Elapsed time: 1.38 seconds.
# leave the pads so that non-adjacent words will not become adjacent
wiki_toks <- tokens_select(wiki_toks, feats, padding = TRUE)

Construct the feature co-occurrence matrix

wiki_fcm <- fcm(wiki_toks, context = "window", count = "weighted", weights = 1/(1:5), tri = TRUE)

Fit word embedding model

Fit the GloVe model using text2vec.

library(text2vec)

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

GloVe encodes the ratios of word-word co-occurrence probabilities, which is thought to represent some crude form of meaning associated with the abstract concept of the word, as vector difference. The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence.

glove <- GlobalVectors$new(word_vectors_size = 50, vocabulary = featnames(wiki_fcm), x_max = 10)
wiki_main <- fit_transform(wiki_fcm, glove, n_iter = 20)
## INFO [2017-12-08 16:40:07] 2017-12-08 16:40:07 - epoch 1, expected cost 0.0832
## INFO [2017-12-08 16:40:09] 2017-12-08 16:40:09 - epoch 2, expected cost 0.0622
## INFO [2017-12-08 16:40:11] 2017-12-08 16:40:11 - epoch 3, expected cost 0.0540
## INFO [2017-12-08 16:40:12] 2017-12-08 16:40:12 - epoch 4, expected cost 0.0502
## INFO [2017-12-08 16:40:14] 2017-12-08 16:40:14 - epoch 5, expected cost 0.0477
## INFO [2017-12-08 16:40:16] 2017-12-08 16:40:16 - epoch 6, expected cost 0.0459
## INFO [2017-12-08 16:40:18] 2017-12-08 16:40:18 - epoch 7, expected cost 0.0446
## INFO [2017-12-08 16:40:19] 2017-12-08 16:40:19 - epoch 8, expected cost 0.0435
## INFO [2017-12-08 16:40:21] 2017-12-08 16:40:21 - epoch 9, expected cost 0.0426
## INFO [2017-12-08 16:40:23] 2017-12-08 16:40:23 - epoch 10, expected cost 0.0419
## INFO [2017-12-08 16:40:25] 2017-12-08 16:40:25 - epoch 11, expected cost 0.0413
## INFO [2017-12-08 16:40:26] 2017-12-08 16:40:26 - epoch 12, expected cost 0.0407
## INFO [2017-12-08 16:40:28] 2017-12-08 16:40:28 - epoch 13, expected cost 0.0403
## INFO [2017-12-08 16:40:30] 2017-12-08 16:40:30 - epoch 14, expected cost 0.0399
## INFO [2017-12-08 16:40:32] 2017-12-08 16:40:32 - epoch 15, expected cost 0.0395
## INFO [2017-12-08 16:40:34] 2017-12-08 16:40:34 - epoch 16, expected cost 0.0392
## INFO [2017-12-08 16:40:35] 2017-12-08 16:40:35 - epoch 17, expected cost 0.0389
## INFO [2017-12-08 16:40:37] 2017-12-08 16:40:37 - epoch 18, expected cost 0.0386
## INFO [2017-12-08 16:40:39] 2017-12-08 16:40:39 - epoch 19, expected cost 0.0384
## INFO [2017-12-08 16:40:41] 2017-12-08 16:40:41 - epoch 20, expected cost 0.0382

Averaging learned word vectors

The two vectors are main and context. According to the Glove paper, averaging the two word vectors results in more accurate representation.

wiki_context <- glove$components
dim(wiki_context)
## [1]    50 71290
wiki_vectors = wiki_main + t(wiki_context)

Examining term representations

Now we can find the closest word vectors for paris - france + germany

berlin <-  wiki_vectors["paris", , drop = FALSE] - 
    wiki_vectors["france", , drop = FALSE] + 
    wiki_vectors["germany", , drop = FALSE]

# calculate the similarity
wiki_vector_dfm <- as.dfm(rbind(wiki_vectors, berlin))
wiki_vector_dfm@Dimnames$docs[dim(wiki_vector_dfm)[1]] <- "new_berlin"
cos_sim <-  textstat_simil(wiki_vector_dfm, "new_berlin", 
                           margin = "documents", method= "cosine")
head(sort(cos_sim[,1], decreasing = TRUE), 5)
## new_berlin      paris     berlin     munich    germany 
##  1.0000000  0.7776876  0.7690844  0.7209527  0.7123541

Here is another example for london = paris - france + uk + england

london <-  wiki_vectors["paris", , drop = FALSE] - 
    wiki_vectors["france", , drop = FALSE] + 
    wiki_vectors["uk", , drop = FALSE] + 
    wiki_vectors["england", , drop = FALSE] 

wiki_vector_dfm <- as.dfm(rbind(wiki_vectors, london))
wiki_vector_dfm@Dimnames$docs[dim(wiki_vector_dfm)[1]] <- "new_london"
cos_sim <-  textstat_simil(wiki_vector_dfm, "new_london", 
                           margin = "documents", method= "cosine")
head(sort(cos_sim[,1], decreasing = TRUE), 5)
## new_london         uk    england       york     london 
##  1.0000000  0.7823298  0.7599103  0.7451288  0.7413701