In this vignette, we show how to perform Latent Semantic Analysis using the quanteda package based on Grossman and Frieder’s Information Retrieval, Algorithms and Heuristics.

LSA decomposes document-feature matrix into a reduced vector space that is assumed to reflect semantic structure.

New documents or queries can be ‘folded-in’ to this constructed latent semantic space for downstream tasks.

library(quanteda)

# Create a document-feature matrix

txt <- c(d1="Shipment of gold damaged in a fire",
d2="Delivery of silver arrived in a silver truck",
d3="Shipment of gold arrived in a truck" )

mydfm <- dfm(txt)
mydfm
## Document-feature matrix of: 3 documents, 11 features (36.4% sparse).
## 3 x 11 sparse Matrix of class "dfm"
##     features
## docs shipment of gold damaged in a fire delivery silver arrived truck
##   d1        1  1    1       1  1 1    1        0      0       0     0
##   d2        0  1    0       0  1 1    0        1      2       1     1
##   d3        1  1    1       0  1 1    0        0      0       1     1

# Construct the LSA model

mylsa <- textmodel_lsa(mydfm)
## Warning in fun(A, k, nu, nv, opts, mattype = "dgCMatrix"): all singular
## values are requested, svd() is used instead

the new document vector coordinates in the reduced 2-dimensional space is:

mylsa$docs[, 1:2] ## [,1] [,2] ## d1 -0.4944666 0.6491758 ## d2 -0.6458224 -0.7194469 ## d3 -0.5817355 0.2469149 # Apply the constructed LSA model to new data Now the new unseen document can be represented in the reduced 2-dimensional space. The unseen query document: querydfm <- dfm(c("gold silver truck")) %>% dfm_select(pattern = mydfm) querydfm ## Document-feature matrix of: 1 document, 11 features (72.7% sparse). ## 1 x 11 sparse Matrix of class "dfm" ## features ## docs shipment of gold damaged in a fire delivery silver arrived truck ## text1 0 0 1 0 0 0 0 0 1 0 1 newq <- predict(mylsa, querydfm) newq$docs_newspace[, 1:2]
## [1] -0.2140026 -0.1820571