These functions compute matrixes of distances and similarities between documents or features from a dfm and return a dist object (or a matrix if specific targets are selected). They are fast and robust because they operate directly on the sparse dfm objects.

textstat_dist(x, selection = NULL, margin = c("documents", "features"),
  method = "euclidean", upper = FALSE, diag = FALSE, p = 2)

textstat_simil(x, selection = NULL, margin = c("documents", "features"),
  method = "correlation", upper = FALSE, diag = FALSE)

Arguments

x

a dfm object

selection

character vector of document names or feature labels from x; or, a numeric vector or matrix which is conforms to x. A "dist" object is returned if selection is NULL, otherwise, a matrix is returned matching distances to the documents or features identified in the selection.

margin

identifies the margin of the dfm on which similarity or difference will be computed: "documents" for documents or "features" for word/term features.

method

method the similarity or distance measure to be used; see Details

upper

whether the upper triangle of the symmetric \(V \times V\) matrix is recorded

diag

whether the diagonal of the distance matrix should be recorded

p

The power of the Minkowski distance.

Value

textstat_simil and textstat_dist return dist class objects.

Details

textstat_dist options are: "euclidean" (default), "chisquared", "chisquared2", "hamming", "kullback". "manhattan", "maximum", "canberra", and "minkowski". textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamann", and "faith".

Note

If you want to compute similarity on a "normalized" dfm object (controlling for variable document lengths, for methods such as correlation for which different document lengths matter), then wrap the input dfm in dfm_weight(x, "relfreq").

References

The "chisquared" metric is from Legendre, P., & Gallagher, E. D. (2001). "Ecologically meaningful transformations for ordination of species data". Oecologia, 129(2), 271–280. doi.org/10.1007/s004420100716

The "chisquared2" metric is the "Quadratic-Chi" measure from Pele, O., & Werman, M. (2010). "The Quadratic-Chi Histogram Distance Family". In Computer Vision – ECCV 2010 (Vol. 6312, pp. 749–762). Berlin, Heidelberg: Springer, Berlin, Heidelberg. doi.org/10.1007/978-3-642-15552-9_54.

"hamming" is \(\sum{x \neq y)}\).

"kullback" is the Kullback-Leibler distance, which assumes that \(P(x_i) = 0\) implies \(P(y_i)=0\), and in case both \(P(x_i)\) and \(P(y_i)\) equals to zero, then \(P(x_i) * log(p(x_i)/p(y_i))\) is assumed to be zero as the limit value. The formula is: $$\sum{P(x)*log(P(x)/p(y))}$$

All other measures are described in the proxy package.

See also

textstat_dist, as.list.dist, dist

Examples

# create a dfm from inaugural addresses from Reagan onwards presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1990), remove = stopwords("english"), stem = TRUE, remove_punct = TRUE) # distances for documents (d1 <- textstat_dist(presDfm, margin = "documents"))
#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1997-Clinton 58.90671 #> 2001-Bush 52.82045 63.63961 #> 2005-Bush 62.79331 73.38256 54.32311 #> 2009-Obama 51.66237 59.95832 50.70503 62.33779 #> 2013-Obama 51.30302 60.81118 49.03060 57.90509 48.48711 #> 2017-Trump 52.14403 65.85590 48.79549 58.00000 55.65968 #> 2013-Obama #> 1997-Clinton #> 2001-Bush #> 2005-Bush #> 2009-Obama #> 2013-Obama #> 2017-Trump 55.21775
as.matrix(d1)
#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1993-Clinton 0.00000 58.90671 52.82045 62.79331 51.66237 #> 1997-Clinton 58.90671 0.00000 63.63961 73.38256 59.95832 #> 2001-Bush 52.82045 63.63961 0.00000 54.32311 50.70503 #> 2005-Bush 62.79331 73.38256 54.32311 0.00000 62.33779 #> 2009-Obama 51.66237 59.95832 50.70503 62.33779 0.00000 #> 2013-Obama 51.30302 60.81118 49.03060 57.90509 48.48711 #> 2017-Trump 52.14403 65.85590 48.79549 58.00000 55.65968 #> 2013-Obama 2017-Trump #> 1993-Clinton 51.30302 52.14403 #> 1997-Clinton 60.81118 65.85590 #> 2001-Bush 49.03060 48.79549 #> 2005-Bush 57.90509 58.00000 #> 2009-Obama 48.48711 55.65968 #> 2013-Obama 0.00000 55.21775 #> 2017-Trump 55.21775 0.00000
# distances for specific documents textstat_dist(presDfm, "2017-Trump", margin = "documents")
#> 2017-Trump #> 2017-Trump 0.00000 #> 1993-Clinton 52.14403 #> 1997-Clinton 65.85590 #> 2001-Bush 48.79549 #> 2005-Bush 58.00000 #> 2009-Obama 55.65968 #> 2013-Obama 55.21775
textstat_dist(presDfm, "2005-Bush", margin = "documents", method = "jaccard")
#> 2005-Bush #> 2005-Bush 1.0000000 #> 1993-Clinton 0.2216867 #> 1997-Clinton 0.2392503 #> 2001-Bush 0.2591195 #> 2009-Obama 0.2502483 #> 2013-Obama 0.2505353 #> 2017-Trump 0.1852761
(d2 <- textstat_dist(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents"))
#> 2009-Obama 2013-Obama #> 2009-Obama 0.00000 48.48711 #> 2013-Obama 48.48711 0.00000 #> 1993-Clinton 51.66237 51.30302 #> 1997-Clinton 59.95832 60.81118 #> 2001-Bush 50.70503 49.03060 #> 2005-Bush 62.33779 57.90509 #> 2017-Trump 55.65968 55.21775
as.list(d1)
#> $`1993-Clinton` #> 2005-Bush 1997-Clinton 2001-Bush 2017-Trump 2009-Obama 2013-Obama #> 62.79331 58.90671 52.82045 52.14403 51.66237 51.30302 #> #> $`1997-Clinton` #> 2005-Bush 2017-Trump 2001-Bush 2013-Obama 2009-Obama 1993-Clinton #> 73.38256 65.85590 63.63961 60.81118 59.95832 58.90671 #> #> $`2001-Bush` #> 1997-Clinton 2005-Bush 1993-Clinton 2009-Obama 2013-Obama 2017-Trump #> 63.63961 54.32311 52.82045 50.70503 49.03060 48.79549 #> #> $`2005-Bush` #> 1997-Clinton 1993-Clinton 2009-Obama 2017-Trump 2013-Obama 2001-Bush #> 73.38256 62.79331 62.33779 58.00000 57.90509 54.32311 #> #> $`2009-Obama` #> 2005-Bush 1997-Clinton 2017-Trump 1993-Clinton 2001-Bush 2013-Obama #> 62.33779 59.95832 55.65968 51.66237 50.70503 48.48711 #> #> $`2013-Obama` #> 1997-Clinton 2005-Bush 2017-Trump 1993-Clinton 2001-Bush 2009-Obama #> 60.81118 57.90509 55.21775 51.30302 49.03060 48.48711 #> #> $`2017-Trump` #> 1997-Clinton 2005-Bush 2009-Obama 2013-Obama 1993-Clinton 2001-Bush #> 65.85590 58.00000 55.65968 55.21775 52.14403 48.79549 #>
# similarities for documents (s1 <- textstat_simil(presDfm, method = "cosine", margin = "documents"))
#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1997-Clinton 0.6680262 #> 2001-Bush 0.5358898 0.5912236 #> 2005-Bush 0.5012215 0.5006142 0.5982538 #> 2009-Obama 0.6280946 0.6593018 0.6018113 0.5266249 #> 2013-Obama 0.6265428 0.6466030 0.6193608 0.5867178 0.6815711 #> 2017-Trump 0.5511398 0.5558054 0.5327058 0.5386656 0.5192075 #> 2013-Obama #> 1997-Clinton #> 2001-Bush #> 2005-Bush #> 2009-Obama #> 2013-Obama #> 2017-Trump 0.5160104
as.matrix(s1)
#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1993-Clinton 1.0000000 0.6680262 0.5358898 0.5012215 0.6280946 #> 1997-Clinton 0.6680262 1.0000000 0.5912236 0.5006142 0.6593018 #> 2001-Bush 0.5358898 0.5912236 1.0000000 0.5982538 0.6018113 #> 2005-Bush 0.5012215 0.5006142 0.5982538 1.0000000 0.5266249 #> 2009-Obama 0.6280946 0.6593018 0.6018113 0.5266249 1.0000000 #> 2013-Obama 0.6265428 0.6466030 0.6193608 0.5867178 0.6815711 #> 2017-Trump 0.5511398 0.5558054 0.5327058 0.5386656 0.5192075 #> 2013-Obama 2017-Trump #> 1993-Clinton 0.6265428 0.5511398 #> 1997-Clinton 0.6466030 0.5558054 #> 2001-Bush 0.6193608 0.5327058 #> 2005-Bush 0.5867178 0.5386656 #> 2009-Obama 0.6815711 0.5192075 #> 2013-Obama 1.0000000 0.5160104 #> 2017-Trump 0.5160104 1.0000000
as.list(s1)
#> $`1993-Clinton` #> 1997-Clinton 2009-Obama 2013-Obama 2017-Trump 2001-Bush 2005-Bush #> 0.6680262 0.6280946 0.6265428 0.5511398 0.5358898 0.5012215 #> #> $`1997-Clinton` #> 1993-Clinton 2009-Obama 2013-Obama 2001-Bush 2017-Trump 2005-Bush #> 0.6680262 0.6593018 0.6466030 0.5912236 0.5558054 0.5006142 #> #> $`2001-Bush` #> 2013-Obama 2009-Obama 2005-Bush 1997-Clinton 1993-Clinton 2017-Trump #> 0.6193608 0.6018113 0.5982538 0.5912236 0.5358898 0.5327058 #> #> $`2005-Bush` #> 2001-Bush 2013-Obama 2017-Trump 2009-Obama 1993-Clinton 1997-Clinton #> 0.5982538 0.5867178 0.5386656 0.5266249 0.5012215 0.5006142 #> #> $`2009-Obama` #> 2013-Obama 1997-Clinton 1993-Clinton 2001-Bush 2005-Bush 2017-Trump #> 0.6815711 0.6593018 0.6280946 0.6018113 0.5266249 0.5192075 #> #> $`2013-Obama` #> 2009-Obama 1997-Clinton 1993-Clinton 2001-Bush 2005-Bush 2017-Trump #> 0.6815711 0.6466030 0.6265428 0.6193608 0.5867178 0.5160104 #> #> $`2017-Trump` #> 1997-Clinton 1993-Clinton 2005-Bush 2001-Bush 2009-Obama 2013-Obama #> 0.5558054 0.5511398 0.5386656 0.5327058 0.5192075 0.5160104 #>
# similarities for for specific documents textstat_simil(presDfm, "2017-Trump", margin = "documents")
#> 2017-Trump #> 2017-Trump 1.0000000 #> 1993-Clinton 0.4967910 #> 1997-Clinton 0.4989669 #> 2001-Bush 0.4672634 #> 2005-Bush 0.4739241 #> 2009-Obama 0.4377484 #> 2013-Obama 0.4414144
textstat_simil(presDfm, "2017-Trump", method = "cosine", margin = "documents")
#> 2017-Trump #> 2017-Trump 1.0000000 #> 1993-Clinton 0.5511398 #> 1997-Clinton 0.5558054 #> 2001-Bush 0.5327058 #> 2005-Bush 0.5386656 #> 2009-Obama 0.5192075 #> 2013-Obama 0.5160104
textstat_simil(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents")
#> 2009-Obama 2013-Obama #> 2009-Obama 1.0000000 0.6103693 #> 2013-Obama 0.6103693 1.0000000 #> 1993-Clinton 0.5707623 0.5725041 #> 1997-Clinton 0.6026942 0.5916516 #> 2001-Bush 0.5241995 0.5523905 #> 2005-Bush 0.4330978 0.5137096 #> 2017-Trump 0.4377484 0.4414144
# compute some term similarities s2 <- textstat_simil(presDfm, c("fair", "health", "terror"), method = "cosine", margin = "features") head(as.matrix(s2), 10)
#> fair health terror #> fair 1.0000000 0.7559289 0.15430335 #> health 0.7559289 1.0000000 0.54433105 #> terror 0.1543033 0.5443311 1.00000000 #> fellow 0.4265617 0.7181848 0.67016625 #> citizen 0.6787417 0.7144508 0.49663296 #> today 0.6265515 0.8288497 0.59866609 #> celebr 0.4472136 0.6761234 0.48304589 #> mysteri 0.2672612 0.2357023 0.28867513 #> american 0.5665941 0.7335861 0.67709711 #> renew 0.5041842 0.4850713 0.09901475
as.list(s2, n = 8)
#> $fair #> continu purpos travel failur lead begin courag call #> 1.0000000 0.9636241 0.9561829 0.9449112 0.9166985 0.9091373 0.9091373 0.8971226 #>