Returns a document by feature matrix with the feature frequencies weighted according to one of several common methods. Some shortcut functions that offer finer-grained control are:

  • tf compute term frequency weights

  • tfidf compute term frequency-inverse document frequency weights

  • docfreq compute document frequencies of features

dfm_weight(x, type = c("frequency", "relfreq", "relmaxfreq", "logfreq",
  "tfidf"), weights = NULL)

dfm_smooth(x, smoothing = 1)

Arguments

x

document-feature matrix created by dfm

type

a label of the weight type:

"frequency"

integer feature count (default when a dfm is created)

"relfreq"

the proportion of the feature counts of total feature counts (aka relative frequency)

"relmaxfreq"

the proportion of the feature counts of the highest feature count in a document

"logfreq"

take the logarithm of 1 + the feature count, for base 10

"tfidf"

Term-frequency * inverse document frequency. For a full explanation, see, for example, http://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html. This implementation will not return negative values. For finer-grained control, call tfidf directly.

weights

if type is unused, then weights can be a named numeric vector of weights to be applied to the dfm, where the names of the vector correspond to feature labels of the dfm, and the weights will be applied as multipliers to the existing feature counts for the corresponding named features. Any features not named will be assigned a weight of 1.0 (meaning they will be unchanged).

smoothing

constant added to the dfm cells for smoothing, default is 1

Value

dfm_weight returns the dfm with weighted values. dfm_smooth returns a dfm whose values have been smoothed by adding the smoothing amount. Note that this effectively converts a matrix from sparse to dense format, so may exceed memory requirements depending on the size of your input matrix.

Note

For finer grained control, consider calling the convenience functions directly.

References

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008.

See also

tf, tfidf, docfreq

Examples

dtm <- dfm(data_corpus_inaugural) x <- apply(dtm, 1, function(tf) tf/max(tf)) topfeatures(dtm)
#> the of , and . to in a our that #> 10082 7103 7026 5310 4945 4526 2785 2246 2181 1789
normDtm <- dfm_weight(dtm, "relfreq") topfeatures(normDtm)
#> the , of and . to in our #> 3.7910332 2.7639649 2.6821863 2.0782035 1.9594539 1.7643366 1.0695645 0.8731637 #> a we #> 0.8593092 0.7726443
maxTfDtm <- dfm_weight(dtm, type = "relmaxfreq") topfeatures(maxTfDtm)
#> the , of and . to in our #> 55.13499 42.22681 39.34995 31.43686 30.76141 26.37869 16.08336 13.97242 #> a we #> 13.38024 13.21974
logTfDtm <- dfm_weight(dtm, type = "logfreq") topfeatures(logTfDtm)
#> the , of and . to in a #> 182.1856 174.3182 173.3837 167.1782 164.9945 163.2151 150.4070 143.6032 #> our that #> 140.7424 138.9939
tfidfDtm <- dfm_weight(dtm, type = "tfidf") topfeatures(tfidfDtm)
#> - america union " should constitution #> 55.80272 52.68044 51.14846 48.02566 42.10689 40.21661 #> congress freedom you revenue #> 39.13390 38.31822 35.99430 34.11779
# combine these methods for more complex dfm_weightings, e.g. as in Section 6.4 # of Introduction to Information Retrieval head(tfidf(dtm, scheme_tf = "log"))
#> Document-feature matrix of: 6 documents, 9,357 features (93.8% sparse).
# apply numeric weights str <- c("apple is better than banana", "banana banana apple much better") (mydfm <- dfm(str, remove = stopwords("english")))
#> Document-feature matrix of: 2 documents, 4 features (12.5% sparse). #> 2 x 4 sparse Matrix of class "dfm" #> features #> docs apple better banana much #> text1 1 1 1 0 #> text2 1 1 2 1
dfm_weight(mydfm, weights = c(apple = 5, banana = 3, much = 0.5))
#> Document-feature matrix of: 2 documents, 4 features (12.5% sparse). #> 2 x 4 sparse Matrix of class "dfm" #> features #> docs feat1 feat2 feat3 feat4 #> text1 5 1 3 0 #> text2 5 1 6 0.5
# smooth the dfm dfm_smooth(mydfm, 0.5)
#> Document-feature matrix of: 2 documents, 4 features (0% sparse). #> 2 x 4 sparse Matrix of class "dfm" #> features #> docs apple better banana much #> text1 1.5 1.5 1.5 0.5 #> text2 1.5 1.5 2.5 1.5