Weight the feature frequencies in a dfm

dfm_weight(x, scheme = c("count", "prop", "propmax", "logcount", "boolean",
"augmented", "logave"), weights = NULL, base = 10, K = 0.5)

dfm_smooth(x, smoothing = 1)

## Arguments

x document-feature matrix created by dfm a label of the weight type: count$$tf_{ij}$$, an integer feature count (default when a dfm is created) propthe proportion of the feature counts of total feature counts (aka relative frequency), calculated as $$tf_{ij} / \sum_j tf_{ij}$$ propmaxthe proportion of the feature counts of the highest feature count in a document, $$tf_{ij} / \textrm{max}_j tf_{ij}$$ logcounttake the logarithm of 1 + each count, for the given base: $$\textrm{log}_{base}(1 + tf_{ij})$$ booleanrecode all non-zero counts as 1 augmentedequivalent to $$K + (1 - K) *$$ dfm_weight(x, "propmax") logave1 + the log of the counts) / (1 + log of the counts / the average count within document), or $$\frac{1 + \textrm{log}_{base} tf_{ij}}{1 + \textrm{log}_{base}(\sum_j tf_{ij} / N_i)}$$ if scheme is unused, then weights can be a named numeric vector of weights to be applied to the dfm, where the names of the vector correspond to feature labels of the dfm, and the weights will be applied as multipliers to the existing feature counts for the corresponding named features. Any features not named will be assigned a weight of 1.0 (meaning they will be unchanged). base for the logarithm when scheme is "logcount" or logave the K for the augmentation when scheme = "augmented" constant added to the dfm cells for smoothing, default is 1

## Value

dfm_weight returns the dfm with weighted values. Note the because the default weighting scheme is "count", simply calling this function on an unweighted dfm will return the same object. Many users will want the normalized dfm consisting of the proportions of the feature counts within each document, which requires setting scheme = "prop".

dfm_smooth returns a dfm whose values have been smoothed by adding the smoothing amount. Note that this effectively converts a matrix from sparse to dense format, so may exceed memory requirements depending on the size of your input matrix.

## References

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008.

dfm_tfidf, docfreq

## Examples

my_dfm <- dfm(data_corpus_inaugural)

x <- apply(my_dfm, 1, function(tf) tf/max(tf))
topfeatures(my_dfm)#>   the    of     ,   and     .    to    in     a   our  that
#> 10082  7103  7026  5310  4945  4526  2785  2246  2181  1789 norm_dfm <- dfm_weight(my_dfm, "prop")
topfeatures(norm_dfm)#>       the         ,        of       and         .        to        in       our
#> 3.7910332 2.7639649 2.6821863 2.0782035 1.9594539 1.7643366 1.0695645 0.8731637
#>         a        we
#> 0.8593092 0.7726443 max_tf_dfm <- dfm_weight(my_dfm)
topfeatures(max_tf_dfm)#>   the    of     ,   and     .    to    in     a   our  that
#> 10082  7103  7026  5310  4945  4526  2785  2246  2181  1789 log_tf_dfm <- dfm_weight(my_dfm, scheme = "logcount")
topfeatures(log_tf_dfm)#>      the        ,       of      and        .       to       in        a
#> 182.1856 174.3182 173.3837 167.1782 164.9945 163.2151 150.4070 143.6032
#>      our     that
#> 140.7424 138.9939 log_ave_dfm <- dfm_weight(my_dfm, scheme = "logave")
topfeatures(log_ave_dfm)#>       the         ,        of       and         .        to        in         a
#> 121.98599 116.64229 116.06902 111.83340 110.34098 109.25338 100.55961  95.75088
#>       our      that
#>  93.81347  92.88474
# combine these methods for more complex dfm_weightings, e.g. as in Section 6.4
# of Introduction to Information Retrieval
head(dfm_tfidf(my_dfm, scheme_tf = "logcount"))#> Document-feature matrix of: 6 documents, 9,357 features (93.8% sparse).
# apply numeric weights
str <- c("apple is better than banana", "banana banana apple much better")
(my_dfm <- dfm(str, remove = stopwords("english")))#> Document-feature matrix of: 2 documents, 4 features (12.5% sparse).
#> 2 x 4 sparse Matrix of class "dfm"
#>        features
#> docs    apple better banana much
#>   text1     1      1      1    0
#>   text2     1      1      2    1dfm_weight(my_dfm, weights = c(apple = 5, banana = 3, much = 0.5))#> Document-feature matrix of: 2 documents, 4 features (12.5% sparse).
#> 2 x 4 sparse Matrix of class "dfm"
#>        features
#> docs    apple better banana much
#>   text1     5      1      3  0
#>   text2     5      1      6  0.5
# smooth the dfm
dfm_smooth(my_dfm, 0.5)#> Document-feature matrix of: 2 documents, 4 features (0% sparse).
#> 2 x 4 sparse Matrix of class "dfm"
#>        features
#> docs    apple better banana much
#>   text1   1.5    1.5    1.5  0.5
#>   text2   1.5    1.5    2.5  1.5