Returns a document by feature matrix reduced in size based on document and term frequency, usually in terms of a minimum frequencies, but may also be in terms of maximum frequencies. Setting a combination of minimum and maximum frequencies will select features based on a range.

dfm_trim(x, min_count = 1, min_docfreq = 1, max_count = NULL,
  max_docfreq = NULL, sparsity = NULL,
  verbose = quanteda_options("verbose"))

Arguments

x

a dfm object

min_count, max_count

minimum/maximum count or percentile frequency of features across all documents, below/above which features will be removed

min_docfreq, max_docfreq

minimum/maximum number or fraction of documents in which a feature appears, below/above which features will be removed

sparsity

equivalent to 1 - min_docfreq, included for comparison with tm

verbose

print messages

Value

A dfm reduced in features (with the same number of documents)

Note

Trimming a dfm object is an operation based on the values in the document-feature matrix. To select subsets of a dfm based on the features themselves (meaning the feature labels from featnames) -- such as those matching a regular expression, or removing features matching a stopword list, use dfm_select.

See also

dfm_select, dfm_sample

Examples

(myDfm <- dfm(data_corpus_inaugural[1:5]))
#> Document-feature matrix of: 5 documents, 1,948 features (69.5% sparse).
# keep only words occurring >=10 times and in >=2 documents dfm_trim(myDfm, min_count = 10, min_docfreq = 2)
#> Document-feature matrix of: 5 documents, 107 features (16.3% sparse).
# keep only words occurring >=10 times and in at least 0.4 of the documents dfm_trim(myDfm, min_count = 10, min_docfreq = 0.4)
#> Document-feature matrix of: 5 documents, 107 features (16.3% sparse).
# keep only words occurring <=10 times and in <=2 documents dfm_trim(myDfm, max_count = 10, max_docfreq = 2)
#> Document-feature matrix of: 5 documents, 1,675 features (76.5% sparse).
# keep only words occurring <=10 times and in at most 3/4 of the documents dfm_trim(myDfm, max_count = 10, max_docfreq = 0.75)
#> Document-feature matrix of: 5 documents, 1,799 features (74% sparse).
# keep only words occurring frequently (top 20%) and in <=2 documents dfm_trim(myDfm, min_count = 0.8, max_docfreq = 2)
#> Document-feature matrix of: 5 documents, 150 features (63.6% sparse).
# keep only words occurring 5 times in 1000, and in 2 of 5 of documents dfm_trim(myDfm, min_docfreq = 0.4, min_count = 0.005)
#> Document-feature matrix of: 5 documents, 569 features (44.2% sparse).
# NOT RUN { # compare to removeSparseTerms from the tm package (myDfmTM <- convert(myDfm, "tm")) tm::removeSparseTerms(myDfmTM, 0.7) dfm_trim(myDfm, min_docfreq = 0.3) dfm_trim(myDfm, sparsity = 0.7) # }