Produces counts and document frequencies summaries of the features in a dfm, optionally grouped by a docvars variable or other supplied grouping variable.

textstat_frequency(x, n = NULL, groups = NULL)

Arguments

x

a dfm object

n

(optional) integer specifying the top n features to be returned, within group if groups is specified

groups

either: a character vector containing the names of document variables to be used for grouping; or a factor or object that can be coerced into a factor equal in length or rows to the number of documents. See groups for details.

Value

a data.frame containing the following variables:

feature

(character) the feature

frequency

count of the feature

rank

rank of the feature, where 1 indicates the greatest frequency

docfreq

document frequency of the feature, as a count (the number of documents in which this feature occurred at least once)

docfreq

document frequency of the feature, as a count

group

(only if groups is specified) the label of the group. If the features have been grouped, then all counts, ranks, and document frequencies are within group. If groups is not specified, the group column is omitted from the returned data.frame.

Examples

dfm1 <- dfm(c("a a b b c d", "a d d d", "a a a")) textstat_frequency(dfm1)
#> feature frequency rank docfreq #> 1 a 6 1 3 #> 2 d 4 2 2 #> 3 b 2 3 1 #> 4 c 1 4 1
textstat_frequency(dfm1, groups = c("one", "two", "one"))
#> feature frequency rank docfreq group #> 1 a 5 1 2 one #> 2 b 2 2 1 one #> 3 c 1 3 1 one #> 4 d 1 4 1 one #> 5 d 3 1 1 two #> 6 a 1 2 1 two
obamadfm <- corpus_subset(data_corpus_inaugural, President == "Obama") %>% dfm(remove_punct = TRUE, remove = stopwords("english")) freq <- textstat_frequency(obamadfm) head(freq, 10)
#> feature frequency rank docfreq #> 1 us 44 1 2 #> 2 must 25 2 2 #> 3 can 20 3 2 #> 4 nation 18 4 2 #> 5 people 18 5 2 #> 6 new 17 6 2 #> 7 time 16 7 2 #> 8 every 15 8 2 #> 9 america 14 9 2 #> 10 now 11 10 2
# plot 20 most frequent words library("ggplot2") ggplot(freq[1:20, ], aes(x = reorder(feature, frequency), y = frequency)) + geom_point() + coord_flip() + labs(x = NULL, y = "Frequency")
# plot relative frequencies by group dfm_weight_pres <- data_corpus_inaugural %>% corpus_subset(Year > 2000) %>% dfm(remove = stopwords("english"), remove_punct = TRUE) %>% dfm_weight(type = "relfreq") # calculate relative frequency by president freq_weight <- textstat_frequency(dfm_weight_pres, n = 10, groups = "President") # plot frequencies ggplot(data = freq_weight, aes(x = nrow(freq_weight):1, y = frequency)) + geom_point() + facet_wrap(~ group, scales = "free") + coord_flip() + scale_x_continuous(breaks = nrow(freq_weight):1, labels = freq_weight$feature) + labs(x = NULL, y = "Relative frequency")