Calculate "keyness", a score for features that occur differentially across different categories. Here, the categories are defined by reference to a "target" document index in the dfm, with the reference group consisting of all other documents.

textstat_keyness(x, target = 1L, measure = c("chi2", "exact", "lr", "pmi"),
  sort = TRUE, correction = c("default", "yates", "williams", "none"))

Arguments

x

a dfm containing the features to be examined for keyness

target

the document index (numeric, character or logical) identifying the document forming the "target" for computing keyness; all other documents' feature frequencies will be combined for use as a reference

measure

(signed) association measure to be used for computing keyness. Currently available: "chi2"; "exact" (Fisher's exact test); "lr" for the likelihood ratio; "pmi" for pointwise mutual information.

sort

logical; if TRUE sort features scored in descending order of the measure, otherwise leave in original feature order

correction

if "default", Yates correction is applied to "chi2"; William's correction is applied to "lr"; and no correction is applied for the "exact" and "pmi" measures. Specifying a value other than the default can be used to override the defaults, for instance to apply the Williams correction to the chi2 measure. Specifying a correction for the "exact" and "pmi" measures has no effect and produces a warning.

Value

a data.frame of computed statistics and associated p-values, where the features scored name each row, and the number of occurrences for both the target and reference groups. For measure = "chi2" this is the chi-squared value, signed positively if the observed value in the target exceeds its expected value; for measure = "exact" this is the estimate of the odds ratio; for measure = "lr" this is the likelihood ratio \(G2\) statistic; for "pmi" this is the pointwise mutual information statistics. textstat_keyness returns a data.frame of features and their keyness scores and frequency counts.

References

Bondi, Marina, and Mike Scott, eds. 2010. Keyness in Texts. Amsterdam, Philadelphia: John Benjamins, 2010.

Stubbs, Michael. 2010. "Three Concepts of Keywords". In Keyness in Texts, Marina Bondi and Mike Scott, eds. pp21–42. Amsterdam, Philadelphia: John Benjamins.

Scott, M. & Tribble, C. 2006. Textual Patterns: keyword and corpus analysis in language education. Amsterdam: Benjamins, p. 55.

Dunning, Ted. 1993. "Accurate Methods for the Statistics of Surprise and Coincidence", Computational Linguistics, Vol 19, No. 1, pp. 61-74.

Examples

# compare pre- v. post-war terms using grouping period <- ifelse(docvars(data_corpus_inaugural, "Year") < 1945, "pre-war", "post-war") mydfm <- dfm(data_corpus_inaugural, groups = period) head(mydfm) # make sure 'post-war' is in the first row
#> Document-feature matrix of: 2 documents, 9,357 features (34.9% sparse).
head(result <- textstat_keyness(mydfm), 10)
#> feature chi2 p n_target n_reference #> 1 we 702.0019 0 960 779 #> 2 - 395.1361 0 450 312 #> 3 . 226.6654 0 1804 3141 #> 4 our 187.8329 0 874 1307 #> 5 us 186.0132 0 262 216 #> 6 : 178.1283 0 105 29 #> 7 america 176.6114 0 130 54 #> 8 world 175.1354 0 188 123 #> 9 americans 150.6480 0 67 7 #> 10 new 141.3690 0 150 97
tail(result, 10)
#> feature chi2 p n_target n_reference #> 9348 upon -51.91189 5.804246e-13 39 332 #> 9349 it -52.69911 3.888001e-13 257 1132 #> 9350 public -55.99225 7.271961e-14 11 213 #> 9351 states -59.13009 1.476597e-14 28 305 #> 9352 constitution -61.16666 5.218048e-15 6 200 #> 9353 be -72.21927 0.000000e+00 257 1224 #> 9354 should -83.10741 0.000000e+00 15 309 #> 9355 which -160.14560 0.000000e+00 95 911 #> 9356 of -179.17847 0.000000e+00 1437 5666 #> 9357 the -299.85716 0.000000e+00 1988 8094
# compare pre- v. post-war terms using logical vector mydfm2 <- dfm(data_corpus_inaugural) textstat_keyness(mydfm2, docvars(data_corpus_inaugural, "Year") >= 1945)
#> feature chi2 p n_target n_reference #> 1 we 7.020019e+02 0.000000e+00 960 779 #> 2 - 3.951361e+02 0.000000e+00 450 312 #> 3 . 2.266654e+02 0.000000e+00 1804 3141 #> 4 our 1.878329e+02 0.000000e+00 874 1307 #> 5 us 1.860132e+02 0.000000e+00 262 216 #> 6 : 1.781283e+02 0.000000e+00 105 29 #> 7 america 1.766114e+02 0.000000e+00 130 54 #> 8 world 1.751354e+02 0.000000e+00 188 123 #> 9 americans 1.506480e+02 0.000000e+00 67 7 #> 10 new 1.413690e+02 0.000000e+00 150 97 #> 11 freedom 1.373460e+02 0.000000e+00 121 64 #> 12 today 1.288729e+02 0.000000e+00 76 21 #> 13 let 1.115160e+02 0.000000e+00 100 54 #> 14 together 1.049803e+02 0.000000e+00 64 19 #> 15 you 1.020667e+02 0.000000e+00 116 80 #> 16 america's 9.406895e+01 0.000000e+00 35 0 #> 17 help 9.213749e+01 0.000000e+00 46 8 #> 18 work 9.077929e+01 0.000000e+00 78 40 #> 19 , 6.285815e+01 2.220446e-15 2194 4832 #> 20 know 6.043361e+01 7.660539e-15 63 40 #> 21 earth 6.032691e+01 7.993606e-15 44 18 #> 22 history 5.769898e+01 3.053113e-14 60 38 #> 23 god 5.417331e+01 1.835199e-13 55 34 #> 24 live 5.337023e+01 2.762235e-13 39 16 #> 25 time 5.273752e+01 3.812506e-13 106 110 #> 26 do 5.260523e+01 4.077849e-13 112 120 #> 27 are 5.087598e+01 9.838796e-13 311 503 #> 28 journey 4.930187e+01 2.194578e-12 20 1 #> 29 dignity 4.864118e+01 3.073430e-12 24 4 #> 30 what 4.847609e+01 3.343437e-12 88 86 #> 31 century 4.724865e+01 6.252887e-12 39 19 #> 32 will 4.721042e+01 6.376122e-12 339 572 #> 33 promise 4.641444e+01 9.570789e-12 34 14 #> 34 mr 4.493857e+01 2.033140e-11 25 6 #> 35 build 4.427129e+01 2.858780e-11 21 3 #> 36 thank 4.395178e+01 3.365652e-11 18 1 #> 37 bless 4.001006e+01 2.526580e-10 18 2 #> 38 strength 3.961129e+01 3.098864e-10 45 31 #> 39 for 3.955452e+01 3.190271e-10 421 776 #> 40 challenges 3.938908e+01 3.472338e-10 16 0 #> 41 cannot 3.914476e+01 3.935189e-10 36 20 #> 42 lives 3.798753e+01 7.119839e-10 32 16 #> 43 across 3.750459e+01 9.119797e-10 22 6 #> 44 dream 3.737566e+01 9.743051e-10 17 2 #> 45 generation 3.721599e+01 1.057433e-09 28 12 #> 46 moment 3.711190e+01 1.115412e-09 27 11 #> 47 must 3.708654e+01 1.130010e-09 151 215 #> 48 again 3.687593e+01 1.258907e-09 42 29 #> 49 young 3.583078e+01 2.152211e-09 19 4 #> 50 words 3.485943e+01 3.543892e-09 26 11 #> 51 weapons 3.402523e+01 5.440201e-09 14 0 #> 52 resolve 3.389994e+01 5.802054e-09 17 3 #> 53 faith 3.350180e+01 7.119804e-09 48 40 #> 54 heart 3.323775e+01 8.155199e-09 28 14 #> 55 day 3.314882e+01 8.536828e-09 45 36 #> 56 nation 3.278696e+01 1.028323e-08 123 170 #> 57 friends 3.263119e+01 1.114126e-08 25 11 #> 58 children 3.239506e+01 1.258057e-08 31 18 #> 59 because 3.217534e+01 1.408671e-08 59 58 #> 60 dreams 3.169219e+01 1.806472e-08 16 2 #> 61 your 3.144943e+01 2.047031e-08 61 62 #> 62 role 3.134467e+01 2.160514e-08 13 0 #> 63 peoples 3.086300e+01 2.769008e-08 26 13 #> 64 forward 3.042924e+01 3.462689e-08 24 11 #> 65 unity 3.010270e+01 4.097608e-08 21 8 #> 66 meaning 2.910796e+01 6.845506e-08 15 2 #> 67 vice 2.910796e+01 6.845506e-08 15 2 #> 68 communism 2.866529e+01 8.603180e-08 12 0 #> 69 begin 2.866529e+01 8.603180e-08 12 0 #> 70 yes 2.866529e+01 8.603180e-08 12 0 #> 71 this 2.861672e+01 8.821709e-08 295 540 #> 72 join 2.829580e+01 1.041215e-07 16 4 #> 73 president 2.817897e+01 1.105997e-07 46 42 #> 74 world's 2.783793e+01 1.319142e-07 18 6 #> 75 man 2.765403e+01 1.450692e-07 45 41 #> 76 story 2.737324e+01 1.677360e-07 13 1 #> 77 strong 2.685377e+01 2.194459e-07 32 23 #> 78 families 2.653669e+01 2.585804e-07 14 2 #> 79 back 2.611380e+01 3.218758e-07 22 11 #> 80 believe 2.608212e+01 3.272002e-07 40 35 #> 81 jobs 2.598743e+01 3.436470e-07 11 0 #> 82 seek 2.528140e+01 4.954634e-07 34 27 #> 83 celebrate 2.474832e+01 6.532558e-07 12 1 #> 84 when 2.470883e+01 6.667808e-07 90 123 #> 85 women 2.400582e+01 9.604510e-07 21 11 #> 86 senator 2.331152e+01 1.377720e-06 10 0 #> 87 nuclear 2.331152e+01 1.377720e-06 10 0 #> 88 remember 2.320349e+01 1.457329e-06 17 7 #> 89 ask 2.275731e+01 1.838035e-06 26 18 #> 90 free 2.227345e+01 2.364530e-06 78 105 #> 91 change 2.191881e+01 2.844328e-06 36 33 #> 92 peace 2.187335e+01 2.912502e-06 102 152 #> 93 old 2.157477e+01 3.402983e-06 39 38 #> 94 courage 2.145773e+01 3.617142e-06 23 15 #> 95 centuries 2.144407e+01 3.643011e-06 12 2 #> 96 make 2.126122e+01 4.007586e-06 64 81 #> 97 born 2.105920e+01 4.453103e-06 13 3 #> 98 pledge 2.104609e+01 4.483677e-06 22 14 #> 99 need 2.104505e+01 4.486098e-06 38 37 #> 100 values 2.096503e+01 4.677443e-06 16 7 #> 101 challenge 2.087761e+01 4.895839e-06 14 4 #> 102 hard 2.084299e+01 4.985136e-06 14 5 #> 103 third 2.063821e+01 5.547786e-06 9 0 #> 104 come 2.013522e+01 7.215571e-06 38 38 #> 105 hope 1.957330e+01 9.681281e-06 53 64 #> 106 who 1.942958e+01 1.043780e-05 138 232 #> 107 american 1.909708e+01 1.242349e-05 69 94 #> 108 here 1.894267e+01 1.347059e-05 39 41 #> 109 threat 1.893039e+01 1.355757e-05 11 2 #> 110 tyranny 1.893039e+01 1.355757e-05 11 2 #> 111 go 1.891167e+01 1.369125e-05 26 21 #> 112 nation's 1.876743e+01 1.476673e-05 15 7 #> 113 poverty 1.876743e+01 1.476673e-05 15 7 #> 114 answer 1.862999e+01 1.587044e-05 12 3 #> 115 renew 1.852495e+01 1.676947e-05 13 4 #> 116 voices 1.852495e+01 1.676947e-05 13 4 #> 117 each 1.803028e+01 2.174196e-05 55 70 #> 118 areas 1.796844e+01 2.245982e-05 8 0 #> 119 heal 1.796844e+01 2.245982e-05 8 0 #> 120 define 1.796844e+01 2.245982e-05 8 0 #> 121 compassion 1.796844e+01 2.245982e-05 8 0 #> 122 speaker 1.796844e+01 2.245982e-05 8 0 #> 123 quiet 1.796844e+01 2.245982e-05 8 0 #> 124 don't 1.796844e+01 2.245982e-05 8 0 #> 125 bring 1.791425e+01 2.310843e-05 24 19 #> 126 democracy 1.776756e+01 2.496034e-05 30 28 #> 127 but 1.762136e+01 2.695441e-05 225 429 #> 128 child 1.694883e+01 3.840093e-05 9 1 #> 129 founding 1.694883e+01 3.840093e-05 9 1 #> 130 age 1.688978e+01 3.961438e-05 15 8 #> 131 land 1.682924e+01 4.089821e-05 35 37 #> 132 creed 1.644570e+01 5.006356e-05 10 2 #> 133 bold 1.644570e+01 5.006356e-05 10 2 #> 134 simple 1.639733e+01 5.135748e-05 13 6 #> 135 achieve 1.639733e+01 5.135748e-05 13 6 #> 136 schools 1.631362e+01 5.367665e-05 13 5 #> 137 strive 1.631362e+01 5.367665e-05 13 5 #> 138 friend 1.623771e+01 5.587097e-05 11 3 #> 139 leaders 1.623771e+01 5.587097e-05 11 3 #> 140 learned 1.623771e+01 5.587097e-05 11 3 #> 141 man's 1.623771e+01 5.587097e-05 11 3 #> 142 program 1.623771e+01 5.587097e-05 11 3 #> 143 planet 1.530373e+01 9.153583e-05 7 0 #> 144 sick 1.530373e+01 9.153583e-05 7 0 #> 145 productivity 1.530373e+01 9.153583e-05 7 0 #> 146 decency 1.530373e+01 9.153583e-05 7 0 #> 147 decades 1.530373e+01 9.153583e-05 7 0 #> 148 goals 1.530373e+01 9.153583e-05 7 0 #> 149 \\ 1.530373e+01 9.153583e-05 7 0 #> 150 a 1.494622e+01 1.106196e-04 690 1556 #> 151 factories 1.438799e+01 1.487479e-04 8 1 #> 152 historic 1.438799e+01 1.487479e-04 8 1 #> 153 can 1.416408e+01 1.675387e-04 164 307 #> 154 way 1.411962e+01 1.715452e-04 37 44 #> 155 ceremony 1.399790e+01 1.830146e-04 9 2 #> 156 decent 1.399790e+01 1.830146e-04 9 2 #> 157 everyone 1.399790e+01 1.830146e-04 9 2 #> 158 commitment 1.399790e+01 1.830146e-04 9 2 #> 159 goal 1.399790e+01 1.830146e-04 9 2 #> 160 prayer 1.395366e+01 1.873730e-04 11 4 #> 161 move 1.395366e+01 1.873730e-04 11 4 #> 162 small 1.373881e+01 2.100682e-04 18 14 #> 163 lead 1.325109e+01 2.724220e-04 17 13 #> 164 generations 1.325109e+01 2.724220e-04 17 13 #> 165 is 1.320823e+01 2.787231e-04 458 1004 #> 166 fellow 1.316204e+01 2.856784e-04 45 60 #> 167 life 1.312379e+01 2.915696e-04 56 81 #> 168 human 1.294773e+01 3.203090e-04 41 53 #> 169 instead 1.285562e+01 3.364677e-04 13 8 #> 170 misery 1.264658e+01 3.762541e-04 6 0 #> 171 hunger 1.264658e+01 3.762541e-04 6 0 #> 172 quest 1.264658e+01 3.762541e-04 6 0 #> 173 night 1.264658e+01 3.762541e-04 6 0 #> 174 deepest 1.264658e+01 3.762541e-04 6 0 #> 175 celebration 1.264658e+01 3.762541e-04 6 0 #> 176 civility 1.264658e+01 3.762541e-04 6 0 #> 177 choices 1.264658e+01 3.762541e-04 6 0 #> 178 ensure 1.264658e+01 3.762541e-04 6 0 #> 179 adversaries 1.264658e+01 3.762541e-04 6 0 #> 180 timeless 1.264658e+01 3.762541e-04 6 0 #> 181 bush 1.264658e+01 3.762541e-04 6 0 #> 182 workers 1.264658e+01 3.762541e-04 6 0 #> 183 streets 1.264658e+01 3.762541e-04 6 0 #> 184 budget 1.264658e+01 3.762541e-04 6 0 #> 185 enemies 1.231184e+01 4.500942e-04 12 6 #> 186 sacrifice 1.229513e+01 4.541406e-04 15 11 #> 187 end 1.218465e+01 4.818428e-04 23 23 #> 188 better 1.193654e+01 5.504353e-04 34 42 #> 189 security 1.191654e+01 5.563759e-04 30 35 #> 190 renewal 1.185881e+01 5.738933e-04 7 1 #> 191 basic 1.185881e+01 5.738933e-04 7 1 #> 192 learn 1.175415e+01 6.070759e-04 10 4 #> 193 ideas 1.160081e+01 6.592313e-04 9 3 #> 194 try 1.160081e+01 6.592313e-04 9 3 #> 195 truly 1.160081e+01 6.592313e-04 9 3 #> 196 welcome 1.160081e+01 6.592313e-04 9 3 #> 197 play 1.160081e+01 6.592313e-04 9 3 #> 198 shape 1.159805e+01 6.602100e-04 8 2 #> 199 heroes 1.159805e+01 6.602100e-04 8 2 #> 200 working 1.159805e+01 6.602100e-04 8 2 #> [ reached getOption("max.print") -- omitted 9157 rows ]
# compare Trump 2017 to other post-war preseidents pwdfm <- dfm(corpus_subset(data_corpus_inaugural, period == "post-war")) head(textstat_keyness(pwdfm, target = "2017-Trump"), 10)
#> feature chi2 p n_target n_reference #> 1 protected 76.64466 0.000000e+00 5 1 #> 2 will 51.44795 7.351897e-13 40 299 #> 3 while 48.23022 3.790079e-12 6 7 #> 4 obama 47.85727 4.584000e-12 3 0 #> 5 we've 47.85727 4.584000e-12 3 0 #> 6 america 31.45537 2.040775e-08 18 112 #> 7 again 27.81145 1.337322e-07 9 33 #> 8 everyone 27.67876 1.432269e-07 4 5 #> 9 your 26.67898 2.402201e-07 11 50 #> 10 transferring 25.54569 4.320292e-07 2 0
# using the likelihood ratio method head(textstat_keyness(dfm_smooth(pwdfm), measure = "lr", target = "2017-Trump"), 10)
#> feature G2 p n_target n_reference #> 1 will 24.604106 7.040156e-07 41 317 #> 2 america 14.040255 1.789387e-04 19 130 #> 3 your 10.435140 1.236402e-03 12 68 #> 4 again 9.758516 1.784939e-03 10 51 #> 5 while 9.504990 2.049139e-03 7 25 #> 6 american 8.877690 2.886766e-03 12 76 #> 7 protected 8.820562 2.978550e-03 6 19 #> 8 back 6.853526 8.846653e-03 7 34 #> 9 you 6.713202 9.570175e-03 14 121 #> 10 country 5.821599 1.583055e-02 10 72