Substitute token types based on vectorized one-to-one matching. Since this function is created for lemmatization or user-defined stemming, it does not support multi-word features, or glob and regex patterns. Please use tokens_lookup with exclusive = FALSE for substitutions of more complex patterns.

tokens_replace(x, pattern, replacement = NULL, case_insensitive = TRUE,
  verbose = quanteda_options("verbose"))

Arguments

x

tokens object whose token elements will be replaced

pattern

a character vector or dictionary. See pattern for more details.

replacement

if pattern is a character vector, then replacement must be character vector of equal length, for a 1:1 match. If pattern is a dictionary, then replacement should not be used.

case_insensitive

ignore case when matching, if TRUE

verbose

print status messages if TRUE

Examples

toks <- tokens(data_corpus_irishbudget2010) # lemmatization infle <- c("foci", "focus", "focused", "focuses", "focusing", "focussed", "focusses") lemma <- rep("focus", length(infle)) toks2 <- tokens_replace(toks, infle, lemma) kwic(toks2, "focus*")
#> #> [2010_BUDGET_01_Brian_Lenihan_FF, 1092] . A key feature and | #> [2010_BUDGET_01_Brian_Lenihan_FF, 5477] , our investment projects will | #> [2010_BUDGET_02_Richard_Bruton_FG, 2133] budget and see that the | #> [2010_BUDGET_03_Joan_Burton_LAB, 927] therefore, be the main | #> [2010_BUDGET_03_Joan_Burton_LAB, 3592] the budget had just one | #> [2010_BUDGET_03_Joan_Burton_LAB, 4115] county, however, the | #> [2010_BUDGET_03_Joan_Burton_LAB, 4997] That is too narrow a | #> [2010_BUDGET_03_Joan_Burton_LAB, 5210] economic revival that has a | #> [2010_BUDGET_04_Arthur_Morgan_SF, 3141] new jobs. Instead the | #> [2010_BUDGET_04_Arthur_Morgan_SF, 3721] what should be the main | #> [2010_BUDGET_04_Arthur_Morgan_SF, 6796] must be completely redrawn to | #> [2010_BUDGET_05_Brian_Cowen_FF, 3114] . The scheme will also | #> [2010_BUDGET_05_Brian_Cowen_FF, 3786] to maximise the efficiency and | #> [2010_BUDGET_05_Brian_Cowen_FF, 4466] place, with a particular | #> [2010_BUDGET_07_Kieran_ODonnell_FG, 1953] coherent plan which should be | #> [2010_BUDGET_08_Eamon_Gilmore_LAB, 2628] " More recent studies, | #> #> focus | of today's budget is regaining #> focus | on labour-intensive areas such as #> focus | has been on the front #> focus | of policy. The Labour #> focus | and that was just too #> focus | of the feature is not #> focus | . There is a character #> focus | other than the dream of #> focus | was on rates of pay #> focus | of economic recovery, which #> focus | on the more labour intensive #> focus | on providing information, via #> focus | of our investment and ensure #> focus | on some of the worst #> focus | on jobs. The Taoiseach #> focus | on country cases, provide
# stemming type <- types(toks) stem <- char_wordstem(type, "porter") toks3 <- tokens_replace(toks, type, stem, case_insensitive = FALSE) identical(toks3, tokens_wordstem(toks, "porter"))
#> [1] TRUE