These function select or discard tokens from a tokens objects. For convenience, the functions tokens_remove and tokens_keep are defined as shortcuts for tokens_select(x, pattern, selection = "remove") and tokens_select(x, pattern, selection = "keep"), respectively. The most common usage for tokens_remove will be to eliminate stop words from a text or text-based object, while the most common use of tokens_select will be to select tokens with only positive pattern matches from a list of regular expressions, including a dictionary.

tokens_select(x, pattern, selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  padding = FALSE, window = 0, min_nchar = 1L, max_nchar = 79L,
  verbose = quanteda_options("verbose"))

tokens_remove(x, ...)

tokens_keep(x, ...)

Arguments

x

tokens object whose token elements will be removed or kept

pattern

a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details.

selection

whether to "keep" or "remove" the tokens matching pattern

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore case when matching, if TRUE

padding

if TRUE, leave an empty string where the removed tokens previously existed. This is useful if a positional match is needed between the pre- and post-selected tokens, for instance if a window of adjacency needs to be computed.

window

integer of length 1 or 2; the size of the window of tokens adjacent to pattern that will be selected. The window is symmetric unless a vector of two elements is supplied, in which case the first element will be the token length of the window before pattern, and the second will be the token length of the window after pattern. The default is 0, meaning that only the pattern matched token(s) are selected, with no adjacent terms.

Terms from overlapping windows are never double-counted, but simply returned in the pattern match. This is because tokens_select never redefines the document units; for this, see kwic.

min_nchar, max_nchar

numerics specifying the minimum and maximum length in characters for tokens to be removed or kept; defaults are 1 and 79. (Set max_nchar to NULL for no upper limit.) These are applied after (and hence, in addition to) any selection based on pattern matches.

verbose

if TRUE print messages about how many tokens were selected or removed

...

additional arguments passed by tokens_remove and tokens_keep to tokens_select. Cannot include selection.

Value

a tokens object with tokens selected or removed based on their match to pattern

Examples

## tokens_select with simple examples toks <- tokens(c("This is a sentence.", "This is a second sentence."), remove_punct = TRUE) tokens_select(toks, c("is", "a", "this"), selection = "keep", padding = FALSE)
#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "a" #> #> text2 : #> [1] "This" "is" "a" #>
tokens_select(toks, c("is", "a", "this"), selection = "keep", padding = TRUE)
#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "a" "" #> #> text2 : #> [1] "This" "is" "a" "" "" #>
tokens_select(toks, c("is", "a", "this"), selection = "remove", padding = FALSE)
#> tokens from 2 documents. #> text1 : #> [1] "sentence" #> #> text2 : #> [1] "second" "sentence" #>
tokens_select(toks, c("is", "a", "this"), selection = "remove", padding = TRUE)
#> tokens from 2 documents. #> text1 : #> [1] "" "" "" "sentence" #> #> text2 : #> [1] "" "" "" "second" "sentence" #>
# how case_insensitive works tokens_select(toks, c("is", "a", "this"), selection = "remove", case_insensitive = TRUE)
#> tokens from 2 documents. #> text1 : #> [1] "sentence" #> #> text2 : #> [1] "second" "sentence" #>
tokens_select(toks, c("is", "a", "this"), selection = "remove", case_insensitive = FALSE)
#> tokens from 2 documents. #> text1 : #> [1] "This" "sentence" #> #> text2 : #> [1] "This" "second" "sentence" #>
# use window tokens_select(toks, "second", selection = "keep", window = 1)
#> tokens from 2 documents. #> text1 : #> character(0) #> #> text2 : #> [1] "a" "second" "sentence" #>
tokens_select(toks, "second", selection = "remove", window = 1)
#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "a" "sentence" #> #> text2 : #> [1] "This" "is" #>
tokens_remove(toks, "is", window = c(0, 1))
#> tokens from 2 documents. #> text1 : #> [1] "This" "sentence" #> #> text2 : #> [1] "This" "second" "sentence" #>
# tokens_remove example: remove stopwords txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.", wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.") tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))
#> tokens from 2 documents. #> text1 : #> [1] "Fellow" "citizens" "called" "upon" "voice" #> [6] "country" "execute" "functions" "Chief" "Magistrate" #> #> text2 : #> [1] "occasion" "proper" "shall" "arrive" #> [5] "shall" "endeavor" "express" "high" #> [9] "sense" "entertain" "distinguished" "honor" #>
# token_keep example: keep two-letter words tokens_keep(tokens(txt, remove_punct = TRUE), "??")
#> tokens from 2 documents. #> text1 : #> [1] "am" "by" "of" "my" "to" "of" #> #> text2 : #> [1] "it" "to" "of" #>