Replace multi-token sequences with a multi-word, or "compound" token. The resulting compound tokens will represent a phrase or multi-word expression, concatenated with concatenator (by default, the "_" character) to form a single "token". This ensures that the sequences will be processed subsequently as single tokens, for instance in constructing a dfm.

tokens_compound(x, pattern, concatenator = "_", valuetype = c("glob",
  "regex", "fixed"), case_insensitive = TRUE, join = TRUE)

Arguments

x

an input tokens object

pattern

a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details.

concatenator

the concatenation character that will connect the words making up the multi-word sequences. The default _ is recommended since it will not be removed during normal cleaning and tokenization (while nearly all other punctuation characters, at least those in the Unicode punctuation class [P] will be removed).

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching

join

logical; if TRUE, join overlapping compounds

Value

a tokens object in which the token sequences matching pattern have been replaced by compound "tokens" joined by the concatenator

Examples

mytexts <- c("The new law included a capital gains tax, and an inheritance tax.", "New York City has raised taxes: an income tax and inheritance taxes.") mytoks <- tokens(mytexts, remove_punct = TRUE) # for lists of sequence elements myseqs <- list(c("tax"), c("income", "tax"), c("capital", "gains", "tax"), c("inheritance", "tax")) (cw <- tokens_compound(mytoks, myseqs))
#> tokens from 2 documents. #> text1 : #> [1] "The" "new" "law" #> [4] "included" "a" "capital_gains_tax" #> [7] "and" "an" "inheritance_tax" #> #> text2 : #> [1] "New" "York" "City" "has" "raised" #> [6] "taxes" "an" "income_tax" "and" "inheritance" #> [11] "taxes" #>
dfm(cw)
#> Document-feature matrix of: 2 documents, 16 features (40.6% sparse). #> 2 x 16 sparse Matrix of class "dfm" #> features #> docs the new law included a and an inheritance york city has raised taxes #> text1 1 1 1 1 1 1 1 0 0 0 0 0 0 #> text2 0 1 0 0 0 1 1 1 1 1 1 1 2 #> features #> docs income_tax capital_gains_tax inheritance_tax #> text1 0 1 1 #> text2 1 0 0
# when used as a dictionary for dfm creation mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax*"))) (cw2 <- tokens_compound(mytoks, mydict))
#> tokens from 2 documents. #> text1 : #> [1] "The" "new" "law" #> [4] "included" "a" "capital_gains_tax" #> [7] "and" "an" "inheritance_tax" #> #> text2 : #> [1] "New" "York" "City" #> [4] "has" "raised" "taxes" #> [7] "an" "income_tax" "and" #> [10] "inheritance_taxes" #>
# to pick up "taxes" in the second text, set valuetype = "regex" (cw3 <- tokens_compound(mytoks, mydict, valuetype = "regex"))
#> tokens from 2 documents. #> text1 : #> [1] "The" "new" "law" #> [4] "included" "a" "capital_gains_tax" #> [7] "and" "an" "inheritance_tax" #> #> text2 : #> [1] "New" "York" "City" #> [4] "has" "raised" "taxes" #> [7] "an" "income_tax" "and" #> [10] "inheritance_taxes" #>
# dictionaries w/glob matches myDict <- dictionary(list(negative = c("bad* word*", "negative", "awful text"), positive = c("good stuff", "like? th??"))) toks <- tokens(c(txt1 = "I liked this, when we can use bad words, in awful text.", txt2 = "Some damn good stuff, like the text, she likes that too.")) tokens_compound(toks, myDict)
#> tokens from 2 documents. #> txt1 : #> [1] "I" "liked_this" "," "when" "we" #> [6] "can" "use" "bad_words" "," "in" #> [11] "awful_text" "." #> #> txt2 : #> [1] "Some" "damn" "good_stuff" "," "like" #> [6] "the" "text" "," "she" "likes_that" #> [11] "too" "." #>
# with collocations cols <- textstat_collocations(tokens("capital gains taxes are worse than inheritance taxes"), size = 2, min_count = 1) toks <- tokens("The new law included capital gains taxes and inheritance taxes.") tokens_compound(toks, cols)
#> tokens from 1 document. #> text1 : #> [1] "The" "new" "law" #> [4] "included" "capital_gains_taxes" "and" #> [7] "inheritance_taxes" "." #>