Introduction

quanteda is an R package for managing and analyzing text. This package makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, by performing the most common natural language processing tasks simply and quickly, such as tokenizing, stemming, or forming ngrams. quanteda’s functions for tokenizing texts and forming multiple tokenized documents into a document-feature matrix are both extremely fast and extremely simple to use. quanteda can segment texts easily by words, paragraphs, sentences, or even user-supplied delimiters and tags.

Built on the text processing functions in the stringi package, which is in turn built on C++ implementation of the ICU libraries for Unicode text handling, quanteda pays special attention to fast and correct implementation of Unicode and the handling of text in any character set, following conversion internally to UTF-8.

quanteda is built for efficiency and speed, through its design around three infrastructures: the stringi package for text processing, the data.table package for indexing large documents efficiently, and the Matrix package for sparse matrix objects. If you can fit it into memory, quanteda will handle it quickly. (And eventually, we will make it possible to process objects even larger than available memory.)

quanteda is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining what the documents and features. The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables. The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined “thesaurus”, and trimming and weighting features based on document frequency, feature frequency, and related measures such as tf-idf.

Features

Corpus Management Tools

The tools for getting texts into a corpus object include:

  • loading texts from directories of individual files
  • loading texts ``manually’’ by inserting them into a corpus using helper functions
  • managing text encodings and conversions from source files into corpus texts
  • attaching variables to each text that can be used for grouping, reorganizing a corpus, or simply recording additional information to supplement quantitative analyses with non-textual data
  • recording meta-data about the sources and creation details for the corpus.

The tools for working with a corpus include:

  • summarizing the corpus in terms of its language units
  • reshaping the corpus into smaller units or more aggregated units
  • adding to or extracting subsets of a corpus
  • resampling texts of the corpus, for example for use in non-parametric bootstrapping of the texts
  • Easy extraction and saving, as a new data frame or corpus, key words in context (KWIC)

Natural-Language Processing tools

For extracting features from a corpus, quanteda provides the following tools:

  • extraction of word types
  • extraction of word n-grams
  • extraction of dictionary entries from user-defined dictionaries
  • feature selection through
    • stemming
    • random selection
    • document frequency
    • word frequency
  • and a variety of options for cleaning word types, such as capitalization and rules for handling punctuation.

Document-Feature Matrix analysis tools

For analyzing the resulting document-feature matrix created when features are abstracted from a corpus, quanteda provides:

  • scaling methods, such as correspondence analysis, Wordfish, and Wordscores
  • topic models, such as LDA
  • classifiers, such as Naive Bayes or k-nearest neighbour
  • sentiment analysis, using dictionaries

Additional and Planned Features

Additional features of quanteda include:

  • the ability to explore texts using key-words-in-context
  • fast computation of a variety of readability indexes
  • fast computation of a variety of lexical diversity measures
  • quick computation of word or document association measures, for clustering or to compute similarity scores for other purposes
  • a comprehensive suite of descriptive statistics on text such as the number of sentences, words, characters, or syllables per document

Working with other packages

quanteda is hardly unique in providing facilities for working with text – the excellent tm package already provides many of the features we have described. quanteda is designed to complement those packages, as well to simplify the implementation of the text-to-analysis workflow. quanteda corpus structures are simpler objects than in tms, as are the document-feature matrix objects from quanteda, compared to the sparse matrix implementation found in tm. However, there is no need to choose only one package, since we provide translator functions from one matrix or corpus object to the other in quanteda.

Once constructed, a quanteda “dfm”" can be easily passed to other text-analysis packages for additional analysis of topic models or scaling, such as:

  • topic models (including converters for direct use with the topicmodels, LDA, and stm packages)
  • document scaling using quanteda’s own functions for the “wordfish” and “Wordscores” models, and a sparse method for correspondence analysis
  • document classification methods, using (for example) Naive Bayes, k-nearest neighbour, or Support Vector Machines
  • more sophisticated machine learning through a variety of other packages that take matrix or matrix-like inputs.
  • graphical analysis, including word clouds and strip plots for selected themes or words.