This article compares quanteda to alternative R packages for quantitative text analysis (tm, tidytext, corpus, and koRpus) and the Natural Language Toolkit for Python. If a function is available in another package, we provide the respective command.

Note that we have used the package manuals for the comparison. If we have overlooked certain functions, please let us know - either by editing the table and issuing a Pull Request or by contacting the maintainer.

Function quanteda tm tidytext corpus koRpus NLTK
Create corpus corpus() Corpus() corpus_frame() read.corp.custom() PlaintextCorpusReader()
Bind/subset corpora corpus_subset() tm_combine(); tm_filter()
Reshape corpus into smaller units corpus_reshape(); corpus_segment() text_split()
Take random sample of corpus texts corpus_sample()
Keywords-in-context kwic() text_locate() common_contexts()
Tokenize texts tokens() tokenizer() unnest_tokens() text_tokens() tokenize() nltk.word_tokenize
Stem features tokens_wordstem() stemDocument() stem_snowball() treetag() stem()
Define multi-word features phrase() MWETokenizer
Create document-feature matrix dfm() TermDocumentMatrix() unnest_tokens() term_matrix()
Create a feature co-occurence matrix fcm()
Weight a dfm/fcm dfm_weight(); fcm_weight() weightTf(); weightTfIdf() bind_tf_idf()
Create a custom dictionary dictionary() dictionary always a data.frame object SentimentAnalyzer
Included dictionaries Lexicoder AFINN, Bing, NRC AFINN Sentiment dictionary, WordNet-Affect Lexicon
Apply custom dictionaries dfm_lookup() dplyr::inner_join() SentimentAnalyzer
Supported dictionary formats Wordstat, LIWC, yoshicoder, lexicoder, YAML data.frame objects
Calculate feature frequencies textstat_frequency() FindMostFreqTerms() dplyr::count() term_stats() freq.analysis() FreqDist()
Extract collocations textstat_collocations() unnest_tokens(token = “ngrams”) collocations()
Readability scores textstat_readability() readability() nltk_contrib.readability
Lexical diversity textstat_lexdiv() various measures lexical_diversity()
Distance/similarity measures textstat_simil(); textstat_dist()
Keyness statistics textstat_keyness()
Wordcloud textplot_wordcloud()
Correspondence Analysis textmodel_ca()
Naïve Bayes textmodel_NB() NaiveBayesClassifier
Wordscores textmodel_wordscores()
Wordfish textmodel_wordfish()
Convert dfm to other format convert() cast_tdm()
POS-tagging spacyr package parts_of_speech() kRp.POS.tags() nltk.pos_tag
Import texts readtext package Reader() read()