Fit a multinomial or Bernoulli Naive Bayes model, given a dfm and some training labels.

textmodel_nb(x, y, smooth = 1, prior = c("uniform", "docfreq", "termfreq"),
  distribution = c("multinomial", "Bernoulli"), ...)

Arguments

x

the dfm on which the model will be fit. Does not need to contain only the training documents.

y

vector of training labels associated with each document identified in train. (These will be converted to factors if not already factors.)

smooth

smoothing parameter for feature counts by class

prior

prior distribution on texts; one of "uniform", "docfreq", or "termfreq". See Prior Distributions below.

distribution

count model for text features, can be multinomial or Bernoulli. To fit a "binary multinomial" model, first convert the dfm to a binary matrix using tf(x, "boolean").

...

more arguments passed through

Value

A list of return values, consisting of (where \(I\) is the total number of documents, \(J\) is the total number of features, and \(k\) is the total number of training classes):

call

original function call

PwGc

\(k \times J\); probability of the word given the class (empirical likelihood)

Pc

\(k\)-length named numeric vector of class prior probabilities

PcGw

\(k \times J\); posterior class probability given the word

Pw

\(J \times 1\); baseline probability of the word

data

list consisting of the \(I \times J\) training dfm x, and the \(I\)-length y training class vector

distribution

the distribution argument

prior

the prior argument

smooth

the value of the smoothing parameter

Predict Methods

A predict method is also available for a fitted Naive Bayes object, see predict.textmodel_nb_fitted.

Prior distributions

Prior distributions refer to the prior probabilities assigned to the training classes, and the choice of prior distribution affects the calculation of the fitted probabilities. The default is uniform priors, which sets the unconditional probability of observing the one class to be the same as observing any other class.

"Document frequency" means that the class priors will be taken from the relative proportions of the class documents used in the training set. This approach is so common that it is assumed in many examples, such as the worked example from Manning, Raghavan, and Schütze (2008) below. It is not the default in quanteda, however, since there may be nothing informative in the relative numbers of documents used to train a classifier other than the relative availability of the documents. When training classes are balanced in their number of documents (usually advisable), however, then the empirically computed "docfreq" would be equivalent to "uniform" priors.

Setting prior to "termfreq" makes the priors equal to the proportions of total feature counts found in the grouped documents in each training class, so that the classes with the largest number of features are assigned the largest priors. If the total count of features in each training class was the same, then "uniform" and "termfreq" would be the same.

References

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Jurafsky, Daniel and James H. Martin. (2016) Speech and Language Processing. Draft of November 7, 2016. https://web.stanford.edu/~jurafsky/slp3/6.pdf

Examples

## Example from 13.1 of _An Introduction to Information Retrieval_ txt <- c(d1 = "Chinese Beijing Chinese", d2 = "Chinese Chinese Shanghai", d3 = "Chinese Macao", d4 = "Tokyo Japan Chinese", d5 = "Chinese Chinese Chinese Tokyo Japan") trainingset <- dfm(txt, tolower = FALSE) trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE) ## replicate IIR p261 prediction for test set (document 5) (nb.p261 <- textmodel_nb(trainingset, trainingclass, prior = "docfreq"))
#> Fitted Naive Bayes model: #> Call: #> textmodel_nb.dfm(x = trainingset, y = trainingclass, prior = "docfreq") #> #> #> Training classes and priors: #> Y N #> 0.75 0.25 #> #> Likelihoods: Class Posteriors: #> Y N Y N #> Chinese 0.42857143 0.2222222 0.8526316 0.1473684 #> Beijing 0.14285714 0.1111111 0.7941176 0.2058824 #> Shanghai 0.14285714 0.1111111 0.7941176 0.2058824 #> Macao 0.14285714 0.1111111 0.7941176 0.2058824 #> Tokyo 0.07142857 0.2222222 0.4909091 0.5090909 #> Japan 0.07142857 0.2222222 0.4909091 0.5090909 #>
predict(nb.p261, newdata = trainingset[5, ])
#> Predicted textmodel of type: Naive Bayes #> #> lp(Y) lp(N) Pr(Y) Pr(N) Predicted #> d5 -8.10769 -8.906681 0.6898 0.3102 Y #>
# contrast with other priors predict(textmodel_nb(trainingset, trainingclass, prior = "uniform"))
#> Predicted textmodel of type: Naive Bayes #> #> lp(Y) lp(N) Pr(Y) Pr(N) Predicted #> d1 -4.333653 -5.898527 0.8271 0.1729 Y #> d2 -4.333653 -5.898527 0.8271 0.1729 Y #> d3 -3.486355 -4.394449 0.7126 0.2874 Y #> d4 -6.818560 -5.205379 0.1661 0.8339 N #> d5 -8.513155 -8.213534 0.4257 0.5743 N #>
predict(textmodel_nb(trainingset, trainingclass, prior = "termfreq"))
#> Predicted textmodel of type: Naive Bayes #> #> lp(Y) lp(N) Pr(Y) Pr(N) Predicted #> d1 -3.958960 -6.504662 0.9273 0.0727 Y #> d2 -3.958960 -6.504662 0.9273 0.0727 Y #> d3 -3.111662 -5.000585 0.8686 0.1314 Y #> d4 -6.443866 -5.811515 0.3470 0.6530 N #> d5 -8.138462 -8.819670 0.6640 0.3360 Y #>
## replicate IIR p264 Bernoulli Naive Bayes (nb.p261.bern <- textmodel_nb(trainingset, trainingclass, distribution = "Bernoulli", prior = "docfreq"))
#> Fitted Naive Bayes model: #> Call: #> textmodel_nb.dfm(x = trainingset, y = trainingclass, prior = "docfreq", #> distribution = "Bernoulli") #> #> #> Training classes and priors: #> Y N #> 0.75 0.25 #> #> Likelihoods: Class Posteriors: #> Y N Y N #> Chinese 0.8 0.6666667 0.7826087 0.2173913 #> Beijing 0.4 0.3333333 0.7826087 0.2173913 #> Shanghai 0.4 0.3333333 0.7826087 0.2173913 #> Macao 0.4 0.3333333 0.7826087 0.2173913 #> Tokyo 0.2 0.6666667 0.4736842 0.5263158 #> Japan 0.2 0.6666667 0.4736842 0.5263158 #>
predict(nb.p261.bern, newdata = trainingset[5, ])
#> Predicted textmodel of type: Naive Bayes #> #> lp(Y) lp(N) Pr(Y) Pr(N) Predicted #> d5 -5.262178 -3.819085 0.1911 0.8089 N #>