Naive Bayes classifier. A naive Bayes classifier is a simple probabilistic
classifier based on applying Bayes' theorem with strong (naive) independence
assumptions. Depending on the precise nature of the probability model, naive
Bayes classifiers can be trained very efficiently in a supervised learning
setting.
In spite of their naive design and apparently over-simplified assumptions,
naive Bayes classifiers have worked quite well in many complex real-world
situations and are very popular in Natural Language Processing (NLP).
For a general purpose naive Bayes classifier without any assumptions
about the underlying distribution of each variable, we don't provide
a learning method to infer the variable distributions from the training data.
Instead, the users can fit any appropriate distributions on the data by
themselves with various
Distribution classes. Although the
#predictmethod takes an array of double values as a general form of independent variables,
the users are free to use any discrete distributions to model categorical or
ordinal random variables.
For document classification in NLP, there are two major different ways we can set
up an naive Bayes classifier: multinomial model and Bernoulli model. The
multinomial model generates one term from the vocabulary in each position
of the document. The multivariate Bernoulli model or Bernoulli model
generates an indicator for each term of the vocabulary, either indicating
presence of the term in the document or indicating absence.
Of the two models, the Bernoulli model is particularly sensitive to noise
features. A Bernoulli naive Bayes classifier requires some form of feature
selection or else its accuracy will be low.
The different generation models imply different estimation strategies and
different classification rules. The Bernoulli model estimates as the
fraction of documents of class that contain term. In contrast, the
multinomial model estimates as the fraction of tokens or fraction of
positions in documents of class that contain term. When classifying a
test document, the Bernoulli model uses binary occurrence information,
ignoring the number of occurrences, whereas the multinomial model keeps
track of multiple occurrences. As a result, the Bernoulli model typically
makes many mistakes when classifying long documents. However, it was reported
that the Bernoulli model works better in sentiment analysis.
The models also differ in how non-occurring terms are used in classification.
They do not affect the classification decision in the multinomial model;
but in the Bernoulli model the probability of nonoccurrence is factored
in when computing. This is because only the Bernoulli model models
absence of terms explicitly.
A third setting is Polya Urn model which simply
add twice for what is seen in training data instead of one time.
See reference for more detail.