**Naive Bayes Classifier** is probably the most widely used text classifier, it’s a *supervised learning* algorithm. It can be used to classify blog posts or news articles into different categories like sports, entertainment and so forth. It can be used to detect spam emails. But most important is that it’s widely implemented in Sentiment analysis. So first of all what is supervised learning? It means that the labeled training dataset is provided along with the input and the respective output. From this training dataset, our algorithm infers the next outcome to a given input.

The basics,

Conditional Probability : It is simply the probability that something will happen, given that something else has happened. It’s a way to handle dependent events. You can check out some examples of conditional probability here.

So from the multiplication rule; (here A and B are dependent events)

P(A ∪ B) = P(A) · P(B|A)

Now from above equation we get the formula for conditional probability;

P(B|A) = P(A ∪ B) / P(A)

Bayes’ theorem : It describes the probability of an event based on the conditions or attributes that might be related to the event,

P(A|B) = [P(B|A) · P(A)] / P(B)

So, our classifier can be written as :

Assume a problem instance to be classified, represented by vector *x = (x1, x2, x3, …. , xn)* representing some *n* attributes. Here *y* is our class variable.

Here we have eliminated the denominator *P(x1, x2, x3, …. , xn) *because* *it doesn’t really contribute to our final solution.

Now to make sure our algorithm holds up good against our datasets, we need to take the following conditions into account.

**The Zero Frequency problem** : Let us consider the case where a given attribute or class never occurs together in the training data, causing the frequency-based probability estimate be zero. This small condition will wipe out the entire information in other probabilities when multiplied (multiplied by zero…duh…!). The simple solution to it is to apply *Laplace estimation* by assuming a uniform distribution over all attributes ie. we simply add a *pseudocount* in all probability estimates such that no probability is ever set to zero.

**Floating Point Underflow** : The probability values can go out of the floating point range hence to avoid this we need take *logarithms* on the probabilities. Accordingly we need to apply logarithmic rules to our classifier.

I have implemented Naive Bayes Classifier in *Python* and you can find it on *Github*. If have any improvements to add or any suggestions let me know in the comments section below.

Refer :

- Naive Bayes – scikit
- SO – A simple explanation of Naive Bayes Classification
- Coursera Video lecture – Machine Learning by Pedro Domingos [Recommended]