Naive Bayes Classifier in Python

Naive Bayes Classifier is probably the most widely used text classifier, it’s a supervised learning algorithm. It can be used to classify blog posts or news articles into different categories like sports, entertainment and so forth. It can be used to detect spam emails. But most important is that it’s widely implemented in Sentiment analysis. So first of all what is supervised learning? It means that the labeled training dataset is provided along with the input and the respective output. From this training dataset, our algorithm infers the next outcome to a given input.

The basics,

Conditional Probability : It is simply the probability that something will happen, given that something else has happened. It’s a way to handle dependent events. You can check out some examples of conditional probability here.

So from the multiplication rule; (here A and B are dependent events)

P(A ∪ B) = P(A) · P(B|A)

Now from above equation we get the formula for conditional probability;

P(B|A) = P(A ∪ B) / P(A)

Bayes’ theorem : It describes the probability of an event based on the conditions or attributes that might be related to the event,

P(A|B) = [P(B|A) · P(A)] / P(B)

So, our classifier can be written as :

Assume a problem instance to be classified, represented by vector x = (x1, x2, x3, …. , xn) representing some n attributes. Here y is our class variable.

Here we have eliminated the denominator P(x1, x2, x3, …. , xn) because it doesn’t really contribute to our final solution.

Now to make sure our algorithm holds up good against our datasets, we need to take the following conditions into account.

The Zero Frequency problem : Let us consider the case where a given attribute or class never occurs together in the training data, causing the frequency-based probability estimate be zero. This small condition will wipe out the entire information in other probabilities when multiplied (multiplied by zero…duh…!). The simple solution to it is to apply Laplace estimation by assuming a uniform distribution over all attributes ie. we simply add a pseudocount in all probability estimates such that no probability is ever set to zero.

Floating Point Underflow : The probability values can go out of the floating point range hence to avoid this we need take logarithms on the probabilities. Accordingly we need to apply logarithmic rules to our classifier.

I have implemented Naive Bayes Classifier in Python and you can find it on Github. If have any improvements to add or any suggestions let me know in the comments section below.

Refer :

Discover more from Shirish Kadam

Subscribe to get the latest posts sent to your email.

Posted

April 23, 2016

Natural Language Processing

Shirish Kadam

Tags:

Naive Bayes Classifier, Python, Supervised Learning

Comments

4 responses to “Naive Bayes Classifier in Python”

Setting up Natural Language Processing Environment with Python – SHIRISH KADAM

October 6, 2016

[…] If you have any problems with installations or some other comments let me know below in the comments section. Meanwhile, you can also check out my other post on Machine Learning Classification algorithm. Naive Bayes Classifier in Python […]

LikeLike

Reply
Huy T Nguyen

April 24, 2017

can you solve the error on line line 71? says record is out of range. Thank you

LikeLike

Reply
1. Shirish Kadam
  
  July 2, 2017
  
  Hi, what was the input data…How can I recreate the bug..?
  
  LikeLike
  
  Reply
NLP: Question Classification using Support Vector Machines [spacy][scikit-learn][pandas] – SHIRISH KADAM

July 3, 2017

[…] You can also experiment with a Bayesian Classifier (Refer: Naive Bayes Classifier in Python): […]

LikeLike

Reply

Naive Bayes Classifier in Python

Discover more from Shirish Kadam

Comments

4 responses to “Naive Bayes Classifier in Python”

Leave a Comment: Cancel reply