Naive Bayes Classifier in Python

Naive Bayes Classifier is probably the most widely used text classifier, it’s a supervised learning algorithm. It can be used to classify blog posts or news articles into different categories like sports, entertainment and so forth. It can be used to detect spam emails. But most important is that it’s widely implemented in Sentiment analysis. So first of all what is supervised learning? It means that the labeled training dataset is provided along with the input and the respective output. From this training dataset, our algorithm infers the next outcome to a given input.

The basics,

Conditional Probability : It is simply the probability that something will happen, given that something else has happened. It’s a way to handle dependent events. You can check out some examples of conditional probability here.

So from the multiplication rule; (here A and B are dependent events)

P(A ∪ B) = P(A) · P(B|A)

Now from above equation we get the formula for conditional probability;

P(B|A) = P(A ∪ B) / P(A)

Bayes’ theorem : It describes the probability of an event based on the conditions or attributes that might be related to the event,

P(A|B) = [P(B|A) · P(A)] / P(B)

So, our classifier can be written as :

Assume a problem instance to be classified, represented by vector x = (x1, x2, x3, …. , xn) representing some n attributes. Here y is our class variable.

Selection_005

Here we have eliminated the denominator P(x1, x2, x3, …. , xn) because it doesn’t really contribute to our final solution.

Now to make sure our algorithm holds up good against our datasets, we need to take the following conditions into account.

The Zero Frequency problem : Let us consider the case where a given attribute or class never occurs together in the training data, causing the frequency-based probability estimate be zero. This small  condition will wipe out the entire information in other probabilities when multiplied (multiplied by zero…duh…!). The simple solution to it is to apply Laplace estimation by assuming a uniform distribution over all attributes ie. we simply add a pseudocount in all probability estimates such that no probability is ever set to zero.

Floating Point Underflow : The probability values can go out of the floating point range hence to avoid this we need take logarithms on the probabilities. Accordingly we need to apply logarithmic rules to our classifier.

I have implemented Naive Bayes Classifier in Python and you can find it on Github. If have any improvements to add or any suggestions let me know in the comments section below.

image4233

Refer :


Posted

in

by

Comments

4 responses to “Naive Bayes Classifier in Python”

  1. Setting up Natural Language Processing Environment with Python – SHIRISH KADAM Avatar

    […] If you have any problems with installations or some other comments let me know below in the comments section. Meanwhile, you can also check out my other post on Machine Learning Classification algorithm. Naive Bayes Classifier in Python […]

    Like

  2. Huy T Nguyen Avatar
    Huy T Nguyen


    can you solve the error on line line 71? says record is out of range. Thank you

    Like

    1. Shirish Kadam Avatar

      Hi, what was the input data…How can I recreate the bug..?

      Like

  3. NLP: Question Classification using Support Vector Machines [spacy][scikit-learn][pandas] – SHIRISH KADAM Avatar

    […] You can also experiment with a Bayesian Classifier (Refer: Naive Bayes Classifier in Python): […]

    Like

Leave a Comment:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

%d bloggers like this: