Setting up Natural Language Processing Environment with Python


In this blog post, I will be discussing all the tools of Natural Language Processing pertaining to Linux environment, although most of them would also apply to Windows and Mac. So, let’s get started with some prerequisites.
We will use Python’s Pip package installer in order to install various python modules.

$ sudo apt install python-pip
$ pip install -U pip
$ pip install --upgrade pip

So I am going to talk about three NLP tools in Python that I have worked with so far.

  1. Natural Language Toolkit :

    NLTK can be seen as a library written for educational purposes and hence, is great to experiment with as its website itself notes this; NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.” To install NLTK we use pip :

    $ sudo pip install -U nltk

    NLTK also comes with its own corpora and can be downloaded as follows:

    >>> import nltk

    We can also interface NLTK with our own corpora. For detailed usage of the NLTK API usage, one can refer its official guide “Natural Language Processing with Python by Steven Bird”. I will be covering more about NLTK its API usage in the upcoming posts, but for now, we will settle with its installation.

  2. spaCy :

    In the words of Matthew Honnibal (author of spaCy);

    ” There’s a real philosophical difference between spaCy and NLTK. spaCy is written to help you get things done. It’s minimal and opinionated. We want to provide you with exactly one way to do it — the right way. In contrast, NLTK was created to support education. Most of what’s there is for demo purposes, to help students explore ideas. spaCy provides very fast and accurate syntactic analysis (the fastest of any library released), and also offers named entity recognition and ready access to word vectors. You can use the default word vectors, or replace them with any you have.

    What really sets it apart, though, is the API. spaCy is the only library that has all of these features together, and allows you to easily hop between these levels of representation. Here’s an example of how that helps. Tutorial: Search Reddit for comments about Google doing something . spaCy also ensures that the annotations are always aligned to the original string, so you can easily print mark-up: Tutorial: Mark all adverbs, particularly for verbs of speech . “

    The benchmarks provided on its official website:selection_011
    Here are some of the things I have tried with spaCy and it’s my favourite NLP tool. In the upcoming posts I will dwell into each of its APIs:selection_004
    Installation :

    $ sudo pip install -U spacy
    $ sudo python -m
  3. TextBlob :

    It is more of a text processing library than an NLP library. It is simple and light weight and is your go-to library when you have to perform some basic NLP operations such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
    Installation :

    $ sudo pip install -U textblob
    $ sudo python -m textblob.download_corpora

    Usage to check word definitions and synonyms and similarity between different words. I will be doing an independent post on TexBlob with WordNet and its SynSets:selection_013

Further Reading :

If you have any problems with installations or some other comments let me know below in the comments section. Meanwhile, you can also check out my other post on Machine Learning Classification algorithm. Naive Bayes Classifier in Python

Naive Bayes Classifier in Python


Naive Bayes Classifier is probably the most widely used text classifier, it’s a supervised learning algorithm. It can be used to classify blog posts or news articles into different categories like sports, entertainment and so forth. It can be used to detect spam emails. But most important is that it’s widely implemented in Sentiment analysis. So first of all what is supervised learning? It means that the labeled training dataset is provided along with the input and the respective output. From this training dataset, our algorithm infers the next outcome to a given input.

The basics,

Conditional Probability : It is simply the probability that something will happen, given that something else has happened. It’s a way to handle dependent events. You can check out some examples of conditional probability here.

So from the multiplication rule; (here A and B are dependent events)

P(A ∪ B) = P(A) · P(B|A)

Now from above equation we get the formula for conditional probability;

P(B|A) = P(A ∪ B) / P(A)

Bayes’ theorem : It describes the probability of an event based on the conditions or attributes that might be related to the event,

P(A|B) = [P(B|A) · P(A)] / P(B)

So, our classifier can be written as :

Assume a problem instance to be classified, represented by vector x = (x1, x2, x3, …. , xn) representing some n attributes. Here y is our class variable.


Here we have eliminated the denominator P(x1, x2, x3, …. , xn) because it doesn’t really contribute to our final solution.

Now to make sure our algorithm holds up good against our datasets, we need to take the following conditions into account.

The Zero Frequency problem : Let us consider the case where a given attribute or class never occurs together in the training data, causing the frequency-based probability estimate be zero. This small  condition will wipe out the entire information in other probabilities when multiplied (multiplied by zero…duh…!). The simple solution to it is to apply Laplace estimation by assuming a uniform distribution over all attributes ie. we simply add a pseudocount in all probability estimates such that no probability is ever set to zero.

Floating Point Underflow : The probability values can go out of the floating point range hence to avoid this we need take logarithms on the probabilities. Accordingly we need to apply logarithmic rules to our classifier.

I have implemented Naive Bayes Classifier in Python and you can find it on Github. If have any improvements to add or any suggestions let me know in the comments section below.


Refer :

Flickr Wallpapers daily for Linux Mint & Ubuntu


Linux has very few options for ‘Automatic Wallpaper Setter’. Variety is at the top. I loved it, it has lots of image sources, effects, a digital clock and daily quotes. You can also set your wallpaper-switching interval. It is really worth checking out. But what I wanted was a simple script that would download images from Flickr daily and set them as my desktop wallpaper. So here is the Python script to download and set wallpapers on Linux Mint and Ubuntu from Flickr based on an image’s interestingness.

So, to get a list of interesting photos from Flickr we will require Flickr API keys (used strictly for non-commercial use). You can get them from here. The API we are going to use for our script is : flickr.interestingness.getList more info about this API can be found here. You can set its arguments as per your requirements. For this script, I am requesting the response in XML format but you can also get it in JSON. When we are done with tuning the API it would look something like this.

(You need put your API key there…). The above REST request doesn’t actually return an image URL but it returns Image IDs, Server IDs and Secret alphanumeric IDs of interesting images as per your limit (in this case 5) and date. Now our job is to map these image IDs to their respective static URLs so that it will be convenient for us to download them.


With the help of the above API, we can construct the source URLs of the images. At the tail of this API, we need to specify the size of the image we want to download. More info about this can be found here.

Let’s get back to our script. We need to select a date for retrieving interesting images on that date. Ideally, this date should be at least two days before. The limit for retrieving the number of images should be around five as you will need only one wallpaper daily. We will now generate our API URL and request it via HTTP request which will return us an XML response which we have to parse (Sample XML Response from the Flickr Interesting API can be found here). Parsing XML in Python is quite simple and can be done with ElementTree XML . After parsing the response we have to save all of its necessary attributes in independent variables and also replace any whitespaces with an underscore (to avoid any kind of conflicts while setting the wallpaper). With the help of these attributes we will generate our static URLs and add them in List[]. Also, we will create a Dictionary{} to maintain a mapping between image titles and URLs {'photo_url': 'photo_title'}.

So, we have five image URLs in a list and we can choose any one randomly (if you want) to set as our wallpaper. Once we have our desired image URL we can download the image file and save it in a directory. In this script, I have set the download directory as: ~/Pictures/Flickr/

If this download directory doesn’t exist it will be created so that images can be saved there (in .jpg format). This script is specifically targeted to Cinnamon Desktop Environment [Linux Mint 17.X]. Wallpapers can be changed in Cinnamon as follows:

gsettings set org.cinnamon.desktop.background picture-uri file:////absolute_path_of_image_no_spaces

The same can be achieved in Ubuntu with Unity DE as follows:

gsettings set org.gnome.desktop.background picture-uri file:////absolute_path_of_image_no_spaces

Now this script sets wallpaper only if there is no image for that day in the download directory, but you can change that logic according to your preferences. To automate this script ie. to execute it whenever you are connected to the internet you can try this software Cuttlefish.  More info about Cuttlefish at Ubuntu handbook and OMGUbuntu. If you want to do it the hard way refer the Ubuntu guide for OnNetworkConnectionRunScript and this post on AskUbuntu.

This slideshow requires JavaScript.

This script is available on Github you can modify it as per your needs. If you encounter any bugs let me now.