NLP: Question Classification using Support Vector Machines [spacy][scikit-learn][pandas]

Standard

Past couple of months I have been working on a Question Answering System and in my upcoming blog posts, I would like to share some things I learnt in the whole process. I haven’t reached to a satisfactory accuracy with the answers fetched by the system, but it is work in progress. Adam QAS on Github.

In this post, we are specifically going to focus on the Question Classification part. The goal is to classify a given input question into predefined categories. This classification will help us in Query Construction / Modelling phases.

adam_poster_mini

ADAM – Poster

So, before we begin let’s make sure our environment is all set up. Setting up Natural Language Processing Environment with Python. For Question’s language processing part, we are going to use spaCy and for the machine learning part, we will use scikit-learn and for the Data Frames, I prefer pandas. Note: I am using Python3

$ pip3 install -U scikit-learn
$ pip3 install pandas

Now, that our environment is all set we need a training data set to train our classifier. I am using the dataset from the Cognitive Computation Group at the Department of Computer Science, University of Illinois at Urbana-Champaign.Training set 5(5500 labeled questions, For more visit here.)

DESC:manner How did serfdom develop in and then leave Russia ?
ENTY:cremat What films featured the character Popeye Doyle ?
DESC:manner How can I find a list of celebrities ' real names ?
...

Prep Training data for the SVM

For this classifier, we will be using a Linear Support Vector Machine. Now let us identify the features in the question which will affect its classification and train our classifier based on these features.

  1. WH-word: The WH-word in a question holds a lot of information about the intent of the question and what basically it is trying to seek. (What, When, How, Where and so on)
  2. WH-word POS: The part of speech of the WH-word (wh-determiner, wh-pronoun, wh-adverb)
  3. POS of the word next to WH-word: The part of speech of the word adjacent to WH-word or the word at 1st position in the bigram (0th being the WH-word).
  4. Root POS: The part of speech of the word at the root of the dependency parse tree.

Note: We will be extracting the WH-Bigram also (just for reference); A bigram is nothing but two consecutive words, in this case, we will consider the WH-word and the word that follows it. (What is, How many, Where do…)

We have to extract these features from our labelled dataset and store them in a CSV file with the respective label. This is where spaCy comes in action. It will enable us to get the Part of Speech, Dependency relation of each token in the question.

import spacy
import csv
clean_old_data()
en_nlp = spacy.load("en_core_web_md")

First, we load the English language model and clean our CSV file from old training data. And then we read our raw labelled data, extract the features for each question, store these features and labels in a CSV file.

read_input_file(fp, en_nlp)

This function splits the raw data into the question and its respective label and passes it on for further NLP processing.

def process_question(question, qclass, en_nlp):
    en_doc = en_nlp(u'' + question)
    sent_list = list(en_doc.sents)
    sent = sent_list[0]
    wh_bi_gram = []
    root_token = ""
    wh_pos = ""
    wh_nbor_pos = ""
    wh_word = ""
    for token in sent:
        if token.tag_ == "WDT" or token.tag_ == "WP" or token.tag_ == "WP$" or token.tag_ == "WRB":
            wh_pos = token.tag_
            wh_word = token.text
            wh_bi_gram.append(token.text)
            wh_bi_gram.append(str(en_doc[token.i + 1]))
            wh_nbor_pos = en_doc[token.i + 1].tag_
        if token.dep_ == "ROOT":
            root_token = token.tag_

    write_each_record_to_csv(wh_pos, wh_word, wh_bi_gram, wh_nbor_pos, root_token)

The above function feeds the question into the NLP pipeline en_doc = en_nlp(u'' + question) and obtains a Doc object containing linguistic annotations of the question. This Doc also performs sentence boundary detection/segmentation and we have to obtain the list of sentences which acts as the decomposed questions or sub questions. (Here I am only operating on the first sub question). Let us iterate over each token in the sentence to get its Parts of Speech and Dependency label. To extract only the WH-word we have to look for WDT, WP, WP$, WRB tags and to extract the root token from the sentence we look for its dependency label as ROOT. After writing all the records to the training data CSV file, it looks something like this:

#Question|WH|WH-Bigram|WH-POS|WH-NBOR-POS|Root-POS|Class
How did serfdom develop in and then leave Russia ?|How|How did|WRB|VBD|VB|DESC
What films featured the character Popeye Doyle ?|What|What films|WP|NNS|VBD|ENTY
...

Training the SVM and Prediction

from sklearn.svm import LinearSVC
import pandas

I prefer pandas over sklearn.datasets, First thing is we load our training dataset CSV file in the pandas DataFrame. This data frame will have all the features extracted in column-row fashion. Now to train our classifier we need to separate the features column and the class/label column so, we pop the label column from the data frame and store it separately. Along with that, we will also pop some unnecessary columns.

y = dta.pop('Class')
dta.pop('#Question')
dta.pop('WH-Bigram')

X_train = pandas.get_dummies(dta)

Here, the get_dummies() function converts the actual values into dummy values or binary values. What this means is that, if a record is something like below it will be converted to its binary form with 1 being the feature is present in the record and 0 as being absent.

   ...
5. How|WRB|VBD|VB
   ...
#  Why How What When Where ... WRB WDT ... VBD VB VBN VBP ...
   ...
5.  0   1    0    0    0   ...  1   0  ...  1   1  0   0  ...
   ...

In the next phase, we extract the same features from the question we want to predict in a data frame and get its dummy values.  Here is the data frame get_question_predict_features() will return:

qdata_frame = [{'WH':wh_word, 'WH-POS':wh_pos, 'WH-NBOR-POS':wh_nbor_pos, 'Root-POS':root_token}]
en_doc = en_nlp(u'' + question_to_predict)
question_data = get_question_predict_features(en_doc)
X_predict = pandas.get_dummies(question_data)

The problem here is that the size (number of features) of prediction data frame and the training data frame varies due to the absense of some features in the prediction data frame. It is obvious that the question to be classified will be missing a majority of features that are present in the training dataset of 5000 questions. So, to equate the size (number of features) we append the missing feature columns that are present in the training data frame to the prediction data frame with the value of 0 (because these features are not present in the question to classify).

def transform_data_matrix(X_train, X_predict):
    X_train_columns = list(X_train.columns)
    X_predict_columns = list(X_predict.columns)

    X_trans_columns = list(set(X_train_columns + X_predict_columns))
    # print(X_trans_columns, len(X_trans_columns))

    trans_data_train = {}

    for col in X_trans_columns:
        if col not in X_train:
            trans_data_train[col] = [0 for i in range(len(X_train.index))]
        else:
            trans_data_train[col] = list(X_train[col])

    XT_train = pandas.DataFrame(trans_data_train)
    XT_train = csr_matrix(XT_train)
    trans_data_predict = {}

    for col in X_trans_columns:
        if col not in X_predict:
            trans_data_predict[col] = 0
        else:
            trans_data_predict[col] = list(X_predict[col])  # KeyError

    XT_predict = pandas.DataFrame(trans_data_predict)
    XT_predict = csr_matrix(XT_predict)

    return XT_train, XT_predict
X_train, X_predict = transform_data_matrix(X_train, X_predict)

After we have both the data frames with the same size, we classify the question based on the training dataset using Linear Support Vector Machine. The LinearSVC model is fitted with the training features and respective labels. This fitted object is later used to predict the class with respect to the prediction data. It returns the question class/category.

Note: Here the DataFrame has multiple zero entries, hence you convert it into a sparse matrix representation; csr_matrix() takes care of that. from scipy.sparse import csr_matrix

print("Question Class:", support_vector_machine(X_train, y, X_predict))
def support_vector_machine(X_train, y, X_predict):
    lin_clf = LinearSVC()
    lin_clf.fit(X_train, y)
    prediction = lin_clf.predict(X_predict)
    return prediction

You can also experiment with a Bayesian Classifier (Refer: Naive Bayes Classifier in Python):

def naive_bayes_classifier(X_train, y, X_predict):
    gnb = GaussianNB()
    gnb.fit(X_train, y)
    prediction = gnb.predict(X_predict)
    return prediction

Fork it on GitHub:


GitHub-Mark-120px-plus

Reference:

Advertisements

Setting up Natural Language Processing Environment with Python

Standard

In this blog post, I will be discussing all the tools of Natural Language Processing pertaining to Linux environment, although most of them would also apply to Windows and Mac. So, let’s get started with some prerequisites.
We will use Python’s Pip package installer in order to install various python modules.

$ sudo apt install python-pip
$ pip install -U pip
$ pip install --upgrade pip

So I am going to talk about three NLP tools in Python that I have worked with so far.

Note: It is highly recommended to install these modules in a Virtual Environment. Here is how you do that : Common Python Tools: Using virtualenv, Installing with Pip, and Managing Packages.

  1. Natural Language Toolkit :

    NLTK can be seen as a library written for educational purposes and hence, is great to experiment with as its website itself notes this; NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.” To install NLTK we use pip :

    $ sudo pip install -U nltk

    NLTK also comes with its own corpora and can be downloaded as follows:

    >>> import nltk
    >>> nltk.download()

    felixfelix-_010
    We can also interface NLTK with our own corpora. For detailed usage of the NLTK API usage, one can refer its official guide “Natural Language Processing with Python by Steven Bird”. I will be covering more about NLTK its API usage in the upcoming posts, but for now, we will settle with its installation.

  2. spaCy :

    In the words of Matthew Honnibal (author of spaCy);

    ” There’s a real philosophical difference between spaCy and NLTK. spaCy is written to help you get things done. It’s minimal and opinionated. We want to provide you with exactly one way to do it — the right way. In contrast, NLTK was created to support education. Most of what’s there is for demo purposes, to help students explore ideas. spaCy provides very fast and accurate syntactic analysis (the fastest of any library released), and also offers named entity recognition and ready access to word vectors. You can use the default word vectors, or replace them with any you have.

    What really sets it apart, though, is the API. spaCy is the only library that has all of these features together, and allows you to easily hop between these levels of representation. Here’s an example of how that helps. Tutorial: Search Reddit for comments about Google doing something . spaCy also ensures that the annotations are always aligned to the original string, so you can easily print mark-up: Tutorial: Mark all adverbs, particularly for verbs of speech . “

    The benchmarks provided on its official website:selection_011
    Here are some of the things I have tried with spaCy and it’s my favorite NLP tool. In the upcoming posts I will dwell into each of its APIs so, keep an eye out here (spaCy):selection_004
    Installation :

    $ sudo pip install -U spacy
    $ sudo python -m spacy.en.download

    What makes it easy to work with spaCy is it’s well maintained and presented documentation. They also have made some great demos like displaCy for dependency parser and named entity recognizer. Check them out here.

  3. TextBlob :

    It is more of a text processing library than an NLP library. It is simple and light weight and is your go-to library when you have to perform some basic NLP operations such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
    Installation :

    $ sudo pip install -U textblob
    $ sudo python -m textblob.download_corpora

    Usage to check word definitions and synonyms and similarity between different words. I will be doing an independent post on TexBlob with WordNet and its SynSets:selection_013

Check out my other posts on Natural Language Processing.

Further Reading :

If you have any problem with installations or some other comments let me know below in the comments section. Meanwhile, you can also check out my other post on Machine Learning Classification algorithm. Naive Bayes Classifier in Python

Naive Bayes Classifier in Python

Standard

Naive Bayes Classifier is probably the most widely used text classifier, it’s a supervised learning algorithm. It can be used to classify blog posts or news articles into different categories like sports, entertainment and so forth. It can be used to detect spam emails. But most important is that it’s widely implemented in Sentiment analysis. So first of all what is supervised learning? It means that the labeled training dataset is provided along with the input and the respective output. From this training dataset, our algorithm infers the next outcome to a given input.

The basics,

Conditional Probability : It is simply the probability that something will happen, given that something else has happened. It’s a way to handle dependent events. You can check out some examples of conditional probability here.

So from the multiplication rule; (here A and B are dependent events)

P(A ∪ B) = P(A) · P(B|A)

Now from above equation we get the formula for conditional probability;

P(B|A) = P(A ∪ B) / P(A)

Bayes’ theorem : It describes the probability of an event based on the conditions or attributes that might be related to the event,

P(A|B) = [P(B|A) · P(A)] / P(B)

So, our classifier can be written as :

Assume a problem instance to be classified, represented by vector x = (x1, x2, x3, …. , xn) representing some n attributes. Here y is our class variable.

Selection_005

Here we have eliminated the denominator P(x1, x2, x3, …. , xn) because it doesn’t really contribute to our final solution.

Now to make sure our algorithm holds up good against our datasets, we need to take the following conditions into account.

The Zero Frequency problem : Let us consider the case where a given attribute or class never occurs together in the training data, causing the frequency-based probability estimate be zero. This small  condition will wipe out the entire information in other probabilities when multiplied (multiplied by zero…duh…!). The simple solution to it is to apply Laplace estimation by assuming a uniform distribution over all attributes ie. we simply add a pseudocount in all probability estimates such that no probability is ever set to zero.

Floating Point Underflow : The probability values can go out of the floating point range hence to avoid this we need take logarithms on the probabilities. Accordingly we need to apply logarithmic rules to our classifier.

I have implemented Naive Bayes Classifier in Python and you can find it on Github. If have any improvements to add or any suggestions let me know in the comments section below.

image4233

Refer :