NLP: Question Classification using Support Vector Machines [spacy][scikit-learn][pandas]


Past couple of months I have been working on a Question Answering System and in my upcoming blog posts, I would like to share some things I learnt in the whole process. I haven’t reached to a satisfactory accuracy with the answers fetched by the system, but it is work in progress. Adam QAS on Github.

In this post, we are specifically going to focus on the Question Classification part. The goal is to classify a given input question into predefined categories. This classification will help us in Query Construction / Modelling phases.


ADAM – Poster

So, before we begin let’s make sure our environment is all set up. Setting up Natural Language Processing Environment with Python. For Question’s language processing part, we are going to use spaCy and for the machine learning part, we will use scikit-learn and for the Data Frames, I prefer pandas. Note: I am using Python3

$ pip3 install -U scikit-learn
$ pip3 install pandas

Now, that our environment is all set we need a training data set to train our classifier. I am using the dataset from the Cognitive Computation Group at the Department of Computer Science, University of Illinois at Urbana-Champaign.Training set 5(5500 labeled questions, For more visit here.)

DESC:manner How did serfdom develop in and then leave Russia ?
ENTY:cremat What films featured the character Popeye Doyle ?
DESC:manner How can I find a list of celebrities ' real names ?

Prep Training data for the SVM

For this classifier, we will be using a Linear Support Vector Machine. Now let us identify the features in the question which will affect its classification and train our classifier based on these features.

  1. WH-word: The WH-word in a question holds a lot of information about the intent of the question and what basically it is trying to seek. (What, When, How, Where and so on)
  2. WH-word POS: The part of speech of the WH-word (wh-determiner, wh-pronoun, wh-adverb)
  3. POS of the word next to WH-word: The part of speech of the word adjacent to WH-word or the word at 1st position in the bigram (0th being the WH-word).
  4. Root POS: The part of speech of the word at the root of the dependency parse tree.

Note: We will be extracting the WH-Bigram also (just for reference); A bigram is nothing but two consecutive words, in this case, we will consider the WH-word and the word that follows it. (What is, How many, Where do…)

We have to extract these features from our labelled dataset and store them in a CSV file with the respective label. This is where spaCy comes in action. It will enable us to get the Part of Speech, Dependency relation of each token in the question.

import spacy
import csv
en_nlp = spacy.load("en_core_web_md")

First, we load the English language model and clean our CSV file from old training data. And then we read our raw labelled data, extract the features for each question, store these features and labels in a CSV file.

read_input_file(fp, en_nlp)

This function splits the raw data into the question and its respective label and passes it on for further NLP processing.

def process_question(question, qclass, en_nlp):
    en_doc = en_nlp(u'' + question)
    sent_list = list(en_doc.sents)
    sent = sent_list[0]
    wh_bi_gram = []
    root_token = ""
    wh_pos = ""
    wh_nbor_pos = ""
    wh_word = ""
    for token in sent:
        if token.tag_ == "WDT" or token.tag_ == "WP" or token.tag_ == "WP$" or token.tag_ == "WRB":
            wh_pos = token.tag_
            wh_word = token.text
            wh_bi_gram.append(str(en_doc[token.i + 1]))
            wh_nbor_pos = en_doc[token.i + 1].tag_
        if token.dep_ == "ROOT":
            root_token = token.tag_

    write_each_record_to_csv(wh_pos, wh_word, wh_bi_gram, wh_nbor_pos, root_token)

The above function feeds the question into the NLP pipeline en_doc = en_nlp(u'' + question) and obtains a Doc object containing linguistic annotations of the question. This Doc also performs sentence boundary detection/segmentation and we have to obtain the list of sentences which acts as the decomposed questions or sub questions. (Here I am only operating on the first sub question). Let us iterate over each token in the sentence to get its Parts of Speech and Dependency label. To extract only the WH-word we have to look for WDT, WP, WP$, WRB tags and to extract the root token from the sentence we look for its dependency label as ROOT. After writing all the records to the training data CSV file, it looks something like this:

How did serfdom develop in and then leave Russia ?|How|How did|WRB|VBD|VB|DESC
What films featured the character Popeye Doyle ?|What|What films|WP|NNS|VBD|ENTY

Training the SVM and Prediction

from sklearn.svm import LinearSVC
import pandas

I prefer pandas over sklearn.datasets, First thing is we load our training dataset CSV file in the pandas DataFrame. This data frame will have all the features extracted in column-row fashion. Now to train our classifier we need to separate the features column and the class/label column so, we pop the label column from the data frame and store it separately. Along with that, we will also pop some unnecessary columns.

y = dta.pop('Class')

X_train = pandas.get_dummies(dta)

Here, the get_dummies() function converts the actual values into dummy values or binary values. What this means is that, if a record is something like below it will be converted to its binary form with 1 being the feature is present in the record and 0 as being absent.

#  Why How What When Where ... WRB WDT ... VBD VB VBN VBP ...
5.  0   1    0    0    0   ...  1   0  ...  1   1  0   0  ...

In the next phase, we extract the same features from the question we want to predict in a data frame and get its dummy values.  Here is the data frame get_question_predict_features() will return:

qdata_frame = [{'WH':wh_word, 'WH-POS':wh_pos, 'WH-NBOR-POS':wh_nbor_pos, 'Root-POS':root_token}]
en_doc = en_nlp(u'' + question_to_predict)
question_data = get_question_predict_features(en_doc)
X_predict = pandas.get_dummies(question_data)

The problem here is that the size (number of features) of prediction data frame and the training data frame varies due to the absense of some features in the prediction data frame. It is obvious that the question to be classified will be missing a majority of features that are present in the training dataset of 5000 questions. So, to equate the size (number of features) we append the missing feature columns that are present in the training data frame to the prediction data frame with the value of 0 (because these features are not present in the question to classify).

def transform_data_matrix(X_train, X_predict):
    X_train_columns = list(X_train.columns)
    X_predict_columns = list(X_predict.columns)

    X_trans_columns = list(set(X_train_columns + X_predict_columns))
    # print(X_trans_columns, len(X_trans_columns))

    trans_data_train = {}

    for col in X_trans_columns:
        if col not in X_train:
            trans_data_train[col] = [0 for i in range(len(X_train.index))]
            trans_data_train[col] = list(X_train[col])

    XT_train = pandas.DataFrame(trans_data_train)
    XT_train = csr_matrix(XT_train)
    trans_data_predict = {}

    for col in X_trans_columns:
        if col not in X_predict:
            trans_data_predict[col] = 0
            trans_data_predict[col] = list(X_predict[col])  # KeyError

    XT_predict = pandas.DataFrame(trans_data_predict)
    XT_predict = csr_matrix(XT_predict)

    return XT_train, XT_predict
X_train, X_predict = transform_data_matrix(X_train, X_predict)

After we have both the data frames with the same size, we classify the question based on the training dataset using Linear Support Vector Machine. The LinearSVC model is fitted with the training features and respective labels. This fitted object is later used to predict the class with respect to the prediction data. It returns the question class/category.

Note: Here the DataFrame has multiple zero entries, hence you convert it into a sparse matrix representation; csr_matrix() takes care of that. from scipy.sparse import csr_matrix

print("Question Class:", support_vector_machine(X_train, y, X_predict))
def support_vector_machine(X_train, y, X_predict):
    lin_clf = LinearSVC(), y)
    prediction = lin_clf.predict(X_predict)
    return prediction

You can also experiment with a Bayesian Classifier (Refer: Naive Bayes Classifier in Python):

def naive_bayes_classifier(X_train, y, X_predict):
    gnb = GaussianNB(), y)
    prediction = gnb.predict(X_predict)
    return prediction

Fork it on GitHub:



The Process of Information Retrieval


A friend of mine published this realy great post about Information Retrieval. I have reblogged it here.


Information Retrieval (IR) is the activity of obtaining information from large collections of Information sources in response to a need.

The working of Information Retrieval process is explained below

  • The Process of Information Retrieval starts when a user creates any query into the system through some graphical interface provided.
  • These user-defined queries are the statements of needed information. for example, queries fork by users in search engines.
  • In IR single query does not match to the right data object instead it matches with the several collections of data objects from which the most relevant document is taken into consideration for further evaluation.
  • The ranking of relevant documents is done to find out the most related document to the given query.
  • This is the key difference between the Database searching and Information Retrieval.
  • After the query is sent to the core of the system. This part has the access to the content management…

View original post 223 more words

A Cognitive study of Lexicons in Natural Language Processing.


What are Lexicons ?

A word in any language is made of a root or stem word and an affix. These affixes are usually governed by some rules called orthographic rules. These orthographic rules define the spelling rules for a word composition in Morphological Parsing phase. A lexicon is a list of such stem words and affixes and is a vital requirement to construct a Morphological Parser. Morphological parsing involves building up or breaking down a structured representation of component morphemes to form a meaningful word or a stem word. It is a necessary phase in spell checking, search term disambiguation in Web Search engines, part of speech tagging, machine translation.

A simple lexicon would usually just consist of a list of every possible word and stem + affix combination in the language. But this is an inconvenient approach in real-time applications as search and retrieval of a specific word would become a challenge owing to the unstructured format of the lexicon. If a proper structure is provided to the lexicon consisting of the stem and affixes then building a word from this lexicon becomes bit simple. So, what kind of structure are we talking about here ? The most common structure used in morphotactics modeling is the Finite-State Automaton.

Let us look at a simple finite-state model for English nominal inflection:


Finite State Model for English nominal inflection

As stated in this FSM the regular noun is our stem word and is concatenated with plural suffix –s, 

eg. regular_noun(bat) + plural_suffix(s) = bats

Now this FSM will fail at some exceptions like : foot => feet, mouse => mice, company => companies, etc. This is where orthographic rules come in action. It defines these specific spelling rules for particular a stem which is supposed to be the exception. According to this, the FSM can be improved.

Cognitive Computing :

” It addresses complex situations that are characterized by ambiguity and uncertainty; in other words it handles human kinds of problems.”

Cognitive Computing Consortium

“Cognitive computing refers to systems that learn at scale, reason with purpose and interact with humans naturally. Rather than being explicitly programmed, they learn and reason from their interactions with us and from their experiences with their environment.”

– Dr. John E. Kelly III; Computing, cognition and the future of knowing, IBM.

Or simple we can say:

“Cognitive computing is the simulation of human thought processes in a computerized model. Cognitive computing involves self-learning systems that use data mining, pattern recognition and natural language processing to mimic the way the human brain works. The goal of cognitive computing is to create automated IT systems that are capable of solving problems without requiring human assistance.”


If cognitive computing is the simulation of human thought process in a computerized model, then this solves most of our ambiguity issues faced in Natural Language Processing. Let us first try to reason how a human mind tries to resolve ambiguity in Morphological Parsing.

How does our human mind constructs its Mental Lexicons ?

Let us say I give you a word – ‘cat’. The human brain immediately recognizes that given word is a noun relating to a cute little animal with fur. It also is able to recall its pronunciation. But sometimes it is unable to recognize the given word and recall all the information relating to it, say for example if you see the word ‘wug’, your mind might be able to figure out its pronunciation but it would fail to label a part of speech to it or assign a meaning to it. But if I tell you that it is a Noun and is a small creature, you can use it in a sentence and you would know its Part of Speech, eg. “I saw a wug today.”

Similarly, a word like ‘cluvious’ even if you don’t know its meaning, you may be able to infer some information about it because most words in English that have this form are Adjectives (ambitious, curious, anxious, envious…). Which might help you predict their meaning when the occur in sentences, example “You look cluvious today”. From the example sentence, one can easily interpret that ‘cluvious’ informs about the physical appearance of an entity.

You can even reason about words that you haven’t seen before, like ‘traftful’ and ‘traftless’ and figure out that they are most likely opposites. This is because the given pair of words resembles with many pairs of words in English that have this particular structure and an antonym relationship.

With the observations stated as above one can build a Morphological Parser with higher efficiency. You can also read my other post on how to set up Natural Language Processing environment in Python.

Further Reading: