NLP: Question Classification using Support Vector Machines [spacy][scikit-learn][pandas]

Standard

Past couple of months I have been working on a Question Answering System and in my upcoming blog posts, I would like to share some things I learnt in the whole process. I haven’t reached to a satisfactory accuracy with the answers fetched by the system, but it is work in progress. Adam QAS on Github.

In this post, we are specifically going to focus on the Question Classification part. The goal is to classify a given input question into predefined categories. This classification will help us in Query Construction / Modelling phases.

adam_poster_mini

ADAM – Poster

So, before we begin let’s make sure our environment is all set up.¬†Setting up Natural Language Processing Environment with Python. For Question’s language processing part, we are going to use spaCy¬†and for the machine learning part, we will use scikit-learn and for the Data Frames, I prefer pandas. Note: I am using Python3

$ pip3 install -U scikit-learn
$ pip3 install pandas

Now, that our environment is all set we need a training data set to train our classifier. I am using the dataset from the Cognitive Computation Group at the Department of Computer Science, University of Illinois at Urbana-Champaign. ( Training set 5(5500 labeled questions, For more visit here.)

DESC:manner How did serfdom develop in and then leave Russia ?
ENTY:cremat What films featured the character Popeye Doyle ?
DESC:manner How can I find a list of celebrities ' real names ?
...

Prep Training data for the SVM

For this classifier, we will be using a Linear Support Vector Machine. Now let us identify the features in the question which will affect its classification and train our classifier based on these features.

  1. WH-word: The WH-word in a question holds a lot of information about the intent of the question and what basically it is trying to seek. (What, When, How, Where and so on)
  2. WH-word POS: The part of speech of the WH-word (wh-determiner, wh-pronoun, wh-adverb)
  3. POS of the word next to WH-word: The part of speech of the word adjacent to WH-word or the word at 1st position in the bigram (0th being the WH-word).
  4. Root POS: The part of speech of the word at the root of the dependency parse tree.

Note: We will be extracting the WH-Bigram also (just for reference); A bigram is nothing but two consecutive words, in this case, we will consider the WH-word and the word that follows it. (What is, How many, Where do…)

We have to extract these features from our labelled dataset and store them in a CSV file with the respective label. This is where spaCy comes in action. It will enable us to get the Part of Speech, Dependency relation of each token in the question.

import spacy
import csv
clean_old_data()
en_nlp = spacy.load("en_core_web_md")

First, we load the English language model and clean our CSV file from old training data. And then we read our raw labelled data, extract the features for each question, store these features and labels in a CSV file.

read_input_file(fp, en_nlp)

This function splits the raw data into the question and its respective label and passes it on for further NLP processing.

def process_question(question, qclass, en_nlp):
    en_doc = en_nlp(u'' + question)
    sent_list = list(en_doc.sents)
    sent = sent_list[0]
    wh_bi_gram = []
    root_token = ""
    wh_pos = ""
    wh_nbor_pos = ""
    wh_word = ""
    for token in sent:
        if token.tag_ == "WDT" or token.tag_ == "WP" or token.tag_ == "WP$" or token.tag_ == "WRB":
            wh_pos = token.tag_
            wh_word = token.text
            wh_bi_gram.append(token.text)
            wh_bi_gram.append(str(en_doc[token.i + 1]))
            wh_nbor_pos = en_doc[token.i + 1].tag_
        if token.dep_ == "ROOT":
            root_token = token.tag_

    write_each_record_to_csv(wh_pos, wh_word, wh_bi_gram, wh_nbor_pos, root_token)

The above function feeds the question into the NLP pipeline en_doc = en_nlp(u'' + question) and obtains a Doc object containing linguistic annotations of the question. This Doc also performs sentence boundary detection/segmentation and we have to obtain the list of sentences which acts as the decomposed questions or sub questions. (Here I am only operating on the first sub question). Let us iterate over each token in the sentence to get its Parts of Speech and Dependency label. To extract only the WH-word we have to look for WDT, WP, WP$, WRB tags and to extract the root token from the sentence we look for its dependency label as ROOT. After writing all the records to the training data CSV file, it looks something like this:

#Question|WH|WH-Bigram|WH-POS|WH-NBOR-POS|Root-POS|Class
How did serfdom develop in and then leave Russia ?|How|How did|WRB|VBD|VB|DESC
What films featured the character Popeye Doyle ?|What|What films|WP|NNS|VBD|ENTY
...

Training the SVM and Prediction

from sklearn.svm import LinearSVC
import pandas

I prefer pandas over sklearn.datasets, First thing is we load our training dataset CSV file in the pandas DataFrame. This data frame will have all the features extracted in column-row fashion. Now to train our classifier we need to separate the features column and the class/label column so, we pop the label column from the data frame and store it separately. Along with that, we will also pop some unnecessary columns.

y = dta.pop('Class')
dta.pop('#Question')
dta.pop('WH-Bigram')

X_train = pandas.get_dummies(dta)

Here, the get_dummies() function converts the actual values into dummy values or binary values. What this means is that, if a record is something like below it will be converted to its binary form with 1 being the feature is present in the record and 0 as being absent.

   ...
5. How|WRB|VBD|VB
   ...
#  Why How What When Where ... WRB WDT ... VBD VB VBN VBP ...
   ...
5.  0   1    0    0    0   ...  1   0  ...  1   1  0   0  ...
   ...

In the next phase, we extract the same features from the question we want to predict in a data frame and get its dummy values.  Here is the data frame get_question_predict_features() will return:

qdata_frame = [{'WH':wh_word, 'WH-POS':wh_pos, 'WH-NBOR-POS':wh_nbor_pos, 'Root-POS':root_token}]
en_doc = en_nlp(u'' + question_to_predict)
question_data = get_question_predict_features(en_doc)
X_predict = pandas.get_dummies(question_data)

The problem here is that the size (number of features) of prediction data frame and the training data frame varies due to the absense of some features in the prediction data frame. It is obvious that the question to be classified will be missing a majority of features that are present in the training dataset of 5000 questions. So, to equate the size (number of features) we append the missing feature columns that are present in the training data frame to the prediction data frame with the value of 0 (because these features are not present in the question to classify).

def transform_data_matrix(X_train, X_predict):
    X_train_columns = list(X_train.columns)
    X_predict_columns = list(X_predict.columns)

    X_trans_columns = list(set(X_train_columns + X_predict_columns))
    # print(X_trans_columns, len(X_trans_columns))

    trans_data_train = {}

    for col in X_trans_columns:
        if col not in X_train:
            trans_data_train[col] = [0 for i in range(len(X_train.index))]
        else:
            trans_data_train[col] = list(X_train[col])

    XT_train = pandas.DataFrame(trans_data_train)
    XT_train = csr_matrix(XT_train)
    trans_data_predict = {}

    for col in X_trans_columns:
        if col not in X_predict:
            trans_data_predict[col] = 0
        else:
            trans_data_predict[col] = list(X_predict[col])  # KeyError

    XT_predict = pandas.DataFrame(trans_data_predict)
    XT_predict = csr_matrix(XT_predict)

    return XT_train, XT_predict
X_train, X_predict = transform_data_matrix(X_train, X_predict)

After we have both the data frames with the same size, we classify the question based on the training dataset using Linear Support Vector Machine. The LinearSVC model is fitted with the training features and respective labels. This fitted object is later used to predict the class with respect to the prediction data. It returns the question class/category.

Note: Here the DataFrame has multiple zero entries, hence you convert it into a sparse matrix representation; csr_matrix() takes care of that. from scipy.sparse import csr_matrix

print("Question Class:", support_vector_machine(X_train, y, X_predict))
def support_vector_machine(X_train, y, X_predict):
    lin_clf = LinearSVC()
    lin_clf.fit(X_train, y)
    prediction = lin_clf.predict(X_predict)
    return prediction

You can also experiment with a Bayesian Classifier (Refer: Naive Bayes Classifier in Python):

def naive_bayes_classifier(X_train, y, X_predict):
    gnb = GaussianNB()
    gnb.fit(X_train, y)
    prediction = gnb.predict(X_predict)
    return prediction

Fork it on GitHub:


GitHub-Mark-120px-plus

Reference:

Advertisements

Why Natural Language Processing can be hard?

Standard
Following are my notes for the video lectures of IIT-K, NPTEL, NLP course. (Not orgnanized properly, will do them soon.)
“Language is the foundation of civilization. It is the glue that holds a people together. It is the first weapon drawn in a conflict.” – ¬†Arrival (2016).
Problems in NLP :
  • Ambiguity
  • Open Domain
  • Relation between Entities
  • Impractical Goals
Why is language processing hard ?
  • Lexical Ambiguity :
¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†Input : “Will Will Smith play the role of Will Wise in Will the wise ?”
               Output using spaCy :
               Will Рwill ( MD ) VERB
¬† ¬† ¬† ¬† ¬† ¬†¬†¬†¬†¬†Will – will ( VB ) VERB –>¬†( NNP ) PROPN
               Smith Рsmith ( NNP ) PROPN
               play Рplay ( VB ) VERB
               the Рthe ( DT ) DET
               role Рrole ( NN ) NOUN
               of Рof ( IN ) ADP
               Will Рwill ( NNP ) PROPN
               Wise Рwise ( NNP ) PROPN
               in Рin ( IN ) ADP
               Will Рwill ( NNP ) PROPN
               the Рthe ( DT ) DET
               wise Рwise ( JJ ) ADJ
               ? Р? ( . ) PUNCT
¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†Input : “Rose rose to put a rose on her rows of roses.”
               Output using spaCy :
               Rose Рrose ( NNP ) PROPN
               rose Рrise ( VBD ) VERB
               to Рto ( TO ) PART
               put Рput ( VB ) VERB
               a Рa ( DT ) DET
               rose Рrose ( NN ) NOUN
               on Рon ( IN ) ADP
               her Рher ( PRP$ ) ADJ
               rows Рrow ( NNS ) NOUN
               of Рof ( IN ) ADP
               roses Рrose ( NNS ) NOUN
               . Р. ( . ) PUNCT
  • Structural Ambiguity
¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†Input : “The man saw the boy with the binoculars”
               Output using spaCy :
               selection_002
¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†Input : “Flying planes can be dangerous.”
               Output using spaCy :

 

               selection_003
  • Imprecise and Vague
¬† ¬† ¬† ¬† ¬† ¬†¬†¬†¬†¬†“It is very warm here” – The condition of being ‘Warm’ isn’t associated with a fixed temperature range and can vary from person to person depending upon there threshold for warmness.
¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†“Q : Did your mom call your sister last night?”
¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†“A : I’m sure she must have” – Here the answer doesn’t convey the accurate information and is just a prediction, which can be true or false. Hence, it doesn’t provide the questionnaire with useful information.
A classic example to demonstrate ambiguity “I made her a duck.”

 

selection_004
Catalan number : Number of parses generated for a sentence given to a parser as an input.
  • Non Standard English / Informal English
  • Segmentation Issues
  • Idioms / Phrases / Pop culture references
  • Neologisms / New words
  • New senses of words
  • Named Entities
 
Empirical Laws
Function words : serve as important elements to the structure of sentences and contribute very less to the lexical meaning of the sentences.
eg. prepositions, pronouns, auxiliary verbs, conjunctions, grammatical articles, particles, etc. [Closed-class words]
Content words : Convey information
Type-Token distinction : distinction separating a concept from the objects which are particular instances of the concept. Distinguishing between similar tokens on the basis of their type. “Will Will Smith play the role of Will Wise in Will the wise ?”
Here both the tokens although look similar but differ vastly with respect to their tokens.
Type/Token ratio : the ratio of the number of different words to the number of running words in the corpus. It indicates how often, on average, a new ‘word form’ appears in the text corpus.
Heap’s Law : |V| = KN^n
Number of unique token-types increase to square root of the corpus size.
Text Processing
Tokenization : segmentation of a string into words
Sentence segmentation : Boundary detection for a sentence. This task is challenging due to the fact that punctuation marks like dot (.) which usually represents the termination of sentences can also represents certain abbreviations or initials, which certainly do not indicate an end of the sentence. A binary classifier can built to decide the two possible out comes ie. a sentence termination and not a sentence termination. To build this sentence terminator we can use decision tree, some hand coded rules (cringe).
Word Tokenization : Apostrophe, Named Entities, Abbreviations.
Handling Hyphens : End of the line, lexical, sententially determined
Normalization
Case Folding : Lower case / Upper case
Lemmatization : Finding root / head word using morphology (stem words + affixes).
Stemming : Reducing to a single lemma (Porter’s algorithm)

Dependency Parsing in NLP

Standard

Syntactic Parsing or Dependency Parsing is the task of recognizing a sentence and assigning a syntactic structure to it. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it¬†plays a critical role in the semantic analysis stage. For example to answer the question “Who is the point guard for the LA Laker in the next game ?” we need to figure out its subject, objects, attributes to help us figure out that the user wants the point guard of the LA Lakers specifically for the next game.

Now the task of Syntactic parsing is quite complex due to the fact that a given sentence can have multiple parse trees which we call as ambiguities.¬†Consider a sentence “Book that flight.” which can form multiple parse trees based on its ambiguous¬†part of speech tags unless these ambiguities are resolved. Choosing a correct parse from the multiple possible parses is called as syntactic disambiguation. Parsing algorithms like the Cocke-Kasami-Younger (CKY), Earley algorithm or the Chart parsing algorithms uses a dynamic programming approach to deal with the ambiguity problems.
In this post, we will actually try to implement a few Syntactic parsers from different libraries:

SpaCy :

spaCy dependency parser provides token properties to navigate the generated dependency parse tree. Using the dep¬†attribute gives the syntactic dependency relationship between the head token and its child token. The syntactic dependency scheme is used from the ClearNLP. The generated parse tree follows all the properties of a tree and each child token has only one head token although a head token can have multiple children. We can obtain the head token with the¬†token.head property and its children by the¬†token.children¬†property. A subtree of a token can also be extracted using the token.subtree property. Similarly, ancestors for a token can be obtained with token.ancestors. To obtain the rightmost and leftmost token of a token’s syntactic descendants the token.right_edge and token.left_edge can be used. It is also worth mentioning that to extract the neighboring token we can use token.nbor.¬†spaCy doesn’t provide an inbuilt¬†tree representation although you can use the NLTK’s tree representation. Here’s a code snippet for it:

def tok_format(tok):
    return "_".join([tok.orth_, tok.tag_, tok.dep_])


def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(tok_format(node), [to_nltk_tree(child) for child in node.children])
    else:
        return tok_format(node)


command = "Submit debug logs to project lead today at 9:00 AM"
en_doc = en_nlp(u'' + command) 

[to_nltk_tree(sent.root).pretty_print() for sent in en_doc.sents]

Here’s the output format (Token_POS Tags_Dependency Tag):-¬†selection_014

Let’s try extracting the head word from a question to understand how dependency works. A headword in a question can be extracted using various dependency relationships. But for now, we will try to extract the Nominal Subject nsubj from the question as the headword. Here’s how you can get a subject from the sentence.

head_word = "null"
question = "What films featured the character Popeye Doyle ?"
en_doc = en_nlp(u'' + question)
for sent in en_doc.sents:
    for token in sent:
        if token.dep == nsubj and (token.pos == NOUN or token.pos == PROPN):
            head_word = token.text
        elif token.dep == attr and (token.pos == NOUN or token.pos == PROPN):
            head_word = token.text
    print(question+" ("+head_word+")")

Here we get the output with headword as “films” which is pretty close and you can improve its accuracy by detecting more dependency relationships and headword rules:

What films featured the character Popeye Doyle ? (films)

spaCy also has a dependency visualizer displaCy here is the demo with our input question:

screenshot-from-2016-12-23-13-09-29

displaCy

To install spaCy refer this Setting up Natural Language Processing Environment with Python

(Working on NLTK will update as soon as possible)

Further Reading :