Dependency Parsing in NLP

Standard

Syntactic Parsing or Dependency Parsing is the task of recognizing a sentence and assigning a syntactic structure to it. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it plays a critical role in the semantic analysis stage. For example to answer the question “Who is the point guard for the LA Laker in the next game ?” we need to figure out its subject, objects, attributes to help us figure out that the user wants the point guard of the LA Lakers specifically for the next game.

Now the task of Syntactic parsing is quite complex due to the fact that a given sentence can have multiple parse trees which we call as ambiguities. Consider a sentence “Book that flight.” which can form multiple parse trees based on its ambiguous part of speech tags unless these ambiguities are resolved. Choosing a correct parse from the multiple possible parses is called as syntactic disambiguation. Parsing algorithms like the Cocke-Kasami-Younger (CKY), Earley algorithm or the Chart parsing algorithms uses a dynamic programming approach to deal with the ambiguity problems.
In this post, we will actually try to implement a few Syntactic parsers from different libraries:

SpaCy :

spaCy dependency parser provides token properties to navigate the generated dependency parse tree. Using the dep attribute gives the syntactic dependency relationship between the head token and its child token. The syntactic dependency scheme is used from the ClearNLP. The generated parse tree follows all the properties of a tree and each child token has only one head token although a head token can have multiple children. We can obtain the head token with the token.head property and its children by the token.children property. A subtree of a token can also be extracted using the token.subtree property. Similarly, ancestors for a token can be obtained with token.ancestors. To obtain the rightmost and leftmost token of a token’s syntactic descendants the token.right_edge and token.left_edge can be used. It is also worth mentioning that to extract the neighboring token we can use token.nbor. spaCy doesn’t provide an inbuilt tree representation although you can use the NLTK’s tree representation. Here’s a code snippet for it:

def tok_format(tok):
    return "_".join([tok.orth_, tok.tag_, tok.dep_])


def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(tok_format(node), [to_nltk_tree(child) for child in node.children])
    else:
        return tok_format(node)


command = "Submit debug logs to project lead today at 9:00 AM"
en_doc = en_nlp(u'' + command) 

[to_nltk_tree(sent.root).pretty_print() for sent in en_doc.sents]

Here’s the output format (Token_POS Tags_Dependency Tag):- selection_014

Let’s try extracting the head word from a question to understand how dependency works. A headword in a question can be extracted using various dependency relationships. But for now, we will try to extract the Nominal Subject nsubj from the question as the headword. Here’s how you can get a subject from the sentence.

head_word = "null"
question = "What films featured the character Popeye Doyle ?"
en_doc = en_nlp(u'' + question)
for sent in en_doc.sents:
    for token in sent:
        if token.dep == nsubj and (token.pos == NOUN or token.pos == PROPN):
            head_word = token.text
        elif token.dep == attr and (token.pos == NOUN or token.pos == PROPN):
            head_word = token.text
    print(question+" ("+head_word+")")

Here we get the output with headword as “films” which is pretty close and you can improve its accuracy by detecting more dependency relationships and headword rules:

What films featured the character Popeye Doyle ? (films)

spaCy also has a dependency visualizer displaCy here is the demo with our input question:

screenshot-from-2016-12-23-13-09-29

displaCy

To install spaCy refer this Setting up Natural Language Processing Environment with Python

(Working on NLTK will update as soon as possible)

Further Reading :

Advertisements

Setting up Natural Language Processing Environment with Python

Standard

In this blog post, I will be discussing all the tools of Natural Language Processing pertaining to Linux environment, although most of them would also apply to Windows and Mac. So, let’s get started with some prerequisites.
We will use Python’s Pip package installer in order to install various python modules.

$ sudo apt install python-pip
$ pip install -U pip
$ pip install --upgrade pip

So I am going to talk about three NLP tools in Python that I have worked with so far.

Note: It is highly recommended to install these modules in a Virtual Environment. Here is how you do that : Common Python Tools: Using virtualenv, Installing with Pip, and Managing Packages.

  1. Natural Language Toolkit :

    NLTK can be seen as a library written for educational purposes and hence, is great to experiment with as its website itself notes this; NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.” To install NLTK we use pip :

    $ sudo pip install -U nltk

    NLTK also comes with its own corpora and can be downloaded as follows:

    >>> import nltk
    >>> nltk.download()

    felixfelix-_010
    We can also interface NLTK with our own corpora. For detailed usage of the NLTK API usage, one can refer its official guide “Natural Language Processing with Python by Steven Bird”. I will be covering more about NLTK its API usage in the upcoming posts, but for now, we will settle with its installation.

  2. spaCy :

    In the words of Matthew Honnibal (author of spaCy);

    ” There’s a real philosophical difference between spaCy and NLTK. spaCy is written to help you get things done. It’s minimal and opinionated. We want to provide you with exactly one way to do it — the right way. In contrast, NLTK was created to support education. Most of what’s there is for demo purposes, to help students explore ideas. spaCy provides very fast and accurate syntactic analysis (the fastest of any library released), and also offers named entity recognition and ready access to word vectors. You can use the default word vectors, or replace them with any you have.

    What really sets it apart, though, is the API. spaCy is the only library that has all of these features together, and allows you to easily hop between these levels of representation. Here’s an example of how that helps. Tutorial: Search Reddit for comments about Google doing something . spaCy also ensures that the annotations are always aligned to the original string, so you can easily print mark-up: Tutorial: Mark all adverbs, particularly for verbs of speech . “

    The benchmarks provided on its official website:selection_011
    Here are some of the things I have tried with spaCy and it’s my favorite NLP tool. In the upcoming posts I will dwell into each of its APIs so, keep an eye out here (spaCy):selection_004
    Installation :

    $ sudo pip install -U spacy
    $ sudo python -m spacy.en.download

    What makes it easy to work with spaCy is it’s well maintained and presented documentation. They also have made some great demos like displaCy for dependency parser and named entity recognizer. Check them out here.

  3. TextBlob :

    It is more of a text processing library than an NLP library. It is simple and light weight and is your go-to library when you have to perform some basic NLP operations such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
    Installation :

    $ sudo pip install -U textblob
    $ sudo python -m textblob.download_corpora

    Usage to check word definitions and synonyms and similarity between different words. I will be doing an independent post on TexBlob with WordNet and its SynSets:selection_013

Check out my other posts on Natural Language Processing.

Further Reading :

If you have any problem with installations or some other comments let me know below in the comments section. Meanwhile, you can also check out my other post on Machine Learning Classification algorithm. Naive Bayes Classifier in Python

Naive Bayes Classifier in Python

Standard

Naive Bayes Classifier is probably the most widely used text classifier, it’s a supervised learning algorithm. It can be used to classify blog posts or news articles into different categories like sports, entertainment and so forth. It can be used to detect spam emails. But most important is that it’s widely implemented in Sentiment analysis. So first of all what is supervised learning? It means that the labeled training dataset is provided along with the input and the respective output. From this training dataset, our algorithm infers the next outcome to a given input.

The basics,

Conditional Probability : It is simply the probability that something will happen, given that something else has happened. It’s a way to handle dependent events. You can check out some examples of conditional probability here.

So from the multiplication rule; (here A and B are dependent events)

P(A ∪ B) = P(A) · P(B|A)

Now from above equation we get the formula for conditional probability;

P(B|A) = P(A ∪ B) / P(A)

Bayes’ theorem : It describes the probability of an event based on the conditions or attributes that might be related to the event,

P(A|B) = [P(B|A) · P(A)] / P(B)

So, our classifier can be written as :

Assume a problem instance to be classified, represented by vector x = (x1, x2, x3, …. , xn) representing some n attributes. Here y is our class variable.

Selection_005

Here we have eliminated the denominator P(x1, x2, x3, …. , xn) because it doesn’t really contribute to our final solution.

Now to make sure our algorithm holds up good against our datasets, we need to take the following conditions into account.

The Zero Frequency problem : Let us consider the case where a given attribute or class never occurs together in the training data, causing the frequency-based probability estimate be zero. This small  condition will wipe out the entire information in other probabilities when multiplied (multiplied by zero…duh…!). The simple solution to it is to apply Laplace estimation by assuming a uniform distribution over all attributes ie. we simply add a pseudocount in all probability estimates such that no probability is ever set to zero.

Floating Point Underflow : The probability values can go out of the floating point range hence to avoid this we need take logarithms on the probabilities. Accordingly we need to apply logarithmic rules to our classifier.

I have implemented Naive Bayes Classifier in Python and you can find it on Github. If have any improvements to add or any suggestions let me know in the comments section below.

image4233

Refer :