Why Natural Language Processing can be hard?

Following are my notes for the video lectures of IIT-K, NPTEL, NLP course. (Not orgnanized properly, will do them soon.)
“Language is the foundation of civilization. It is the glue that holds a people together. It is the first weapon drawn in a conflict.” –  Arrival (2016).
Problems in NLP :
  • Ambiguity
  • Open Domain
  • Relation between Entities
  • Impractical Goals
Why is language processing hard ?
  • Lexical Ambiguity :
               Input : “Will Will Smith play the role of Will Wise in Will the wise ?”
               Output using spaCy :
               Will – will ( MD ) VERB
               Will – will ( VB ) VERB –> ( NNP ) PROPN
               Smith – smith ( NNP ) PROPN
               play – play ( VB ) VERB
               the – the ( DT ) DET
               role – role ( NN ) NOUN
               of – of ( IN ) ADP
               Will – will ( NNP ) PROPN
               Wise – wise ( NNP ) PROPN
               in – in ( IN ) ADP
               Will – will ( NNP ) PROPN
               the – the ( DT ) DET
               wise – wise ( JJ ) ADJ
               ? – ? ( . ) PUNCT
               Input : “Rose rose to put a rose on her rows of roses.”
               Output using spaCy :
               Rose – rose ( NNP ) PROPN
               rose – rise ( VBD ) VERB
               to – to ( TO ) PART
               put – put ( VB ) VERB
               a – a ( DT ) DET
               rose – rose ( NN ) NOUN
               on – on ( IN ) ADP
               her – her ( PRP$ ) ADJ
               rows – row ( NNS ) NOUN
               of – of ( IN ) ADP
               roses – rose ( NNS ) NOUN
               . – . ( . ) PUNCT
  • Structural Ambiguity
               Input : “The man saw the boy with the binoculars”
               Output using spaCy :
               Input : “Flying planes can be dangerous.”
               Output using spaCy :


  • Imprecise and Vague
               “It is very warm here” – The condition of being ‘Warm’ isn’t associated with a fixed temperature range and can vary from person to person depending upon there threshold for warmness.
               “Q : Did your mom call your sister last night?”
               “A : I’m sure she must have” – Here the answer doesn’t convey the accurate information and is just a prediction, which can be true or false. Hence, it doesn’t provide the questionnaire with useful information.
A classic example to demonstrate ambiguity “I made her a duck.”


Catalan number : Number of parses generated for a sentence given to a parser as an input.
  • Non Standard English / Informal English
  • Segmentation Issues
  • Idioms / Phrases / Pop culture references
  • Neologisms / New words
  • New senses of words
  • Named Entities
Empirical Laws
Function words : serve as important elements to the structure of sentences and contribute very less to the lexical meaning of the sentences.
eg. prepositions, pronouns, auxiliary verbs, conjunctions, grammatical articles, particles, etc. [Closed-class words]
Content words : Convey information
Type-Token distinction : distinction separating a concept from the objects which are particular instances of the concept. Distinguishing between similar tokens on the basis of their type. “Will Will Smith play the role of Will Wise in Will the wise ?”
Here both the tokens although look similar but differ vastly with respect to their tokens.
Type/Token ratio : the ratio of the number of different words to the number of running words in the corpus. It indicates how often, on average, a new ‘word form’ appears in the text corpus.
Heap’s Law : |V| = KN^n
Number of unique token-types increase to square root of the corpus size.
Text Processing
Tokenization : segmentation of a string into words
Sentence segmentation : Boundary detection for a sentence. This task is challenging due to the fact that punctuation marks like dot (.) which usually represents the termination of sentences can also represents certain abbreviations or initials, which certainly do not indicate an end of the sentence. A binary classifier can built to decide the two possible out comes ie. a sentence termination and not a sentence termination. To build this sentence terminator we can use decision tree, some hand coded rules (cringe).
Word Tokenization : Apostrophe, Named Entities, Abbreviations.
Handling Hyphens : End of the line, lexical, sententially determined
Case Folding : Lower case / Upper case
Lemmatization : Finding root / head word using morphology (stem words + affixes).
Stemming : Reducing to a single lemma (Porter’s algorithm)

Dependency Parsing in NLP


Syntactic Parsing or Dependency Parsing is the task of recognizing a sentence and assigning a syntactic structure to it. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it plays a critical role in the semantic analysis stage. For example to answer the question “Who is the point guard for the LA Laker in the next game ?” we need to figure out its subject, objects, attributes to help us figure out that the user wants the point guard of the LA Lakers specifically for the next game.

Now the task of Syntactic parsing is quite complex due to the fact that a given sentence can have multiple parse trees which we call as ambiguities. Consider a sentence “Book that flight.” which can form multiple parse trees based on its ambiguous part of speech tags unless these ambiguities are resolved. Choosing a correct parse from the multiple possible parses is called as syntactic disambiguation. Parsing algorithms like the Cocke-Kasami-Younger (CKY), Earley algorithm or the Chart parsing algorithms uses a dynamic programming approach to deal with the ambiguity problems.
In this post, we will actually try to implement a few Syntactic parsers from different libraries:

SpaCy :

spaCy dependency parser provides token properties to navigate the generated dependency parse tree. Using the dep attribute gives the syntactic dependency relationship between the head token and its child token. The syntactic dependency scheme is used from the ClearNLP. The generated parse tree follows all the properties of a tree and each child token has only one head token although a head token can have multiple children. We can obtain the head token with the token.head property and its children by the token.children property. A subtree of a token can also be extracted using the token.subtree property. Similarly, ancestors for a token can be obtained with token.ancestors. To obtain the rightmost and leftmost token of a token’s syntactic descendants the token.right_edge and token.left_edge can be used. It is also worth mentioning that to extract the neighboring token we can use token.nbor. spaCy doesn’t provide an inbuilt tree representation although you can use the NLTK’s tree representation. Here’s a code snippet for it:

def tok_format(tok):
    return "_".join([tok.orth_, tok.tag_, tok.dep_])

def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(tok_format(node), [to_nltk_tree(child) for child in node.children])
        return tok_format(node)

command = "Submit debug logs to project lead today at 9:00 AM"
en_doc = en_nlp(u'' + command) 

[to_nltk_tree(sent.root).pretty_print() for sent in en_doc.sents]

Here’s the output format (Token_POS Tags_Dependency Tag):- selection_014

Let’s try extracting the head word from a question to understand how dependency works. A headword in a question can be extracted using various dependency relationships. But for now, we will try to extract the Nominal Subject nsubj from the question as the headword. Here’s how you can get a subject from the sentence.

head_word = "null"
question = "What films featured the character Popeye Doyle ?"
en_doc = en_nlp(u'' + question)
for sent in en_doc.sents:
    for token in sent:
        if token.dep == nsubj and (token.pos == NOUN or token.pos == PROPN):
            head_word = token.text
        elif token.dep == attr and (token.pos == NOUN or token.pos == PROPN):
            head_word = token.text
    print(question+" ("+head_word+")")

Here we get the output with headword as “films” which is pretty close and you can improve its accuracy by detecting more dependency relationships and headword rules:

What films featured the character Popeye Doyle ? (films)

spaCy also has a dependency visualizer displaCy here is the demo with our input question:



To install spaCy refer this Setting up Natural Language Processing Environment with Python

(Working on NLTK will update as soon as possible)

Further Reading :

The Process of Information Retrieval


A friend of mine published this realy great post about Information Retrieval. I have reblogged it here.


Information Retrieval (IR) is the activity of obtaining information from large collections of Information sources in response to a need.

The working of Information Retrieval process is explained below

  • The Process of Information Retrieval starts when a user creates any query into the system through some graphical interface provided.
  • These user-defined queries are the statements of needed information. for example, queries fork by users in search engines.
  • In IR single query does not match to the right data object instead it matches with the several collections of data objects from which the most relevant document is taken into consideration for further evaluation.
  • The ranking of relevant documents is done to find out the most related document to the given query.
  • This is the key difference between the Database searching and Information Retrieval.
  • After the query is sent to the core of the system. This part has the access to the content management…

View original post 223 more words