Why Natural Language Processing can be hard?

Following are my notes for the video lectures of IIT-K, NPTEL, NLP course. (Not orgnanized properly, will do them soon.)
“Language is the foundation of civilization. It is the glue that holds a people together. It is the first weapon drawn in a conflict.” –  Arrival (2016).
Problems in NLP :
  • Ambiguity
  • Open Domain
  • Relation between Entities
  • Impractical Goals
Why is language processing hard ?
  • Lexical Ambiguity :
               Input : “Will Will Smith play the role of Will Wise in Will the wise ?”
               Output using spaCy :
               Will – will ( MD ) VERB
               Will – will ( VB ) VERB –> ( NNP ) PROPN
               Smith – smith ( NNP ) PROPN
               play – play ( VB ) VERB
               the – the ( DT ) DET
               role – role ( NN ) NOUN
               of – of ( IN ) ADP
               Will – will ( NNP ) PROPN
               Wise – wise ( NNP ) PROPN
               in – in ( IN ) ADP
               Will – will ( NNP ) PROPN
               the – the ( DT ) DET
               wise – wise ( JJ ) ADJ
               ? – ? ( . ) PUNCT
               Input : “Rose rose to put a rose on her rows of roses.”
               Output using spaCy :
               Rose – rose ( NNP ) PROPN
               rose – rise ( VBD ) VERB
               to – to ( TO ) PART
               put – put ( VB ) VERB
               a – a ( DT ) DET
               rose – rose ( NN ) NOUN
               on – on ( IN ) ADP
               her – her ( PRP$ ) ADJ
               rows – row ( NNS ) NOUN
               of – of ( IN ) ADP
               roses – rose ( NNS ) NOUN
               . – . ( . ) PUNCT
  • Structural Ambiguity
               Input : “The man saw the boy with the binoculars”
               Output using spaCy :
               Input : “Flying planes can be dangerous.”
               Output using spaCy :


  • Imprecise and Vague
               “It is very warm here” – The condition of being ‘Warm’ isn’t associated with a fixed temperature range and can vary from person to person depending upon there threshold for warmness.
               “Q : Did your mom call your sister last night?”
               “A : I’m sure she must have” – Here the answer doesn’t convey the accurate information and is just a prediction, which can be true or false. Hence, it doesn’t provide the questionnaire with useful information.
A classic example to demonstrate ambiguity “I made her a duck.”


Catalan number : Number of parses generated for a sentence given to a parser as an input.
  • Non Standard English / Informal English
  • Segmentation Issues
  • Idioms / Phrases / Pop culture references
  • Neologisms / New words
  • New senses of words
  • Named Entities
Empirical Laws
Function words : serve as important elements to the structure of sentences and contribute very less to the lexical meaning of the sentences.
eg. prepositions, pronouns, auxiliary verbs, conjunctions, grammatical articles, particles, etc. [Closed-class words]
Content words : Convey information
Type-Token distinction : distinction separating a concept from the objects which are particular instances of the concept. Distinguishing between similar tokens on the basis of their type. “Will Will Smith play the role of Will Wise in Will the wise ?”
Here both the tokens although look similar but differ vastly with respect to their tokens.
Type/Token ratio : the ratio of the number of different words to the number of running words in the corpus. It indicates how often, on average, a new ‘word form’ appears in the text corpus.
Heap’s Law : |V| = KN^n
Number of unique token-types increase to square root of the corpus size.
Text Processing
Tokenization : segmentation of a string into words
Sentence segmentation : Boundary detection for a sentence. This task is challenging due to the fact that punctuation marks like dot (.) which usually represents the termination of sentences can also represents certain abbreviations or initials, which certainly do not indicate an end of the sentence. A binary classifier can built to decide the two possible out comes ie. a sentence termination and not a sentence termination. To build this sentence terminator we can use decision tree, some hand coded rules (cringe).
Word Tokenization : Apostrophe, Named Entities, Abbreviations.
Handling Hyphens : End of the line, lexical, sententially determined
Case Folding : Lower case / Upper case
Lemmatization : Finding root / head word using morphology (stem words + affixes).
Stemming : Reducing to a single lemma (Porter’s algorithm)

A Cognitive study of Lexicons in Natural Language Processing.


What are Lexicons ?

A word in any language is made of a root or stem word and an affix. These affixes are usually governed by some rules called orthographic rules. These orthographic rules define the spelling rules for a word composition in Morphological Parsing phase. A lexicon is a list of such stem words and affixes and is a vital requirement to construct a Morphological Parser. Morphological parsing involves building up or breaking down a structured representation of component morphemes to form a meaningful word or a stem word. It is a necessary phase in spell checking, search term disambiguation in Web Search engines, part of speech tagging, machine translation.

A simple lexicon would usually just consist of a list of every possible word and stem + affix combination in the language. But this is an inconvenient approach in real-time applications as search and retrieval of a specific word would become a challenge owing to the unstructured format of the lexicon. If a proper structure is provided to the lexicon consisting of the stem and affixes then building a word from this lexicon becomes bit simple. So, what kind of structure are we talking about here ? The most common structure used in morphotactics modeling is the Finite-State Automaton.

Let us look at a simple finite-state model for English nominal inflection:


Finite State Model for English nominal inflection

As stated in this FSM the regular noun is our stem word and is concatenated with plural suffix –s, 

eg. regular_noun(bat) + plural_suffix(s) = bats

Now this FSM will fail at some exceptions like : foot => feet, mouse => mice, company => companies, etc. This is where orthographic rules come in action. It defines these specific spelling rules for particular a stem which is supposed to be the exception. According to this, the FSM can be improved.

Cognitive Computing :

” It addresses complex situations that are characterized by ambiguity and uncertainty; in other words it handles human kinds of problems.”

Cognitive Computing Consortium

“Cognitive computing refers to systems that learn at scale, reason with purpose and interact with humans naturally. Rather than being explicitly programmed, they learn and reason from their interactions with us and from their experiences with their environment.”

– Dr. John E. Kelly III; Computing, cognition and the future of knowing, IBM.

Or simple we can say:

“Cognitive computing is the simulation of human thought processes in a computerized model. Cognitive computing involves self-learning systems that use data mining, pattern recognition and natural language processing to mimic the way the human brain works. The goal of cognitive computing is to create automated IT systems that are capable of solving problems without requiring human assistance.”


If cognitive computing is the simulation of human thought process in a computerized model, then this solves most of our ambiguity issues faced in Natural Language Processing. Let us first try to reason how a human mind tries to resolve ambiguity in Morphological Parsing.

How does our human mind constructs its Mental Lexicons ?

Let us say I give you a word – ‘cat’. The human brain immediately recognizes that given word is a noun relating to a cute little animal with fur. It also is able to recall its pronunciation. But sometimes it is unable to recognize the given word and recall all the information relating to it, say for example if you see the word ‘wug’, your mind might be able to figure out its pronunciation but it would fail to label a part of speech to it or assign a meaning to it. But if I tell you that it is a Noun and is a small creature, you can use it in a sentence and you would know its Part of Speech, eg. “I saw a wug today.”

Similarly, a word like ‘cluvious’ even if you don’t know its meaning, you may be able to infer some information about it because most words in English that have this form are Adjectives (ambitious, curious, anxious, envious…). Which might help you predict their meaning when the occur in sentences, example “You look cluvious today”. From the example sentence, one can easily interpret that ‘cluvious’ informs about the physical appearance of an entity.

You can even reason about words that you haven’t seen before, like ‘traftful’ and ‘traftless’ and figure out that they are most likely opposites. This is because the given pair of words resembles with many pairs of words in English that have this particular structure and an antonym relationship.

With the observations stated as above one can build a Morphological Parser with higher efficiency. You can also read my other post on how to set up Natural Language Processing environment in Python.

Further Reading: