The Process of Information Retrieval

Standard

A friend of mine published this realy great post about Information Retrieval. I have reblogged it here.

AMIT GUNJAL

Information Retrieval (IR) is the activity of obtaining information from large collections of Information sources in response to a need.


The working of Information Retrieval process is explained below

  • The Process of Information Retrieval starts when a user creates any query into the system through some graphical interface provided.
  • These user-defined queries are the statements of needed information. for example, queries fork by users in search engines.
  • In IR single query does not match to the right data object instead it matches with the several collections of data objects from which the most relevant document is taken into consideration for further evaluation.
  • The ranking of relevant documents is done to find out the most related document to the given query.
  • This is the key difference between the Database searching and Information Retrieval.
  • After the query is sent to the core of the system. This part has the access to the content management…

View original post 223 more words

Advertisements

A Cognitive study of Lexicons in Natural Language Processing.

Standard

What are Lexicons ?

A word in any language is made of a root or stem word and an affix. These affixes are usually governed by some rules called orthographic rules. These orthographic rules define the spelling rules for a word composition in Morphological Parsing phase. A lexicon is a list of such stem words and affixes and is a vital requirement to construct a Morphological Parser. Morphological parsing involves building up or breaking down a structured representation of component morphemes to form a meaningful word or a stem word. It is a necessary phase in spell checking, search term disambiguation in Web Search engines, part of speech tagging, machine translation.

A simple lexicon would usually just consist of a list of every possible word and stem + affix combination in the language. But this is an inconvenient approach in real-time applications as search and retrieval of a specific word would become a challenge owing to the unstructured format of the lexicon. If a proper structure is provided to the lexicon consisting of the stem and affixes then building a word from this lexicon becomes bit simple. So, what kind of structure are we talking about here ? The most common structure used in morphotactics modeling is the Finite-State Automaton.

Let us look at a simple finite-state model for English nominal inflection:

fsm

Finite State Model for English nominal inflection

As stated in this FSM the regular noun is our stem word and is concatenated with plural suffix –s, 

eg. regular_noun(bat) + plural_suffix(s) = bats

Now this FSM will fail at some exceptions like : foot => feet, mouse => mice, company => companies, etc. This is where orthographic rules come in action. It defines these specific spelling rules for particular a stem which is supposed to be the exception. According to this, the FSM can be improved.

Cognitive Computing :

” It addresses complex situations that are characterized by ambiguity and uncertainty; in other words it handles human kinds of problems.”

Cognitive Computing Consortium

“Cognitive computing refers to systems that learn at scale, reason with purpose and interact with humans naturally. Rather than being explicitly programmed, they learn and reason from their interactions with us and from their experiences with their environment.”

– Dr. John E. Kelly III; Computing, cognition and the future of knowing, IBM.

Or simple we can say:

“Cognitive computing is the simulation of human thought processes in a computerized model. Cognitive computing involves self-learning systems that use data mining, pattern recognition and natural language processing to mimic the way the human brain works. The goal of cognitive computing is to create automated IT systems that are capable of solving problems without requiring human assistance.”

Techtarget

If cognitive computing is the simulation of human thought process in a computerized model, then this solves most of our ambiguity issues faced in Natural Language Processing. Let us first try to reason how a human mind tries to resolve ambiguity in Morphological Parsing.

How does our human mind constructs its Mental Lexicons ?

Let us say I give you a word – ‘cat’. The human brain immediately recognizes that given word is a noun relating to a cute little animal with fur. It also is able to recall its pronunciation. But sometimes it is unable to recognize the given word and recall all the information relating to it, say for example if you see the word ‘wug’, your mind might be able to figure out its pronunciation but it would fail to label a part of speech to it or assign a meaning to it. But if I tell you that it is a Noun and is a small creature, you can use it in a sentence and you would know its Part of Speech, eg. “I saw a wug today.”

Similarly, a word like ‘cluvious’ even if you don’t know its meaning, you may be able to infer some information about it because most words in English that have this form are Adjectives (ambitious, curious, anxious, envious…). Which might help you predict their meaning when the occur in sentences, example “You look cluvious today”. From the example sentence, one can easily interpret that ‘cluvious’ informs about the physical appearance of an entity.

You can even reason about words that you haven’t seen before, like ‘traftful’ and ‘traftless’ and figure out that they are most likely opposites. This is because the given pair of words resembles with many pairs of words in English that have this particular structure and an antonym relationship.

With the observations stated as above one can build a Morphological Parser with higher efficiency. You can also read my other post on how to set up Natural Language Processing environment in Python.

Further Reading:

Setting up Natural Language Processing Environment with Python

Standard

In this blog post, I will be discussing all the tools of Natural Language Processing pertaining to Linux environment, although most of them would also apply to Windows and Mac. So, let’s get started with some prerequisites.
We will use Python’s Pip package installer in order to install various python modules.

$ sudo apt install python-pip
$ pip install -U pip
$ pip install --upgrade pip

So I am going to talk about three NLP tools in Python that I have worked with so far.

Note: It is highly recommended to install these modules in a Virtual Environment. Here is how you do that : Common Python Tools: Using virtualenv, Installing with Pip, and Managing Packages.

  1. Natural Language Toolkit :

    NLTK can be seen as a library written for educational purposes and hence, is great to experiment with as its website itself notes this; NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.” To install NLTK we use pip :

    $ sudo pip install -U nltk

    NLTK also comes with its own corpora and can be downloaded as follows:

    >>> import nltk
    >>> nltk.download()

    felixfelix-_010
    We can also interface NLTK with our own corpora. For detailed usage of the NLTK API usage, one can refer its official guide “Natural Language Processing with Python by Steven Bird”. I will be covering more about NLTK its API usage in the upcoming posts, but for now, we will settle with its installation.

  2. spaCy :

    In the words of Matthew Honnibal (author of spaCy);

    ” There’s a real philosophical difference between spaCy and NLTK. spaCy is written to help you get things done. It’s minimal and opinionated. We want to provide you with exactly one way to do it — the right way. In contrast, NLTK was created to support education. Most of what’s there is for demo purposes, to help students explore ideas. spaCy provides very fast and accurate syntactic analysis (the fastest of any library released), and also offers named entity recognition and ready access to word vectors. You can use the default word vectors, or replace them with any you have.

    What really sets it apart, though, is the API. spaCy is the only library that has all of these features together, and allows you to easily hop between these levels of representation. Here’s an example of how that helps. Tutorial: Search Reddit for comments about Google doing something . spaCy also ensures that the annotations are always aligned to the original string, so you can easily print mark-up: Tutorial: Mark all adverbs, particularly for verbs of speech . “

    The benchmarks provided on its official website:selection_011
    Here are some of the things I have tried with spaCy and it’s my favorite NLP tool. In the upcoming posts I will dwell into each of its APIs so, keep an eye out here (spaCy):selection_004
    Installation :

    $ sudo pip install -U spacy
    $ sudo python -m spacy.en.download

    What makes it easy to work with spaCy is it’s well maintained and presented documentation. They also have made some great demos like displaCy for dependency parser and named entity recognizer. Check them out here.

  3. TextBlob :

    It is more of a text processing library than an NLP library. It is simple and light weight and is your go-to library when you have to perform some basic NLP operations such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
    Installation :

    $ sudo pip install -U textblob
    $ sudo python -m textblob.download_corpora

    Usage to check word definitions and synonyms and similarity between different words. I will be doing an independent post on TexBlob with WordNet and its SynSets:selection_013

Check out my other posts on Natural Language Processing.

Further Reading :

If you have any problem with installations or some other comments let me know below in the comments section. Meanwhile, you can also check out my other post on Machine Learning Classification algorithm. Naive Bayes Classifier in Python