My last blog post was on 3rd July 17, it has been a long time since I have posted anything here. I do have 5 posts in my drafts that I never got around completing over the last year, but I hope to publish them soon. On June 17 I moved to Bengaluru and since then I have been busy trying out a variety of things but now I think I do have a clear idea of the things that I want to keep doing in the future and this is me compiling them here.
ADAM Question Answering System
What has been done?
Since January I have been working on ADAM Qas project (5hirish/adam_qas), focusing on improving some basic modules. The project previously relied upon the official Wikipedia APIs Python wrapper for information fetching and scrapping and I wasn’t satisfied with it very much. It had some issues with searching ambiguous terms and the project required more fine-grained information scrapping from the articles. So I removed the Wikipedia python module and implemented its APIs to search terms and fetch the articles. Later with the help of XPath queries, I would extract and separate all the structured information from the unstructured information thus allowing me to store the tabular data, information cards in the articles as JSON objects. I am planning on writing a separate blog post on this and will dwell on its implementation there.
Next thing was that the project lacked a good storage strategy and hence I choose Elasticsearch to store the extracted structured and unstructured data. Working on Elasticsearch was a lot of fun and I did mess up the mappings a lot of times during mapping updates. I also had to use a custom language analyzer since the default one did not perform well with stemming, I wanted a stricter stemmer, more on this discussed in Github Issue #20 Elasticsearch with custom English analyzer.
Using Elastic search enabled me to improve and fine tune my search query with negations and boolean operators and it also takes care of the scoring and ranking. More on this here Github Issue #19 Elasticsearch Full-Text search strategies.
Whats in the pipeline?
The project is still very immature but it has a lot of potentials, and the community on Github is starting to notice it and I think it would be better if I discuss whats next in the pipeline for the project and what is the current state of the project. I think it is very important to have full transparency in an Open Source project for it to thrive.
The most pressing issue right now is to query the structured data stored in Elasticsearch. This structured data usually has answers to factual questions, I have stumbled across JMESPath a query language for JSON and it seems quite promising so as soon as I have bandwidth I will be picking it up.
Other things that I have planned are, working on improving the answer extraction, I am not satisfied with Vector Space model and I am looking at some other alternative. Also, to replace rule-based query constructor with an unsupervised one which would also allow us to benefit from the Question’s class/category.
In April I discovered Kenneth Reitz’s repo, twitter scrapper, but unfortunately, it was Python 3.6+ supported as it depended on requests_html module. So, I decided to build my own twitter scrapper purely using XPath queries and release it on PyPi. It extracts the tweet with all of its meta-data, mentions, hashtags, external links. I have also added a few examples using Jupyter notebooks like Tweet generator using Markov Chain and Gensim Topic Modeling using Latent Dirichlet Allocation model, to demonstrate the usage of this module. You can check out the repo at 5hirish/tweet_scrapper.
pip install tweetscrape
My perspective on Django and Flask
In the last year, I had the chance to work with Django for a project and this year I am working with Flask. I was very much dismissive of Flask and worshiped Django (exaggerating) until a few months ago when a friend of mine convinced me to give a try for at least a day. And after a week of playing around with Flask, I have learned to keep an open mind toward things. I loved that Flask was lightweight, very easy to learn, gives you a lot of freedom, and you are working on building your APIs the next minute you set up the project. I feel like setting up a Django project is like a ritual and I feel guilty if I skip on setting up an unnecessary thing. Most of the things in Django are already set up and ready to go hence giving you a very small space to customize. Both framework’s usage really depends on your use case and comparing them toe-toe is just pointless.
I have been working on a personal project of mine and hope to release a Developer Preview by November, so most of my efforts are dedicated towards it, but I’ll try my best to blog more often and squeeze ADAM Qas’s development in between. I am very much excited about this project and will update here when its out.
Recently I have also picked up a new hobby- trekking and I am loving it a lot and I am trying to accommodate a trek once a month.