How We Approached The Allen A.I. Challenge on Kaggle

Posted by Rob May on Jan 11, 2016 11:11:00 AM

AI2-Allen-Institute-for-Artificial-Intelligence-Talla-Submission.png

When I started a machine intelligence company, the first thing a VC said to me was "You can't do it. There aren't enough good data scientists out there, and you won't be able to find any." He was wrong of course. But, I knew that as we went to raise money, VCs would ultimately ask about the team. How do I know that the Talla data science team is up to the task? The majority of VCs like to back teams that have Stanford engineering degrees, Harvard MBAs, and then a stint at Google or Facebook, but I prefer to hire weirdos, and I knew VCs wouldn't accept them on their credentials alone, because I'm not hiring the kind of person they want. So one way to get around this VC question was to do some sort of task that shows we have a good team. And that is why we decided to participate in the Allen A.I. Challenge on Kaggle. This post is about how we approached that problem.

About the Allen A.I. Challenge

The Allen A.I. Institute was setup by Microsoft co-founder Paul Allen to work on problems in artificial intelligence. The group decided to sponsor a Kaggle challenge which, if you aren't familiar with Kaggle, means data scientists come to the site to compete for a cash prize. In this case, the prize is $80,000.

The goal of the Allen A.I. challenge is to answer questions as best you can from an 8th grade science test. There are some specific rules of the challenge, which you can read here if you are interested. But I won't go into them because what this post is really about is, how do you approach a problem like this? More specifically, given a bunch of multiple choice questions from an 8th grade science test, how do you choose the right answer?

Before we dive in to the approaches we have taken, I will point out one particularly difficult piece of this problem that is not as common in Kaggle challenges—the validation set has a bunch of nonsensical questions mixed in with the real ones, and only the real ones are graded.

Our team is the Long Short Term Manatees, a play on the LSTM acronym for Long Short Term Memory, a popular recurrent neural network model used for NLP tasks. I have no idea where Manatees came from but, that's what happens when you hire weirdos, so I just rolled with it.

Why Deep Learning Won't Work On A Problem Like This

If you don't know much about machine learning other than what you read in the tech press, your first thought is probably "just throw deep learning at it."  Right?  Not exactly.

The Allen AI challenge is a difficult use case for deep learning because answering questions is highly dependent on the structure of natural language questions, as well as an accumulated knowledge base. Recently there has been an increased use of 'word vectors' that allow deep learning to effectively perform many NLP tasks. In this case, however, converting question words to vectors is insufficient because the resulting representation does not capture the compositional nature of language. For example, take the question "Which of the following is not a way that plants extract energy from their environment?". The usage of not changes the meaning of the sentence. "Their environment" is a unit in the sentence that also references the earlier "plants." Learning to parse sentences like these constitutes a learning task in its own right, and people can manually specify grammars for parsing language that are currently hard to beat statistically.

Compounding the difficulty of this task is the problem of using acquired knowledge. A person attempting to answer the questions on the test has an internal model of the the facts that the question corresponds to. Plants exist in a given environment and extract energy in several ways. People are able to recall information relevant to the question and organize it in a way that helps answer the question. Research pairing deep learning with memory is an ongoing effort, but similar to natural language processing, shortcuts such as data retrieval are often still more efficient. For example, most search engines encode large bodies of facts in a regularly and easily processable form rather than trying to extract information from a representation in a neural network. Taken together, the problems of natural language processing and utilizing preexisting knowledge bases create a difficult challenge that is often easier to tackle with hand crafted modeling and engineering, rather than end-to-end deep learning. Maybe someday deep learning will solve this problem, but today is not that day.

Where To Start

The first thing to do in a Kaggle competition like this is try to match the benchmark, using the same approach. In this case, the benchmark approach used basic information retrieval. So, after building a knowledge base of information relevant to the question you are trying to answer, this method uses a query or search-based strategy.

In general, using Information Retrieval for Question Answering requires multiple steps. For example, for answering questions like, "Which example describes a learned behavior in a dog?", one would have to do the following steps:

1. Create/Extract a knowledge-base containing enough information to answer questions.

2. This knowledge base is then indexed to be able to retrieve relevant documents.

3. From the retrieved documents, a sentence/phrase containing the answer is selected. A large knowledge-base is indexed.

In our case, we used Information Retrieval methods for ranking question-answer pairs. That is, we already have the four possible answers, we just need to determine which of these is most relevant. To do so, we extracted raw text data from Wikipedia and ck12.org to create our knowledge-base for answering questions. We then indexed it using an Information Retrieval library called 'whoosh'. After indexing, text of each question-answer pair is preprocessed into a query language to match documents, using BM-25 scoring for matched documents. The pair with highest score is the guess of our model.

What worked:

Our in-house preprocessing and tokenizing with some fine-tuning makes our model better than the competition's baseline.

What didn't work:

We also experimented with adding entire wiki-books content into our knowledge-base. It appears that lot of out-of domain data makes scoring of question-answer pairs less distinguishable and doesn't perform better.

Semantic Similarity Approach

Our next approach was to try semantic similarity. The core of this approach is to calculate semantic similarities between a question and its answer candidates. The relevance of an answer is then determined by its semantic similarity with the question. Methods for calculating this semantic similarity range from a simple bag-of-words model to a neural network based model, making use of sparse or dense distributed representations of words.

For this competition, we started with a simple bag-of-words model using dense distributed representations of words. These distributed representations were trained using word2vec toolkit. The corpus used for training was extracted from Wikipedia and ck12.org. It was preprocessed and tokenized suiting the needs for this competition.
Once we have the word representations, distributed representation of a question/answer is calculated by summing the representations of all words (input being preprocessed to remove stop words). This vector is then normalized by the length of question-answer. Similarity between a question-answer pair is then calculated as the dot product of their vector representations.

What worked:

This approach does work much better than randomly guessing the answer. Random baseline being at 25%, we could achieve 36% on training set using this approach (which is still much below the Lucene based Wikipedia benchmark at 43.25%).

What didn't work:

This approach didn't work as well as IR approach and comparing them wouldn't be fair, as this approach uses only question-answer text, while IR approach makes use of entire knowledge bases of relevant content.

Combining Approaches

We experimented with combining the above two methods. Here, for a question-answer pair, we add the score obtained using our Information Retrieval method to the corresponding Semantic similarity measure. Intuitively, this combines the Knowledge-base evidence of a question-answer pair with its semantic similarity, thereby boosting overall score of the question-answer pair.

This approach does improve the overall leaderboard score and works better than just the IR approach.

We have some other ideas to try, but we are also going into beta with a bunch of customers in January and are up against some real product deadlines so, if we have the time to work more on this, we will be sure to post more about it. In the meantime, if you are interested in NLP and Machine Learning, we would love to talk to you. And if you want to beta test Talla, an intelligent automated assistant that can handle simple recruiting and HR related tasks for you, let us know.

Topics: Deep Learning, NLP