February 5, 2015

Dabbling in Natural Language Processing

I am an intern on the technology discovery team at TurboTax. One of my recent assignments has been to research the field of natural language question and answering (NLQA) in order to see whether we can leverage existing technology to answer user questions and improve the level of customer support that Intuit provides to customers, while saving support staff hours.

We want to alleviate demand on support representatives by making a smarter system for fielding customer questions.

The goal is to build a system that takes as input a context-free natural language query. e.g. “Where is my W2?” and spits out an answer like “Your W2 should arrive by {{ Date }}. If you do not receive it by {{ AnotherDate }}, contact your employer.” Success is not necessarily being able to field every single question. However, it is the goal that what we build substantially alleviates the demand on support representatives.

In developing a new system, we should take into consideration the system that is in place, which combines a number of reference materials for the most important questions with an online community where both employees and users to field questions. In order to validate the need for a new system, we would need to show a substantial improvement over the existing one.

As it turns out, natural language is infinitely permutable. There is an infinite set of ways to express any given question. But because I doubt that we will do any engineering work with natural language directly, we can pretend that the existing solutions are good enough for our purposes and dive into the meat of the problem.

There are two general ways to approach query answering: search of preexisting documents and artificial intelligence - in this case building an answer on the fly. Let’s work our way in order of decreasing complexity.

Artificial intelligence has proven to be a powerful tool. However, the state of natural language understanding is not at the level to offer customer support on a complicated topic such as taxes.

Artificial intelligence has proven to be a powerful tool. However, the bleeding edge of the field is training neural networks to recognize cat videos on YouTube. Impressive but fundamentally different from generating coherent, complex language. Even a system like Watson, regarded as one of the most powerful AI’s in the world, solves a problem that is an order of magnitude easier than the one at hand. Watson takes clean natural language with context i.e. the category, and only works really well when the target answer is one word, rather than a complicated idea. For example, when asked in the category of “Olympic Oddities” about a particular athlete, Watson responded “what is a leg?” when the real answer should have been “what is missing a leg?” Watson saw that the athlete was associated with the oddity related to his leg. In fact, there are numerous queries which Watson was not sure about, despite having a multi-Terabyte corpus of indexed, often structured, data. The glimmer of hope is that taxes are based on a much smaller set of rules than everything that could be a Jeopardy category. We will discuss rule-based programming for tax understanding in the context of structured search. Remembering how long my automated theorem proving courseware took to resolve simple proofs about atomic data makes me cautious to even imagine a system that could reason about taxes.

Search is broadly divided into search on structured and search on unstructured data.

Unstructured Search: let’s compute the relevance of a document through keywords, manipulated in various ways.

With regard to search on unstructured data, Singhal’s “Term Weighting Revisited” paper provides a stellar overview of the field. Computers cannot readily reason about unstructured documents. The only thing we can do, then, is compute the relevance of a document through keywords, manipulated in various ways. Normalization gets thrown around a lot, an approach for taking noise out of a document so that statistical keyword matching can work well. We have length normalization, normalization on the values of keywords, and many other techniques.

It is easy to imagine that the existing search of AX questions employs a keyword based search. Any substantial benefit in this direction would come from interpolating new data about the relationship between questions. (Interestingly, Google’s algorithm found its magic down a similar avenue, by super-positioning the keywords dataset with a matrix representing the relative value of web pages based on the quantity and quality of other web pages that point to them.) One place TurboTax could find its magic sauce for search is by performing its unstructured searches on the questions and answers.

As it turns out, although users might not know what exactly they are asking, the people who answer their questions do and tend to use precise language, to boot. Thus, we can actually infer the most information about questions by looking at what types of answers they have received. Conveniently, we have data about how many users found any given answer helpful, too. Now we also have a way to ensure the quality of particular answers.

What does the system look under the hood? The first step is to precompute a matrix relating question and answer key words. We use a natural language entity extraction tool (NLTK, Stanford Language Parser, etc. should work) and compile a list of the 1000 most commonly discussed entities, bigrams, and trigrams in our datasets of questions and answers.

We map each keyword to a set of answers that is associated with it. The next step is to compute the relationship between each pair of entities in the questions and the answers. In doing so, we consider the likelihood that a good answer to a question with a particular set of keywords (entities) contains another set of keywords. Although the details of the implementation are open-ended, one can imagine a scenario in which certain keyword pairs actually have a negative predictive effect on the quality of the answer to a given entity. This, too, could be encoded in the matrix described. Finally, we normalize the matrix, and are left with a PageRank-like predictive matrix that models the relationship between entities in the question and entities in the answer.

Finally, instead of doing a simple keyword match on the questions, we augment our search by converting those question into a set of keywords to search for in potential answers. Now, we can improve the relevance of our results by leveraging a giant dataset that is right at our fingertips. And, by implementing a feedback loop, we can continue to improve the relevance of our results.

From a high level, there isn’t all that much magic in the approach to search I described above. But, it’s a good way of getting things done without any nasty ad hoc stratagems - think of it as unsupervised learning, but in native computer world.

A structured search engine should know that George Washington is a person. … A person has certain properties. For example, a person has an age. He or she has a birthday. A person also has certain relationships. He or she will have parents, children, friends, and institutions to which he or she belongs.

My research has led me to believe that we cannot do away with unstructured search. However, an promising alternative to unstructured search is performing search on structured data. The precedents here are things like FreeBase, an open database that contains a bunch of information about stuff, Facebook’s Graph API, which lets users fire off queries like “friends who like ‘Bon Jovi’” and WolframAlpha, which maps natural language queries to functions and models.

Companies are just now starting to understand structured search. Imagine we have an entity, say George Washington. A structured search engine should know that George Washington is a person. A person has certain properties. For example, a person has an age. He or she has a birthday. A person also has certain relationships. He or she will have parents, children, friends, and institutions to which he or she belongs. You can see that an imprudent builder may choose to wrap needless amounts of information. A lot of where programs fall apart is when trying to deal with the unbounded complexity of the world.

Thus, the big question with using search on structured data is how do we represent our knowledge? Alternatively, what is the minimal set of information we need to store about an entity to achieve our purpose? A follow-up, then, is how do we effectively convert our information into structured form without enlisting millions of people? (For a general purpose, by the way, I suspect a distributed methodology like the one used by ReCapcha or DuoLingo wouldn’t be a bad idea…)

This is the open question, now. Yes, we can perform effective computations on structured data. But, how do we structure our data in a way that makes it conducive to those computations? How do we generalize it in a way that we can use it with functions that are also general enough to be useful? Which relationships are important? Is it even possible? It might take a bit more familiarity with the tax code to answer that question.

I do not have the answer to how we should approach structured search in the context of taxes. A prudent approach would be to create a simplistic model which stores information about the most important tax forms, who has to fill them out, when they are due. Perhaps we figure this information out by carefully analyzing incoming requests to TurboTax which are common and simple. How many of these requests presently turn into questions in the forum? How many go to live representatives? Knowing this, we could begin collecting data on what types of questions we should be able to answer most effectively using a combination of rule-based AI and augmented question-answer search.

And, because I promised to get back to the issue of natural language: after the data is structured, we could map a finite set of natural language expressions to certain operations on that data. Over time, we can figure out new ways to augment our collection of structured tax-data and the ways of retrieving it based on natural language templates.

Conclusion: building smart systems is not easy. But we should definitely keep trying to do a good job at it.

Although current state of technology cannot obviate the need for users and TurboTax representatives on the AnswerXchange community or live phone representatives. However, we aim to increase the number of customers whose issues are handled without ever needing to reach a live representative. Perhaps we can even solve their problems faster in the process. Furthermore, we would want a system that can improves on the job by recomputing its under-the-hood data based on user feedback on its performance.

The best way to do this is by combining the three techniques explored. A hypothetical new system might start by prompting the user to enter a question. Once the question has been received, we decide whether we can answer it by mapping it to our structured data a la WolframAlpha. If not, we search the unstructured corpus of question and answer data. If there are no satisfactory matches, we take that into consideration for our algorithm, improving it for next time, and allow the user to either post the question to the community or call a live representative.

If we can crack this problem, we can solve the problem of customer support, not just for our company but for the entire industry. And, having spent a few hours in the past week helping people live, I can tell you, saving people from having to field simple questions from angry customers would be a great humanitarian accomplishment indeed.

Kudos

Dabbling in Natural Language Processing

Now read this

On Iterating at Facebook