At Zelros, we believe that many enterprise machine learning applications lack of a major component: the interface between the complex data workflow on one side, and the business users on the other side — that is to say a natural way for people to interact with technology.
We believe that one solution for this issue is to invent a new type of intelligent interaction, like chatbots, based on natural language and discussions. We are not the only ones. There is not a day without an article promoting benefits of conversational and invisible UI.
Technically, end-to-end chatbots are constituted of several building blocks. We won’t go into details in this post, but will focus on one of them that is crucial: the “NLP engine”.
NLP engines under the hood
The role of the NLP engine is to determine what the user is saying (called intent in this post). It learns from sentences examples (sometimes named utterances), and then is able to classify new sentences among the known types.
For instance, you can feed the NLP engines with several equivalent sentences like “I want to book a taxi to Paris” or “Please drive me to Paris” — BTW there are tools for that — , and it will train a machine learning model being able to understand new similar (but yet unseen) sentences like “let’s drive to Paris”.
A lot of NLP-as-a-service providers now exist for developers: look for example at the “bot developer frameworks and tools” section of this landscape.
At Zelros, we support several of them: our goal is to be agnostic and to be able to plug the right one, depending on the context. We also have our own engine, supporting some custom functionalities (e.g. word2vec, deep learning or enhanced privacy by design).
Given all these existing NLP services, quickly several questions were raised by our team:
- Is there one service more accurate in my context (e.g. a specific business domain)?
- Are all the languages supported with the same accuracy?
- Are some services more robust in misspelling contexts?
- How accuracy is improved when trained on more data?
- How to tune parameters (thresholds, …) for my context? (parameter values can be different from one framework to another)
- How to check non regression when adding utterances or intents? Or when the NLP service is upgraded?
- etc.
Nothing is more subjective than the perceived understanding quality of a chatbot. We wanted to find a way to measure things in a more automated, formal, and reproducible way.
One of our trainee, Benjamin Battino, was part of the discussion last summer, and started a project around this topic. Today we are open sourcing it!
Bunt
As of today, the main features of Bunt (our Bot UNderstanding Tesbed!) are the following:
- French and English are supported. Other languages can be added through modules
- Different default corpus of testing data are provided by default (small talk, misspellings, …) — other corpus can be added by the user
- 2 metrics are available (accuracy and k-penalize accuracy, i.e. an error is k times worse than a success) — other metrics can be added by the user
- 3 NLP-as-a-service providers are supported: api.ai, luis.ai, and recast.ai. Other providers can be added through plugins
- For providers having hyperparameters to tune (thresholds, …), a basic gridsearch system is available
Using Bunt
Here are some charts you can obtain when using Bunt. As you can see, depending of the context, services have different response scores. Bunt is a good way to troubleshoot or assess the services for your particular usage.
Further work
We are happy to open source the work of our trainee Benjamin Battino, hoping the community will find an interest in this development (the first of this kind as far as we know…), and contribute to improve it.
Here are some hints of what we could investigate in the future:
- Other cross-validation strategies: we use classical train/test splits, without replacement. Other strategies could also be investigated (would doing replacements be closer to a real bot interaction?)
- Only defined intents are tested in the validation strategy. We could evaluate responses to out-of-intents sentences (fallback)
- To evaluate the error, we take into account the chosen intent, without its confidence score. Soft decisions could be used for a different metric
- Named entities recognition is out of the actual scope. This could be an interesting domain to test
- Other NLP-as-a-service providers could be added, like for example wit.ai or IBM Watson.
We hope you will enjoy Bunt (find it on Github here). We would love to hear your feedbacks: reach us on twitter or on our web site!
Oh, and one last thing : we are hiring several full stack software developers, willing to learn and use machine learning for real purposes!