2017 has been a stunning year for all of us, here at Zelros.
We started 2017 by a technical blog post — that turned out to be very popular among the data science community — and an innovation prize, awarded by our peers.
As 2017 is almost ending, we wanted to close these amazing last 12 months with some of our thoughts on what is currently happening on the AI field.
Here is a review of the 4 most striking machine learning trends, noticed by our product R&D team this year.
Trend 1: ML Frameworks
New Machine Learning frameworks are dawning.
They are becoming more and more high level, to help users focus on applications and usage — and offload them from low level tasks.
Data scientists must adapt quickly and learn to use several of them, to remain up to date. Here are a few examples of what happened in 2017:
- Keras is now part of the core TensorFlow framework
- Pytorch is quickly developing popularity amongst top AI researchers, as Tensorflow shows limitations
- A few days ago, Apple open sourced Turi Create, a framework that simplifies the development of custom machine learning models
- spaCy v2.0 has been released, and is becoming the preferred open-source library for advanced Natural Language Processing in Python
- RIP Theano, the former major deep learning framework
Trend 2: Datasets
There is no machine learning without data. This year, several new datasets have been released, helping data scientists to train and benchmark models for various tasks. Here are a few of them, in the Natural Language Processing field:
- Quora questions pairs dataset: over 400,000 lines of potential question duplicate pairs. Here is the associated Kaggle challenge (won by the awesome BNP Cardif Lab french team, we know well at Zelros ;) )
- Google speech commands dataset: 65,000 one-second long utterances of 30 short words, by thousands of different people. Here is the associated Kaggle challenge
- Salesforce AI WikiSQL, a dataset of 80,654 hand-annotated examples of questions and SQL queries distributed across 24,241 tables from Wikipedia
- The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information (inspired by the historical Stanford SNLI corpus)
Trend 3: Transparency
As AI is more and more used in real-life enterprise processes, and the new GDPR regulation will be soon enforced in Europe, the need for algorithms transparency is raising.
In 2017, we have seen several contributions around Interpretable Machine Learning. Here is a selection:
- a new R package that makes XGBoost interpretable
- Understanding Black-box Predictions via Influence Functions (best 2017 ICML paper award, here is the code)
- Google launched Facets, an interactive visualization tool, to inspect datasets and better understand them
- O’Reilly wrote a complete and interesting article on interpreting machine learning
- Several researchers are working on this new field, like at Microsoft for example
Trend 4: AutoML
2017 has seen the advent of automated machine learning, that becomes little by little a commodity.
AutoML is the way to automate some parts of the data science process: basic data preparation and feature engineering, model selection, hyper parameters tuning, …
This year, new open source libraries have been released, like MLBox, or improved, like auto-sklearn.
New commercial solutions have been launched as well, like for example Edge-ml or Prevision.io.
What’s more, existing tools have added AutoML capabilities:
- Datarobot, the pioneer in AutoML field, has raised another $54 million
- Dataiku added an AutoML feature (btw, check out our integration with this platform)
- H2O.ai launched a new product: Driverless AI
We wish you a happy new year! Stay tuned for an important announcement in the coming weeks ;)
And did we mention that we are hiring data scientists and software engineers?