The Four Maturity Levels of Automated Machine Learning: Towards Domain Specific AutoML

Level 1: Model and hyperparameter optimization

The basic functionalities of an autoML system for structured data can be summarized in the following workflow:

  • Provide a dataset to analyse
  • Provide the name of the target variable to predict
  • Specify the metric that one wants to optimize, the selected features (columns of the dataset) to use as explanatory variables in model, the list of the models among a predefined catalog to fit the data, and the choice of the train/validation scheme. Also, it is often required to specify a maximum duration of training
  • Run the training of AutoML: the AutoML algorithm handles the cross validation scheme and fits the various selected models (e.g. linear models, random forest, gradient boosting trees, etc.) to the data. For each model, the hyperparameters of the model are optimized with respect to the given metric. The optimization method for the parameter is typically a bruteforce grid-search, though more subtle techniques may be used. For instance, auto-sklearn uses a bayesian approach to explore the parameter space
  • Observe the performances of the different models + hyperparameter choices tested by the AutoML. The metric provided as input is used, but additional metrics may be computed too
  • Provide a test dataset, or in-production dataset, and make predictions on this dataset based on the best model identified by the autoML algorithm

Level 2: Integrating the data preprocessing into the AutoML

In Level 2 AutoML, a set of basic preprocessing steps is performed before the ML algorithm search itself. This includes the automatic detection of the column types (continuous, categorical, free text, date, etc.), the transformation of all columns into numerical data depending on their type (e.g. encoding for categorical columns, processing or deletion of free text columns, transformation of date columns), and the handling of missing values with different strategies (e.g. set missing data to a given value, add a category corresponding to a missing data, data completion, etc.).

Level 2.5: More advanced data preprocessing

As shown by the many experiences we have from data science competitions (e.g. Kaggle competitions), advanced processing of the data is often needed to achieve state of art performance on a given dataset. We give here a few examples of techniques that generally bring improvement to a model:

  • Feature selection is sometimes needed to get rid of columns which do not bring valuable information to solve the problem.
  • Advanced methods for the encoding of categorical variables such as target encoding, or the use of category embeddings
  • Data compression and dimensionality reduction methods such as PCA or the use of autoencoder
  • Generation of features based on text content processing, using for instance sentiment analysis
  • Generation of new features by crossing of multiple variables
  • Automated feature creation (see for example AutoLearn, ExploreKit, or Data Science Machine)
  • Data cleaning : anomaly / outlier detection for instance

Level 3: Automated search for the ML network architecture itself

In the previous AutoML levels, machine learning algorithms hyperparameters are automatically tuned, but the structure of the ML algorithm itself is fixed, and not discovered.

Level 4: Using domain knowledge to enhance AutoML 💡

At Zelros, we have an ambitious vision of AutoML and we are currently focusing on a new kind of AutoML which takes advantage of domain knowledge, in addition to the techniques discussed in the three previous levels.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Zelros AI

Zelros AI

837 Followers

Our thoughts about Data, Machine Learning and Artificial Intelligence