Four machine learning tricks you should have known to win the Data Science Olympics 2019

Zelros AI
3 min readMay 31, 2019

--

Last Thursday we organized the third edition of the Data Science Olympics. It is the largest machine learning competition in Europe, with almost 1000 data scientists registered and competing on two places: Paris (Station F) and Berlin (Kühlhaus). Last year had been a great edition, but this year it was huge!

We will come back to the details of the competition in coming posts. But here are the first highlights of the technical learnings of this year!

Target encoding

The dataset contained categorical variables. The usual way of handling this kind of feature is to convert them into integers. With scikit learn, it can typically be achieved by using preprocessing.OrdinalEncoder() or preprocessing.OneHotEncoder().

Another feature engineering method, which gave good results during this competition, was to use target encoding. In a nutshell it consists in replacing each category (e.g. gender = Male) by the average of the target over this category.

However there is an issue with this method. If you take the average of the target over all the rows of the training set, you introduce a small leak. The features you are creating are indirectly using the ground truth. In a refinement, it is better to compute the average over an out-of-fold subset of the data. For example, if you are splitting your data in 5 folds : F1, F2, F3, F4, F5 :

  • you create target encoding feature of fold F1 by averaging on folds F2, F3, F4, F5
  • you create target encoding feature of fold F2 by averaging on folds F1, F3, F4, F5
  • and so on …
The leaderboard 20 min after starting the competition

Class_weight and sample_weight

The 2019 challenge was a multiclass classification problem (4 classes). This is quite common. But what was more uncommon is that the evaluation metric was custom: a log loss with different weights on each class. 1 for class 0, 10 for class 1, 100 for class 2 and 1000 for class 3.

Here is the code of this custom metric :

from sklearn.metrics import log_losslog_loss(y_true, y_pred, sample_weight=10**y_true)

To be able to make your model aware of this, you should have tuned the class_weight parameter in the classifier construction.

clf = RandomForestClassifier(class_weight={0:1,1:10,2:100,3:1000})

Note that another equivalent way of doing it is to use the sample_weight parameter in the fit() method. In this case you no longer pass a weight for each class, but for each sample.

clf.fit(X_train, y_train, sample_weight=10**y_train)

Lightgbm and the new scikit-learn gradient boosting

As the competition only lasted 2 hours, you should have been ready to train very fast with very powerful classifiers. This year, the state of the art solution was to use lightgbm LGBMClassifier.

Note that scikit-learn 0.21 just comes with a new Gradient Boosting flavour inspired by LightGBM: HistGradientBoostingClassifier, much faster than the historical scikit-learn gradient boosting. Unfortunately according to the documentation, class_weight and sample_weight don’t seem to be supported yet, so it was not usable for the competition (but LGBMClassifier supports class_weight)

read_csv with bad lines

The training set contained corrupted rows, which made pandas read_csv() fail reading the dataset.

The solution was to use the parameter error_bad_lines=False to skip these lines (by default the parameter is set to True).

If you want to try these tricks, the dataset is available online.

You enjoyed what you just read and want to join our team ? Feel free to apply, we are hiring!

--

--

Zelros AI
Zelros AI

Written by Zelros AI

Our thoughts about Data, Machine Learning and Artificial Intelligence

No responses yet