A team from Zelros recently participated in the Engie Datapower Hackathon organized in Paris.
(EDIT : the Notebook of this post can be found here).
Several datasets were provided, coming from different sources : wind turbines, building heatings, electricity consumption, …
One of the datasets draw quickly our attention : the largest one of course !
- 100 Mo of Data, containing measurements from dozens of industrial sensors
- 2,5 Millions data points
- 1,5 year of history, with 1 measurement per hour
Here is the story of the exploration of this dataset.
First reflex
Our first reflex was to have a look at the shape of each sensor data, both on a 1 year and a 1 week time range :
Data looked very different from a sensor to an other. However our intuition was that some sensors were correlated between them. It was confirmed when we plotted the correlation heatmap (blue or red mean a negative or positive correlation, white mean no correlation) :
Here are, for example, two correlated sensors — the water temperature at the input and at the output of a boiler (the fact that these two measurements are correlated is in line with common sense) :
Second Reflex
With this correlation, we demonstrated that the data had a kind of structure. Our second immediate question was to know if there were strange phenomenons in this data (anomalies), breaking this natural structure. And if it was possible to view them ?
The challenge to capture these potential anomalies was the high dimensionality of the data (each hour, a measurement vector of 64 different sensors). To have a sense of the data structure, we decided to use the T-SNE representation (scikit-learn implementation):
A nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot
Here was the result : each point is a 64 dimensions measurement, at a given moment :
We were enthusiastic about this discovery ! We highlighted “clusters”, groups of measurement close to each other !
Going forward
What was behind these clusters ? Our first thought was that the clusters were hours in the day. To check this assumption, we colored the points according to their hour.
It showed that the assumption was wrong, because colors were blended. We tried also to color by day of the week. Unfortunately same result, blended colors.
We then tried to color points by day in the year. Hurray ! That worked : clusters had roughly the same color ! These color-coherent clusters showed that for T-SNE, nothing is closer to a given observation than an observation in its time range vicinity. This is also a consequence of the T-SNE algorithm itself, that focuses on preserving very small pairwise distances between points, contrary to PCA.
Spotting anomalies
With the last visualization, we were able to easily identify abnormal points. For example, a blue one in the middle of a white cluster !
We diagnosed several singular observations with this method. For example, here is a zoom on one of this anomaly, where we clearly see something wrong with this temperature measurement.
Conclusion
Thanks to the powerful T-SNE data visualization, we were able to quickly visualize abnormal points, in an unsupervised way.
Just imagine how it would have been a pain to identify these anomalies in these millions of observations, with a classical threshold detection for example.
Applications for industries are huge : for example predictive maintenance, or a better assessment of infrastructure behaviors.
If you liked this story, and want to join Zelros teams or know more about our solutions, contact us !