Data drift: It can come at you from anywhere

A short introduction to data drift

Machine learning (ML) models, at least in their present incarnations, are little more than powerful inductive machines. Their lofty goal is to go from a small example set to a broader inductive generalization.

Naturally, for successful (and dependable) functioning, they need to depend on the critical assumption – the input data distribution does not drift away too much over time.

Because, if that happens, then a successful model is only successful at a single point in time. Over time, the model will make inductive generalizations based on a distribution that is not exactly what it was trained on, and performance will degrade.

So, we need monitoring and (timely) detection of any data-drift in the MLOps or ModelOps pipeline for any (continually) successful ML model deployment. Apart from regular MLops, the concept of data drift is important for ML explainability or Explainable AI (xAI) field too.

Where exactly a data drift monitoring block should sit in an end-to-end MLops cycle is nicely visualized in the following article.

Essential guide to Machine Learning Model Monitoring in Production

A lot of articles on this topic discuss the shape and statistical properties of the input distributions. In this article, we will take the example of (one or more) data streams feeding into ML models in the form of time series and visualize those variations to understand the concept clearly. Additionally, we will also talk about a few not-so-obvious sources of data drifts in an industrial context.

Data drift is like a box of chocolates…

You never know what you’re gonna get (next)! Let’s see some obvious and not-so-obvious examples…

Level shift? Obviously

This is an obvious data drift. In many unsupervised model settings, this will also be called an ‘outlier‘ or ‘anomaly‘. Also note, drift may not be permanent and can be short-lived. Both the start and end of the drift should be identified quickly.

Variance shift? Subtler

This is slightly harder to detect than level shift. See the illustration below.

Variance decrease? That’s a drift too

At a high level, business/ process people (or your customer) will tell you that high variance is bad. Yes, they are almost always right. But from a data drift point of view, you have to watch for any chance – even a sudden decrease in the variance. Statistically, that is a change in the distribution properties too.

Peak shifts inside a period

This one is more subtle and hard to detect. I have tried to illustrate the drift with additional annotations like peak and period. Did you see what really happened here? The data comes in a periodic fashion and there is a peak within each period. The relative position of that peak (within a period) shifted for some time before it came back to normal.

Do you see the problem? Simple statistical properties may not catch this drift. The mean and variance (other moments too) may look similar. How do you catch it then? Interestingly, some simple spectral analysis will indicate a shift. So, you have to bring in some knowledge of signal processing to catch this data drift.

Contextual data – another way to catch the drift

If you are also monitoring a logging stream that has fixed-time indicators, like a shift start, or a machine reset, that can sometimes help catch these temporal shifts in an intuitive manner. The data monitoring has to ensure excellent time-synchronization of course. In the following illustration, the anchor events and the relative distance of the peaks from those anchors will indicate a drift.

Phase-shift/delay

This is a killer, in a true sense. For ML models training and inferencing on time-series data, a slight phase delay can generate totally wrong predictions. Basically, the model was trained like "If _X_1 and _X_2 were similar then predict _Y_0, otherwise predict _Y_1".

Now, look at the situation below and realize what would happen with out-of-phase data streams. The top data stream is fine, the bottom one drifted slightly. It is possible that the data stream is fine, only the acquisition was delayed and fell out of sync.

Some are drift, some not

In industrial or manufacturing scenarios, process recipes and settings change all the time. There may also be a scheduled changelog. This kind of contextual data needs to be monitored for identifying correct data drift. Here is the visual illustration,

Measurement/Sensor drift

This is also one of the hardest to detect and manage. The input data streams and the generative processes may be fine but there may be drift on the sensor (that is measuring the incoming data and feeding to the ML model or monitoring system).

A monitoring system (consisting of an ML model and an alert generating module) may detect a drift and send a wrong recommendation signal to the core process layer thinking that the fundamental process has shifted. That would be a misjudgment and bad recommendation.

This is also an example where a concept drift (the model or the physical dynamics of the sensor itself) can be mistaken for a data drift. That’s an error to be avoided at all costs.

Summary

We discussed the concept of data drift in a time-series sense by showing some visual illustrations. It is obvious that catching and analyzing data drift is not a trivial problem, even in the simplest one-dimensional setting. Additional techniques like signal processing or merging with contextual data may be needed to correctly identify the onset and nature of data drift.

What are the mitigations? There is no one answer and it totally depends on the industry and the specific application area. This article goes through some of them systematically. Give it a read.

"My data drifted. What’s next?" How to handle ML model drift in production.

Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.

Join Medium with my referral link – Tirthajyoti Sarkar

Data drift: It can come at you from anywhere

A short introduction to data drift

Data drift is like a box of chocolates…

Level shift? Obviously

Variance shift? Subtler

Variance decrease? That’s a drift too

Peak shifts inside a period

Contextual data – another way to catch the drift

Phase-shift/delay

Some are drift, some not

Measurement/Sensor drift

Summary

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

What Do Large Language Models “Understand”?

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function