Drifted Tabular Data Generation #2052

LifeBoey · 2022-09-22T09:37:56Z

LifeBoey
Sep 22, 2022

Hi there,

I've been exploring data drift detection and have been wanting to test how good deepchecks is at determining how much a given dataset has drifted. However, my main concern right now is wondering how to generate drifted data in the first place, and how much to skew them, so that I can get deepchecks to detect how much drift was applied to them.

So let's say I have a tabular dataframe like this, where I want to drift just the feature of Age.

What are the types of ways to artificially create a drifted dataset from a given dataset?

What I've been doing is splitting it into 2 extreme ranges (e.g. one set of <50 Age and one set of >=50 Age), and then mixing the two datasets more and more to create "less" drift. But supposedly for tabular data would something simpler do the trick, such as applying a uniform difference to all the Ages of one dataset work? Applying a random noise to all of the Ages, the noise following some normal distribution? What other standard techniques could be used to apply drift in this manner, and of a degree that can be varied?

Thank you!

Answered by noamzbr

Sep 22, 2022

Hi @LifeBoey

This is a really good question! There are no "absolute" answers here because drift detection is frequently used as a proxy for something else, for example "How well will my model do on this new data, for which I don't have labels yet?". That means that "how good" a drift detection algorithm is depends on the type of drift you expect to have in real life, and it's relation to the quality of your models' predictions.

Having said that, there's a couple of things you can do:

As you suggested, you can add some characteristic noise to one or more of your features. You can visually explore the effects of adding noise in our Checks Demo. This example simply adds the same kind of noi…

View full answer

noamzbr · 2022-09-22T12:53:05Z

noamzbr
Sep 22, 2022
Maintainer

Hi @LifeBoey

This is a really good question! There are no "absolute" answers here because drift detection is frequently used as a proxy for something else, for example "How well will my model do on this new data, for which I don't have labels yet?". That means that "how good" a drift detection algorithm is depends on the type of drift you expect to have in real life, and it's relation to the quality of your models' predictions.

Having said that, there's a couple of things you can do:

As you suggested, you can add some characteristic noise to one or more of your features. You can visually explore the effects of adding noise in our Checks Demo. This example simply adds the same kind of noise to all features, but you can improve on this by using kinds of noise that are typical of your data.
The second method you suggested is also sound and has been used by us when testing our drift detection algorithms. You can split your data based on some parameter, such as Age as you demonstrated, and test the results provided by deepchecks as function of the severity of that split.
At the end, the best method would be finding example for "real life" drift, preferably ones that have occurred in the past in your dataset, and test deepchecks' output on that data. This is limited, but gives you the best idea of how well deepchecks is at detecting drift in your particular use-case.

Hope that helped! We'll be glad to hear about the results of your tests and how well the drift detection performed. We'll possibly also open source some tools to automate this kind of testing in the future, so stay tuned!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drifted Tabular Data Generation #2052

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Drifted Tabular Data Generation #2052

LifeBoey Sep 22, 2022

Replies: 1 comment

noamzbr Sep 22, 2022 Maintainer

LifeBoey
Sep 22, 2022

noamzbr
Sep 22, 2022
Maintainer