-
Hi there, I've been exploring data drift detection and have been wanting to test how good deepchecks is at determining how much a given dataset has drifted. However, my main concern right now is wondering how to generate drifted data in the first place, and how much to skew them, so that I can get deepchecks to detect how much drift was applied to them. So let's say I have a tabular dataframe like this, where I want to drift just the feature of Age. What are the types of ways to artificially create a drifted dataset from a given dataset? What I've been doing is splitting it into 2 extreme ranges (e.g. one set of <50 Age and one set of >=50 Age), and then mixing the two datasets more and more to create "less" drift. But supposedly for tabular data would something simpler do the trick, such as applying a uniform difference to all the Ages of one dataset work? Applying a random noise to all of the Ages, the noise following some normal distribution? What other standard techniques could be used to apply drift in this manner, and of a degree that can be varied? Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @LifeBoey This is a really good question! There are no "absolute" answers here because drift detection is frequently used as a proxy for something else, for example "How well will my model do on this new data, for which I don't have labels yet?". That means that "how good" a drift detection algorithm is depends on the type of drift you expect to have in real life, and it's relation to the quality of your models' predictions. Having said that, there's a couple of things you can do:
Hope that helped! We'll be glad to hear about the results of your tests and how well the drift detection performed. We'll possibly also open source some tools to automate this kind of testing in the future, so stay tuned! |
Beta Was this translation helpful? Give feedback.
Hi @LifeBoey
This is a really good question! There are no "absolute" answers here because drift detection is frequently used as a proxy for something else, for example "How well will my model do on this new data, for which I don't have labels yet?". That means that "how good" a drift detection algorithm is depends on the type of drift you expect to have in real life, and it's relation to the quality of your models' predictions.
Having said that, there's a couple of things you can do: