Data drift detection using Autoencoders

Disclaimer for dark mode users: some of the graphics' title in README cannot be read properly in Github Dark Mode

Definition

Data drift in this projects means that a new class which has not been seen is introduced during model run. It can be an anomaly (changed object of the same class on which training was performed) or completely new object of different class

Approach

Low-dimensional representations (embeddings) of Autoencoders (AE) are used to cluster them to detect new classes through Sihlouete coefficient, reconstruction error is also used to detect "anomaly" inside one class. Autoencoder is trained without labels on one (good) class. When the new class is detected a new AE for average representation of this class can be introduced.

Following research

Researched problem can be treated as small one

It's unclear wether this approach will work on bigger problems (e.g. higher res images) in order to scale tiling can help

Tiling big images to create attention maps of a kind to specify drifts

Tiling images and training AE for each segment of a camera can be done do speficy drifts, moreover this approach can be used in feredative way to provide better accuracy

Camera "problems" should be treated as "style"

So of course, we can say that there is data drift (anomaly if we're talking about one image) when camera is moved, or lightning condition are different, but question here can be put differently. We actually can say what's wrong can be with camera: lightning conditions, focus, movement, scratches. And all of these can me modelled (e.g. algoritmically put on images simulating such condition) that gives us two advatages:

We can generate data and model drift for such occasions
More interesting, given, that we know such things can happen, can we train an VAEGAN to disentagle style (e.g. camera problems) and features (e.g. objects) and maybe even interpolate car from information given (e.g. yes, the image is lighted badly, but we can restore bad parts of it and we know with some certainty that this is specified object and we can say it's not anomaly one) that way we will be more sensitive to react on anomalies.

Remaining questions:

What is maximum capacity for one encoder to distinguish classes?

Data:

Dataset used for this project was mainly MVTEC data you can find its Dataloader at src folder. Dataset needs to be downloaded separately from link provided, no registration / additional fee needed.

Autoencoders:

Autoencoders are stored in model folder. They're written pytoch-lightning

Supported architectures:

Variational Autoencoder
Vanila Autoencoder
VAEGAN (has not been tested)

Results

This is compilation of results from notebooks folder, check .ipynb files for more details.

Autoencoders:

2 Autoencoders models were trained: "big" and "small". Small AE model embedding size is 8. Big AE model embedding size is 32.
Variational Autoencoder with embedding size 32 was trained as well and showed similar results to Big AE model. Big AE can generalize better, this can be seen from PCA's of the same MVTEC data:

PCA on embeddings for Big AE

PCA on embegging for Small AE

Original bottle images (AE was trained only on good bottles)

Bottle reconstructions

Original transistor images

Reconstructed transistor images

Clusters:

Even with 32 dimensions clustering works not the best. DBSCAN which should've solved curse of dimensionality, worked not so good as supposed to.
Combination with 2-component PCA (explaining around 80% of variance) and K-means clustering using Sihlouhete coefficient to determine cluster cardinality worked fine
One can use GMM instead of K-means clustering to quantify uncertainty in class cardinality.

Dynamic results on MVTEC data:

Setup

Before clonning repository download MVTEC data

git clone https://github.com/uncleDecart/data_drift
cd data_drift
pip install -r requirements.txt
jupyter-notebook

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
models		models
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

models

models

notebooks

notebooks

src

src

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Data drift detection using Autoencoders

Definition

Approach

Data:

Autoencoders:

Results

Autoencoders:

Clusters:

Setup

About

Releases

Packages

Languages

uncleDecart/data_drift

Folders and files

Latest commit

History

Repository files navigation

Data drift detection using Autoencoders

Definition

Approach

Data:

Autoencoders:

Results

Autoencoders:

Clusters:

Setup

About

Topics

Resources

Stars

Watchers

Forks

Languages