AnoShift

Accepted at NeurIPS 2022 - Datasets and Benchmarks Track

Title: AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection
Authors: Marius Dragoi, Elena Burceanu, Emanuela Haller, Andrei Manolache, Florin Brad
ArXiv Preprint

💥💥 AD benchmark for both In-Distribution (ID) and Out-Of-Distribution (OOD) Anomaly Detection tasks (full results)

Out-Of-Distribution Anomaly Detection

Method	ROC-AUC ( $\uparrow$ )
Method	IID	NEAR	FAR
OC-SVM [3]	76.86	71.43	49.57
IsoForest [10]	86.09	75.26	27.16
ECOD [6]	84.76	44.87	49.19
COPOD [8]	85.62	54.24	50.42
LOF [11]	91.50	79.29	34.96
SO-GAAL [1]	50.48	54.55	49.35
deepSVDD [2]	92.67	87.00	34.53
AE for anomalies [4]	81.00	44.06	19.96
LUNAR [9]	85.75	49.03	28.19
InternalContrastiveLearning [7]	84.86	52.26	22.45
BERT for anomalies [5]	84.54	86.05	28.15

Average results over multiple runs
Train data files "[year]_subset.parquet" with year in {2006, 2007, 2008, 2009, 2010}
IID test data files "[year]_subset_valid.parquet" with year in {2006, 2007, 2008, 2009, 2010}
NEAR test data files "[year]_subset.parquet" with year in {2011, 2012, 2013}
FAR test data files "[year]_subset.parquet" with year in {2014, 2015}
Results for each split are reported as an average over the performance on each year
Scripts for repoducing the results are available in 'baselines_OOD_setup/' (check Baselines section for more details).

In-Distribution Anomaly Detection

Method	ROC-AUC ( $\uparrow$ )
OC-SVM [3]	68.73
IsoForest [10]	81.27
ECOD [6]	79.41
COPOD [8]	80.89
LOF [11]	87.61
SO-GAAL [1]	49.90
deepSVDD [2]	88.24
AE for anomalies [4]	64.08
LUNAR [9]	78.53
InternalContrastiveLearning [7]	66.99
BERT for anomalies [5]	79.62

Average results over multiple runs
Train data files "[year]_subset.parquet" with year in {2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015}
Test data files "[year]_subset_valid.parquet" with year in {2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015}
Scripts for repoducing the results are available in 'baselines_ID_setup/' (check Baselines section for more details).

AnoShift Protocol

We introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection. This type of data meets the premise of shifting the input distribution: it covers a large time span (from 2006 to 2015), with naturally occurring changes over time. In AnoShift, we split the data in IID, NEAR, and FAR testing splits. We validate the performance degradation over time with diverse models (MLM to classical Isolation Forest).

With all tested baselines, we notice a significant decrease in performance on FAR years for inliers, showing that there might be a particularity with those years. We observe a large distance in the Jeffreys divergence on FAR and the rest of the years for 2 features: service type and the number of bytes sent by the source IP.
From the OTDD analysis we observe that: first, the inliers from FAR are very distanced to training years; and second, the outliers from FAR are quite close to the training inliers.
We propose a BERT model for MLM and compare several training regimes: iid, finetune and a basic distillation technique, and show that acknowledging the distribution shift leads to better test performance on average.

Kyoto-2006+ data

Original Kyoto-2006+ data available at: https://www.takakura.com/Kyoto_data/ (in AnoShift we used the New data (Nov. 01, 2006 - Dec. 31, 2015))
Preprocessed Kyoto-2006+ data available at: https://share.bitdefender.com/s/9D4bBE7H8XTdYDB
The result is obtained by applying the preprocessing script data_processor/parse_kyoto_logbins.py to the original data.
The preprocessed dataset is available in pandas parquet format and available as both full sets and subsets with 300k inlier samples, with equal outlier proportions to the original data.
In the notebook tutorials, we use the subsets for fast experimentation. In our experiments, the subset results are consistent with the full sets.

Label column (18) has value 1 for inlier class (normal traffic) and -1 (known type) and -2 (unknown type) for anomalies.

Prepare data

curl https://share.bitdefender.com/s/9D4bBE7H8XTdYDB/download --output AnoShift.zip

mkdir datasets

mv AnoShift.zip datasets

unzip datasets/AnoShift.zip -d datasets/

rm datasets/AnoShift.zip

Prepare environment

Create a new conda environment: conda create --name anoshift
Activate the new environment: conda activate anoshift
Install pip: conda install -c anaconda pip
Upgrade pip: pip install --upgrade pip
Install dependencies: pip install -r requirements.txt

Baselines

We provide numeros baselines in the baselines_OOD_setup/ directory, which are a good entrypoint for familiarizing with the protocol:

baseline_*.ipynb: isoforest/ocsvm/LOF baselines on AnoShift
baseline_deep_svdd/baseline_deepSVDD.py: deppSVDD baseline on AnoShift
baseline_BERT_train.ipynb: BERT baseline on AnoShift
baseline_InternalContrastiveLearning.py: InternalContrastiveLearning baseline on AnoShift
baselines_PyOD.py: ['ecod', 'copod', 'lunar', 'ae', 'so_gaal'] baselines on AnoShift using PyOD
iid_finetune_distill_comparison.ipynb: compare the IID, finetune and distillation training strategies for the BERT model, on AnoShift

run the notebooks from the root of the project: jupyter-notebook .

If you intend to use AnoShift in the ID setup, please consider the code provided in 'baselines_ID_setup/'. You can use either the full set (full_set=1 => all ten years) or the years corresponding to our original IID split (full_set=0 => first five years) (check usage instructions for each baseline in order to switch between them).

Please cite this project as:

@article{druagoi2022anoshift,
  title={AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection},
  author={Dr{\u{a}}goi, Marius and Burceanu, Elena and Haller, Emanuela and Manolache, Andrei and Brad, Florin},
  journal={Neural Information Processing Systems {NeurIPS}, Datasets and Benchmarks Track},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
baselines_ID_setup		baselines_ID_setup
baselines_OOD_setup		baselines_OOD_setup
data_processor		data_processor
language_models		language_models
resources		resources
saved_tokenizers		saved_tokenizers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
full_results_OOD_and_ID.pdf		full_results_OOD_and_ID.pdf
main.py		main.py
requirements.txt		requirements.txt

License

bit-ml/AnoShift

Folders and files

Latest commit

History

Repository files navigation

AnoShift

💥💥 AD benchmark for both In-Distribution (ID) and Out-Of-Distribution (OOD) Anomaly Detection tasks (full results)

AnoShift Protocol

Kyoto-2006+ data

Prepare data

Prepare environment

Baselines

Please cite this project as:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages