Skip to content

bit-ml/AnoShift

Repository files navigation

AnoShift

Accepted at NeurIPS 2022 - Datasets and Benchmarks Track

  • Title: AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection
  • Authors: Marius Dragoi, Elena Burceanu, Emanuela Haller, Andrei Manolache, Florin Brad
  • ArXiv Preprint

💥💥 AD benchmark for both In-Distribution (ID) and Out-Of-Distribution (OOD) Anomaly Detection tasks (full results)

Out-Of-Distribution Anomaly Detection

Method ROC-AUC ( $\uparrow$ )
IID NEAR FAR
OC-SVM [3] 76.86 71.43 49.57
IsoForest [10] 86.09 75.26 27.16
ECOD [6] 84.76 44.87 49.19
COPOD [8] 85.62 54.24 50.42
LOF [11] 91.50 79.29 34.96
SO-GAAL [1] 50.48 54.55 49.35
deepSVDD [2] 92.67 87.00 34.53
AE for anomalies [4] 81.00 44.06 19.96
LUNAR [9] 85.75 49.03 28.19
InternalContrastiveLearning [7] 84.86 52.26 22.45
BERT for anomalies [5] 84.54 86.05 28.15
  • Average results over multiple runs
  • Train data files "[year]_subset.parquet" with year in {2006, 2007, 2008, 2009, 2010}
  • IID test data files "[year]_subset_valid.parquet" with year in {2006, 2007, 2008, 2009, 2010}
  • NEAR test data files "[year]_subset.parquet" with year in {2011, 2012, 2013}
  • FAR test data files "[year]_subset.parquet" with year in {2014, 2015}
  • Results for each split are reported as an average over the performance on each year
  • Scripts for repoducing the results are available in 'baselines_OOD_setup/' (check Baselines section for more details).

In-Distribution Anomaly Detection

Method ROC-AUC ( $\uparrow$ )
OC-SVM [3] 68.73
IsoForest [10] 81.27
ECOD [6] 79.41
COPOD [8] 80.89
LOF [11] 87.61
SO-GAAL [1] 49.90
deepSVDD [2] 88.24
AE for anomalies [4] 64.08
LUNAR [9] 78.53
InternalContrastiveLearning [7] 66.99
BERT for anomalies [5] 79.62
  • Average results over multiple runs
  • Train data files "[year]_subset.parquet" with year in {2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015}
  • Test data files "[year]_subset_valid.parquet" with year in {2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015}
  • Scripts for repoducing the results are available in 'baselines_ID_setup/' (check Baselines section for more details).

AnoShift Protocol

  • We introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection. This type of data meets the premise of shifting the input distribution: it covers a large time span (from 2006 to 2015), with naturally occurring changes over time. In AnoShift, we split the data in IID, NEAR, and FAR testing splits. We validate the performance degradation over time with diverse models (MLM to classical Isolation Forest).

AnoShift overview - Kyoto-2006+

  • With all tested baselines, we notice a significant decrease in performance on FAR years for inliers, showing that there might be a particularity with those years. We observe a large distance in the Jeffreys divergence on FAR and the rest of the years for 2 features: service type and the number of bytes sent by the source IP.

  • From the OTDD analysis we observe that: first, the inliers from FAR are very distanced to training years; and second, the outliers from FAR are quite close to the training inliers.

  • We propose a BERT model for MLM and compare several training regimes: iid, finetune and a basic distillation technique, and show that acknowledging the distribution shift leads to better test performance on average.

Kyoto-2006+ data

  • Original Kyoto-2006+ data available at: https://www.takakura.com/Kyoto_data/ (in AnoShift we used the New data (Nov. 01, 2006 - Dec. 31, 2015))

  • Preprocessed Kyoto-2006+ data available at: https://share.bitdefender.com/s/9D4bBE7H8XTdYDB

  • The result is obtained by applying the preprocessing script data_processor/parse_kyoto_logbins.py to the original data.

  • The preprocessed dataset is available in pandas parquet format and available as both full sets and subsets with 300k inlier samples, with equal outlier proportions to the original data.

  • In the notebook tutorials, we use the subsets for fast experimentation. In our experiments, the subset results are consistent with the full sets.

  • Label column (18) has value 1 for inlier class (normal traffic) and -1 (known type) and -2 (unknown type) for anomalies.

Prepare data

curl https://share.bitdefender.com/s/9D4bBE7H8XTdYDB/download --output AnoShift.zip

mkdir datasets

mv AnoShift.zip datasets

unzip datasets/AnoShift.zip -d datasets/

rm datasets/AnoShift.zip

Prepare environment

  • Create a new conda environment: conda create --name anoshift
  • Activate the new environment: conda activate anoshift
  • Install pip: conda install -c anaconda pip
  • Upgrade pip: pip install --upgrade pip
  • Install dependencies: pip install -r requirements.txt

Baselines

We provide numeros baselines in the baselines_OOD_setup/ directory, which are a good entrypoint for familiarizing with the protocol:

  • baseline_*.ipynb: isoforest/ocsvm/LOF baselines on AnoShift
  • baseline_deep_svdd/baseline_deepSVDD.py: deppSVDD baseline on AnoShift
  • baseline_BERT_train.ipynb: BERT baseline on AnoShift
  • baseline_InternalContrastiveLearning.py: InternalContrastiveLearning baseline on AnoShift
  • baselines_PyOD.py: ['ecod', 'copod', 'lunar', 'ae', 'so_gaal'] baselines on AnoShift using PyOD
  • iid_finetune_distill_comparison.ipynb: compare the IID, finetune and distillation training strategies for the BERT model, on AnoShift
  • run the notebooks from the root of the project: jupyter-notebook .

If you intend to use AnoShift in the ID setup, please consider the code provided in 'baselines_ID_setup/'. You can use either the full set (full_set=1 => all ten years) or the years corresponding to our original IID split (full_set=0 => first five years) (check usage instructions for each baseline in order to switch between them).

Please cite this project as:

@article{druagoi2022anoshift,
  title={AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection},
  author={Dr{\u{a}}goi, Marius and Burceanu, Elena and Haller, Emanuela and Manolache, Andrei and Brad, Florin},
  journal={Neural Information Processing Systems {NeurIPS}, Datasets and Benchmarks Track},
  year={2022}
}