Skip to content

DRLib/CDR

Repository files navigation

CDR - Interactive Visual Cluster Analysis by Contrastive Dimensionality Reduction

teaser

Environment setup

This project was based on python 3.6 and pytorch 1.7.0. See requirements.txt for all prerequisites, and you can also install them using the following command.

pip install -r requirements.txt

Datasets

Size Dimensionality Clusters Type Link
Animals 10000 512 10 image Kaggle
Anuran calls 7195 22 8 tabular UCI
Banknote 1097 4 2 text UCI
Cifar10 10000 512 10 image Alex Krizhevsky
Cnae9 864 856 9 text UCI
Cats-vs-Dogs 10000 512 2 image Kaggle
Fish 9000 512 9 image Kaggle
Food 3585 512 11 image Kaggle
Har 8240 561 6 tabular UCI
Isolet 1920 617 8 text UCI
ML binary 1000 10 2 tabular Kaggle
MNIST 10000 784 10 image Yann LeCun
Pendigits 8794 16 10 tabular UCI
Retina 10000 50 12 tabular Paper
Satimage 5148 36 6 image UCI
Stanford Dogs 1384 512 7 image Stanford University
Texture 4400 40 11 text KEEL
USPS 7440 256 10 image Kaggle
Weathers 900 512 4 image Kaggle
WiFi 1600 7 4 tabular UCI

For image dataset such as Animals, Cifar10, Cats-vs-Dogs, Fish, Food, Stanford Dogs and Weathers, we use SimCLR to get their 512 dimensional representations.

All the datasets are supported with H5 format (e.g. usps.h5), and we need all the dataset to be stored at data/H5 Data. For image data sets, place all images as 0.jpg,1.jpg,...,n-1.jpg format and put it in the static/images/(dataset name)(e.g. static/images/usps) directory.

Pre-trained model weights

The pre-training model weights on all the above data sets can be found in Google Drive.

Training

To train the model on USPS with a single GPU, check the configuration file configs/CDR.yaml, and try the following command:

python train.py --configs configs/CDR.yaml

Config File

The configuration files can be found under the folder ./configs, and we provide two config files with the format .yaml. We give the guidance of several key parameters in this paper below.

  • n_neighbors(K): It determines the granularity of the local structure to be maintained in low-dimensional space. A too small value will cause one cluster in the high-dimensional space be projected into two low-dimensional clusters, while too large value will aggravate the problem of clustering overlap. The default setting is K = 15.
  • batch_size(B): It determines the number of negative samples. A larger value is better, but it also depends on the data size. We recommend to use B = n/10, where n is the number of instances.
  • temperature(t): It determines the ability of the model upon neighborhood preservation. The smaller the value is, the more strict the model is to maintain the neighborhood, but it also keeps more error neighbors. The default setting is t = 0.15.
  • separate_upper(μ): It determines the intensity of cluster separation. The larger the value is, the higher the cluster separation degree is. The default setting is μ = 0.11.

Load pre-trained model for visualization

To use our pre-trained model, try the following command:

# python vis.py --configs 'configuration file path' --ckpt 'model weights path'

# Example on USPS dataset
python vis.py --configs configs/CDR.yaml --ckpt_path model_weights/usps.pth.tar

Prototype interface

Using our prototype interface for interactive visual clustering analysis, try the following command.

python app.py --config configs/ICDR.yaml

After that, the prototype interface can be found in http://127.0.0.1:5000 .

frontend_07

Cite

@article{xia2022interactive,
  title={Interactive visual cluster analysis by contrastive dimensionality reduction},
  author={Xia, Jiazhi and Huang, Linquan and Lin, Weixing and Zhao, Xin and Wu, Jing and Chen, Yang and Zhao, Ying and Chen, Wei},
  journal={IEEE Transactions on Visualization and Computer Graphics},
  volume={29},
  number={1},
  pages={734--744},
  year={2022},
  publisher={IEEE}
}