Skip to content

This repository contains all the source files required to run DeLUCS, a deep learning clustering algorithm for DNA sequences.

Notifications You must be signed in to change notification settings

millanp95/DeLUCS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeLUCS

This repository contains all the source files required to reproduce the results in the original DeLUCS paper (https://doi.org/10.1101/2021.05.13.444008), as well as a detailed guide for running the code.

Computational Pipeline:

1. Build the dataset:

	python build_dp.py --data_path=<PATH_sequence_folder>	
  • Input: Folders with the sequences in FASTA format
  • Output : file in the form (label,sequence,accession)

2. Compute the mimic sequences.

  python get_pairs.py --data_path=<PATH_pickle_dataset> --k=6 --modify='mutation' --output=<PATH_output_file> --n_mimics=<n mimics per sequence>
  • Input: file in the form (label,sequence,accession)
  • Output : file in the form of (pairs, x_test, y_test)

3. Train the model.

* For training DeLUCS and testing its performance
	```
	python EvaluateDeLUCS.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
	```

	* Input: Pickle file with the mimics in the form of (pairs, x_test, y_test). 
	* Output : Confusion Matrix. 
			<!--* File with the misclassified sequences in the form (accession, true_label, predicted_label)-->

* For testing the performance  a single Neural Network trained in an unsupervised way (labels must be available):
	```
	python EvaluateSingleRun.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
	```

Training on your own data

We recomend using the updated version of the code in (https://github.com/Kari-Genomics-Lab) for training on your own data.

Citation

If you find DeLUCS useful in your research please consider citing:

@article{10.1371/journal.pone.0261531,
    doi = {10.1371/journal.pone.0261531},
    author = {Millán Arias, Pablo AND Alipour, Fatemeh AND Hill, Kathleen A. AND Kari, Lila},
    journal = {PLOS ONE},
    publisher = {Public Library of Science},
    title = {DeLUCS: Deep learning for unsupervised clustering of DNA sequences},
    year = {2022},
    month = {01},
    volume = {17},
    url = {https://doi.org/10.1371/journal.pone.0261531},
    pages = {1-25},
    number = {1},
}	

About

This repository contains all the source files required to run DeLUCS, a deep learning clustering algorithm for DNA sequences.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published