CASTOR-KRFE

CASTOR-KRFE v1.3 Help file
K-mers based feature identifier for viral genomic classification
Copyright (C) 2023 Dylan Lebatteux, Amine M. Remita, Abdoulaye Banire Diallo
Author : Dylan Lebatteux, Amine M. Remita
Contact : lebatteux.dylan@courrier.uqam.ca

Description

CASTOR-KRFE is an alignment-free method to identify a set of k-mers to discriminate between groups of genomic sequences. The core of CASTOR-KRFE is based on feature elimination using Support Vector Machines (SVM-RFE) which is an machine learning feature selection method. CASTOR-KRFE identifies an optimal length of k to maximize classification performance and minimize the number of features. The extracted set of k-mers can be used to build a prediction model. This model can then be used to predict a set of new genomic sequences. A new module allowing to identify discriminative k-mers variations and their associated information according to the sequence class has also been included.

Required softwares

Parameters

List of parameters requiring adjustment in the configuration_file.ini :

k_min : Minimum length of k-mers
k_max : Maximum length of k-mers
T : Percentage performance threshold (T = 0.99 is recommended) .
training_fasta : Training fasta file path
testing_fasta : Testing fasta file path
reference_sequence : Path of the reference sequence in GenBank format
k_mers_path : Path file of the extracted k-mers
model_path : Path file of the prediction model
prediction_path : Path of the sequence prediction file
evaluation_mode : Evaluation mode during the prediction (True/False).

Utilization

Specify the parameters of the previous section in the configuration_file.ini.
Run the following command :

$ python main.py configuration_file.ini

Select an option:

1)Extract k-mers | Required parameters: T, k_min, k_max, training_fasta and k_mers_path
2)Fit a model | Required parameters: training_fasta, k_mers_path and model_path
3)Predict a sequences | Required parameters: testing_fasta, k_mers_path, model_path, prediction_path and evaluation_mode
4)Motif analyzer | Required parameters: training_fasta, k_mers_path and reference_sequence
5)Exit/Quit

Fasta file format example for n sequences:

>id_sequence_1|target_sequence_1 
CTCAACTCAGTTCCACCAGGCTCTGTTGGATCCGAGGGTAAGGGCTCTGTATTTTCCTGC 
>id_sequence_2|target_sequence_2						
CTCAACTCAGTTCCACCAGGCTCTGTTGGATCCGAGGGTAAGGGCTCTGTATTTTCCTGC
...
...
...
>>id_sequence_n-1|target_sequence_n-1									 
CTCAACTCAGTTCCACCAGGCTCTGTTGGATCCGAGGGTAAGGGCTCTGTATTTTCCTGC 
>id_sequence_n|target_sequence_n													 
CTCAACTCAGTTCCACCAGGCTCTGTTGGATCCGAGGGTAAGGGCTCTGTATTTTCCTGC

The character "|" is used to separate the sequence ID from its target.
The target must be specified in the fasta file for a prediction with evaluation_mode = True.
For more detailed examples see the data sets in the Data folder

Output

k_mers.fasta: File of the extracted k-mers list
model.pkl : Prediction model generated by CASTOR-KRFE
Prediction.csv : Results file of the prediction of unknown genomic sequences
Signature_location.xlsx : Analysis report associated with a signature

Reference to cite CASTOR-KRFE

Lebatteux, D., Remita, A. M., & Diallo, A. B. (2019). Toward an alignment-free method for feature extraction and accurate classification of viral sequences. Journal of Computational Biology, 26(6), 519-535.

Reference to cite KANALYZER (Option 4: Motif analyzer)

Lebatteux, Dylan, et al. "KANALYZER: a method to identify variations of discriminative k-mers in genomic sequences." 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE Computer Society, 2022.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
input		input
output		output
LICENSE		LICENSE
README.md		README.md
analyzer.py		analyzer.py
configuration.py		configuration.py
configuration_file.ini		configuration_file.ini
data.py		data.py
kmers.py		kmers.py
krfe.py		krfe.py
main.py		main.py
matrix.py		matrix.py
ml.py		ml.py

License

bioinfoUQAM/CASTOR_KRFE

Folders and files

Latest commit

History

Repository files navigation

CASTOR-KRFE

Description

Required softwares

Parameters

Utilization

Fasta file format example for n sequences:

Output

Reference to cite CASTOR-KRFE

Reference to cite KANALYZER (Option 4: Motif analyzer)

About

Topics

Resources

License

Stars

Watchers

Forks

Languages