Skip to content

reymond-group/ML-PDGA-anticancer-peptides

Repository files navigation

Machine Learning Designs Anticancers Peptides

Recurrent Neural Network (RNN) models

The RNN activity and hemolysis classifier and the prior generative model used in this study were previously developed for the recently published work “Machine Learning Designs Non-Hemolytic Antimicrobial Peptides” using activity and hemolysis data of APCs and AMPs extracted from the DBAASP (Database of Antimicrobial Activity and Structure of Peptides). In the present study, the RNN prior generative model was fine-tuned using transfer learning (TL) and 53 linear and natural peptides active against Hela cancer cells extracted from the DBAASP. TL consisted in a second training of the prior model using 37 out of the 53 peptides against Hela, which were present in the training set previously used for the training of the prior model. The remaining 16 sequences were used as test set. Negative log-likelihood loss (NLLL) and Stochastic gradient descent with a momentum of 0.9 and a learning rate of 0.00001 were used, and the training was stopped when the NLLL of the test set reached its minimum. 50,000 peptide sequences were sampled from the fine-tuned generative model.

Peptide Design Genetic Algorithm (PDGA)

The peptide design genetic algorithm (PDGA) was adapted to use the MinHashed atom pair fingerprint of diameter 4 (MAP4) in its string version.4 This version of the MAP4 fingerprint can be calculated using the parameter “return_strings = True” and it returns the shingles of the encoded molecules. In its fitness function, the MAP4 PDGA evaluates each generated peptide structure based on the Jaccard distance (JD) between the shingles of its MAP4 fingerprint and the shingles of the MAP4 fingerprint of the given query. The MAP4 PDGA was run 10 times in parallel for 12 hours using the anticancer peptides Lasioglossin III as query, an initial population of 100 peptides, a mutation rate and a generation gap of 0.5, linear topology, and excluding non-natural building blocks. The runs resulted in the generation of 715,658 unique sequences.

Properties calculation

The Levenshtein distance (LD) from the nearest neighbor (NN) present in the training and the test used to implement the RNN activity and hemolysis classifiers1 was calculated using the Levenshtein Python package.5,6 The helicity prediction was performed using SPIDER3,7 and the helicity fraction was calculated as the number of residues predicted helical in a peptide sequence divided by the length of the sequence itself. The hydrophobic moment was calculated as described by Eisenberg et al.8 Hemolysis and activity were predicted by the respective classifiers converting the probabilistic prediction values into binary classification using the threshold that kept the prediction of false positive below 6% (0.99205756 for the activity classifier and 0.99981695 for the hemolysis classifier).

Peptide sequences selection

The generated sequences sampled from the fine-tuned generative model in the first approach and generated with the PDGA in the second approach were filtered based on multiple criteria. First, to ensure novelty, we have chosen sequences with LD > 5 from the classifiers training sets and LD > 4 from the classifiers test set. Second, we removed sequences that were outside the applicability domain of the classifiers. To do so, the minimum LD of every test set compound to the training set was calculated, and the applicability domain of the classifiers was set to be the 90% quantile. This led to the exclusion of all generated sequences with a LD distance of 8 or more to the training set of the classifiers. Only sequences up to 15 residues were selected to facilitate the synthesis process. Due to the low percentage of D amino acids in the training set, sequences containing D-residues (present only in the first approach dataset) were excluded. Since helicity and amphiphilicity often correlate with antimicrobial activity, we selected sequences with a predicted helicity fraction above 0.8 and an Eisenberg hydrophobic moment above 0.3. The thresholds for the predicted helicity fraction and hydrophobic moment were chosen based on the median values of the active sequences in the training and test, respectively 0.83 and 0.31. The filtered sequences were clustered using the RDKit9 Butina module with a threshold of 10 and the Levenshtein distance as distance function, and the center of each cluster was picked. The workflow resulted in 14 sequences for the first approach and in 22 sequences for the second approach.

TMAP visualization

The default version of the MAP4 fingerprint of the 53 ACPs against Hela cancer cells, and of the 202 and 152 peptides obtained, respectively, from the first and the second approach. The indices generated by the MinHash procedure of the MAP4 calculation were used to create a locality-sensitive hashing (LSH) forest10 of 32 trees. Then, for each structure, the 20 approximate nearest neighbors (NNs) in the MAP4 feature space were extracted from the LSH forest, and the tree layout was calculated. The LSH forest and the minimum spanning tree layout were calculated using the TMAP open-source code. Finally, Fearun11 was used to display the obtained layout interactively.

to predict the apha helix % we used SPIDER3

required environment installation:

  • conda env create -f environment.yml
  • conda activate aipep

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published