Skip to content

Code for the paper "Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties", published in Molecular Informatics

boun-tabi/exploring-chemical-words

Repository files navigation

Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

About

This repository contains code for "Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties", published in Molecular Informatics DOI. The accompanying materials are available in Zenodo DOI.

Installation

To use the scripts, clone the repository and install the required dependencies:

git clone https://github.com/boun-tabi/exploring-chemical-words.git
cd exploring-chemical-words
pip install -r requirements.txt

Usage

Identifying chemical vocabularies

Train subword tokenization models to identify chemical vocabularies.

python word_identification.py --model_type [model_type] --corpus [corpus_path] --save_name [save_name] --vocab_size [vocab_size]
  • --model_type: Choose from 'bpe', 'unigram', 'wordpiece'
  • --corpus: Filepath of the corpus containing SMILES strings
  • --save_name: Filename for the output vocabulary file
  • --vocab_size: Desired size of the vocabulary

Selecting key chemical words

Identify significant words in chemical documents using the specified vocabulary.

python highlighter.py --dataset [dataset_name] --vocabulary [vocabulary_name]
  • --dataset: Specify dataset name (e.g., 'lit_pcba', 'bdb', or others)
  • --vocabulary: Name or path of the vocabulary file

Computing chemical vocabulary statistics

Perform a comprehensive analysis of chemical words, deriving key statistics and insights.

python analyzer.py --dataset [dataset_name] --vocabulary [vocabulary_name]
  • --dataset: Choose the dataset (e.g., 'lit_pcba', 'bdb', or others)
  • --vocabulary: Name or path of the vocabulary file

Streamlit app

Launch an interactive Streamlit application illustrating the key chemical words for particular targets along with associated binders and drugs.

streamlit run app.py

Citation

@article{https://doi.org/10.1002/minf.202300249,
    author = {Temizer, Asu Busra and Uludoğan, Gökçe and Özçelik, Rıza and Koulani, Taha and Ozkirimli, Elif and Ulgen, Kutlu O. and Karali, Nilgun and Özgür, Arzucan},
    title = {Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties},
    journal = {Molecular Informatics},
    doi = {https://doi.org/10.1002/minf.202300249},
    url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.202300249},
    eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/minf.202300249},
}

About

Code for the paper "Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties", published in Molecular Informatics

Topics

Resources

Stars

Watchers

Forks