Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

About

This repository contains code for "Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties", published in Molecular Informatics . The accompanying materials are available in Zenodo .

Installation

To use the scripts, clone the repository and install the required dependencies:

git clone https://github.com/boun-tabi/exploring-chemical-words.git
cd exploring-chemical-words
pip install -r requirements.txt

Usage

Identifying chemical vocabularies

Train subword tokenization models to identify chemical vocabularies.

python word_identification.py --model_type [model_type] --corpus [corpus_path] --save_name [save_name] --vocab_size [vocab_size]

--model_type: Choose from 'bpe', 'unigram', 'wordpiece'
--corpus: Filepath of the corpus containing SMILES strings
--save_name: Filename for the output vocabulary file
--vocab_size: Desired size of the vocabulary

Selecting key chemical words

Identify significant words in chemical documents using the specified vocabulary.

python highlighter.py --dataset [dataset_name] --vocabulary [vocabulary_name]

--dataset: Specify dataset name (e.g., 'lit_pcba', 'bdb', or others)
--vocabulary: Name or path of the vocabulary file

Computing chemical vocabulary statistics

Perform a comprehensive analysis of chemical words, deriving key statistics and insights.

python analyzer.py --dataset [dataset_name] --vocabulary [vocabulary_name]

--dataset: Choose the dataset (e.g., 'lit_pcba', 'bdb', or others)
--vocabulary: Name or path of the vocabulary file

Streamlit app

Launch an interactive Streamlit application illustrating the key chemical words for particular targets along with associated binders and drugs.

streamlit run app.py

Citation

@article{https://doi.org/10.1002/minf.202300249,
    author = {Temizer, Asu Busra and Uludoğan, Gökçe and Özçelik, Rıza and Koulani, Taha and Ozkirimli, Elif and Ulgen, Kutlu O. and Karali, Nilgun and Özgür, Arzucan},
    title = {Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties},
    journal = {Molecular Informatics},
    doi = {https://doi.org/10.1002/minf.202300249},
    url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.202300249},
    eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/minf.202300249},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
notebooks		notebooks
README.md		README.md
SS1.png		SS1.png
SS2.png		SS2.png
SS3.png		SS3.png
analyzer.py		analyzer.py
app.py		app.py
data.py		data.py
efgs.py		efgs.py
graphicalabstract.png		graphicalabstract.png
highlighter.py		highlighter.py
requirements.txt		requirements.txt
word_identification.py		word_identification.py

boun-tabi/exploring-chemical-words

Folders and files

Latest commit

History

Repository files navigation

Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

About

Installation

Usage

Identifying chemical vocabularies

Selecting key chemical words

Computing chemical vocabulary statistics

Streamlit app

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages