QSAR fro COVID-19

You could consider deploying this repository as an app by Streamlit

You can generate a similar app as described in https://github.com/dataprofessor/bioactivity-prediction-app

QSAR fro COVID-19

Around 60% of the notebook coding, especially the key method of calculating molecular descriptors are borrowed from https://github.com/dataprofessor as claimed inside the notebooks.

Around 40% of the notebook is written from scratch, especially the local version. In case you need a reference, the following may be considered

The paper describing how to calculate molecular descriptors https://peerj.com/articles/2322/

chembl database https://www.ebi.ac.uk/chembl/ from where we obtain the public data.

sk-learn https://scikit-learn.org/stable/ to build the regression model.

Pre-conditions: Jupyter notebook

Windows, Mac or Linux either is fine as long as you have a Jupyter Notebook installed. For people who have never used Jupyter Notebooks, here is a great article to get started: https://link.springer.com/protocol/10.1007/978-1-0716-0150-1_3.

Download and use it

Windows: Directly download the ZIP file and extract it to use.

Unix-like terminal and Mac

git clone https://github.com/quantaosun/QSAR-COVID-19.git

then you can use Jupyter Notebook to run the notebook in order.

Performance

Though the training looks good, the test is not.

--------------100% as training set---------------------- 80% as training set

----------20% tested

What is it does

As of January 9th 2022, the pandemic has been there for almost 2 years, with an unprecedented number of infected people, there is also a great increase in relative research data like chemical compounds that could potentially inhibit the virus. To date when this is written, around 14,355 bioactivities toward small molecules, have been recorded in the public Chembl database.

A fundamental question is then, can we build a QSAR model for all these data? This is precisely what this project tries to do. Not all bioactivities are comparable to each other, so this QSAR model will not be perfect but it will give you a sense of the progress of small-molecules based inhibitors development against COVID-19.

The QSAR model is to take molecular descriptors as independent variables, and bioactivities as dependent variables, with the help of machine learning model random forest regression, to build a QSAR model either based on public Chembl bioactivity or your local bioactivity for SARS-COV-2 or if you could change the target, for any other target.

1. Starting from public data

How to use it

run 1_public.ipynb, 2_build_public,3_build_public,5_build_all in sequence to build a QSAR, then run 3_external_prediction,5_external_prediction to predict unknown molecules.

Before you could do the external prediction, you should create a file called "unknow.txt" containing all the smiles you want to predict or validate.

2. Starting from your local data

How to use it

run 1_local.ipynb, 3_build_local,5_build_all in sequence to build a QSAR, then run 3_external_prediction,5_external_prediction to predict unknown molecules.

Before you could run a local version, do the following beforehand,

Prepare a text file that contains all the molecules' smiles string, you can obtain it from ChemDraw or any other means you prefer, note that there are various variants of smiles, what I used here is the conventional one. All strings should be put in a one-per-line manner. save it as "structures.txt", see the attached example.
Prepare another text file that contains all the bioactivities in a one per line manner, to match up the first "bioactivity.txt" file, see the attached example.

Before you could do the external prediction, you should create a file called "unknow.txt" containing all the smiles you want to predict or validate.

Note

For the sake of this example, I only used 139 molecules with IC50, but there are actually thousands of other bioactivities available out there, so check it out yourself and see if you could improve the model performance. Alternatively, you can use Kd, Ki or what you like as the bioactivity, just remember you can't build a QSAR used different tpyes.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.github		.github
.ipynb_checkpoints		.ipynb_checkpoints
.DS_Store		.DS_Store
1_local.ipynb		1_local.ipynb
1_public.ipynb		1_public.ipynb
2_build_public.ipynb		2_build_public.ipynb
3_build_local.ipynb		3_build_local.ipynb
3_build_public.ipynb		3_build_public.ipynb
3_external_prediction.ipynb		3_external_prediction.ipynb
5_build_all.ipynb		5_build_all.ipynb
5_external_prediction.ipynb		5_external_prediction.ipynb
README.md		README.md
bioactivity.txt		bioactivity.txt
descriptor_list.csv		descriptor_list.csv
sars-cov-2-local_bioactivity_data_2class_pIC50_pubchem_fp.csv		sars-cov-2-local_bioactivity_data_2class_pIC50_pubchem_fp.csv
sars-cov-2-local_bioactivity_data_preprocessed.csv		sars-cov-2-local_bioactivity_data_preprocessed.csv
sars-cov-2_bioactivity_data_2class_pIC50_pubchem_fp.csv		sars-cov-2_bioactivity_data_2class_pIC50_pubchem_fp.csv
sars-cov-2_bioactivity_data_preprocessed.csv		sars-cov-2_bioactivity_data_preprocessed.csv
sars-cov-2_bioactivity_data_processed_2classes_pIC50.csv		sars-cov-2_bioactivity_data_processed_2classes_pIC50.csv
structures.txt		structures.txt
unknown_descriptor.txt		unknown_descriptor.txt
unknown_smile_with_descriptors_fp.csv		unknown_smile_with_descriptors_fp.csv
unkonwn.txt		unkonwn.txt

quantaosun/QSAR-COVID-19

Folders and files

Latest commit

History

Repository files navigation

You could consider deploying this repository as an app by Streamlit

QSAR fro COVID-19

Pre-conditions: Jupyter notebook

Download and use it

Performance

What is it does

1. Starting from public data

How to use it

2. Starting from your local data

How to use it

Note

About

Topics

Resources

Stars

Watchers

Forks

Sponsor this project

Languages