Skip to content

Open Source, machine learning QSAR model with public data or your local data, The model utilises molecular descriptors as the independent variable, bioactivity as the dependent variable, random forest as a mathematical model.

quantaosun/QSAR-COVID-19

Repository files navigation

You could consider deploying this repository as an app by Streamlit

You can generate a similar app as described in https://github.com/dataprofessor/bioactivity-prediction-app

QSAR fro COVID-19

Around 60% of the notebook coding, especially the key method of calculating molecular descriptors are borrowed from https://github.com/dataprofessor as claimed inside the notebooks.

Around 40% of the notebook is written from scratch, especially the local version. In case you need a reference, the following may be considered

The paper describing how to calculate molecular descriptors https://peerj.com/articles/2322/

chembl database https://www.ebi.ac.uk/chembl/ from where we obtain the public data.

sk-learn https://scikit-learn.org/stable/ to build the regression model.

Pre-conditions: Jupyter notebook

Windows, Mac or Linux either is fine as long as you have a Jupyter Notebook installed. For people who have never used Jupyter Notebooks, here is a great article to get started: https://link.springer.com/protocol/10.1007/978-1-0716-0150-1_3.

Download and use it

Windows: Directly download the ZIP file and extract it to use.

Unix-like terminal and Mac

git clone https://github.com/quantaosun/QSAR-COVID-19.git

then you can use Jupyter Notebook to run the notebook in order.

Performance

Though the training looks good, the test is not.

image image

--------------100% as training set---------------------- 80% as training set

image

----------20% tested

What is it does

As of January 9th 2022, the pandemic has been there for almost 2 years, with an unprecedented number of infected people, there is also a great increase in relative research data like chemical compounds that could potentially inhibit the virus. To date when this is written, around 14,355 bioactivities toward small molecules, have been recorded in the public Chembl database.

A fundamental question is then, can we build a QSAR model for all these data? This is precisely what this project tries to do. Not all bioactivities are comparable to each other, so this QSAR model will not be perfect but it will give you a sense of the progress of small-molecules based inhibitors development against COVID-19.

The QSAR model is to take molecular descriptors as independent variables, and bioactivities as dependent variables, with the help of machine learning model random forest regression, to build a QSAR model either based on public Chembl bioactivity or your local bioactivity for SARS-COV-2 or if you could change the target, for any other target.

1. Starting from public data

How to use it

run 1_public.ipynb, 2_build_public,3_build_public,5_build_all in sequence to build a QSAR, then run 3_external_prediction,5_external_prediction to predict unknown molecules.

Before you could do the external prediction, you should create a file called "unknow.txt" containing all the smiles you want to predict or validate.

2. Starting from your local data

How to use it

run 1_local.ipynb, 3_build_local,5_build_all in sequence to build a QSAR, then run 3_external_prediction,5_external_prediction to predict unknown molecules.

Before you could run a local version, do the following beforehand,

  1. Prepare a text file that contains all the molecules' smiles string, you can obtain it from ChemDraw or any other means you prefer, note that there are various variants of smiles, what I used here is the conventional one. All strings should be put in a one-per-line manner. save it as "structures.txt", see the attached example.
  2. Prepare another text file that contains all the bioactivities in a one per line manner, to match up the first "bioactivity.txt" file, see the attached example.

Before you could do the external prediction, you should create a file called "unknow.txt" containing all the smiles you want to predict or validate.

Note

For the sake of this example, I only used 139 molecules with IC50, but there are actually thousands of other bioactivities available out there, so check it out yourself and see if you could improve the model performance. Alternatively, you can use Kd, Ki or what you like as the bioactivity, just remember you can't build a QSAR used different tpyes.

About

Open Source, machine learning QSAR model with public data or your local data, The model utilises molecular descriptors as the independent variable, bioactivity as the dependent variable, random forest as a mathematical model.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published