PubChem

PubChem18 (used in the paper)

For the PubChem dataset used in the paper run the following commands, which downloads, unzips and deletes the zip-file:

wget -N -r https://cloud.ml.jku.at/s/2ybfLRXWSYb4DZN/download -O pubchem18.zip
unzip pubchem18.zip; rm pubchem18.zip

PubChem23 / up-to-date PubChem

To setup the PubChem pretraining dataset with the latest version call:

python clamp/dataset/prep_pubchem.py --data_dir=./data/pubchem23/

This downloads the PubChem-database and preprocesses it. You may manually delete the raw PubChem Folder (data_dir/ftp.ncbi.nlm.nih.gov) after running this command. Get some coffee, this will take quite some time ;)

For reproducibility, data is also available through

wget -N -r https://cloud.ml.jku.at/s/fi83oGMN2KTbsNQ/download -O pubchem23.zip
unzip pubchem23.zip
rm pubchem23.zip

Satistics

	PubChem23 (2023-02)	PubChem18	PubChem HTS	FSMOL (v1)
# measurements	245,234,081	223,219,241	143,886,653	501,366
# compounds	3,475,417	2,120,811	715,231	240,465
# assays	521,603	21,002	582	5,135
Mean # compounds / assay	462.69	10,628.48	247,227.93	96.32
Median # compounds / assay	2.00	35.00	304,804.00	46.00
% active	2.58	1.51	0.70	46.48
Mean % active per assay	80.57	79.46	1.04	47.17
Median % active per assay	100.00	100.00	0.42	48.84
Source	PubChem 2023	PubChem 2018	PubChem 2018	ChEMBL27
% of assays with only one class	95.54	74.54	1.37	0.00
% density	0.01	0.50	34.57	0.04

Compound and Assay Encodings

To compute the compound encodings as input for your model run

python clamp/dataset/encode_compound.py \
--compounds=./data/pubchem23/compound_names.parquet \
--compound2smiles=./data/pubchem23/compound_smiles.parquet \
--fp_type=morganc+rdkc --fp_size=8096

To compute the assay encodings as input for your model run

python clamp/dataset/encode_assay.py --encoding=clip

or use --encoding=lsa.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pubchem.md

pubchem.md

PubChem

PubChem18 (used in the paper)

PubChem23 / up-to-date PubChem

Satistics

Compound and Assay Encodings

Files

pubchem.md

Latest commit

History

pubchem.md

File metadata and controls

PubChem

PubChem18 (used in the paper)

PubChem23 / up-to-date PubChem

Satistics

Compound and Assay Encodings