Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pointers to data processing script & ChEMBL dataset update #135

Open
AlanHassen opened this issue Jan 13, 2022 · 2 comments
Open

Add pointers to data processing script & ChEMBL dataset update #135

AlanHassen opened this issue Jan 13, 2022 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@AlanHassen
Copy link

The overall question:
Is it possible to describe the preprocessing and data origin for different datasets?

Explanation:
I am currently looking into using ChEMBL via TDC. However, it would be essential to know which version of the dataset is provided here and how it is preprocessed (cleaned...) for reproducibility purposes. This is especially important because ChEMBL has recurring releases. In the source code, the file is downloaded from "https://dataverse.harvard.edu/api/access/datafile/" without any explanation (TDC/tdc/utils/load.py).

A Solution:
Add a section data origin/preprocessing to the documentation.

@kexinhuang12345
Copy link
Collaborator

Hi Alan, thank you for raising this important point. We have a repo that tracks the preprocessing scripts for the majority of the datasets: https://github.com/kexinhuang12345/data_process however it is not cleaned up yet. I think it is important to make sure the data provenance is good and we would work towards that by linking to these processing scripts in the website.

As for the ChEMBL, unfortunately, the processing script seems to be missing. To address that, we plan to release the most up to date ChEMBL version in the coming release and document the chembl version on the website. If you have already used the current data, for now, you could call it TDC version to make things clear. Hope this helps!

@kexinhuang12345 kexinhuang12345 added the enhancement New feature or request label Jan 15, 2022
@kexinhuang12345 kexinhuang12345 changed the title Is it possible to describe the preprocessing and data origin for different datasets? Add pointers to data processing script & ChEMBL dataset update Jan 15, 2022
@kexinhuang12345 kexinhuang12345 self-assigned this Jan 23, 2022
@kexinhuang12345
Copy link
Collaborator

Hi, ChEMBL-V29 is now released in 0.3.5. You can load it via:

from tdc.generation import MolGen
data = MolGen(name = 'ChEMBL_V29')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants