Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand cancer cell line and patient related datasets, e.g. DepMap, CCLE, TCGA #216

Open
abearab opened this issue Jan 3, 2024 · 5 comments
Labels
new-dataset Request new dataset. new-task Request new task.

Comments

@abearab
Copy link
Contributor

abearab commented Jan 3, 2024

Describe the problem
To enable cancer research, I would like to suggest including functionalities to work with cancer cell line information in TDC. In DepMap, there are updates in newer DepMap releases that make it incompatible with some current implementations for data collection – e.g. GilbertLabUCSF/CanDI#34, kevinhu/cancer_data#79.

Our previous work, CanDI, is a global cancer data integrator in Python that is used to harmonize and query datasets. Stable data from prior DepMap releases is deposited in Harvard Dataverse. I drafted some scripts to download this older but functinoal data and we will update script to make it work with newer DepMap releases.

TCGA data access can be even harder, although I just saw https://cloud.google.com/life-sciences/docs/resources/public-datasets/tcga

Describe the solution you'd like
A new data collection method will be very beneficial. It would be great to gather structured and harmonized data for cancer cell lines using TDC. You already have a tool for GDSC so a similar approach for CCLE and DepMap will be very useful. gget is also planning something like this which can be a synergized effort pachterlab/gget#121 (cc @lauraluebbert).

Additional context
See these links for CanDI's source codes https://github.com/GilbertLabUCSF/CanDI, docs or manuscript

This is an example of my analysis using TDC and CanDI – notebook | blog post | GilbertLabUCSF/Decitabine-treatment#5


other related issues: #191

@kexinhuang12345
Copy link
Collaborator

Thanks for the issue! This sounds interesting. Would it make sense to add this as an additional dataset for the drug response prediction task? https://tdcommons.ai/multi_pred_tasks/drugres/ Or are you thinking more as an independent data function as in https://tdcommons.ai/fct_overview/?

@abearab
Copy link
Contributor Author

abearab commented Jan 4, 2024

Hi @kexinhuang12345, I think DepMap and CCLE datasets are multi-modal readouts form different assays performed on cancer cells and these are / can be used in many different tasks. Thus, maybe this can be a "Data Processing" from "Data Functions"?

@abearab
Copy link
Contributor Author

abearab commented Jan 11, 2024

Hi @kexinhuang12345 - quick question. Have you ever thought about including tasks related to connecting cancer cell line to cancer patients? e.g. https://github.com/broadinstitute/celligner

@kexinhuang12345
Copy link
Collaborator

kexinhuang12345 commented Jan 12, 2024 via email

@kexinhuang12345 kexinhuang12345 added new-dataset Request new dataset. new-task Request new task. labels Jan 13, 2024
@abearab
Copy link
Contributor Author

abearab commented Jan 16, 2024

Hi @kexinhuang12345,

Interesting! What is the relevant machine learning task formulation for it?

I think there is a wide range of ML tasks possible with the CCLE and DepMap datasets, here are some examples:

As for the data function/dataset for this cancer cell line data, I was thinking more about it and it seems like it maybe more fit as datasets since the data functions in general need to be applicable to multiple tasks&datasets in contrast to be dataset-specific.

Agreed.

The function to generate the datasets are definitely useful we should store it in the data generation repo and reuse it or even make it into the data loader for more diverse usage.

I guess I'm not aware of the "data generation repo". Let me know how I can help in this regard.

What are your thoughts on this? Also you mention about multiple tasks, can you elaborate more on this?

In general, CCLE stands for Cancer Cell Line "Encyclopedia" so conceptually it is a well-established empirical resource for a diverse set of biological questions. Thus, these datasets are widely used for simple query tasks or more advanced ML tasks in the context of cancer cell biology.

Happy to hop on a call to discuss more and let me know, thanks!!

I'll send an email right after this, thank you.

@abearab abearab changed the title Cancer Cell Line "Datasets" Expand cancer cell line and patient related datasets, e.g. DepMap, CCLE, TCGA Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-dataset Request new dataset. new-task Request new task.
Projects
None yet
Development

No branches or pull requests

2 participants