Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI support for graph transformation 'pipelines' #195

Open
kamurani opened this issue Jul 18, 2022 · 2 comments
Open

CLI support for graph transformation 'pipelines' #195

kamurani opened this issue Jul 18, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@kamurani
Copy link
Contributor

I'm aiming to generate protein graphs in bulk in order to then perform unsupervised clustering on them. I would also like to repeat this process on several different proteomes.

I would also like to apply several intermediate steps (e.g. select subgraph of radius r for each graph; select subgraph of threshold rsa)

So far, I have seen that ProteinGraphDataset retrieves PDB files from a list of ids (either UniProt or PDB accession codes) and downloads from PDB or AF2, and the 'intermediate steps' can be achieved by supplying functions to the graph_transformation_funcs parameter.

However, I would like to use a subset of a proteome (list of IDs) and an already existing set of .pdb files in a directory (as opposed to downloading them again). Would it be possible for a more elegant solution to exist in a similar fashion to the existing command line interface?

I was thinking that some sort of 'pipeline' could be written as a CLI command, perhaps by providing

  • path to file containing list of protein IDs
  • Path to directory containing structures (also where new ones will be downloaded if required)
  • which database to use if UniProt IDs used (e.g. swissprot or AF2)
  • path to config.yml file for graph construction
  • path to graph_processing.yml file detailing a list of functions to apply (e.g. subgraph selection)
  • output path for graphs (can specify format, e.g. nx.Graph or pyg)

This is just my naive idea for now, I haven't fleshed out exactly how it would work; but maybe a way to describe 'transformations' in a processing.yml file in a similar way to the ProteinGraphConfig parser?

I think a framework that allows people to script pipelines (like the one I am trying to make) from the command line would allow for ease of experimentation and simplicity, compared to making it all in python using the low-level functions.

Would appreciate any thoughts on this!

@a-r-j
Copy link
Owner

a-r-j commented Jul 18, 2022

Hi @cimranm great suggestion!

To address your immediate problem, I think you can try just passing the filenames (no extension) as the pdb_code arg in ProteinGraphDataset. The download is only triggered if the files are not found in the DATA_DIR/raw directory so if you place your PDBs there it should behave how you want it to.

With respect to pipelining, I think this would be a great feature (and not too tricky to implement). It should be quite straightforward to write a parser for the transformation functions from a Yaml file (see: https://stackoverflow.com/questions/67442071/passing-python-functions-from-yaml-file).

I can provide some support and help implement some of this if you're keen to build this feature. I don't have the bandwidth at the moment to pick this up on my own though.

@a-r-j a-r-j added enhancement New feature or request labels Jul 20, 2022
@kamurani
Copy link
Contributor Author

Sure, I've already built something like this for my own use case so would be happy to figure out an elegant way to make it generalisable and add it to the graphein CLI. Will let you know if I'm stuck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants