Metagenomic-DeepFRI

This repository is for portfolio purposes only.

For currently maintained version go to Małopolskie Centrum Biotechnologii repository

About The Project

Do you have thousands of protein sequences with unknown structures, but still want to know their molecular function, biological process, cellular component and enzyme commission predicted by DeepFRI Graph Convolutional Network?

This is the right project for this task! Pipeline in a nutshell:

Search for similar target protein sequences using MMseqs2
Align target protein contact map to fit your query protein with unknown structure
Run predictions on query sequence combined with aligned target contact map or sequence alone if no alignment was found

Built With

Installation

Local

Setup python environment
```
pip install .  
```
Install mmseqs2
```
sudo apt install mmseqs2
```

Install boost libraries

sudo apt-get install libboost-numpy1.71 libboost-python1.71

(optional) Edit CONFIG/FOLDER_STRUCTURE.py to customize your folder structure
```
nano CONFIG/FOLDER_STRUCTURE.py
```
Run post_setup.py script to create folder structure according to FOLDER_STRUCTURE.py and to download and unzip DeepFRI model weights
```
python post_setup.py
```

Docker

Create YOUR_DATA_ROOT directory on your local machine
```
mkdir /YOUR_DATA_ROOT
```
Docker run! -u $(id -u):$(id -g) is used to make sure all files created by pipeline are accessible for users
```
docker run -it -u $(id -u):$(id -g) -v /YOUR_DATA_ROOT:/data soliareofastora/metagenomic-deepfri
```
Inside docker run post_setup.py script to create folder structure and unzip DeepFRI model weights
```
python post_setup.py
```

TL:DR! QUICK START

Upload structure files, for example from PDB, to STRUCTURE_FILES_PATH (paths are defined in CONFIG/FOLDER_STRUCTURE.py)

Create target database

python update_target_mmseqs_database.py --input all

Upload protein sequences .faa files into QUERY_PATH
Run main_pipeline.py.
```
python main_pipeline.py --input all
```
Collect results from FINISHED_PATH

How this pipeline works

Projects, tasks and timestamps

Pipeline is build around folder structure described in CONFIG / FOLDER_STRUCTURE.py.

Multiple teams can have their separate projects - subdirectories inside STRUCTURE_FILES_PATH, QUERY_PATH, WORK_PATH and FINISHED_PATH. You can execute main_pipeline.py or update_target_mmseqs_database.py with --project_name to easily control the files it touches.

Without specifying --project_name pipeline will use default as project name.

Task is a single run of main_pipeline.py. The task name is the timestamp at the beginning of the scrip run. Its path is WORK_PATH / project_name / timestamp. After completion, results will be stored in FINISHED_PATH / project_name / timestamp.

Target database creation update_target_mmseqs_database.py works in similar fashion appending new structures to MMSEQS_DATABASES_PATH / project_name creating new timestamp folder. Pipeline will use the database that was most recently created. TODO Feature to use specific target database timestamp instead of name + the newest timestamp

When running main_pipeline.py with a new project name, current state of CONFIG / RUNTIME_PARAMETERS.py will be saved in WORK_PATH / project_name / project_config.json and will be used in all upcoming tasks in this project.

Similarly update_target_mmseqs_database.py. It will store MAX_TARGET_CHAIN_LENGTH inside target_db_config.json.

Mmseqs2 target database setup

Upload structure files to STRUCTURE_FILES_PATH / your_project_name.

Run update_target_mmseqs_database.py script.

python update_target_mmseqs_database.py --project_name your_project_name

Main feature of this project is its ability to generate query contact map on the fly using results from mmseqs2 target database search for similar protein sequences with known structures. Later in the metagenomic_deepfri.py contact map alignment is performed to use it as input to DeepFRI GCN. (implemented in CPP_lib/load_contact_maps.h)

update_target_mmseqs_database.py script will search for structure files, process them and store protein chain sequence and atoms positions inside SEQ_ATOMS_DATASET_PATH / project_name. It will also create a mmseqs2 database in MMSEQS_DATABASES_PATH / project_name. This operation will append new structures to existing ones.

You can also use --input DIR_1 FILE_2 ... argument list to parse structures from multiple sources. Both absolute and relative to STRUCTURE_FILES_PATH. Use --input . to parse all structure files inside STRUCTURE_FILES_PATH. Accepted formats are: .pdb .cif .ent both raw and compressed .gz

To add another structure file format edit STRUCTURE_FILES_PARSERS inside update_target_mmseqs_database.py

target_db_config.json contains MAX_TARGET_CHAIN_LENGTH. This value is copied from CONFIG / RUNTIME_PARAMETERS.py while creating new target database.

Protein ID is used as a filename. A new protein whose ID already exists in the database will be skipped. Use --overwrite flag to overwrite existing sequences and atoms positions. Also use this argument if you want to apply changes to MAX_TARGET_CHAIN_LENGTH inside target_db_config.json

Running main pipeline

Upload .faa files into QUERY_PATH / your_project_name (default project_name is default)

Run main_pipeline.py

python main_pipeline.py --project_name your_project_name

Upon completion, collect results from FINISHED_PATH / your_project_name / timestamp

Pipeline will attempt to use project_name target database name. If it's missing, default target database will be used instead. If you want to use other target database use its name (project_name used during database creation) in --target_db_name.

You can use --input DIR_1 FILE_2 ... argument list to process query .faa files from multiple sources. Both absolute and relative to QUERY_PATH. Use --input . to process all query .faa files inside QUERY_PATH.

--delete_query Use this flag so that source query files are deleted from input paths after being copied to project workspace.

--n_parallel_jobs will divide query protein sequences evenly across all jobs.

Results

Finished folder FINISHED_PATH / project_name / timestamp will contain:

query_files/* - directory containing all input query files.
mmseqs2_search_results.m8
alignments.json - results of alignment search implemented in utils.search_alignments.py
metadata* - files with some useful info
results* - multiple files from DeepFRI. Organized by model type ['GCN' / 'CNN'] and its mode ['mf', 'bp', 'cc', 'ec'] for the total of 8 files. Sometimes results from one model can be missing which means that all query proteins sequences were aligned correctly or none of them were aligned.
```
mf = molecular_function
bp = biological_process
cc = cellular_component
ec = enzyme_commission
```

Contributing

If you have a suggestion that would make this project better, email me or fork the repo and create a pull request.

TODO

main_pipeline.py add possibility to use specific target_database path and timestamp instead of name only
utils/search_alignments.py make some runtime tests, maybe chunkified sequences will perform better with pathos.multiprocessing
update_target_mmseqs_database.py add max_target_chain_length argument and inform user if there is difference between this arg and existing target_db_config.json
update_target_mmseqs_database.py when already processed structures to another project, check if they already exists somewhere

Contact

Piotr Kucharski - soliareofastorauj@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
CONFIG		CONFIG
CPP_lib		CPP_lib
DeepFRI @ a979791		DeepFRI @ a979791
structure_files_parsers		structure_files_parsers
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main_pipeline.py		main_pipeline.py
metagenomic_deepfri.py		metagenomic_deepfri.py
post_setup.py		post_setup.py
resume_tasks.py		resume_tasks.py
setup.py		setup.py
update_target_mmseqs_database.py		update_target_mmseqs_database.py

License

SoliareofAstora/portfolio-Metagenomic-DeepFRI

Folders and files

Latest commit

History

Repository files navigation

Metagenomic-DeepFRI

About The Project

Built With

Installation

Local

Docker

TL:DR! QUICK START

How this pipeline works

Projects, tasks and timestamps

Mmseqs2 target database setup

Running main pipeline

Results

Contributing

TODO

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages