Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge pangenome graphs #68

Open
genomesandMGEs opened this issue Oct 6, 2021 · 5 comments
Open

Merge pangenome graphs #68

genomesandMGEs opened this issue Oct 6, 2021 · 5 comments

Comments

@genomesandMGEs
Copy link

Hi there,

Is it possible to merge pangenome graphs from independent runs? I know panaroo has that option, and would like to know if it would be possible to do so with ppanggolin.
If not, could you please provide me alternatives to compare the pangenome of independent runs?

Thanks!

@axbazin
Copy link
Member

axbazin commented Oct 6, 2021

Hi,

What are you trying to achieve through this comparison, exactly?
Is it for example to compare the gene families and their partitions between both pangenomes, and know which family is persistent in both pangenomes, which is shell in one and persistent in the other, things like that?

Adelme

@genomesandMGEs
Copy link
Author

Hey,

Thanks for the (super) quick reply!
Exactly, that's what I was thinking about that.

@axbazin
Copy link
Member

axbazin commented Oct 6, 2021

We do not have something that directly implements a straightforward comparison between two pangenomes (for now), however you can get that with some file comparisons.
Assuming you have the latest version installed, you can do the following:

get all family sequences for both pangenomes:

ppanggolin fasta --prot_families all -p pangenome_1.h5 -o prot_pangenome_1 
ppanggolin fasta --prot_families all -p pangenome_2.h5 -o prot_pangenome_2

Those commands will write a file 'all_protein_families.faa' in the output directory provided with -o.
Then, you can compare this file to the other pangenome:

ppanggolin align -p  pangenome_1.h5 --proteins prot_pangenome_2/all_protein_families.faa -o align_prot_pang2_to_pang1
ppanggolin align -p  pangenome_2.h5 --proteins prot_pangenome_1/all_protein_families.faa -o align_prot_pang1_to_pang2

You can provide --identity (default is 0.5) and --coverage (default is 0.8) thresholds for the comparison.
In both your output directories 'align_prot_pang2_to_pang1' and 'align_prot_pang1_to_pang2' you will get two files:
The first one called 'proteins_partition_projection.tsv' which is tab separated, and will give you a file akin to this:

image

The first column indicates a family id from the faa file, and the second column indicates the partition of the most similar family in the pangenome it was compared to.

And alternatively the 'input_to_pangenome_associations.blast-tab' file is a alignment file with blast-like results on the proteins vs pangenome alignment, which will give you family ids from both pangenomes directly. (there can be multiple hits)

By comparing those files, and the origin family partitions, you should be able to get what you want, I believe?
If you have any question or need me to clarify something, do not hesitate !

Adelme

@genomesandMGEs
Copy link
Author

Hey,

Thanks for the detailed explanation.

So, if I understood correctly, this approach will give you information about the family ids from pangenome 1 that match families in pangenome 2, right? But the classification in the 2nd column only let's you know that a given id is considered 'persistent' in pangenome 2, and may not be so in pangenome 1?

Also, family ids not listed in column 1 from the 'proteins_partition_projection.tsv' will represent family-specific ids from pangenome 1, i.e. which have no match in pangenome 2?

@axbazin
Copy link
Member

axbazin commented Oct 6, 2021

Yes absolutely, you are correct for all of your points.

If you want you can play with the filters available with ppanggolin fasta, which can make things simpler for your comparison, you can do stuff like this:

ppanggolin fasta --prot_families persistent -p pangenome_1.h5 -o prot_pangenome_1 

to write only the persistent gene families (in a file called 'persistent_protein_families.faa'). You can do this with all partitions, the filename will change accordingly.

Adelme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Ideas and improvements
Awaiting triage
Development

No branches or pull requests

3 participants