Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between # of genes in RGPs? #122

Open
cizydorczyk opened this issue Jun 23, 2023 · 1 comment
Open

Discrepancy between # of genes in RGPs? #122

cizydorczyk opened this issue Jun 23, 2023 · 1 comment
Assignees
Labels

Comments

@cizydorczyk
Copy link

When I look at the plastic_regions.tsv file, the number of genes in each RGP for each isolate is reported. Summing provides the total number of genes in RGPs for a given sample/isolate.

When I compare this total to the total number of genes per isolate written in the fasta file with genes in rgps, there seem to be some discrepancies.

For example, for a single sample, there are 656 genes reported in RGPs if I sum the 'genes' column in plastic_regions.tsv. When I extract all genes in RGPs using

$ ppanggolin fasta -p pangenome.h5 --output RGP_GENES --genes rgp

and use grep to get the number of genes with a locus that corresponds to my sample, I get 772 fasta headers.

Is there supposed to be a 1 to 1 relationship between what is output when writing a fasta and the numbers reported in the plastic_regions.tsv file?

Note that when using a python script to count GFF entries between coordinates of RGPs, I also do not get the same number as reported in the plastic_regions.tsv file, but it is much closer (off by 1 or 2 genes usually). It may be an error in the script, but it's accurate for some samples and inaccurate for others...either way, the discrepancy in the number of fasta entries for a given genome and what is reported in the plastic_regions.tsv file remains.

How then does ppanggolin calculate the number of genes reported in the plastic_regions.tsv file?

@ggautreau
Copy link
Collaborator

Hi @cizydorczyk,

The plastic_regions.tsv file provides the correct count. However, the issue arises with the fasta file, which outputs all the genes across all families with at least one gene in an RGP. We plan to resolve this issue in the forthcoming update.

We appreciate your vigilance!

Best regards.

@ggautreau ggautreau added the bug label Jul 7, 2023
@ggautreau ggautreau self-assigned this Jul 7, 2023
ggautreau added a commit that referenced this issue Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants