Discrepancy between # of genes in RGPs? #122

cizydorczyk · 2023-06-23T22:57:04Z

When I look at the plastic_regions.tsv file, the number of genes in each RGP for each isolate is reported. Summing provides the total number of genes in RGPs for a given sample/isolate.

When I compare this total to the total number of genes per isolate written in the fasta file with genes in rgps, there seem to be some discrepancies.

For example, for a single sample, there are 656 genes reported in RGPs if I sum the 'genes' column in plastic_regions.tsv. When I extract all genes in RGPs using

$ ppanggolin fasta -p pangenome.h5 --output RGP_GENES --genes rgp

and use grep to get the number of genes with a locus that corresponds to my sample, I get 772 fasta headers.

Is there supposed to be a 1 to 1 relationship between what is output when writing a fasta and the numbers reported in the plastic_regions.tsv file?

Note that when using a python script to count GFF entries between coordinates of RGPs, I also do not get the same number as reported in the plastic_regions.tsv file, but it is much closer (off by 1 or 2 genes usually). It may be an error in the script, but it's accurate for some samples and inaccurate for others...either way, the discrepancy in the number of fasta entries for a given genome and what is reported in the plastic_regions.tsv file remains.

How then does ppanggolin calculate the number of genes reported in the plastic_regions.tsv file?

The text was updated successfully, but these errors were encountered:

ggautreau · 2023-07-07T21:08:18Z

Hi @cizydorczyk,

The plastic_regions.tsv file provides the correct count. However, the issue arises with the fasta file, which outputs all the genes across all families with at least one gene in an RGP. We plan to resolve this issue in the forthcoming update.

We appreciate your vigilance!

Best regards.

ggautreau added the bug label Jul 7, 2023

ggautreau self-assigned this Jul 7, 2023

ggautreau added a commit that referenced this issue Mar 26, 2024

Fix issue #122

7b1eb66

JeanMainguy mentioned this issue Mar 26, 2024

Fix RGP fasta outputs #202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between # of genes in RGPs? #122

Discrepancy between # of genes in RGPs? #122

cizydorczyk commented Jun 23, 2023

ggautreau commented Jul 7, 2023

Discrepancy between # of genes in RGPs? #122

Discrepancy between # of genes in RGPs? #122

Comments

cizydorczyk commented Jun 23, 2023

ggautreau commented Jul 7, 2023