You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I look at the plastic_regions.tsv file, the number of genes in each RGP for each isolate is reported. Summing provides the total number of genes in RGPs for a given sample/isolate.
When I compare this total to the total number of genes per isolate written in the fasta file with genes in rgps, there seem to be some discrepancies.
For example, for a single sample, there are 656 genes reported in RGPs if I sum the 'genes' column in plastic_regions.tsv. When I extract all genes in RGPs using
$ ppanggolin fasta -p pangenome.h5 --output RGP_GENES --genes rgp
and use grep to get the number of genes with a locus that corresponds to my sample, I get 772 fasta headers.
Is there supposed to be a 1 to 1 relationship between what is output when writing a fasta and the numbers reported in the plastic_regions.tsv file?
Note that when using a python script to count GFF entries between coordinates of RGPs, I also do not get the same number as reported in the plastic_regions.tsv file, but it is much closer (off by 1 or 2 genes usually). It may be an error in the script, but it's accurate for some samples and inaccurate for others...either way, the discrepancy in the number of fasta entries for a given genome and what is reported in the plastic_regions.tsv file remains.
How then does ppanggolin calculate the number of genes reported in the plastic_regions.tsv file?
The text was updated successfully, but these errors were encountered:
The plastic_regions.tsv file provides the correct count. However, the issue arises with the fasta file, which outputs all the genes across all families with at least one gene in an RGP. We plan to resolve this issue in the forthcoming update.
When I look at the plastic_regions.tsv file, the number of genes in each RGP for each isolate is reported. Summing provides the total number of genes in RGPs for a given sample/isolate.
When I compare this total to the total number of genes per isolate written in the fasta file with genes in rgps, there seem to be some discrepancies.
For example, for a single sample, there are 656 genes reported in RGPs if I sum the 'genes' column in plastic_regions.tsv. When I extract all genes in RGPs using
and use grep to get the number of genes with a locus that corresponds to my sample, I get 772 fasta headers.
Is there supposed to be a 1 to 1 relationship between what is output when writing a fasta and the numbers reported in the plastic_regions.tsv file?
Note that when using a python script to count GFF entries between coordinates of RGPs, I also do not get the same number as reported in the plastic_regions.tsv file, but it is much closer (off by 1 or 2 genes usually). It may be an error in the script, but it's accurate for some samples and inaccurate for others...either way, the discrepancy in the number of fasta entries for a given genome and what is reported in the plastic_regions.tsv file remains.
How then does ppanggolin calculate the number of genes reported in the plastic_regions.tsv file?
The text was updated successfully, but these errors were encountered: