Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry about the coding density equation #281

Open
AhmedElsherbini opened this issue Apr 8, 2024 · 4 comments
Open

Inquiry about the coding density equation #281

AhmedElsherbini opened this issue Apr 8, 2024 · 4 comments
Labels
help wanted Extra attention is needed

Comments

@AhmedElsherbini
Copy link

Hi Oliver,

I hope you do well,

Here, it is just an inquiry about the equation of coding density.

The background of the questions, I am comparing two sister species (the first is a successful commensal and the second is a rare pathogen). the coding density stats from Bakta show a significant difference between them ~2 %. I have just a few long read sequences <10 genome, in comparison to many short read assemblies > 80 genomes for both of them.
The thing is both species, using long-read sequencing, I have no significant difference in genome length, the same as CDS.
My only hit, is that the species with fewer coding sequences are highly enriched with mobilome (is elements), which could be something that affects this calculation.

knowing that pseudogenes are on average 20 in this rare species that has less coding density in comparison to 8 in the first species relative.

As unique genes extraction, using the default parameter in Panaroo, PPanGGOLiN, I can see ~ 113 proteins unique for the first species) and 80 for the second, which is hard for me to relate to the 2 % difference in the coding density.

Thank you in advance.

Best,
Ahmed

@oschwengers
Copy link
Owner

Hi Ahmed,
Of course, short/long read sequencing indeed can have a strong impact on assembly lengths. Especially short read draft assemblies can suffer from too many contig edges which make it very difficult to detect all CDS proximate or even spanning contig edges. The coding density is only summing up all bases that are part of an annotated genome feature divided by the genome length.
I hope this helps. Otherwise, can you elaborate a bit more?

@oschwengers oschwengers added the help wanted Extra attention is needed label Apr 19, 2024
@AhmedElsherbini
Copy link
Author

Thank you Oliver for your response.

Summing up all the bases of the annotated genome is (~the sum of the whole length of CDS) / the sum of the total contigs length, right?
if we focus on long read assemble genomes and NO statistical difference between the two species's CDS number, nor total length,
so my guess now for the difference in coding density, could be longer CDS in the more species with high density, or pseudogenes, / transposons (they are short in lenght anyway) in the less dense species. does it make sense now?

Best,
Ahmed

@oschwengers
Copy link
Owner

Hi, the coding density is not limited to CDS but comprises all genomic features, e.g. non-coding RNA genes, regulatory elements, DNA motifs, etc.

Regarding your question, I guess in theory yes, that could be, but I'd be rather reluctant to use these kind of statistics. I'd rather directly compare certain genes presence/absence, etc.

@AhmedElsherbini
Copy link
Author

Absolutely, you are right.

I followed this with tools like Panaroo and PPanGGolin for gene presence/absence comparison. Mobilome ( insertion elements ) is the main thing being enriched in the sister species with lower coding density.

Just, I wanted to investigate the causality of this coding density, as I get a lot of questions regarding this 2 % difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants