-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry about the coding density equation #281
Comments
Hi Ahmed, |
Thank you Oliver for your response. Summing up all the bases of the annotated genome is (~the sum of the whole length of CDS) / the sum of the total contigs length, right? Best, |
Hi, the coding density is not limited to CDS but comprises all genomic features, e.g. non-coding RNA genes, regulatory elements, DNA motifs, etc. Regarding your question, I guess in theory yes, that could be, but I'd be rather reluctant to use these kind of statistics. I'd rather directly compare certain genes presence/absence, etc. |
Absolutely, you are right. I followed this with tools like Panaroo and PPanGGolin for gene presence/absence comparison. Mobilome ( insertion elements ) is the main thing being enriched in the sister species with lower coding density. Just, I wanted to investigate the causality of this coding density, as I get a lot of questions regarding this 2 % difference. |
Hi Oliver,
I hope you do well,
Here, it is just an inquiry about the equation of coding density.
The background of the questions, I am comparing two sister species (the first is a successful commensal and the second is a rare pathogen). the coding density stats from Bakta show a significant difference between them ~2 %. I have just a few long read sequences <10 genome, in comparison to many short read assemblies > 80 genomes for both of them.
The thing is both species, using long-read sequencing, I have no significant difference in genome length, the same as CDS.
My only hit, is that the species with fewer coding sequences are highly enriched with mobilome (is elements), which could be something that affects this calculation.
knowing that pseudogenes are on average 20 in this rare species that has less coding density in comparison to 8 in the first species relative.
As unique genes extraction, using the default parameter in Panaroo, PPanGGOLiN, I can see ~ 113 proteins unique for the first species) and 80 for the second, which is hard for me to relate to the 2 % difference in the coding density.
Thank you in advance.
Best,
Ahmed
The text was updated successfully, but these errors were encountered: