Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coverage by region vs. Global dist #184

Open
hlfreund opened this issue Oct 10, 2022 · 8 comments
Open

Coverage by region vs. Global dist #184

hlfreund opened this issue Oct 10, 2022 · 8 comments

Comments

@hlfreund
Copy link

Hi there. I am trying to retain the coverage for genes of interest found in my assembled contigs, and have a region BED file, a thresholds BED file, a global.dist.txt file, and a summary.txt file.

I noticed that for certain genes with multiple gene predictions in the BED file (from Prodigal), the number of entries in the global.dist.txt file describing the distribution of their coverage is less than the number of entries of this predicted gene in the regions & threshold files. For example, a gene with multiple predictions appears 102 times in my BED files, but only 49 times in the global.dist file...

Is there a reason for this? Just trying to understand the output of these files better. Thanks!

@brentp
Copy link
Owner

brentp commented Oct 10, 2022

Hi,
So each contig is a gene? Do the genes have ':' in the names?
Can you show a few of the 49 rows and then a few of the missing IDs?

@hlfreund
Copy link
Author

The gene IDs were not clearly layed out in my BED file, so each gene has been assigned the original contig name rather than an updated gene name -- so it's hard to determine exactly which entries are not included in the distribution file.

I have attached a thresholds output and the distribution text file for a sample (compressed in attachment). If you look for the # of entries for these contig IDs, you can see they do not always match in frequency between the thresholds file and the distribution file.
Archive 2.zip

@brentp
Copy link
Owner

brentp commented Oct 10, 2022

do the missing IDs have 0 coverage?

@hlfreund
Copy link
Author

No, it appears that in the regions.bed file, all of the genes have coverage > 0.

@brentp
Copy link
Owner

brentp commented Oct 12, 2022

Hi,
here's what I see. Your
global.dist file contains 135302 unique regions.
thresholds bed file contains 128605 unique regions.

Now, if I look at a few of the differences:

$ grep  NODE_100006_length_274_cov_2.593607 BDC_D_ATL_HL_1_20_2.mosdepth.global.dist.txt
NODE_100006_length_274_cov_2.593607     10      0.05
NODE_100006_length_274_cov_2.593607     9       0.40
NODE_100006_length_274_cov_2.593607     8       0.43
NODE_100006_length_274_cov_2.593607     7       0.43
NODE_100006_length_274_cov_2.593607     6       0.50
NODE_100006_length_274_cov_2.593607     5       0.55
NODE_100006_length_274_cov_2.593607     4       0.74
NODE_100006_length_274_cov_2.593607     3       0.74
NODE_100006_length_274_cov_2.593607     2       0.77
NODE_100006_length_274_cov_2.593607     1       1.00
NODE_100006_length_274_cov_2.593607     0       1.00

I can see it's very low coverage globally, but I can't see wether your regions are covered. So I don't think I have enough information to debug this further.

Note that your contig ID's must be unique.

@hlfreund
Copy link
Author

hlfreund commented Oct 12, 2022

I can provide you with the regions coverage file if that would be helpful.

I tried to name the individual genes in the BED file I supplied, but it seemed to get gene names from the BAM file I attached, which is why they are not unique between genes -- however, the contig IDs are unique.

The documentation says that the gene names will come from the 4th column from the supplied BED file - but this has not been working for me, which is why the gene/region IDs are not unique.

BDC_D_ATL_HL_1_20_2.regions.bed.zip

@brentp
Copy link
Owner

brentp commented Oct 12, 2022

the names in the first column of the bed output must always come from the chromosome/contig column of the bam.
mosdepth will put the gene name in the 5th column of the output regions bed if it was supplied in the 4th column of the input.

the latest regions.bed file you send has nothing in the 4th column.
If I'm missing something obvious, it would help to have some sample input and expected output so I can see the problem.

@hlfreund
Copy link
Author

hlfreund commented Jan 17, 2023

Hi there, sorry for the long period of radio silence. I am trying to determine the depth of coverage for genes within my specific MAG bins. I aligned my raw reads to my assembled contigs, binned these contigs into MAGs, and functionally annotated them to get the BED file. For some reason, my output files are only including the names of the contigs and not the names of the genes - so I am getting coverage values in the output files, but they are not associated with specific genes as far as I can tell. Could I be looking at the wrong files, or am I doing something incorrectly?

The contig IDs I have are unique, and the BED files I am now using have a unique gene ID in the 4th column.
The zip I've attached includes the BED file for this specific MAG bin I am analyzing, and part of the SAM file I have from aligning my original reads to the assembled contigs from this sample. I have also attached some outputs from the program (MosDepth is still currently running so these are incomplete files, and I did not attach all of the output files).

Ideally the mean coverage per gene results would appear like this (an excerpt from a colleague's output):

k141_26460	3	188	g_1_1	491.11
k141_26460	285	824	g_1_2	487.27
k141_26460	837	1022	g_1_3	922.84
k141_26460	1125	2132	g_1_4	415.99
k141_26460	2218	3099	g_1_5	518.05
k141_26460	3147	4460	g_1_6	402.76
k141_26460	4500	5174	g_1_7	231.73
k141_26460	5418	6188	g_1_8	385.52
k141_26460	6185	6928	g_1_9	498.09
k141_26460	7007	7768	g_1_10	647.01

Sorry for the long message and radio silence, thank you so much for your help.

Archive 3.zip

BDC_D_ATL_HL_1_20_1_bin.6.mosdepth.region.dist.txt
BDC_D_ATL_HL_1_20_1_bin.6.mosdepth.global.dist.txt
BDC_D_ATL_HL_1_20_1_bin.6.mosdepth.summary.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants