Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amrfinder output #71

Open
maggietsui opened this issue Jul 18, 2023 · 8 comments
Open

amrfinder output #71

maggietsui opened this issue Jul 18, 2023 · 8 comments

Comments

@maggietsui
Copy link

Hello,

I have a question regarding the output from running amrfinder. One of the columns looks something like "+:136". Is this the score for the input region? Also, what does the +/- denote? Thanks.

@andrewdavidsmith
Copy link
Collaborator

@maggietsui What version are you using? I'm not able to reproduce any output from amrfinder that seems like what you describe, but you've also only partly described it. Maybe you could paste in a few lines?

@maggietsui
Copy link
Author

Apologies, I got the commands mixed up--I actually used amrtester not amrfinder. I'm using version 1.1.0 which was installed using conda. I first ran dnmtools format on WGBS bam files, then dnmtools states on the output to produce a ".epiread" file, then dnmtools amrtester with a bedfile of regions to output a ".amr" file. Here are the first few lines:

NC_000001.11	257864	359681	-:2748	0.0411185	+ 
NC_000001.11	359681	361681	-:6	1	+
NC_000001.11	516376	516479	-:10	1	+
NC_000001.11	516479	518479	-:32	1	+
NC_000001.11	1722512	1724512	+:136	0.998533	+
NC_000001.11	1724512	1737251	+:1501	0.914237	+
NC_000001.11	1724838	1745992	-:2038	1	+
NC_000001.11	1745992	1747992	-:92	1	+
NC_000001.11	1921951	2003837	-:8245	0	+

I see that column 4 is the p-value, is column 3 the score?

@andrewdavidsmith
Copy link
Collaborator

Your input is wrong. Can you confirm that you provided the mapped reads, converted with "states" and the exact same reference genome as used when mapping reads?

@maggietsui
Copy link
Author

Here is the exact command I used:

dnmtools amrtester \
	-c /data1/home/mtsui/allele_specific_methylation/ncbi-genomes-2023-04-26/GCF_000001405.26_GRCh38_genomic.fna \
	-o /data1/home/mtsui/allele_specific_methylation/asm_scores/"$patient"_output.amr \
	/data1/home/mtsui/allele_specific_methylation/regions_to_test/"$patient"_ase_assessable_promoter_genebody_sorted.bed /data1/home/mtsui/allele_specific_methylation/epiread/"$patient".epiread

I wasn't able to obtain the exact same reference that these files were aligned with since they were aligned on a different platform long ago...This one was the closest I could find.

@andrewdavidsmith
Copy link
Collaborator

The output formats for amrfinder and amrtester will differ. This issue seems to be about amrtester. In the case of amrtester the output will involve the genomic intervals you give it as input, while amrfinder identifies its own genomic intervals.

But all of these methods need to be able to confirm that the CpG sites within the mapped reads are actually CpG sites. Otherwise they could be just sequencing errors, or somatic variation in the cell sample, or any number of things. So when you run dnmtools states, if you are not using the exact same reference genome, it might give you some pretty arbitrary behaviors when making your epireads.

If you have the original BAM files, there's a good chance you can deduce the reference genome used. If you use samtools view -H it should show you the header with the names of "targets" (chromosomes) and the length of each, in nucleotides. Those lengths should be enough to confirm if the reference was hg38 (likely). But also it would be helpful to use a reference genome with chromosome names rather than the NCBI RefSeq accessions for chromosomes, which I think I see in your example output. I suspect the program would fail if the chromosome names weren't at least the same, so you might have gotten lucky here.

The entries like -:2748 that you see in your output, the - that you see is almost certainly because the input BED format file you are providing has the strand in the wrong column. You seem to have it in the 4th column, but the 4th column of a 6-column BED file is the name of the interval, and the strand should be in the 6th column. Please see this documentation for the definition of BED format https://genome.ucsc.edu/FAQ/FAQformat.html#format1 The "name" must appear if the strand does, and if you decide to use either + or - as the name of your interval, then it will be reported in the output as part of the name (4th column).

@andrewdavidsmith
Copy link
Collaborator

@maggietsui sorry -- I was reading your comments backwards. Thanks, I see you mentioned it's amrtester. I think the +/- issue should be explained by the input format of the BED file. Hopefully we can get this resolved and working for you.

@maggietsui
Copy link
Author

Thank you so much for the pointers--I accidentally had the strand in the 4th column of my bedfiles instead of the name of the region. Now I am getting an output like this:

NC_000001.11	1722512	1724512	AL031282.1_promoter:136	0.998533	+
NC_000001.11	1724512	1737251	AL031282.1_gene:1501	0.914237	+
NC_000001.11	1724838	1745992	SLC35E2A_gene:2038	1	+
NC_000001.11	1745992	1747992	SLC35E2A_promoter:92	1	+
NC_000001.11	1921951	2003837	CFAP74_gene:8245	0	+

Just to confirm, in the 4th column the number after ":" is the score for that region? Thanks.

@andrewdavidsmith
Copy link
Collaborator

I don't think so and can't check right now. I'll try to check out the documentation and update if needed later today. I usually use that for information about the amount of data available for the interval. The "score" you want is in the 5th column and that's the p-value for the null hypothesis of no ASM. The intervals where there is ASM detected will have a low value for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants