Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hmr requires symmetric data, fails on lifted over .meth files #43

Open
faulk-lab opened this issue Jun 17, 2023 · 5 comments
Open

hmr requires symmetric data, fails on lifted over .meth files #43

faulk-lab opened this issue Jun 17, 2023 · 5 comments

Comments

@faulk-lab
Copy link

I have perhaps an unusual use case. I am using whole genome methylation data from multiple closely related vertebrates. In order to compare apples-to-apples, I need a common coordinate system so I lifted them all over to human hg38. Now I want to run them through hmr (and dmr and several other of your wonderful tools!). However hmr is complaining that error: input is not symmetric-CpGs: .

My data is inherently symmetric because I specified merged symmetric coverage when I generated the bed file (that I converted to .meth with an awk script). My original data is from nanopore, and counts generated with modkit. It was direct methylation calling, not bisulfite, but once it has been converted to .meth format, I figured it didn't matter anymore.

Anyway, dnmtools hmr monkey.meth works fine, but after I lift it over to hg38, and filtered the lines with dnmtools liftfilter I tried dnmtools hmr monkey.hg38.meth and it fails with the symmetric problem. I wrote a script to remove any lines that "looked symmetric" and still no joy. Ideas?

@faulk-lab
Copy link
Author

Hmm I think my filtering script is not doing a good job of actually removing "symmetric looking" sites. Trying to figure out a way to deal with them. Why is it necessary for dnmtools hmr to require non-symmetric looking input anyway? I imagine it's statistically superior to merge the symmetric sites, but in my case, that's probably not biologically meaningful, in lifted-over data for example.

@andrewdavidsmith
Copy link
Collaborator

The sites need to be CpG sites, or the concept of HMRs wouldn't apply. But you are correct that in theory they don't need to be "symmetric" in the file anymore. There's now only a historical artifactual reason that check exists. But I can't just disable it completely right now. If you can build the tools from source, I can probably make a branch relatively quickly that disables that particular check. Also, if you could point me to any public data sets that are sufficiently similar to your own, I could test it myself. Either way, we can find out if it even makes sense to do that analysis on your data.

@faulk-lab
Copy link
Author

faulk-lab commented Jun 18, 2023

Thanks, I can probably build from source if you make a test branch. The sites are definitely CpG only. That's the only ones I called from the original sequencing run.
Here's a partial dataset I am working with, just chromosome 1 of a macaque that has been lifted over to hg38.
https://drive.google.com/file/d/1VvaaKMgvO1RULYvvYSI5Sv9WmwegcwSu/view?usp=drive_link

@andrewdavidsmith
Copy link
Collaborator

@faulk-lab I just made a branch hmr_warn_nonsymmetric_cpgs (poor choice of branch name) that should do what you are asking for. It includes a flag that will allow the program to continue running even if it sees 2 CpG sites at a distance of 1 from each other. I tested it on a file that had a single such site, and it worked. But it's not really fully tested. If the CpGs are symmetric, it should do the same thing as before. If the CpGs include some at a distance of 1, then it should continue just fine, but I don't know what it will do if they are all like that. It's possible that the statistical model will be impacted. I don't have time right now to fully verify that.

@andrewdavidsmith
Copy link
Collaborator

andrewdavidsmith commented Aug 23, 2023

@faulk-lab @moqri I'm adding information here from another issue opened on the same topic:

Describe the bug
We are using a data CpG count data which is converted from another pipeline (so missing some CpGS) but overall symmetric.
running hmr gives an error but hmr-rep works fine

To Reproduce
source: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5663172
"beta" file is converted to bed file and lifted to hg38. the file is reformatted and sorted to match meth-count. then running hmr on this data will result in "error: input is not symmetric-CpGs"

However, running hmr-rep with two identical copies of the processed data works fine!

Expected behavior
not clear if all (symmetric) CpGs are required for hmr to work or if there is another issue (e.g. two consecutive CpG coordinate?). not clear why hmr-rep works but hmr doesn't

Thanks for the great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants