Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Can i build a HMM, where the distribution depends on a value of a specific position #1080

Open
LauraSkak opened this issue Feb 27, 2024 · 3 comments

Comments

@LauraSkak
Copy link

Hi

This is not a bug, but a question.
I'm in the search for a software that can help me build a HMM that can identify imprinted regions using data produced with dorado and modkit from Oxford Nanopore Technologies.

For each position and each haplotype i have a count modified, count unmodified and a total valid coverage count. In my opinion, there is three/four states; both haplotypes are modified, both haplotypes are unmodified and one haplotype is modified and the other is not. My theory is that i can build a HMM where the emission probability of a given position depends on the coverage of that position. No sure if i should use a binomial or beta distribution here. Is this possible with pomegranate?

Thanks in advance for any help you can provide... I'm a bit lost as to how i should attack this.

Laura

@jmschrei
Copy link
Owner

Hi Laura

It is possible to use a variety of distributions with pomegranate. Binomial distributions are already implemented (https://github.com/jmschrei/pomegranate/blob/master/pomegranate/distributions/bernoulli.py) but these assume a fixed number of total trials. However, you can easily make your own distributions using the existing ones as a template. You just need to fill in the few functions that are there and they can be dropped into any existing model.

My understanding of your specific problem is that you modified and unmodified counts and want to get a probability for each hidden state given both the ratio of these counts and their total value, because higher total counts means higher accuracy, right? Modeling just the ratios is easy -- you can use a beta distribution since the range is restricted to (0, 1) (you'd have to implement that yourself unfortunately) or a Gaussian distribution if you want to approximate it for now. If you want to include both the ratio and the counts you might be better off using binomial distributions but, as I mentioned above, you'd probably need to modify the code to handle variable numbers of total counts.

Sorry that I couldn't be of more help.

@LauraSkak
Copy link
Author

Thank you for your quick response!

Yeah, thats the problem, that the probability of modification in each state should be constant but the emission probability should then depend on how many reads are mapped to that position and how many of those reads are classed as modified.

I'm still quite new in the field, so figuring out how i implement my own distributions is a big ask for me atm... I'm currently trying to implement something myself from scratch, so we'll see how that goes.

Again, thanks for the help!

@jmschrei
Copy link
Owner

I understand, good luck!

If you're new to the field I'd suggest starting with a more basic model, like using Gaussian emissions, and seeing how good you can get that model before running into issues. Then, you'd already have a good baseline set up to demonstrate why the more complicated emissions are necessary. Hopefully by that point you'd be more comfortable with the setting and modeling and more able to write your own emissions and test them in a setting you know already works fairly well (by replacing your Gaussian emissions with them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants