Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGEN support #37

Open
prasunanand opened this issue Jun 19, 2018 · 2 comments
Open

BGEN support #37

prasunanand opened this issue Jun 19, 2018 · 2 comments
Assignees

Comments

@prasunanand
Copy link
Member

prasunanand commented Jun 19, 2018

In GEMMA, bgen support was added in PR.

However, there are no tests to validate the code so that I can port it to faster_lmm_d .

I need to test BGEN files with a 500k sample. I believe this would be a great exercise to test GPU support.

PS: This thread tracks the implementation of BGEN file support.

@prasunanand
Copy link
Member Author

prasunanand commented Aug 20, 2018

Might be helpful for future reference.

The following table tabulates features of various different formats:

PLINK binaryGENBGEN v1.1BGEN v1.2 / v1.3VCFBCF
Supports unphased genotype calls **
Supports unphased genotype probabilities
Supports NULL/outlier probability
e.g. NULL class from CHIAMO / GenoSNP
Supports non-diploid samples
Supports phased data?
Supports multi-allelic variants
Efficient representation?

Hard-called genotypes are converted to probabilities in GEN and BGEN v1.1. †By convention, males on the X chromosome are stored as homozygote females in GEN and BGEN v1.1. ‡At the time of writing, the storage of genotype likelihoods and probabilities for non-diploid samples and/or phased data in VCF/BCF is not fully specified.

Found this on http://www.well.ox.ac.uk/~gav/bgen_format/

@pjotrp
Copy link
Member

pjotrp commented Aug 21, 2018

It is also important how quickly file formats can be streamed for parallel processing. Binary formats typically do no better than compressed textual data here. I see that as a too early optimization ;).

I suspect for GEMMA we end up with our own R/qtl2 based format and convert from one of the above.

Computing probabilities is something we like to control. Also it is not a great idea to have GEMMA support multiple formats for reasons of maintenance. One type is enough. Conversion will be rapid so we can pipe it in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants