-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. #8561
base: master
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
6534430
to
558ccaf
Compare
Thanks for your work on this @samuelklee! Testing on both wes and wgs would be ideal. For wgs we can use the gatk-sv reference panel, which is our standard (I can help with this once a docker is ready). For wes, 1kgp would work although it's definitely showing its age. Are the integration test differences large? |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
OK, I think things are looking good! Updated a bunch of things, including the following:
These and other packages (numpy, scipy, etc.) are all pretty much at the latest available versions for python 3.10. I've also bumped version numbers for our internal python packages. I also made all of the changes to the gCNV code to accommodate any changes introduced by PyMC/Pytensor. For the most part, these were minor renamings of However, there were some more nontrivial changes, including to 1) model priors (since some of the distributions previously used were removed or are now supported differently), 2) the implementation of posterior sampling, 3) some shape/dimshuffle operations, and other things along these lines. Using a single test shard of 20 1kGP WES samples x 1000 intervals, I have verified determinism/reproducibility for DetermineGermlineContigPloidy COHORT/CASE modes, GermlineCNVCaller COHORT/CASE modes, and PostprocessGermlineCNVCalls. Numerical results are also relatively close to those from 4.4.0.0 for all identifiable call and model quantities (albeit far outside any reasonable exact-match thresholds, most likely due to differences in RNG, sampling, and the aforementioned priors). Some remaining TODOs:
|
Github actions tests reported job failures from actions build 7143821808
|
4bf5286
to
ed59372
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
…ker and conda environments.
413fa8f
to
c12c124
Compare
c12c124
to
90bd880
Compare
Hi @samuelklee, Any updates on this PR? Will this be able to get merged in the foreseeable future? Thanks |
Thanks for the inquiry, @matthdsm! This has indeed been a bit on the back burner since winter break, so perhaps it's time for a kick start. @mwalker174 (or anyone else interested in running scientific tests), a Docker image is available at: us.gcr.io/broad-dsde-methods/broad-gatk-snapshots/gatk-remote-builds@sha256:59310a7e8d635d4c09a6b6e09d188070628a42f7110805eeb26ca044aa1f71a5. Note that I haven't tested this image yet, so please let me know if you encounter any issues. @droazen can you provide a roadmap for the CNN filtering tools? These should be deprecated or otherwise managed before this PR is merged. I am not sure who the major stakeholders are here, but any users out there should feel free to chime in. I will try to do a quick self-review and some minor cleanup in the meantime. My expectation is that the timescale for this to get merged will be on the order of a few weeks to months. But it would be great if we can get it in sooner! |
@samuelklee My plan for the deprecation was to do it concurrently with (or just before) the merge of this PR. As this gets closer to a final merge, we can coordinate on that aspect. |
Just noting here that @gokalpcelik was kind enough to run some tests on this branch and noticed that I probably broke posterior sampling—it appears that I reverted to the non-online, memory intensive implementation that we ran into previously during development. Need to carve out some time to fix this issue and more generally polish things up, but I have other more pressing deadlines at the moment. |
Hi @samuelklee, @droazen, totally understand this is not high priority. Would you be able to estimate a rough timeframe when you might be able to tackle this? It would help our community figure out whether to wait or pursue other mitigation in the meantime. |
Hi @vdauwera! I think we’re still within the “few weeks to months” timeframe quoted above, although we’re definitely pushing into the latter territory 😜 Would be happy to chat more, if you like—would also be curious to hear what mitigation is being considered. Feel free to set up a meeting! |
We are hoping to re-prioritize this and get it over the finish line soon! Thanks to @samuelklee 's heroic efforts we are 90% of the way up the path out of our self-imposed Python hell, but I believe there are still a few stray Python demons pulling at our ankles. |
e2ffcd8
to
afe0a67
Compare
@gokalpcelik I think I fixed this posterior-sampling issue. I haven't yet done my own testing, but let me know if you beat me to it! |
Do we have a link for the updated docker image ? |
Here's one, if anyone would like to try it out! |
Finishing up a tieout using old MalariaGEN Pf7 CNV genotyping runs, and I think things look good from this perspective, at least. This tieout uses a subset of the Pf7 samples containing 300 cohort and 1683 case samples (which were indeed treated as a cohort-case cluster in the original Pf7 CNV genotyping analysis). ~4k genomic bins are covered. We compare this branch against 4.5.0.0, as well as this branch against itself (checking for reproducibility). Costs for this branch ($10.92) and 4.5.0.0 ($10.96) were quite comparable. Note that a small portion of these costs derives from Pf7-specific genotyping steps, which I did not bother to remove from the workflow. Runtime for the ploidy modeling and postprocessing steps were comparable. Interestingly, runtime for the gCNV was ~20-25% longer with this branch than with 4.5.0.0, but memory usage fell by a factor of ~3 (~6GB to ~2GB)! I am not sure if we could recoup the runtime with some more tweaking of the environment (perhaps double checking that optimized BLAS/MKL/etc. packages are properly used, changing environment variables/flags, etc.), but I think the decrease in memory usage is quite nice. Concordance was checked for the following quantities (4.5.0.0 is on the x-axis and this branch is on the y-axis in all plots below):
|
Thanks for sharing these results @samuelklee ! That memory vs runtime tradeoff does not seem at all bad to me. What else needs to be done before we can feel confident about merging this (apart from deprecating the old CNN tools)? Is the posterior sampling issue resolved? |
Yes, that's been resolved! Phew. If anyone wants to start taking an initial review pass, that would be appreciated! There are probably a few very minor cleanups I could do (e.g., removing stray commented code and TODOs), but I don't think that should hold anything up. |
Also, I think @mwalker174 should have the final say on merging, as he's the current gCNV owner. Would be great if he could run any GATK-SV-specific tests and report back on runtime/accuracy! |
@@ -78,12 +78,13 @@ def logsumexp_double_complement(a: np.ndarray, rel_tol: float = 1e-3) -> float: | |||
Returns: | |||
a float scalar | |||
""" | |||
print(a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, probably should remove this!
@samuelklee It's great to see this inching closer to finally getting merged! I think we should plan on merging it shortly after the upcoming 4.6 release, assuming the remaining testing doesn't reveal any issues. Since 4.6 will contain a number of important bug fixes and user-requested changes, I want to get it out before making a big breaking change to the Python environment. After 4.6 is out, we can meet briefly to discuss how best to handle the deprecation of the legacy CNN tools, and hopefully get this in at last. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for documenting all the API changes. This looks reasonable to me (along with the malaria plots). There are a couple more TODOs that I think are to-done. @droazen is going to help us come up with a CNN deprecation/transition plan so we don't have to turn off tests to merge
@@ -1 +1 @@ | |||
__version__ = '0.8' | |||
__version__ = '0.9' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly didn't know this is here. Does it get compared to anything?
* the theano compilation directory (which is set to <code>$HOME/.theano</code> by default). See theano documentation | ||
* at <a href="https://theano-pymc.readthedocs.io/en/latest/library/config.html"> | ||
* https://theano-pymc.readthedocs.io/en/latest/library/config.html</a>. | ||
* <code>PYTENSOR_FLAGS="base_compiledir=PATH/TO/BASE_COMPILEDIR" gatk DetermineGermlineContigPloidy ...</code>, users can specify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
""" | ||
super().__init__(approx) | ||
assert temperature is not None | ||
self.temperature = temperature | ||
|
||
def apply(self, f): | ||
z = self.input | ||
return self.temperature * self.logq_norm(z) - self.logp_norm(z) | ||
return (self.temperature * self.logq_norm - self.logp_norm)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to have to trust you on this one.
# Unfortunately, this new functionality appears to be somewhat brittle and yields an error in our use case. | ||
# We instead bring the old VarMap class into this module and recreate the old functionality to | ||
# preserve our preexisting interfaces. | ||
# TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the TODO?
log_trans_tcc: types.PytensorTensor3, | ||
log_emission_tc: types.PytensorMatrix, | ||
temperature: pt.scalar): | ||
inv_temperature = pt.reciprocal(temperature) # TODO inv to reciprocal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still outstanding?
@@ -295,7 +295,7 @@ jobs: | |||
runs-on: ubuntu-latest | |||
strategy: | |||
matrix: | |||
wdlTest: [ 'RUN_CNV_GERMLINE_COHORT_WDL', 'RUN_CNV_GERMLINE_CASE_WDL', 'RUN_CNV_SOMATIC_WDL', 'RUN_M2_WDL', 'RUN_CNN_WDL', 'RUN_VCF_SITE_LEVEL_FILTERING_WDL' ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may already be a WDL for the NVIDIA CNN in the Terra workspace, but this will be part of the CNNScoreVariants Deprecation Plan
@@ -160,7 +160,8 @@ public void testNumericalAccuracy() { | |||
.add(GermlineCNVCaller.CONTIG_PLOIDY_CALLS_DIRECTORY_LONG_NAME, | |||
CONTIG_PLOIDY_CALLS_OUTPUT_DIR.getAbsolutePath()) | |||
.add(StandardArgumentDefinitions.OUTPUT_LONG_NAME, outputDir) | |||
.add(CopyNumberStandardArgument.OUTPUT_PREFIX_LONG_NAME, outputPrefix); | |||
.add(CopyNumberStandardArgument.OUTPUT_PREFIX_LONG_NAME, outputPrefix) | |||
.add(StandardArgumentDefinitions.VERBOSITY_NAME, "DEBUG"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to keep this debug verbosity?
{ "numpy", "numpy.__config__.get_info('blas_mkl_info') != {} and numpy.__config__.get_info('lapack_mkl_info') != {}" }, | ||
{ "theano", "'-lmkl_rt' in theano.config.blas.ldflags" }, | ||
{ "tensorflow", "tensorflow.pywrap_tensorflow.IsMklEnabled()" } | ||
// { "numpy", "numpy.__config__.get_info('blas_mkl_info') != {} and numpy.__config__.get_info('lapack_mkl_info') != {}" }, TODO this check might need to be done differently for conda-forge numpy 1.26.2, we remove it for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a real TODO?
@droazen When we finally remove CNNScoreVariants, don't forget that you can put an entry in the DeprecatedToolsRegistry with a helpful message about the replacement. |
Very excited that this is almost ready! I'm presently going to run some tests on matched BGE/exome samples to assess runtime performance and results with gatk-sv. |
Copying over some discussion from Slack, with some slight modifications:
I've made some strides in this PR; as of 6b08f3a, I've made enough updates to accommodate API changes so that cohort-mode inference for both GermlineCNVCaller and DetermineGermlineContigPloidy runs successfully under Python 3.10 and PyMC 5.9.0---although note that 5.9.1 has been released in the interim!
However, our work has just begun. Results now produced in the numerical tests mentioned above are quite far off from the original expected results. It remains to be seen whether this is due to the randomness of inference, some slight changes to the model prior that were necessitated by the API changes, or some bugs introduced in other code updates. (Also note that I believe Andrey's PR in item 4 already broke these tests, although the numerical differences were much smaller and more reasonable---but perhaps he can confirm. Also noting here that I think determinism is still currently broken as of this commit---there have been some changes to PyTensor/PyMC seeding so that our previous theano/PyMC3 hack no longer applies.)
So I think the next step is to just go to scientific-level testing and see what the fallout is. Ideally, we'd still get good performance (or perhaps better! at least on the runtime side, hopefully...) and we can just update the numerical tests. But if performance tanks, then we might need to see whether I've introduced any bugs. @mwalker174 @asmirnov239 perhaps you can comment on what might be the appropriate test suite here----1kGP?
I'll also highlight again that this PR will remove TensorFlow and might require that the corresponding CNN implementations be supported by an alternate strategy, at least until the PyTorch implementation goes in.