Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. #8561

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Oct 23, 2023

Copying over some discussion from Slack, with some slight modifications:

I took a quick stab at updating the environment for gCNV. Even taking out TensorFlow (assuming that the CNN will not be supported by this environment), it's a difficult task:

  1. The goal is to update Python from 3.6 to 3.10+, since Terra now requires the latter for officially supported images.
  2. However, gCNV relies on the PyMC3 package. PyMC3 3.1 is currently used in GATK master. 3.1 was released in 2017, not long before our release of gCNV in 2018, but it's very old now.
  3. The latest version of Python that is supported by PyMC3 3.1 in conda is Python 3.6.
  4. @asmirnov239 has a draft PR (Add pytorch to the conda environment #8094) that updates PyMC3 to 3.5 and Python to 3.7, which clearly still falls short of Python 3.10+. This PR also updated some gCNV code to make it compatible with PyMC3 3.5. (It also removed TensorFlow and added PyTorch.)
  5. @asmirnov239 also merged a PR that added tests for numerical reproducibility of GermlineCNVCaller in cohort mode in Added gCNV integration test to detect numerical differences in the outputs. #7889.
  6. The earliest version of PyMC that supports Python 3.10+ is PyMC 4, released in 2022.
  7. However, PyMC 4 introduces API changes, which will also require additional gCNV code changes and numerical testing.
  8. These API changes are because the underlying computational backend for PyMC was updated from Theano (think of this as an old alternative to TensorFlow) to Aesara.
  9. Since then, PyMC 5.9 has been released and the underlying backend has been updated again, from Aesara to PyTensor.
  10. So if we are going to update the environment to support Python 3.10+, it probably makes sense to go all the way to PyMC 5.9.

I've made some strides in this PR; as of 6b08f3a, I've made enough updates to accommodate API changes so that cohort-mode inference for both GermlineCNVCaller and DetermineGermlineContigPloidy runs successfully under Python 3.10 and PyMC 5.9.0---although note that 5.9.1 has been released in the interim!

However, our work has just begun. Results now produced in the numerical tests mentioned above are quite far off from the original expected results. It remains to be seen whether this is due to the randomness of inference, some slight changes to the model prior that were necessitated by the API changes, or some bugs introduced in other code updates. (Also note that I believe Andrey's PR in item 4 already broke these tests, although the numerical differences were much smaller and more reasonable---but perhaps he can confirm. Also noting here that I think determinism is still currently broken as of this commit---there have been some changes to PyTensor/PyMC seeding so that our previous theano/PyMC3 hack no longer applies.)

So I think the next step is to just go to scientific-level testing and see what the fallout is. Ideally, we'd still get good performance (or perhaps better! at least on the runtime side, hopefully...) and we can just update the numerical tests. But if performance tanks, then we might need to see whether I've introduced any bugs. @mwalker174 @asmirnov239 perhaps you can comment on what might be the appropriate test suite here----1kGP?

I'll also highlight again that this PR will remove TensorFlow and might require that the corresponding CNN implementations be supported by an alternate strategy, at least until the PyTorch implementation goes in.

@samuelklee samuelklee marked this pull request as draft October 23, 2023 18:31
@samuelklee samuelklee changed the title Sl python version update Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. Oct 23, 2023
@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@samuelklee samuelklee force-pushed the sl_python_version_update branch 2 times, most recently from 6534430 to 558ccaf Compare November 9, 2023 20:48
@mwalker174
Copy link
Contributor

Thanks for your work on this @samuelklee! Testing on both wes and wgs would be ideal. For wgs we can use the gatk-sv reference panel, which is our standard (I can help with this once a docker is ready). For wes, 1kgp would work although it's definitely showing its age. Are the integration test differences large?

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@samuelklee
Copy link
Contributor Author

samuelklee commented Dec 8, 2023

OK, I think things are looking good! Updated a bunch of things, including the following:

  • conda 23.1.0 -> 23.10.0; in the base Docker, also disabled conda auto-updating and set the solver to the much faster libmamba
  • python 3.6.10 -> 3.10.13
  • pymc 3.1 -> 5.10.0
  • theano 1.0.4 -> pytensor 2.18.1
  • added pytorch 2.1.0
  • removed tensorflow 1.15.0 and other CNN dependencies
  • added libblas-dev to the base Docker; I think MKL versions of all packages are being used, but we should verify!

These and other packages (numpy, scipy, etc.) are all pretty much at the latest available versions for python 3.10. I've also bumped version numbers for our internal python packages.

I also made all of the changes to the gCNV code to accommodate any changes introduced by PyMC/Pytensor. For the most part, these were minor renamings of theano/tt/etc. to pytensor/pt/etc.

However, there were some more nontrivial changes, including to 1) model priors (since some of the distributions previously used were removed or are now supported differently), 2) the implementation of posterior sampling, 3) some shape/dimshuffle operations, and other things along these lines.

Using a single test shard of 20 1kGP WES samples x 1000 intervals, I have verified determinism/reproducibility for DetermineGermlineContigPloidy COHORT/CASE modes, GermlineCNVCaller COHORT/CASE modes, and PostprocessGermlineCNVCalls. Numerical results are also relatively close to those from 4.4.0.0 for all identifiable call and model quantities (albeit far outside any reasonable exact-match thresholds, most likely due to differences in RNG, sampling, and the aforementioned priors).

Some remaining TODOs:

  • Rebuild and push the base Docker. EDIT: Mostly covered by Update the GATK base image to a newer LTS ubuntu release #8610, but this also includes an addition of libblas-dev.
  • Update expected results for integration tests, perhaps add any that might be missing. EDIT: These were generated on WSL Ubuntu 20.04.2, we'll see if things pass on 22.04. Note that changing the ARD priors does change the names of the expected files, since the transform is appended to the corresponding variable name. DetermineGermlineContigPloidy and PostprocessGermlineCNVCalls are missing exact-match tests and should probably have some, but I'll leave that to someone else.
  • Update other python integration tests.
  • Clean up some of the changes to the priors.
  • Clean up some TODO comments that I left to track code changes that might result in changed numerics. I'll try to go through and convert these to PR comments in an initial review pass.
  • Test over multiple shards on WGS and WES. Probably some scientific tests on ~100 samples in both cohort and case mode would do the trick. We should also double check runtime/memory performance (I noted ~1.5x speedups, but didn't measure carefully; I also want to make sure the changes to posterior sampling didn't introduce any memory issues). @mwalker174 will ping you when a Docker is ready! Might be good to loop in Isaac and/or Jack as well.
  • Perhaps add back the fix for 2-interval shards in Number of intervals edge case gCNV fix #8180, which I removed since the required functionality wasn't immediately available in Pytensor. Not sure if this actually broke things though---need to check. (However, I don't actually think this is a very important use case to support...)
  • Delete/deprecate/etc. CNN tools/tests as appropriate. Note that this has to be done concurrently, since we remove Tensorflow. @droazen perhaps I can take a first stab at this in a subsequent commit to this PR once more of the gCNV dust settles and/or has undergone a preliminary review? EDIT: Disabled integration/WDL tests. We should add some deprecation messages to the tools---we can note that they should still work in previous environments but will be untested. I might set up a separate PR for deletion, to be done at the appropriate time (but I call dibs on this, can't have @davidbenjamin overtaking my all-time record for number of lines deleted 😛).

@gatk-bot
Copy link

gatk-bot commented Dec 8, 2023

Github actions tests reported job failures from actions build 7143821808
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 7143821808.3 logs

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@matthdsm
Copy link

Hi @samuelklee,

Any updates on this PR? Will this be able to get merged in the foreseeable future?

Thanks
M

@samuelklee
Copy link
Contributor Author

samuelklee commented Feb 12, 2024

Thanks for the inquiry, @matthdsm! This has indeed been a bit on the back burner since winter break, so perhaps it's time for a kick start.

@mwalker174 (or anyone else interested in running scientific tests), a Docker image is available at: us.gcr.io/broad-dsde-methods/broad-gatk-snapshots/gatk-remote-builds@sha256:59310a7e8d635d4c09a6b6e09d188070628a42f7110805eeb26ca044aa1f71a5. Note that I haven't tested this image yet, so please let me know if you encounter any issues.

@droazen can you provide a roadmap for the CNN filtering tools? These should be deprecated or otherwise managed before this PR is merged. I am not sure who the major stakeholders are here, but any users out there should feel free to chime in.

I will try to do a quick self-review and some minor cleanup in the meantime.

My expectation is that the timescale for this to get merged will be on the order of a few weeks to months. But it would be great if we can get it in sooner!

@droazen
Copy link
Collaborator

droazen commented Feb 12, 2024

@samuelklee My plan for the deprecation was to do it concurrently with (or just before) the merge of this PR. As this gets closer to a final merge, we can coordinate on that aspect.

@samuelklee
Copy link
Contributor Author

Just noting here that @gokalpcelik was kind enough to run some tests on this branch and noticed that I probably broke posterior sampling—it appears that I reverted to the non-online, memory intensive implementation that we ran into previously during development. Need to carve out some time to fix this issue and more generally polish things up, but I have other more pressing deadlines at the moment.

@vdauwera
Copy link
Contributor

Hi @samuelklee, @droazen, totally understand this is not high priority. Would you be able to estimate a rough timeframe when you might be able to tackle this? It would help our community figure out whether to wait or pursue other mitigation in the meantime.

@samuelklee
Copy link
Contributor Author

Hi @vdauwera! I think we’re still within the “few weeks to months” timeframe quoted above, although we’re definitely pushing into the latter territory 😜

Would be happy to chat more, if you like—would also be curious to hear what mitigation is being considered. Feel free to set up a meeting!

@droazen
Copy link
Collaborator

droazen commented Mar 22, 2024

We are hoping to re-prioritize this and get it over the finish line soon! Thanks to @samuelklee 's heroic efforts we are 90% of the way up the path out of our self-imposed Python hell, but I believe there are still a few stray Python demons pulling at our ankles.

@samuelklee
Copy link
Contributor Author

Just noting here that @gokalpcelik was kind enough to run some tests on this branch and noticed that I probably broke posterior sampling—it appears that I reverted to the non-online, memory intensive implementation that we ran into previously during development. Need to carve out some time to fix this issue and more generally polish things up, but I have other more pressing deadlines at the moment.

@gokalpcelik I think I fixed this posterior-sampling issue. I haven't yet done my own testing, but let me know if you beat me to it!

@gokalpcelik
Copy link
Contributor

Do we have a link for the updated docker image ?

@samuelklee
Copy link
Contributor Author

samuelklee commented Apr 3, 2024

Here's one, if anyone would like to try it out! us.gcr.io/broad-dsde-methods/broad-gatk-snapshots/gatk@sha256:c405d1d7fc42a008bb11a2f60f95bb65f1c487cf4986850a7d5d7d4bdfc37965

@samuelklee
Copy link
Contributor Author

samuelklee commented Apr 26, 2024

Finishing up a tieout using old MalariaGEN Pf7 CNV genotyping runs, and I think things look good from this perspective, at least.

This tieout uses a subset of the Pf7 samples containing 300 cohort and 1683 case samples (which were indeed treated as a cohort-case cluster in the original Pf7 CNV genotyping analysis). ~4k genomic bins are covered. We compare this branch against 4.5.0.0, as well as this branch against itself (checking for reproducibility).

Costs for this branch ($10.92) and 4.5.0.0 ($10.96) were quite comparable. Note that a small portion of these costs derives from Pf7-specific genotyping steps, which I did not bother to remove from the workflow.

Runtime for the ploidy modeling and postprocessing steps were comparable. Interestingly, runtime for the gCNV was ~20-25% longer with this branch than with 4.5.0.0, but memory usage fell by a factor of ~3 (~6GB to ~2GB)! I am not sure if we could recoup the runtime with some more tweaking of the environment (perhaps double checking that optimized BLAS/MKL/etc. packages are properly used, changing environment variables/flags, etc.), but I think the decrease in memory usage is quite nice.

Concordance was checked for the following quantities (4.5.0.0 is on the x-axis and this branch is on the y-axis in all plots below):

  1. Variational posterior means (mu_*) and standard deviations (std_*) for all analogous variables in the ploidy and gCNV models. There were some slight changes to the gCNV model in this branch (e.g., the functional form of the ARD prior was changed), which means some variables are no longer directly comparable. Furthermore, some variables (such as the bias factors W) are degenerate and cannot be immediately compared. Otherwise, there is good concordance between the remaining variables, e.g.:

image
image
image
image
image
image

  1. ... Will update more later!

@droazen
Copy link
Collaborator

droazen commented Apr 26, 2024

Thanks for sharing these results @samuelklee ! That memory vs runtime tradeoff does not seem at all bad to me.

What else needs to be done before we can feel confident about merging this (apart from deprecating the old CNN tools)? Is the posterior sampling issue resolved?

@samuelklee
Copy link
Contributor Author

Yes, that's been resolved! Phew.

If anyone wants to start taking an initial review pass, that would be appreciated! There are probably a few very minor cleanups I could do (e.g., removing stray commented code and TODOs), but I don't think that should hold anything up.

@samuelklee
Copy link
Contributor Author

Also, I think @mwalker174 should have the final say on merging, as he's the current gCNV owner. Would be great if he could run any GATK-SV-specific tests and report back on runtime/accuracy!

@@ -78,12 +78,13 @@ def logsumexp_double_complement(a: np.ndarray, rel_tol: float = 1e-3) -> float:
Returns:
a float scalar
"""
print(a)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, probably should remove this!

@droazen
Copy link
Collaborator

droazen commented May 20, 2024

@samuelklee It's great to see this inching closer to finally getting merged! I think we should plan on merging it shortly after the upcoming 4.6 release, assuming the remaining testing doesn't reveal any issues. Since 4.6 will contain a number of important bug fixes and user-requested changes, I want to get it out before making a big breaking change to the Python environment. After 4.6 is out, we can meet briefly to discuss how best to handle the deprecation of the legacy CNN tools, and hopefully get this in at last.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for documenting all the API changes. This looks reasonable to me (along with the malaria plots). There are a couple more TODOs that I think are to-done. @droazen is going to help us come up with a CNN deprecation/transition plan so we don't have to turn off tests to merge

@@ -1 +1 @@
__version__ = '0.8'
__version__ = '0.9'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly didn't know this is here. Does it get compared to anything?

* the theano compilation directory (which is set to <code>$HOME/.theano</code> by default). See theano documentation
* at <a href="https://theano-pymc.readthedocs.io/en/latest/library/config.html">
* https://theano-pymc.readthedocs.io/en/latest/library/config.html</a>.
* <code>PYTENSOR_FLAGS="base_compiledir=PATH/TO/BASE_COMPILEDIR" gatk DetermineGermlineContigPloidy ...</code>, users can specify
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

"""
super().__init__(approx)
assert temperature is not None
self.temperature = temperature

def apply(self, f):
z = self.input
return self.temperature * self.logq_norm(z) - self.logp_norm(z)
return (self.temperature * self.logq_norm - self.logp_norm)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to have to trust you on this one.

# Unfortunately, this new functionality appears to be somewhat brittle and yields an error in our use case.
# We instead bring the old VarMap class into this module and recreate the old functionality to
# preserve our preexisting interfaces.
# TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the TODO?

log_trans_tcc: types.PytensorTensor3,
log_emission_tc: types.PytensorMatrix,
temperature: pt.scalar):
inv_temperature = pt.reciprocal(temperature) # TODO inv to reciprocal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still outstanding?

@@ -295,7 +295,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
wdlTest: [ 'RUN_CNV_GERMLINE_COHORT_WDL', 'RUN_CNV_GERMLINE_CASE_WDL', 'RUN_CNV_SOMATIC_WDL', 'RUN_M2_WDL', 'RUN_CNN_WDL', 'RUN_VCF_SITE_LEVEL_FILTERING_WDL' ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may already be a WDL for the NVIDIA CNN in the Terra workspace, but this will be part of the CNNScoreVariants Deprecation Plan

@@ -160,7 +160,8 @@ public void testNumericalAccuracy() {
.add(GermlineCNVCaller.CONTIG_PLOIDY_CALLS_DIRECTORY_LONG_NAME,
CONTIG_PLOIDY_CALLS_OUTPUT_DIR.getAbsolutePath())
.add(StandardArgumentDefinitions.OUTPUT_LONG_NAME, outputDir)
.add(CopyNumberStandardArgument.OUTPUT_PREFIX_LONG_NAME, outputPrefix);
.add(CopyNumberStandardArgument.OUTPUT_PREFIX_LONG_NAME, outputPrefix)
.add(StandardArgumentDefinitions.VERBOSITY_NAME, "DEBUG");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to keep this debug verbosity?

{ "numpy", "numpy.__config__.get_info('blas_mkl_info') != {} and numpy.__config__.get_info('lapack_mkl_info') != {}" },
{ "theano", "'-lmkl_rt' in theano.config.blas.ldflags" },
{ "tensorflow", "tensorflow.pywrap_tensorflow.IsMklEnabled()" }
// { "numpy", "numpy.__config__.get_info('blas_mkl_info') != {} and numpy.__config__.get_info('lapack_mkl_info') != {}" }, TODO this check might need to be done differently for conda-forge numpy 1.26.2, we remove it for now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a real TODO?

@cmnbroad
Copy link
Collaborator

@droazen When we finally remove CNNScoreVariants, don't forget that you can put an entry in the DeprecatedToolsRegistry with a helpful message about the replacement.

@mwalker174
Copy link
Contributor

Very excited that this is almost ready! I'm presently going to run some tests on matched BGE/exome samples to assess runtime performance and results with gatk-sv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants