Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. #8561

samuelklee · 2023-10-23T18:30:59Z

Copying over some discussion from Slack, with some slight modifications:

I took a quick stab at updating the environment for gCNV. Even taking out TensorFlow (assuming that the CNN will not be supported by this environment), it's a difficult task:

The goal is to update Python from 3.6 to 3.10+, since Terra now requires the latter for officially supported images.

However, gCNV relies on the PyMC3 package. PyMC3 3.1 is currently used in GATK master. 3.1 was released in 2017, not long before our release of gCNV in 2018, but it's very old now.

The latest version of Python that is supported by PyMC3 3.1 in conda is Python 3.6.

@asmirnov239 has a draft PR (Add pytorch to the conda environment #8094) that updates PyMC3 to 3.5 and Python to 3.7, which clearly still falls short of Python 3.10+. This PR also updated some gCNV code to make it compatible with PyMC3 3.5. (It also removed TensorFlow and added PyTorch.)

@asmirnov239 also merged a PR that added tests for numerical reproducibility of GermlineCNVCaller in cohort mode in Added gCNV integration test to detect numerical differences in the outputs. #7889.

The earliest version of PyMC that supports Python 3.10+ is PyMC 4, released in 2022.

However, PyMC 4 introduces API changes, which will also require additional gCNV code changes and numerical testing.

These API changes are because the underlying computational backend for PyMC was updated from Theano (think of this as an old alternative to TensorFlow) to Aesara.

Since then, PyMC 5.9 has been released and the underlying backend has been updated again, from Aesara to PyTensor.

So if we are going to update the environment to support Python 3.10+, it probably makes sense to go all the way to PyMC 5.9.

I've made some strides in this PR; as of 6b08f3a, I've made enough updates to accommodate API changes so that cohort-mode inference for both GermlineCNVCaller and DetermineGermlineContigPloidy runs successfully under Python 3.10 and PyMC 5.9.0---although note that 5.9.1 has been released in the interim!

However, our work has just begun. Results now produced in the numerical tests mentioned above are quite far off from the original expected results. It remains to be seen whether this is due to the randomness of inference, some slight changes to the model prior that were necessitated by the API changes, or some bugs introduced in other code updates. (Also note that I believe Andrey's PR in item 4 already broke these tests, although the numerical differences were much smaller and more reasonable---but perhaps he can confirm. Also noting here that I think determinism is still currently broken as of this commit---there have been some changes to PyTensor/PyMC seeding so that our previous theano/PyMC3 hack no longer applies.)

So I think the next step is to just go to scientific-level testing and see what the fallout is. Ideally, we'd still get good performance (or perhaps better! at least on the runtime side, hopefully...) and we can just update the numerical tests. But if performance tanks, then we might need to see whether I've introduced any bugs. @mwalker174 @asmirnov239 perhaps you can comment on what might be the appropriate test suite here----1kGP?

I'll also highlight again that this PR will remove TensorFlow and might require that the corresponding CNN implementations be supported by an alternate strategy, at least until the PyTorch implementation goes in.

mwalker174 · 2023-11-09T21:17:07Z

Thanks for your work on this @samuelklee! Testing on both wes and wgs would be ideal. For wgs we can use the gatk-sv reference panel, which is our standard (I can help with this once a docker is ready). For wes, 1kgp would work although it's definitely showing its age. Are the integration test differences large?

samuelklee · 2023-12-08T17:16:20Z

OK, I think things are looking good! Updated a bunch of things, including the following:

conda 23.1.0 -> 23.10.0; in the base Docker, also disabled conda auto-updating and set the solver to the much faster libmamba
python 3.6.10 -> 3.10.13
pymc 3.1 -> 5.10.0
theano 1.0.4 -> pytensor 2.18.1
added pytorch 2.1.0
removed tensorflow 1.15.0 and other CNN dependencies
added libblas-dev to the base Docker; I think MKL versions of all packages are being used, but we should verify!

These and other packages (numpy, scipy, etc.) are all pretty much at the latest available versions for python 3.10. I've also bumped version numbers for our internal python packages.

I also made all of the changes to the gCNV code to accommodate any changes introduced by PyMC/Pytensor. For the most part, these were minor renamings of theano/tt/etc. to pytensor/pt/etc.

However, there were some more nontrivial changes, including to 1) model priors (since some of the distributions previously used were removed or are now supported differently), 2) the implementation of posterior sampling, 3) some shape/dimshuffle operations, and other things along these lines.

Using a single test shard of 20 1kGP WES samples x 1000 intervals, I have verified determinism/reproducibility for DetermineGermlineContigPloidy COHORT/CASE modes, GermlineCNVCaller COHORT/CASE modes, and PostprocessGermlineCNVCalls. Numerical results are also relatively close to those from 4.4.0.0 for all identifiable call and model quantities (albeit far outside any reasonable exact-match thresholds, most likely due to differences in RNG, sampling, and the aforementioned priors).

Some remaining TODOs:

Rebuild and push the base Docker. EDIT: Mostly covered by Update the GATK base image to a newer LTS ubuntu release #8610, but this also includes an addition of libblas-dev.
Update expected results for integration tests, perhaps add any that might be missing. EDIT: These were generated on WSL Ubuntu 20.04.2, we'll see if things pass on 22.04. Note that changing the ARD priors does change the names of the expected files, since the transform is appended to the corresponding variable name. DetermineGermlineContigPloidy and PostprocessGermlineCNVCalls are missing exact-match tests and should probably have some, but I'll leave that to someone else.
Update other python integration tests.
Clean up some of the changes to the priors.
Clean up some TODO comments that I left to track code changes that might result in changed numerics. I'll try to go through and convert these to PR comments in an initial review pass.
Test over multiple shards on WGS and WES. Probably some scientific tests on ~100 samples in both cohort and case mode would do the trick. We should also double check runtime/memory performance (I noted ~1.5x speedups, but didn't measure carefully; I also want to make sure the changes to posterior sampling didn't introduce any memory issues). @mwalker174 will ping you when a Docker is ready! Might be good to loop in Isaac and/or Jack as well.
Perhaps add back the fix for 2-interval shards in Number of intervals edge case gCNV fix #8180, which I removed since the required functionality wasn't immediately available in Pytensor. Not sure if this actually broke things though---need to check. (However, I don't actually think this is a very important use case to support...)
Delete/deprecate/etc. CNN tools/tests as appropriate. Note that this has to be done concurrently, since we remove Tensorflow. @droazen perhaps I can take a first stab at this in a subsequent commit to this PR once more of the gCNV dust settles and/or has undergone a preliminary review? EDIT: Disabled integration/WDL tests. We should add some deprecation messages to the tools---we can note that they should still work in previous environments but will be untested. I might set up a separate PR for deletion, to be done at the appropriate time (but I call dibs on this, can't have @davidbenjamin overtaking my all-time record for number of lines deleted 😛).

gatk-bot · 2023-12-08T17:17:10Z

Github actions tests reported job failures from actions build 7143821808
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
conda	17.0.6+10	7143821808.3	logs

…ker and conda environments.

…Tensor.

matthdsm · 2024-02-12T08:37:00Z

Hi @samuelklee,

Any updates on this PR? Will this be able to get merged in the foreseeable future?

Thanks
M

samuelklee · 2024-02-12T21:23:23Z

Thanks for the inquiry, @matthdsm! This has indeed been a bit on the back burner since winter break, so perhaps it's time for a kick start.

@mwalker174 (or anyone else interested in running scientific tests), a Docker image is available at: us.gcr.io/broad-dsde-methods/broad-gatk-snapshots/gatk-remote-builds@sha256:59310a7e8d635d4c09a6b6e09d188070628a42f7110805eeb26ca044aa1f71a5. Note that I haven't tested this image yet, so please let me know if you encounter any issues.

@droazen can you provide a roadmap for the CNN filtering tools? These should be deprecated or otherwise managed before this PR is merged. I am not sure who the major stakeholders are here, but any users out there should feel free to chime in.

I will try to do a quick self-review and some minor cleanup in the meantime.

My expectation is that the timescale for this to get merged will be on the order of a few weeks to months. But it would be great if we can get it in sooner!

droazen · 2024-02-12T22:27:16Z

@samuelklee My plan for the deprecation was to do it concurrently with (or just before) the merge of this PR. As this gets closer to a final merge, we can coordinate on that aspect.

samuelklee · 2024-03-19T16:13:35Z

Just noting here that @gokalpcelik was kind enough to run some tests on this branch and noticed that I probably broke posterior sampling—it appears that I reverted to the non-online, memory intensive implementation that we ran into previously during development. Need to carve out some time to fix this issue and more generally polish things up, but I have other more pressing deadlines at the moment.

vdauwera · 2024-03-21T23:40:05Z

Hi @samuelklee, @droazen, totally understand this is not high priority. Would you be able to estimate a rough timeframe when you might be able to tackle this? It would help our community figure out whether to wait or pursue other mitigation in the meantime.

samuelklee · 2024-03-22T10:32:17Z

Hi @vdauwera! I think we’re still within the “few weeks to months” timeframe quoted above, although we’re definitely pushing into the latter territory 😜

Would be happy to chat more, if you like—would also be curious to hear what mitigation is being considered. Feel free to set up a meeting!

droazen · 2024-03-22T16:55:22Z

We are hoping to re-prioritize this and get it over the finish line soon! Thanks to @samuelklee 's heroic efforts we are 90% of the way up the path out of our self-imposed Python hell, but I believe there are still a few stray Python demons pulling at our ankles.

samuelklee · 2024-04-03T16:12:06Z

Just noting here that @gokalpcelik was kind enough to run some tests on this branch and noticed that I probably broke posterior sampling—it appears that I reverted to the non-online, memory intensive implementation that we ran into previously during development. Need to carve out some time to fix this issue and more generally polish things up, but I have other more pressing deadlines at the moment.

@gokalpcelik I think I fixed this posterior-sampling issue. I haven't yet done my own testing, but let me know if you beat me to it!

gokalpcelik · 2024-04-03T16:14:22Z

Do we have a link for the updated docker image ?

samuelklee · 2024-04-03T17:29:20Z

Here's one, if anyone would like to try it out! us.gcr.io/broad-dsde-methods/broad-gatk-snapshots/gatk@sha256:c405d1d7fc42a008bb11a2f60f95bb65f1c487cf4986850a7d5d7d4bdfc37965

samuelklee · 2024-04-26T16:15:29Z

Finishing up a tieout using old MalariaGEN Pf7 CNV genotyping runs, and I think things look good from this perspective, at least.

This tieout uses a subset of the Pf7 samples containing 300 cohort and 1683 case samples (which were indeed treated as a cohort-case cluster in the original Pf7 CNV genotyping analysis). ~4k genomic bins are covered. We compare this branch against 4.5.0.0, as well as this branch against itself (checking for reproducibility).

Costs for this branch ($10.92) and 4.5.0.0 ($10.96) were quite comparable. Note that a small portion of these costs derives from Pf7-specific genotyping steps, which I did not bother to remove from the workflow.

Runtime for the ploidy modeling and postprocessing steps were comparable. Interestingly, runtime for the gCNV was ~20-25% longer with this branch than with 4.5.0.0, but memory usage fell by a factor of ~3 (~6GB to ~2GB)! I am not sure if we could recoup the runtime with some more tweaking of the environment (perhaps double checking that optimized BLAS/MKL/etc. packages are properly used, changing environment variables/flags, etc.), but I think the decrease in memory usage is quite nice.

Concordance was checked for the following quantities (4.5.0.0 is on the x-axis and this branch is on the y-axis in all plots below):

Variational posterior means (mu_*) and standard deviations (std_*) for all analogous variables in the ploidy and gCNV models. There were some slight changes to the gCNV model in this branch (e.g., the functional form of the ARD prior was changed), which means some variables are no longer directly comparable. Furthermore, some variables (such as the bias factors W) are degenerate and cannot be immediately compared. Otherwise, there is good concordance between the remaining variables, e.g.:

... Will update more later!

droazen · 2024-04-26T17:21:43Z

Thanks for sharing these results @samuelklee ! That memory vs runtime tradeoff does not seem at all bad to me.

What else needs to be done before we can feel confident about merging this (apart from deprecating the old CNN tools)? Is the posterior sampling issue resolved?

samuelklee · 2024-04-26T18:18:02Z

Yes, that's been resolved! Phew.

If anyone wants to start taking an initial review pass, that would be appreciated! There are probably a few very minor cleanups I could do (e.g., removing stray commented code and TODOs), but I don't think that should hold anything up.

samuelklee · 2024-04-26T18:19:03Z

Also, I think @mwalker174 should have the final say on merging, as he's the current gCNV owner. Would be great if he could run any GATK-SV-specific tests and report back on runtime/accuracy!

samuelklee · 2024-05-16T20:32:50Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/utils/math.py

@@ -78,12 +78,13 @@ def logsumexp_double_complement(a: np.ndarray, rel_tol: float = 1e-3) -> float:
    Returns:
        a float scalar
    """
+    print(a)


Whoops, probably should remove this!

droazen · 2024-05-20T19:03:11Z

@samuelklee It's great to see this inching closer to finally getting merged! I think we should plan on merging it shortly after the upcoming 4.6 release, assuming the remaining testing doesn't reveal any issues. Since 4.6 will contain a number of important bug fixes and user-requested changes, I want to get it out before making a big breaking change to the Python environment. After 4.6 is out, we can meet briefly to discuss how best to handle the deprecation of the legacy CNN tools, and hopefully get this in at last.

ldgauthier

Thanks for documenting all the API changes. This looks reasonable to me (along with the malaria plots). There are a couple more TODOs that I think are to-done. @droazen is going to help us come up with a CNN deprecation/transition plan so we don't have to turn off tests to merge

ldgauthier · 2024-05-20T18:12:54Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/_version.py

@@ -1 +1 @@
-__version__ = '0.8'
+__version__ = '0.9'


I honestly didn't know this is here. Does it get compared to anything?

ldgauthier · 2024-05-20T18:13:44Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCalls.java

- * the theano compilation directory (which is set to <code>$HOME/.theano</code> by default).  See theano documentation
- * at <a href="https://theano-pymc.readthedocs.io/en/latest/library/config.html">
- *     https://theano-pymc.readthedocs.io/en/latest/library/config.html</a>.
+ * <code>PYTENSOR_FLAGS="base_compiledir=PATH/TO/BASE_COMPILEDIR" gatk DetermineGermlineContigPloidy ...</code>, users can specify


Good catch.

ldgauthier · 2024-05-20T18:14:38Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/inference/deterministic_annealing.py

        """
        super().__init__(approx)
        assert temperature is not None
        self.temperature = temperature

    def apply(self, f):
-        z = self.input
-        return self.temperature * self.logq_norm(z) - self.logp_norm(z)
+        return (self.temperature * self.logq_norm - self.logp_norm)[0]


I'm going to have to trust you on this one.

ldgauthier · 2024-05-20T18:16:08Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_commons.py

+    # Unfortunately, this new functionality appears to be somewhat brittle and yields an error in our use case.
+    # We instead bring the old VarMap class into this module and recreate the old functionality to
+    # preserve our preexisting interfaces.
+    # TODO


What's the TODO?

ldgauthier · 2024-05-20T18:23:59Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/models/pytensor_hmm.py

+                                        log_trans_tcc: types.PytensorTensor3,
+                                        log_emission_tc: types.PytensorMatrix,
+                                        temperature: pt.scalar):
+        inv_temperature = pt.reciprocal(temperature) # TODO inv to reciprocal


Is this still outstanding?

ldgauthier · 2024-05-20T18:27:46Z

.github/workflows/gatk-tests.yml

@@ -295,7 +295,7 @@ jobs:
    runs-on: ubuntu-latest
    strategy:
      matrix:
-        wdlTest: [ 'RUN_CNV_GERMLINE_COHORT_WDL', 'RUN_CNV_GERMLINE_CASE_WDL', 'RUN_CNV_SOMATIC_WDL', 'RUN_M2_WDL', 'RUN_CNN_WDL', 'RUN_VCF_SITE_LEVEL_FILTERING_WDL' ]


There may already be a WDL for the NVIDIA CNN in the Terra workspace, but this will be part of the CNNScoreVariants Deprecation Plan

ldgauthier · 2024-05-20T18:29:21Z

...st/java/org/broadinstitute/hellbender/tools/copynumber/GermlineCNVCallerIntegrationTest.java

@@ -160,7 +160,8 @@ public void testNumericalAccuracy() {
                .add(GermlineCNVCaller.CONTIG_PLOIDY_CALLS_DIRECTORY_LONG_NAME,
                        CONTIG_PLOIDY_CALLS_OUTPUT_DIR.getAbsolutePath())
                .add(StandardArgumentDefinitions.OUTPUT_LONG_NAME, outputDir)
-                .add(CopyNumberStandardArgument.OUTPUT_PREFIX_LONG_NAME, outputPrefix);
+                .add(CopyNumberStandardArgument.OUTPUT_PREFIX_LONG_NAME, outputPrefix)
+                .add(StandardArgumentDefinitions.VERBOSITY_NAME, "DEBUG");


Do you want to keep this debug verbosity?

ldgauthier · 2024-05-20T19:01:42Z

src/test/java/org/broadinstitute/hellbender/utils/python/PythonEnvironmentIntegrationTest.java

-                { "numpy",          "numpy.__config__.get_info('blas_mkl_info') != {} and numpy.__config__.get_info('lapack_mkl_info') != {}" },
-                { "theano",         "'-lmkl_rt' in theano.config.blas.ldflags" },
-                { "tensorflow",     "tensorflow.pywrap_tensorflow.IsMklEnabled()" }
+                // { "numpy",          "numpy.__config__.get_info('blas_mkl_info') != {} and numpy.__config__.get_info('lapack_mkl_info') != {}" },  TODO this check might need to be done differently for conda-forge numpy 1.26.2, we remove it for now


Is this a real TODO?

cmnbroad · 2024-05-20T19:40:39Z

@droazen When we finally remove CNNScoreVariants, don't forget that you can put an entry in the DeprecatedToolsRegistry with a helpful message about the replacement.

mwalker174 · 2024-05-22T15:33:09Z

Very excited that this is almost ready! I'm presently going to run some tests on matched BGE/exome samples to assess runtime performance and results with gatk-sv.

samuelklee marked this pull request as draft October 23, 2023 18:31

samuelklee changed the title ~~Sl python version update~~ Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. Oct 23, 2023

This comment was marked as outdated.

Sign in to view

samuelklee force-pushed the sl_python_version_update branch 2 times, most recently from 6534430 to 558ccaf Compare November 9, 2023 20:48

This comment was marked as outdated.

Sign in to view

This was referenced Dec 10, 2023

Update the GATK base image to a newer LTS ubuntu release #8610

Merged

GermlineCNVCaller different results with same GATK and different Ubuntu #8619

Closed

samuelklee force-pushed the sl_python_version_update branch from 4bf5286 to ed59372 Compare December 12, 2023 14:55

This comment was marked as outdated.

Sign in to view

samuelklee added 2 commits December 14, 2023 09:30

Updated Python and PyMC, removed TensorFlow, and added PyTorch in Doc…

53de6b2

…ker and conda environments.

Updated gCNV code to account for changes from PyMC3/Theano to PyMC/Py…

8d1cad8

…Tensor.

samuelklee force-pushed the sl_python_version_update branch from 413fa8f to c12c124 Compare December 14, 2023 14:34

samuelklee added 3 commits December 14, 2023 09:36

Updated gCNV integration tests.

b722ac1

Updated gCNV WDL tests.

581f33f

Updated other tests and tools affected by environment changes.

90bd880

samuelklee force-pushed the sl_python_version_update branch from c12c124 to 90bd880 Compare December 14, 2023 14:36

matthdsm mentioned this pull request Feb 12, 2024

GATK4 dependencies bioconda/bioconda-recipes#45758

Open

Reverted posterior sampling to online implementation.

afe0a67

samuelklee force-pushed the sl_python_version_update branch from e2ffcd8 to afe0a67 Compare April 3, 2024 01:07

Updated localDevCondaEnv task in build.gradle.

f7a9760

samuelklee commented May 16, 2024

View reviewed changes

ldgauthier reviewed May 20, 2024

View reviewed changes

		@@ -1 +1 @@
		__version__ = '0.8'
		__version__ = '0.9'

Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. #8561

Are you sure you want to change the base?

Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. #8561

Conversation

samuelklee commented Oct 23, 2023 • edited

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

mwalker174 commented Nov 9, 2023

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

samuelklee commented Dec 8, 2023 • edited

gatk-bot commented Dec 8, 2023

This comment was marked as outdated.

This comment was marked as outdated.

matthdsm commented Feb 12, 2024

samuelklee commented Feb 12, 2024 • edited

droazen commented Feb 12, 2024

samuelklee commented Mar 19, 2024

vdauwera commented Mar 21, 2024

samuelklee commented Mar 22, 2024

droazen commented Mar 22, 2024

samuelklee commented Apr 3, 2024

gokalpcelik commented Apr 3, 2024

samuelklee commented Apr 3, 2024 • edited

samuelklee commented Apr 26, 2024 • edited

droazen commented Apr 26, 2024

samuelklee commented Apr 26, 2024

samuelklee commented Apr 26, 2024

Choose a reason for hiding this comment

droazen commented May 20, 2024

ldgauthier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmnbroad commented May 20, 2024

mwalker174 commented May 22, 2024

samuelklee commented Oct 23, 2023 •

edited

samuelklee commented Dec 8, 2023 •

edited

samuelklee commented Feb 12, 2024 •

edited

samuelklee commented Apr 3, 2024 •

edited

samuelklee commented Apr 26, 2024 •

edited