-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Dockerfile to improve reproducibility #3339
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3339 +/- ##
==========================================
+ Coverage 93.79% 93.86% +0.06%
==========================================
Files 144 144
Lines 18103 19104 +1001
==========================================
+ Hits 16980 17932 +952
- Misses 1123 1172 +49
Continue to review full report at Codecov.
|
Thanks @Bultako! Do we still need the old docker? Probably not... |
No, we do not need it, not only because we do not use Travis for the CI but also I have an error when trying to run that |
@Bultako Yes, please remove! |
Ok, done ! |
@@ -0,0 +1,29 @@ | |||
# gammapy docker container | |||
FROM continuumio/miniconda3:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using miniconda inside docker will result in much larger docker images than say using ubuntu 20.04, installing the system python (which is 3.8).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The installation process relies on building previously a conda environment described in the environment-dev.yml
, so conda
command is needed.
Dockerfile
Outdated
@@ -0,0 +1,29 @@ | |||
# gammapy docker container | |||
FROM continuumio/miniconda3:latest | |||
RUN apt-get update && apt-get install -y build-essential python3-pip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what do you need build-essential and pip when using conda?
Also, please use docker image best practices, here cleaning up after using apt: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we can remove build-essential
, but pip
is needed to do a development install, as justified in comments below..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But pip comes with conda, no need to install pip into the system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I have an error when running pip install -e.
in the docker, that does not happen if I install previously pip with apt-get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you are installing gammapy into the system, not into the conda env
Dockerfile
Outdated
# create conda env | ||
RUN conda install -c conda-forge mamba | ||
RUN mamba env create -n gammapy -f environment-dev.yml | ||
RUN echo 'source activate gammapy' >> ~/.bashrc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runs should also be combined into a single run statement and conda clean --all --yes
called at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done ! thanks !
ENV PATH /opt/conda/envs/gammapy/bin:$PATH | ||
|
||
# install gammapy | ||
RUN pip install -e . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is for reproducibility, wouldn't it make more sense to just install a specific version of gammapy and specific versions of all needed packages instead of doing a development install?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main idea is that Dockerhub could automatically trigger the built of an image when we tag a release in Github, hence before having gammapy
available in conda repositories. We have the dependencies described in the environment-dev.yml
file, that it is also used to build the executing environment in the Github actions CI. Maybe, we could avoid installing the dev dependencies and test dependencies, but I think adding the complexity of parsing this file to gain a few hundreds of MB out of 3GB does not pay the effort.
RUN pip install -e . | ||
|
||
# download datasets | ||
RUN gammapy download datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the datasets needed in the docker image?
Doesn't that just increase the docker image size by a lot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The datasets size is 192MB out of 3GB of the total image size. We ship the datasets within the docker image together with the Gammapy code and tutorials, so to provide the possibility to play with the tutorials in shipped Jupyterlab. See here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image is 3Gb??? That is HUGE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, find below some numbers..
Environment
$ du -hs /opt/miniconda3/envs/gammapy-dev
2,1G /opt/miniconda3/envs/gammapy-dev
Code
$ du -hs ~/git/gammapy/gammapy
286M gammapy
Datasets
$ du -hs gammapy-datasets/
184M gammapy-datasets/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting, I have never checked what the actual size of the dev environment is. Do you have any idea why it is so large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conda envs get huge fast since they duplicate many system libraries and include e.g. the mal by default.
Adding nomkl reduces conda env quite a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for info
$ du -hs /opt/miniconda3/envs/gammapy-0.18.2
1,2G /opt/miniconda3/envs/gammapy-0.18.2
I think we should discuss a bit what the use case for this docker container should be. If the use case is to provide an exactly reproducible software environment for analysis, the goal should be to make this image as small as possible so it can be used quickly on clusters and the like. I don't really see a usecase for providing a docker image to run tutorials. This is an anti-pattern. People wanting to try out gammapy and do the tutorials should not have a need to do that in a docker container. It adds many additional hurdles and doesn't solve a problem other than "My python setup is messed up". The chances to "mess up" only increase with docker. So if the goal is to provide an containerized environment for running analyses reproducibly, no data should be added and only the bare depedencies, e.g. not jupyter lab. |
As I have understood it from internal discussions, the use case is mostly archive oriented, in the sense of archiving an exact replica of the development environment, code and datasets used at the moment of a release. So not only to provide a reproducible software in the long-term, but also the exact environment used to develop it at that time. Sure, you're very welcome to contribute to this discussion internally :) |
Just for comparison, a Docker image starting from ❯ cat Dockerfile
FROM python:3.9-slim-buster
RUN apt update && apt install --yes --no-install-recommends \
gcc \
libc-dev
RUN useradd -m gammapy
USER gammapy
# region 0.4 does not provide a pyproject.toml or a wheel and
# needs numpy, cython to be present at install time
# this is fixed in current master, so this extra command should be removed
# when the next version of regions is released.
ENV PATH="/home/gammapy/.local/bin:${PATH}"
RUN python3 -m pip install --no-cache-dir Cython numpy \
&& pip install --no-cache-dir gammapy \
&& rm -rf /home/gammapy/.cache/pip
CMD ["/bin/bash", "-l"] which could potentially be slightly smaller with using a build stage (no compilers / dev packages) in the final image. And one with all optional dependencies including ipython/jupyterlab and a build stage to further shrink the image size by not having the build requirements in the runtime is ~700 MB, which is still a lot but considerably smaller than the 3 GB using conda. conda is a good (most of the time) tool to get binary packages without too much hassle. But docker provides the reproducibility that makes building the binaries a non-issue, most of the packages come as wheels anyway, Docker image with all deps and using a build stage: FROM python:3.9-slim-buster AS BUILDER
# gcc, libc-dev needed to build regions and sherpa. g++, bison, flex, make needed for sherpa
RUN apt update && apt install --yes --no-install-recommends \
gcc \
g++ \
libc-dev \
bison \
flex \
make
RUN useradd -m gammapy
USER gammapy
# region 0.4 does not provide a pyproject.toml or a wheel and
# needs numpy, cython to be present at install time
# this is fixed in current master, so this extra command should be removed
# when the next version of regions is released.
RUN python3 -m pip install --no-cache-dir --no-warn-script-location Cython numpy \
&& pip install --no-cache-dir --no-warn-script-location \
gammapy \
ipython \
jupyterlab \
matplotlib \
pandas \
healpy \
iminuit \
sherpa \
naima \
emcee \
corner \
parfive \
&& rm -rf /home/gammapy/.cache/pip
# second stage to save space
FROM python:3.9-slim-buster
RUN useradd -m gammapy
USER gammapy
WORKDIR /home/gammapy
ENV PATH="/home/gammapy/.local/bin:${PATH}"
COPY --from=BUILDER /home/gammapy/.local /home/gammapy/.local Image sizes:
Running jupyter lab :
|
The docker image is not a priority now, I will postpone this to after v1.0. |
This PR adds a
Dockerfile
so that a docker container may be automatically built and stored in the Gammapy Dockerhub for future tagged releases. In addition, there is also documentation on how to work with this docker container.