Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Dockerfile to improve reproducibility #3339

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Bultako
Copy link
Member

@Bultako Bultako commented May 4, 2021

This PR adds a Dockerfile so that a docker container may be automatically built and stored in the Gammapy Dockerhub for future tagged releases. In addition, there is also documentation on how to work with this docker container.

@codecov
Copy link

codecov bot commented May 4, 2021

Codecov Report

Merging #3339 (a5ae24f) into master (82b514f) will increase coverage by 0.06%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3339      +/-   ##
==========================================
+ Coverage   93.79%   93.86%   +0.06%     
==========================================
  Files         144      144              
  Lines       18103    19104    +1001     
==========================================
+ Hits        16980    17932     +952     
- Misses       1123     1172      +49     
Impacted Files Coverage Δ
gammapy/makers/safe.py 89.20% <0.00%> (-1.82%) ⬇️
gammapy/maps/core.py 87.71% <0.00%> (+0.02%) ⬆️
gammapy/maps/hpx.py 88.53% <0.00%> (+0.08%) ⬆️
gammapy/maps/wcsmap.py 96.61% <0.00%> (+0.21%) ⬆️
gammapy/maps/wcsnd.py 91.87% <0.00%> (+0.79%) ⬆️
gammapy/modeling/models/spectral.py 97.60% <0.00%> (+0.85%) ⬆️
gammapy/maps/hpxnd.py 96.85% <0.00%> (+1.84%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82b514f...a5ae24f. Read the comment docs.

@adonath adonath self-assigned this May 4, 2021
@adonath adonath added this to the v0.19 milestone May 4, 2021
@adonath adonath added this to In progress in MAINTAIN via automation May 4, 2021
@adonath
Copy link
Member

adonath commented May 4, 2021

Thanks @Bultako! Do we still need the old docker? Probably not...

@Bultako
Copy link
Member Author

Bultako commented May 5, 2021

Thanks @Bultako! Do we still need the old docker? Probably not...

No, we do not need it, not only because we do not use Travis for the CI but also I have an error when trying to run that Dockerfile. Do you agree to remove the dev/docker folder?

@adonath
Copy link
Member

adonath commented May 5, 2021

@Bultako Yes, please remove!

@Bultako
Copy link
Member Author

Bultako commented May 5, 2021

Ok, done !

@@ -0,0 +1,29 @@
# gammapy docker container
FROM continuumio/miniconda3:latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using miniconda inside docker will result in much larger docker images than say using ubuntu 20.04, installing the system python (which is 3.8).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installation process relies on building previously a conda environment described in the environment-dev.yml, so conda command is needed.

Dockerfile Outdated
@@ -0,0 +1,29 @@
# gammapy docker container
FROM continuumio/miniconda3:latest
RUN apt-get update && apt-get install -y build-essential python3-pip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what do you need build-essential and pip when using conda?

Also, please use docker image best practices, here cleaning up after using apt: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#run

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we can remove build-essential, but pip is needed to do a development install, as justified in comments below..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But pip comes with conda, no need to install pip into the system

Copy link
Member Author

@Bultako Bultako May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I have an error when running pip install -e. in the docker, that does not happen if I install previously pip with apt-get

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you are installing gammapy into the system, not into the conda env

Dockerfile Outdated
# create conda env
RUN conda install -c conda-forge mamba
RUN mamba env create -n gammapy -f environment-dev.yml
RUN echo 'source activate gammapy' >> ~/.bashrc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs should also be combined into a single run statement and conda clean --all --yes called at the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done ! thanks !

ENV PATH /opt/conda/envs/gammapy/bin:$PATH

# install gammapy
RUN pip install -e .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for reproducibility, wouldn't it make more sense to just install a specific version of gammapy and specific versions of all needed packages instead of doing a development install?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea is that Dockerhub could automatically trigger the built of an image when we tag a release in Github, hence before having gammapy available in conda repositories. We have the dependencies described in the environment-dev.yml file, that it is also used to build the executing environment in the Github actions CI. Maybe, we could avoid installing the dev dependencies and test dependencies, but I think adding the complexity of parsing this file to gain a few hundreds of MB out of 3GB does not pay the effort.

RUN pip install -e .

# download datasets
RUN gammapy download datasets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the datasets needed in the docker image?

Doesn't that just increase the docker image size by a lot?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datasets size is 192MB out of 3GB of the total image size. We ship the datasets within the docker image together with the Gammapy code and tutorials, so to provide the possibility to play with the tutorials in shipped Jupyterlab. See here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image is 3Gb??? That is HUGE

Copy link
Member Author

@Bultako Bultako May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, find below some numbers..

Environment

$ du -hs /opt/miniconda3/envs/gammapy-dev
2,1G	/opt/miniconda3/envs/gammapy-dev

Code

$ du -hs ~/git/gammapy/gammapy
286M	gammapy

Datasets

$ du -hs gammapy-datasets/
184M	gammapy-datasets/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, I have never checked what the actual size of the dev environment is. Do you have any idea why it is so large?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conda envs get huge fast since they duplicate many system libraries and include e.g. the mal by default.

Adding nomkl reduces conda env quite a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for info

$ du -hs /opt/miniconda3/envs/gammapy-0.18.2
1,2G	/opt/miniconda3/envs/gammapy-0.18.2

@maxnoe
Copy link
Member

maxnoe commented May 5, 2021

I think we should discuss a bit what the use case for this docker container should be.

If the use case is to provide an exactly reproducible software environment for analysis, the goal should be to make this image as small as possible so it can be used quickly on clusters and the like.

I don't really see a usecase for providing a docker image to run tutorials. This is an anti-pattern. People wanting to try out gammapy and do the tutorials should not have a need to do that in a docker container. It adds many additional hurdles and doesn't solve a problem other than "My python setup is messed up". The chances to "mess up" only increase with docker.

So if the goal is to provide an containerized environment for running analyses reproducibly, no data should be added and only the bare depedencies, e.g. not jupyter lab.

@Bultako
Copy link
Member Author

Bultako commented May 5, 2021

As I have understood it from internal discussions, the use case is mostly archive oriented, in the sense of archiving an exact replica of the development environment, code and datasets used at the moment of a release. So not only to provide a reproducible software in the long-term, but also the exact environment used to develop it at that time.

Sure, you're very welcome to contribute to this discussion internally :)

@maxnoe
Copy link
Member

maxnoe commented May 6, 2021

Just for comparison, a Docker image starting from python:3.9-slim-buster and only installing gammapy is "only" 513 MB large:

❯ cat Dockerfile
FROM python:3.9-slim-buster

RUN apt update && apt install --yes --no-install-recommends \
	gcc \
	libc-dev

RUN useradd -m gammapy
USER gammapy

# region 0.4 does not provide a pyproject.toml or a wheel and
# needs numpy, cython to be present at install time
# this is fixed in current master, so this extra command should be removed
# when the next version of regions is released.
ENV PATH="/home/gammapy/.local/bin:${PATH}"
RUN python3 -m pip install --no-cache-dir Cython numpy \
	&& pip install --no-cache-dir gammapy \
	&& rm -rf /home/gammapy/.cache/pip

CMD ["/bin/bash", "-l"]

which could potentially be slightly smaller with using a build stage (no compilers / dev packages) in the final image.

And one with all optional dependencies including ipython/jupyterlab and a build stage to further shrink the image size by not having the build requirements in the runtime is ~700 MB, which is still a lot but considerably smaller than the 3 GB using conda.

conda is a good (most of the time) tool to get binary packages without too much hassle. But docker provides the reproducibility that makes building the binaries a non-issue, most of the packages come as wheels anyway,
the only exceptions are regions and sherpa.

Docker image with all deps and using a build stage:

FROM python:3.9-slim-buster AS BUILDER

# gcc, libc-dev needed to build regions and sherpa. g++, bison, flex, make needed for sherpa
RUN apt update && apt install --yes --no-install-recommends \
	gcc \
	g++ \
	libc-dev \
	bison \
	flex \
	make

RUN useradd -m gammapy
USER gammapy

# region 0.4 does not provide a pyproject.toml or a wheel and
# needs numpy, cython to be present at install time
# this is fixed in current master, so this extra command should be removed
# when the next version of regions is released.
RUN python3 -m pip install --no-cache-dir --no-warn-script-location Cython numpy \
	&& pip install --no-cache-dir --no-warn-script-location \
		gammapy \
		ipython \
		jupyterlab \
		matplotlib \
		pandas \
		healpy \
		iminuit \
		sherpa \
		naima \
		emcee \
		corner \
		parfive \
	&& rm -rf /home/gammapy/.cache/pip


# second stage to save space
FROM python:3.9-slim-buster

RUN useradd -m gammapy
USER gammapy
WORKDIR /home/gammapy
ENV PATH="/home/gammapy/.local/bin:${PATH}"

COPY --from=BUILDER /home/gammapy/.local /home/gammapy/.local

Image sizes:

gammapy-all                                          latest                 73993f41e023   About a minute ago   708MB
gammapy                                              latest                 32420a39fd95   39 minutes ago       513MB

Running jupyter lab :

docker run --rm -it -p8888:8888 gammapy-all jupyter lab --no-browser --ip 0.0.0.0  

@adonath adonath modified the milestones: v0.19, 1.0 May 18, 2021
@adonath adonath modified the milestones: 1.0rc, 1.0 May 5, 2022
@adonath
Copy link
Member

adonath commented Sep 20, 2022

The docker image is not a priority now, I will postpone this to after v1.0.

@adonath adonath modified the milestones: 1.0, 1.1 Sep 20, 2022
@registerrier registerrier modified the milestones: 1.1, 2.0 Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
MAINTAIN
  
In progress
Development

Successfully merging this pull request may close these issues.

None yet

4 participants