Overview
This post briefly shows how to create reproducible environments for data science workflows using a combination of Conda and Docker. The basic idea is to create a Conda environment for local development then reliably package the environment in a Docker image for reproducibility during code deployments.
Capture local Conda environment using conda-lock
The key to generating a reproducible development environment is to capture the Conda dependencies using conda-lock. conda-lock
is a lightweight library for generating reproducible lockfiles that recreate environments on multiple operating systems (Linux, Windows, OSX).
# create conda environment
conda create --name my-env python=3.9
# add conda-forge channel (required to install conda-lock)
conda config --env --add channels conda-forge
# install conda packages including conda-lock
conda install <pkgs>
conda install -c conda-forge conda-lock
# export environment to environment.yml
conda env export --from-history > environment.yml
# generate lockfiles for linux, windows, and osx
conda-lock
Create Docker image from Dockerfile specifying the Conda environment
To reproduce the Conda environment in a Docker image, recreate the Conda environment from the lockfile generated in the previous step.
# base image
FROM continuumio/miniconda3
# copy the lockfile into the image
COPY conda-linux-64.lock conda-linux-64.lock
# restore the conda environment from the lockfile
RUN conda create --name my-env --file conda-linux-64.lock
# copy script into image
COPY my-script.py
# run script when container starts
ENTRYPOINT ["conda", "run", "--no-capture-output", "-n", "my-env", "python", "my-script.py"]