Building Large Docker Images, Quickly

Sometimes, you've just got a big codebase. Maybe you want to put that codebase into a Docker image. Sure, you could mount it, but sometimes that's either not an option or would be too much trouble. In those cases, it's nice to be able to cram your big 'ole codebase into images quickly.

Overview

To accomplish this, we're going to have 3 main layers in our image. Of course, you're welcome to add as many layers as you want, or even take one away. We'll just use 3 for demo purposes – and because that's what's worked in my situation.

The first image will be what we call the 'base' image. It's the one that we're going to copy our code into. On top of that, we'll build an image with our dependencies – it's our 'dependency' image. Finally, we've got our 'incremental' image, which just has our code updates. And that's it. So three images: base, dependency, and incremental.

Luckily, the first two images don't have to build quickly. They're the images that we use to prepare to build the incremental image, so that the incremental build can be quick when we need it.

The Base Image

So the first image, the base image, has our codebase and any universal dependencies. The reason we don't want to put other dependencies in here is because we want to be able to use this 'base' image for any type of task that will require our code. For example, if we have JavaScript tests and, say, PHP tests, they'll probably require different dependencies to run. While we may have a huge codebase, we're still trying to stick to the idea that Docker images should be as small as possible.

This image is actually pretty simple to set up. Here's an example of a Dockerfile:

FROM centos:latest

RUN yum -y install patch && yum -y clean all

COPY repo /var/repo

ARG sha=unknown
LABEL sha=$sha

You'll notice that I'm installing the patch utility. That's because we're going to use this in the incremental image later on to apply a diff to our code. If you have any binary files in your image, you might want to install git instead, because git handles binary diffs, where patch doesn't.

Now, at the end there, we're doing something that you don't necessarily see in every Dockerfile. When we build this image, there are a few more things we should do. We're including the sha argument so that later on we can generate the right diff that we need to apply to get the latest code. We need to pass this in to the docker build command as a --build-arg, and this last bit of the Dockerfile will add that as a label to the image itself. You can see an example on Stack Overflow.

We also should avoid copying in parts of the codebase that we don't need in the image. For example, we probably don't need our .git/ folder in our Docker image. There are a couple of ways that we can accomplish this – we can either do a --bare checkout of our repo, or we can just delete the .git folder before we copy it in. I took the first approach, because it allows me to just update my repo and check things out again.

I used a bash script to handle all this, so that I don't have to remember everything and I don't have to worry about accidentally skipping things. Here's what my base image building script looks like, approximately:

#!/bin/bash

# This allows you to pass in an environment variable, but sets a default.
CODE_DIR=${CODE_DIR:-'/var/tmp/repo.git'}

# If the repo already exists, just update it. Otherwise, clone it.
if [ -d "$CODE_DIR" ]; then
    echo "Found an existing git clone... Fetching now to update it..."
    GIT_DIR=$CODE_DIR git fetch -q origin master:master
else
    echo "No clone found. Cloning the entire repo."
    git clone --mirror [email protected]:my/repo.git $CODE_DIR
fi

# This grabs the sha we'll be building
BUILD_VERSION=$(GIT_DIR=$CODE_DIR git rev-parse master)

mkdir -p ./repo

# We clean the old directory to make sure it's a 'clean' checkout
rm -rf ./repo/*

# Check out the code
GIT_DIR=$CODE_DIR GIT_WORK_TREE=./repo git checkout $BUILD_VERSION -f

# Build the image
docker build --rm -t base-image:latest --build-arg sha=$BUILD_VERSION

docker push base-image:latest

So, it's not super simple (if you want to skip the .git/ folder), but I think you get the idea. Next, we'll move on to the dependencies image.

The Dependencies Image

This image is really only necessary if you have more dependencies to install. In my particular case, I needed to install things like php, sqlite, etc. Next up, I installed the composer dependencies for my project. If you're using something other than PHP, you can install the dependencies through whatever package manager you're using – like bundler or npm.

My Dockerfile for this image looks a lot like this (the particular incantations you use will depend on the flavor of linux you're using, of course):

FROM base-image:latest

RUN yum -y install \ 
    php7 \
    sqlite \
    sqlite-devel \
    && yum -y clean all

WORKDIR /var/repo

RUN composer update

You'll probably notice in this image, we don't need to include the whole ARG sha=unknown thingy. That's because labels applied to a parent image are automatically passed to child images.

This image doesn't necessarily need to have a bash script to build it, but it all depends on what your dependencies look like. If you need to copy other information in, then you might just want one. In my case, I have one, but it's pretty similar to the previous script, so I won't bother to put it here.

The Incremental Image

Now for the fun part. Our incremental image is what needs to build quickly – and we're set up to do just that. This takes a bit of scripting, but it's not super complicated.

Here's what we're going to do when we build this image:

Update/clone our git repo
Figure out the latest sha for our repo
Generate a diff from our original sha (the one the base image has baked-in) and the new sha
When building the image, copy in and apply the diff to our code

To handle all of this, I highly recommend using a shell script. Here's a sample of the script that I'm using to handle the build (with some repetition from the script above):

#!/bin/bash

BASE_IMAGE_NAME=dependency-image
BASE_IMAGE_TAG=latest

# This allows you to pass in an environment variable, but sets a default.
CODE_DIR=${CODE_DIR:-'/var/tmp/repo.git'}

# If the repo already exists, just update it. Otherwise, clone it.
if [ -d "$CODE_DIR" ]; then
    echo "Found an existing git clone... Fetching now to update it..."
    GIT_DIR=$CODE_DIR git fetch -q origin master:master
else
    echo "No clone found. Cloning the entire repo."
    git clone --mirror [email protected]:my/repo.git $CODE_DIR
fi

# Get the latest commit sha from Github, use jq to parse it
echo "Fetching the current commit sha from GitHub..."
BUILD_VERSION=$(curl -s "https://github.com/api/v3/repos/my/repo/commits" | jq '.[0].sha' | tr -d '"')

# Generate a diff from the base image
docker pull $BASE_IMAGE_NAME:$BASE_IMAGE_VERSION
BASE_VERSION=$(docker inspect $BASE_IMAGE_NAME:$BASE_IMAGE_VERSION | jq '.[0].Config.Labels.sha' | tr -d '"')
GIT_DIR=$CODE_DIR git diff $BASE_VERSION..$BUILD_VERSION > patch.diff

# Build the image
docker build --rm -t incremental-image:latest -t incremental-image:$BUILD_VERSION --build-arg sha=$BUILD_VERSION

# And push both tags!
docker push incremental-image:latest
docker push incremental-image:$BUILD_VERSION

There are a few things of note here: I'm using jq on my machine to parse out JSON results. I'm fetching the latest sha directly from GitHub, but I could just as easily use a local git command to get it. I'm also passing in the --build-arg, just like we did for our base image, so that we can use it in the Dockerfile as an environment variable and to set a new label for the image.

On that note, here's a sample Dockerfile:

FROM incremental-image:latest

ARG sha=unknown
ENV sha=$sha
LABEL sha=$sha

COPY patch.diff /var/repo/patch.diff
RUN patch < patch.diff

RUN composer update

CMD ["run-my-tests"]

And that's it! In my experience, this is pretty quick to run – it takes me about a minute, which is a lot faster than the 6+ minute build times I was seeing when I built the entire image every time.

Assumptions

I'm making some definite assumptions here. First, I'm assuming that you have a builder where you have git, jq, and docker installed. I'm also assuming that you can build your base and dependency images about once a day without time restraints. I have them build on a cron at midnight. Throughout the day, as people make commits, I build the incremental image.

Conclusion

This is a fairly straightforward method to build up-to-date images with code baked in quickly. Well, quickly relative to copying in the entire codebase every time.

I don't recommend this method if building your images quickly isn't a priority. In my case, we're trying to build these images and run our tests in under 5 minutes – which meant that a 5 minute image build time obviously wasn't acceptable.