CategoryTesting

Building Large Docker Images, Quickly

Sometimes, you've just got a big codebase. Maybe you want to put that codebase into a Docker image. Sure, you could mount it, but sometimes that's either not an option or would be too much trouble. In those cases, it's nice to be able to cram your big 'ole codebase into images quickly.

Overview

To accomplish this, we're going to have 3 main layers in our image. Of course, you're welcome to add as many layers as you want, or even take one away. We'll just use 3 for demo purposes – and because that's what's worked in my situation.

The first image will be what we call the 'base' image. It's the one that we're going to copy our code into. On top of that, we'll build an image with our dependencies – it's our 'dependency' image. Finally, we've got our 'incremental' image, which just has our code updates. And that's it. So three images: base, dependency, and incremental.

Luckily, the first two images don't have to build quickly. They're the images that we use to prepare to build the incremental image, so that the incremental build can be quick when we need it.

The Base Image

So the first image, the base image, has our codebase and any universal dependencies. The reason we don't want to put other dependencies in here is because we want to be able to use this 'base' image for any type of task that will require our code. For example, if we have JavaScript tests and, say, PHP tests, they'll probably require different dependencies to run. While we may have a huge codebase, we're still trying to stick to the idea that Docker images should be as small as possible.

This image is actually pretty simple to set up. Here's an example of a Dockerfile:

FROM centos:latest

RUN yum -y install patch && yum -y clean all

COPY repo /var/repo

ARG sha=unknown
LABEL sha=$sha

You'll notice that I'm installing the patch utility. That's because we're going to use this in the incremental image later on to apply a diff to our code. If you have any binary files in your image, you might want to install git instead, because git handles binary diffs, where patch doesn't.

Now, at the end there, we're doing something that you don't necessarily see in every Dockerfile. When we build this image, there are a few more things we should do. We're including the sha argument so that later on we can generate the right diff that we need to apply to get the latest code. We need to pass this in to the docker build command as a --build-arg, and this last bit of the Dockerfile will add that as a label to the image itself. You can see an example on Stack Overflow.

We also should avoid copying in parts of the codebase that we don't need in the image. For example, we probably don't need our .git/ folder in our Docker image. There are a couple of ways that we can accomplish this – we can either do a --bare checkout of our repo, or we can just delete the .git folder before we copy it in. I took the first approach, because it allows me to just update my repo and check things out again.

I used a bash script to handle all this, so that I don't have to remember everything and I don't have to worry about accidentally skipping things. Here's what my base image building script looks like, approximately:

#!/bin/bash

# This allows you to pass in an environment variable, but sets a default.
CODE_DIR=${CODE_DIR:-'/var/tmp/repo.git'}

# If the repo already exists, just update it. Otherwise, clone it.
if [ -d "$CODE_DIR" ]; then
    echo "Found an existing git clone... Fetching now to update it..."
    GIT_DIR=$CODE_DIR git fetch -q origin master:master
else
    echo "No clone found. Cloning the entire repo."
    git clone --mirror git@github.com:my/repo.git $CODE_DIR
fi

# This grabs the sha we'll be building
BUILD_VERSION=$(GIT_DIR=$CODE_DIR git rev-parse master)

mkdir -p ./repo

# We clean the old directory to make sure it's a 'clean' checkout
rm -rf ./repo/*

# Check out the code
GIT_DIR=$CODE_DIR GIT_WORK_TREE=./repo git checkout $BUILD_VERSION -f

# Build the image
docker build --rm -t base-image:latest --build-arg sha=$BUILD_VERSION

docker push base-image:latest

So, it's not super simple (if you want to skip the .git/ folder), but I think you get the idea. Next, we'll move on to the dependencies image.

The Dependencies Image

This image is really only necessary if you have more dependencies to install. In my particular case, I needed to install things like php, sqlite, etc. Next up, I installed the composer dependencies for my project. If you're using something other than PHP, you can install the dependencies through whatever package manager you're using – like bundler or npm.

My Dockerfile for this image looks a lot like this (the particular incantations you use will depend on the flavor of linux you're using, of course):

FROM base-image:latest

RUN yum -y install \ 
    php7 \
    sqlite \
    sqlite-devel \
    && yum -y clean all

WORKDIR /var/repo

RUN composer update

You'll probably notice in this image, we don't need to include the whole ARG sha=unknown thingy. That's because labels applied to a parent image are automatically passed to child images.

This image doesn't necessarily need to have a bash script to build it, but it all depends on what your dependencies look like. If you need to copy other information in, then you might just want one. In my case, I have one, but it's pretty similar to the previous script, so I won't bother to put it here.

The Incremental Image

Now for the fun part. Our incremental image is what needs to build quickly – and we're set up to do just that. This takes a bit of scripting, but it's not super complicated.

Here's what we're going to do when we build this image:

  1. Update/clone our git repo
  2. Figure out the latest sha for our repo
  3. Generate a diff from our original sha (the one the base image has baked-in) and the new sha
  4. When building the image, copy in and apply the diff to our code

To handle all of this, I highly recommend using a shell script. Here's a sample of the script that I'm using to handle the build (with some repetition from the script above):

#!/bin/bash

BASE_IMAGE_NAME=dependency-image
BASE_IMAGE_TAG=latest

# This allows you to pass in an environment variable, but sets a default.
CODE_DIR=${CODE_DIR:-'/var/tmp/repo.git'}

# If the repo already exists, just update it. Otherwise, clone it.
if [ -d "$CODE_DIR" ]; then
    echo "Found an existing git clone... Fetching now to update it..."
    GIT_DIR=$CODE_DIR git fetch -q origin master:master
else
    echo "No clone found. Cloning the entire repo."
    git clone --mirror git@github.com:my/repo.git $CODE_DIR
fi

# Get the latest commit sha from Github, use jq to parse it
echo "Fetching the current commit sha from GitHub..."
BUILD_VERSION=$(curl -s "https://github.com/api/v3/repos/my/repo/commits" | jq '.[0].sha' | tr -d '"')

# Generate a diff from the base image
docker pull $BASE_IMAGE_NAME:$BASE_IMAGE_VERSION
BASE_VERSION=$(docker inspect $BASE_IMAGE_NAME:$BASE_IMAGE_VERSION | jq '.[0].Config.Labels.sha' | tr -d '"')
GIT_DIR=$CODE_DIR git diff $BASE_VERSION..$BUILD_VERSION > patch.diff

# Build the image
docker build --rm -t incremental-image:latest -t incremental-image:$BUILD_VERSION --build-arg sha=$BUILD_VERSION

# And push both tags!
docker push incremental-image:latest
docker push incremental-image:$BUILD_VERSION

There are a few things of note here: I'm using jq on my machine to parse out JSON results. I'm fetching the latest sha directly from GitHub, but I could just as easily use a local git command to get it. I'm also passing in the --build-arg, just like we did for our base image, so that we can use it in the Dockerfile as an environment variable and to set a new label for the image.

On that note, here's a sample Dockerfile:

FROM incremental-image:latest

ARG sha=unknown
ENV sha=$sha
LABEL sha=$sha

COPY patch.diff /var/repo/patch.diff
RUN patch < patch.diff

RUN composer update

CMD ["run-my-tests"]

And that's it! In my experience, this is pretty quick to run – it takes me about a minute, which is a lot faster than the 6+ minute build times I was seeing when I built the entire image every time.

Assumptions

I'm making some definite assumptions here. First, I'm assuming that you have a builder where you have git, jq, and docker installed. I'm also assuming that you can build your base and dependency images about once a day without time restraints. I have them build on a cron at midnight. Throughout the day, as people make commits, I build the incremental image.

Conclusion

This is a fairly straightforward method to build up-to-date images with code baked in quickly. Well, quickly relative to copying in the entire codebase every time.

I don't recommend this method if building your images quickly isn't a priority. In my case, we're trying to build these images and run our tests in under 5 minutes – which meant that a 5 minute image build time obviously wasn't acceptable.

Alphabetic Filtering with Regex

Last week, I found myself needing to filter things alphabetically, using regex. Basically, this is because PHPUnit lets you filter what tests you run with regex, and we (we being Etsy) have enough tests that we have to split them into many parts to get them to run in a reasonable amount of time.

We've already got some logical splits, like separating unit tests from db-related integration tests and all that jazz. But at some point, you just need to split a test suite into, say, 6 pieces. When the option you have to do this is regex, well, then you just have to split it out by name.

Splitting Tests by Name? That's not Smart.

No, no it's not. But until we have a better solution (which I'll talk about in a later post), this is what we're stuck with. Originally, we just split the alphabet into the number of pieces we required and ran the tests that way. Of course, this doesn't result in even remotely similar runtimes on your test suites.

Anyway, since we're splitting things up by runtime but we're stuck with using regex, we might as well use alphabetic sorting. That'll result in relatively short regular expressions compared to just creating a list of tests.

To figure out where in the alphabet to make the splits for our tests, I downloaded all of our test results for a specific test suite and ran it through a parser that could handle JUnit-style output (XML files with the test results). I converted them into CSV's, and then loaded them into a Google Spreadsheet:

Tests in Google Sheets

This made it trivial to figure out where to split the tests alphabetically to even out the runtimes. The problem was, the places where it made sense to split our tests weren't the easiest places to create an alphabetic split. While it would've been nice if the ranges had been, say, A-Cd or Ce-Fa, instead they decided to be things like A-Api_User_Account_Test or Shop_Listings_O-Transaction_User_B.

It's not easy to turn that into regex, but there is at least a pattern to it. I originally tried creating the regex myself – but quickly got in over my head. After about 100 characters in my regex, my brain was fried.

I decided that it'd be easier, quicker, and less error-prone to write a script that could handle it for me.

Identifying the Pattern

It's really quite simple once you break it down. To find if a String is between A and Bob (not including Bob itself), you need a String that meets the following conditions:

  • Starts with A or a, OR
  • Starts with B or b, AND:
    • The second character is A-M or a-m OR
    • The second character is O or o, AND:
      • The third character is A or a

In a normal 'ole regular expression, this looks like the following (ignoring all special characters):

^[Aa].*|^([Bb](.|[Oo]([Aa]|$)|$)).*$

Now, if we've got something that complicated just to find something up to Bob, you can likely figure out that the rule would get much longer if you have many characters, like Beta_Uploader_Test_Runner.

There's a recognizable pattern, but once again, it's pretty complex and hard for my weak human brain to grok when it gets long. Luckily, this is what computers are very, very good at.

Formulating the Regex

To get a range between two alphabetic options, you generally need 3 different regex rules. Let's say we're looking for the range Super-Whale. First, you need the range from the starting point to the last alphabetic option that still starts with the same letter. So, essentially, you need Super-Sz. The second thing you need is anything that starts with a letter between the first letter of the starting point and the first letter of the end point. So our middle range would be T-V. The last part needs to be W-Whale.

By way of an example, here's a more simple version of the first part – in this case, it's Hey-Hz, color-coded so that you can see what letter applies to which part of the regular expression:

Next up, we're using the same word as if it were the second part. In this case, H-Hey:

Since the middle part is super simple, I won't bother detailing that. With those three elements, we've got our regex range. Of course, there are some more details around edge cases and whatnot, but we'll ignore those for now. It's much simpler for the purposes of blog posts.

Doing some Test-Driven Development

I decided that the best way to make this, you know, actually work, was to write a bunch of tests that would cover many of the edge cases that I could hit. I needed to make sure that these would work, and writing a bunch of tests is a good way to do so.

This helped me know exactly what was going wrong, and I wrote more tests as I kept writing the code. For every method that I wrote, I wrote tests to go along with it. If I realized that I had missed any necessary tests, I added them in, too.

Overall, I'd say this significantly increased my development speed, and it definitely helped me be more confident in my code. Tests are good. Don't let anyone tell you otherwise.

The Code on GitHub

Of course, it doesn't make sense to restrict this code to just me. I couldn't find any good libraries to handle this for me, so I wrote it myself. But really, it only makes sense to make this available to a wider audience.

I've still got some issues to work out, and I need to make a Ruby gem out of it, but in the meantime, feel free to play around with the code: https://github.com/russtaylor/alphabetic-regex

I'm really hoping that someone else will find this code to be useful. If anyone has any questions, comments, or suggestions, feel free to let me know!

© 2019 russt

Theme by Anders NorénUp ↑