At work recently, we had a need to generate diffs between two different directory trees. This is so that we can handle deploys, but it’s after we’ve already generated assets, so we can’t just use git
for the diff creation, since git diff
doesn’t handle files that aren’t tracked by git itself. We looked into using GNU’s diffutils, but it doesn’t handle binary files.
We tried investigating other methods for deploying our code, but thought it would still be simplest if there was some way to generate just a ‘patch’ of what had changed.
Luckily, one of the Staff Engineers at Etsy happened to know that rsync
had just such an option hiding in its very long man
page. Because rsync
handles transferring files from one place to another, whether it’s local or remote, it has to figure out the diffs between files anyway. It’s really nice that they’ve exposed it so that you can use the diffs themselves. The option that does this is called ‘Batch Mode’, because you can use it to ‘apply’ a diff on many machines after you’ve distributed the diff file.
Creating the Diff
To create the diff itself, you’ll need to first have two directories containing your folder structure – one with the ‘previous’ version and one with the ‘current’ version. In our case, after we run each deploy, we create a copy of the current directory so that we can use that as our previous version to build our next diff.
Your rsync command will look a lot like this:
rsync --write-batch=diff /deploy/current /deploy/previous
Running that command will give you two files, diff
and diff.sh
. You can just use the .sh
file to apply your diff, but you don’t have to. As long as you remember to use the same flags when applying your diff, you’ll be fine. You can also use any filename that you want after the =
.
Also, it’s important to note that running this command will update /deploy/previous
to the contents of /deploy/current
. If you want to keep /deploy/previous
as-is so that you can update it later, use --only-write-batch
instead of just --write-batch
.
Applying the Diff
Next up, you’ll want to distribute your diff
to whatever hosts are going to receive it. In our case, we’re uploading it to Google Cloud Storage, where all the hosts can just grab it as necessary.
On each host that’s applying the diff, you’ll want to just run something like the following:
rsync --read-diff=/path/to/diff /deploy/directory
Remember, you need to use the same flags when applying your diff as you did when you created your diff.
In our testing, this worked well for applying a diff to many hosts – updating around 400 hosts in just about 1 minute (including downloading the ~30MB diff file to each host).
Caveats
This will fail if the diff doesn’t apply cleanly. So, essentially, if one of your hosts is a deploy behind, you should make absolutely sure that you know that, and don’t try to update it to the latest version. If you try to anyway, you’ll probably end up with errors in the best case, or a corrupt copy of your code in the worst case. We’re still working on making our scripts handle the potential error cases so that we don’t end up in a corrupt state.
I hope this is helpful to you! If you’ve got any thoughts, questions, or corrections, drop them in the comments below. I’d love to hear them!