Blog Post

Optimizing Your Dockerfile

How We Made our Docker Builds Three Times Faster

Docker has revolutionized application development, deployment, and operation. It has made the microservice architecture that we use at Ginkgo Bioworks a pragmatic possibility. The build cycle for Docker can seem like a burden, but I will show you how you can make your Docker builds as much as three times faster.

Example

For simplicity, we will Dockerize the Tic-Tac-Toe app from the official React Tutorial as our working example. Use create-react-app to seed your repo, and then populate public/index.html, src/index.css and src/index.js with the code from the tutorial's CodePen. Preface index.js with:
import React from 'react'; import ReactDOM from 'react-dom'; import './index.css';
Add a Dockerfile:
FROM node:10.17.0-alpine3.10 ​ COPY . /usr/src/tic-tac-toe WORKDIR /usr/src/tic-tac-toe RUN ["yarn", "install"] RUN ["yarn", "build"]
and a .dockerignore file:
.git node_modules
Run yarn install to create a yarn.lock file, and finally, run docker build -t tic-tac-toe . to build the Docker image. To view the application, simply run docker run -it -p 3000:3000 tic-tac-toe yarn start and point your browser to http://localhost:3000.​

Why is the build so slow?​

The first time we build this Docker image, it understandably takes a long time. Docker has to pull the base image specified in the FROM command and then execute command after that. ​ Now, let's say that we want to make a small change to the application. For instance, let's say that instead of a game of X’s and O’s, we want this version of tic-tac-toe to be a game of +’s and O’s. This requires a change of just two lines of code. ​ Change:
squares[i] = this.state.xIsNext ? 'X' : 'O';
​to:
squares[i] = this.state.xIsNext ? '+' : 'O';
​and:
status = "Next player: " + (this.state.xIsNext ? 'X' : 'O');
​to:
status = "Next player: " + (this.state.xIsNext ? '+' : 'O');
​Now, when we run the Docker image build again:
Step 1/5 : FROM node:10.17.0-alpine3.10 ---> a0708430821e Step 2/5 : COPY . /usr/src/tic-tac-toe ---> 6bed16f0469b Step 3/5 : WORKDIR /usr/src/tic-tac-toe ---> Running in 2e997d17670a Removing intermediate container 2e997d17670a ---> 222f9e8bf80c Step 4/5 : RUN ["yarn", "install"] ---> Running in 50e24a8d25ff yarn install v1.19.1 [1/4] Resolving packages... [2/4] Fetching packages... ​ ... ​ [3/4] Linking dependencies... ​ ... ​ [4/4] Building fresh packages... Done in 32.28s. Removing intermediate container 50e24a8d25ff ---> 30499b09debc Step 5/5 : RUN ["yarn", "build"] ---> Running in 96f70bd0e782 yarn run v1.19.1 $ react-scripts build Creating an optimized production build... Compiled successfully. ​ ... ​ Done in 8.35s. Removing intermediate container 96f70bd0e782 ---> d62902620d7d Successfully built d62902620d7d
​We only changed two lines of code, but all the steps in the Dockerfile were rerun, including the expensive yarn install. We shouldn't need to run yarn install every time we change a few lines of code, and with a bit of care and attention applied to our Dockerfile, we won't. ​

Layers

Docker images are built as layers, where each step in the Dockerfile literally builds a layer atop the previous layer. Each layer is cached in the Docker layer cache, identified in the cache by a unique hash. The hash of a layer is calculated from its own contents as well as the hashes of all of the layers before it. Therefore, if the contents of a layer were to change (like, for example, if the contents of the source of a COPY command were to change), then that layer and all layers subsequent to it would be new layers, identified by new hashes. That step and all subsequent steps would therefore need to be rerun, because those layers will not yet have been put into the Docker layer cache. ​ In the case where we changed those two lines of code, we ended up changing the layer from the COPY step, because the source in that COPY command is the entire repo (aside from what's listed in .dockerignore). Changing anything in the repo therefore invalidates nearly every layer generated by the Dockerfile. ​ If we could sequence COPYing the directory after RUNning the yarn install, we would be freer to alter the application code without incurring the penalty of reRUNning the yarn install step for most builds. yarn install requires exactly two files: package.json and yarn.lock, so we only need to COPY those two files before the RUNning the yarn install step. We still need nearly the entire source directory before we can run the yarn build step, so we can sequence the COPY . /usr/src/tic-tac-to step directly before RUN ["yarn", "build"]. Let's try that. The Dockerfile is now: ​
FROM node:10.17.0-alpine3.10 ​ COPY package.json /usr/src/tic-tac-toe/package.json COPY yarn.lock /usr/src/tic-tac-toe/yarn.lock WORKDIR /usr/src/tic-tac-toe RUN ["yarn", "install"] ​ COPY . /usr/src/tic-tac-toe RUN ["yarn", "build"]
Be sure that node_modules is in the .dockerignore file so that the COPY . /usr/src/tic-tac-toe does not clobber the node_modules directory generated by the yarn install in the image. ​ Now build this image. ​ Now, let's say that we want to change the +'s back into X's. Make that change and run the image build again. The output should be something like: ​
Step 1/7 : FROM node:10.17.0-alpine3.10 ---> a0708430821e Step 2/7 : COPY package.json /usr/src/tic-tac-toe/package.json ---> Using cache ---> 828ab4b9065a<< Step 3/7 : COPY yarn.lock /usr/src/tic-tac-toe/yarn.lock ---> Using cache ---> 683ad066cc76 Step 4/7 : WORKDIR /usr/src/tic-tac-toe ---> Using cache ---> f6e1da8e37e7 Step 5/7 : RUN ["yarn", "install"] ---> Using cache ---> 6c249a1631c9 Step 6/7 : COPY . /usr/src/tic-tac-toe ---> 6708e001b337 Step 7/7 : RUN ["yarn", "build"] ---> Running in 6578b456a5bc yarn run v1.19.1 $ react-scripts build Creating an optimized production build... Compiled successfully. ​ ... ​ Done in 8.08s. Removing intermediate container 6578b456a5bc ---> 50854e5ce042 Successfully built 50854e5ce042
Notice that, in addition to not pulling the base image in step 1, for steps 2 through 5, it says "Using cache" -- rather than RUNning the expensive yarn install command again, Docker was able to reuse that layer from the layer cache. package.json and yarn.lock change infrequently as compared to other files in the normal course of development, so most builds of this application will take about 8 seconds, rather than the 40 seconds that it would take to execute a full build. We have cut more than three quarters from the build time of this application! ​

CI

​You may notice that when running the build on an actual CI platform (such as GitLab CI) the caching is ineffective -- Docker reruns all the steps despite our optimizations. Your CI platform likely distributes work amongst workers on several machines, perhaps even machines that are ephemeral. Your pipeline may run on a different machine each time it runs, so there is a good chance that the local Docker cache will not have the layers from previous builds of your image. We must therefore instruct Docker to pull the images into the local cache. Assuming that the Docker image is tagged with the git branch name: ​
docker pull docker.my-company.com/tick-tac-to:$BRANCH_NAME || true
​The || true is so that the pipeline does not fail due to the docker pull failing if this happens to be the first time we are building the Docker image for the branch. ​ Then, we must instruct Docker to consider the image that we just pulled as a cache source for layers: ​
docker build \ --cache-from docker.my-company.com/tick-tac-to:$BRANCH_NAME \ -t docker.my-company.com/tick-tac-to:$BRANCH_NAME \ .
This will work well on the second and subsequent Docker builds of a branch, but the first time the branch is built, all of the steps will be run. It is likely that package.json and yarn.lock remain the same between the master branch and a feature branch, so we can actually use the master image as a cache source for those first 5 steps to speed up that first build of the feature branch. We can specify multiple --cache-from sources. The build script will be something like: ​
docker pull docker.my-company.com/tick-tac-to:$BRANCH_NAME || true docker pull docker.my-company.com/tick-tac-to:master || true docker build \ --cache-from docker.my-company.com/tick-tac-to:$BRANCH_NAME \ --cache-from docker.my-company.com/tick-tac-to:master \ -t docker.my-company.com/tick-tac-to:$BRANCH_NAME \ . docker push docker.my-company.com/tick-tac-to:$BRANCH_NAME
​Even invoking docker pull twice is usually faster than running yarn install, especially if we get lucky and the local Docker cache happens to have one or both images already. ​ The key takeaways are:
  • Reduce the likelihood that expensive steps will be invalidated by narrowing the scope of those steps' dependencies.

  • Move the steps that are less likely to be invalidated earlier in the Dockerfile.

  • Ideal would be that the most expensive steps are the least likely to be invalidated and occur earliest in the Dockerfile.

  • Use docker pull and --cache-from in concert to prime a cold Docker cache.

​Our React apps at Ginkgo Bioworks are of course much larger and much more complicated than this simple Tic-Tac-Toe example. We applied these techniques to our build pipelines, and now our build times are 3 times faster than they were before, allowing us to ship cool new features to our users more rapidly!

(Feature photo by frank mckenna on Unsplash)



Posted by Raymond Lam