Docker deployments with zero downtime

The swarm mode with docker service command introduced in version 1.12.0 aims to be a good tool for scaling your application and one of the nice feature promised is zero downtime deployments, which I’m going to try in this post.

UPD: Unfortunately latest docker versions keeps forwarding new connections to removed service tasks, while containers got SIGTERM and started graceful shutdown, which makes no-downtime rolling updates impossible, at least without using an external load balancer.

There are few related issues which should fix it (still open):

You can subscribe for issue notifications to get updates, meanwhile - get familiar with deployment process …

Application

First - we need our mission critical application to be packed in a docker image, and should respond to SIGTERM signal to gracefully shutdown in timely manner.

Bootstrap

I’ll use a basic expressjs app generated with express-generator, which will just respond with a plain text response on it’s root path (source code can be found on github).

$ npm install -g express-generator

$ express --view=pug --git -f docker-deploy-test

   create : docker-deploy-test
   create : docker-deploy-test/package.json
   create : docker-deploy-test/app.js
   create : docker-deploy-test/.gitignore
   create : docker-deploy-test/public
   create : docker-deploy-test/public/javascripts
   create : docker-deploy-test/public/images
   create : docker-deploy-test/public/stylesheets
   create : docker-deploy-test/public/stylesheets/style.css
   create : docker-deploy-test/routes
   create : docker-deploy-test/routes/index.js
   create : docker-deploy-test/routes/users.js
   create : docker-deploy-test/views
   create : docker-deploy-test/views/index.pug
   create : docker-deploy-test/views/layout.pug
   create : docker-deploy-test/views/error.pug
   create : docker-deploy-test/bin
   create : docker-deploy-test/bin/www

   install dependencies:
     $ cd docker-deploy-test && npm install

   run the app:
     $ DEBUG=docker-deploy-test:* npm start

Install dependencies

$ cd docker-deploy-test && npm install

 <long output of installed dependencies tree should be displayed here>

And run the app:

$ DEBUG=docker-deploy-test:* npm start

> [email protected] start /Users/vadim/projects/lostintime/docker-deploy-test
> node ./bin/www

  docker-deploy-test:server Listening on port 3000 +0ms

Now on http://localhost:3000/ you can see this nice looking page.

Expressjs hello world index page

Let’s run some benchmarks on this and have a base to compare later results with.

$ ab -c 10 -n 1000 "http://localhost:3000/"
This is ApacheBench, Version 2.3 <$Revision: 1748469 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            3000

Document Path:          /
Document Length:        170 bytes

Concurrency Level:      10
Time taken for tests:   4.438 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      366000 bytes
HTML transferred:       170000 bytes
Requests per second:    225.31 [#/sec] (mean)
Time per request:       44.382 [ms] (mean)
Time per request:       4.438 [ms] (mean, across all concurrent requests)
Transfer rate:          80.53 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     7   44   6.0     42      68
Waiting:        7   44   6.0     42      68
Total:          8   44   6.0     42      68

Percentage of the requests served within a certain time (ms)
  50%     42
  66%     43
  75%     44
  80%     47
  90%     54
  95%     57
  98%     61
  99%     63
 100%     68 (longest request)

To summarize the previous output:

Longest request took 68ms
Complete requests: 1000
Failed requests: 0

Docker image

Building docker image for nodejs application is damn simple, just use official onbuild node image, the Dockerfile will look like this:

FROM node:6.9-onbuild

CMD ["node", "./bin/www"]

One thing to notice here - is custom CMD, this is caused by the issue with npm, not handling SIGTERM properly: https://github.com/npm/npm/issues/4603, https://github.com/dickeyxxx/npm-register/issues/43.

Build the image and push to docker hub (or your private registry):

$ docker build -t lostintime/docker-deploy-test:v1 .

< long build process output here>

$ docker images|grep docker-deploy-test
lostintime/docker-deploy-test:v1                                  latest              88e65257a7ca        10 seconds ago      671 MB

$ docker push lostintime/docker-deploy-test:v1
...

To be sure it works - run the app with docker (don’t forget to stop previously running node application, ctrl+c may help with this), and then stop container with docker stop:

$ docker run --rm -p 127.0.0.1:3000:3000 -e "DEBUG=docker-deploy-test:*" lostintime/docker-deploy-test:v1
Thu, 09 Feb 2017 19:03:46 GMT docker-deploy-test:server Listening on port 3000
Thu, 09 Feb 2017 19:03:57 GMT docker-deploy-test:server Got SIGTERM
Thu, 09 Feb 2017 19:03:57 GMT docker-deploy-test:server Server bind closed

Our benchmark looks pretty similar at this step too:

 ab -c 10 -n 1000 "http://localhost:3000/"

...

Time taken for tests:   5.256 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      366000 bytes
HTML transferred:       170000 bytes
Requests per second:    190.26 [#/sec] (mean)
Time per request:       52.561 [ms] (mean)
Time per request:       5.256 [ms] (mean, across all concurrent requests)

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       1
Processing:    11   52   6.9     50      88
Waiting:       11   52   6.8     50      87
Total:         12   52   6.9     50      88

....

Docker Service

Next thing to do - is to prepare our cluster, if it’s not yet - init swarm cluster:

$ docker swarm init
Swarm initialized: current node (zzzzzzzzzzzzzzz) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join \
    --token SWMTKN-1-some-long-token-here \
    192.168.65.2:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

Create network:

$ docker network create deployme --driver overlay
...

Create service and scale to 6 instances:

$ docker service create \
    --env "DEBUG=docker-deploy-test:*" \
    --name "deployme" \
    --endpoint-mode "vip" \
    --mode "replicated" \
    --replicas 1 \
    --update-parallelism 1 \
    --update-delay 10s \
    --stop-grace-period 5s \
    --restart-condition "any" \
    --restart-max-attempts 10 \
    --publish "3000:3000" \
    --network "deployme" \
    lostintime/docker-deploy-test:v1

$ docker service ls
ID            NAME      MODE        REPLICAS  IMAGE
mdd4g9fnzhxe  deployme  replicated  1/1       lostintime/docker-deploy-test:v1

$ docker service scale deployme=6

$ docker service ls
ID            NAME      MODE        REPLICAS  IMAGE
mdd4g9fnzhxe  deployme  replicated  6/6       lostintime/docker-deploy-test:v1

Now, let’s try to scale service down to 2 items while putting service under load (run commands same time):

$ ab -c 40 -n 5000 "http://localhost:3000/"
...
Concurrency Level:      40
Time taken for tests:   16.180 seconds
Complete requests:      5000
Failed requests:        0
Total transferred:      1830000 bytes
HTML transferred:       850000 bytes
Requests per second:    309.01 [#/sec] (mean)
Time per request:       129.444 [ms] (mean)
Time per request:       3.236 [ms] (mean, across all concurrent requests)
Transfer rate:          110.45 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   2.6      0      39
Processing:     5  128 114.8    111     792
Waiting:        4  127 114.7    110     791
Total:          5  128 114.8    113     792
...

$ docker service scale deployme=2

Pretty good, all requests succeed, let’s scale service up:

$ docker service scale deployme=6

$ ab -c 40 -n 5000 "http://localhost:3000/"
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking deployme (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
apr_socket_recv: Connection refused (111)
Total of 2771 requests completed

Oops, looks like containers are added to service before node socket binding gets ready, which probably can be fixed with another great feature released with docker 1.12 - HEALTHCHECK (can also be added on container build time).

For checking service is up - will use simple curl command: curl --fail http://localhost:3000 (which is available by default in node docker images, at least the one used for our app, didn’t check in slim versions)

Re-create our service with healthcheck instructions:

$ docker service rm deployme

$ docker service create \
    --env "DEBUG=docker-deploy-test:*" \
    --name "deployme" \
    --endpoint-mode "vip" \
    --mode "replicated" \
    --replicas 1 \
    --update-parallelism 1 \
    --update-delay 10s \
    --stop-grace-period 5s \
    --restart-condition "any" \
    --restart-max-attempts 10 \
    --network "deployme" \
    --publish "3000:3000" \
    --health-cmd "curl --fail http://localhost:3000" \
    --health-interval 3s \
    --health-retries 5 \
    --health-timeout 2s \
    lostintime/docker-deploy-test:v1

And benchmark again:

$ ab -c 40 -n 20000 -l -k "http://localhost:3000/"
...

While scaling it up and down:

$ docker service scale deployme=6

$ docker service scale deployme=2

...

Concurrency Level:      40
Time taken for tests:   53.098 seconds
Complete requests:      20000
Failed requests:        0
Keep-Alive requests:    19973
Total transferred:      7409983 bytes
HTML transferred:       3395410 bytes
Requests per second:    376.66 [#/sec] (mean)
Time per request:       106.196 [ms] (mean)
Time per request:       2.655 [ms] (mean, across all concurrent requests)
Transfer rate:          136.28 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.3      0       8
Processing:     8  106 189.0     94    5136
Waiting:        0   99  53.6     93     942
Total:          8  106 189.0     94    5136

Tadaaa! All requests succeed.

For deploy process we will use docker service update that technically will do the same: scale service down and back up with new options, ex: --image.

Create new version of our app:

$ docker tag lostintime/docker-deploy-test:v1 lostintime/docker-deploy-test:v2

And finally deploy:

$ docker service update --image "lostintime/docker-deploy-test:v2" deployme

While benchmarking:

$ ab -c 40 -n 20000 -l -k "http://deployme:3000/"
...

Concurrency Level:      40
Time taken for tests:   46.566 seconds
Complete requests:      20000
Failed requests:        0
Keep-Alive requests:    19985
Total transferred:      7414435 bytes
HTML transferred:       3397450 bytes
Requests per second:    429.50 [#/sec] (mean)
Time per request:       93.132 [ms] (mean)
Time per request:       2.328 [ms] (mean, across all concurrent requests)
Transfer rate:          155.49 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.5      0      15
Processing:     6   93 157.7     76    5927
Waiting:        0   89  47.3     75     414
Total:          6   93 157.7     76    5927

Conclusion

Health check is a very important part of deploy process and let us controll when exactly container is ready to be added to the swarm. Of course you should tune --helth-* parameters for your requirements and ideally create a separate endpoint which will incapsulate all your healthcheck logic. For deploy process you’ll need at least one container running at a time or more, that can handle the load, so please also check --update-parallelism, --update-delay and --stop-grace-period 5s params in more depth.

Useful links

Here are some useful links which I discovered while solving this problem.

Graceful shutdown in nodejs: http://joseoncode.com/2014/07/21/graceful-shutdown-in-node-dot-js/
Reducing Deploy Risk With Docker’s New Health Check Instruction: https://blog.newrelic.com/2016/08/24/docker-health-check-instruction/
Docker healthcheck documentation: https://docs.docker.com/engine/reference/builder/#/healthcheck

PS: sorry for style of the writing and predominant shell outputs, the post is just a dirty proof-of-concept instructions set :).