Temps

Zero-Downtime Docker Deployments: Blue-Green Setup, DB Migrations & Verification (2026)

March 12, 2026 (1mo ago)

Written by Temps Team

Last updated March 12, 2026 (1mo ago)

Back to all posts

Temps

Zero-Downtime Docker Deployments: Blue-Green Setup, DB Migrations & Verification (2026)

March 12, 2026 (1mo ago)

Written by Temps Team

Last updated March 12, 2026 (1mo ago)

You push a new version and for five to thirty seconds, some users see errors. In-flight requests drop. WebSocket connections break. It happens every deploy, and most teams just accept it.

Here's the thing: zero-downtime deployment is achievable without Kubernetes. You don't need a container orchestration platform, a service mesh, or a dedicated SRE team. You need three things — health checks, connection draining, and an atomic route switch — layered on top of Docker containers you're already running.

This guide walks through why Docker deployments have downtime by default, how to build a zero-downtime pipeline from scratch with Docker Compose and Nginx, and how to skip the DIY work entirely if you'd rather not maintain deployment scripts forever.

TL;DR: Docker's default stop-start cycle creates a 5-30 second gap where requests fail. You can eliminate it with health check gating, connection draining, and blue-green container swaps. Elite engineering teams deploy multiple times per day with a 5% change failure rate. This guide shows both the DIY approach and a one-command alternative.

Why Do Docker Deployments Have Downtime by Default?

Docker's default lifecycle creates an unavoidable gap between stopping the old container and starting the new one. According to ITIC, 91% of mid-size and large enterprises report that a single hour of downtime costs over $300,000. Even brief deployment windows — repeated across multiple daily deploys — compound fast.

The Stop-Start Gap

When you run docker compose up -d --build, Docker stops the old container, removes it, builds the new image, and starts a fresh container. That sequence has three gaps where requests fail:

Container shutdown — The old process receives SIGTERM. If your app doesn't handle graceful shutdown, connections drop immediately.
Image build — Even with layer caching, builds take seconds to minutes. No container is running during this window.
Application boot — The new container starts, but your app needs time to load config, establish database connections, and warm caches.

During these gaps, any request hitting your server gets a 502 Bad Gateway or a connection refused error. That's downtime.

Why `restart: always` Doesn't Help

A common misconception: setting restart: always in your docker-compose.yml gives you zero-downtime deploys. It doesn't. This directive tells Docker to restart the same container when it crashes. It doesn't spin up a new version alongside the old one.

# This does NOT give you zero-downtime deployment
services:
  web:
    image: myapp:latest
    restart: always  # Only restarts the SAME container on crash

What you actually need is two containers running simultaneously — the old version serving traffic while the new version boots up and passes health checks. That's a fundamentally different pattern.

Why `docker compose up -d` Replaces In-Place

Running docker compose up -d with an updated image does a stop-then-start on the same service. It doesn't create a parallel instance. Even docker compose up -d --scale web=2 won't orchestrate a graceful handoff. You'd end up with two containers behind no load balancer, both receiving traffic with no health gating.

The core problem: Docker Compose is a development tool that happens to work in production. It wasn't designed for zero-downtime deployments. You need to layer your own orchestration on top.

What's the Difference Between Blue-Green, Rolling, and Canary Deployments?

Three strategies dominate zero-downtime deployment, each with different trade-offs in complexity, cost, and risk. According to the 2024 DORA report, elite-performing teams deploy on-demand with a change failure rate of just 5%, while low performers deploy monthly with a 64% failure rate. The strategy you pick affects how fast you can recover from that failure.

Blue-Green Deployment

Blue-green keeps two identical environments running. "Blue" serves production traffic. "Green" gets the new version. Once green passes health checks, the load balancer switches all traffic in one atomic step.

How it works:

Deploy new version to the idle environment (green)
Run health checks against green
Switch the load balancer from blue to green
Blue becomes the idle environment (instant rollback target)

The advantage is simplicity: one clean switch, one clean rollback. The downside is cost — you're running two full environments at all times. For a single-server Docker setup, this means doubling your container resources permanently.

Rolling Deployment

Rolling updates replace instances one at a time. The old version keeps serving while new instances spin up and pass health checks.

Time 0:  [v1] [v1] [v1]     ← All running v1
Time 1:  [v1] [v1] [v2...]  ← One instance boots v2
Time 2:  [v1] [v1] [v2 ✓]   ← v2 passes check, takes traffic
Time 3:  [v1] [v2 ✓] [v2 ✓] ← Second instance upgraded
Time 4:  [v2 ✓] [v2 ✓] [v2 ✓] ← Complete, zero dropped requests

Rolling deployment uses about 1.3x resources during the deploy — not 2x permanently. But it requires multiple instances, which makes it less practical on a single server with one container.

Canary Deployment

Canary sends a small percentage of traffic (say 5%) to the new version first. If error rates stay flat, traffic gradually shifts — 5%, 25%, 50%, 100%. If anything goes wrong, only that small slice of users was affected.

This is the safest approach for high-traffic applications. But it's also the most complex to implement. You need traffic splitting at the load balancer level, per-version metrics collection, and automated promotion logic.

Which Strategy Should You Use?

Strategy	Resource Cost	Rollback Speed	Complexity	Best For
Blue-green	2x always	Instant	Low	Single-server Docker apps
Rolling	1.3x during deploy	Seconds	Medium	Multi-instance clusters
Canary	1.1x during deploy	Seconds	High	High-traffic production

For most Docker apps running on a single server, blue-green is the practical choice. You can implement it with two containers and an Nginx reload. No cluster required.

What Are the Three Ingredients for Zero-Downtime Docker Deployment?

Every zero-downtime strategy relies on three mechanisms working together: health check gating, connection draining, and atomic routing. According to New Relic, organizations with full-stack observability experience 71% fewer annual outages. These three ingredients are the foundation of that observability at the deployment layer.

Health Check Gating

The new container should never receive traffic until it's genuinely ready. A health check endpoint verifies that your application booted, connected to its database, and can serve requests.

// Express.js health check that verifies real dependencies
app.get('/health', async (req, res) => {
  try {
    await db.query('SELECT 1');
    await redis.ping();
    res.status(200).json({ status: 'healthy' });
  } catch (err) {
    res.status(503).json({ status: 'unhealthy', error: err.message });
  }
});

A health endpoint that blindly returns 200 defeats the entire purpose. If your app returns "healthy" before the database connection pool is established, the load balancer will route traffic to a container that immediately throws 500 errors.

Docker's built-in HEALTHCHECK instruction helps, but it's not enough on its own. Docker health checks only affect container status — they don't control your load balancer. You need your deployment script to check health before switching traffic.

Connection Draining

When you remove the old container from the load balancer, don't kill it immediately. In-flight requests — a user mid-checkout, a file upload at 90%, a long-polling connection — need time to complete.

Connection draining means:

Stop sending new requests to the old container
Let existing requests finish (with a timeout, typically 30 seconds)
Kill the container only after all connections close or the timeout expires

Without draining, you'll randomly drop requests during every deploy. Users won't see a full outage, but they'll get sporadic 502 errors that are hard to reproduce and diagnose.

Atomic Route Switch

The load balancer needs to flip from old to new in one step. Not gradually, not with a gap — atomically. For Nginx, this is a config reload:

nginx -s reload

Nginx's reload is graceful: it starts new worker processes with the updated config, and old worker processes finish their current requests before exiting. That's atomic routing and connection draining in one operation — but only for the Nginx layer. Your application containers still need their own draining logic.

How Do You Build Zero-Downtime Deployment with Docker Compose and Nginx?

Here's a working blue-green deployment pipeline you can implement today on any single server. According to the CNCF annual survey, 82% of container users now run Kubernetes in production, but you don't need to be one of them. This approach uses Docker Compose, Nginx, and a 60-line bash script.

Step 1: Docker Compose with Two Service Slots

Define two services — web-blue and web-green — so both can run simultaneously during the transition.

# docker-compose.yml
services:
  web-blue:
    build: .
    container_name: app-blue
    ports:
      - "8001:3000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 5s
      timeout: 3s
      retries: 3
      start_period: 10s
    restart: unless-stopped

  web-green:
    build: .
    container_name: app-green
    ports:
      - "8002:3000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 5s
      timeout: 3s
      retries: 3
      start_period: 10s
    restart: unless-stopped

Port 8001 maps to the blue container. Port 8002 maps to green. Nginx sits in front and routes to whichever is currently active.

Step 2: Nginx Upstream Configuration

Create two upstream configs that Nginx can switch between:

# /etc/nginx/conf.d/app.conf
upstream app_backend {
    # This file gets overwritten by the deploy script
    include /etc/nginx/conf.d/active-upstream.conf;
}

server {
    listen 80;
    server_name myapp.com;

    location / {
        proxy_pass http://app_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
    }
}

# /etc/nginx/conf.d/active-upstream.conf
# Points to blue by default
server 127.0.0.1:8001;

Step 3: The Deploy Script

This is where the magic happens. The script determines which slot is active, deploys to the idle slot, waits for health checks, switches Nginx, and drains the old container.

#!/bin/bash
set -euo pipefail

# Configuration
HEALTH_ENDPOINT="http://localhost:PORT/health"
MAX_RETRIES=30
RETRY_INTERVAL=2
DRAIN_WAIT=10
UPSTREAM_CONF="/etc/nginx/conf.d/active-upstream.conf"

# Determine which slot is currently active
CURRENT=$(cat "$UPSTREAM_CONF" | grep -oP ':\K[0-9]+')
if [ "$CURRENT" = "8001" ]; then
    ACTIVE="blue"
    TARGET="green"
    TARGET_PORT="8002"
else
    ACTIVE="green"
    TARGET="blue"
    TARGET_PORT="8001"
fi

echo "Active: $ACTIVE | Deploying to: $TARGET (port $TARGET_PORT)"

# Step 1: Build and start the target container
echo "Building and starting $TARGET..."
docker compose up -d --build "web-$TARGET"

# Step 2: Wait for health check
echo "Waiting for health check on port $TARGET_PORT..."
HEALTH_URL="${HEALTH_ENDPOINT/PORT/$TARGET_PORT}"
RETRIES=0
until curl -sf "$HEALTH_URL" > /dev/null 2>&1; do
    RETRIES=$((RETRIES + 1))
    if [ "$RETRIES" -ge "$MAX_RETRIES" ]; then
        echo "ERROR: Health check failed after $MAX_RETRIES attempts"
        echo "Rolling back: stopping $TARGET"
        docker compose stop "web-$TARGET"
        exit 1
    fi
    echo "  Attempt $RETRIES/$MAX_RETRIES..."
    sleep "$RETRY_INTERVAL"
done
echo "Health check passed!"

# Step 3: Switch Nginx to the new container
echo "Switching traffic to $TARGET..."
echo "server 127.0.0.1:$TARGET_PORT;" > "$UPSTREAM_CONF"
nginx -s reload

# Step 4: Drain connections from old container
echo "Draining connections from $ACTIVE ($DRAIN_WAIT seconds)..."
sleep "$DRAIN_WAIT"

# Step 5: Stop the old container
echo "Stopping $ACTIVE..."
docker compose stop "web-$ACTIVE"

echo "Deploy complete! Active slot: $TARGET"

Save this as deploy.sh, make it executable with chmod +x deploy.sh, and run it every time you push a new version.

The Gotchas You'll Hit

This DIY approach works, but it has sharp edges you'll discover in production:

DNS caching — If you're using a DNS-based load balancer upstream, Nginx caches DNS resolution at startup. You'll need a resolver directive and variables in proxy_pass to force re-resolution.
Connection pools — Database connection pools in the old container may hold connections open past the drain timeout. Set your app's shutdown handler to close pools explicitly.
SIGTERM handling — Many frameworks don't handle SIGTERM gracefully by default. Node.js needs an explicit process.on('SIGTERM') handler. Python needs signal trapping.
Disk space — Old Docker images pile up. Add docker image prune -f to the end of your deploy script.
Concurrent deploys — If two people run deploy.sh at the same time, you'll corrupt state. Add a lock file.

Every one of these gotchas burns you exactly once. Then you add a fix, the script grows to 150 lines, and you've built yourself a deployment system to maintain forever. Which brings us to the question: is this script worth maintaining?

How Do You Handle Database Migrations Without Downtime?

Database migrations are the trickiest part of zero-downtime deployment. Both GitHub's June 2025 outage and Cloudflare's November 2025 global outage were caused by database changes that cascaded into platform-wide failures. If your new code expects a column that doesn't exist yet — or your old code breaks when a column disappears — your rolling deployment fails.

The Expand-and-Contract Pattern

The safe approach splits every breaking database change into three deploys:

Phase 1: Expand (backward-compatible)

-- Add the new column without removing the old one
ALTER TABLE users ADD COLUMN full_name TEXT;

-- Backfill data
UPDATE users SET full_name = first_name || ' ' || last_name;

Deploy code that writes to both columns but reads from the new one.

Phase 2: Migrate

Deploy code that only uses the new column. Both old and new application versions coexist safely because the old column still exists.

Phase 3: Contract (cleanup)

-- Safe to remove after all instances run the new version
ALTER TABLE users DROP COLUMN first_name;
ALTER TABLE users DROP COLUMN last_name;

During a blue-green or rolling deployment, both v1 and v2 run simultaneously. The expand-and-contract pattern ensures both versions work with the same database schema at every step. Never drop a column in the same deploy that stops using it.

Why Is Kubernetes Overkill for Most Docker Apps?

Kubernetes handles zero-downtime deployment beautifully — rolling updates, readiness probes, and preStop hooks are all built in. But 82% of container users running Kubernetes doesn't mean 82% of them need it. For a single Docker app on one server, the overhead is significant.

What Kubernetes requires for zero-downtime deploys:

Cluster setup — At minimum, a control plane node and a worker node. Managed Kubernetes (EKS, GKE, AKS) costs $70-150/month just for the control plane.
Deployment manifests — YAML files defining replicas, update strategy, readiness probes, resource limits.
Ingress controller — Nginx Ingress, Traefik, or similar. Another component to configure and maintain.
Cert-manager — For automatic HTTPS. More YAML, more CRDs, more things to break.
kubectl and CI/CD — Your team needs to learn Kubernetes tooling and integrate it into your pipeline.

That's a lot of infrastructure for deploying a single application. Kubernetes shines when you're running dozens of services across multiple nodes. For one to five Docker apps on a single server? It's like hiring a crane to hang a picture frame.

The real trap is incremental complexity. You start with a simple deployment, add Kubernetes for rolling updates, then spend weeks learning about PodDisruptionBudgets, NetworkPolicies, and resource quotas. Each piece makes sense individually. Together, they form a system that requires dedicated infrastructure expertise to operate.

How Does Temps Handle Zero-Downtime Deployment Automatically?

Temps runs a Pingora-based reverse proxy that handles health checks, connection draining, and atomic traffic switching out of the box. According to Splunk and Oxford Economics, unplanned downtime costs Global 2000 companies $400 billion per year, and most of that is preventable with proper deployment tooling.

Every git push triggers this pipeline automatically:

Build — New Docker image builds while the old container keeps serving traffic
Health check — New container starts, Temps polls the health endpoint until it returns 200
Traffic shift — Pingora atomically routes new requests to the new container
Drain — Old container finishes in-flight requests before shutdown
Cleanup — Old container and dangling images are removed

No deploy script. No Nginx config. No blue-green orchestration logic. The same three ingredients — health checks, draining, atomic switching — but managed by the platform instead of your bash scripts.

# That's the entire deploy workflow
git push temps main

If the new container fails health checks after three consecutive attempts, Temps rolls back automatically. The old version keeps running. Users never see the broken version. You get a notification with logs explaining what went wrong.

What About the Gotchas?

Remember the sharp edges from the DIY section? Here's how Temps handles each one:

DNS caching — Pingora resolves upstreams dynamically. No stale DNS.
Connection pools — Drain timeout is configurable per project. Defaults to 30 seconds.
SIGTERM handling — Temps sends SIGTERM, waits for the drain timeout, then sends SIGKILL. Your app gets a fair chance to shut down cleanly.
Disk space — Old images are pruned automatically after successful deploys.
Concurrent deploys — Deploy queue ensures deploys run sequentially per project.

The deploy script you'd maintain? Temps replaces it with a single binary that handles all of this, plus SSL certificates, log aggregation, and error tracking.

Five Metrics to Track After Every Deploy

Don't assume a successful deploy means everything is fine. Watch these five metrics for at least 15 minutes after each deployment:

Deploy duration — Total time from push to live
Health check time — How long until the new version was ready
Drain time — How long in-flight requests took to complete
Error rate — Before, during, and after deployment
Response time — Any latency impact from the deployment

Whether you're using the DIY approach or a platform, a deploy that passes health checks can still have subtle issues: a slow query, a visual regression, an edge case in a new feature.

What Are the Most Common Zero-Downtime Mistakes?

According to the Uptime Institute, 80% of operators believe their most recent downtime event was preventable. Most zero-downtime failures come from a handful of common mistakes — and they're all avoidable.

1. Fake Health Checks

A health endpoint that always returns 200 defeats the purpose of health check gating. Always verify real dependencies:

// Bad: always returns healthy
app.get('/health', () => ({ status: 'ok' }));

// Good: verifies actual readiness
app.get('/health', async () => {
  await db.query('SELECT 1');
  await cache.ping();
  return { status: 'ok' };
});

2. Breaking Database Changes in One Deploy

Never drop a column in the same deploy that stops using it. Always use the expand-and-contract pattern across multiple deploys. GitHub and Cloudflare both learned this lesson in 2025.

3. Infrequent, Large Deploys

Small, frequent deploys are easier to roll back and less likely to cause cascading failures. The DORA data consistently shows that elite teams deploy multiple times per day — not once a week.

4. Skipping Post-Deploy Monitoring

A deploy that passes health checks can still have subtle issues: a visual regression, a slow database query, an edge case in a new feature. Watch metrics for 15 minutes after every deployment.

5. No Rollback Plan

If you can't roll back in seconds, you don't have zero-downtime deployment — you have zero-downtime deployment with a single point of failure. Always keep the previous container image cached and test your rollback process regularly.

How Do You Verify Zero Requests Are Dropped During Deployment?

Trust but verify. Don't assume your zero-downtime setup works — prove it with a load test. The 2024 DORA report introduced Deployment Rework Rate as a metric because teams often discover failures only after users report them. A load test during deployment catches problems before users do.

Load Test with `hey`

hey is a lightweight HTTP load generator. Install it and run continuous requests while deploying:

# Install hey
go install github.com/rakyll/hey@latest

# In terminal 1: start continuous load test
hey -z 120s -c 10 -q 50 http://myapp.com/health

This sends 50 requests per second from 10 concurrent workers for 120 seconds. Now, in another terminal, trigger your deploy:

# In terminal 2: deploy while load test is running
./deploy.sh        # DIY approach
# or
git push temps main  # Temps approach

Reading the Results

When hey finishes, check the output:

Summary:
  Total:        120.0034 secs
  Slowest:      0.2345 secs
  Fastest:      0.0012 secs
  Average:      0.0089 secs
  Requests/sec: 499.85

Status code distribution:
  [200] 59982 responses

# Zero-downtime: CONFIRMED
# If you see ANY non-200 responses, something is wrong.

If the status code distribution shows only 200 responses, you've achieved zero-downtime deployment. Any 502, 503, or connection errors mean requests were dropped during the switch.

Common Load Test Failures

What to investigate if you see dropped requests:

A few 502s at the switch point — Nginx reload didn't drain cleanly. Increase DRAIN_WAIT in your deploy script.
Burst of 503s early in the deploy — Health check passed too early. Your app reported healthy before it was ready. Make your health endpoint check real dependencies.
Timeout errors — Drain timeout is too short. Long-running requests were killed before completing.

Run this test after every change to your deployment pipeline. What works in staging can break in production under different load patterns.

Frequently Asked Questions

How long does a zero-downtime Docker deployment take?

Deployment time depends primarily on your Docker image build step and application boot time. A typical Node.js or Python app builds in 30-90 seconds with layer caching. Health check verification adds 10-20 seconds. The traffic switch itself is instantaneous. Total time from push to live: usually under two minutes for most Docker applications.

What happens if the new version fails health checks?

The old container keeps running and serving all traffic — users experience nothing unusual. In the DIY approach, the deploy script stops the new container and exits with an error. With Temps, the platform automatically rolls back after three consecutive health check failures and sends a notification with container logs. Either way, the broken version never receives production traffic.

Can I do zero-downtime deployments on a single server?

Yes. Blue-green deployment works on a single server by running two containers mapped to different ports, with Nginx routing to the active one. You need enough RAM and CPU for two instances of your app during the brief overlap period. For most web applications, that's an extra 256-512 MB of RAM for 30-60 seconds. No cluster or orchestrator required.

What's the difference between zero-downtime and blue-green deployment?

Zero-downtime deployment is the goal — ensuring users never see errors during a deploy. Blue-green is one strategy for achieving that goal. Rolling deployment and canary deployment are alternative strategies. Blue-green maintains two full environments and switches traffic atomically. Rolling replaces instances one by one. Both achieve zero downtime, but with different resource and complexity trade-offs.

Does zero-downtime deployment work with database migrations?

Yes, but you must follow the expand-and-contract pattern. Never make breaking schema changes in a single deploy. Add new columns first, migrate code, then remove old columns in a separate deploy. This ensures both old and new application versions work with the same schema during the rolling update window.

Do I need to change my Docker image for zero-downtime deploys?

The only requirement is a health check endpoint — an HTTP route that returns 200 when your app is ready to serve traffic. Most web frameworks make this trivial (a /health route that checks database connectivity). Beyond that, your app should handle SIGTERM for graceful shutdown. No other changes to your Dockerfile or application code are needed.

What's Next?

Zero-downtime deployment boils down to three principles: don't send traffic to unready containers, let in-flight requests finish, and switch routes atomically. You can implement these yourself with Docker Compose, Nginx, and a bash script. You'll spend a day building it and ongoing time maintaining it as edge cases surface.

Or you can skip the plumbing entirely. Temps wraps all three principles into a single binary — Pingora proxy, health check gating, connection draining, automatic rollback — so that every git push produces a zero-downtime deploy without scripts to maintain.

The DIY approach teaches you exactly what's happening. The platform approach lets you stop thinking about it. Both are valid. The worst option is accepting 502 errors during deploys as normal.

# Install Temps and get zero-downtime deploys by default
curl -fsSL temps.sh/install.sh | bash

Back to all posts

You push a new version and for five to thirty seconds, some users see errors. In-flight requests drop. WebSocket connections break. It happens every deploy, and most teams just accept it.

TL;DR: Docker's default stop-start cycle creates a 5-30 second gap where requests fail. You can eliminate it with health check gating, connection draining, and blue-green container swaps. Elite engineering teams deploy multiple times per day with a 5% change failure rate. This guide shows both the DIY approach and a one-command alternative.

Why Do Docker Deployments Have Downtime by Default?

The Stop-Start Gap

When you run docker compose up -d --build, Docker stops the old container, removes it, builds the new image, and starts a fresh container. That sequence has three gaps where requests fail:

Container shutdown — The old process receives SIGTERM. If your app doesn't handle graceful shutdown, connections drop immediately.
Image build — Even with layer caching, builds take seconds to minutes. No container is running during this window.
Application boot — The new container starts, but your app needs time to load config, establish database connections, and warm caches.

During these gaps, any request hitting your server gets a 502 Bad Gateway or a connection refused error. That's downtime.

Why `restart: always` Doesn't Help

# This does NOT give you zero-downtime deployment
services:
  web:
    image: myapp:latest
    restart: always  # Only restarts the SAME container on crash

What you actually need is two containers running simultaneously — the old version serving traffic while the new version boots up and passes health checks. That's a fundamentally different pattern.

Why `docker compose up -d` Replaces In-Place

The core problem: Docker Compose is a development tool that happens to work in production. It wasn't designed for zero-downtime deployments. You need to layer your own orchestration on top.

What's the Difference Between Blue-Green, Rolling, and Canary Deployments?

Blue-Green Deployment

How it works:

Deploy new version to the idle environment (green)
Run health checks against green
Switch the load balancer from blue to green
Blue becomes the idle environment (instant rollback target)

Rolling Deployment

Rolling updates replace instances one at a time. The old version keeps serving while new instances spin up and pass health checks.

Time 0:  [v1] [v1] [v1]     ← All running v1
Time 1:  [v1] [v1] [v2...]  ← One instance boots v2
Time 2:  [v1] [v1] [v2 ✓]   ← v2 passes check, takes traffic
Time 3:  [v1] [v2 ✓] [v2 ✓] ← Second instance upgraded
Time 4:  [v2 ✓] [v2 ✓] [v2 ✓] ← Complete, zero dropped requests

Rolling deployment uses about 1.3x resources during the deploy — not 2x permanently. But it requires multiple instances, which makes it less practical on a single server with one container.

Canary Deployment

Which Strategy Should You Use?

Strategy	Resource Cost	Rollback Speed	Complexity	Best For
Blue-green	2x always	Instant	Low	Single-server Docker apps
Rolling	1.3x during deploy	Seconds	Medium	Multi-instance clusters
Canary	1.1x during deploy	Seconds	High	High-traffic production

For most Docker apps running on a single server, blue-green is the practical choice. You can implement it with two containers and an Nginx reload. No cluster required.

What Are the Three Ingredients for Zero-Downtime Docker Deployment?

Health Check Gating

The new container should never receive traffic until it's genuinely ready. A health check endpoint verifies that your application booted, connected to its database, and can serve requests.

// Express.js health check that verifies real dependencies
app.get('/health', async (req, res) => {
  try {
    await db.query('SELECT 1');
    await redis.ping();
    res.status(200).json({ status: 'healthy' });
  } catch (err) {
    res.status(503).json({ status: 'unhealthy', error: err.message });
  }
});

Connection Draining

Connection draining means:

Stop sending new requests to the old container
Let existing requests finish (with a timeout, typically 30 seconds)
Kill the container only after all connections close or the timeout expires

Without draining, you'll randomly drop requests during every deploy. Users won't see a full outage, but they'll get sporadic 502 errors that are hard to reproduce and diagnose.

Atomic Route Switch

The load balancer needs to flip from old to new in one step. Not gradually, not with a gap — atomically. For Nginx, this is a config reload:

nginx -s reload

How Do You Build Zero-Downtime Deployment with Docker Compose and Nginx?

Step 1: Docker Compose with Two Service Slots

Define two services — web-blue and web-green — so both can run simultaneously during the transition.

# docker-compose.yml
services:
  web-blue:
    build: .
    container_name: app-blue
    ports:
      - "8001:3000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 5s
      timeout: 3s
      retries: 3
      start_period: 10s
    restart: unless-stopped

  web-green:
    build: .
    container_name: app-green
    ports:
      - "8002:3000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 5s
      timeout: 3s
      retries: 3
      start_period: 10s
    restart: unless-stopped

Port 8001 maps to the blue container. Port 8002 maps to green. Nginx sits in front and routes to whichever is currently active.

Step 2: Nginx Upstream Configuration

Create two upstream configs that Nginx can switch between:

# /etc/nginx/conf.d/app.conf
upstream app_backend {
    # This file gets overwritten by the deploy script
    include /etc/nginx/conf.d/active-upstream.conf;
}

server {
    listen 80;
    server_name myapp.com;

    location / {
        proxy_pass http://app_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
    }
}

# /etc/nginx/conf.d/active-upstream.conf
# Points to blue by default
server 127.0.0.1:8001;

Step 3: The Deploy Script

This is where the magic happens. The script determines which slot is active, deploys to the idle slot, waits for health checks, switches Nginx, and drains the old container.

#!/bin/bash
set -euo pipefail

# Configuration
HEALTH_ENDPOINT="http://localhost:PORT/health"
MAX_RETRIES=30
RETRY_INTERVAL=2
DRAIN_WAIT=10
UPSTREAM_CONF="/etc/nginx/conf.d/active-upstream.conf"

# Determine which slot is currently active
CURRENT=$(cat "$UPSTREAM_CONF" | grep -oP ':\K[0-9]+')
if [ "$CURRENT" = "8001" ]; then
    ACTIVE="blue"
    TARGET="green"
    TARGET_PORT="8002"
else
    ACTIVE="green"
    TARGET="blue"
    TARGET_PORT="8001"
fi

echo "Active: $ACTIVE | Deploying to: $TARGET (port $TARGET_PORT)"

# Step 1: Build and start the target container
echo "Building and starting $TARGET..."
docker compose up -d --build "web-$TARGET"

# Step 2: Wait for health check
echo "Waiting for health check on port $TARGET_PORT..."
HEALTH_URL="${HEALTH_ENDPOINT/PORT/$TARGET_PORT}"
RETRIES=0
until curl -sf "$HEALTH_URL" > /dev/null 2>&1; do
    RETRIES=$((RETRIES + 1))
    if [ "$RETRIES" -ge "$MAX_RETRIES" ]; then
        echo "ERROR: Health check failed after $MAX_RETRIES attempts"
        echo "Rolling back: stopping $TARGET"
        docker compose stop "web-$TARGET"
        exit 1
    fi
    echo "  Attempt $RETRIES/$MAX_RETRIES..."
    sleep "$RETRY_INTERVAL"
done
echo "Health check passed!"

# Step 3: Switch Nginx to the new container
echo "Switching traffic to $TARGET..."
echo "server 127.0.0.1:$TARGET_PORT;" > "$UPSTREAM_CONF"
nginx -s reload

# Step 4: Drain connections from old container
echo "Draining connections from $ACTIVE ($DRAIN_WAIT seconds)..."
sleep "$DRAIN_WAIT"

# Step 5: Stop the old container
echo "Stopping $ACTIVE..."
docker compose stop "web-$ACTIVE"

echo "Deploy complete! Active slot: $TARGET"

Save this as deploy.sh, make it executable with chmod +x deploy.sh, and run it every time you push a new version.

The Gotchas You'll Hit

This DIY approach works, but it has sharp edges you'll discover in production:

DNS caching — If you're using a DNS-based load balancer upstream, Nginx caches DNS resolution at startup. You'll need a resolver directive and variables in proxy_pass to force re-resolution.
Connection pools — Database connection pools in the old container may hold connections open past the drain timeout. Set your app's shutdown handler to close pools explicitly.
SIGTERM handling — Many frameworks don't handle SIGTERM gracefully by default. Node.js needs an explicit process.on('SIGTERM') handler. Python needs signal trapping.
Disk space — Old Docker images pile up. Add docker image prune -f to the end of your deploy script.
Concurrent deploys — If two people run deploy.sh at the same time, you'll corrupt state. Add a lock file.

How Do You Handle Database Migrations Without Downtime?

The Expand-and-Contract Pattern

The safe approach splits every breaking database change into three deploys:

Phase 1: Expand (backward-compatible)

-- Add the new column without removing the old one
ALTER TABLE users ADD COLUMN full_name TEXT;

-- Backfill data
UPDATE users SET full_name = first_name || ' ' || last_name;

Deploy code that writes to both columns but reads from the new one.

Phase 2: Migrate

Deploy code that only uses the new column. Both old and new application versions coexist safely because the old column still exists.

Phase 3: Contract (cleanup)

-- Safe to remove after all instances run the new version
ALTER TABLE users DROP COLUMN first_name;
ALTER TABLE users DROP COLUMN last_name;

Why Is Kubernetes Overkill for Most Docker Apps?

What Kubernetes requires for zero-downtime deploys:

Cluster setup — At minimum, a control plane node and a worker node. Managed Kubernetes (EKS, GKE, AKS) costs $70-150/month just for the control plane.
Deployment manifests — YAML files defining replicas, update strategy, readiness probes, resource limits.
Ingress controller — Nginx Ingress, Traefik, or similar. Another component to configure and maintain.
Cert-manager — For automatic HTTPS. More YAML, more CRDs, more things to break.
kubectl and CI/CD — Your team needs to learn Kubernetes tooling and integrate it into your pipeline.

How Does Temps Handle Zero-Downtime Deployment Automatically?

Every git push triggers this pipeline automatically:

Build — New Docker image builds while the old container keeps serving traffic
Health check — New container starts, Temps polls the health endpoint until it returns 200
Traffic shift — Pingora atomically routes new requests to the new container
Drain — Old container finishes in-flight requests before shutdown
Cleanup — Old container and dangling images are removed

# That's the entire deploy workflow
git push temps main

What About the Gotchas?

Remember the sharp edges from the DIY section? Here's how Temps handles each one:

DNS caching — Pingora resolves upstreams dynamically. No stale DNS.
Connection pools — Drain timeout is configurable per project. Defaults to 30 seconds.
SIGTERM handling — Temps sends SIGTERM, waits for the drain timeout, then sends SIGKILL. Your app gets a fair chance to shut down cleanly.
Disk space — Old images are pruned automatically after successful deploys.
Concurrent deploys — Deploy queue ensures deploys run sequentially per project.

The deploy script you'd maintain? Temps replaces it with a single binary that handles all of this, plus SSL certificates, log aggregation, and error tracking.

Five Metrics to Track After Every Deploy

Don't assume a successful deploy means everything is fine. Watch these five metrics for at least 15 minutes after each deployment:

Deploy duration — Total time from push to live
Health check time — How long until the new version was ready
Drain time — How long in-flight requests took to complete
Error rate — Before, during, and after deployment
Response time — Any latency impact from the deployment

Whether you're using the DIY approach or a platform, a deploy that passes health checks can still have subtle issues: a slow query, a visual regression, an edge case in a new feature.

What Are the Most Common Zero-Downtime Mistakes?

1. Fake Health Checks

A health endpoint that always returns 200 defeats the purpose of health check gating. Always verify real dependencies:

// Bad: always returns healthy
app.get('/health', () => ({ status: 'ok' }));

// Good: verifies actual readiness
app.get('/health', async () => {
  await db.query('SELECT 1');
  await cache.ping();
  return { status: 'ok' };
});

2. Breaking Database Changes in One Deploy

Never drop a column in the same deploy that stops using it. Always use the expand-and-contract pattern across multiple deploys. GitHub and Cloudflare both learned this lesson in 2025.

3. Infrequent, Large Deploys

Small, frequent deploys are easier to roll back and less likely to cause cascading failures. The DORA data consistently shows that elite teams deploy multiple times per day — not once a week.

4. Skipping Post-Deploy Monitoring

A deploy that passes health checks can still have subtle issues: a visual regression, a slow database query, an edge case in a new feature. Watch metrics for 15 minutes after every deployment.

5. No Rollback Plan

How Do You Verify Zero Requests Are Dropped During Deployment?

Load Test with `hey`

hey is a lightweight HTTP load generator. Install it and run continuous requests while deploying:

# Install hey
go install github.com/rakyll/hey@latest

# In terminal 1: start continuous load test
hey -z 120s -c 10 -q 50 http://myapp.com/health

This sends 50 requests per second from 10 concurrent workers for 120 seconds. Now, in another terminal, trigger your deploy:

# In terminal 2: deploy while load test is running
./deploy.sh        # DIY approach
# or
git push temps main  # Temps approach

Reading the Results

When hey finishes, check the output:

Summary:
  Total:        120.0034 secs
  Slowest:      0.2345 secs
  Fastest:      0.0012 secs
  Average:      0.0089 secs
  Requests/sec: 499.85

Status code distribution:
  [200] 59982 responses

# Zero-downtime: CONFIRMED
# If you see ANY non-200 responses, something is wrong.

If the status code distribution shows only 200 responses, you've achieved zero-downtime deployment. Any 502, 503, or connection errors mean requests were dropped during the switch.

Common Load Test Failures

What to investigate if you see dropped requests:

A few 502s at the switch point — Nginx reload didn't drain cleanly. Increase DRAIN_WAIT in your deploy script.
Burst of 503s early in the deploy — Health check passed too early. Your app reported healthy before it was ready. Make your health endpoint check real dependencies.
Timeout errors — Drain timeout is too short. Long-running requests were killed before completing.

Run this test after every change to your deployment pipeline. What works in staging can break in production under different load patterns.

Frequently Asked Questions

How long does a zero-downtime Docker deployment take?

What happens if the new version fails health checks?

Can I do zero-downtime deployments on a single server?

What's the difference between zero-downtime and blue-green deployment?

Does zero-downtime deployment work with database migrations?

Do I need to change my Docker image for zero-downtime deploys?

What's Next?

The DIY approach teaches you exactly what's happening. The platform approach lets you stop thinking about it. Both are valid. The worst option is accepting 502 errors during deploys as normal.

# Install Temps and get zero-downtime deploys by default
curl -fsSL temps.sh/install.sh | bash

Zero-Downtime Docker Deployments: Blue-Green Setup, DB Migrations & Verification (2026) | Temps

Zero-Downtime Docker Deployments: Blue-Green Setup, DB Migrations & Verification (2026)

Zero-Downtime Docker Deployments: Blue-Green Setup, DB Migrations & Verification (2026)

The Stop-Start Gap

Why restart: always Doesn't Help

Why docker compose up -d Replaces In-Place

Blue-Green Deployment

Rolling Deployment

Canary Deployment

Which Strategy Should You Use?

Health Check Gating

Connection Draining

Atomic Route Switch

Step 1: Docker Compose with Two Service Slots

Step 2: Nginx Upstream Configuration

Step 3: The Deploy Script

The Gotchas You'll Hit

The Expand-and-Contract Pattern

What About the Gotchas?

Five Metrics to Track After Every Deploy

1. Fake Health Checks

2. Breaking Database Changes in One Deploy

3. Infrequent, Large Deploys

4. Skipping Post-Deploy Monitoring

5. No Rollback Plan

Load Test with hey

Reading the Results

Common Load Test Failures

How long does a zero-downtime Docker deployment take?

What happens if the new version fails health checks?

Can I do zero-downtime deployments on a single server?

What's the difference between zero-downtime and blue-green deployment?

Does zero-downtime deployment work with database migrations?

Do I need to change my Docker image for zero-downtime deploys?

The Stop-Start Gap

Why restart: always Doesn't Help

Why docker compose up -d Replaces In-Place

Blue-Green Deployment

Rolling Deployment

Canary Deployment

Which Strategy Should You Use?

Health Check Gating

Connection Draining

Atomic Route Switch

Step 1: Docker Compose with Two Service Slots

Step 2: Nginx Upstream Configuration

Step 3: The Deploy Script

The Gotchas You'll Hit

The Expand-and-Contract Pattern

What About the Gotchas?

Five Metrics to Track After Every Deploy

1. Fake Health Checks

2. Breaking Database Changes in One Deploy

3. Infrequent, Large Deploys

4. Skipping Post-Deploy Monitoring

5. No Rollback Plan

Load Test with hey

Reading the Results

Common Load Test Failures

How long does a zero-downtime Docker deployment take?

What happens if the new version fails health checks?

Can I do zero-downtime deployments on a single server?

What's the difference between zero-downtime and blue-green deployment?

Does zero-downtime deployment work with database migrations?

Do I need to change my Docker image for zero-downtime deploys?

Why `restart: always` Doesn't Help

Why `docker compose up -d` Replaces In-Place

Load Test with `hey`

Why `restart: always` Doesn't Help

Why `docker compose up -d` Replaces In-Place

Load Test with `hey`