May 26, 2026 (today)
Written by Temps Team
Last updated May 26, 2026 (today)
Zero dropped requests during deployment used to require Kubernetes, a dedicated SRE team, and a week of configuration. In 2026, seven platforms handle it natively — but each uses a fundamentally different mechanism, and the mechanism matters more than the marketing copy.
This guide ranks seven deployment platforms by how well they actually achieve zero downtime, explains the three underlying strategies, and tells you exactly what to require from any platform before trusting it with your production traffic.
TL;DR: Temps and Vercel achieve the cleanest zero-downtime through atomic traffic swaps — old version serves until the new version is fully healthy, then a single flip. Fly.io supports explicit blue-green alongside rolling. Render and Railway use rolling deploys that work well at scale. Kamal is blue-green per host but requires the most manual setup. Coolify's rolling implementation is functional but has the thinnest health-check control of the group.
Before ranking platforms, you need to understand what they're actually doing under the hood. Three strategies cover nearly all deployments — and each has different trade-offs in cost, risk, and recovery speed.
| Strategy | How It Works | Resource Overhead | Rollback Speed | Best For |
|---|---|---|---|---|
| Blue-green | Two identical environments; traffic switches atomically from old to new | 2× always | Instant (one flip back) | Single-server apps, APIs requiring zero risk |
| Immutable + atomic alias | New build deployed as immutable artifact; edge routing alias flipped in one step | Minimal (old artifact kept briefly) | Instant | Serverless, static, edge functions |
| Rolling | Instances replaced one at a time; old version keeps serving during rollover | ~1.3× during deploy | Seconds (scale-in new version) | Multi-instance, stateless services |
Immutable image swap is a variant of blue-green specific to containerized platforms: the new image is built and health-checked before the reverse proxy's upstream pointer is atomically updated. No traffic reaches the new container until it passes health checks.
Each mechanism requires the same three ingredients to work correctly: a real health check (not a fake 200), connection draining (old connections finish before the container stops), and an atomic route switch (no gap between old traffic off and new traffic on).
Mechanism: Atomic upstream swap using a Pingora-based reverse proxy. The new container builds and starts while the old one keeps serving. Temps polls the health endpoint until it returns 200, then updates the Pingora upstream configuration — a hot-reload that requires no process restart and takes effect in microseconds. The old container continues draining in-flight requests before shutdown.
What happens on git push:
# Entire deploy workflow — Temps handles the rest
git push temps main
1. New Docker image builds in isolation
2. New container starts (old container still serving all traffic)
3. Pingora polls /health until 200 returned
4. Upstream pointer atomically updated → new container receives traffic
5. Old container drains (configurable timeout, default 30s)
6. Old container stopped, old image pruned
Health check control: Full. Temps blocks traffic until health check passes. Three consecutive failures trigger automatic rollback — the old version keeps running and you get a notification with container logs.
Connection draining: Configurable drain timeout per project. Pingora sends no new requests to the old upstream the moment the pointer flips, while existing connections finish naturally.
Rollback: Automatic on health check failure. Manual rollback via temps rollback keeps the previous image cached.
Verdict: The most complete zero-downtime implementation in this list. The Pingora hot-reload eliminates the Nginx reload race condition that plagues DIY setups. No deploy scripts to maintain.
For the full technical deep-dive into how this pattern is implemented, see Zero-Downtime Docker Deployments: Blue-Green Setup, DB Migrations & Verification.
Mechanism: Every Vercel deployment is an immutable, content-addressed artifact. When you push, Vercel builds the new version and assigns it a unique preview URL. Once healthy, a single alias swap routes your production domain to the new deployment — atomically, at Vercel's edge routing layer (an internal routing table flip, not a DNS record change).
What makes it work:
Health check control: Limited. Vercel's health checks run during the build phase, not post-deployment. If your app boots but behaves incorrectly at runtime (bad env var, failed DB connection), traffic still routes to it.
Connection draining: Handled at the edge. Serverless function invocations that were in-flight complete on the old deployment.
Rollback: Instant — re-alias to any previous immutable deployment URL.
Limitation: The immutable model works beautifully for stateless apps and Next.js. For long-running processes or WebSocket-heavy apps, Vercel's serverless model changes the problem entirely.
Verdict: Excellent zero-downtime for the serverless/JAMstack use case. Health check depth is shallower than Temps, but the immutable model means the worst case is a broken new deployment that you can instantly revert.
Mechanism: Fly.io supports explicit blue-green deployment via the --strategy flag. New Machines boot with the new image, pass health checks, then traffic switches. Old Machines are stopped after the drain timeout.
fly deploy --strategy bluegreen
Available strategies:
bluegreen — Full blue-green: new machines boot, health-checked, traffic switched, old machines stoppedrolling — Default: machines replaced one at a timecanary — One machine gets the new version first; if it passes checks, the rest roll out (single-machine smoke test, not weighted traffic splitting)immediate — Stops old machines first (causes downtime — don't use in production)Health check control: Good. Fly.io's [[services.tcp_checks]] and [[services.http_checks]] in fly.toml are evaluated before traffic shifts. The grace period (time before checks start) is configurable.
[[services.http_checks]]
interval = 10000
timeout = 2000
grace_period = "5s"
method = "get"
path = "/health"
protocol = "http"
restart_on_timeout = false
Connection draining: Built in. Fly.io's proxy handles draining before stopping old Machines.
Rollback: fly releases list shows all releases; fly deploy --image <previous-image> rolls back.
Verdict: The most flexible of the managed platforms — you can choose your zero-downtime strategy per deploy. Blue-green on Fly.io is genuinely production-quality. The trade-off is TOML config complexity and regional routing awareness you need to manage yourself.
Mechanism: Render deploys new containers one instance at a time. The old instance keeps serving while the new one starts and passes health checks. Traffic routes to the new instance only after it's healthy.
Health check control: Available via the Render dashboard and render.yaml. Custom health check paths and thresholds are supported.
# render.yaml
services:
- type: web
name: my-app
healthCheckPath: /health
Connection draining: Render drains connections from instances being replaced, with a default 30-second window.
Rollback: Manual via the Render dashboard — redeploy a previous commit.
Limitation: Rolling deploys mean both old and new versions run simultaneously during the deploy window. If your new version has a breaking database schema change, old instances will hit the new schema and new instances will hit the old schema. You must implement the expand-and-contract migration pattern.
Verdict: Solid zero-downtime for multi-instance deployments. Health check integration is straightforward. Not suitable for single-instance deployments where rolling doesn't help (the instance is replaced, not supplemented).
Mechanism: Railway's rolling deploy creates new replicas with the new image, waits for them to become healthy, routes traffic to them, then terminates old replicas. The platform handles the orchestration automatically.
Health check control: Railway supports HTTP health checks configured in the Railway dashboard or railway.toml. The check path, interval, and timeout are configurable.
Connection draining: Railway sends a SIGTERM to old replicas and waits for a configurable drain period before SIGKILL.
Rollback: One-click rollback in the Railway dashboard to any previous deployment.
Limitation: Like Render, simultaneous old/new versions during rolling deploys require migration-safe database schema changes. Railway's health check configuration is less granular than Fly.io's.
Verdict: Good zero-downtime for stateless workloads. The developer experience is polished — rollbacks and deployment history are first-class features. Less control over deployment strategy than Fly.io.
Mechanism: Coolify performs rolling deploys by starting the new container, waiting for it to pass health checks, then stopping the old one. For single-container services, this means a brief overlap window.
Health check control: Coolify reads Docker's HEALTHCHECK instruction from your Dockerfile, or you can configure a health check URL in the Coolify dashboard. The overlap window depends on how quickly your container passes its Docker health check.
HEALTHCHECK --interval=5s --timeout=3s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
Connection draining: Limited. Coolify's drain behavior depends on your application's SIGTERM handling and Docker's stop grace period (stop_grace_period in Compose). The platform itself doesn't add a layer on top.
Rollback: Manual — redeploy a previous Git commit or Docker image tag via the Coolify dashboard.
Limitation: Coolify's zero-downtime implementation is the thinnest on this list. Health check failure handling and automatic rollback are less robust than dedicated deployment platforms. Connection draining relies on your own SIGTERM handling rather than platform-level guarantees.
Verdict: Functional zero-downtime for teams self-hosting Coolify who configure Docker health checks carefully. Not the right choice if zero-downtime is a hard requirement without significant additional configuration.
Mechanism: Kamal (Basecamp's deployment tool) implements blue-green deployment using Docker container labels and a Traefik proxy. It boots the new container alongside the old one, waits for health checks, flips the Traefik router, then stops the old container.
kamal deploy
What Kamal actually does:
# Kamal's deploy sequence (simplified)
1. Push new image to registry
2. Pull image on all target servers
3. Boot new container (new slot)
4. Poll health check endpoint
5. Update Traefik labels → traffic shifts to new container
6. Stop old container after drain period
Health check control: Configured in config/deploy.yml. Kamal polls a health check URL before marking the deployment successful.
# config/deploy.yml
healthcheck:
path: /health
port: 3000
max_attempts: 10
interval: 3
Connection draining: Traefik handles draining. The drain window depends on Traefik's configuration and your app's SIGTERM handling.
Rollback: kamal rollback <version> — keeps the previous image on the server for fast rollback.
Limitation: Kamal is a tool, not a managed platform. You're responsible for the servers, the Traefik configuration, the SSH access, the image registry, and debugging failures. Zero-downtime is achievable but requires significant DevOps investment.
One version note: on a single host, Kamal's deploy is blue-green (new container boots alongside the old, Traefik label flips, old container stops). Across multiple hosts, Kamal 2 deploys one host at a time by default, making it rolling at the fleet level — both old and new app versions serve traffic simultaneously during the rollout, so database schema changes must be backward-compatible.
Verdict: Kamal gives you full control over the blue-green mechanism (per host) at the cost of managing everything yourself. Right for teams who want Heroku-like UX on their own servers and are comfortable with the operational overhead.
Zero-downtime marketing copy is easy. Before you trust a platform with your production traffic, verify it actually does these five things:
The platform must poll a health endpoint after the container starts and before routing any traffic. A health check that runs during the build phase (like Vercel's) or that only checks TCP connectivity misses the most common failure mode: an app that boots but can't connect to its database.
Your health endpoint must verify real dependencies:
// Express.js — checks actual readiness, not just process health
app.get('/health', async (req, res) => {
try {
await db.query('SELECT 1'); // database reachable?
await redis.ping(); // cache reachable?
res.status(200).json({ status: 'ok' });
} catch (err) {
res.status(503).json({ status: 'error', detail: err.message });
}
});
Ask the platform: "What happens if my health check returns 503 after the container starts?" The answer should be "the deploy fails and the old version keeps serving."
Long-running requests — file uploads, streaming responses, long-polling connections — need time to complete after traffic stops routing to the old container. A platform with a hardcoded 5-second drain will silently drop requests that take 6 seconds.
The platform must let you configure drain timeout per service. Typical values: 15-30 seconds for web apps, 60+ seconds for services with long-running operations.
If the new version fails health checks after deployment, the old version must keep running without manual intervention. Platforms that require you to manually redeploy the previous version add a recovery lag during which your users see errors.
Verify this by asking: "If my new deploy passes the initial health check but the check starts failing 60 seconds later, what happens?" The answer should be "nothing — the old deploy was already stopped." But for the deploy-time failure case, the old container must be the safety net.
The moment traffic stops going to the old container must be the same moment it starts going to the new container. Any gap — even 100ms — causes connection errors. This requires a hot-swap mechanism (Pingora upstream reload, Traefik label update, Nginx reload) rather than stopping the old proxy and starting a new one.
The platform must send SIGTERM to the old container before SIGKILL, and wait for the drain timeout before escalating. Many platforms do this. Your application also needs to handle SIGTERM correctly:
// Node.js — graceful shutdown on SIGTERM
process.on('SIGTERM', () => {
server.close(() => {
db.pool.end();
process.exit(0);
});
});
Without SIGTERM handling in your app, connection draining at the platform level won't help — your process will drop connections when it exits regardless.
The core mechanism behind the best zero-downtime implementations is an atomic pointer update in the reverse proxy. Here's what it looks like at the Nginx level (the DIY version that Temps and Kamal automate):
#!/bin/bash
# Atomic upstream swap — the same pattern Temps' Pingora implements
UPSTREAM_CONF="/etc/nginx/conf.d/active-upstream.conf"
NEW_PORT=$1 # 8001 for blue, 8002 for green
# 1. Boot new container on $NEW_PORT
docker compose up -d --build "web-${SLOT}"
# 2. Wait for health check (no traffic yet)
until curl -sf "http://localhost:${NEW_PORT}/health"; do
sleep 2
done
# 3. Atomic swap: update upstream pointer + reload Nginx
# Nginx reload is graceful — new workers get new config,
# old workers finish in-flight requests before exiting
echo "server 127.0.0.1:${NEW_PORT};" > "$UPSTREAM_CONF"
nginx -s reload
# 4. Drain: wait for old container's connections to finish
sleep 30
# 5. Stop old container
docker compose stop "web-${OLD_SLOT}"
The key insight: nginx -s reload is not a restart. New worker processes start with the updated config while old workers continue serving until their current requests complete. The traffic switch takes effect for new connections the moment the reload completes — in-flight requests on old workers are unaffected.
Temps' Pingora implementation replaces this bash script with a hot-reload of the upstream configuration — a pointer swap in memory rather than spawning new OS worker processes the way an Nginx reload does. Kamal uses Traefik's router label update. The pattern is the same across all three: update a pointer atomically, let old workers drain, stop old containers.
For the complete DIY walkthrough including Docker Compose config, the deploy script, and load testing verification, see Zero-Downtime Docker Deployments: Blue-Green Setup, DB Migrations & Verification.
| Platform | Mechanism | Health Check Depth | Auto Rollback | Drain Control | DIY Required |
|---|---|---|---|---|---|
| Temps | Immutable image + atomic Pingora swap | Deep (configurable endpoint) | Yes | Yes (configurable) | None |
| Vercel | Immutable build + atomic alias | Shallow (build-time only) | Yes (don't flip alias) | Edge-managed | None |
| Fly.io | Blue-green or rolling (configurable) | Good (TOML config) | Manual | Built-in | TOML config |
| Render | Rolling | Good (dashboard/YAML) | Manual | 30s default | Minimal |
| Railway | Rolling | Good (dashboard) | One-click | Configurable | Minimal |
| Coolify | Rolling (Docker HEALTHCHECK) | Basic | Manual | SIGTERM only | Moderate |
| Kamal | Manual blue-green (Traefik) | Configurable | kamal rollback | Traefik config | Significant |
A zero-downtime deployment is a release strategy where new application code reaches production without dropping any in-flight requests or showing errors to users. It requires three mechanisms working together: health check gating (new version receives no traffic until it's ready), connection draining (old version finishes in-flight requests before stopping), and an atomic route switch (no gap between old traffic off and new traffic on). The goal is that users experience no errors, latency spikes, or service interruptions during the deployment window.
Temps and Fly.io both implement blue-green deployments where the new version runs alongside the old one, traffic switches atomically, and the old version continues serving until fully drained. Kamal also implements blue-green but requires significant manual configuration of Traefik and your server infrastructure. Vercel's immutable deployment model achieves the same effect — atomic alias flip to an immutable artifact — without running two containers simultaneously.
Blue-green maintains two full environments and flips traffic in a single atomic step. Zero users see the new version until 100% of traffic switches. Rolling updates replace instances one at a time, so during the deploy window, some requests go to the old version and some to the new version. Blue-green is simpler and provides instant rollback but costs 2× resources. Rolling is more resource-efficient but requires both old and new versions to handle the same requests simultaneously — which means database schema changes must be backward-compatible during the deploy window.
Yes, but you must follow the expand-and-contract pattern. Never drop a column or make a breaking schema change in the same deploy that stops using it. Add new columns first (expand), deploy code that reads from both old and new columns, migrate data, then deploy code that only reads the new column, then remove the old column in a final deploy (contract). During a rolling or blue-green deploy, both old and new application versions run simultaneously and must work with the same database schema. Any migration that breaks either version causes errors during the deploy window.
Run a continuous load test with a tool like hey while triggering a deployment. If you see only 200-status responses in the output, your deployment is truly zero-downtime. Any 502, 503, or connection errors indicate dropped requests. Run the test after every change to your deployment configuration — a setup that works in staging can break under production load patterns. For Temps deployments, the built-in metrics dashboard shows request error rates before, during, and after each deploy.
Your health check endpoint must verify real application readiness, not just that the process is running. At minimum, check that your database connection pool can execute a query (SELECT 1) and that any required caches or queues are reachable. Return HTTP 200 when ready, HTTP 503 when not. Avoid returning 200 before your application is genuinely ready to serve traffic — a premature 200 causes the platform to route requests to a container that will immediately return 500 errors. The health check endpoint itself should be fast (under 100ms) and should not perform operations that could affect normal traffic.
Zero-downtime deployment is achievable on any of these platforms. The difference is how much you have to configure and maintain.
If you're starting fresh or want zero operational overhead, Temps handles the full zero-downtime pipeline — Pingora hot-reload, health check gating, automatic rollback, connection draining — from a single git push. Vercel matches this for serverless and Next.js workloads. Fly.io gives you the most strategy flexibility. Render and Railway work well with less configuration than Fly.io. Kamal and Coolify make sense if you're already self-hosting and want to stay in full control.
Whatever platform you choose, verify it with a load test: spin up hey, trigger a deploy, and confirm zero non-200 responses. Don't take zero-downtime on faith.
# Install Temps and get zero-downtime deploys from the first push
curl -fsSL temps.sh/install.sh | bash