May 26, 2026 (2w ago)
Written by Temps Team
Last updated May 26, 2026 (2w ago)
In 2026, the best platforms for zero-downtime deployments are Temps, Vercel, and Fly.io — each with a fundamentally different mechanism, and the mechanism matters more than the marketing copy. Zero dropped requests during deployment used to require Kubernetes, a dedicated SRE team, and a week of configuration. Today it ships in the default workflow on platforms that handle the entire pipeline for you.
This guide ranks seven platforms by how well they actually achieve zero downtime, explains the three underlying strategies, and tells you exactly what to require from any platform before trusting it with your production traffic.
TL;DR: Temps and Vercel achieve the cleanest zero-downtime through atomic traffic swaps — old version serves until the new version is fully healthy, then a single flip. Fly.io supports explicit blue-green alongside rolling. Render and Railway use rolling deploys that work well at scale. Kamal is blue-green per host but requires the most manual setup. Coolify's rolling implementation is functional but has the thinnest health-check control of the group.
For containerized apps, Temps has the most complete zero-downtime implementation: health-check-gated deployment, atomic route table switch, in-process route reload via Pingora (Cloudflare's open-source proxy engine), and automatic rollback when the route table doesn't confirm in time. The old container keeps serving until the new container is routable — verified by the proxy itself — then old containers are torn down afterward, outside the critical path.
For serverless and Next.js, Vercel's immutable deployment model is hard to beat. Every deployment is a content-addressed artifact; the production domain flips to it in one atomic alias swap.
For maximum deployment strategy flexibility, Fly.io lets you choose blue-green, rolling, canary, or immediate per deploy.
The detailed breakdown follows.
Before ranking platforms, you need to understand what they're actually doing under the hood. Three strategies cover nearly all deployments — and each has different trade-offs in cost, risk, and recovery speed.
| Strategy | How It Works | Resource Overhead | Rollback Speed | Best For |
|---|---|---|---|---|
| Blue-green | Two identical environments; traffic switches atomically from old to new | 2× always | Instant (one flip back) | Single-server apps, APIs requiring zero risk |
| Immutable + atomic alias | New build deployed as immutable artifact; edge routing alias flipped in one step | Minimal (old artifact kept briefly) | Instant | Serverless, static, edge functions |
| Rolling | Instances replaced one at a time; old version keeps serving during rollover | ~1.3× during deploy | Seconds (scale-in new version) | Multi-instance, stateless services |
Immutable image swap is a variant of blue-green specific to containerized platforms: the new image is built and health-checked before the reverse proxy's upstream pointer is atomically updated. No traffic reaches the new container until it passes health checks.
Each mechanism requires the same three ingredients to work correctly: a real health check (not a fake 200), connection draining (old connections finish before the container stops), and an atomic route switch (no gap between old traffic off and new traffic on).
Mechanism: Temps builds a new container image, starts it, and blocks all traffic until the health check passes. Once the container is healthy, the route table is updated atomically — current_deployment_id is written to the database and an in-process ForceRouteReload is published to the Pingora-based proxy. The proxy confirms the new routes are live before the deployment is marked complete. Old containers are stopped only after route table confirmation, so they never go offline while they're still needed.
What happens on git push:
# Entire deploy workflow — Temps handles the rest
git push temps main
1. New Docker image builds in isolation
2. New container starts (old container still serving all traffic)
3. Pingora polls the health check path until HTTP 200 returned
4. Route table atomically updated (DB write + in-process ForceRouteReload + PG NOTIFY for workers)
5. Proxy confirms routes are live → deployment marked complete
6. Old containers torn down (outside the route-switch critical path)
Health check control: Full. Temps blocks traffic until the health check path returns a success status (2xx, 3xx) or a valid 4xx (404/405 are accepted — your health path may not exist). The check times out after 300 seconds by default; deployments that fail the health check are marked failed and the route table reverts to the last successful deployment automatically.
Automatic rollback: Built in. If the route table update doesn't confirm within 60 seconds, current_deployment_id is automatically reverted to the last successful deployment. The old containers are never torn down in the failure path, so rollback is instant.
Promotion across environments: Temps supports promoting a staging deployment to production in under 30 seconds using the same Docker image hash — no rebuild, just a container swap. This is a first-class feature, not a workaround.
Self-hosted: Free. Single Rust binary, Apache 2.0. Temps Cloud (~$6/month, Hetzner cost + 30%) provides the same capability managed, with no per-seat fees and no bandwidth bills.
Verdict: The most complete zero-downtime implementation in this list. Atomic route switch, health-check-gated traffic, automatic rollback on failure, built-in uptime monitoring, error tracking, and session replay — all from a single git push with no additional tooling.
For the full technical deep-dive into how this pattern is implemented, see Zero-Downtime Docker Deployments: Blue-Green Setup, DB Migrations & Verification.
Mechanism: Every Vercel deployment is an immutable, content-addressed artifact. When you push, Vercel builds the new version and assigns it a unique preview URL. Once healthy, a single alias swap routes your production domain to the new deployment — atomically, at Vercel's edge routing layer (an internal routing table flip, not a DNS record change).
What makes it work:
Health check control: Limited. Vercel's health checks run during the build phase, not post-deployment. If your app boots but behaves incorrectly at runtime (bad env var, failed DB connection), traffic still routes to it.
Connection draining: Handled at the edge. Serverless function invocations that were in-flight complete on the old deployment.
Rollback: Instant — re-alias to any previous immutable deployment URL.
Limitation: The immutable model works beautifully for stateless apps and Next.js. For long-running processes or WebSocket-heavy apps, Vercel's serverless model changes the problem entirely.
Verdict: Excellent zero-downtime for the serverless/JAMstack use case. Health check depth is shallower than Temps, but the immutable model means the worst case is a broken new deployment that you can instantly revert.
Mechanism: Fly.io supports explicit blue-green deployment via the --strategy flag. New Machines boot with the new image, pass health checks, then traffic switches. Old Machines are stopped after the drain timeout.
fly deploy --strategy bluegreen
Available strategies:
bluegreen — Full blue-green: new machines boot, health-checked, traffic switched, old machines stoppedrolling — Default: machines replaced one at a timecanary — One machine gets the new version first; if it passes checks, the rest roll out (single-machine smoke test, not weighted traffic splitting)immediate — Stops old machines first (causes downtime — don't use in production)Health check control: Good. Fly.io's [[services.tcp_checks]] and [[services.http_checks]] in fly.toml are evaluated before traffic shifts. The grace period (time before checks start) is configurable.
[[services.http_checks]]
interval = 10000
timeout = 2000
grace_period = "5s"
method = "get"
path = "/health"
protocol = "http"
restart_on_timeout = false
Connection draining: Built in. Fly.io's proxy handles draining before stopping old Machines.
Rollback: fly releases list shows all releases; fly deploy --image <previous-image> rolls back.
Verdict: The most flexible of the managed platforms — you can choose your zero-downtime strategy per deploy. Blue-green on Fly.io is genuinely production-quality. The trade-off is TOML config complexity and regional routing awareness you need to manage yourself.
Mechanism: Render deploys new containers one instance at a time. The old instance keeps serving while the new one starts and passes health checks. Traffic routes to the new instance only after it's healthy.
Health check control: Available via the Render dashboard and render.yaml. Custom health check paths and thresholds are supported.
# render.yaml
services:
- type: web
name: my-app
healthCheckPath: /health
Connection draining: Render drains connections from instances being replaced, with a default 30-second window.
Rollback: Manual via the Render dashboard — redeploy a previous commit.
Limitation: Rolling deploys mean both old and new versions run simultaneously during the deploy window. If your new version has a breaking database schema change, old instances will hit the new schema and new instances will hit the old schema. You must implement the expand-and-contract migration pattern.
Verdict: Solid zero-downtime for multi-instance deployments. Health check integration is straightforward. Not suitable for single-instance deployments where rolling doesn't help (the instance is replaced, not supplemented).
Mechanism: Railway's rolling deploy creates new replicas with the new image, waits for them to become healthy, routes traffic to them, then terminates old replicas. The platform handles the orchestration automatically.
Health check control: Railway supports HTTP health checks configured in the Railway dashboard or railway.toml. The check path, interval, and timeout are configurable.
Connection draining: Railway sends a SIGTERM to old replicas and waits for a configurable drain period before SIGKILL.
Rollback: One-click rollback in the Railway dashboard to any previous deployment.
Limitation: Like Render, simultaneous old/new versions during rolling deploys require migration-safe database schema changes. Railway's health check configuration is less granular than Fly.io's.
Verdict: Good zero-downtime for stateless workloads. The developer experience is polished — rollbacks and deployment history are first-class features. Less control over deployment strategy than Fly.io.
Mechanism: Coolify performs rolling deploys by starting the new container, waiting for it to pass health checks, then stopping the old one. For single-container services, this means a brief overlap window.
Health check control: Coolify reads Docker's HEALTHCHECK instruction from your Dockerfile, or you can configure a health check URL in the Coolify dashboard. The overlap window depends on how quickly your container passes its Docker health check.
HEALTHCHECK --interval=5s --timeout=3s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
Connection draining: Limited. Coolify's drain behavior depends on your application's SIGTERM handling and Docker's stop grace period (stop_grace_period in Compose). The platform itself doesn't add a layer on top.
Rollback: Manual — redeploy a previous Git commit or Docker image tag via the Coolify dashboard.
Limitation: Coolify's zero-downtime implementation is the thinnest on this list. Health check failure handling and automatic rollback are less robust than dedicated deployment platforms. Connection draining relies on your own SIGTERM handling rather than platform-level guarantees.
Verdict: Functional zero-downtime for teams self-hosting Coolify who configure Docker health checks carefully. Not the right choice if zero-downtime is a hard requirement without significant additional configuration.
Mechanism: Kamal (Basecamp's deployment tool) implements blue-green deployment using Docker container labels and a Traefik proxy. It boots the new container alongside the old one, waits for health checks, flips the Traefik router, then stops the old container.
kamal deploy
What Kamal actually does:
# Kamal's deploy sequence (simplified)
1. Push new image to registry
2. Pull image on all target servers
3. Boot new container (new slot)
4. Poll health check endpoint
5. Update Traefik labels → traffic shifts to new container
6. Stop old container after drain period
Health check control: Configured in config/deploy.yml. Kamal polls a health check URL before marking the deployment successful.
# config/deploy.yml
healthcheck:
path: /health
port: 3000
max_attempts: 10
interval: 3
Connection draining: Traefik handles draining. The drain window depends on Traefik's configuration and your app's SIGTERM handling.
Rollback: kamal rollback <version> — keeps the previous image on the server for fast rollback.
Limitation: Kamal is a tool, not a managed platform. You're responsible for the servers, the Traefik configuration, the SSH access, the image registry, and debugging failures. Zero-downtime is achievable but requires significant DevOps investment.
One version note: on a single host, Kamal's deploy is blue-green (new container boots alongside the old, Traefik label flips, old container stops). Across multiple hosts, Kamal 2 deploys one host at a time by default, making it rolling at the fleet level — both old and new app versions serve traffic simultaneously during the rollout, so database schema changes must be backward-compatible.
Verdict: Kamal gives you full control over the blue-green mechanism (per host) at the cost of managing everything yourself. Right for teams who want Heroku-like UX on their own servers and are comfortable with the operational overhead.
| Feature | Temps | Vercel | Railway |
|---|---|---|---|
| Zero-downtime strategy | Health-check-gated + atomic route switch | Immutable artifact + atomic alias | Rolling replicas |
| Rollback speed | Instant (automatic on failure, or temps deployments rollback) | Instant (re-alias) | One-click dashboard |
| Health checks | Configurable path, 300s timeout, HTTP polling | Build-time only | HTTP, configurable |
| Automatic rollback on failure | Yes — route table auto-reverts | No (alias doesn't flip on build failure) | Manual |
| Self-hostable | Yes — free, Apache 2.0 | No | No |
| Built-in monitoring | Uptime, error tracking, session replay, analytics | External tools required | External tools required |
| Pricing model | ~$6/mo Cloud, or self-host free; no per-seat fees | See pricing page | See pricing page |
| Managed databases | Yes (Postgres, Redis, MongoDB, RustFS) | Partial (Postgres via partners) | Yes |
| Promotion (staging → prod) | Yes, same image hash, <30s | Manual redeploy | Manual redeploy |
Zero-downtime marketing copy is easy. Before you trust a platform with your production traffic, verify it actually does these five things:
The platform must poll a health endpoint after the container starts and before routing any traffic. A health check that runs during the build phase (like Vercel's) or that only checks TCP connectivity misses the most common failure mode: an app that boots but can't connect to its database.
Your health endpoint must verify real dependencies:
// Express.js — checks actual readiness, not just process health
app.get('/health', async (req, res) => {
try {
await db.query('SELECT 1'); // database reachable?
await redis.ping(); // cache reachable?
res.status(200).json({ status: 'ok' });
} catch (err) {
res.status(503).json({ status: 'error', detail: err.message });
}
});
Ask the platform: "What happens if my health check returns 503 after the container starts?" The answer should be "the deploy fails and the old version keeps serving."
Long-running requests — file uploads, streaming responses, long-polling connections — need time to complete after traffic stops routing to the old container. A platform with a hardcoded 5-second drain will silently drop requests that take 6 seconds.
The platform must let you configure drain timeout per service. Typical values: 15-30 seconds for web apps, 60+ seconds for services with long-running operations.
If the new version fails health checks after deployment, the old version must keep running without manual intervention. Platforms that require you to manually redeploy the previous version add a recovery lag during which your users see errors.
Verify this by asking: "If my new deploy fails its health check, what happens?" The answer should be "the deployment is marked failed and the old version keeps serving automatically."
The moment traffic stops going to the old container must be the same moment it starts going to the new container. Any gap — even 100ms — causes connection errors. This requires a hot-swap mechanism (Pingora upstream reload, Traefik label update, Nginx reload) rather than stopping the old proxy and starting a new one.
The platform must send SIGTERM to the old container before SIGKILL, and wait for the drain timeout before escalating. Many platforms do this. Your application also needs to handle SIGTERM correctly:
// Node.js — graceful shutdown on SIGTERM
process.on('SIGTERM', () => {
server.close(() => {
db.pool.end();
process.exit(0);
});
});
Without SIGTERM handling in your app, connection draining at the platform level won't help — your process will drop connections when it exits regardless.
The core mechanism behind the best zero-downtime implementations is an atomic pointer update in the reverse proxy. Here's what it looks like at the Nginx level (the DIY version that Temps and Kamal automate):
#!/bin/bash
# Atomic upstream swap — the same pattern Temps' Pingora implements
UPSTREAM_CONF="/etc/nginx/conf.d/active-upstream.conf"
NEW_PORT=$1 # 8001 for blue, 8002 for green
# 1. Boot new container on $NEW_PORT
docker compose up -d --build "web-${SLOT}"
# 2. Wait for health check (no traffic yet)
until curl -sf "http://localhost:${NEW_PORT}/health"; do
sleep 2
done
# 3. Atomic swap: update upstream pointer + reload Nginx
# Nginx reload is graceful — new workers get new config,
# old workers finish in-flight requests before exiting
echo "server 127.0.0.1:${NEW_PORT};" > "$UPSTREAM_CONF"
nginx -s reload
# 4. Drain: wait for old container's connections to finish
sleep 30
# 5. Stop old container
docker compose stop "web-${OLD_SLOT}"
The key insight: nginx -s reload is not a restart. New worker processes start with the updated config while old workers continue serving until their current requests complete. The traffic switch takes effect for new connections the moment the reload completes — in-flight requests on old workers are unaffected.
Temps' Pingora implementation replaces this bash script with a route table update: current_deployment_id is atomically written to the database, an in-process ForceRouteReload is published to the proxy (which reloads its upstream table in memory), and the old containers are torn down only after the proxy confirms the new routes are live. Kamal uses Traefik's router label update. The pattern is the same across all three: update a pointer atomically, confirm the proxy sees it, then stop old containers.
For the complete DIY walkthrough including Docker Compose config, the deploy script, and load testing verification, see Zero-Downtime Docker Deployments: Blue-Green Setup, DB Migrations & Verification.
| Platform | Mechanism | Health Check Depth | Auto Rollback | Drain Control | DIY Required |
|---|---|---|---|---|---|
| Temps | Health-check-gated + atomic Pingora route switch | Deep (configurable endpoint, 300s timeout) | Yes (route table auto-reverts) | Yes (configurable) | None |
| Vercel | Immutable build + atomic alias | Shallow (build-time only) | Yes (don't flip alias) | Edge-managed | None |
| Fly.io | Blue-green or rolling (configurable) | Good (TOML config) | Manual | Built-in | TOML config |
| Render | Rolling | Good (dashboard/YAML) | Manual | 30s default | Minimal |
| Railway | Rolling | Good (dashboard) | One-click | Configurable | Minimal |
| Coolify | Rolling (Docker HEALTHCHECK) | Basic | Manual | SIGTERM only | Moderate |
| Kamal | Manual blue-green (Traefik) | Configurable | kamal rollback | Traefik config | Significant |
A zero-downtime deployment is a release strategy where new application code reaches production without dropping any in-flight requests or showing errors to users. It requires three mechanisms working together: health check gating (new version receives no traffic until it's ready), connection draining (old version finishes in-flight requests before stopping), and an atomic route switch (no gap between old traffic off and new traffic on). The goal is that users experience no errors, latency spikes, or service interruptions during the deployment window.
Fly.io supports explicit blue-green deployment via the --strategy bluegreen flag. Kamal also implements blue-green per host but requires manual Traefik configuration. Temps uses a health-check-gated atomic route switch that achieves the same outcome — old container keeps serving until new container is routable, then old container stops — without calling it blue-green by name. Vercel's immutable deployment model achieves the same effect through atomic alias flipping to an immutable artifact.
Blue-green maintains two full environments and flips traffic in a single atomic step. Zero users see the new version until 100% of traffic switches. Rolling updates replace instances one at a time, so during the deploy window, some requests go to the old version and some to the new version. Blue-green is simpler and provides instant rollback but costs 2× resources. Rolling is more resource-efficient but requires both old and new versions to handle the same requests simultaneously — which means database schema changes must be backward-compatible during the deploy window.
Yes, but you must follow the expand-and-contract pattern. Never drop a column or make a breaking schema change in the same deploy that stops using it. Add new columns first (expand), deploy code that reads from both old and new columns, migrate data, then deploy code that only reads the new column, then remove the old column in a final deploy (contract). During a rolling or blue-green deploy, both old and new application versions run simultaneously and must work with the same database schema. Any migration that breaks either version causes errors during the deploy window.
Run a continuous load test with a tool like hey while triggering a deployment. If you see only 200-status responses in the output, your deployment is truly zero-downtime. Any 502, 503, or connection errors indicate dropped requests. Run the test after every change to your deployment configuration — a setup that works in staging can break under production load patterns. For Temps deployments, the built-in metrics dashboard shows request error rates before, during, and after each deploy.
Your health check endpoint must verify real application readiness, not just that the process is running. At minimum, check that your database connection pool can execute a query (SELECT 1) and that any required caches or queues are reachable. Return HTTP 200 when ready, HTTP 503 when not. Avoid returning 200 before your application is genuinely ready to serve traffic — a premature 200 causes the platform to route requests to a container that will immediately return 500 errors. The health check endpoint itself should be fast (under 100ms) and should not perform operations that could affect normal traffic.
Yes. Temps supports Next.js with the same git-push workflow. You get the same zero-downtime deployment mechanism plus built-in analytics, error tracking, session replay, and uptime monitoring that Vercel requires separate SaaS tools for. The main difference: Temps is self-hosted (free, Apache 2.0) or available managed at ~$6/month on Temps Cloud, vs Vercel's per-seat pricing and bandwidth fees.
Zero-downtime deployment is achievable on any of these platforms. The difference is how much you have to configure and maintain.
If you're starting fresh or want zero operational overhead, Temps handles the full zero-downtime pipeline — Pingora route switch, health check gating, automatic rollback, built-in observability — from a single git push. Vercel matches this for serverless and Next.js workloads. Fly.io gives you the most strategy flexibility. Render and Railway work well with less configuration than Fly.io. Kamal and Coolify make sense if you're already self-hosting and want to stay in full control.
Whatever platform you choose, verify it with a load test: spin up hey, trigger a deploy, and confirm zero non-200 responses. Don't take zero-downtime on faith.
# Install Temps and get zero-downtime deploys from the first push
curl -fsSL temps.sh/install.sh | bash