March 12, 2026 (3mo ago)
Written by Temps Team
Last updated March 12, 2026 (3mo ago)
Scale-to-zero stops containers when they receive no traffic and restarts them automatically on the next HTTP request — cutting idle resource costs by 60–80% for dev, staging, and preview environments without changing how developers access them. For teams running 10–30 non-production environments, that typically means dropping from $100–300/mo to $20–60/mo in idle infrastructure spend.
TL;DR: Scale-to-zero stops idle dev and staging containers after a configurable timeout and restarts them on the next request. According to Flexera, organizations waste an average of 28% of their cloud spend on idle or underused resources. For teams running 10+ preview environments, that's $50–200/mo in pure waste that a simple idle-detection proxy eliminates.
According to Flexera, organizations waste 28% of their cloud spend on idle or underused resources. Preview environments are some of the worst offenders. They run around the clock despite being used for minutes per day during code review.
Here's the math for a typical team:
That's $50–100/mo just for PR previews. Add staging, QA, and demo environments and a mid-size team easily runs 20–30 non-production environments, pushing idle costs to $200–300/mo.
Cloud providers charge by the hour. A container that serves zero requests at 3 AM costs the same as one handling 1,000 requests per second at 3 PM. There's no built-in incentive for providers to help you stop paying for idle resources.
Scale-to-zero is a resource management pattern where containers stop completely when they receive no traffic for a defined period. AWS Lambda pioneered serverless scale-to-zero, processing over 1 trillion invocations per month at peak by dynamically scaling functions from zero. The same principle applies to containers.
The lifecycle works in four stages:
Subsequent requests hit the running container normally. It only sleeps again after another idle timeout passes with zero traffic.
Serverless (Lambda, Cloud Run) does scale-to-zero natively. But preview environments aren't stateless functions — they're full applications with databases, file systems, and long-running processes. Containerized scale-to-zero gives you the cost savings of serverless with the flexibility of a full runtime.
The wake-on-request pattern uses a reverse proxy to intercept traffic, check container state, and manage the sleep/wake lifecycle. The CNCF's 2024 survey found that 84% of organizations use or evaluate Kubernetes, where similar patterns power Knative's scale-to-zero. But you don't need Kubernetes — the pattern works with plain Docker and a lightweight proxy.
Here's the full request flow:
┌─────────────────┐
HTTP Request ──> │ Reverse Proxy │
│ (Nginx/Pingora)│
└────────┬────────┘
│
┌────────v────────┐
│ Container alive? │
└───┬─────────┬───┘
│ │
YES NO
│ │
│ ┌────v────────────┐
│ │ Hold request │
│ │ Start container │
│ │ Wait health check│
│ └────┬─────────────┘
│ │
┌───v─────────v───┐
│ Forward request │
│ Update last_activity │
└────────────────┘
Background sweeper (every 30s):
- Check last_activity for each container
- If idle > timeout: docker stop
Three things make or break this pattern:
Request buffering. When a container is sleeping, the proxy must hold the incoming request in memory without dropping it. The client sees a slow response, not an error. If the container takes too long to wake, the proxy should return a 503 with a retry header — not hang indefinitely.
Health check timing. Don't forward the buffered request the instant docker start returns. The container process may be running but the application inside isn't ready. Poll a health endpoint (e.g., /healthz) until it returns 200, then forward.
Last-activity tracking. Every proxied request updates a timestamp. A background goroutine or thread sweeps all containers every 30 seconds, stopping any that exceeded their idle timeout. This is cheaper than per-container timers and uses roughly 100 KB of memory regardless of container count.
Yes. A DIY solution needs three components: a proxy (Nginx or Traefik), a daemon that manages container lifecycle, and a way for the proxy to communicate with the daemon. The Docker API handles 250+ container operations per second on modest hardware, so the start/stop overhead is negligible.
Here's a minimal Python daemon that implements the core logic:
#!/usr/bin/env python3
"""Minimal scale-to-zero daemon for Docker containers."""
import time
import threading
import docker
from flask import Flask, jsonify
client = docker.from_env()
app = Flask(__name__)
# Track last request time per container
last_activity: dict[str, float] = {}
IDLE_TIMEOUT = 300 # 5 minutes
WAKE_TIMEOUT = 30 # Max seconds to wait for container
SWEEP_INTERVAL = 30
def get_container(name: str):
try:
return client.containers.get(name)
except docker.errors.NotFound:
return None
def is_healthy(container) -> bool:
"""Check if container is running and healthy."""
container.reload()
if container.status != "running":
return False
health = container.attrs.get("State", {}).get("Health")
if health is None:
return True # No healthcheck defined, assume ready
return health.get("Status") == "healthy"
def wake_container(name: str) -> bool:
"""Start a stopped container and wait for health."""
container = get_container(name)
if not container:
return False
if container.status == "running":
last_activity[name] = time.time()
return True
container.start()
deadline = time.time() + WAKE_TIMEOUT
while time.time() < deadline:
if is_healthy(container):
last_activity[name] = time.time()
return True
time.sleep(0.5)
return False
@app.route("/wake/<name>", methods=["POST"])
def handle_wake(name: str):
"""Nginx calls this via auth_request or proxy_pass."""
if wake_container(name):
return jsonify({"status": "ready"}), 200
return jsonify({"status": "timeout"}), 503
@app.route("/activity/<name>", methods=["POST"])
def handle_activity(name: str):
"""Called on every proxied request to update timestamp."""
last_activity[name] = time.time()
return "", 204
def idle_sweeper():
"""Background thread that stops idle containers."""
while True:
time.sleep(SWEEP_INTERVAL)
now = time.time()
for name, last_seen in list(last_activity.items()):
if now - last_seen > IDLE_TIMEOUT:
container = get_container(name)
if container and container.status == "running":
print(f"Stopping idle container: {name}")
container.stop(timeout=10)
del last_activity[name]
if __name__ == "__main__":
sweeper = threading.Thread(target=idle_sweeper, daemon=True)
sweeper.start()
app.run(host="127.0.0.1", port=9090)
The Nginx config uses auth_request to call the daemon before proxying:
server {
listen 80;
server_name ~^(?<container_name>.+)\.preview\.example\.com$;
# Wake the container before forwarding
auth_request /internal/wake;
location /internal/wake {
internal;
proxy_pass http://127.0.0.1:9090/wake/$container_name;
proxy_read_timeout 35s; # Slightly above WAKE_TIMEOUT
}
location / {
proxy_pass http://$container_name:3000;
proxy_set_header Host $host;
# Track activity asynchronously
post_action @track_activity;
}
location @track_activity {
internal;
proxy_pass http://127.0.0.1:9090/activity/$container_name;
}
}
This basic setup works, but it has gaps you'll hit in production:
WebSocket handling. The auth_request fires once on connection upgrade, but WebSocket connections stay open. You need to track WebSocket connection count separately and only consider a container idle when both HTTP requests and WebSocket connections are zero.
Concurrent wake requests. If 10 requests arrive simultaneously for a sleeping container, all 10 trigger a wake. Add a per-container lock so only the first request starts the container; the rest wait on the lock.
Container networking. Stopped containers lose their DNS entry in Docker's internal network. You may need to use docker pause/docker unpause instead if you rely on Docker DNS.
The biggest source of bugs in DIY scale-to-zero isn't the wake logic — it's the concurrent request handling. Without proper locking, you get race conditions where two threads both see "container stopped" and both call docker start, causing errors.
| Feature | Temps | Vercel | Render |
|---|---|---|---|
| Preview env scale-to-zero | Yes (on_demand: true) | No (always-on) | No (always-on) |
| Idle timeout (configurable) | 60s–86400s | N/A | N/A |
| Wake timeout (configurable) | 5s–120s | N/A | N/A |
| Per-environment config | Yes | No | No |
| Project-level default | Yes (preview_envs_on_demand) | No | No |
| Self-hosted | Yes, free (Apache 2.0) | No | No |
| Proxy layer | Pingora (Cloudflare-built Rust) | Proprietary | Proprietary |
Vercel and Render run preview environments continuously — there's no built-in idle stop for containers. Temps implements scale-to-zero at the proxy layer using Pingora (Cloudflare's open-source Rust proxy), so the decision and the action happen in the same process with no extra network hop.
Temps calls this feature on-demand environments. Set on_demand: true in your environment configuration and containers automatically stop after idle_timeout_seconds of no traffic (default: 300 seconds, range: 60–86400). They wake automatically when the next HTTP request arrives, within wake_timeout_seconds (default: 30 seconds, range: 5–120).
Three parameters control on-demand behavior per environment:
{
"on_demand": true,
"idle_timeout_seconds": 300,
"wake_timeout_seconds": 30
}
on_demand — enables scale-to-zero for this environmentidle_timeout_seconds — seconds of inactivity before containers stop (60–86400)wake_timeout_seconds — max seconds to wait for wake on the next request (5–120)Set preview_envs_on_demand: true on a project and every newly auto-created preview environment inherits on-demand mode automatically. You can also set project-level defaults for preview_envs_idle_timeout_seconds (default: 300) and preview_envs_wake_timeout_seconds (default: 30). Existing environments are not affected — only previews created after enabling the flag.
This is the key operational win: instead of configuring each preview environment individually, you opt in once at the project level and every future PR preview scales to zero by default.
Temps uses Pingora (Cloudflare's open-source Rust proxy, the same proxy layer that serves Cloudflare's own traffic) as its reverse proxy. When a request arrives for a sleeping on-demand environment:
Wake time is typically 2–5 seconds because the container image is already pulled and cached — no rebuild required. The proxy and orchestrator live in the same binary, eliminating the network hop between "check state" and "start container."
Enabling on-demand mode across all preview environments typically cuts non-production resource usage by 60–80%. For a team with 20 preview environments at $5–10/container/month, that means the bill drops from $100–200/mo to $20–60/mo — just from enabling a single flag.
Cold start latency is the main tradeoff of scale-to-zero. Google Cloud Run reports median cold starts of 1–3 seconds for optimized containers, but unoptimized ones can take 10–30 seconds. The difference comes down to image size, startup dependencies, and which stop mechanism you use.
Every megabyte of image size adds to pull and extract time. Alpine-based images are typically 5–30 MB compared to 200–800 MB for Debian-based ones.
# Unoptimized: ~850 MB image, ~8-second cold start
FROM node:22
COPY . .
RUN npm install
CMD ["node", "server.js"]
# Optimized: ~45 MB image, ~2-second cold start
FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
FROM node:22-alpine
WORKDIR /app
COPY --from=builder /app .
CMD ["node", "server.js"]
docker pause freezes container processes using cgroups. Memory stays allocated but CPU drops to zero. docker unpause resumes in under 100 ms — no process startup involved.
The tradeoff: paused containers still consume RAM. For dev environments with 512 MB containers, this is often worth it. For staging environments running 2 GB containers, stop/start with the slower wake time might save more overall.
| Environment | Recommended Timeout | Rationale |
|---|---|---|
| PR previews | 5 minutes | Reviewed once, rarely revisited |
| Development | 10–15 minutes | Active work, short breaks |
| Staging | 30–60 minutes | QA sessions, longer gaps between tests |
| Demo | 15–30 minutes | Client meetings, unpredictable pauses |
Scale-to-zero is not universal. Here's where it causes more problems than it solves:
Production environments. Any environment where users expect sub-second responses. According to Amazon's research, 100 ms of latency costs 1% in sales. A 2–5 second cold start on the first request after idle is unacceptable for customer-facing applications.
CI/CD pipeline targets. Automated tests and deployment pipelines expect instant responses. A sleeping environment introduces flaky test results because the first request takes abnormally long or times out.
WebSocket-heavy applications. When containers sleep, all WebSocket connections drop. Applications need client-side reconnection logic, and if the app relies on persistent connections (real-time collaboration, live dashboards), the reconnection storm after a wake can overwhelm the freshly started container.
Long-running background jobs. Containers running cron jobs, queue workers, or batch processing should never scale to zero — their "traffic" isn't HTTP-based and they need to be running continuously.
Scale-to-zero shines in environments with bursty, human-driven access patterns:
Cold start duration depends on the stop mechanism and container size. Using docker pause/unpause, wake time is under 100 ms. Using docker stop/start, expect 1–5 seconds for optimized containers and up to 30 seconds for large, unoptimized ones. Google Cloud Run benchmarks show median cold starts of 1–3 seconds for optimized images.
Not recommended for user-facing production. The cold start penalty creates unacceptable latency for the first visitor after an idle period. Reserve scale-to-zero for dev, staging, preview, and demo environments where occasional 2–5 second delays are acceptable.
All WebSocket connections drop when a container sleeps. Clients receive a close frame (or a TCP reset if the stop is abrupt). Applications need client-side reconnection logic — most WebSocket libraries support automatic reconnect with exponential backoff. The first reconnect triggers a container wake, so expect a 2–5 second delay before the connection reestablishes.
Savings depend on idle time percentage. Most dev and staging environments sit idle 70–90% of the time. According to Flexera, organizations waste 28% of cloud spend on idle resources. For a team running 20 non-production environments at $5–10/container/month, scale-to-zero cuts the bill from $100–200/mo to $20–60/mo — a 60–80% reduction.
The container's application stops, not the database. Database connections are closed when the container sleeps and re-established on wake. Use connection pooling (PgBouncer, Prisma connection pool) so the reconnection is fast. Database containers themselves should NOT use scale-to-zero — they need to persist data and maintain availability.
Yes. Set on_demand: true on any environment, or enable preview_envs_on_demand at the project level so every new PR preview inherits it automatically. Idle timeout defaults to 300 seconds (5 minutes), configurable from 60 seconds to 24 hours. Wake timeout defaults to 30 seconds, configurable from 5 to 120 seconds.
Scale-to-zero is one of those rare optimizations that's pure upside for non-production environments. You save 60–80% on idle resources with a tradeoff that barely matters: a few seconds of cold start latency affecting only the first request after an idle period.
You can build it yourself with Docker, a lightweight proxy, and about 100 lines of daemon code. Or use a platform that handles the proxy buffering, health checking, and lifecycle management out of the box — with self-hosted pricing that doesn't add another SaaS bill.
Temps is Apache 2.0, self-hosted for free, or available on Temps Cloud at ~$6/mo (Hetzner cost + 30%). No per-seat fees, no bandwidth bills.
curl -fsSL temps.sh/install.sh | bash
Set on_demand: true on any environment, configure your idle timeout, and watch your resource usage drop while your environments stay accessible on demand.