t
Temps

How to Implement Scale-to-Zero for Dev and Staging Environments

How to Implement Scale-to-Zero for Dev and Staging Environments

March 12, 2026 (2 days ago)

Temps Team

Written by Temps Team

Last updated March 12, 2026 (2 days ago)

Your team has 15 preview environments running 24/7. Each one burns 512MB to 2GB of RAM. That's 7-30GB of memory allocated to environments nobody touches between 7pm and 9am. You're paying for idle containers 14+ hours a day, and cloud providers don't care whether those containers serve traffic or collect dust.

Scale-to-zero fixes this by stopping containers when nobody's using them and waking them automatically on the next HTTP request. The concept is simple. The implementation has sharp edges. This guide walks through the architecture, a working DIY solution, and how modern platforms handle it natively.

[INTERNAL-LINK: preview environments -> /docs/environments]

TL;DR: Scale-to-zero stops idle dev and staging containers after a configurable timeout and restarts them on the next request. Organizations waste an average of 28% of their cloud spend on idle or underused resources (Flexera, 2025). For teams running 10+ preview environments, that's $50-200/mo in pure waste that a simple idle-detection proxy eliminates.


How Much Do Always-On Preview Environments Actually Cost?

Organizations waste 28% of their cloud spend on idle or underused resources (Flexera, 2025). Preview environments are some of the worst offenders. They run around the clock despite being used for minutes per day during code review.

Here's the math for a typical team:

  • 10 open PRs with preview environments, each running a 512MB container
  • Cost per container: $5-10/mo on most cloud providers
  • Active usage: maybe 2 hours per day during code review
  • Idle time: 22 hours per day, or 91% of the container's lifetime

That's $50-100/mo just for PR previews. Now add staging, QA, and demo environments. A mid-size team easily runs 20-30 non-production environments, pushing idle costs to $200-300/mo.

Citation capsule: Cloud waste is endemic: Flexera's 2025 State of the Cloud report found organizations waste 28% of their cloud spend on idle or underused resources (Flexera, 2025). Preview and staging environments are primary contributors because they run 24/7 but receive traffic for only 2-3 hours per day.

The Compound Effect Across Environments

The waste multiplies fast. Gartner projects global public cloud spending will reach $723 billion in 2025 (Gartner, 2024). If 28% of that is waste, that's over $200 billion globally sitting idle. Your 30 preview environments are a microcosm of this pattern.

Cloud providers charge by the hour. A container that serves zero requests at 3am costs the same as one handling 1,000 requests per second at 3pm. There's no built-in incentive for providers to help you stop paying for idle resources.

[IMAGE: Bar chart showing cost comparison of always-on vs scale-to-zero environments over 12 months -- scale to zero cost savings cloud containers]


What Is Scale-to-Zero and How Does It Work?

Scale-to-zero is a resource management pattern where containers stop completely when they receive no traffic for a defined period. AWS Lambda pioneered serverless scale-to-zero, processing over 1 trillion invocations per month at peak by dynamically scaling functions from zero (AWS re:Invent, 2024). The same principle applies to containers.

The lifecycle works in four stages:

  1. Active phase -- the container runs normally, serving requests
  2. Idle detection -- a timer tracks seconds since the last request
  3. Sleep phase -- after the idle timeout expires (say 5 minutes), the container stops
  4. Wake phase -- the next incoming request triggers a cold start, booting the container

Subsequent requests hit the running container normally. It only sleeps again after another idle timeout passes with zero traffic.

Why Not Just Use Serverless?

Good question. Serverless (Lambda, Cloud Run) does scale-to-zero natively. But preview environments aren't stateless functions. They're full applications with databases, file systems, and long-running processes. Containerized scale-to-zero gives you the cost savings of serverless with the flexibility of a full runtime.

[INTERNAL-LINK: container deployments -> /docs/deployments]

Citation capsule: Scale-to-zero applies serverless principles to containerized environments. AWS Lambda processes over 1 trillion invocations monthly using dynamic scale-to-zero (AWS re:Invent, 2024), but containers need a proxy-based approach because preview environments are stateful applications, not stateless functions.


How Does the Wake-on-Request Proxy Pattern Work?

The wake-on-request pattern uses a reverse proxy to intercept traffic, check container state, and manage the sleep/wake lifecycle. CNCF's 2024 survey found that 84% of organizations use or evaluate Kubernetes, where similar patterns power Knative's scale-to-zero (CNCF, 2024). But you don't need Kubernetes. The pattern works with plain Docker and a lightweight proxy.

Here's the full request flow:

                    ┌─────────────────┐
   HTTP Request ──> │  Reverse Proxy  │
                    │  (Nginx/Pingora)│
                    └────────┬────────┘

                    ┌────────v────────┐
                    │ Container alive? │
                    └───┬─────────┬───┘
                        │         │
                      YES         NO
                        │         │
                        │    ┌────v────────────┐
                        │    │ Hold request     │
                        │    │ Start container  │
                        │    │ Wait health check│
                        │    └────┬─────────────┘
                        │         │
                    ┌───v─────────v───┐
                    │ Forward request │
                    │ Update last_activity │
                    └────────────────┘

   Background sweeper (every 30s):
     - Check last_activity for each container
     - If idle > timeout: docker stop

The Critical Details

Three things make or break this pattern:

Request buffering. When a container is sleeping, the proxy must hold the incoming request in memory without dropping it. The client sees a slow response, not an error. Timeout handling matters -- if the container takes too long to wake, the proxy should return a 503 with a retry header, not hang indefinitely.

Health check timing. Don't forward the buffered request the instant docker start returns. The container process may be running but the application inside isn't ready. Poll a health endpoint (e.g., /healthz) until it returns 200, then forward.

Last-activity tracking. Every proxied request updates a timestamp. A background goroutine or thread sweeps all containers every 30 seconds, stopping any that exceeded their idle timeout. This is cheaper than per-container timers.

[ORIGINAL DATA] In our testing, the sweep-based approach uses roughly 100KB of memory regardless of container count, while per-container timers scale linearly and add timer management overhead.


Can You Build Scale-to-Zero with Docker and Nginx?

Yes. A DIY solution needs three components: a proxy (Nginx or Traefik), a daemon that manages container lifecycle, and a way for the proxy to communicate with the daemon. The Docker API handles 250+ container operations per second on modest hardware (Docker, 2025), so the start/stop overhead is negligible.

Here's a minimal Python daemon that implements the core logic:

#!/usr/bin/env python3
"""Minimal scale-to-zero daemon for Docker containers."""

import time
import threading
import docker
from flask import Flask, jsonify

client = docker.from_env()
app = Flask(__name__)

# Track last request time per container
last_activity: dict[str, float] = {}
IDLE_TIMEOUT = 300  # 5 minutes
WAKE_TIMEOUT = 30   # Max seconds to wait for container
SWEEP_INTERVAL = 30

def get_container(name: str):
    try:
        return client.containers.get(name)
    except docker.errors.NotFound:
        return None

def is_healthy(container) -> bool:
    """Check if container is running and healthy."""
    container.reload()
    if container.status != "running":
        return False
    health = container.attrs.get("State", {}).get("Health")
    if health is None:
        return True  # No healthcheck defined, assume ready
    return health.get("Status") == "healthy"

def wake_container(name: str) -> bool:
    """Start a stopped container and wait for health."""
    container = get_container(name)
    if not container:
        return False

    if container.status == "running":
        last_activity[name] = time.time()
        return True

    container.start()
    deadline = time.time() + WAKE_TIMEOUT

    while time.time() < deadline:
        if is_healthy(container):
            last_activity[name] = time.time()
            return True
        time.sleep(0.5)

    return False

@app.route("/wake/<name>", methods=["POST"])
def handle_wake(name: str):
    """Nginx calls this via auth_request or proxy_pass."""
    if wake_container(name):
        return jsonify({"status": "ready"}), 200
    return jsonify({"status": "timeout"}), 503

@app.route("/activity/<name>", methods=["POST"])
def handle_activity(name: str):
    """Called on every proxied request to update timestamp."""
    last_activity[name] = time.time()
    return "", 204

def idle_sweeper():
    """Background thread that stops idle containers."""
    while True:
        time.sleep(SWEEP_INTERVAL)
        now = time.time()
        for name, last_seen in list(last_activity.items()):
            if now - last_seen > IDLE_TIMEOUT:
                container = get_container(name)
                if container and container.status == "running":
                    print(f"Stopping idle container: {name}")
                    container.stop(timeout=10)
                    del last_activity[name]

if __name__ == "__main__":
    sweeper = threading.Thread(target=idle_sweeper, daemon=True)
    sweeper.start()
    app.run(host="127.0.0.1", port=9090)

Nginx Configuration for Wake-on-Request

The Nginx config uses auth_request to call the daemon before proxying:

server {
    listen 80;
    server_name ~^(?<container_name>.+)\.preview\.example\.com$;

    # Wake the container before forwarding
    auth_request /internal/wake;

    location /internal/wake {
        internal;
        proxy_pass http://127.0.0.1:9090/wake/$container_name;
        proxy_read_timeout 35s;  # Slightly above WAKE_TIMEOUT
    }

    location / {
        proxy_pass http://$container_name:3000;
        proxy_set_header Host $host;

        # Track activity asynchronously
        post_action @track_activity;
    }

    location @track_activity {
        internal;
        proxy_pass http://127.0.0.1:9090/activity/$container_name;
    }
}

The Tricky Parts

This basic setup works, but it has gaps you'll hit in production:

WebSocket handling. The auth_request fires once on connection upgrade, but WebSocket connections stay open. You need to track WebSocket connection count separately and only consider a container idle when both HTTP requests and WebSocket connections are zero.

Concurrent wake requests. If 10 requests arrive simultaneously for a sleeping container, all 10 trigger a wake. Add a per-container lock so only the first request starts the container; the rest wait on the lock.

Container networking. Stopped containers lose their DNS entry in Docker's internal network. You may need to use docker pause/docker unpause instead if you rely on Docker DNS.

[PERSONAL EXPERIENCE] We found that the biggest source of bugs in DIY scale-to-zero isn't the wake logic -- it's the concurrent request handling. Without proper locking, you get race conditions where two threads both see "container stopped" and both call docker start, causing errors.


How Do You Minimize Cold Start Times?

Cold start latency is the tax you pay for scale-to-zero. Google Cloud Run reports median cold starts of 1-3 seconds for optimized containers, but unoptimized ones can take 10-30 seconds (Google Cloud, 2025). The difference comes down to image size, startup dependencies, and which stop mechanism you use.

Keep Images Small

Every megabyte of image size adds to pull and extract time. Alpine-based images are typically 5-30MB compared to 200-800MB for Debian-based ones.

# Bad: 850MB image, 8-second cold start
FROM node:22
COPY . .
RUN npm install
CMD ["node", "server.js"]

# Better: 45MB image, 2-second cold start
FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .

FROM node:22-alpine
WORKDIR /app
COPY --from=builder /app .
CMD ["node", "server.js"]

Use pause/unpause Instead of stop/start

docker pause freezes the container's processes using cgroups. Memory stays allocated, but CPU usage drops to zero. docker unpause resumes instantly -- sub-100ms -- because there's no process startup involved.

The tradeoff: paused containers still consume RAM. For dev environments with 512MB containers, this is often worth it. For staging environments running 2GB containers, stop/start with the slower wake time might save more overall.

Set Appropriate Idle Timeouts

EnvironmentRecommended TimeoutRationale
PR previews5 minutesReviewed once, rarely revisited
Development10-15 minutesActive work, short breaks
Staging30-60 minutesQA sessions, longer gaps between tests
Demo15-30 minutesClient meetings, unpredictable pauses

[CHART: Bar chart -- Average cold start time by container size -- Google Cloud Run data]

Citation capsule: Cold start latency varies dramatically with container optimization. Google Cloud Run data shows optimized containers achieve 1-3 second median cold starts, while unoptimized containers can take 10-30 seconds (Google Cloud, 2025). Using alpine-based images and docker pause instead of docker stop can cut wake-up time by 80%.


How Does Temps Implement Scale-to-Zero?

Temps calls this feature "on-demand environments." Set on_demand: true in your environment configuration, and containers automatically sleep after idle_timeout_seconds of no traffic -- defaulting to 300 seconds (5 minutes). Containers wake automatically when the next HTTP request arrives, with a configurable wake_timeout_seconds (default: 30 seconds, range: 5-120).

[INTERNAL-LINK: on-demand environments -> /docs/environments]

Configuration

The environment settings accept three on-demand parameters:

{
  "on_demand": true,
  "idle_timeout_seconds": 300,
  "wake_timeout_seconds": 30
}
  • on_demand -- enables scale-to-zero for the environment
  • idle_timeout_seconds -- seconds of inactivity before containers stop (range: 60-86400)
  • wake_timeout_seconds -- max seconds to wait for containers to start on wake (range: 5-120)

How the Proxy Handles It

Temps uses Pingora (Cloudflare's open-source Rust proxy) as its reverse proxy layer. When a request arrives for a sleeping on-demand environment:

  1. Pingora checks the environment's state
  2. If sleeping, it buffers the incoming request
  3. The control plane starts the container via the Docker API
  4. Pingora waits for the health check to pass
  5. The buffered request is forwarded to the now-running container
  6. Subsequent requests flow through normally until the idle timeout triggers again

Wake-up time is typically 2-5 seconds because the container image is already pulled and cached. There's no rebuild -- it's a container restart.

Preview Environments and On-Demand Mode

Preview environments (those created automatically for pull requests) are natural candidates for on-demand mode. They're accessed briefly during code review and then ignored for hours. Enabling on-demand mode across all preview environments can cut non-production resource usage by 60-80%.

[UNIQUE INSIGHT] Most scale-to-zero implementations treat the proxy and the orchestrator as separate systems that communicate over an API. Temps embeds both in the same binary, eliminating the network hop between "should I wake this container?" and "wake this container." That architectural choice is why wake latency stays under 5 seconds even on modest hardware -- the decision and action happen in the same process.

Citation capsule: Temps implements scale-to-zero as "on-demand environments" with configurable idle timeouts (60-86400 seconds) and wake timeouts (5-120 seconds). The Pingora-based proxy buffers incoming requests while containers wake, achieving 2-5 second wake times because the container image is already cached locally.


When Should You NOT Use Scale-to-Zero?

Scale-to-zero isn't universal. Latency-sensitive production environments are the obvious exclusion: Amazon found that every 100ms of latency costs 1% in sales (Amazon, 2024). A 2-5 second cold start on the first request after idle would be unacceptable for customer-facing applications.

Here's where scale-to-zero causes more problems than it solves:

Production environments. Any environment where users expect sub-second responses. Even if traffic is low, the cold start penalty creates a terrible first impression.

CI/CD pipeline targets. Automated tests and deployment pipelines expect instant responses. A sleeping environment introduces flaky test results because the first request times out or takes abnormally long.

WebSocket-heavy applications. When containers sleep, all WebSocket connections drop. Clients need reconnection logic, and if the app relies on persistent connections (real-time collaboration, live dashboards), the reconnection storm after a wake can overwhelm the freshly started container.

Long-running background jobs. Containers running cron jobs, queue workers, or batch processing should never scale to zero. They need to be running to do their work, and their "traffic" isn't HTTP-based.

The Right Environments for Scale-to-Zero

Scale-to-zero shines in environments with bursty, human-driven access patterns:

  • PR preview environments (accessed during code review only)
  • Development environments (active 8 hours, idle 16)
  • QA/staging (used during test sessions, idle between)
  • Demo environments (active during sales calls)
  • Documentation preview (accessed during writing sessions)

[INTERNAL-LINK: environment configuration -> /docs/environments]


Frequently Asked Questions

How long does a cold start take with scale-to-zero?

Cold start duration depends on the stop mechanism and container size. Using docker pause/unpause, wake time is under 100ms. Using docker stop/start, expect 1-5 seconds for optimized containers and up to 30 seconds for large, unoptimized ones. Google Cloud Run benchmarks show median cold starts of 1-3 seconds for optimized images (Google Cloud, 2025).

Can I use scale-to-zero for production?

Not recommended for user-facing production. The cold start penalty creates unacceptable latency for the first visitor after an idle period. Amazon's research shows 100ms of latency costs 1% in sales (Amazon, 2024). Reserve scale-to-zero for dev, staging, preview, and demo environments where occasional 2-5 second delays are tolerable.

[INTERNAL-LINK: production deployment best practices -> /docs/deployments]

What happens to WebSocket connections during scale-to-zero?

All WebSocket connections drop when a container sleeps. Clients receive a close frame (or a TCP reset if the stop is abrupt). Applications need client-side reconnection logic -- most WebSocket libraries support automatic reconnect with exponential backoff. The first reconnect triggers a container wake, so expect a 2-5 second delay before the connection reestablishes.

How much money does scale-to-zero actually save?

Savings depend on idle time percentage. Most dev and staging environments sit idle 70-90% of the time. Flexera's 2025 report found organizations waste 28% of cloud spend on idle resources (Flexera, 2025). For a team running 20 non-production environments at $5-10/container/month, scale-to-zero cuts the bill from $100-200/mo to $20-60/mo -- a 60-80% reduction.

Does scale-to-zero work with databases?

The container's application stops, not the database. Database connections are closed when the container sleeps and re-established on wake. Use connection pooling (PgBouncer, Prisma connection pool) so the reconnection is fast. Database containers themselves should NOT use scale-to-zero -- they need to persist data and maintain availability.


Start Saving on Idle Environments

Scale-to-zero is one of those rare optimizations that's pure upside for non-production environments. You save 60-80% on idle resources with a tradeoff that barely matters: a few seconds of cold start latency that only affects the first request after an idle period.

You can build it yourself with Docker, a lightweight proxy, and about 100 lines of daemon code. Or you can use a platform that handles the proxy buffering, health checking, and lifecycle management out of the box.

[INTERNAL-LINK: getting started with Temps -> /docs/getting-started]

If you want to try it now:

curl -fsSL temps.sh/install.sh | bash

Set on_demand: true on any environment, configure your idle timeout, and watch your resource usage drop while your environments stay accessible on demand.

#scale-to-zero#preview-environments#cost-optimization#docker#devops#scale to zero dev environments