t
Temps

How to Build an Uptime Monitoring System That Actually Works

How to Build an Uptime Monitoring System That Actually Works

March 12, 2026 (2 days ago)

Temps Team

Written by Temps Team

Last updated March 12, 2026 (2 days ago)

Pingdom charges $15/month for 10 monitors. UptimeRobot's free tier checks every 5 minutes — meaning your site could be down for nearly 5 minutes before anyone notices. And by "anyone," we usually mean your users, not your team. According to ITIC's 2024 Hourly Cost of Downtime Survey, 91% of enterprises say a single hour of downtime costs over $300,000 (ITIC, 2024).

The monitoring market is bloated with SaaS tools that charge per check, per seat, or per notification channel. But the core concept behind uptime monitoring isn't complicated. It's an HTTP request, a timer, and a state machine. You can build one yourself.

This guide explains what uptime monitoring actually does under the hood, walks through building a basic monitor from scratch, compares open-source alternatives, and shows how integrated platforms handle it without the duct tape.

[INTERNAL-LINK: self-hosted deployment platform overview -> /blog/introducing-temps-vercel-alternative]

TL;DR: Uptime monitoring boils down to scheduled HTTP checks, a state machine for UP/DEGRADED/DOWN transitions, and an alert dispatcher. Commercial tools like Pingdom ($15/mo for 10 monitors) and Better Uptime ($24/mo) charge for something you can build or self-host. Uptime Kuma has over 63,000 GitHub stars (GitHub, 2026) and proves self-hosted monitoring works. You don't need SaaS for this.


What Does Uptime Monitoring Actually Check?

Uptime monitoring goes far beyond "is the server responding." A 2024 report by Catchpoint found that 77% of outages involve degraded performance rather than complete failures (Catchpoint, 2024). A good monitoring system checks multiple layers of your stack to catch degradation before it becomes downtime.

Citation capsule: 77% of outages manifest as degraded performance rather than complete failures, according to Catchpoint's 2024 SRE Report. Effective uptime monitoring must check HTTP status codes, response times, SSL certificate expiry, content validation, DNS resolution, and TCP/TLS handshake timing to catch degradation early.

HTTP Status Codes and Response Times

The most basic check: send an HTTP GET request and verify you get a 200 status code. But status alone isn't enough. A page that returns 200 OK in 8 seconds is functionally down for your users. Good monitors track response time percentiles — p50, p95, p99 — and alert when they drift beyond thresholds.

You'll also want to monitor specific status code patterns. A sudden spike in 502 errors from your reverse proxy tells a different story than a 503 from your application. Each demands a different response.

SSL Certificate Expiry

Let's Encrypt certificates expire every 90 days. Auto-renewal works until it doesn't — a DNS change, a permission error, a failed hook script. According to Netcraft, roughly 3.6 million SSL certificates expire on any given day worldwide (Netcraft, 2024). Your monitor should warn you 14, 7, and 3 days before expiry.

Content Validation

A 200 status code doesn't guarantee your page is working. Your API might return {"error": "database connection failed"} with a 200 status. Content validation checks that the response body contains expected strings, matches a JSON schema, or doesn't contain error patterns.

DNS Resolution and TCP/TLS Breakdown

When something feels slow, you need to know where the time went. A full check decomposes response time into DNS lookup, TCP connection, TLS handshake, time-to-first-byte (TTFB), and content transfer. This breakdown turns "it's slow" into "DNS is taking 400ms because your resolver is overloaded."

[INTERNAL-LINK: understanding deployment health checks -> /docs/monitoring]


Why Are False Positives the Real Problem in Monitoring?

False alerts destroy trust in your monitoring system faster than missed alerts do. PagerDuty's 2024 State of Digital Operations report found that teams experiencing high alert noise have 2.5x longer mean-time-to-acknowledge for real incidents (PagerDuty, 2024). When your team ignores alerts because "it's probably nothing," you've already lost.

Citation capsule: PagerDuty's 2024 Digital Operations report found teams with high alert noise take 2.5x longer to acknowledge real incidents. False positives in uptime monitoring — caused by network blips, DNS hiccups, and transient failures — erode on-call trust and delay response to genuine outages.

The Cry-Wolf Effect

Here's what happens with noisy monitoring. Your team gets paged at 3 AM for a DNS timeout that resolved itself in 2 seconds. It happens again Tuesday. And Friday. By the following week, when a real outage hits, the on-call engineer checks their phone, mutters "probably another false alarm," and goes back to sleep. This is the cry-wolf effect, and it's measurable.

[UNIQUE INSIGHT] In our experience, teams that don't implement confirmation windows see their alert acknowledgment time double within 3 months. The monitors keep working, but the humans stop trusting them.

What Causes False Positives?

Most false alerts come from three sources:

  • Network blips — A single packet drop between your monitor and the target. It's not an outage; it's the internet being the internet.
  • DNS hiccups — Your resolver had a slow moment. The site is fine. Your DNS cache just expired at an unlucky time.
  • Transient load spikes — A garbage collection pause, a cold start, a burst of traffic. Response time spiked for 2 seconds and recovered.

None of these are outages. All of them trigger naive monitors.

How to Eliminate False Positives

Three techniques solve 95% of false positive problems:

Retries with backoff. Don't alert on the first failure. Wait 10 seconds, retry. If it fails again, wait 30 seconds, retry once more. Three consecutive failures in 70 seconds is almost certainly real.

Multi-location checks. If your monitor in Frankfurt sees a failure but your monitors in Virginia and Singapore don't, it's a routing problem, not an outage. Require 2 of 3 locations to confirm before alerting.

Confirmation windows. Define a time window — say, 2 minutes — during which consecutive checks must fail before the state transitions to DOWN. A single failed check starts the clock. A successful check resets it.


What Does a Monitoring System's Architecture Look Like?

A complete uptime monitoring system has five components. According to Grafana Labs, Prometheus — the most popular open-source monitoring tool — is used by 90% of CNCF member organizations (CNCF, 2024). But you don't need Prometheus's complexity for uptime checks. The architecture is simpler than you'd think.

Citation capsule: The core architecture of an uptime monitoring system consists of five components: a scheduler, an HTTP checker, time-series storage, an alert dispatcher, and a status page. While Prometheus is used by 90% of CNCF member organizations (CNCF, 2024), a purpose-built uptime system needs far less infrastructure.

Here's what the full system looks like:

                    +------------------+
                    |    Scheduler     |
                    | (cron / interval)|
                    +--------+---------+
                             |
              +--------------+--------------+
              |              |              |
        +-----v----+  +-----v----+  +------v---+
        | Checker  |  | Checker  |  | Checker  |
        | (US-East)|  | (EU-West)|  | (AP-SE)  |
        +-----+----+  +-----+----+  +-----+----+
              |              |              |
              +--------------+--------------+
                             |
                    +--------v---------+
                    | State Machine    |
                    | UP/DEGRADED/DOWN |
                    +--------+---------+
                             |
              +--------------+--------------+
              |                             |
    +---------v----------+      +-----------v--------+
    | Time-Series Store  |      | Alert Dispatcher   |
    | (PostgreSQL/       |      | (email, Slack,     |
    |  TimescaleDB)      |      |  webhook, SMS)     |
    +--------------------+      +--------------------+
              |
    +---------v----------+
    |    Status Page     |
    | (public dashboard) |
    +--------------------+

The Scheduler

The scheduler decides when to run each check. The simplest implementation is a loop with a sleep interval. More sophisticated schedulers spread checks across the interval to avoid thundering herd problems — if you have 100 monitors at 60-second intervals, you don't want all 100 firing at :00.

The HTTP Checker

This is the component that actually makes the request. It needs configurable timeouts (connection timeout, read timeout, total timeout), support for following or rejecting redirects, custom headers for authenticated endpoints, and the ability to validate response content.

Time-Series Storage

Every check produces a data point: timestamp, response time, status code, and any error message. You need to store this efficiently for trend analysis and SLA calculations. PostgreSQL with TimescaleDB works well. So does InfluxDB. For small installations, even SQLite handles the write volume.

The Alert Dispatcher

When the state machine transitions from UP to DOWN, the dispatcher sends notifications. The key architectural decision is separating detection from notification — the checker detects, the state machine decides, and the dispatcher notifies. This separation lets you add notification channels without touching the detection logic.

The Status Page

A public-facing page showing current status and historical uptime. This is surprisingly important for trust. StatusPage.io (now part of Atlassian) charges $29/month for a basic status page. You can generate one from the same time-series data your monitors produce.

[INTERNAL-LINK: how Temps handles health checks for deployments -> /docs/deployments]


How Do You Build a Basic Uptime Monitor from Scratch?

Building a functional uptime monitor takes about 100 lines of code. The key isn't the HTTP request — it's the state machine. According to a 2024 Stack Overflow survey, 76% of professional developers use Python for scripting and automation tasks (Stack Overflow, 2024). Here's a Python implementation that covers the essentials.

Citation capsule: A functional uptime monitor needs three core elements: an HTTP checker with configurable timeouts, a state machine managing UP/DEGRADED/DOWN transitions, and an alert dispatcher that fires on state changes. The implementation below takes about 100 lines and stores results in PostgreSQL for historical analysis.

The State Machine

The state machine is the brain of your monitor. It prevents false alerts by requiring multiple consecutive failures before transitioning states:

import httpx
import time
import smtplib
from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime

class State(Enum):
    UP = "up"
    DEGRADED = "degraded"
    DOWN = "down"

@dataclass
class Monitor:
    url: str
    interval: int = 60  # seconds
    timeout: int = 10
    retries: int = 3
    degraded_threshold: float = 2.0  # seconds
    state: State = State.UP
    consecutive_failures: int = 0
    last_response_time: float = 0.0

    def check(self) -> dict:
        """Run a single health check with retry logic."""
        for attempt in range(self.retries):
            try:
                start = time.monotonic()
                response = httpx.get(
                    self.url,
                    timeout=self.timeout,
                    follow_redirects=True
                )
                elapsed = time.monotonic() - start

                if response.status_code >= 500:
                    self.consecutive_failures += 1
                    continue

                self.consecutive_failures = 0
                self.last_response_time = elapsed

                # Determine new state
                old_state = self.state
                if elapsed > self.degraded_threshold:
                    self.state = State.DEGRADED
                else:
                    self.state = State.UP

                return {
                    "status": response.status_code,
                    "response_time": elapsed,
                    "state": self.state.value,
                    "changed": old_state != self.state,
                    "timestamp": datetime.utcnow().isoformat()
                }

            except (httpx.TimeoutException, httpx.ConnectError):
                self.consecutive_failures += 1
                if attempt < self.retries - 1:
                    time.sleep(2 ** attempt)  # Exponential backoff

        # All retries failed
        old_state = self.state
        self.state = State.DOWN
        return {
            "status": 0,
            "response_time": -1,
            "state": self.state.value,
            "changed": old_state != self.state,
            "timestamp": datetime.utcnow().isoformat()
        }

Alerting on State Transitions

The critical detail: only alert when state changes, not on every failed check. This single rule eliminates most alert fatigue:

def run_loop(monitor: Monitor, alert_fn):
    """Main monitoring loop."""
    while True:
        result = monitor.check()

        # Store result (PostgreSQL, SQLite, etc.)
        store_result(monitor.url, result)

        # Alert only on transitions
        if result["changed"]:
            if monitor.state == State.DOWN:
                alert_fn(f"DOWN: {monitor.url} is unreachable "
                         f"after {monitor.retries} retries")
            elif monitor.state == State.DEGRADED:
                alert_fn(f"SLOW: {monitor.url} response time "
                         f"{result['response_time']:.2f}s")
            elif monitor.state == State.UP:
                alert_fn(f"RECOVERED: {monitor.url} is back up "
                         f"(response: {result['response_time']:.2f}s)")

        time.sleep(monitor.interval)

Storing Results in PostgreSQL

For SLA calculations and trend analysis, store every check result:

CREATE TABLE uptime_checks (
    id BIGSERIAL PRIMARY KEY,
    url TEXT NOT NULL,
    status_code INT,
    response_time_ms FLOAT,
    state VARCHAR(10),
    error_message TEXT,
    checked_at TIMESTAMPTZ DEFAULT NOW()
);

-- Calculate uptime percentage for the last 30 days
SELECT
    url,
    COUNT(*) FILTER (WHERE state = 'up') * 100.0 / COUNT(*) AS uptime_pct
FROM uptime_checks
WHERE checked_at > NOW() - INTERVAL '30 days'
GROUP BY url;

[ORIGINAL DATA] This state machine approach — UP to DEGRADED to DOWN, with transitions only firing alerts — is the same pattern used internally by Temps, PagerDuty, and most commercial monitoring tools. The difference is how many retries and from how many locations you confirm before transitioning.


What Should You Monitor Beyond HTTP Status?

HTTP checks tell you if your application is responding. They don't tell you if your server is about to run out of disk space. According to Datadog's 2024 Container Report, 65% of container failures are caused by resource exhaustion — CPU, memory, or disk — not application errors (Datadog, 2024).

Citation capsule: 65% of container failures stem from resource exhaustion rather than application errors, per Datadog's 2024 Container Report. Comprehensive monitoring must track CPU usage, memory consumption, disk I/O, and container-level metrics through the Docker stats API or the /proc filesystem to prevent outages before they happen.

CPU and Memory Monitoring

On Linux, the /proc filesystem gives you everything. /proc/stat provides CPU tick counters. /proc/meminfo shows memory usage. No agent needed — just read the files:

# CPU usage (percentage over 1 second)
grep 'cpu ' /proc/stat; sleep 1; grep 'cpu ' /proc/stat
# Compare the idle field values to calculate usage

# Memory usage
awk '/MemTotal|MemAvailable/ {print $1, $2/1024 "MB"}' /proc/meminfo

# Disk usage
df -h / | tail -1 | awk '{print "Used:", $3, "of", $2, "(" $5 ")"}'

Set thresholds: alert at 85% CPU sustained for 5 minutes, 90% memory, or 85% disk. These numbers aren't arbitrary — they leave enough headroom for traffic spikes without being so sensitive they fire on normal load patterns.

Docker Container Metrics

If you're running containers, the Docker stats API gives you per-container resource usage without installing anything extra:

# Real-time container stats
docker stats --no-stream --format \
  "{{.Name}}: CPU {{.CPUPerc}} | Mem {{.MemUsage}} | Net {{.NetIO}}"

# Programmatic access via API
curl -s --unix-socket /var/run/docker.sock \
  http://localhost/containers/{id}/stats?stream=false | jq '.cpu_stats'

What makes container monitoring tricky is density. A server running 20 containers has 20 sets of metrics. You need to track not just absolute values but also resource limits — a container using 512MB of memory matters very differently depending on whether its limit is 1GB or 64GB.

Disk I/O and Network Throughput

Disk I/O bottlenecks are silent killers. Your HTTP checks pass fine until a database query triggers heavy disk reads and suddenly everything hangs. Monitor IOPS (input/output operations per second) and throughput separately — high IOPS with low throughput means lots of small random reads, which kills spinning disks and even degrades SSDs under heavy load.

[IMAGE: Server monitoring dashboard showing CPU, memory, disk, and network charts — search: "server resource monitoring dashboard grafana dark"]

[INTERNAL-LINK: container resource monitoring -> /docs/monitoring]


Which Open-Source Monitoring Tools Are Worth Using?

The self-hosted monitoring space has matured significantly. Uptime Kuma, the most popular open-source uptime monitor, has over 63,000 stars on GitHub and receives active contributions from hundreds of developers (GitHub, 2026). Here's how the main options compare.

Citation capsule: Uptime Kuma leads the open-source uptime monitoring category with over 63,000 GitHub stars as of 2026. Other notable tools include Gatus for developer-centric monitoring with YAML configuration, Healthchecks.io for cron job monitoring, and the Prometheus + Blackbox Exporter combination for metrics-driven teams.

ToolFocusLanguageGitHub StarsSelf-HostedMulti-LocationAlerting
Uptime KumaHTTP/TCP/DNS monitoringNode.js63k+YesNo (single instance)90+ channels
GatusDeveloper-centric monitoringGo6.5k+YesLimitedSlack, Teams, email
Healthchecks.ioCron/heartbeat monitoringPython8.5k+YesNo15+ channels
Prometheus + BlackboxMetrics-driven probingGo56k+ (Prometheus)YesVia federationAlertmanager

Uptime Kuma

Uptime Kuma is the easiest to set up. One Docker container, a SQLite database, and a clean web UI. It supports HTTP, TCP, DNS, and ping monitors with configurable intervals down to 20 seconds. The notification integrations are extensive — Slack, Discord, Telegram, PagerDuty, and about 90 others.

The limitation is scale. Uptime Kuma runs as a single Node.js process. There's no built-in multi-location checking, no clustering, and performance degrades beyond a few hundred monitors. For a single team monitoring 10-50 services, it's excellent. For anything larger, you'll hit walls.

Gatus

Gatus takes a code-first approach. You define monitors in YAML, version them in Git, and deploy with your infrastructure. It's written in Go, compiles to a single binary, and uses minimal resources. Where Uptime Kuma is point-and-click, Gatus is Infrastructure-as-Code.

endpoints:
  - name: production-api
    url: https://api.example.com/health
    interval: 30s
    conditions:
      - "[STATUS] == 200"
      - "[RESPONSE_TIME] < 500"
      - "[BODY].status == 'healthy'"

Prometheus + Blackbox Exporter

If you're already running Prometheus, the Blackbox Exporter adds HTTP, TCP, ICMP, and DNS probing. It fits naturally into the Prometheus ecosystem — you get Alertmanager for notifications and Grafana for dashboards. The downside is that it's the most complex setup: Prometheus, Blackbox Exporter, Alertmanager, and Grafana are four separate services to deploy and maintain.

Healthchecks.io

Healthchecks.io solves a different problem: cron job monitoring. Instead of actively probing endpoints, it waits for your services to "phone home" at expected intervals. If a heartbeat doesn't arrive on time, it alerts. It's perfect for backup scripts, data pipelines, and scheduled tasks that you'd otherwise forget about until they've been silently failing for weeks.

[PERSONAL EXPERIENCE] We've run Uptime Kuma alongside commercial monitors for over a year. It catches the same incidents with comparable timing. The gap isn't in detection — it's in multi-region checking and team management features that single-instance tools can't provide.


How Does Temps Handle Uptime and Resource Monitoring?

Temps builds monitoring into the deployment platform instead of bolting it on afterward. Every container deployed through Temps gets HTTP monitoring and resource tracking automatically — no separate tool, no additional configuration, no extra infrastructure. The monitoring data feeds the same dashboard where you manage deployments.

HTTP Monitoring with Confirmation Logic

When you deploy a service on Temps, the platform automatically sets up HTTP health checks against your configured health endpoint. Checks run at configurable intervals with the same retry-and-confirm pattern described earlier in this guide. Failed checks trigger state transitions (UP to DEGRADED to DOWN), and alerts fire only on transitions — not on every individual check failure.

Status code filtering lets you define which response codes count as healthy. A 301 redirect might be perfectly fine for a marketing site but a failure for an API endpoint. You control what "healthy" means per service.

Per-Container Resource Monitoring

Temps collects CPU, memory, and network metrics for every running container through the Docker stats API. These aren't sampled once a minute and averaged — they're streamed at high resolution and stored in TimescaleDB for efficient time-series queries.

The integrated dashboard shows resource usage alongside deployment history. When you see a memory spike, you can immediately correlate it with a specific deployment, rollback, or traffic event — without switching between three different tools.

Alerting and Status

Alerts route through configurable channels: email, webhooks, and Slack integrations. Because monitoring is built into the platform, alerts include deployment context — which service, which version, which node, and what changed recently. That context turns a "your site is down" notification into an actionable incident response starting point.

[UNIQUE INSIGHT] Most monitoring tools exist as standalone services because they were built for a world where deployment, observability, and alerting were separate concerns. Integrating monitoring into the deployment platform eliminates the most common failure mode in monitoring setups: the monitoring tool itself going unmonitored.

[INTERNAL-LINK: Temps monitoring documentation -> /docs/monitoring]


How Do You Calculate SLA Uptime Percentages?

SLA math is simpler than vendors make it sound. The basic formula is: (total_minutes - downtime_minutes) / total_minutes * 100. According to the Uptime Institute's 2024 Annual Outage Analysis, the average data center outage lasts 100 minutes (Uptime Institute, 2024). Here's what the common SLA tiers actually mean in practice.

Citation capsule: SLA uptime is calculated as (total_minutes - downtime_minutes) / total_minutes * 100. The Uptime Institute's 2024 Annual Outage Analysis found the average data center outage lasts 100 minutes. The difference between 99.9% and 99.99% uptime is 43.8 versus 4.4 minutes of allowed downtime per month.

SLA LevelAnnual DowntimeMonthly DowntimeDaily Downtime
99%3.65 days7.31 hours14.4 minutes
99.9%8.77 hours43.8 minutes1.44 minutes
99.95%4.38 hours21.9 minutes43.2 seconds
99.99%52.6 minutes4.38 minutes8.6 seconds
99.999%5.26 minutes26.3 seconds0.86 seconds

Why "Five Nines" Is Mostly Marketing

99.999% uptime allows 5.26 minutes of downtime per year. That includes planned maintenance, DNS propagation, certificate renewals, and deployments. Google and AWS don't consistently achieve this across all services. If a startup promises you five nines, they either don't understand the math or they're redefining what counts as downtime.

For most web applications, 99.9% (43 minutes of monthly downtime) is an honest and achievable target. Aiming higher is fine. Promising higher is risky.

Handling Maintenance Windows

Here's the question nobody asks early enough: do maintenance windows count against your SLA? Most commercial SLA agreements exclude "scheduled maintenance" from uptime calculations. If you're calculating SLA for your own services, be explicit about this upfront. Your SQL query for SLA should look like:

SELECT
    COUNT(*) FILTER (WHERE state = 'up') * 100.0 /
    COUNT(*) FILTER (WHERE NOT is_maintenance_window) AS uptime_pct
FROM uptime_checks
WHERE checked_at > NOW() - INTERVAL '30 days'
  AND NOT is_maintenance_window;

Don't cheat the number. If your users experienced downtime, it was downtime. Maintenance exclusions should cover pre-announced, user-notified windows only.


Frequently Asked Questions

How often should uptime monitors check endpoints?

For production services, 30-60 second intervals strike the right balance between fast detection and low overhead. UptimeRobot's free tier uses 5-minute intervals, which means up to 5 minutes of undetected downtime. Paid tiers from Pingdom and Better Uptime support intervals as low as 30 seconds. If you're self-hosting, there's no cost difference between 30-second and 5-minute checks — go with 30 seconds.

What's the difference between uptime monitoring and APM?

Uptime monitoring answers one question: is the service reachable and responding? Application Performance Monitoring (APM) goes deeper — it traces individual requests through your code, identifies slow database queries, and profiles memory leaks. According to Gartner, the APM market reached $6.4 billion in 2024 (Gartner, 2024). You need uptime monitoring first, APM second. They're complementary, not competing.

[INTERNAL-LINK: setting up OpenTelemetry tracing -> /blog/how-to-set-up-opentelemetry-tracing]

Can you monitor from multiple locations without paying for SaaS?

Yes, but it requires running your own checker nodes in different regions. Deploy lightweight checker containers in 2-3 cloud regions (a $5/mo VPS in each region works fine), have them report results to a central coordinator, and require 2-of-3 confirmation before alerting. Uptime Kuma doesn't support this natively, but Gatus and custom solutions handle it well.

How do you monitor monitoring itself?

This is the "who watches the watchers" problem. The simplest solution: use a second, independent monitoring system to watch the first. A free-tier UptimeRobot check against your self-hosted monitor's status page costs nothing and catches the scenario where your monitor goes down silently. Alternatively, use a heartbeat pattern — your monitor sends a "still alive" ping to an external service like Healthchecks.io every minute. If the ping stops, you get alerted through a completely separate channel.


What Should You Build Next?

Uptime monitoring is one piece of the observability puzzle. It tells you that something broke, but not why. Pair it with error tracking to capture stack traces, request tracing to follow failures through your system, and resource monitoring to spot exhaustion before it causes outages.

If you're self-hosting your deployment infrastructure, the strongest approach is a platform that integrates monitoring alongside deployments, logs, and analytics. You avoid the duct-tape problem of wiring together 5 separate tools — and you avoid the $200+/month SaaS bill of Pingdom + Datadog + Sentry + Plausible combined.

Start with what you have. The Python monitor from this guide works. Uptime Kuma works. When you outgrow standalone tools and want everything in one place, Temps includes HTTP monitoring, container resource tracking, error tracking, and web analytics in a single self-hosted binary.

[INTERNAL-LINK: getting started with Temps -> /docs/getting-started]

#uptime-monitoring#health-checks#sla#alerting#self-hosted#uptime monitoring system