Temps

Downtime Costs $300K+/Hour — Build Uptime Monitoring That Catches It in Seconds

March 12, 2026 (3mo ago)

Written by Temps Team

Last updated March 12, 2026 (3mo ago)

Back to all posts

Temps

Downtime Costs $300K+/Hour — Build Uptime Monitoring That Catches It in Seconds

March 12, 2026 (3mo ago)

Written by Temps Team

Last updated March 12, 2026 (3mo ago)

Can You Self-Host Uptime Monitoring?

Yes. A self-hosted uptime monitoring system needs three things: an HTTP checker that runs on a schedule, a state machine that prevents false alerts, and a notification dispatcher that fires only on state transitions. That's it. You don't need Pingdom or UptimeRobot — both are managed services that charge for something you can run yourself.

This guide explains what uptime monitoring does under the hood, walks through building a minimal monitor from scratch, and compares the leading self-hosted options. If you want monitoring built into your deployment platform — no separate tool, no additional config — the last section covers how Temps handles it with a dedicated temps-status-page subsystem.

TL;DR: Uptime monitoring is a scheduled HTTP check, a state machine (UP → DEGRADED → DOWN), and an alert dispatcher. Commercial tools like Pingdom charge for something you can build or self-host. Uptime Kuma (63,000+ GitHub stars) proves self-hosted monitoring works at scale. Temps includes HTTP monitoring, a built-in status page, and alert channels (email, Slack, webhook) in the same binary that handles your deployments — no separate monitoring subscription.

Temps vs Pingdom vs UptimeRobot: What Actually Differs

Before building or buying, compare what each option actually provides:

Feature	Temps	Pingdom	UptimeRobot
Self-hostable	Yes (free, Apache 2.0)	No	No (limited free tier)
HTTP checks	Yes	Yes	Yes
Check interval	60s default, configurable	1min (paid)	5min (free), 1min (paid)
Status page included	Yes (built-in)	Extra cost	Yes (limited free)
Alert channels	Email, Slack, webhook	Email, SMS, Slack	Email, SMS, Slack, webhook
Incident management	Yes	Yes	Limited
Integrated with deployments	Yes (auto-created per environment)	No	No
Pricing	Free self-host / ~$6/mo cloud	See pricing page	See pricing page

Temps creates HTTP monitors automatically when you deploy an environment — no separate dashboard to set up, no API keys to rotate between tools.

What Does Uptime Monitoring Actually Check?

Uptime monitoring goes beyond "is the server responding." According to Catchpoint, 77% of outages involve degraded performance rather than complete failures. A good monitoring system checks multiple layers to catch degradation before it becomes downtime.

HTTP Status Codes and Response Times

The most basic check: send an HTTP GET and verify you get a 200 status code. But status alone is not enough. A page that returns 200 OK in 8 seconds is functionally down for your users. Good monitors track response time and alert when it crosses a degraded threshold — not just when the service is fully unreachable.

You also want to monitor specific status code patterns. A spike in 502s from your reverse proxy tells a different story than a 503 from your application. Each demands a different response.

SSL Certificate Expiry

Let's Encrypt certificates expire every 90 days. Auto-renewal works until it doesn't — a DNS change, a permission error, a failed hook script. According to Netcraft, roughly 3.6 million SSL certificates expire on any given day worldwide. Your monitor should warn you 14, 7, and 3 days before expiry.

Content Validation

A 200 status code does not guarantee your page is working. Your API might return {"error": "database connection failed"} with a 200 status. Content validation checks that the response body contains expected strings, matches a JSON schema, or does not contain error patterns.

DNS Resolution and TCP/TLS Breakdown

When something feels slow, you need to know where the time went. A full check decomposes response time into DNS lookup, TCP connection, TLS handshake, time-to-first-byte (TTFB), and content transfer. This breakdown turns "it's slow" into "DNS is taking 400ms because your resolver is overloaded."

Why Are False Positives the Real Problem in Monitoring?

False alerts destroy trust faster than missed alerts. According to PagerDuty's State of Digital Operations report, teams experiencing high alert noise have 2.5x longer mean-time-to-acknowledge for real incidents. When your team ignores alerts because "it's probably nothing," you've already lost.

The Cry-Wolf Effect

Your team gets paged at 3 AM for a DNS timeout that resolved itself in 2 seconds. It happens again Tuesday. And Friday. By the following week, when a real outage hits, the on-call engineer checks their phone, mutters "probably another false alarm," and goes back to sleep. Teams that do not implement confirmation windows see their alert acknowledgment time double within 3 months. The monitors keep working, but the humans stop trusting them.

What Causes False Positives?

Most false alerts come from three sources:

Network blips — A single packet drop between your monitor and the target. Not an outage; just the internet.
DNS hiccups — Your resolver had a slow moment. The site is fine. Your DNS cache just expired at an unlucky time.
Transient load spikes — A garbage collection pause, a cold start, a burst of traffic. Response time spiked for 2 seconds and recovered.

None of these are outages. All of them trigger naive monitors.

How to Eliminate False Positives

Three techniques solve 95% of false positive problems:

Retries with backoff. Do not alert on the first failure. Wait 10 seconds, retry. If it fails again, wait 30 seconds, retry once more. Three consecutive failures in 70 seconds is almost certainly real.

Multi-location checks. If your monitor in Frankfurt sees a failure but your monitors in Virginia and Singapore do not, it is a routing problem, not an outage. Require 2 of 3 locations to confirm before alerting.

Confirmation windows. Define a time window — say, 2 minutes — during which consecutive checks must fail before the state transitions to DOWN. A single failed check starts the clock. A successful check resets it.

What Does a Monitoring System's Architecture Look Like?

A complete uptime monitoring system has five components. According to the CNCF, Prometheus is used by 90% of CNCF member organizations. But you do not need Prometheus's complexity for uptime checks. The architecture is simpler:

                    +------------------+
                    |    Scheduler     |
                    | (cron / interval)|
                    +--------+---------+
                             |
              +--------------+--------------+
              |              |              |
        +-----v----+  +-----v----+  +------v---+
        | Checker  |  | Checker  |  | Checker  |
        | (US-East)|  | (EU-West)|  | (AP-SE)  |
        +-----+----+  +-----+----+  +-----+----+
              |              |              |
              +--------------+--------------+
                             |
                    +--------v---------+
                    | State Machine    |
                    | UP/DEGRADED/DOWN |
                    +--------+---------+
                             |
              +--------------+--------------+
              |                             |
    +---------v----------+      +-----------v--------+
    | Time-Series Store  |      | Alert Dispatcher   |
    | (PostgreSQL/       |      | (email, Slack,     |
    |  TimescaleDB)      |      |  webhook)          |
    +--------------------+      +--------------------+
              |
    +---------v----------+
    |    Status Page     |
    | (public dashboard) |
    +--------------------+

The Scheduler

The scheduler decides when to run each check. The simplest implementation is a loop with a sleep interval. More sophisticated schedulers spread checks across the interval to avoid thundering herd problems — if you have 100 monitors at 60-second intervals, you do not want all 100 firing at :00.

The HTTP Checker

This component actually makes the request. It needs configurable timeouts (connection timeout, read timeout, total timeout), support for following or rejecting redirects, custom headers for authenticated endpoints, and the ability to validate response content.

Time-Series Storage

Every check produces a data point: timestamp, response time, status code, and any error message. You need to store this efficiently for trend analysis and SLA calculations. PostgreSQL with TimescaleDB works well. So does InfluxDB. For small installations, even SQLite handles the write volume.

The Alert Dispatcher

When the state machine transitions from UP to DOWN, the dispatcher sends notifications. The key architectural decision is separating detection from notification — the checker detects, the state machine decides, and the dispatcher notifies. This separation lets you add notification channels without touching detection logic.

The Status Page

A public-facing page showing current status and historical uptime. StatusPage.io (now part of Atlassian) charges separately for this capability. You can generate one from the same time-series data your monitors produce.

How Do You Build a Basic Uptime Monitor from Scratch?

Building a functional uptime monitor takes about 100 lines of code. The key is not the HTTP request — it is the state machine. Here is a Python implementation that covers the essentials:

The State Machine

import httpx
import time
import smtplib
from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime

class State(Enum):
    UP = "up"
    DEGRADED = "degraded"
    DOWN = "down"

@dataclass
class Monitor:
    url: str
    interval: int = 60  # seconds
    timeout: int = 10
    retries: int = 3
    degraded_threshold: float = 2.0  # seconds
    state: State = State.UP
    consecutive_failures: int = 0
    last_response_time: float = 0.0

    def check(self) -> dict:
        """Run a single health check with retry logic."""
        for attempt in range(self.retries):
            try:
                start = time.monotonic()
                response = httpx.get(
                    self.url,
                    timeout=self.timeout,
                    follow_redirects=True
                )
                elapsed = time.monotonic() - start

                if response.status_code >= 500:
                    self.consecutive_failures += 1
                    continue

                self.consecutive_failures = 0
                self.last_response_time = elapsed

                old_state = self.state
                if elapsed > self.degraded_threshold:
                    self.state = State.DEGRADED
                else:
                    self.state = State.UP

                return {
                    "status": response.status_code,
                    "response_time": elapsed,
                    "state": self.state.value,
                    "changed": old_state != self.state,
                    "timestamp": datetime.utcnow().isoformat()
                }

            except (httpx.TimeoutException, httpx.ConnectError):
                self.consecutive_failures += 1
                if attempt < self.retries - 1:
                    time.sleep(2 ** attempt)  # Exponential backoff

        # All retries failed
        old_state = self.state
        self.state = State.DOWN
        return {
            "status": 0,
            "response_time": -1,
            "state": self.state.value,
            "changed": old_state != self.state,
            "timestamp": datetime.utcnow().isoformat()
        }

Alerting on State Transitions

Alert only when state changes, not on every failed check. This single rule eliminates most alert fatigue:

def run_loop(monitor: Monitor, alert_fn):
    """Main monitoring loop."""
    while True:
        result = monitor.check()

        # Store result (PostgreSQL, SQLite, etc.)
        store_result(monitor.url, result)

        # Alert only on transitions
        if result["changed"]:
            if monitor.state == State.DOWN:
                alert_fn(f"DOWN: {monitor.url} is unreachable "
                         f"after {monitor.retries} retries")
            elif monitor.state == State.DEGRADED:
                alert_fn(f"SLOW: {monitor.url} response time "
                         f"{result['response_time']:.2f}s")
            elif monitor.state == State.UP:
                alert_fn(f"RECOVERED: {monitor.url} is back up "
                         f"(response: {result['response_time']:.2f}s)")

        time.sleep(monitor.interval)

Storing Results in PostgreSQL

For SLA calculations and trend analysis, store every check result:

CREATE TABLE uptime_checks (
    id BIGSERIAL PRIMARY KEY,
    url TEXT NOT NULL,
    status_code INT,
    response_time_ms FLOAT,
    state VARCHAR(10),
    error_message TEXT,
    checked_at TIMESTAMPTZ DEFAULT NOW()
);

-- Calculate uptime percentage for the last 30 days
SELECT
    url,
    COUNT(*) FILTER (WHERE state = 'up') * 100.0 / COUNT(*) AS uptime_pct
FROM uptime_checks
WHERE checked_at > NOW() - INTERVAL '30 days'
GROUP BY url;

This state machine approach — UP to DEGRADED to DOWN, with transitions only firing alerts — is the same pattern used internally by Temps, PagerDuty, and most commercial monitoring tools. The difference is how many retries and from how many locations you confirm before transitioning.

What Should You Monitor Beyond HTTP Status?

HTTP checks tell you if your application is responding. They do not tell you if your server is about to run out of disk space. According to Datadog's Container Report, 65% of container failures are caused by resource exhaustion — CPU, memory, or disk — not application errors.

CPU and Memory Monitoring

On Linux, the /proc filesystem gives you everything. /proc/stat provides CPU tick counters. /proc/meminfo shows memory usage:

# CPU usage (percentage over 1 second)
grep 'cpu ' /proc/stat; sleep 1; grep 'cpu ' /proc/stat
# Compare the idle field values to calculate usage

# Memory usage
awk '/MemTotal|MemAvailable/ {print $1, $2/1024 "MB"}' /proc/meminfo

# Disk usage
df -h / | tail -1 | awk '{print "Used:", $3, "of", $2, "(" $5 ")"}'

Set thresholds: alert at 85% CPU sustained for 5 minutes, 90% memory, or 85% disk. These numbers leave enough headroom for traffic spikes without being so sensitive they fire on normal load patterns.

Docker Container Metrics

If you are running containers, the Docker stats API gives you per-container resource usage without installing anything extra:

# Real-time container stats
docker stats --no-stream --format \
  "{{.Name}}: CPU {{.CPUPerc}} | Mem {{.MemUsage}} | Net {{.NetIO}}"

# Programmatic access via API
curl -s --unix-socket /var/run/docker.sock \
  http://localhost/containers/{id}/stats?stream=false | jq '.cpu_stats'

What makes container monitoring tricky is density. A server running 20 containers has 20 sets of metrics. You need to track not just absolute values but also resource limits — a container using 512MB of memory matters very differently depending on whether its limit is 1GB or 64GB.

Disk I/O and Network Throughput

Disk I/O bottlenecks are silent killers. Your HTTP checks pass fine until a database query triggers heavy disk reads and suddenly everything hangs. Monitor IOPS (input/output operations per second) and throughput separately — high IOPS with low throughput means lots of small random reads, which kills spinning disks and degrades SSDs under heavy load.

Which Open-Source Monitoring Tools Are Worth Using?

The self-hosted monitoring space has matured significantly. Uptime Kuma has over 63,000 stars on GitHub and receives active contributions from hundreds of developers. Here is how the main options compare:

Tool	Focus	Language	GitHub Stars	Self-Hosted	Multi-Location	Alerting
Uptime Kuma	HTTP/TCP/DNS monitoring	Node.js	63k+	Yes	No (single instance)	90+ channels
Gatus	Developer-centric monitoring	Go	6.5k+	Yes	Limited	Slack, Teams, email
Healthchecks.io	Cron/heartbeat monitoring	Python	8.5k+	Yes	No	15+ channels
Prometheus + Blackbox	Metrics-driven probing	Go	56k+ (Prometheus)	Yes	Via federation	Alertmanager
Temps	Deployment-integrated monitoring	Rust	—	Yes (free)	Planned	Email, Slack, webhook

Uptime Kuma

Uptime Kuma is the easiest to set up. One Docker container, a SQLite database, and a clean web UI. It supports HTTP, TCP, DNS, and ping monitors with configurable intervals down to 20 seconds. The notification integrations are extensive — Slack, Discord, Telegram, PagerDuty, and about 90 others.

The limitation is scale. Uptime Kuma runs as a single Node.js process. There is no built-in multi-location checking, no clustering, and performance degrades beyond a few hundred monitors. For a single team monitoring 10–50 services, it is excellent. For anything larger, you will hit walls.

Gatus

Gatus takes a code-first approach. You define monitors in YAML, version them in Git, and deploy with your infrastructure. It is written in Go, compiles to a single binary, and uses minimal resources. Where Uptime Kuma is point-and-click, Gatus is Infrastructure-as-Code.

endpoints:
  - name: production-api
    url: https://api.example.com/health
    interval: 30s
    conditions:
      - "[STATUS] == 200"
      - "[RESPONSE_TIME] < 500"
      - "[BODY].status == 'healthy'"

Prometheus + Blackbox Exporter

If you are already running Prometheus, the Blackbox Exporter adds HTTP, TCP, ICMP, and DNS probing. It fits naturally into the Prometheus ecosystem — you get Alertmanager for notifications and Grafana for dashboards. The downside is complexity: Prometheus, Blackbox Exporter, Alertmanager, and Grafana are four separate services to deploy and maintain.

Healthchecks.io

Healthchecks.io solves a different problem: cron job monitoring. Instead of actively probing endpoints, it waits for your services to "phone home" at expected intervals. If a heartbeat does not arrive on time, it alerts. It is perfect for backup scripts, data pipelines, and scheduled tasks that you would otherwise forget about until they have been silently failing for weeks.

How Does Temps Handle Uptime Monitoring?

Temps builds monitoring into the deployment platform instead of bolting it on afterward. The temps-status-page crate is a first-class plugin that runs alongside every deployment — no separate tool, no extra infrastructure, no additional configuration step.

Automatic Monitor Creation

When you deploy a new environment on Temps, the platform automatically creates an HTTP monitor for it. This happens via an EnvironmentCreated job that the StatusPagePlugin processes — you do not configure monitoring separately, it just exists from the moment your first deployment lands.

Each monitor checks at a 60-second default interval (configurable), tracks response_time_ms, and records status as operational, degraded, or down.

HTTP Health Checks with Confirmation Logic

The HealthCheckService performs HTTP checks with retry-and-confirm logic — the same pattern described in the state machine section above. Failed checks are retried before any state transition fires. You can configure a custom check_path per monitor so Temps checks your /health or /api/status endpoint rather than the root path.

Temps respects the distinction between degraded (slow but responding) and down (unreachable), so you get actionable alerts rather than binary up/down noise.

Built-In Status Page

Temps ships a status page plugin (temps-status-page) that serves a public dashboard showing current monitor status, historical uptime, and active incidents. This is the same functionality that services like StatusPage.io charge separately for. Because it reads from the same TimescaleDB tables that store your check results, historical data is available immediately without any additional setup.

The status page includes an IncidentService for creating and managing incidents, and an OutageDetectionService that bridges detected outages to the alarm system automatically.

Alert Channels

Alerts route through configurable channels. Temps has verified implementations for:

Email — via EmailProvider, SMTP-based transactional delivery
Slack — via SlackProvider, webhook-based channel notifications
Webhook — via WebhookProvider, generic HTTP POST to any endpoint

Because monitoring is built into the platform, alerts include deployment context — which service, which version, which node. That context turns a "your site is down" notification into an actionable incident response starting point.

Metric-Based Alerting

Beyond HTTP checks, Temps also has monitoring_alert_rules for container and database metrics. You can define rules like "alert when active Postgres connections exceed 80% of max" or "alert when container memory usage exceeds 90% of limit" — with configurable thresholds, comparators, and silence windows. Alerts fire only after a breach persists for a configurable duration (for_duration_secs), eliminating transient spikes.

Why Integrated Monitoring Beats Standalone Tools

Most monitoring tools exist as standalone services because they were built for a world where deployment, observability, and alerting were separate concerns. The most common failure mode in monitoring setups is the monitoring tool itself going unmonitored. Integrating monitoring into the deployment platform eliminates this problem — Temps monitors your services and its own health checks run as part of the same binary.

Temps is free to self-host (Apache 2.0). Temps Cloud costs approximately $6/month (Hetzner infrastructure cost + 30% margin), with no per-seat fees and no separate monitoring subscription. The same binary that handles your git-push deployments, Pingora-based reverse proxy (built by Cloudflare), and WireGuard mesh networking also runs your uptime monitors and status page.

How Do You Calculate SLA Uptime Percentages?

SLA math is simpler than vendors make it sound. The basic formula is: (total_minutes - downtime_minutes) / total_minutes * 100. According to the Uptime Institute's Annual Outage Analysis, the average data center outage lasts 100 minutes. Here is what the common SLA tiers mean in practice:

SLA Level	Annual Downtime	Monthly Downtime	Daily Downtime
99%	3.65 days	7.31 hours	14.4 minutes
99.9%	8.77 hours	43.8 minutes	1.44 minutes
99.95%	4.38 hours	21.9 minutes	43.2 seconds
99.99%	52.6 minutes	4.38 minutes	8.6 seconds
99.999%	5.26 minutes	26.3 seconds	0.86 seconds

Why "Five Nines" Is Mostly Marketing

99.999% uptime allows 5.26 minutes of downtime per year. That includes planned maintenance, DNS propagation, certificate renewals, and deployments. Google and AWS do not consistently achieve this across all services. For most web applications, 99.9% (43.8 minutes of monthly downtime) is an honest and achievable target.

Handling Maintenance Windows

Do maintenance windows count against your SLA? Most commercial SLA agreements exclude "scheduled maintenance" from uptime calculations. If you are calculating SLA for your own services, be explicit about this upfront:

SELECT
    COUNT(*) FILTER (WHERE state = 'up') * 100.0 /
    COUNT(*) FILTER (WHERE NOT is_maintenance_window) AS uptime_pct
FROM uptime_checks
WHERE checked_at > NOW() - INTERVAL '30 days'
  AND NOT is_maintenance_window;

Do not cheat the number. If your users experienced downtime, it was downtime. Maintenance exclusions should cover pre-announced, user-notified windows only.

Frequently Asked Questions

How often should uptime monitors check endpoints?

For production services, 30–60 second intervals strike the right balance between fast detection and low overhead. UptimeRobot's free tier uses 5-minute intervals, which means up to 5 minutes of undetected downtime. If you are self-hosting, there is no cost difference between 30-second and 5-minute checks — go with 30 seconds. Temps defaults to 60-second intervals.

What's the difference between uptime monitoring and APM?

Uptime monitoring answers one question: is the service reachable and responding? Application Performance Monitoring (APM) goes deeper — it traces individual requests through your code, identifies slow database queries, and profiles memory leaks. According to Gartner, the APM market reached $6.4 billion in 2024. You need uptime monitoring first, APM second. They are complementary, not competing.

Can you monitor from multiple locations without paying for SaaS?

Yes, but it requires running your own checker nodes in different regions. Deploy lightweight checker containers in 2–3 cloud regions (a $5/mo VPS in each region works fine), have them report results to a central coordinator, and require 2-of-3 confirmation before alerting. Uptime Kuma does not support this natively, but Gatus and custom solutions handle it well.

How do you monitor monitoring itself?

The simplest solution: use a second, independent monitoring system to watch the first. A free-tier UptimeRobot check against your self-hosted monitor's status page costs nothing and catches the scenario where your monitor goes down silently. Alternatively, use a heartbeat pattern — your monitor sends a "still alive" ping to an external service like Healthchecks.io every minute. If the ping stops, you get alerted through a completely separate channel.

Is Temps uptime monitoring production-ready?

The temps-status-page crate ships as a compiled plugin in every Temps binary release. Monitors are automatically created for each environment, alerts route through email, Slack, and webhook channels, and the public status page serves from the same TimescaleDB instance used for all other observability data. It is the same binary used in Temps Cloud deployments.

What Should You Build Next?

Uptime monitoring tells you that something broke, but not why. Pair it with error tracking to capture stack traces, request tracing to follow failures through your system, and resource monitoring to spot exhaustion before it causes outages.

If you are self-hosting your deployment infrastructure, the strongest approach is a platform that integrates monitoring alongside deployments, logs, and analytics. You avoid wiring together 5 separate tools — and you avoid the SaaS subscription overhead of Pingdom + Datadog + Sentry + Plausible running in parallel.

Temps includes HTTP monitoring, a built-in status page, container resource tracking, error tracking, and web analytics in a single self-hosted binary. Free to self-host under Apache 2.0. Temps Cloud starts at approximately $6/month on Hetzner infrastructure with no per-seat pricing.

Back to all posts

Can You Self-Host Uptime Monitoring?

TL;DR: Uptime monitoring is a scheduled HTTP check, a state machine (UP → DEGRADED → DOWN), and an alert dispatcher. Commercial tools like Pingdom charge for something you can build or self-host. Uptime Kuma (63,000+ GitHub stars) proves self-hosted monitoring works at scale. Temps includes HTTP monitoring, a built-in status page, and alert channels (email, Slack, webhook) in the same binary that handles your deployments — no separate monitoring subscription.

Temps vs Pingdom vs UptimeRobot: What Actually Differs

Before building or buying, compare what each option actually provides:

Feature	Temps	Pingdom	UptimeRobot
Self-hostable	Yes (free, Apache 2.0)	No	No (limited free tier)
HTTP checks	Yes	Yes	Yes
Check interval	60s default, configurable	1min (paid)	5min (free), 1min (paid)
Status page included	Yes (built-in)	Extra cost	Yes (limited free)
Alert channels	Email, Slack, webhook	Email, SMS, Slack	Email, SMS, Slack, webhook
Incident management	Yes	Yes	Limited
Integrated with deployments	Yes (auto-created per environment)	No	No
Pricing	Free self-host / ~$6/mo cloud	See pricing page	See pricing page

Temps creates HTTP monitors automatically when you deploy an environment — no separate dashboard to set up, no API keys to rotate between tools.

What Does Uptime Monitoring Actually Check?

HTTP Status Codes and Response Times

You also want to monitor specific status code patterns. A spike in 502s from your reverse proxy tells a different story than a 503 from your application. Each demands a different response.

SSL Certificate Expiry

Content Validation

DNS Resolution and TCP/TLS Breakdown

Why Are False Positives the Real Problem in Monitoring?

The Cry-Wolf Effect

What Causes False Positives?

Most false alerts come from three sources:

Network blips — A single packet drop between your monitor and the target. Not an outage; just the internet.
DNS hiccups — Your resolver had a slow moment. The site is fine. Your DNS cache just expired at an unlucky time.
Transient load spikes — A garbage collection pause, a cold start, a burst of traffic. Response time spiked for 2 seconds and recovered.

None of these are outages. All of them trigger naive monitors.

How to Eliminate False Positives

Three techniques solve 95% of false positive problems:

What Does a Monitoring System's Architecture Look Like?

                    +------------------+
                    |    Scheduler     |
                    | (cron / interval)|
                    +--------+---------+
                             |
              +--------------+--------------+
              |              |              |
        +-----v----+  +-----v----+  +------v---+
        | Checker  |  | Checker  |  | Checker  |
        | (US-East)|  | (EU-West)|  | (AP-SE)  |
        +-----+----+  +-----+----+  +-----+----+
              |              |              |
              +--------------+--------------+
                             |
                    +--------v---------+
                    | State Machine    |
                    | UP/DEGRADED/DOWN |
                    +--------+---------+
                             |
              +--------------+--------------+
              |                             |
    +---------v----------+      +-----------v--------+
    | Time-Series Store  |      | Alert Dispatcher   |
    | (PostgreSQL/       |      | (email, Slack,     |
    |  TimescaleDB)      |      |  webhook)          |
    +--------------------+      +--------------------+
              |
    +---------v----------+
    |    Status Page     |
    | (public dashboard) |
    +--------------------+

The Scheduler

The HTTP Checker

Time-Series Storage

The Alert Dispatcher

The Status Page

How Do You Build a Basic Uptime Monitor from Scratch?

Building a functional uptime monitor takes about 100 lines of code. The key is not the HTTP request — it is the state machine. Here is a Python implementation that covers the essentials:

The State Machine

import httpx
import time
import smtplib
from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime

class State(Enum):
    UP = "up"
    DEGRADED = "degraded"
    DOWN = "down"

@dataclass
class Monitor:
    url: str
    interval: int = 60  # seconds
    timeout: int = 10
    retries: int = 3
    degraded_threshold: float = 2.0  # seconds
    state: State = State.UP
    consecutive_failures: int = 0
    last_response_time: float = 0.0

    def check(self) -> dict:
        """Run a single health check with retry logic."""
        for attempt in range(self.retries):
            try:
                start = time.monotonic()
                response = httpx.get(
                    self.url,
                    timeout=self.timeout,
                    follow_redirects=True
                )
                elapsed = time.monotonic() - start

                if response.status_code >= 500:
                    self.consecutive_failures += 1
                    continue

                self.consecutive_failures = 0
                self.last_response_time = elapsed

                old_state = self.state
                if elapsed > self.degraded_threshold:
                    self.state = State.DEGRADED
                else:
                    self.state = State.UP

                return {
                    "status": response.status_code,
                    "response_time": elapsed,
                    "state": self.state.value,
                    "changed": old_state != self.state,
                    "timestamp": datetime.utcnow().isoformat()
                }

            except (httpx.TimeoutException, httpx.ConnectError):
                self.consecutive_failures += 1
                if attempt < self.retries - 1:
                    time.sleep(2 ** attempt)  # Exponential backoff

        # All retries failed
        old_state = self.state
        self.state = State.DOWN
        return {
            "status": 0,
            "response_time": -1,
            "state": self.state.value,
            "changed": old_state != self.state,
            "timestamp": datetime.utcnow().isoformat()
        }

Alerting on State Transitions

Alert only when state changes, not on every failed check. This single rule eliminates most alert fatigue:

def run_loop(monitor: Monitor, alert_fn):
    """Main monitoring loop."""
    while True:
        result = monitor.check()

        # Store result (PostgreSQL, SQLite, etc.)
        store_result(monitor.url, result)

        # Alert only on transitions
        if result["changed"]:
            if monitor.state == State.DOWN:
                alert_fn(f"DOWN: {monitor.url} is unreachable "
                         f"after {monitor.retries} retries")
            elif monitor.state == State.DEGRADED:
                alert_fn(f"SLOW: {monitor.url} response time "
                         f"{result['response_time']:.2f}s")
            elif monitor.state == State.UP:
                alert_fn(f"RECOVERED: {monitor.url} is back up "
                         f"(response: {result['response_time']:.2f}s)")

        time.sleep(monitor.interval)

Storing Results in PostgreSQL

For SLA calculations and trend analysis, store every check result:

CREATE TABLE uptime_checks (
    id BIGSERIAL PRIMARY KEY,
    url TEXT NOT NULL,
    status_code INT,
    response_time_ms FLOAT,
    state VARCHAR(10),
    error_message TEXT,
    checked_at TIMESTAMPTZ DEFAULT NOW()
);

-- Calculate uptime percentage for the last 30 days
SELECT
    url,
    COUNT(*) FILTER (WHERE state = 'up') * 100.0 / COUNT(*) AS uptime_pct
FROM uptime_checks
WHERE checked_at > NOW() - INTERVAL '30 days'
GROUP BY url;

What Should You Monitor Beyond HTTP Status?

CPU and Memory Monitoring

On Linux, the /proc filesystem gives you everything. /proc/stat provides CPU tick counters. /proc/meminfo shows memory usage:

# CPU usage (percentage over 1 second)
grep 'cpu ' /proc/stat; sleep 1; grep 'cpu ' /proc/stat
# Compare the idle field values to calculate usage

# Memory usage
awk '/MemTotal|MemAvailable/ {print $1, $2/1024 "MB"}' /proc/meminfo

# Disk usage
df -h / | tail -1 | awk '{print "Used:", $3, "of", $2, "(" $5 ")"}'

Set thresholds: alert at 85% CPU sustained for 5 minutes, 90% memory, or 85% disk. These numbers leave enough headroom for traffic spikes without being so sensitive they fire on normal load patterns.

Docker Container Metrics

If you are running containers, the Docker stats API gives you per-container resource usage without installing anything extra:

# Real-time container stats
docker stats --no-stream --format \
  "{{.Name}}: CPU {{.CPUPerc}} | Mem {{.MemUsage}} | Net {{.NetIO}}"

# Programmatic access via API
curl -s --unix-socket /var/run/docker.sock \
  http://localhost/containers/{id}/stats?stream=false | jq '.cpu_stats'

Disk I/O and Network Throughput

Which Open-Source Monitoring Tools Are Worth Using?

Tool	Focus	Language	GitHub Stars	Self-Hosted	Multi-Location	Alerting
Uptime Kuma	HTTP/TCP/DNS monitoring	Node.js	63k+	Yes	No (single instance)	90+ channels
Gatus	Developer-centric monitoring	Go	6.5k+	Yes	Limited	Slack, Teams, email
Healthchecks.io	Cron/heartbeat monitoring	Python	8.5k+	Yes	No	15+ channels
Prometheus + Blackbox	Metrics-driven probing	Go	56k+ (Prometheus)	Yes	Via federation	Alertmanager
Temps	Deployment-integrated monitoring	Rust	—	Yes (free)	Planned	Email, Slack, webhook

Uptime Kuma

Gatus

endpoints:
  - name: production-api
    url: https://api.example.com/health
    interval: 30s
    conditions:
      - "[STATUS] == 200"
      - "[RESPONSE_TIME] < 500"
      - "[BODY].status == 'healthy'"

Email — via EmailProvider, SMTP-based transactional delivery
Slack — via SlackProvider, webhook-based channel notifications
Webhook — via WebhookProvider, generic HTTP POST to any endpoint

SLA Level	Annual Downtime	Monthly Downtime	Daily Downtime
99%	3.65 days	7.31 hours	14.4 minutes
99.9%	8.77 hours	43.8 minutes	1.44 minutes
99.95%	4.38 hours	21.9 minutes	43.2 seconds
99.99%	52.6 minutes	4.38 minutes	8.6 seconds
99.999%	5.26 minutes	26.3 seconds	0.86 seconds

Why "Five Nines" Is Mostly Marketing

Handling Maintenance Windows

SELECT
    COUNT(*) FILTER (WHERE state = 'up') * 100.0 /
    COUNT(*) FILTER (WHERE NOT is_maintenance_window) AS uptime_pct
FROM uptime_checks
WHERE checked_at > NOW() - INTERVAL '30 days'
  AND NOT is_maintenance_window;

Do not cheat the number. If your users experienced downtime, it was downtime. Maintenance exclusions should cover pre-announced, user-notified windows only.