March 12, 2026 (3mo ago)
Written by Temps Team
Last updated March 12, 2026 (3mo ago)
Yes. A self-hosted uptime monitoring system needs three things: an HTTP checker that runs on a schedule, a state machine that prevents false alerts, and a notification dispatcher that fires only on state transitions. That's it. You don't need Pingdom or UptimeRobot — both are managed services that charge for something you can run yourself.
This guide explains what uptime monitoring does under the hood, walks through building a minimal monitor from scratch, and compares the leading self-hosted options. If you want monitoring built into your deployment platform — no separate tool, no additional config — the last section covers how Temps handles it with a dedicated temps-status-page subsystem.
TL;DR: Uptime monitoring is a scheduled HTTP check, a state machine (UP → DEGRADED → DOWN), and an alert dispatcher. Commercial tools like Pingdom charge for something you can build or self-host. Uptime Kuma (63,000+ GitHub stars) proves self-hosted monitoring works at scale. Temps includes HTTP monitoring, a built-in status page, and alert channels (email, Slack, webhook) in the same binary that handles your deployments — no separate monitoring subscription.
Before building or buying, compare what each option actually provides:
| Feature | Temps | Pingdom | UptimeRobot |
|---|---|---|---|
| Self-hostable | Yes (free, Apache 2.0) | No | No (limited free tier) |
| HTTP checks | Yes | Yes | Yes |
| Check interval | 60s default, configurable | 1min (paid) | 5min (free), 1min (paid) |
| Status page included | Yes (built-in) | Extra cost | Yes (limited free) |
| Alert channels | Email, Slack, webhook | Email, SMS, Slack | Email, SMS, Slack, webhook |
| Incident management | Yes | Yes | Limited |
| Integrated with deployments | Yes (auto-created per environment) | No | No |
| Pricing | Free self-host / ~$6/mo cloud | See pricing page | See pricing page |
Temps creates HTTP monitors automatically when you deploy an environment — no separate dashboard to set up, no API keys to rotate between tools.
Uptime monitoring goes beyond "is the server responding." According to Catchpoint, 77% of outages involve degraded performance rather than complete failures. A good monitoring system checks multiple layers to catch degradation before it becomes downtime.
The most basic check: send an HTTP GET and verify you get a 200 status code. But status alone is not enough. A page that returns 200 OK in 8 seconds is functionally down for your users. Good monitors track response time and alert when it crosses a degraded threshold — not just when the service is fully unreachable.
You also want to monitor specific status code patterns. A spike in 502s from your reverse proxy tells a different story than a 503 from your application. Each demands a different response.
Let's Encrypt certificates expire every 90 days. Auto-renewal works until it doesn't — a DNS change, a permission error, a failed hook script. According to Netcraft, roughly 3.6 million SSL certificates expire on any given day worldwide. Your monitor should warn you 14, 7, and 3 days before expiry.
A 200 status code does not guarantee your page is working. Your API might return {"error": "database connection failed"} with a 200 status. Content validation checks that the response body contains expected strings, matches a JSON schema, or does not contain error patterns.
When something feels slow, you need to know where the time went. A full check decomposes response time into DNS lookup, TCP connection, TLS handshake, time-to-first-byte (TTFB), and content transfer. This breakdown turns "it's slow" into "DNS is taking 400ms because your resolver is overloaded."
False alerts destroy trust faster than missed alerts. According to PagerDuty's State of Digital Operations report, teams experiencing high alert noise have 2.5x longer mean-time-to-acknowledge for real incidents. When your team ignores alerts because "it's probably nothing," you've already lost.
Your team gets paged at 3 AM for a DNS timeout that resolved itself in 2 seconds. It happens again Tuesday. And Friday. By the following week, when a real outage hits, the on-call engineer checks their phone, mutters "probably another false alarm," and goes back to sleep. Teams that do not implement confirmation windows see their alert acknowledgment time double within 3 months. The monitors keep working, but the humans stop trusting them.
Most false alerts come from three sources:
None of these are outages. All of them trigger naive monitors.
Three techniques solve 95% of false positive problems:
Retries with backoff. Do not alert on the first failure. Wait 10 seconds, retry. If it fails again, wait 30 seconds, retry once more. Three consecutive failures in 70 seconds is almost certainly real.
Multi-location checks. If your monitor in Frankfurt sees a failure but your monitors in Virginia and Singapore do not, it is a routing problem, not an outage. Require 2 of 3 locations to confirm before alerting.
Confirmation windows. Define a time window — say, 2 minutes — during which consecutive checks must fail before the state transitions to DOWN. A single failed check starts the clock. A successful check resets it.
A complete uptime monitoring system has five components. According to the CNCF, Prometheus is used by 90% of CNCF member organizations. But you do not need Prometheus's complexity for uptime checks. The architecture is simpler:
+------------------+
| Scheduler |
| (cron / interval)|
+--------+---------+
|
+--------------+--------------+
| | |
+-----v----+ +-----v----+ +------v---+
| Checker | | Checker | | Checker |
| (US-East)| | (EU-West)| | (AP-SE) |
+-----+----+ +-----+----+ +-----+----+
| | |
+--------------+--------------+
|
+--------v---------+
| State Machine |
| UP/DEGRADED/DOWN |
+--------+---------+
|
+--------------+--------------+
| |
+---------v----------+ +-----------v--------+
| Time-Series Store | | Alert Dispatcher |
| (PostgreSQL/ | | (email, Slack, |
| TimescaleDB) | | webhook) |
+--------------------+ +--------------------+
|
+---------v----------+
| Status Page |
| (public dashboard) |
+--------------------+
The scheduler decides when to run each check. The simplest implementation is a loop with a sleep interval. More sophisticated schedulers spread checks across the interval to avoid thundering herd problems — if you have 100 monitors at 60-second intervals, you do not want all 100 firing at :00.
This component actually makes the request. It needs configurable timeouts (connection timeout, read timeout, total timeout), support for following or rejecting redirects, custom headers for authenticated endpoints, and the ability to validate response content.
Every check produces a data point: timestamp, response time, status code, and any error message. You need to store this efficiently for trend analysis and SLA calculations. PostgreSQL with TimescaleDB works well. So does InfluxDB. For small installations, even SQLite handles the write volume.
When the state machine transitions from UP to DOWN, the dispatcher sends notifications. The key architectural decision is separating detection from notification — the checker detects, the state machine decides, and the dispatcher notifies. This separation lets you add notification channels without touching detection logic.
A public-facing page showing current status and historical uptime. StatusPage.io (now part of Atlassian) charges separately for this capability. You can generate one from the same time-series data your monitors produce.
Building a functional uptime monitor takes about 100 lines of code. The key is not the HTTP request — it is the state machine. Here is a Python implementation that covers the essentials:
import httpx
import time
import smtplib
from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime
class State(Enum):
UP = "up"
DEGRADED = "degraded"
DOWN = "down"
@dataclass
class Monitor:
url: str
interval: int = 60 # seconds
timeout: int = 10
retries: int = 3
degraded_threshold: float = 2.0 # seconds
state: State = State.UP
consecutive_failures: int = 0
last_response_time: float = 0.0
def check(self) -> dict:
"""Run a single health check with retry logic."""
for attempt in range(self.retries):
try:
start = time.monotonic()
response = httpx.get(
self.url,
timeout=self.timeout,
follow_redirects=True
)
elapsed = time.monotonic() - start
if response.status_code >= 500:
self.consecutive_failures += 1
continue
self.consecutive_failures = 0
self.last_response_time = elapsed
old_state = self.state
if elapsed > self.degraded_threshold:
self.state = State.DEGRADED
else:
self.state = State.UP
return {
"status": response.status_code,
"response_time": elapsed,
"state": self.state.value,
"changed": old_state != self.state,
"timestamp": datetime.utcnow().isoformat()
}
except (httpx.TimeoutException, httpx.ConnectError):
self.consecutive_failures += 1
if attempt < self.retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
# All retries failed
old_state = self.state
self.state = State.DOWN
return {
"status": 0,
"response_time": -1,
"state": self.state.value,
"changed": old_state != self.state,
"timestamp": datetime.utcnow().isoformat()
}
Alert only when state changes, not on every failed check. This single rule eliminates most alert fatigue:
def run_loop(monitor: Monitor, alert_fn):
"""Main monitoring loop."""
while True:
result = monitor.check()
# Store result (PostgreSQL, SQLite, etc.)
store_result(monitor.url, result)
# Alert only on transitions
if result["changed"]:
if monitor.state == State.DOWN:
alert_fn(f"DOWN: {monitor.url} is unreachable "
f"after {monitor.retries} retries")
elif monitor.state == State.DEGRADED:
alert_fn(f"SLOW: {monitor.url} response time "
f"{result['response_time']:.2f}s")
elif monitor.state == State.UP:
alert_fn(f"RECOVERED: {monitor.url} is back up "
f"(response: {result['response_time']:.2f}s)")
time.sleep(monitor.interval)
For SLA calculations and trend analysis, store every check result:
CREATE TABLE uptime_checks (
id BIGSERIAL PRIMARY KEY,
url TEXT NOT NULL,
status_code INT,
response_time_ms FLOAT,
state VARCHAR(10),
error_message TEXT,
checked_at TIMESTAMPTZ DEFAULT NOW()
);
-- Calculate uptime percentage for the last 30 days
SELECT
url,
COUNT(*) FILTER (WHERE state = 'up') * 100.0 / COUNT(*) AS uptime_pct
FROM uptime_checks
WHERE checked_at > NOW() - INTERVAL '30 days'
GROUP BY url;
This state machine approach — UP to DEGRADED to DOWN, with transitions only firing alerts — is the same pattern used internally by Temps, PagerDuty, and most commercial monitoring tools. The difference is how many retries and from how many locations you confirm before transitioning.
HTTP checks tell you if your application is responding. They do not tell you if your server is about to run out of disk space. According to Datadog's Container Report, 65% of container failures are caused by resource exhaustion — CPU, memory, or disk — not application errors.
On Linux, the /proc filesystem gives you everything. /proc/stat provides CPU tick counters. /proc/meminfo shows memory usage:
# CPU usage (percentage over 1 second)
grep 'cpu ' /proc/stat; sleep 1; grep 'cpu ' /proc/stat
# Compare the idle field values to calculate usage
# Memory usage
awk '/MemTotal|MemAvailable/ {print $1, $2/1024 "MB"}' /proc/meminfo
# Disk usage
df -h / | tail -1 | awk '{print "Used:", $3, "of", $2, "(" $5 ")"}'
Set thresholds: alert at 85% CPU sustained for 5 minutes, 90% memory, or 85% disk. These numbers leave enough headroom for traffic spikes without being so sensitive they fire on normal load patterns.
If you are running containers, the Docker stats API gives you per-container resource usage without installing anything extra:
# Real-time container stats
docker stats --no-stream --format \
"{{.Name}}: CPU {{.CPUPerc}} | Mem {{.MemUsage}} | Net {{.NetIO}}"
# Programmatic access via API
curl -s --unix-socket /var/run/docker.sock \
http://localhost/containers/{id}/stats?stream=false | jq '.cpu_stats'
What makes container monitoring tricky is density. A server running 20 containers has 20 sets of metrics. You need to track not just absolute values but also resource limits — a container using 512MB of memory matters very differently depending on whether its limit is 1GB or 64GB.
Disk I/O bottlenecks are silent killers. Your HTTP checks pass fine until a database query triggers heavy disk reads and suddenly everything hangs. Monitor IOPS (input/output operations per second) and throughput separately — high IOPS with low throughput means lots of small random reads, which kills spinning disks and degrades SSDs under heavy load.
The self-hosted monitoring space has matured significantly. Uptime Kuma has over 63,000 stars on GitHub and receives active contributions from hundreds of developers. Here is how the main options compare:
| Tool | Focus | Language | GitHub Stars | Self-Hosted | Multi-Location | Alerting |
|---|---|---|---|---|---|---|
| Uptime Kuma | HTTP/TCP/DNS monitoring | Node.js | 63k+ | Yes | No (single instance) | 90+ channels |
| Gatus | Developer-centric monitoring | Go | 6.5k+ | Yes | Limited | Slack, Teams, email |
| Healthchecks.io | Cron/heartbeat monitoring | Python | 8.5k+ | Yes | No | 15+ channels |
| Prometheus + Blackbox | Metrics-driven probing | Go | 56k+ (Prometheus) | Yes | Via federation | Alertmanager |
| Temps | Deployment-integrated monitoring | Rust | — | Yes (free) | Planned | Email, Slack, webhook |
Uptime Kuma is the easiest to set up. One Docker container, a SQLite database, and a clean web UI. It supports HTTP, TCP, DNS, and ping monitors with configurable intervals down to 20 seconds. The notification integrations are extensive — Slack, Discord, Telegram, PagerDuty, and about 90 others.
The limitation is scale. Uptime Kuma runs as a single Node.js process. There is no built-in multi-location checking, no clustering, and performance degrades beyond a few hundred monitors. For a single team monitoring 10–50 services, it is excellent. For anything larger, you will hit walls.
Gatus takes a code-first approach. You define monitors in YAML, version them in Git, and deploy with your infrastructure. It is written in Go, compiles to a single binary, and uses minimal resources. Where Uptime Kuma is point-and-click, Gatus is Infrastructure-as-Code.
endpoints:
- name: production-api
url: https://api.example.com/health
interval: 30s
conditions:
- "[STATUS] == 200"
- "[RESPONSE_TIME] < 500"
- "[BODY].status == 'healthy'"
If you are already running Prometheus, the Blackbox Exporter adds HTTP, TCP, ICMP, and DNS probing. It fits naturally into the Prometheus ecosystem — you get Alertmanager for notifications and Grafana for dashboards. The downside is complexity: Prometheus, Blackbox Exporter, Alertmanager, and Grafana are four separate services to deploy and maintain.
Healthchecks.io solves a different problem: cron job monitoring. Instead of actively probing endpoints, it waits for your services to "phone home" at expected intervals. If a heartbeat does not arrive on time, it alerts. It is perfect for backup scripts, data pipelines, and scheduled tasks that you would otherwise forget about until they have been silently failing for weeks.
Temps builds monitoring into the deployment platform instead of bolting it on afterward. The temps-status-page crate is a first-class plugin that runs alongside every deployment — no separate tool, no extra infrastructure, no additional configuration step.
When you deploy a new environment on Temps, the platform automatically creates an HTTP monitor for it. This happens via an EnvironmentCreated job that the StatusPagePlugin processes — you do not configure monitoring separately, it just exists from the moment your first deployment lands.
Each monitor checks at a 60-second default interval (configurable), tracks response_time_ms, and records status as operational, degraded, or down.
The HealthCheckService performs HTTP checks with retry-and-confirm logic — the same pattern described in the state machine section above. Failed checks are retried before any state transition fires. You can configure a custom check_path per monitor so Temps checks your /health or /api/status endpoint rather than the root path.
Temps respects the distinction between degraded (slow but responding) and down (unreachable), so you get actionable alerts rather than binary up/down noise.
Temps ships a status page plugin (temps-status-page) that serves a public dashboard showing current monitor status, historical uptime, and active incidents. This is the same functionality that services like StatusPage.io charge separately for. Because it reads from the same TimescaleDB tables that store your check results, historical data is available immediately without any additional setup.
The status page includes an IncidentService for creating and managing incidents, and an OutageDetectionService that bridges detected outages to the alarm system automatically.
Alerts route through configurable channels. Temps has verified implementations for:
EmailProvider, SMTP-based transactional deliverySlackProvider, webhook-based channel notificationsWebhookProvider, generic HTTP POST to any endpointBecause monitoring is built into the platform, alerts include deployment context — which service, which version, which node. That context turns a "your site is down" notification into an actionable incident response starting point.
Beyond HTTP checks, Temps also has monitoring_alert_rules for container and database metrics. You can define rules like "alert when active Postgres connections exceed 80% of max" or "alert when container memory usage exceeds 90% of limit" — with configurable thresholds, comparators, and silence windows. Alerts fire only after a breach persists for a configurable duration (for_duration_secs), eliminating transient spikes.
Most monitoring tools exist as standalone services because they were built for a world where deployment, observability, and alerting were separate concerns. The most common failure mode in monitoring setups is the monitoring tool itself going unmonitored. Integrating monitoring into the deployment platform eliminates this problem — Temps monitors your services and its own health checks run as part of the same binary.
Temps is free to self-host (Apache 2.0). Temps Cloud costs approximately $6/month (Hetzner infrastructure cost + 30% margin), with no per-seat fees and no separate monitoring subscription. The same binary that handles your git-push deployments, Pingora-based reverse proxy (built by Cloudflare), and WireGuard mesh networking also runs your uptime monitors and status page.
SLA math is simpler than vendors make it sound. The basic formula is: (total_minutes - downtime_minutes) / total_minutes * 100. According to the Uptime Institute's Annual Outage Analysis, the average data center outage lasts 100 minutes. Here is what the common SLA tiers mean in practice:
| SLA Level | Annual Downtime | Monthly Downtime | Daily Downtime |
|---|---|---|---|
| 99% | 3.65 days | 7.31 hours | 14.4 minutes |
| 99.9% | 8.77 hours | 43.8 minutes | 1.44 minutes |
| 99.95% | 4.38 hours | 21.9 minutes | 43.2 seconds |
| 99.99% | 52.6 minutes | 4.38 minutes | 8.6 seconds |
| 99.999% | 5.26 minutes | 26.3 seconds | 0.86 seconds |
99.999% uptime allows 5.26 minutes of downtime per year. That includes planned maintenance, DNS propagation, certificate renewals, and deployments. Google and AWS do not consistently achieve this across all services. For most web applications, 99.9% (43.8 minutes of monthly downtime) is an honest and achievable target.
Do maintenance windows count against your SLA? Most commercial SLA agreements exclude "scheduled maintenance" from uptime calculations. If you are calculating SLA for your own services, be explicit about this upfront:
SELECT
COUNT(*) FILTER (WHERE state = 'up') * 100.0 /
COUNT(*) FILTER (WHERE NOT is_maintenance_window) AS uptime_pct
FROM uptime_checks
WHERE checked_at > NOW() - INTERVAL '30 days'
AND NOT is_maintenance_window;
Do not cheat the number. If your users experienced downtime, it was downtime. Maintenance exclusions should cover pre-announced, user-notified windows only.
For production services, 30–60 second intervals strike the right balance between fast detection and low overhead. UptimeRobot's free tier uses 5-minute intervals, which means up to 5 minutes of undetected downtime. If you are self-hosting, there is no cost difference between 30-second and 5-minute checks — go with 30 seconds. Temps defaults to 60-second intervals.
Uptime monitoring answers one question: is the service reachable and responding? Application Performance Monitoring (APM) goes deeper — it traces individual requests through your code, identifies slow database queries, and profiles memory leaks. According to Gartner, the APM market reached $6.4 billion in 2024. You need uptime monitoring first, APM second. They are complementary, not competing.
Yes, but it requires running your own checker nodes in different regions. Deploy lightweight checker containers in 2–3 cloud regions (a $5/mo VPS in each region works fine), have them report results to a central coordinator, and require 2-of-3 confirmation before alerting. Uptime Kuma does not support this natively, but Gatus and custom solutions handle it well.
The simplest solution: use a second, independent monitoring system to watch the first. A free-tier UptimeRobot check against your self-hosted monitor's status page costs nothing and catches the scenario where your monitor goes down silently. Alternatively, use a heartbeat pattern — your monitor sends a "still alive" ping to an external service like Healthchecks.io every minute. If the ping stops, you get alerted through a completely separate channel.
The temps-status-page crate ships as a compiled plugin in every Temps binary release. Monitors are automatically created for each environment, alerts route through email, Slack, and webhook channels, and the public status page serves from the same TimescaleDB instance used for all other observability data. It is the same binary used in Temps Cloud deployments.
Uptime monitoring tells you that something broke, but not why. Pair it with error tracking to capture stack traces, request tracing to follow failures through your system, and resource monitoring to spot exhaustion before it causes outages.
If you are self-hosting your deployment infrastructure, the strongest approach is a platform that integrates monitoring alongside deployments, logs, and analytics. You avoid wiring together 5 separate tools — and you avoid the SaaS subscription overhead of Pingdom + Datadog + Sentry + Plausible running in parallel.
Temps includes HTTP monitoring, a built-in status page, container resource tracking, error tracking, and web analytics in a single self-hosted binary. Free to self-host under Apache 2.0. Temps Cloud starts at approximately $6/month on Hetzner infrastructure with no per-seat pricing.