March 12, 2026 (1mo ago)
Written by Temps Team
Last updated March 12, 2026 (1mo ago)
Your app calls OpenAI for chat, Anthropic for code review, and Gemini for document summarization. That's three SDKs, three billing dashboards, three sets of rate limit logic, and three separate auth flows. When you need to add cost tracking, failover, or swap a model, you're patching the same wrapper code across every service.
This isn't a scaling problem. It's an architecture problem. And it has a well-known solution: an AI gateway.
Enterprise AI spending hit an average of $85,521 per month in 2025, according to CloudZero's State of AI Costs report. Yet most teams can't attribute that spend to specific features, users, or even projects. The multi-provider mess isn't just annoying — it's expensive and invisible.
This guide walks through what an AI gateway is, why the pattern matters even if you're using a single provider, and how to build one yourself. Then we'll look at how Temps ships one out of the box so you don't have to.
TL;DR: An AI gateway is a reverse proxy between your app and LLM providers. It normalizes APIs, tracks costs, handles failover, and eliminates SDK sprawl. According to CloudZero, average enterprise AI spend is $85,521/month, and most teams can't tell you where the money goes. Build one yourself or use Temps, which includes one in the same binary that runs your deployments.
An AI gateway is a reverse proxy that sits between your application code and LLM providers. According to Gartner, over 50% of enterprises will deploy AI gateways by 2027 — up from fewer than 5% in 2023. The pattern centralizes authentication, request routing, usage tracking, and response normalization into a single layer.
Think of it like an API gateway (Kong, Envoy, or AWS API Gateway) but purpose-built for LLM traffic. A traditional API gateway routes HTTP requests. An AI gateway understands tokens, models, streaming chunks, and cost-per-million-token pricing.
Here's the basic flow:
┌─────────────┐ ┌──────────────────┐ ┌───────────────┐
│ │ │ │ │ OpenAI │
│ Your App │────▶│ AI Gateway │────▶│ Anthropic │
│ (one SDK) │◀────│ │◀────│ Gemini │
│ │ │ - Auth │ │ xAI │
└─────────────┘ │ - Route │ └───────────────┘
│ - Translate │
│ - Log │
│ - Cache │
└──────────────────┘
Your app sends every request in one format — typically OpenAI's /chat/completions schema. The gateway inspects the model field, routes to the correct provider, translates the request into that provider's native API format, translates the response back, and logs everything along the way.
| Concern | API Gateway | AI Gateway |
|---|---|---|
| Routing | URL path or header | Model name |
| Rate limiting | Requests per second | Tokens per minute |
| Cost tracking | Not built in | Per-token pricing by model |
| Response format | Pass-through | Normalize provider-specific schemas |
| Streaming | Standard HTTP streaming | SSE chunk translation between formats |
| Auth | One upstream per route | Multiple provider keys per route |
That last row matters. A single /chat/completions endpoint might fan out to four different providers, each with its own API key, rate limit, and error format. A regular API gateway doesn't handle that.
Even teams using just OpenAI benefit from a gateway layer. Organizations using AI gateways reduce their mean time to detect cost anomalies by 60%, according to Portkey's 2024 AI Infrastructure Survey. The gateway isn't just about multi-provider routing — it's about operational visibility.
Here's what a gateway gives you beyond multi-provider support:
Without a gateway, you see one number on your OpenAI bill: total spend. With a gateway, every request carries metadata — which user triggered it, which feature it belongs to, which environment it ran in. You can finally answer "how much does our chatbot cost per conversation?"
Provider rate limits protect the provider. They don't protect you from a single user burning through your entire quota. A gateway lets you set limits per user, per team, or per feature — before the request ever hits the provider.
If 500 users ask "what's your refund policy?" within an hour, that's 500 identical API calls. A gateway can cache responses by prompt hash and serve identical answers from cache. Some teams report 30-40% cache hit rates on customer-facing chatbots.
SOC 2 and HIPAA require logging access to sensitive data. If your app sends customer data to an LLM, you need an immutable record of what was sent, to which model, and when. A gateway creates this audit trail automatically.
Rotating an API key should be a config change, not a deployment. With a gateway, you update the key in one place. Every service that routes through the gateway picks up the new key immediately. No redeployment, no downtime.
OpenAI has had 12 significant outages in the past year. When your primary provider goes down, a gateway can automatically retry the request with a fallback model — say, Anthropic Claude when GPT-4.1 is unavailable. Your users never notice.
OpenAI's /chat/completions format has become the de facto standard for LLM APIs. Over 75% of AI developer tools now support the OpenAI API format natively, according to a16z's AI Infrastructure report. This convergence makes the "OpenAI-compatible" pattern the smartest foundation for any AI gateway.
Here's the core schema every developer already knows:
{
"model": "gpt-4.1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain TCP vs UDP."}
],
"temperature": 0.7,
"max_tokens": 1024,
"stream": true
}
Anthropic, Google, xAI, Mistral, and dozens of open-source model hosts all have different native formats. But if your gateway accepts this one schema and translates under the hood, all existing code works. Every OpenAI SDK in Python, TypeScript, Go, or Rust connects by changing a single line: the base URL.
That's the key insight. Don't invent a new API format. Ride the ecosystem.
You could. But consider what happens at scale:
openai and anthropic SDKs@anthropic-ai/sdk and @google/generative-aiNow multiply by 10 services. That's 20-40 SDK dependencies with different versioning, different auth patterns, and different response shapes. A gateway collapses all of that into one client library and one response format.
Building a minimal AI gateway takes roughly 500-800 lines of code for basic routing, plus another 200-300 for streaming — far more than most teams expect when they start. Here's a walkthrough of the core pieces using Python and FastAPI.
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import httpx
app = FastAPI()
PROVIDER_KEYS = {
"openai": "sk-...",
"anthropic": "sk-ant-...",
"google": "AIza...",
}
def detect_provider(model: str) -> str:
if model.startswith(("gpt-", "o1", "o3", "o4")):
return "openai"
elif model.startswith("claude-"):
return "anthropic"
elif model.startswith("gemini-"):
return "google"
raise ValueError(f"Unknown model: {model}")
@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
body = await request.json()
provider = detect_provider(body["model"])
if body.get("stream"):
return StreamingResponse(
stream_response(provider, body),
media_type="text/event-stream",
)
return await sync_response(provider, body)
That's the skeleton. Now comes the hard part.
OpenAI and Anthropic have fundamentally different request schemas. Here's the translation:
def translate_to_anthropic(body: dict) -> dict:
"""Convert OpenAI chat format to Anthropic Messages API."""
messages = body["messages"]
system = None
converted = []
for msg in messages:
if msg["role"] == "system":
system = msg["content"]
else:
converted.append({
"role": msg["role"],
"content": msg["content"],
})
payload = {
"model": body["model"],
"messages": converted,
"max_tokens": body.get("max_tokens", 4096),
}
if system:
payload["system"] = system
if body.get("temperature") is not None:
payload["temperature"] = body["temperature"]
if body.get("stream"):
payload["stream"] = True
return payload
Notice the differences. Anthropic pulls system out of the messages array and into a top-level field. max_tokens is required (not optional). Temperature ranges differ. Tool calling schemas are different. And we haven't even touched Gemini's format yet, which uses contents instead of messages and has its own parts structure.
Streaming is where the complexity explodes. Each provider uses Server-Sent Events, but the chunk format is completely different.
OpenAI sends:
data: {"choices":[{"delta":{"content":"Hello"}}]}
Anthropic sends:
event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"Hello"}}
You need to parse each provider's SSE stream and normalize the chunks back to OpenAI's delta format. Here's a simplified version for Anthropic:
async def normalize_anthropic_stream(response):
async for line in response.aiter_lines():
if not line.startswith("data: "):
continue
data = json.loads(line[6:])
if data["type"] == "content_block_delta":
yield f'data: {json.dumps({
"choices": [{
"delta": {"content": data["delta"]["text"]},
"index": 0,
}]
})}\n\n'
elif data["type"] == "message_stop":
yield "data: [DONE]\n\n"
This is simplified. The real implementation needs to handle message_start (for model info and input token counts), content_block_start, tool use blocks, thinking blocks, and error events mid-stream. It also needs to compute output token counts from message_delta usage fields.
MODEL_PRICING = {
"gpt-4.1": {"input": 2.00, "output": 8.00},
"gpt-4.1-mini": {"input": 0.40, "output": 1.60},
"gpt-4.1-nano": {"input": 0.10, "output": 0.40},
"claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
"claude-haiku-4-5": {"input": 0.80, "output": 4.00},
"gemini-2.5-flash": {"input": 0.15, "output": 0.60},
} # Prices per 1M tokens
def calculate_cost(model, input_tokens, output_tokens):
prices = MODEL_PRICING.get(model, {"input": 0, "output": 0})
input_cost = (input_tokens / 1_000_000) * prices["input"]
output_cost = (output_tokens / 1_000_000) * prices["output"]
return round(input_cost + output_cost, 6)
But here's the catch. For streaming responses, you don't get final token counts until the stream ends. You need to accumulate counts across chunks, handle cases where the provider doesn't report them, and fall back to tokenizer-based estimation.
What we've built so far handles the happy path. A production-ready gateway also needs:
You're looking at 2,000-3,000 lines of well-tested code before it's production-ready. Is that worth building when solid open-source options exist?
The open-source AI gateway space has matured quickly. The LLM API management market is projected to reach $3.2 billion by 2027, according to MarketsandMarkets (2024). Here's how the main options compare for self-hosted teams.
| Gateway | Language | Self-Hosted | Streaming | Cost Tracking | Deployment |
|---|---|---|---|---|---|
| LiteLLM | Python | Yes (fully) | Yes | Basic | Separate service + Redis + Postgres |
| Portkey | TypeScript | Limited | Yes | Yes | Cloud-first, self-hosted is enterprise |
| Helicone | TypeScript | Yes | Yes | Yes | Separate service + ClickHouse |
| Kong AI Gateway | Lua/Go | Yes | Yes | Via plugins | Kong cluster + database |
| Cloudflare AI Gateway | N/A | No | Yes | Yes | Cloud-only |
The most popular open-source option. Supports 100+ models and handles the translation layer well. But it's a Python service that needs its own deployment, its own PostgreSQL database, Redis for caching, and its own monitoring. Startup times are slow. Memory usage is high. And you're maintaining another service.
Great dashboard and analytics. But the self-hosted version requires an enterprise license. The cloud version means your API requests route through Portkey's servers — a non-starter for teams with data residency requirements.
Focused on logging and observability rather than being a full proxy. Strong analytics, but it works as a header-based proxy (you add headers to your existing OpenAI calls) rather than a standalone gateway. This means you still need provider SDKs in each service.
Enterprise-grade but enterprise-heavy. Requires a full Kong deployment with its own database, configuration management, and operational overhead. According to TrueFoundry, Kong charges over $30 per million requests on their managed tier.
Every option above solves the gateway problem. None of them solve the "another service to deploy and monitor" problem.
Temps includes an AI gateway in the same Rust binary that handles deployments, analytics, and monitoring. No sidecar, no separate process, no additional database. According to Kong, 72% of enterprises plan to increase GenAI spending in 2025, and the teams using Temps don't need to bolt on a separate tool to manage that spend.
The gateway exposes three OpenAI-compatible endpoints:
POST /api/ai/v1/chat/completions → Chat (all providers)
POST /api/ai/v1/embeddings → Embeddings (OpenAI)
GET /api/ai/v1/models → List available models
If your code already uses the OpenAI SDK, you change one line:
import openai
client = openai.OpenAI(
api_key="tk_your_temps_api_key",
base_url="https://your-temps-server.example.com/api/ai/v1",
)
# Routes to Anthropic automatically
response = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "Review this pull request."}],
)
Same SDK. Same types. Same error handling. The only difference is the base URL and API key.
This works identically in TypeScript:
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "tk_your_temps_api_key",
baseURL: "https://your-temps-server.example.com/api/ai/v1",
});
const response = await client.chat.completions.create({
model: "gemini-2.5-flash",
messages: [{ role: "user", content: "Summarize this document." }],
stream: true,
});
for await (const chunk of response) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
| Provider | Models | Streaming | BYOK |
|---|---|---|---|
| Anthropic | Claude Opus 4, Sonnet 4, Haiku 3.5 | Yes | Yes |
| OpenAI | GPT-4.1, o3, o4-mini, GPT-4o, embeddings | Yes | Yes |
| Gemini 2.5 Flash/Pro, 2.0 Flash | Yes | Yes | |
| xAI | Grok 3, Grok 3 Mini | Yes | Yes |
Provider keys are encrypted with AES-256-GCM before storage. Or use Bring Your Own Key (BYOK) mode — pass the key per-request via an x-provider-api-key header and it's never stored.
Temps doesn't just proxy requests — it traces them. Every AI call generates OpenTelemetry spans following the GenAI semantic conventions. You see the full conversation in the AI Activity dashboard: system prompt, user messages, assistant responses, tool calls, and thinking blocks.
No additional instrumentation library needed. The gateway produces the traces automatically.
Cost visibility is the primary reason teams adopt AI gateways. According to the Stanford HAI AI Index, AI inference costs are declining 10x per year, but total spend keeps climbing because usage grows faster than prices drop. Knowing where the money goes matters more than the per-token price.
Temps logs every gateway request with 15 fields of metadata to a TimescaleDB hypertable:
Pass tags via headers to slice costs any way you want:
curl https://your-temps.example.com/api/ai/v1/chat/completions \
-H "Authorization: Bearer tk_your_key" \
-H "x-tags: team:platform, feature:code-review, env:production" \
-d '{
"model": "claude-sonnet-4-6",
"messages": [{"role": "user", "content": "Review this diff."}]
}'
Then query the dashboard: "How much did the code-review feature cost in production last month?" You get an actual number, broken down by model.
The Temps AI analytics dashboard provides:
This is the kind of visibility that standalone billing dashboards from OpenAI or Anthropic simply don't provide. They show you total spend. A gateway shows you why you're spending.
A well-built AI gateway adds under 10ms of overhead per request. The Temps gateway, written in Rust with Axum, typically adds 2-5ms. For context, most LLM API calls take 200-2,000ms depending on the model and token count. The gateway overhead is noise compared to inference time.
Yes. Streaming is the most common mode for chat applications. The gateway translates Server-Sent Events between provider formats — Anthropic's content_block_delta events become OpenAI's delta format transparently. You use "stream": true in your request body and the response streams through exactly like a direct OpenAI call.
An AI gateway gives you two layers of rate limiting. First, application-level limits: cap requests per user, per team, or per feature. Second, provider-aware limits: if OpenAI returns a 429 (rate limited), the gateway can retry with a fallback provider. Temps supports both, plus tag-based rate limiting via the x-tags header.
A load balancer distributes identical requests across identical backends. An AI gateway routes different models to different providers, translates between incompatible API formats, tracks per-token costs, and normalizes streaming chunk formats. They solve fundamentally different problems. You'd put a load balancer in front of multiple AI gateway instances, not use one instead.
You don't need one, but you'll wish you had one. Single-provider teams still benefit from cost attribution per feature, key rotation without redeployment, caching identical prompts, audit logging, and rate limiting at the application level. And when you inevitably add a second provider, you won't need to refactor anything.
The pattern is clear: a reverse proxy between your app and LLM providers eliminates SDK sprawl, centralizes cost tracking, and gives you failover for free. You can build one yourself — the code above is a solid starting point. You can deploy LiteLLM or another open-source option. Or you can use Temps, which bundles the gateway alongside deployments, analytics, and error tracking in one binary.
If you're already running Temps, the AI gateway is built in. Configure your provider keys in the dashboard and start routing. If you're new, installation takes under five minutes:
curl -fsSL temps.sh/install.sh | bash
One binary. One endpoint. Every LLM provider your team needs.
Related guides: