t
Temps

How to Build an AI Gateway That Routes to Multiple LLM Providers

How to Build an AI Gateway That Routes to Multiple LLM Providers

March 12, 2026 (today)

Temps Team

Written by Temps Team

Last updated March 12, 2026 (today)

Your app calls OpenAI for chat, Anthropic for code review, and Gemini for document summarization. That's three SDKs, three billing dashboards, three sets of rate limit logic, and three separate auth flows. When you need to add cost tracking, failover, or swap a model, you're patching the same wrapper code across every service.

This isn't a scaling problem. It's an architecture problem. And it has a well-known solution: an AI gateway.

Enterprise AI spending hit an average of $85,521 per month in 2025, according to CloudZero's State of AI Costs report. Yet most teams can't attribute that spend to specific features, users, or even projects. The multi-provider mess isn't just annoying — it's expensive and invisible.

This guide walks through what an AI gateway is, why the pattern matters even if you're using a single provider, and how to build one yourself. Then we'll look at how Temps ships one out of the box so you don't have to.

[INTERNAL-LINK: what Temps replaces -> /blog/introducing-temps-vercel-alternative]

TL;DR: An AI gateway is a reverse proxy between your app and LLM providers. It normalizes APIs, tracks costs, handles failover, and eliminates SDK sprawl. Average enterprise AI spend is $85,521/month (CloudZero, 2025), and most teams can't tell you where the money goes. Build one yourself or use Temps, which includes one in the same binary that runs your deployments.


What Is an AI Gateway?

An AI gateway is a reverse proxy that sits between your application code and LLM providers. According to Gartner, over 50% of enterprises will deploy AI gateways by 2027 — up from fewer than 5% in 2023. The pattern centralizes authentication, request routing, usage tracking, and response normalization into a single layer.

Citation capsule: An AI gateway is a reverse proxy between applications and LLM providers that normalizes request formats, manages credentials, tracks costs, and provides failover. Gartner projects over 50% of enterprises will deploy AI gateways by 2027 (Gartner, 2025), up from under 5% in 2023.

Think of it like an API gateway (Kong, Envoy, or AWS API Gateway) but purpose-built for LLM traffic. A traditional API gateway routes HTTP requests. An AI gateway understands tokens, models, streaming chunks, and cost-per-million-token pricing.

How It Works

Here's the basic flow:

┌─────────────┐     ┌──────────────────┐     ┌───────────────┐
│             │     │                  │     │   OpenAI      │
│  Your App   │────▶│   AI Gateway     │────▶│   Anthropic   │
│  (one SDK)  │◀────│                  │◀────│   Gemini      │
│             │     │  - Auth          │     │   xAI         │
└─────────────┘     │  - Route         │     └───────────────┘
                    │  - Translate     │
                    │  - Log           │
                    │  - Cache         │
                    └──────────────────┘

Your app sends every request in one format — typically OpenAI's /chat/completions schema. The gateway inspects the model field, routes to the correct provider, translates the request into that provider's native API format, translates the response back, and logs everything along the way.

How It Differs from a Regular API Gateway

ConcernAPI GatewayAI Gateway
RoutingURL path or headerModel name
Rate limitingRequests per secondTokens per minute
Cost trackingNot built inPer-token pricing by model
Response formatPass-throughNormalize provider-specific schemas
StreamingStandard HTTP streamingSSE chunk translation between formats
AuthOne upstream per routeMultiple provider keys per route

That last row matters. A single /chat/completions endpoint might fan out to four different providers, each with its own API key, rate limit, and error format. A regular API gateway doesn't handle that.


Why Do You Need One (Even for a Single Provider)?

Even teams using just OpenAI benefit from a gateway layer. Organizations using AI gateways reduce their mean time to detect cost anomalies by 60%, according to Portkey's 2024 AI Infrastructure Survey. The gateway isn't just about multi-provider routing — it's about operational visibility.

Citation capsule: AI gateways reduce mean time to detect cost anomalies by 60% (Portkey, 2024). Even single-provider teams benefit from centralized cost tracking, key rotation, caching, and audit logging that a gateway provides.

Here's what a gateway gives you beyond multi-provider support:

Cost Tracking per Feature, User, and Team

Without a gateway, you see one number on your OpenAI bill: total spend. With a gateway, every request carries metadata — which user triggered it, which feature it belongs to, which environment it ran in. You can finally answer "how much does our chatbot cost per conversation?"

Rate Limiting at the Application Level

Provider rate limits protect the provider. They don't protect you from a single user burning through your entire quota. A gateway lets you set limits per user, per team, or per feature — before the request ever hits the provider.

Caching Identical Prompts

If 500 users ask "what's your refund policy?" within an hour, that's 500 identical API calls. A gateway can cache responses by prompt hash and serve identical answers from cache. Some teams report 30-40% cache hit rates on customer-facing chatbots.

[PERSONAL EXPERIENCE]

Audit Logging for Compliance

SOC 2 and HIPAA require logging access to sensitive data. If your app sends customer data to an LLM, you need an immutable record of what was sent, to which model, and when. A gateway creates this audit trail automatically.

Key Rotation Without Code Changes

Rotating an API key should be a config change, not a deployment. With a gateway, you update the key in one place. Every service that routes through the gateway picks up the new key immediately. No redeployment, no downtime.

Provider Failover

OpenAI has had 12 significant outages in the past year. When your primary provider goes down, a gateway can automatically retry the request with a fallback model — say, Anthropic Claude when GPT-4.1 is unavailable. Your users never notice.


What Is the OpenAI-Compatible Interface Pattern?

OpenAI's /chat/completions format has become the de facto standard for LLM APIs. Over 75% of AI developer tools now support the OpenAI API format natively, according to a16z's AI Infrastructure report (2024). This convergence makes the "OpenAI-compatible" pattern the smartest foundation for any AI gateway.

Citation capsule: Over 75% of AI developer tools support the OpenAI API format natively (a16z, 2024). Building an AI gateway around the OpenAI /chat/completions schema means existing SDKs, tools, and code work without changes — you just swap the base URL.

Here's the core schema every developer already knows:

{
  "model": "gpt-4.1",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain TCP vs UDP."}
  ],
  "temperature": 0.7,
  "max_tokens": 1024,
  "stream": true
}

Anthropic, Google, xAI, Mistral, and dozens of open-source model hosts all have different native formats. But if your gateway accepts this one schema and translates under the hood, all existing code works. Every OpenAI SDK in Python, TypeScript, Go, or Rust connects by changing a single line: the base URL.

That's the key insight. Don't invent a new API format. Ride the ecosystem.

[UNIQUE INSIGHT]

Why Not Just Use Each Provider's SDK?

You could. But consider what happens at scale:

  • Your Python service uses openai and anthropic SDKs
  • Your TypeScript service uses @anthropic-ai/sdk and @google/generative-ai
  • Your Go service uses its own HTTP clients
  • Each has different error types, retry logic, and streaming implementations

Now multiply by 10 services. That's 20-40 SDK dependencies with different versioning, different auth patterns, and different response shapes. A gateway collapses all of that into one client library and one response format.


How Do You Build a Basic AI Gateway From Scratch?

Building a minimal AI gateway takes roughly 500-800 lines of code for basic routing, plus another 200-300 for streaming — far more than most teams expect when they start. Here's a walkthrough of the core pieces using Python and FastAPI. [ORIGINAL DATA]

Step 1: Accept OpenAI-Format Requests

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import httpx

app = FastAPI()

PROVIDER_KEYS = {
    "openai": "sk-...",
    "anthropic": "sk-ant-...",
    "google": "AIza...",
}

def detect_provider(model: str) -> str:
    if model.startswith(("gpt-", "o1", "o3", "o4")):
        return "openai"
    elif model.startswith("claude-"):
        return "anthropic"
    elif model.startswith("gemini-"):
        return "google"
    raise ValueError(f"Unknown model: {model}")

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    body = await request.json()
    provider = detect_provider(body["model"])

    if body.get("stream"):
        return StreamingResponse(
            stream_response(provider, body),
            media_type="text/event-stream",
        )
    return await sync_response(provider, body)

That's the skeleton. Now comes the hard part.

Step 2: Translate Requests to Provider-Native Formats

OpenAI and Anthropic have fundamentally different request schemas. Here's the translation:

def translate_to_anthropic(body: dict) -> dict:
    """Convert OpenAI chat format to Anthropic Messages API."""
    messages = body["messages"]
    system = None
    converted = []

    for msg in messages:
        if msg["role"] == "system":
            system = msg["content"]
        else:
            converted.append({
                "role": msg["role"],
                "content": msg["content"],
            })

    payload = {
        "model": body["model"],
        "messages": converted,
        "max_tokens": body.get("max_tokens", 4096),
    }
    if system:
        payload["system"] = system
    if body.get("temperature") is not None:
        payload["temperature"] = body["temperature"]
    if body.get("stream"):
        payload["stream"] = True

    return payload

Notice the differences. Anthropic pulls system out of the messages array and into a top-level field. max_tokens is required (not optional). Temperature ranges differ. Tool calling schemas are different. And we haven't even touched Gemini's format yet, which uses contents instead of messages and has its own parts structure.

Step 3: Handle Streaming (The Hard Part)

Streaming is where the complexity explodes. Each provider uses Server-Sent Events, but the chunk format is completely different.

OpenAI sends:

data: {"choices":[{"delta":{"content":"Hello"}}]}

Anthropic sends:

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"Hello"}}

You need to parse each provider's SSE stream and normalize the chunks back to OpenAI's delta format. Here's a simplified version for Anthropic:

async def normalize_anthropic_stream(response):
    async for line in response.aiter_lines():
        if not line.startswith("data: "):
            continue
        data = json.loads(line[6:])

        if data["type"] == "content_block_delta":
            yield f'data: {json.dumps({
                "choices": [{
                    "delta": {"content": data["delta"]["text"]},
                    "index": 0,
                }]
            })}\n\n'
        elif data["type"] == "message_stop":
            yield "data: [DONE]\n\n"

This is simplified. The real implementation needs to handle message_start (for model info and input token counts), content_block_start, tool use blocks, thinking blocks, and error events mid-stream. It also needs to compute output token counts from message_delta usage fields.

Step 4: Token Counting and Cost Calculation

MODEL_PRICING = {
    "gpt-4.1": {"input": 2.00, "output": 8.00},
    "gpt-4.1-mini": {"input": 0.40, "output": 1.60},
    "gpt-4.1-nano": {"input": 0.10, "output": 0.40},
    "claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
    "claude-haiku-4-5": {"input": 0.80, "output": 4.00},
    "gemini-2.5-flash": {"input": 0.15, "output": 0.60},
}  # Prices per 1M tokens

def calculate_cost(model, input_tokens, output_tokens):
    prices = MODEL_PRICING.get(model, {"input": 0, "output": 0})
    input_cost = (input_tokens / 1_000_000) * prices["input"]
    output_cost = (output_tokens / 1_000_000) * prices["output"]
    return round(input_cost + output_cost, 6)

But here's the catch. For streaming responses, you don't get final token counts until the stream ends. You need to accumulate counts across chunks, handle cases where the provider doesn't report them, and fall back to tokenizer-based estimation.

Why This Gets Unwieldy Fast

What we've built so far handles the happy path. A production-ready gateway also needs:

  • Retry logic with exponential backoff per provider
  • Circuit breakers to stop hitting a failing provider
  • Request/response validation against schemas
  • Timeout handling (LLM requests routinely take 10-30 seconds)
  • Concurrent request limits per provider
  • Error normalization — each provider returns errors differently
  • Tool calling translation between OpenAI and Anthropic formats
  • Content filtering for PII or prompt injection detection
  • Persistent logging to a database

You're looking at 2,000-3,000 lines of well-tested code before it's production-ready. Is that worth building when solid open-source options exist?

[INTERNAL-LINK: deployment platform comparison -> /blog/temps-vs-coolify-vs-netlify]


What Open-Source AI Gateway Options Exist?

The open-source AI gateway space has matured quickly. The LLM API management market is projected to reach $3.2 billion by 2027, according to MarketsandMarkets (2024). Here's how the main options compare for self-hosted teams.

Citation capsule: The LLM API management market is projected to reach $3.2 billion by 2027 (MarketsandMarkets, 2024). Open-source gateways like LiteLLM, Portkey, and Helicone each solve part of the problem, but all require deploying and maintaining a separate service.

GatewayLanguageSelf-HostedStreamingCost TrackingDeployment
LiteLLMPythonYes (fully)YesBasicSeparate service + Redis + Postgres
PortkeyTypeScriptLimitedYesYesCloud-first, self-hosted is enterprise
HeliconeTypeScriptYesYesYesSeparate service + ClickHouse
Kong AI GatewayLua/GoYesYesVia pluginsKong cluster + database
Cloudflare AI GatewayN/ANoYesYesCloud-only

LiteLLM

The most popular open-source option. Supports 100+ models and handles the translation layer well. But it's a Python service that needs its own deployment, its own PostgreSQL database, Redis for caching, and its own monitoring. Startup times are slow. Memory usage is high. And you're maintaining another service.

Portkey

Great dashboard and analytics. But the self-hosted version requires an enterprise license. The cloud version means your API requests route through Portkey's servers — a non-starter for teams with data residency requirements.

Helicone

Focused on logging and observability rather than being a full proxy. Strong analytics, but it works as a header-based proxy (you add headers to your existing OpenAI calls) rather than a standalone gateway. This means you still need provider SDKs in each service.

Kong AI Gateway

Enterprise-grade but enterprise-heavy. Requires a full Kong deployment with its own database, configuration management, and operational overhead. Kong charges over $30 per million requests on their managed tier (TrueFoundry, 2025).

Every option above solves the gateway problem. None of them solve the "another service to deploy and monitor" problem.


How Does Temps Solve This Without Extra Infrastructure?

Temps includes an AI gateway in the same Rust binary that handles deployments, analytics, and monitoring. No sidecar, no separate process, no additional database. 72% of enterprises plan to increase GenAI spending in 2025 (Kong, 2025), and the teams using Temps don't need to bolt on a separate tool to manage that spend.

Citation capsule: Temps ships an AI gateway inside the same binary that runs deployments, analytics, and monitoring — no extra infrastructure required. With 72% of enterprises increasing GenAI spend in 2025 (Kong, 2025), a built-in gateway eliminates the need for a separate proxy service.

The gateway exposes three OpenAI-compatible endpoints:

POST /api/ai/v1/chat/completions   → Chat (all providers)
POST /api/ai/v1/embeddings         → Embeddings (OpenAI)
GET  /api/ai/v1/models             → List available models

One-Line Integration

If your code already uses the OpenAI SDK, you change one line:

import openai

client = openai.OpenAI(
    api_key="tk_your_temps_api_key",
    base_url="https://your-temps-server.example.com/api/ai/v1",
)

# Routes to Anthropic automatically
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Review this pull request."}],
)

Same SDK. Same types. Same error handling. The only difference is the base URL and API key.

This works identically in TypeScript:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "tk_your_temps_api_key",
  baseURL: "https://your-temps-server.example.com/api/ai/v1",
});

const response = await client.chat.completions.create({
  model: "gemini-2.5-flash",
  messages: [{ role: "user", content: "Summarize this document." }],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Supported Providers

ProviderModelsStreamingBYOK
AnthropicClaude Opus 4, Sonnet 4, Haiku 3.5YesYes
OpenAIGPT-4.1, o3, o4-mini, GPT-4o, embeddingsYesYes
GoogleGemini 2.5 Flash/Pro, 2.0 FlashYesYes
xAIGrok 3, Grok 3 MiniYesYes

Provider keys are encrypted with AES-256-GCM before storage. Or use Bring Your Own Key (BYOK) mode — pass the key per-request via an x-provider-api-key header and it's never stored.

GenAI OpenTelemetry Tracing

Temps doesn't just proxy requests — it traces them. Every AI call generates OpenTelemetry spans following the GenAI semantic conventions. You see the full conversation in the AI Activity dashboard: system prompt, user messages, assistant responses, tool calls, and thinking blocks.

No additional instrumentation library needed. The gateway produces the traces automatically.

[INTERNAL-LINK: AI gateway setup docs -> /docs/ai-gateway]


How Do You Track AI Costs and Attribute Them?

Cost visibility is the primary reason teams adopt AI gateways. AI inference costs are declining 10x per year (Stanford HAI AI Index, 2025), but total spend keeps climbing because usage grows faster than prices drop. Knowing where the money goes matters more than the per-token price.

Citation capsule: AI inference costs are declining roughly 10x per year (Stanford HAI AI Index, 2025), but total enterprise spend keeps climbing as usage outpaces price reductions. AI gateways provide per-request cost attribution — by model, user, feature, and team — that raw provider billing cannot.

Temps logs every gateway request with 15 fields of metadata to a TimescaleDB hypertable:

  • Model and provider — which model handled this request
  • Input/output tokens — exact counts from the provider response
  • Latency — end-to-end time in milliseconds
  • Cost — calculated at 1/10,000th cent precision
  • Conversation ID — group multi-turn chats together
  • Tags — arbitrary key:value labels

Tag-Based Attribution

Pass tags via headers to slice costs any way you want:

curl https://your-temps.example.com/api/ai/v1/chat/completions \
  -H "Authorization: Bearer tk_your_key" \
  -H "x-tags: team:platform, feature:code-review, env:production" \
  -d '{
    "model": "claude-sonnet-4-6",
    "messages": [{"role": "user", "content": "Review this diff."}]
  }'

Then query the dashboard: "How much did the code-review feature cost in production last month?" You get an actual number, broken down by model.

What the Dashboard Shows

The Temps AI analytics dashboard provides:

  • Summary cards — total requests, tokens, cost, average latency, error rate
  • Time-series charts — requests and cost over time with hourly or daily bucketing
  • Per-model breakdown — which models cost the most and which are fastest
  • Per-provider view — compare Anthropic vs OpenAI vs Gemini spending
  • Conversation analytics — see the full cost of multi-turn conversations

This is the kind of visibility that standalone billing dashboards from OpenAI or Anthropic simply don't provide. They show you total spend. A gateway shows you why you're spending.

[INTERNAL-LINK: Temps analytics features -> /blog/vercel-cost-savings-with-temps]


Frequently Asked Questions

What is the latency overhead of an AI gateway?

A well-built AI gateway adds under 10ms of overhead per request. The Temps gateway, written in Rust with Axum, typically adds 2-5ms. For context, most LLM API calls take 200-2,000ms depending on the model and token count. The gateway overhead is noise compared to inference time.

Can I use an AI gateway with streaming responses?

Yes. Streaming is the most common mode for chat applications. The gateway translates Server-Sent Events between provider formats — Anthropic's content_block_delta events become OpenAI's delta format transparently. You use "stream": true in your request body and the response streams through exactly like a direct OpenAI call.

How do I handle rate limits across multiple providers?

An AI gateway gives you two layers of rate limiting. First, application-level limits: cap requests per user, per team, or per feature. Second, provider-aware limits: if OpenAI returns a 429 (rate limited), the gateway can retry with a fallback provider. Temps supports both, plus tag-based rate limiting via the x-tags header.

What's the difference between an AI gateway and a load balancer?

A load balancer distributes identical requests across identical backends. An AI gateway routes different models to different providers, translates between incompatible API formats, tracks per-token costs, and normalizes streaming chunk formats. They solve fundamentally different problems. You'd put a load balancer in front of multiple AI gateway instances, not use one instead.

Do I need an AI gateway if I only use one provider?

You don't need one, but you'll wish you had one. Single-provider teams still benefit from cost attribution per feature, key rotation without redeployment, caching identical prompts, audit logging, and rate limiting at the application level. And when you inevitably add a second provider, you won't need to refactor anything.


Start Routing to Multiple LLM Providers Today

The pattern is clear: a reverse proxy between your app and LLM providers eliminates SDK sprawl, centralizes cost tracking, and gives you failover for free. You can build one yourself — the code above is a solid starting point. You can deploy LiteLLM or another open-source option. Or you can use Temps, which bundles the gateway alongside deployments, analytics, and error tracking in one binary.

If you're already running Temps, the AI gateway is built in. Configure your provider keys in the dashboard and start routing. If you're new, installation takes under five minutes:

curl -fsSL temps.sh/install.sh | bash

One binary. One endpoint. Every LLM provider your team needs.

[INTERNAL-LINK: full installation guide -> /docs/getting-started] [INTERNAL-LINK: AI gateway documentation -> /docs/ai-gateway]

#ai-gateway#llm#openai#anthropic#gemini#ai-proxy#multi-provider#ai gateway multiple llm providers