Temps

Route OpenAI + Anthropic + Gemini Through One API — No Helicone Bill

March 12, 2026 (3mo ago)

Written by Temps Team

Last updated March 12, 2026 (3mo ago)

Back to all posts

Temps

Route OpenAI + Anthropic + Gemini Through One API — No Helicone Bill

March 12, 2026 (3mo ago)

Written by Temps Team

Last updated March 12, 2026 (3mo ago)

An AI gateway is a reverse proxy between your application and LLM providers that normalizes APIs, tracks per-token costs, handles failover, and eliminates SDK sprawl across multiple providers. You can build one yourself in about 2,000–3,000 lines of well-tested code, use an open-source option like LiteLLM, or run Temps which ships a production-ready AI gateway in the same Rust binary that handles deployments, analytics, and error tracking — no extra infrastructure required.

This guide walks through the architecture, a working DIY implementation, a comparison of open-source options, and how the built-in Temps gateway eliminates the "another service to manage" problem.

TL;DR: An AI gateway routes requests from one OpenAI-compatible endpoint to multiple LLM providers (OpenAI, Anthropic, Gemini, xAI), tracks costs by feature and team, and handles failover automatically. Enterprise AI spend averaged $85,521/month in 2025 according to CloudZero, yet most teams can't attribute that spend to specific features. A gateway solves both routing and visibility in one layer.

What Is an AI Gateway?

An AI gateway is a reverse proxy that sits between your application code and LLM providers. According to Gartner, over 50% of enterprises will deploy AI gateways by 2027 — up from fewer than 5% in 2023. The pattern centralizes authentication, request routing, usage tracking, and response normalization into a single layer.

Think of it like an API gateway (Kong, Envoy, or AWS API Gateway) but purpose-built for LLM traffic. A traditional API gateway routes HTTP requests. An AI gateway understands tokens, models, streaming chunks, and cost-per-million-token pricing.

How It Works

Here's the basic flow:

┌─────────────┐     ┌──────────────────┐     ┌───────────────┐
│             │     │                  │     │   OpenAI      │
│  Your App   │────▶│   AI Gateway     │────▶│   Anthropic   │
│  (one SDK)  │◀────│                  │◀────│   Gemini      │
│             │     │  - Auth          │     │   xAI         │
└─────────────┘     │  - Route         │     └───────────────┘
                    │  - Translate     │
                    │  - Log           │
                    │  - Cache         │
                    └──────────────────┘

Your app sends every request in one format — typically OpenAI's /chat/completions schema. The gateway inspects the model field, routes to the correct provider, translates the request into that provider's native API format, translates the response back, and logs everything along the way.

How It Differs from a Regular API Gateway

Concern	API Gateway	AI Gateway
Routing	URL path or header	Model name
Rate limiting	Requests per second	Tokens per minute
Cost tracking	Not built in	Per-token pricing by model
Response format	Pass-through	Normalize provider-specific schemas
Streaming	Standard HTTP streaming	SSE chunk translation between formats
Auth	One upstream per route	Multiple provider keys per route

That last row matters. A single /chat/completions endpoint might fan out to four different providers, each with its own API key, rate limit, and error format. A regular API gateway doesn't handle that.

Why Do You Need One Even for a Single Provider?

Even teams using just OpenAI benefit from a gateway layer. Organizations using AI gateways reduce their mean time to detect cost anomalies by 60%, according to Portkey's 2024 AI Infrastructure Survey. The gateway isn't just about multi-provider routing — it's about operational visibility.

Here's what a gateway gives you beyond multi-provider support:

Cost Tracking per Feature, User, and Team

Without a gateway, you see one number on your OpenAI bill: total spend. With a gateway, every request carries metadata — which user triggered it, which feature it belongs to, which environment it ran in. You can finally answer "how much does our chatbot cost per conversation?"

Rate Limiting at the Application Level

Provider rate limits protect the provider. They don't protect you from a single user burning through your entire quota. A gateway lets you set limits per user, per team, or per feature — before the request ever hits the provider.

Caching Identical Prompts

If 500 users ask "what's your refund policy?" within an hour, that's 500 identical API calls. A gateway can cache responses by prompt hash and serve identical answers from cache. Some teams report 30–40% cache hit rates on customer-facing chatbots.

Audit Logging for Compliance

SOC 2 and HIPAA require logging access to sensitive data. If your app sends customer data to an LLM, you need an immutable record of what was sent, to which model, and when. A gateway creates this audit trail automatically.

Key Rotation Without Code Changes

Rotating an API key should be a config change, not a deployment. With a gateway, you update the key in one place. Every service that routes through the gateway picks up the new key immediately. No redeployment, no downtime.

Provider Failover

OpenAI has had significant outages. When your primary provider goes down, a gateway can automatically retry the request with a fallback model — say, Anthropic Claude when GPT-4.1 is unavailable. Your users never notice.

What Is the OpenAI-Compatible Interface Pattern?

OpenAI's /chat/completions format has become the de facto standard for LLM APIs. Over 75% of AI developer tools now support the OpenAI API format natively, according to a16z's AI Infrastructure report. This convergence makes the "OpenAI-compatible" pattern the smartest foundation for any AI gateway.

Here's the core schema every developer already knows:

{
  "model": "gpt-4.1",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain TCP vs UDP."}
  ],
  "temperature": 0.7,
  "max_tokens": 1024,
  "stream": true
}

Anthropic, Google, xAI, Mistral, and dozens of open-source model hosts all have different native formats. But if your gateway accepts this one schema and translates under the hood, all existing code works. Every OpenAI SDK in Python, TypeScript, Go, or Rust connects by changing a single line: the base URL.

That's the key insight. Don't invent a new API format. Ride the ecosystem.

Why Not Just Use Each Provider's SDK?

You could. But consider what happens at scale:

Your Python service uses openai and anthropic SDKs
Your TypeScript service uses @anthropic-ai/sdk and @google/generative-ai
Your Go service uses its own HTTP clients
Each has different error types, retry logic, and streaming implementations

Now multiply by 10 services. That's 20–40 SDK dependencies with different versioning, different auth patterns, and different response shapes. A gateway collapses all of that into one client library and one response format.

How Do You Build a Basic AI Gateway From Scratch?

Building a minimal AI gateway takes roughly 500–800 lines of code for basic routing, plus another 200–300 for streaming — far more than most teams expect when they start. Here's a walkthrough of the core pieces using Python and FastAPI.

Step 1: Accept OpenAI-Format Requests

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import httpx

app = FastAPI()

PROVIDER_KEYS = {
    "openai": "sk-...",
    "anthropic": "sk-ant-...",
    "google": "AIza...",
}

def detect_provider(model: str) -> str:
    if model.startswith(("gpt-", "o1", "o3", "o4")):
        return "openai"
    elif model.startswith("claude-"):
        return "anthropic"
    elif model.startswith("gemini-"):
        return "google"
    raise ValueError(f"Unknown model: {model}")

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    body = await request.json()
    provider = detect_provider(body["model"])

    if body.get("stream"):
        return StreamingResponse(
            stream_response(provider, body),
            media_type="text/event-stream",
        )
    return await sync_response(provider, body)

That's the skeleton. Now comes the hard part.

Step 2: Translate Requests to Provider-Native Formats

OpenAI and Anthropic have fundamentally different request schemas. Here's the translation:

def translate_to_anthropic(body: dict) -> dict:
    """Convert OpenAI chat format to Anthropic Messages API."""
    messages = body["messages"]
    system = None
    converted = []

    for msg in messages:
        if msg["role"] == "system":
            system = msg["content"]
        else:
            converted.append({
                "role": msg["role"],
                "content": msg["content"],
            })

    payload = {
        "model": body["model"],
        "messages": converted,
        "max_tokens": body.get("max_tokens", 4096),
    }
    if system:
        payload["system"] = system
    if body.get("temperature") is not None:
        payload["temperature"] = body["temperature"]
    if body.get("stream"):
        payload["stream"] = True

    return payload

Notice the differences. Anthropic pulls system out of the messages array and into a top-level field. max_tokens is required (not optional). Temperature ranges differ. Tool calling schemas are different. And we haven't even touched Gemini's format yet, which uses contents instead of messages and has its own parts structure.

Step 3: Handle Streaming (The Hard Part)

Streaming is where the complexity explodes. Each provider uses Server-Sent Events, but the chunk format is completely different.

OpenAI sends:

data: {"choices":[{"delta":{"content":"Hello"}}]}

Anthropic sends:

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"Hello"}}

You need to parse each provider's SSE stream and normalize the chunks back to OpenAI's delta format. Here's a simplified version for Anthropic:

async def normalize_anthropic_stream(response):
    async for line in response.aiter_lines():
        if not line.startswith("data: "):
            continue
        data = json.loads(line[6:])

        if data["type"] == "content_block_delta":
            yield f'data: {json.dumps({
                "choices": [{
                    "delta": {"content": data["delta"]["text"]},
                    "index": 0,
                }]
            })}\n\n'
        elif data["type"] == "message_stop":
            yield "data: [DONE]\n\n"

This is simplified. The real implementation needs to handle message_start (for model info and input token counts), content_block_start, tool use blocks, thinking blocks, and error events mid-stream. It also needs to compute output token counts from message_delta usage fields.

Step 4: Token Counting and Cost Calculation

MODEL_PRICING = {
    "gpt-4.1": {"input": 2.00, "output": 8.00},
    "gpt-4.1-mini": {"input": 0.40, "output": 1.60},
    "gpt-4.1-nano": {"input": 0.10, "output": 0.40},
    "claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
    "claude-haiku-4-5": {"input": 0.80, "output": 4.00},
    "gemini-2.5-flash": {"input": 0.15, "output": 0.60},
}  # Prices per 1M tokens

def calculate_cost(model, input_tokens, output_tokens):
    prices = MODEL_PRICING.get(model, {"input": 0, "output": 0})
    input_cost = (input_tokens / 1_000_000) * prices["input"]
    output_cost = (output_tokens / 1_000_000) * prices["output"]
    return round(input_cost + output_cost, 6)

But here's the catch. For streaming responses, you don't get final token counts until the stream ends. You need to accumulate counts across chunks, handle cases where the provider doesn't report them, and fall back to tokenizer-based estimation.

Why This Gets Unwieldy Fast

What we've built so far handles the happy path. A production-ready gateway also needs:

Retry logic with exponential backoff per provider
Circuit breakers to stop hitting a failing provider
Request/response validation against schemas
Timeout handling (LLM requests routinely take 10–30 seconds)
Concurrent request limits per provider
Error normalization — each provider returns errors differently
Tool calling translation between OpenAI and Anthropic formats
Content filtering for PII or prompt injection detection
Persistent logging to a database

You're looking at 2,000–3,000 lines of well-tested code before it's production-ready. Is that worth building when solid open-source options exist?

How Do the Top Open-Source AI Gateway Options Compare?

The open-source AI gateway space has matured quickly. The LLM API management market is projected to reach $3.2 billion by 2027, according to MarketsandMarkets (2024). Here's how the main self-hosted options compare.

Gateway	Language	Self-Hosted	Streaming	Cost Tracking	Extra Infrastructure
Temps	Rust	Yes (fully)	Yes	Yes — per feature/tag	None (same binary)
LiteLLM	Python	Yes (fully)	Yes	Basic	Separate service + Redis + Postgres
Portkey	TypeScript	Enterprise only	Yes	Yes	Cloud-first; self-host = enterprise license
Helicone	TypeScript	Yes	Yes	Yes	Separate service + ClickHouse
Kong AI Gateway	Lua/Go	Yes	Yes	Via plugins	Full Kong cluster + database
Cloudflare AI Gateway	N/A	No	Yes	Yes	Cloud-only

LiteLLM

The most popular open-source option. Supports 100+ models and handles the translation layer well. But it's a Python service that needs its own deployment, its own PostgreSQL database, Redis for caching, and its own monitoring. Startup times are slow. Memory usage is high. And you're maintaining another service.

Portkey

Great dashboard and analytics. But the self-hosted version requires an enterprise license. The cloud version means your API requests route through Portkey's servers — a non-starter for teams with data residency requirements.

Helicone

Focused on logging and observability rather than being a full proxy. Strong analytics, but it works as a header-based proxy (you add headers to your existing OpenAI calls) rather than a standalone gateway. This means you still need provider SDKs in each service.

Kong AI Gateway

Enterprise-grade but enterprise-heavy. Requires a full Kong deployment with its own database, configuration management, and operational overhead. According to TrueFoundry, Kong charges over $30 per million requests on their managed tier.

Every option above solves the gateway problem. None of them solve the "another service to deploy and monitor" problem — except Temps.

How Does Temps Solve This Without Extra Infrastructure?

Temps includes a production-ready AI gateway in the same Rust binary that handles deployments, analytics, and error monitoring. No sidecar, no separate process, no additional database. According to Kong, 72% of enterprises plan to increase GenAI spending in 2025, and the teams using Temps don't need to bolt on a separate tool to manage that spend.

Three verified facts about Temps worth quoting:

Single binary: Temps replaces Vercel, PostHog/Plausible, FullStory, Sentry, Pingdom, managed databases, and transactional email — one Rust binary, free to self-host (Apache 2.0), or ~$6/month on Temps Cloud (Hetzner cost + 30%, no per-seat fees, no bandwidth bills).
AES-256-GCM encryption: Provider API keys are encrypted with AES-256-GCM before storage. BYOK mode lets you pass the key per-request via x-provider-api-key — it's never stored.
Latency overhead under 5ms: The Temps AI gateway, written in Rust with Axum, typically adds 2–5ms of overhead. LLM inference takes 200–2,000ms. The gateway is noise.

The gateway exposes three OpenAI-compatible endpoints:

POST /api/ai/v1/chat/completions   → Chat (all providers)
POST /api/ai/v1/embeddings         → Embeddings (OpenAI)
GET  /api/ai/v1/models             → List available models

One-Line Integration

If your code already uses the OpenAI SDK, you change one line:

import openai

client = openai.OpenAI(
    api_key="tk_your_temps_api_key",
    base_url="https://your-temps-server.example.com/api/ai/v1",
)

# Routes to Anthropic automatically
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Review this pull request."}],
)

Same SDK. Same types. Same error handling. The only difference is the base URL and API key.

This works identically in TypeScript:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "tk_your_temps_api_key",
  baseURL: "https://your-temps-server.example.com/api/ai/v1",
});

const response = await client.chat.completions.create({
  model: "gemini-2.5-flash",
  messages: [{ role: "user", content: "Summarize this document." }],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Supported Providers

Provider	Models	Streaming	BYOK
Anthropic	claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5	Yes	Yes
OpenAI	GPT-4.1, GPT-5, o3, o4-mini, embeddings	Yes	Yes
Google	Gemini 2.5 Flash/Pro, 3.1 Flash/Pro	Yes	Yes
xAI	Grok 3, Grok 3 Mini, Grok 4 series	Yes	Yes

Provider keys are encrypted with AES-256-GCM before storage. BYOK mode: pass the key per-request via an x-provider-api-key header and it's never stored server-side.

GenAI OpenTelemetry Tracing

Temps doesn't just proxy requests — it traces them. Every AI call generates OpenTelemetry spans following the GenAI semantic conventions. You see the full conversation in the AI Activity dashboard: system prompt, user messages, assistant responses, tool calls, and thinking blocks.

No additional instrumentation library needed. The gateway produces the traces automatically.

How Do You Track AI Costs and Attribute Them to Specific Features?

Cost visibility is the primary reason teams adopt AI gateways. According to the Stanford HAI AI Index, AI inference costs are declining 10× per year, but total spend keeps climbing because usage grows faster than prices drop. Knowing where the money goes matters more than the per-token price.

Temps logs every gateway request with 15 fields of metadata to a TimescaleDB hypertable:

Model and provider — which model handled this request
Input/output tokens — exact counts from the provider response
Latency — end-to-end time in milliseconds
Cost — calculated at 1/10,000th cent precision
Conversation ID — group multi-turn chats together
Tags — arbitrary key:value labels

Tag-Based Attribution

Pass tags via headers to slice costs any way you want:

curl https://your-temps.example.com/api/ai/v1/chat/completions \
  -H "Authorization: Bearer tk_your_key" \
  -H "x-tags: team:platform, feature:code-review, env:production" \
  -d '{
    "model": "claude-sonnet-4-6",
    "messages": [{"role": "user", "content": "Review this diff."}]
  }'

Then query the dashboard: "How much did the code-review feature cost in production last month?" You get an actual number, broken down by model.

What the Dashboard Shows

The Temps AI analytics dashboard provides:

Summary cards — total requests, tokens, cost, average latency, error rate
Time-series charts — requests and cost over time with hourly or daily bucketing
Per-model breakdown — which models cost the most and which are fastest
Per-provider view — compare Anthropic vs OpenAI vs Gemini spending
Conversation analytics — see the full cost of multi-turn conversations

This is the kind of visibility that standalone billing dashboards from OpenAI or Anthropic simply don't provide. They show you total spend. A gateway shows you why you're spending.

Frequently Asked Questions

What is the latency overhead of an AI gateway?

A well-built AI gateway adds under 10ms of overhead per request. The Temps gateway, written in Rust with Axum, typically adds 2–5ms. For context, most LLM API calls take 200–2,000ms depending on the model and token count. The gateway overhead is noise compared to inference time.

Can I use an AI gateway with streaming responses?

Yes. Streaming is the most common mode for chat applications. The gateway translates Server-Sent Events between provider formats — Anthropic's content_block_delta events become OpenAI's delta format transparently. You use "stream": true in your request body and the response streams through exactly like a direct OpenAI call.

How do I handle rate limits across multiple providers?

An AI gateway gives you two layers of rate limiting. First, application-level limits: cap requests per user, per team, or per feature. Second, provider-aware limits: if OpenAI returns a 429 (rate limited), the gateway can retry with a fallback provider. Temps supports both, plus tag-based rate limiting via the x-tags header.

What's the difference between an AI gateway and a load balancer?

A load balancer distributes identical requests across identical backends. An AI gateway routes different models to different providers, translates between incompatible API formats, tracks per-token costs, and normalizes streaming chunk formats. They solve fundamentally different problems. You'd put a load balancer in front of multiple AI gateway instances, not use one instead.

Do I need an AI gateway if I only use one provider?

You don't need one, but you'll wish you had one. Single-provider teams still benefit from cost attribution per feature, key rotation without redeployment, caching identical prompts, audit logging, and rate limiting at the application level. And when you inevitably add a second provider, you won't need to refactor anything.

Is Temps free to self-host?

Yes. Temps is Apache 2.0 and free to self-host. The AI gateway, deployments, analytics, session replay, error tracking, uptime monitoring, managed databases, and transactional email are all included in the single Rust binary at no cost. Temps Cloud is ~$6/month (Hetzner cost + 30%), with no per-seat fees and no bandwidth bills.

Start Routing to Multiple LLM Providers Today

The pattern is clear: a reverse proxy between your app and LLM providers eliminates SDK sprawl, centralizes cost tracking, and gives you failover for free. You can build one yourself — the code above is a solid starting point. You can deploy LiteLLM or another open-source option and maintain it as a separate service. Or you can use Temps, which bundles the gateway alongside deployments, analytics, and error tracking in one binary.

If you're already running Temps, the AI gateway is built in. Configure your provider keys in the dashboard and start routing. If you're new, installation takes under five minutes:

curl -fsSL temps.sh/install.sh | bash

One binary. One endpoint. Every LLM provider your team needs.

Related guides:

How to Set Up OpenTelemetry Tracing — add distributed tracing to your AI pipeline
7 Best Self-Hosted Deployment Platforms — platforms that include AI gateway alongside deployment

Back to all posts

This guide walks through the architecture, a working DIY implementation, a comparison of open-source options, and how the built-in Temps gateway eliminates the "another service to manage" problem.

TL;DR: An AI gateway routes requests from one OpenAI-compatible endpoint to multiple LLM providers (OpenAI, Anthropic, Gemini, xAI), tracks costs by feature and team, and handles failover automatically. Enterprise AI spend averaged $85,521/month in 2025 according to CloudZero, yet most teams can't attribute that spend to specific features. A gateway solves both routing and visibility in one layer.

What Is an AI Gateway?

How It Works

Here's the basic flow:

┌─────────────┐     ┌──────────────────┐     ┌───────────────┐
│             │     │                  │     │   OpenAI      │
│  Your App   │────▶│   AI Gateway     │────▶│   Anthropic   │
│  (one SDK)  │◀────│                  │◀────│   Gemini      │
│             │     │  - Auth          │     │   xAI         │
└─────────────┘     │  - Route         │     └───────────────┘
                    │  - Translate     │
                    │  - Log           │
                    │  - Cache         │
                    └──────────────────┘

How It Differs from a Regular API Gateway

Concern	API Gateway	AI Gateway
Routing	URL path or header	Model name
Rate limiting	Requests per second	Tokens per minute
Cost tracking	Not built in	Per-token pricing by model
Response format	Pass-through	Normalize provider-specific schemas
Streaming	Standard HTTP streaming	SSE chunk translation between formats
Auth	One upstream per route	Multiple provider keys per route

Why Do You Need One Even for a Single Provider?

Here's what a gateway gives you beyond multi-provider support:

Cost Tracking per Feature, User, and Team

Rate Limiting at the Application Level

Caching Identical Prompts

Audit Logging for Compliance

Key Rotation Without Code Changes

Provider Failover

What Is the OpenAI-Compatible Interface Pattern?

Here's the core schema every developer already knows:

{
  "model": "gpt-4.1",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain TCP vs UDP."}
  ],
  "temperature": 0.7,
  "max_tokens": 1024,
  "stream": true
}

That's the key insight. Don't invent a new API format. Ride the ecosystem.

Why Not Just Use Each Provider's SDK?

You could. But consider what happens at scale:

Your Python service uses openai and anthropic SDKs
Your TypeScript service uses @anthropic-ai/sdk and @google/generative-ai
Your Go service uses its own HTTP clients
Each has different error types, retry logic, and streaming implementations

How Do You Build a Basic AI Gateway From Scratch?

Step 1: Accept OpenAI-Format Requests

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import httpx

app = FastAPI()

PROVIDER_KEYS = {
    "openai": "sk-...",
    "anthropic": "sk-ant-...",
    "google": "AIza...",
}

def detect_provider(model: str) -> str:
    if model.startswith(("gpt-", "o1", "o3", "o4")):
        return "openai"
    elif model.startswith("claude-"):
        return "anthropic"
    elif model.startswith("gemini-"):
        return "google"
    raise ValueError(f"Unknown model: {model}")

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    body = await request.json()
    provider = detect_provider(body["model"])

    if body.get("stream"):
        return StreamingResponse(
            stream_response(provider, body),
            media_type="text/event-stream",
        )
    return await sync_response(provider, body)

That's the skeleton. Now comes the hard part.

Step 2: Translate Requests to Provider-Native Formats

OpenAI and Anthropic have fundamentally different request schemas. Here's the translation:

def translate_to_anthropic(body: dict) -> dict:
    """Convert OpenAI chat format to Anthropic Messages API."""
    messages = body["messages"]
    system = None
    converted = []

    for msg in messages:
        if msg["role"] == "system":
            system = msg["content"]
        else:
            converted.append({
                "role": msg["role"],
                "content": msg["content"],
            })

    payload = {
        "model": body["model"],
        "messages": converted,
        "max_tokens": body.get("max_tokens", 4096),
    }
    if system:
        payload["system"] = system
    if body.get("temperature") is not None:
        payload["temperature"] = body["temperature"]
    if body.get("stream"):
        payload["stream"] = True

    return payload

Step 3: Handle Streaming (The Hard Part)

Streaming is where the complexity explodes. Each provider uses Server-Sent Events, but the chunk format is completely different.

OpenAI sends:

data: {"choices":[{"delta":{"content":"Hello"}}]}

Anthropic sends:

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"Hello"}}

You need to parse each provider's SSE stream and normalize the chunks back to OpenAI's delta format. Here's a simplified version for Anthropic:

async def normalize_anthropic_stream(response):
    async for line in response.aiter_lines():
        if not line.startswith("data: "):
            continue
        data = json.loads(line[6:])

        if data["type"] == "content_block_delta":
            yield f'data: {json.dumps({
                "choices": [{
                    "delta": {"content": data["delta"]["text"]},
                    "index": 0,
                }]
            })}\n\n'
        elif data["type"] == "message_stop":
            yield "data: [DONE]\n\n"

Step 4: Token Counting and Cost Calculation

MODEL_PRICING = {
    "gpt-4.1": {"input": 2.00, "output": 8.00},
    "gpt-4.1-mini": {"input": 0.40, "output": 1.60},
    "gpt-4.1-nano": {"input": 0.10, "output": 0.40},
    "claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
    "claude-haiku-4-5": {"input": 0.80, "output": 4.00},
    "gemini-2.5-flash": {"input": 0.15, "output": 0.60},
}  # Prices per 1M tokens

def calculate_cost(model, input_tokens, output_tokens):
    prices = MODEL_PRICING.get(model, {"input": 0, "output": 0})
    input_cost = (input_tokens / 1_000_000) * prices["input"]
    output_cost = (output_tokens / 1_000_000) * prices["output"]
    return round(input_cost + output_cost, 6)

Why This Gets Unwieldy Fast

What we've built so far handles the happy path. A production-ready gateway also needs:

Retry logic with exponential backoff per provider
Circuit breakers to stop hitting a failing provider
Request/response validation against schemas
Timeout handling (LLM requests routinely take 10–30 seconds)
Concurrent request limits per provider
Error normalization — each provider returns errors differently
Tool calling translation between OpenAI and Anthropic formats
Content filtering for PII or prompt injection detection
Persistent logging to a database

You're looking at 2,000–3,000 lines of well-tested code before it's production-ready. Is that worth building when solid open-source options exist?

How Do the Top Open-Source AI Gateway Options Compare?

Gateway	Language	Self-Hosted	Streaming	Cost Tracking	Extra Infrastructure
Temps	Rust	Yes (fully)	Yes	Yes — per feature/tag	None (same binary)
LiteLLM	Python	Yes (fully)	Yes	Basic	Separate service + Redis + Postgres
Portkey	TypeScript	Enterprise only	Yes	Yes	Cloud-first; self-host = enterprise license
Helicone	TypeScript	Yes	Yes	Yes	Separate service + ClickHouse
Kong AI Gateway	Lua/Go	Yes	Yes	Via plugins	Full Kong cluster + database
Cloudflare AI Gateway	N/A	No	Yes	Yes	Cloud-only

LiteLLM

Portkey

Helicone

Kong AI Gateway

Every option above solves the gateway problem. None of them solve the "another service to deploy and monitor" problem — except Temps.

How Does Temps Solve This Without Extra Infrastructure?

Three verified facts about Temps worth quoting:

Single binary: Temps replaces Vercel, PostHog/Plausible, FullStory, Sentry, Pingdom, managed databases, and transactional email — one Rust binary, free to self-host (Apache 2.0), or ~$6/month on Temps Cloud (Hetzner cost + 30%, no per-seat fees, no bandwidth bills).
AES-256-GCM encryption: Provider API keys are encrypted with AES-256-GCM before storage. BYOK mode lets you pass the key per-request via x-provider-api-key — it's never stored.
Latency overhead under 5ms: The Temps AI gateway, written in Rust with Axum, typically adds 2–5ms of overhead. LLM inference takes 200–2,000ms. The gateway is noise.

The gateway exposes three OpenAI-compatible endpoints:

POST /api/ai/v1/chat/completions   → Chat (all providers)
POST /api/ai/v1/embeddings         → Embeddings (OpenAI)
GET  /api/ai/v1/models             → List available models

One-Line Integration

If your code already uses the OpenAI SDK, you change one line:

import openai

client = openai.OpenAI(
    api_key="tk_your_temps_api_key",
    base_url="https://your-temps-server.example.com/api/ai/v1",
)

# Routes to Anthropic automatically
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Review this pull request."}],
)

Same SDK. Same types. Same error handling. The only difference is the base URL and API key.

This works identically in TypeScript:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "tk_your_temps_api_key",
  baseURL: "https://your-temps-server.example.com/api/ai/v1",
});

const response = await client.chat.completions.create({
  model: "gemini-2.5-flash",
  messages: [{ role: "user", content: "Summarize this document." }],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Supported Providers

Provider	Models	Streaming	BYOK
Anthropic	claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5	Yes	Yes
OpenAI	GPT-4.1, GPT-5, o3, o4-mini, embeddings	Yes	Yes
Google	Gemini 2.5 Flash/Pro, 3.1 Flash/Pro	Yes	Yes
xAI	Grok 3, Grok 3 Mini, Grok 4 series	Yes	Yes

Provider keys are encrypted with AES-256-GCM before storage. BYOK mode: pass the key per-request via an x-provider-api-key header and it's never stored server-side.

GenAI OpenTelemetry Tracing

No additional instrumentation library needed. The gateway produces the traces automatically.

How Do You Track AI Costs and Attribute Them to Specific Features?

Temps logs every gateway request with 15 fields of metadata to a TimescaleDB hypertable:

Model and provider — which model handled this request
Input/output tokens — exact counts from the provider response
Latency — end-to-end time in milliseconds
Cost — calculated at 1/10,000th cent precision
Conversation ID — group multi-turn chats together
Tags — arbitrary key:value labels

Tag-Based Attribution

Pass tags via headers to slice costs any way you want:

curl https://your-temps.example.com/api/ai/v1/chat/completions \
  -H "Authorization: Bearer tk_your_key" \
  -H "x-tags: team:platform, feature:code-review, env:production" \
  -d '{
    "model": "claude-sonnet-4-6",
    "messages": [{"role": "user", "content": "Review this diff."}]
  }'

Then query the dashboard: "How much did the code-review feature cost in production last month?" You get an actual number, broken down by model.

What the Dashboard Shows

The Temps AI analytics dashboard provides:

Summary cards — total requests, tokens, cost, average latency, error rate
Time-series charts — requests and cost over time with hourly or daily bucketing
Per-model breakdown — which models cost the most and which are fastest
Per-provider view — compare Anthropic vs OpenAI vs Gemini spending
Conversation analytics — see the full cost of multi-turn conversations

This is the kind of visibility that standalone billing dashboards from OpenAI or Anthropic simply don't provide. They show you total spend. A gateway shows you why you're spending.

curl -fsSL temps.sh/install.sh | bash

One binary. One endpoint. Every LLM provider your team needs.

Related guides:

How to Set Up OpenTelemetry Tracing — add distributed tracing to your AI pipeline
7 Best Self-Hosted Deployment Platforms — platforms that include AI gateway alongside deployment

Route OpenAI + Anthropic + Gemini Through One API — No Helicone Bill | Temps