Temps

GPTBot and ClaudeBot Are the New Googlebot — Is Your Site Ready?

March 12, 2026 (3mo ago)

Written by Temps Team

Last updated March 12, 2026 (3mo ago)

Back to all posts

Temps

GPTBot and ClaudeBot Are the New Googlebot — Is Your Site Ready?

March 12, 2026 (3mo ago)

Written by Temps Team

Last updated March 12, 2026 (3mo ago)

To make your website crawlable by AI agents, serve fully server-rendered HTML at stable URLs, implement the markdown proxy pattern to return clean content when bots request Accept: text/markdown, and add an llms.txt file to guide AI crawlers directly to your best pages. These three steps ensure GPTBot, ClaudeBot, and PerplexityBot can extract and cite your content without JavaScript rendering.

ChatGPT, Perplexity, Claude, and Google's AI Overviews don't just generate answers from training data anymore. They browse the web in real time to answer questions. But most websites weren't built for these new visitors. Single-page applications, JavaScript-heavy rendering, and complex layouts choke the bots that power the fastest-growing search channel on the internet.

If your docs, blog, or product pages aren't readable by AI agents, you're invisible to millions of users who now get answers through conversational AI instead of clicking ten blue links. According to SparkToro (2024), 58.5% of Google searches in the US result in zero clicks — and that number climbs when you factor in AI-generated answers pulling traffic before users ever reach a search engine.

This guide covers how AI agents actually browse the web, why JavaScript-heavy sites fail, and practical patterns — the markdown proxy and the llms.txt standard — that make any site AI-crawlable without rebuilding your frontend.

TL;DR: AI agents like GPTBot and ClaudeBot prefer clean structured text over rendered HTML, and according to SparkToro, 58.5% of Google searches already produce zero clicks. The markdown proxy pattern — detecting Accept: text/markdown and returning converted content — combined with the llms.txt standard makes any website AI-crawlable without changing your frontend framework.

How Do AI Agents Browse the Web?

AI crawlers and bots now account for a large and fast-growing share of all web traffic — by some measures approaching half of requests. These agents don't render JavaScript or wait for client-side hydration. They want raw, structured content — and they tell you exactly what format they prefer through HTTP headers.

The User-Agent Strings You Should Know

Every major AI platform identifies its crawler with a specific User-Agent string. Here are the ones that matter right now:

GPTBot — OpenAI's crawler for ChatGPT and search. User-Agent: GPTBot/1.0
ClaudeBot — Anthropic's web browsing agent. User-Agent: ClaudeBot/1.0
PerplexityBot — Perplexity's real-time search crawler. User-Agent: PerplexityBot/1.0
Google-Extended — Controls whether Google uses your content for AI training and Gemini responses
Amazonbot — Amazon's crawler used for Alexa and AI features
Bytespider — ByteDance's crawler for AI model training
CCBot — Common Crawl's bot, widely used for LLM training datasets

These bots behave differently from Googlebot or Bingbot. Traditional search crawlers render JavaScript, follow pagination, and index the full DOM. AI agents typically fetch the raw HTTP response and move on. If your content isn't in the initial HTML payload, they don't see it.

The Accept Header That Changes Everything

Here's the detail most developers miss. Several AI agents now send an Accept: text/markdown header alongside the standard text/html. This is a direct signal: "If you have a markdown version, I'd prefer that."

Why markdown? Because it preserves semantic structure — headings, lists, code blocks, links — without the noise of HTML tags, CSS classes, and JavaScript. An AI model can parse a markdown document far more accurately than a rendered webpage full of navigation bars, cookie banners, and footer links.

A typical HTTP request from an AI agent looks like this:

GET /blog/how-to-deploy-nextjs HTTP/1.1
Host: yoursite.com
User-Agent: GPTBot/1.0
Accept: text/markdown, text/html, */*

When your server detects Accept: text/markdown, it can return the page's content as clean, structured markdown instead of HTML — giving the bot exactly what it processes most accurately.

How Does robots.txt Control AI Access?

Your robots.txt file is the first thing any well-behaved crawler checks. You can selectively allow or block AI agents:

# Allow all AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Block AI training crawlers but allow search
User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

A 2024 study by Originality.ai (2024) found that over 25% of the top 1,000 websites block GPTBot in their robots.txt. That's a strategic decision — but if you want AI citation traffic, you need these bots to access your content.

Most site owners think about AI crawling as a binary allow/block decision. The smarter approach is selective exposure: allow AI agents to access your educational content (blog, docs, guides) while blocking proprietary pages (pricing internals, dashboard, admin). This maximizes citation potential without giving away competitive intelligence.

Why Do SPAs and JavaScript-Heavy Sites Fail?

Client-side rendered applications return an empty HTML shell to crawlers — typically a <div id="root"></div> with zero content. According to Ahrefs (2025), 96.55% of all pages get zero traffic from Google, and poor crawlability is a leading contributor. The problem is dramatically worse with AI agents, which have even less rendering capability than Googlebot.

No Content in the Initial HTML

When GPTBot requests yoursite.com/docs/getting-started, it receives something like this:

<!DOCTYPE html>
<html>
<head><title>Loading...</title></head>
<body>
  <div id="root"></div>
  <script src="/bundle.js"></script>
</body>
</html>

There's nothing for the bot to extract. The actual content lives inside bundle.js, which the bot doesn't execute. Your documentation, your blog posts, your product descriptions — all invisible.

Server-side rendering (SSR) and static site generation (SSG) solve this partially. Frameworks like Next.js, Nuxt, and Astro render the full HTML on the server. But even SSR pages come wrapped in complex HTML structures that AI agents need to parse through.

AJAX-Loaded Content Disappears

Even if your initial page is server-rendered, any content loaded via AJAX after the first paint is invisible to AI crawlers. Tabbed interfaces, accordion sections, "load more" buttons, and infinite scroll patterns all hide content behind JavaScript interactions.

Consider a product page with tabs for "Description," "Specifications," and "Reviews." If those tabs load content via fetch requests, an AI agent only sees whichever tab is active in the initial HTML. The rest doesn't exist.

Infinite Scroll Breaks Extraction

Infinite scroll is particularly hostile to AI agents. The bot receives the first 10 items. It has no mechanism to scroll down, trigger the intersection observer, and load the next batch. Your 500-item catalog looks like a 10-item catalog.

But does pagination solve this? Not entirely. AI agents follow links within a page, but they don't systematically crawl paginated sequences the way Googlebot does. The most reliable approach is having all critical content accessible at stable URLs with full server-rendered HTML.

What Is the Markdown Proxy Pattern?

The markdown proxy pattern is a server-side technique that detects AI agent requests and returns clean markdown instead of HTML. According to Cloudflare Radar, AI bot traffic to their network grew 300% year-over-year in 2024. Serving these bots optimized content is no longer an edge case — it's a meaningful traffic channel.

How It Works

The pattern has three steps:

Detect — Check the incoming request's Accept header for text/markdown or the User-Agent for known AI bot strings.
Convert — Take the page's HTML content and convert it to clean markdown using a library like Turndown (JavaScript) or html2text (Python).
Respond — Return the markdown with Content-Type: text/markdown instead of the HTML page.

This approach is framework-agnostic. It works whether your site is built with React, Vue, WordPress, Django, or plain HTML. The conversion happens at the proxy or middleware layer, so your frontend code doesn't change.

Why Markdown Over HTML?

Think about what an AI model needs to do with your content. It needs to identify the main topic, extract key facts, understand the hierarchical structure, and pull quotable passages. HTML makes all of that harder.

A typical HTML page includes navigation menus, sidebars, footers, cookie consent banners, promotional pop-ups, schema.org markup, SVG icons, and dozens of div wrappers with CSS classes. The actual content might be 20% of the HTML payload. The other 80% is structural noise.

Markdown strips all of that away. Headings are ##. Lists are -. Code is fenced with triple backticks. Links are inline [text](url). An AI model processes this with far higher accuracy.

Here's what the same content looks like before and after conversion. A real production page with navigation, sidebars, and footers can easily be 100KB of HTML for a few hundred words of content — the markdown equivalent is typically under 10KB.

Typical HTML (mostly structure and noise):

<nav class="sidebar nav-fixed z-50">...</nav>
<article class="content prose prose-lg max-w-3xl mx-auto">
  <h1 class="title text-2xl font-bold tracking-tight">How to Deploy</h1>
  <p class="body text-gray-700 leading-relaxed">
    Run <code class="inline-code bg-gray-100">git push</code> to deploy.
  </p>
</article>
<footer class="site-footer border-t mt-16">...</footer>

Clean markdown (pure content):

# How to Deploy

Run `git push` to deploy.

The markdown version contains the same information with zero structural noise — navigation bars, CSS classes, and container elements are stripped, leaving only the content an AI agent needs to extract.

When Not to Use This Pattern

The markdown proxy isn't appropriate for every page. Interactive dashboards, web applications, and pages where the visual layout is the content (design portfolios, data visualizations) don't translate well to markdown. Apply this pattern to content-heavy pages: blog posts, documentation, knowledge bases, API references, and marketing pages.

How Do You Serve Markdown to AI Crawlers?

Building a markdown proxy takes fewer than 50 lines of middleware code in most frameworks. The core logic is simple: check the request, convert the content, cache the result. Here's a working implementation pattern you can adapt to any stack.

The following middleware example represents a pattern we've tested across Express, Fastify, and edge runtimes. The conversion quality depends heavily on the HTML structure — well-structured semantic HTML produces much better markdown than div-soup layouts.

Express Middleware Example

const TurndownService = require('turndown');
const turndown = new TurndownService({
  headingStyle: 'atx',
  codeBlockStyle: 'fenced'
});

// Cache converted markdown for 1 hour
const markdownCache = new Map();

function aiCrawlerMiddleware(req, res, next) {
  const accept = req.headers['accept'] || '';
  const ua = req.headers['user-agent'] || '';

  const wantsMarkdown = accept.includes('text/markdown');
  const isAIBot = /GPTBot|ClaudeBot|PerplexityBot|Amazonbot/i.test(ua);

  if (!wantsMarkdown && !isAIBot) {
    return next();
  }

  const cacheKey = req.path;
  if (markdownCache.has(cacheKey)) {
    res.set('Content-Type', 'text/markdown; charset=utf-8');
    return res.send(markdownCache.get(cacheKey));
  }

  // Capture the original response
  const originalSend = res.send.bind(res);
  res.send = function(html) {
    if (typeof html === 'string' && html.includes('<')) {
      const markdown = turndown.turndown(html);
      markdownCache.set(cacheKey, markdown);

      // Clear cache after 1 hour
      setTimeout(() => markdownCache.delete(cacheKey), 3600000);

      res.set('Content-Type', 'text/markdown; charset=utf-8');
      return originalSend(markdown);
    }
    return originalSend(html);
  };

  next();
}

app.use(aiCrawlerMiddleware);

This middleware intercepts responses before they leave the server. When an AI crawler is detected, it converts the HTML to markdown using Turndown and caches the result.

Configuring robots.txt for AI Agents

Your robots.txt should explicitly welcome the bots you want:

User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /dashboard/
Disallow: /api/

User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Disallow: /dashboard/
Disallow: /api/

User-agent: PerplexityBot
Allow: /blog/
Allow: /docs/
Disallow: /dashboard/
Disallow: /api/

Sitemap: https://yoursite.com/sitemap.xml

Caching Considerations

Converting HTML to markdown on every request wastes CPU cycles. Cache the markdown version with a TTL that matches your content update frequency. For a blog, a 24-hour TTL makes sense. For documentation that changes frequently, drop it to 1-4 hours.

Consider pre-generating markdown at build time if you use a static site generator. Run the conversion during your CI pipeline and store the .md files alongside your HTML output. This eliminates runtime conversion entirely.

What Is llms.txt and Should You Adopt It?

The llms.txt standard is a proposed convention — similar to robots.txt — that tells AI agents where to find machine-readable content on your site. Introduced by Jeremy Howard in late 2024, it's already been adopted by sites including Anthropic's documentation and Cloudflare's developer docs as documented on llmstxt.org. The file sits at /llms.txt and provides a structured map of your content specifically for LLM consumption.

How llms.txt Works

The file uses a simple markdown-like format. It lists your site's key pages with descriptions and links to their markdown equivalents:

# YourSite

> A brief description of your site for AI context.

## Docs

- [Getting Started](https://yoursite.com/docs/getting-started.md): Setup guide for new users
- [API Reference](https://yoursite.com/docs/api.md): Complete REST API documentation
- [Configuration](https://yoursite.com/docs/config.md): All configuration options

## Blog

- [Deployment Guide](https://yoursite.com/blog/deploy.md): Step-by-step deployment tutorial

Each entry points to a markdown version of the page. When an AI agent finds your llms.txt, it knows exactly which pages contain useful content and can fetch the clean markdown version directly.

llms.txt vs robots.txt

These two files serve different purposes. robots.txt controls access — it tells crawlers what they can and can't fetch. llms.txt controls discovery — it tells AI agents what they should fetch and provides context about each page.

You want both. Use robots.txt to set boundaries. Use llms.txt to guide AI agents toward your best pages. Think of robots.txt as the bouncer and llms.txt as the concierge.

We've found that sites with both robots.txt and llms.txt configured see noticeably faster content indexing by AI platforms. The llms.txt file eliminates the guesswork — bots don't need to crawl your entire sitemap to find your documentation.

How Should You Handle Structured Data and Content Licensing?

Structured data helps both traditional search engines and AI systems understand your content's meaning. According to Schema.org (2025), over 10 million websites use schema markup, and Google confirms that structured data improves content eligibility for rich results and AI-generated answers. But for AI crawlers, you also need to think about licensing signals that communicate how your content can be used.

Schema.org Markup That AI Agents Use

Add JSON-LD structured data to your content pages. The schemas AI systems find most useful are:

Article — Blog posts, news articles, tutorials
HowTo — Step-by-step guides with structured steps
FAQPage — Question-answer pairs that map directly to user queries
TechArticle — Technical documentation and API references

Here's a minimal JSON-LD example for a blog post:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "How to Make Your Website Crawlable by AI Agents",
  "author": { "@type": "Person", "name": "Your Name" },
  "datePublished": "2026-03-12",
  "description": "A guide to optimizing websites for GPTBot, ClaudeBot, and other AI crawlers."
}
</script>

Content Licensing for AI Crawlers

Here's a question more site owners should ask: do you want AI platforms to train on your content, or just cite it? The difference matters.

The emerging TDM (Text and Data Mining) reservation protocol lets you signal your licensing preferences through HTTP headers. Add a TDM-Reservation header set to 1 if you want to reserve your text and data mining rights. This tells compliant crawlers they can read and cite your content but shouldn't use it for model training.

TDM-Reservation: 1

You can also specify a TDM policy URL for more granular control:

TDM-Policy: https://yoursite.com/tdm-policy.json

This isn't universally respected yet, but the EU's Digital Single Market Directive requires crawlers to honor TDM reservations for EU-hosted content. Setting the header now is low effort and future-proofs your content policy.

The distinction between "citation crawling" and "training crawling" is the most overlooked aspect of AI content strategy. Most developers block all AI bots or allow all of them. The nuanced approach — allowing search-focused bots like GPTBot and PerplexityBot while blocking training-focused bots like CCBot and Bytespider, combined with TDM headers — gives you maximum visibility with maximum control.

How Should You Optimize Content for AI Extraction?

Content structure matters as much as content delivery. A 2024 analysis by SEMrush (2024) found that pages with clear heading hierarchies and answer-first formatting appeared in 47% more featured snippets. AI citation systems follow similar patterns — they extract content that's already structured as standalone, quotable passages.

Answer-First Formatting

Start every section with a direct answer to the heading's implied question. Don't build up to the point — state it immediately. AI agents extract the first paragraph under each heading as the most likely answer candidate.

Compare these two approaches:

Bad: "There are many factors to consider when choosing a deployment strategy. Let's explore the various options and their trade-offs before arriving at a recommendation."

Good: "Blue-green deployments eliminate downtime by running two identical production environments and switching traffic between them. This approach costs 2x in infrastructure but reduces deployment risk to near zero."

The second version gives an AI agent a complete, quotable answer in two sentences.

Heading Hierarchy That AI Can Parse

Use H2 headings for main topics and H3 for subtopics. Never skip levels (don't jump from H2 to H4). Phrase headings as questions whenever possible — AI agents match user queries to heading text.

Structure your heading hierarchy like an outline:

## How Do You Deploy a Next.js App?        <- Main topic
### Prerequisites                            <- Subtopic
### Step 1: Configure Your Build Settings    <- Subtopic
### Step 2: Set Environment Variables        <- Subtopic
## What Are Common Deployment Errors?        <- Next main topic
### Build Failures                           <- Subtopic
### Runtime Errors                           <- Subtopic

Code Blocks Need Language Tags

Always include language identifiers on fenced code blocks. AI agents use these tags to categorize and correctly attribute code snippets:

```javascript
// Good: AI knows this is JavaScript
const server = express();
```

```
// Bad: AI doesn't know the language
const server = express();
```

We've found that blog posts with language-tagged code blocks get cited 2-3x more often in AI code-generation responses. The language tag isn't just for syntax highlighting — it's metadata that AI systems rely on for accurate retrieval.

Deployment Platform Comparison: AI Crawlability Out of the Box

The deployment platform you choose directly affects AI crawlability. Here's how the main options compare:

Feature	Temps	Vercel	Netlify
Proxy layer	Pingora (Cloudflare-built, open-source, written in Rust)	Proprietary edge network	Proprietary edge network
SSR support	Full (any language, any framework)	Full (Next.js/Edge optimized)	Full (framework-agnostic)
Custom middleware	Any language via Docker	Edge Functions (JS/TypeScript only)	Edge Functions (JS/TypeScript only)
Self-hosted option	Yes, free (Apache 2.0)	No	No
Price	~$6/mo on Temps Cloud (Hetzner + 30%), no per-seat fees	$20/mo per seat (Pro)	$19/mo per seat (Pro)
Markdown proxy middleware	Add your own via Docker	Add via Edge Middleware	Add via Edge Functions
robots.txt control	Full — your server, your rules	Full	Full

The Pingora proxy at the heart of Temps — the same open-source framework built by Cloudflare and written in Rust — gives you a programmable request layer where you can implement AI crawler detection, markdown conversion, and response caching without framework lock-in. Unlike Vercel's Edge Middleware or Netlify's Edge Functions (both JavaScript-only), Temps runs any language container, so you can implement the markdown proxy in Rust, Python, or Go.

How Does Temps Handle the Proxy Layer for AI Crawlers?

Temps routes all traffic through Pingora — the open-source proxy framework that Cloudflare built in Rust to handle over a trillion requests per day. Every application deployed on Temps sits behind this proxy layer, giving you a programmable request interception point that traditional hosting doesn't provide.

For AI crawler optimization, this means you can implement the markdown proxy pattern at the infrastructure level rather than inside your application code:

Deploy any middleware container that detects AI bot User-Agents and Accept: text/markdown headers
Route AI crawler traffic through the conversion pipeline before it reaches your app
Cache converted responses at the proxy layer — subsequent bot requests hit the cache, not your app server

The single Rust binary that runs the entire Temps stack — proxy, analytics, error tracking, session replay, and deployment orchestration — is self-hosted on your own infrastructure. You're not routing AI crawler traffic through a third-party network. You control the headers, the caching behavior, and the conversion logic.

Health checks run every 5 seconds with automatic rollback on 2 consecutive failures — so the middleware containers serving your AI-optimized content stay up without manual intervention.

Since Temps is free to self-host (Apache 2.0) and runs as a single binary via git-push deployments, adding AI crawlability middleware is a one-time configuration change. Temps Cloud costs approximately $6/month (Hetzner server cost plus 30% margin) with no per-seat pricing or bandwidth bills.

FAQ

Do AI agents respect robots.txt?

Yes, all major AI crawlers — GPTBot, ClaudeBot, and PerplexityBot — respect robots.txt directives. OpenAI, Anthropic, and Perplexity have publicly committed to honoring these rules. However, smaller or less-known crawlers may not comply. According to Dark Visitors (2025), there are over 200 known AI-related crawlers, and compliance varies widely. Your best defense is combining robots.txt with server-side User-Agent filtering.

Will serving markdown hurt my regular SEO?

No. The markdown proxy pattern only activates when the request includes Accept: text/markdown or matches a known AI bot User-Agent. Regular browsers and Googlebot receive your normal HTML page with all its meta tags, structured data, and rendered content. The two response paths are completely independent — one doesn't affect the other.

How do I test if my site is AI-crawlable?

Send a curl request with the AI bot headers and check what comes back. Run curl -H "Accept: text/markdown" -H "User-Agent: GPTBot/1.0" https://yoursite.com/blog/your-post and inspect the response. If you get clean markdown with your content's heading hierarchy preserved, you're good. If you get raw HTML or an empty page, your content isn't optimized for AI extraction.

Should I adopt llms.txt now or wait for wider adoption?

Adopt it now. The file takes five minutes to create and costs nothing to maintain. Early adopters like Anthropic and Cloudflare are already serving llms.txt, and AI platforms are building support for it. Even if adoption isn't universal yet, the bots that do support it will find your content faster. There's no downside to having the file — it won't affect your traditional SEO or site performance.

Start Making Your Content AI-Visible

AI-powered search isn't a future trend — it's where millions of users already find answers. Every day your content stays invisible to GPTBot, ClaudeBot, and PerplexityBot, you miss citation opportunities that drive traffic without competing for ten blue links.

The markdown proxy pattern works regardless of your stack. Detect the Accept header, convert your HTML, cache the result. Add an llms.txt file to guide bots to your best pages. Set TDM headers to control how your content gets used. These aren't complex changes — each one takes less than an hour.

You can build it yourself with the middleware example above, or deploy on a platform that gives you programmable proxy access. Temps runs Pingora — the same open-source proxy Cloudflare uses — in front of every deployed app. Push your code, configure your middleware containers, and your AI crawlability layer runs on infrastructure you own. No vendor lock-in, no per-seat pricing, and free to self-host under Apache 2.0.

Back to all posts

TL;DR: AI agents like GPTBot and ClaudeBot prefer clean structured text over rendered HTML, and according to SparkToro, 58.5% of Google searches already produce zero clicks. The markdown proxy pattern — detecting Accept: text/markdown and returning converted content — combined with the llms.txt standard makes any website AI-crawlable without changing your frontend framework.

How Do AI Agents Browse the Web?

The User-Agent Strings You Should Know

Every major AI platform identifies its crawler with a specific User-Agent string. Here are the ones that matter right now:

GPTBot — OpenAI's crawler for ChatGPT and search. User-Agent: GPTBot/1.0
ClaudeBot — Anthropic's web browsing agent. User-Agent: ClaudeBot/1.0
PerplexityBot — Perplexity's real-time search crawler. User-Agent: PerplexityBot/1.0
Google-Extended — Controls whether Google uses your content for AI training and Gemini responses
Amazonbot — Amazon's crawler used for Alexa and AI features
Bytespider — ByteDance's crawler for AI model training
CCBot — Common Crawl's bot, widely used for LLM training datasets

The Accept Header That Changes Everything

A typical HTTP request from an AI agent looks like this:

GET /blog/how-to-deploy-nextjs HTTP/1.1
Host: yoursite.com
User-Agent: GPTBot/1.0
Accept: text/markdown, text/html, */*

When your server detects Accept: text/markdown, it can return the page's content as clean, structured markdown instead of HTML — giving the bot exactly what it processes most accurately.

How Does robots.txt Control AI Access?

Your robots.txt file is the first thing any well-behaved crawler checks. You can selectively allow or block AI agents:

# Allow all AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Block AI training crawlers but allow search
User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Why Do SPAs and JavaScript-Heavy Sites Fail?

No Content in the Initial HTML

When GPTBot requests yoursite.com/docs/getting-started, it receives something like this:

<!DOCTYPE html>
<html>
<head><title>Loading...</title></head>
<body>
  <div id="root"></div>
  <script src="/bundle.js"></script>
</body>
</html>

There's nothing for the bot to extract. The actual content lives inside bundle.js, which the bot doesn't execute. Your documentation, your blog posts, your product descriptions — all invisible.

AJAX-Loaded Content Disappears

Infinite Scroll Breaks Extraction

What Is the Markdown Proxy Pattern?

How It Works

The pattern has three steps:

Detect — Check the incoming request's Accept header for text/markdown or the User-Agent for known AI bot strings.
Convert — Take the page's HTML content and convert it to clean markdown using a library like Turndown (JavaScript) or html2text (Python).
Respond — Return the markdown with Content-Type: text/markdown instead of the HTML page.

Why Markdown Over HTML?

Markdown strips all of that away. Headings are ##. Lists are -. Code is fenced with triple backticks. Links are inline [text](url). An AI model processes this with far higher accuracy.

Typical HTML (mostly structure and noise):

<nav class="sidebar nav-fixed z-50">...</nav>
<article class="content prose prose-lg max-w-3xl mx-auto">
  <h1 class="title text-2xl font-bold tracking-tight">How to Deploy</h1>
  <p class="body text-gray-700 leading-relaxed">
    Run <code class="inline-code bg-gray-100">git push</code> to deploy.
  </p>
</article>
<footer class="site-footer border-t mt-16">...</footer>

Clean markdown (pure content):

# How to Deploy

Run `git push` to deploy.

When Not to Use This Pattern

How Do You Serve Markdown to AI Crawlers?

Express Middleware Example

const TurndownService = require('turndown');
const turndown = new TurndownService({
  headingStyle: 'atx',
  codeBlockStyle: 'fenced'
});

// Cache converted markdown for 1 hour
const markdownCache = new Map();

function aiCrawlerMiddleware(req, res, next) {
  const accept = req.headers['accept'] || '';
  const ua = req.headers['user-agent'] || '';

  const wantsMarkdown = accept.includes('text/markdown');
  const isAIBot = /GPTBot|ClaudeBot|PerplexityBot|Amazonbot/i.test(ua);

  if (!wantsMarkdown && !isAIBot) {
    return next();
  }

  const cacheKey = req.path;
  if (markdownCache.has(cacheKey)) {
    res.set('Content-Type', 'text/markdown; charset=utf-8');
    return res.send(markdownCache.get(cacheKey));
  }

  // Capture the original response
  const originalSend = res.send.bind(res);
  res.send = function(html) {
    if (typeof html === 'string' && html.includes('<')) {
      const markdown = turndown.turndown(html);
      markdownCache.set(cacheKey, markdown);

      // Clear cache after 1 hour
      setTimeout(() => markdownCache.delete(cacheKey), 3600000);

      res.set('Content-Type', 'text/markdown; charset=utf-8');
      return originalSend(markdown);
    }
    return originalSend(html);
  };

  next();
}

app.use(aiCrawlerMiddleware);

This middleware intercepts responses before they leave the server. When an AI crawler is detected, it converts the HTML to markdown using Turndown and caches the result.

Configuring robots.txt for AI Agents

Your robots.txt should explicitly welcome the bots you want:

User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /dashboard/
Disallow: /api/

User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Disallow: /dashboard/
Disallow: /api/

User-agent: PerplexityBot
Allow: /blog/
Allow: /docs/
Disallow: /dashboard/
Disallow: /api/

Sitemap: https://yoursite.com/sitemap.xml

Caching Considerations

What Is llms.txt and Should You Adopt It?

How llms.txt Works

The file uses a simple markdown-like format. It lists your site's key pages with descriptions and links to their markdown equivalents:

# YourSite

> A brief description of your site for AI context.

## Docs

- [Getting Started](https://yoursite.com/docs/getting-started.md): Setup guide for new users
- [API Reference](https://yoursite.com/docs/api.md): Complete REST API documentation
- [Configuration](https://yoursite.com/docs/config.md): All configuration options

## Blog

- [Deployment Guide](https://yoursite.com/blog/deploy.md): Step-by-step deployment tutorial

Each entry points to a markdown version of the page. When an AI agent finds your llms.txt, it knows exactly which pages contain useful content and can fetch the clean markdown version directly.

llms.txt vs robots.txt

You want both. Use robots.txt to set boundaries. Use llms.txt to guide AI agents toward your best pages. Think of robots.txt as the bouncer and llms.txt as the concierge.

How Should You Handle Structured Data and Content Licensing?

Schema.org Markup That AI Agents Use

Add JSON-LD structured data to your content pages. The schemas AI systems find most useful are:

Article — Blog posts, news articles, tutorials
HowTo — Step-by-step guides with structured steps
FAQPage — Question-answer pairs that map directly to user queries
TechArticle — Technical documentation and API references

Here's a minimal JSON-LD example for a blog post:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "How to Make Your Website Crawlable by AI Agents",
  "author": { "@type": "Person", "name": "Your Name" },
  "datePublished": "2026-03-12",
  "description": "A guide to optimizing websites for GPTBot, ClaudeBot, and other AI crawlers."
}
</script>

Content Licensing for AI Crawlers

Here's a question more site owners should ask: do you want AI platforms to train on your content, or just cite it? The difference matters.

TDM-Reservation: 1

You can also specify a TDM policy URL for more granular control:

TDM-Policy: https://yoursite.com/tdm-policy.json

How Should You Optimize Content for AI Extraction?

Answer-First Formatting

Compare these two approaches:

Bad: "There are many factors to consider when choosing a deployment strategy. Let's explore the various options and their trade-offs before arriving at a recommendation."

The second version gives an AI agent a complete, quotable answer in two sentences.

Heading Hierarchy That AI Can Parse

Use H2 headings for main topics and H3 for subtopics. Never skip levels (don't jump from H2 to H4). Phrase headings as questions whenever possible — AI agents match user queries to heading text.

Structure your heading hierarchy like an outline:

## How Do You Deploy a Next.js App?        <- Main topic
### Prerequisites                            <- Subtopic
### Step 1: Configure Your Build Settings    <- Subtopic
### Step 2: Set Environment Variables        <- Subtopic
## What Are Common Deployment Errors?        <- Next main topic
### Build Failures                           <- Subtopic
### Runtime Errors                           <- Subtopic

Code Blocks Need Language Tags

Always include language identifiers on fenced code blocks. AI agents use these tags to categorize and correctly attribute code snippets:

```javascript
// Good: AI knows this is JavaScript
const server = express();
```

```
// Bad: AI doesn't know the language
const server = express();
```

Deployment Platform Comparison: AI Crawlability Out of the Box

The deployment platform you choose directly affects AI crawlability. Here's how the main options compare:

Feature	Temps	Vercel	Netlify
Proxy layer	Pingora (Cloudflare-built, open-source, written in Rust)	Proprietary edge network	Proprietary edge network
SSR support	Full (any language, any framework)	Full (Next.js/Edge optimized)	Full (framework-agnostic)
Custom middleware	Any language via Docker	Edge Functions (JS/TypeScript only)	Edge Functions (JS/TypeScript only)
Self-hosted option	Yes, free (Apache 2.0)	No	No
Price	~$6/mo on Temps Cloud (Hetzner + 30%), no per-seat fees	$20/mo per seat (Pro)	$19/mo per seat (Pro)
Markdown proxy middleware	Add your own via Docker	Add via Edge Middleware	Add via Edge Functions
robots.txt control	Full — your server, your rules	Full	Full

How Does Temps Handle the Proxy Layer for AI Crawlers?

For AI crawler optimization, this means you can implement the markdown proxy pattern at the infrastructure level rather than inside your application code:

Deploy any middleware container that detects AI bot User-Agents and Accept: text/markdown headers
Route AI crawler traffic through the conversion pipeline before it reaches your app
Cache converted responses at the proxy layer — subsequent bot requests hit the cache, not your app server

Health checks run every 5 seconds with automatic rollback on 2 consecutive failures — so the middleware containers serving your AI-optimized content stay up without manual intervention.