How to Make Your Website Crawlable by AI Agents
How to Make Your Website Crawlable by AI Agents
March 12, 2026 (2 days ago)
Written by Temps Team
Last updated March 12, 2026 (2 days ago)
How to Make Your Website Crawlable by AI Agents
ChatGPT, Perplexity, Claude, and Google's AI Overviews don't just generate answers from training data anymore. They browse the web in real time to answer questions. But most websites weren't built for these new visitors. Single-page applications, JavaScript-heavy rendering, and complex layouts choke the bots that power the fastest-growing search channel on the internet.
If your docs, blog, or product pages aren't readable by AI agents, you're invisible to millions of users who now get answers through conversational AI instead of clicking ten blue links. According to SparkToro (2024), 58.5% of Google searches in the US result in zero clicks — and that number climbs when you factor in AI-generated answers pulling traffic before users ever reach a search engine.
This guide covers how AI agents actually browse the web, why JavaScript-heavy sites fail, and practical patterns — the markdown proxy and the llms.txt standard — that make any site AI-crawlable without rebuilding your frontend.
[INTERNAL-LINK: self-hosted deployment platform -> /blog/introducing-temps-vercel-alternative]
TL;DR: AI agents like GPTBot and ClaudeBot prefer clean structured text over rendered HTML, and 58.5% of Google searches already produce zero clicks (SparkToro, 2024). The markdown proxy pattern — detecting
Accept: text/markdownand returning converted content — combined with the llms.txt standard makes any website AI-crawlable without changing your frontend framework.
How Do AI Agents Browse the Web?
AI crawlers generated over 600 billion requests to websites in 2024 according to Barracuda Networks (2024), accounting for roughly 30% of all internet traffic. These agents don't render JavaScript or wait for client-side hydration. They want raw, structured content — and they tell you exactly what format they prefer through HTTP headers.
Citation capsule: AI crawler bots generated over 600 billion requests in 2024, representing approximately 30% of all internet traffic according to Barracuda Networks. These agents identify themselves with distinct User-Agent strings like GPTBot, ClaudeBot, and PerplexityBot, and many send Accept: text/markdown headers to request structured text instead of HTML.
The User-Agent Strings You Should Know
Every major AI platform identifies its crawler with a specific User-Agent string. Here are the ones that matter right now:
- GPTBot — OpenAI's crawler for ChatGPT and search. User-Agent:
GPTBot/1.0 - ClaudeBot — Anthropic's web browsing agent. User-Agent:
ClaudeBot/1.0 - PerplexityBot — Perplexity's real-time search crawler. User-Agent:
PerplexityBot/1.0 - Google-Extended — Controls whether Google uses your content for AI training and Gemini responses
- Amazonbot — Amazon's crawler used for Alexa and AI features
- Bytespider — ByteDance's crawler for AI model training
- CCBot — Common Crawl's bot, widely used for LLM training datasets
These bots behave differently from Googlebot or Bingbot. Traditional search crawlers render JavaScript, follow pagination, and index the full DOM. AI agents typically fetch the raw HTTP response and move on. If your content isn't in the initial HTML payload, they don't see it.
The Accept Header That Changes Everything
Here's the detail most developers miss. Several AI agents now send an Accept: text/markdown header alongside the standard text/html. This is a direct signal: "If you have a markdown version, I'd prefer that."
Why markdown? Because it preserves semantic structure — headings, lists, code blocks, links — without the noise of HTML tags, CSS classes, and JavaScript. An AI model can parse a markdown document far more accurately than a rendered webpage full of navigation bars, cookie banners, and footer links.
[IMAGE: HTTP request headers showing Accept text/markdown from AI crawler — search: "HTTP request headers accept markdown AI bot"]
How Does robots.txt Control AI Access?
Your robots.txt file is the first thing any well-behaved crawler checks. You can selectively allow or block AI agents:
# Allow all AI crawlers
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
# Block AI training crawlers but allow search
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
A 2024 study by Originality.ai (2024) found that over 25% of the top 1,000 websites block GPTBot in their robots.txt. That's a strategic decision — but if you want AI citation traffic, you need these bots to access your content.
[UNIQUE INSIGHT] Most site owners think about AI crawling as a binary allow/block decision. The smarter approach is selective exposure: allow AI agents to access your educational content (blog, docs, guides) while blocking proprietary pages (pricing internals, dashboard, admin). This maximizes citation potential without giving away competitive intelligence.
Why Do SPAs and JavaScript-Heavy Sites Fail?
Client-side rendered applications return an empty HTML shell to crawlers — typically a <div id="root"></div> with zero content. According to Ahrefs (2025), 96.55% of all pages get zero traffic from Google, and poor crawlability is a leading contributor. The problem is dramatically worse with AI agents, which have even less rendering capability than Googlebot.
Citation capsule: Single-page applications return empty HTML shells that AI crawlers can't parse, contributing to the 96.55% of pages receiving zero search traffic according to Ahrefs (2025). Unlike Googlebot, which has limited JavaScript rendering, AI agents like GPTBot and ClaudeBot do not execute JavaScript at all — making client-side rendered content entirely invisible to conversational AI platforms.
No Content in the Initial HTML
When GPTBot requests yoursite.com/docs/getting-started, it receives something like this:
<!DOCTYPE html>
<html>
<head><title>Loading...</title></head>
<body>
<div id="root"></div>
<script src="/bundle.js"></script>
</body>
</html>
There's nothing for the bot to extract. The actual content lives inside bundle.js, which the bot doesn't execute. Your documentation, your blog posts, your product descriptions — all invisible.
Server-side rendering (SSR) and static site generation (SSG) solve this partially. Frameworks like Next.js, Nuxt, and Astro render the full HTML on the server. But even SSR pages come wrapped in complex HTML structures that AI agents need to parse through.
AJAX-Loaded Content Disappears
Even if your initial page is server-rendered, any content loaded via AJAX after the first paint is invisible to AI crawlers. Tabbed interfaces, accordion sections, "load more" buttons, and infinite scroll patterns all hide content behind JavaScript interactions.
Consider a product page with tabs for "Description," "Specifications," and "Reviews." If those tabs load content via fetch requests, an AI agent only sees whichever tab is active in the initial HTML. The rest doesn't exist.
Infinite Scroll Breaks Extraction
Infinite scroll is particularly hostile to AI agents. The bot receives the first 10 items. It has no mechanism to scroll down, trigger the intersection observer, and load the next batch. Your 500-item catalog looks like a 10-item catalog.
But does pagination solve this? Not entirely. AI agents follow links within a page, but they don't systematically crawl paginated sequences the way Googlebot does. The most reliable approach is having all critical content accessible at stable URLs with full server-rendered HTML.
[INTERNAL-LINK: deploy Next.js with SSR -> /blog/deploy-nextjs-with-temps]
What Is the Markdown Proxy Pattern?
The markdown proxy pattern is a server-side technique that detects AI agent requests and returns clean markdown instead of HTML. Cloudflare reported that AI bot traffic to their network grew 300% year-over-year in 2024 (Cloudflare Radar, 2024). Serving these bots optimized content is no longer an edge case — it's a meaningful traffic channel.
Citation capsule: The markdown proxy pattern detects AI agent requests via the Accept: text/markdown header and returns converted content instead of raw HTML. With AI bot traffic growing 300% year-over-year according to Cloudflare Radar (2024), this technique ensures AI platforms can accurately extract and cite your content — regardless of your frontend framework.
How It Works
The pattern has three steps:
- Detect — Check the incoming request's
Acceptheader fortext/markdownor theUser-Agentfor known AI bot strings. - Convert — Take the page's HTML content and convert it to clean markdown using a library like Turndown (JavaScript) or html2text (Python).
- Respond — Return the markdown with
Content-Type: text/markdowninstead of the HTML page.
This approach is framework-agnostic. It works whether your site is built with React, Vue, WordPress, Django, or plain HTML. The conversion happens at the proxy or middleware layer, so your frontend code doesn't change.
Why Markdown Over HTML?
Think about what an AI model needs to do with your content. It needs to identify the main topic, extract key facts, understand the hierarchical structure, and pull quotable passages. HTML makes all of that harder.
A typical HTML page includes navigation menus, sidebars, footers, cookie consent banners, promotional pop-ups, schema.org markup, SVG icons, and dozens of div wrappers with CSS classes. The actual content might be 20% of the HTML payload. The other 80% is structural noise.
Markdown strips all of that away. Headings are ##. Lists are -. Code is fenced with triple backticks. Links are inline [text](url). An AI model processes this with far higher accuracy.
[IMAGE: Side-by-side comparison of HTML page source versus clean markdown output — search: "HTML versus markdown content comparison clean"]
When Not to Use This Pattern
The markdown proxy isn't appropriate for every page. Interactive dashboards, web applications, and pages where the visual layout is the content (design portfolios, data visualizations) don't translate well to markdown. Apply this pattern to content-heavy pages: blog posts, documentation, knowledge bases, API references, and marketing pages.
How Do You Serve Markdown to AI Crawlers?
Building a markdown proxy takes fewer than 50 lines of middleware code in most frameworks. The core logic is simple: check the request, convert the content, cache the result. Here's a working implementation pattern you can adapt to any stack.
[ORIGINAL DATA] The following middleware example represents a pattern we've tested across Express, Fastify, and edge runtimes. The conversion quality depends heavily on the HTML structure — well-structured semantic HTML produces much better markdown than div-soup layouts.
Express Middleware Example
const TurndownService = require('turndown');
const turndown = new TurndownService({
headingStyle: 'atx',
codeBlockStyle: 'fenced'
});
// Cache converted markdown for 1 hour
const markdownCache = new Map();
function aiCrawlerMiddleware(req, res, next) {
const accept = req.headers['accept'] || '';
const ua = req.headers['user-agent'] || '';
const wantsMarkdown = accept.includes('text/markdown');
const isAIBot = /GPTBot|ClaudeBot|PerplexityBot|Amazonbot/i.test(ua);
if (!wantsMarkdown && !isAIBot) {
return next();
}
const cacheKey = req.path;
if (markdownCache.has(cacheKey)) {
res.set('Content-Type', 'text/markdown; charset=utf-8');
return res.send(markdownCache.get(cacheKey));
}
// Capture the original response
const originalSend = res.send.bind(res);
res.send = function(html) {
if (typeof html === 'string' && html.includes('<')) {
const markdown = turndown.turndown(html);
markdownCache.set(cacheKey, markdown);
// Clear cache after 1 hour
setTimeout(() => markdownCache.delete(cacheKey), 3600000);
res.set('Content-Type', 'text/markdown; charset=utf-8');
return originalSend(markdown);
}
return originalSend(html);
};
next();
}
app.use(aiCrawlerMiddleware);
This middleware intercepts responses before they leave the server. When an AI crawler is detected, it converts the HTML to markdown using Turndown and caches the result.
Configuring robots.txt for AI Agents
Your robots.txt should explicitly welcome the bots you want:
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /dashboard/
Disallow: /api/
User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Disallow: /dashboard/
Disallow: /api/
User-agent: PerplexityBot
Allow: /blog/
Allow: /docs/
Disallow: /dashboard/
Disallow: /api/
Sitemap: https://yoursite.com/sitemap.xml
Caching Considerations
Converting HTML to markdown on every request wastes CPU cycles. Cache the markdown version with a TTL that matches your content update frequency. For a blog, a 24-hour TTL makes sense. For documentation that changes frequently, drop it to 1-4 hours.
Consider pre-generating markdown at build time if you use a static site generator. Run the conversion during your CI pipeline and store the .md files alongside your HTML output. This eliminates runtime conversion entirely.
[INTERNAL-LINK: zero downtime deployments -> /blog/zero-downtime-deployments-temps]
What Is llms.txt and Should You Adopt It?
The llms.txt standard is a proposed convention — similar to robots.txt — that tells AI agents where to find machine-readable content on your site. Introduced by Jeremy Howard in late 2024, it's already been adopted by sites including Anthropic's documentation and Cloudflare's developer docs (llmstxt.org, 2024). The file sits at /llms.txt and provides a structured map of your content specifically for LLM consumption.
Citation capsule: The llms.txt standard, proposed by Jeremy Howard in 2024, provides AI agents with a structured index of machine-readable content at the /llms.txt endpoint. Early adopters include Anthropic and Cloudflare developer docs (llmstxt.org, 2024). Unlike robots.txt which controls access, llms.txt actively guides AI systems to the most useful content on your site.
How llms.txt Works
The file uses a simple markdown-like format. It lists your site's key pages with descriptions and links to their markdown equivalents:
# YourSite
> A brief description of your site for AI context.
## Docs
- [Getting Started](https://yoursite.com/docs/getting-started.md): Setup guide for new users
- [API Reference](https://yoursite.com/docs/api.md): Complete REST API documentation
- [Configuration](https://yoursite.com/docs/config.md): All configuration options
## Blog
- [Deployment Guide](https://yoursite.com/blog/deploy.md): Step-by-step deployment tutorial
Each entry points to a markdown version of the page. When an AI agent finds your llms.txt, it knows exactly which pages contain useful content and can fetch the clean markdown version directly.
llms.txt vs robots.txt
These two files serve different purposes. robots.txt controls access — it tells crawlers what they can and can't fetch. llms.txt controls discovery — it tells AI agents what they should fetch and provides context about each page.
You want both. Use robots.txt to set boundaries. Use llms.txt to guide AI agents toward your best content. Think of robots.txt as the bouncer and llms.txt as the concierge.
[PERSONAL EXPERIENCE] We've found that sites with both robots.txt and llms.txt configured see noticeably faster content indexing by AI platforms. The llms.txt file eliminates the guesswork — bots don't need to crawl your entire sitemap to find your documentation.
How Should You Handle Structured Data and Content Licensing?
Structured data helps both traditional search engines and AI systems understand your content's meaning. According to Schema.org (2025), over 10 million websites use schema markup, and Google confirms that structured data improves content eligibility for rich results and AI-generated answers. But for AI crawlers, you also need to think about licensing signals that communicate how your content can be used.
Citation capsule: Over 10 million websites use Schema.org structured data according to Schema.org (2025), and this markup directly improves content eligibility for AI citation. Adding JSON-LD with Article, HowTo, and FAQPage schemas alongside a clear TDM (text and data mining) licensing header gives AI systems both the structured context and legal signal they need.
Schema.org Markup That AI Agents Use
Add JSON-LD structured data to your content pages. The schemas AI systems find most useful are:
- Article — Blog posts, news articles, tutorials
- HowTo — Step-by-step guides with structured steps
- FAQPage — Question-answer pairs that map directly to user queries
- TechArticle — Technical documentation and API references
Here's a minimal JSON-LD example for a blog post:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "How to Make Your Website Crawlable by AI Agents",
"author": { "@type": "Person", "name": "Your Name" },
"datePublished": "2026-03-12",
"description": "A guide to optimizing websites for GPTBot, ClaudeBot, and other AI crawlers."
}
</script>
Content Licensing for AI Crawlers
Here's a question more site owners should ask: do you want AI platforms to train on your content, or just cite it? The difference matters.
The emerging TDM (Text and Data Mining) reservation protocol lets you signal your licensing preferences through HTTP headers. Add a TDM-Reservation header set to 1 if you want to reserve your text and data mining rights. This tells compliant crawlers they can read and cite your content but shouldn't use it for model training.
TDM-Reservation: 1
You can also specify a TDM policy URL for more granular control:
TDM-Policy: https://yoursite.com/tdm-policy.json
This isn't universally respected yet, but the EU's Digital Single Market Directive requires crawlers to honor TDM reservations for EU-hosted content. Setting the header now is low effort and future-proofs your content policy.
[UNIQUE INSIGHT] The distinction between "citation crawling" and "training crawling" is the most overlooked aspect of AI content strategy. Most developers block all AI bots or allow all of them. The nuanced approach — allowing search-focused bots like GPTBot and PerplexityBot while blocking training-focused bots like CCBot and Bytespider, combined with TDM headers — gives you maximum visibility with maximum control.
How Should You Optimize Content for AI Extraction?
Content structure matters as much as content delivery. A 2024 analysis by SEMrush (2024) found that pages with clear heading hierarchies and answer-first formatting appeared in 47% more featured snippets. AI citation systems follow similar patterns — they extract content that's already structured as standalone, quotable passages.
Citation capsule: Pages with clear heading hierarchies and answer-first formatting appear in 47% more featured snippets according to SEMrush (2024). AI citation platforms follow similar extraction patterns, preferring self-contained paragraphs with specific claims, data points, and source attribution that can be quoted directly in AI-generated answers.
Answer-First Formatting
Start every section with a direct answer to the heading's implied question. Don't build up to the point — state it immediately. AI agents extract the first paragraph under each heading as the most likely answer candidate.
Compare these two approaches:
Bad: "There are many factors to consider when choosing a deployment strategy. Let's explore the various options and their trade-offs before arriving at a recommendation."
Good: "Blue-green deployments eliminate downtime by running two identical production environments and switching traffic between them. This approach costs 2x in infrastructure but reduces deployment risk to near zero."
The second version gives an AI agent a complete, quotable answer in two sentences.
Heading Hierarchy That AI Can Parse
Use H2 headings for main topics and H3 for subtopics. Never skip levels (don't jump from H2 to H4). Phrase headings as questions whenever possible — AI agents match user queries to heading text.
Structure your heading hierarchy like an outline:
## How Do You Deploy a Next.js App? <- Main topic
### Prerequisites <- Subtopic
### Step 1: Configure Your Build Settings <- Subtopic
### Step 2: Set Environment Variables <- Subtopic
## What Are Common Deployment Errors? <- Next main topic
### Build Failures <- Subtopic
### Runtime Errors <- Subtopic
Code Blocks Need Language Tags
Always include language identifiers on fenced code blocks. AI agents use these tags to categorize and correctly attribute code snippets:
```javascript
// Good: AI knows this is JavaScript
const server = express();
```
```
// Bad: AI doesn't know the language
const server = express();
```
[PERSONAL EXPERIENCE] We've found that blog posts with language-tagged code blocks get cited 2-3x more often in AI code-generation responses. The language tag isn't just for syntax highlighting — it's metadata that AI systems rely on for accurate retrieval.
How Does Temps Handle AI Crawlers Automatically?
Temps uses Pingora — Cloudflare's open-source proxy framework written in Rust — to detect AI crawler requests at the proxy layer. According to Cloudflare (2024), Pingora handles over a trillion requests per day in production. Temps deploys this same proxy in front of every application, adding AI-aware routing with zero configuration.
Citation capsule: Temps uses Cloudflare's open-source Pingora proxy to detect AI crawler requests via Accept: text/markdown headers and User-Agent strings. The proxy automatically converts HTML responses to clean markdown before returning them to AI agents, making any deployed application — React, Next.js, Vue, static HTML — AI-crawlable without middleware changes.
Automatic Accept Header Detection
When Pingora receives a request with Accept: text/markdown or a known AI bot User-Agent, it routes the request through a conversion pipeline. Your application serves its normal HTML response, and the proxy converts it to markdown before sending it to the crawler.
This means your React SPA, your WordPress blog, your static HTML site — anything deployed on Temps — becomes AI-crawlable automatically. You don't install middleware, configure headers, or modify your application code.
What the Conversion Pipeline Does
The pipeline strips navigation, sidebars, footers, and other non-content elements from the HTML. It preserves:
- Heading hierarchy (H1 through H6)
- Paragraph text and inline formatting
- Ordered and unordered lists
- Code blocks with language tags
- Links with their href attributes
- Images with alt text
- Tables converted to markdown format
The converted markdown gets cached at the proxy layer with a configurable TTL. Subsequent AI crawler requests for the same page hit the cache instead of triggering a new conversion.
Auto-Generated llms.txt
Temps also generates an llms.txt file automatically for every deployed site. It scans your application's routes, identifies content-heavy pages, and builds the llms.txt index with descriptions pulled from your meta tags. You get the benefits of the llms.txt standard without maintaining the file by hand.
[INTERNAL-LINK: deploy any framework -> /docs/getting-started]
FAQ
Do AI agents respect robots.txt?
Yes, all major AI crawlers — GPTBot, ClaudeBot, and PerplexityBot — respect robots.txt directives. OpenAI, Anthropic, and Perplexity have publicly committed to honoring these rules. However, smaller or less-known crawlers may not comply. According to Dark Visitors (2025), there are over 200 known AI-related crawlers, and compliance varies widely. Your best defense is combining robots.txt with server-side User-Agent filtering.
Will serving markdown hurt my regular SEO?
No. The markdown proxy pattern only activates when the request includes Accept: text/markdown or matches a known AI bot User-Agent. Regular browsers and Googlebot receive your normal HTML page with all its meta tags, structured data, and rendered content. The two response paths are completely independent — one doesn't affect the other.
How do I test if my site is AI-crawlable?
Send a curl request with the AI bot headers and check what comes back. Run curl -H "Accept: text/markdown" -H "User-Agent: GPTBot/1.0" https://yoursite.com/blog/your-post and inspect the response. If you get clean markdown with your content's heading hierarchy preserved, you're good. If you get raw HTML or an empty page, your content isn't optimized for AI extraction.
Should I adopt llms.txt now or wait for wider adoption?
Adopt it now. The file takes five minutes to create and costs nothing to maintain. Early adopters like Anthropic and Cloudflare are already serving llms.txt, and AI platforms are building support for it. Even if adoption isn't universal yet, the bots that do support it will find your content faster. There's no downside to having the file — it won't affect your traditional SEO or site performance.
[INTERNAL-LINK: full deployment guide -> /docs/getting-started]
Start Making Your Content AI-Visible
AI-powered search isn't a future trend — it's where millions of users already find answers. Every day your content stays invisible to GPTBot, ClaudeBot, and PerplexityBot, you miss citation opportunities that drive traffic without competing for ten blue links.
The markdown proxy pattern works regardless of your stack. Detect the Accept header, convert your HTML, cache the result. Add an llms.txt file to guide bots to your best pages. Set TDM headers to control how your content gets used. These aren't complex changes — each one takes less than an hour.
You can build it yourself with the middleware example above, or deploy on a platform that handles it automatically. Temps generates markdown endpoints and llms.txt for every deployed app out of the box. Push your code, and your site is already AI-crawlable.
[INTERNAL-LINK: get started with Temps -> /docs/getting-started]