🎯 Free: check your brand visibility in Yandex, ChatGPT & Gemini in 5 minTry it →

10 min read

Technical Checklist: How to Prepare Your Website for AI Indexing

Complete technical checklist for preparing your website for AI bot indexing: robots.txt, llms.txt, JSON-LD, Schema.org, sitemap.xml, loading speed, full table of all AI bots and user agents.

Владислав Пучков
Владислав Пучков
Основатель GEO Scout, эксперт по GEO-оптимизации

Technical website preparation is the foundation of GEO optimization. Without it, expert content and external mentions lose effectiveness: AI systems simply will not be able to correctly read and index your website.

Complete table of AI bots and user agents

The first thing to know is which bots are trying to access your website. Here is the complete table of AI user agents as of March 2026.

BotCompanyAI productUser-AgentPurpose
GPTBotOpenAIChatGPTGPTBot/1.0Indexing for ChatGPT with web search
ChatGPT-UserOpenAIChatGPTChatGPT-UserChatGPT requests when browsing the web
OAI-SearchBotOpenAIChatGPT SearchOAI-SearchBot/1.0OpenAI search index
ClaudeBotAnthropicClaudeClaudeBot/1.0Indexing for Claude
PerplexityBotPerplexityPerplexityPerplexityBotPerplexity web search
Google-ExtendedGoogleGeminiGoogle-ExtendedData for Gemini training
GooglebotGoogleAI Overview, AI ModeGooglebotUnified bot for search and AI
YandexBotYandexAlice / NeurosearchYandexBot/3.0Unified bot for search and AI
BytespiderByteDanceDoubao / TikTok AIBytespiderIndexing for ByteDance AI
CCBotCommon CrawlMultiple LLMsCCBot/2.0Data for model training
AmazonbotAmazonAlexa / Amazon AIAmazonbotIndexing for Amazon AI services
AppleBot-ExtendedAppleApple IntelligenceAppleBot-ExtendedData for Apple AI
cohere-aiCohereCommand, Embedcohere-aiIndexing for Cohere AI
DeepSeekBotDeepSeekDeepSeekDeepSeekBotIndexing for DeepSeek
Meta-ExternalAgentMetaMeta AIMeta-ExternalAgent/1.0Indexing for Meta AI

1. robots.txt: access control for AI bots

What to check

Open your-site.com/robots.txt and check if there are any blocks for AI bots.

Problematic configurations

# BAD: blocks all AI bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

If you see such rules — AI systems cannot get up-to-date data from your website. ChatGPT, Claude, and Perplexity will rely exclusively on third-party sources.

# Allow AI bots access to public pages
User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /account/

User-agent: ClaudeBot
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /account/

User-agent: PerplexityBot
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /account/

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

What to block

  • /admin/, /api/, /account/ — internal sections
  • Pages with user personal data
  • Internal tools and dashboards
  • Duplicate content (print versions, AMP pages without canonical)

What NOT to block

  • Homepage, service and product pages
  • Blog and expert articles
  • FAQ sections
  • About page
  • Case studies and portfolio

2. llms.txt: instructions for AI

What it is

llms.txt is a file at the root of the website that provides AI systems with structured information about the company and the site. It is not a W3C standard, but a practice gaining traction in the AI community.

File format

# Company Name
 
> Brief company description in one sentence.
 
## About
 
Extended description: what it does, for whom, key advantages.
 
## Key pages
 
- [Product/Service](https://example.com/product): Description
- [Pricing](https://example.com/pricing): Description
- [About](https://example.com/about): Description
- [Blog](https://example.com/blog): Description
- [FAQ](https://example.com/faq): Description
 
## Contacts
 
- Website: https://example.com
- Email: info@example.com
- Phone: +1 (xxx) xxx-xx-xx

Practical example

# GEO Scout
 
> Brand visibility monitoring platform across 9 AI providers.
 
## About
 
GEO Scout is a full-cycle GEO (Generative Engine Optimization) platform.
Daily monitoring of brand presence in ChatGPT, Claude, DeepSeek,
Gemini, Google AI Mode, Google AI Overview, Grok, Perplexity, and Yandex with Alice.
 
## Key pages
 
- [Home](https://geoscout.pro): Platform overview and features
- [Pricing](https://geoscout.pro/pricing): Plans and prices
- [Blog](https://geoscout.pro/blog): Expert articles on GEO
- [Ratings](https://geoscout.pro/ratings): Public AI brand visibility ratings

3. JSON-LD / Schema.org: structured data

Structured data helps AI systems accurately understand page content. This is a critical factor for GEO — AI is more likely to cite data it can unambiguously interpret.

Priority markup types

TypeWhere to useImpact on AI
OrganizationHomepage, About pageAI gets basic brand information
ProductProduct pagesAI can recommend specific products
ServiceService pagesAI understands what you offer
FAQPageFAQ sections, articles with FAQAI extracts ready-made answers to questions
ArticleBlog, expert articlesAI evaluates authorship and expertise
HowToGuides, instructionsAI cites step-by-step instructions
Review / AggregateRatingProduct pages, reviewsAI conveys ratings and opinions
LocalBusinessContacts, branch listingsAI recommends in local queries
BreadcrumbListAll pagesAI understands site structure
SoftwareApplicationSaaS productsAI correctly classifies the product

Example: Organization

{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "Company Name",
  "url": "https://example.com",
  "logo": "https://example.com/logo.png",
  "description": "Brief description with key facts",
  "foundingDate": "2023",
  "numberOfEmployees": {
    "@type": "QuantitativeValue",
    "value": 50
  },
  "sameAs": [
    "https://t.me/company",
    "https://vk.com/company"
  ]
}

Example: FAQPage

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "Customer question?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Specific answer with facts and figures."
      }
    }
  ]
}

4. Sitemap.xml: map for AI bots

Basic requirements

  • File accessible at your-site.com/sitemap.xml
  • Specified in robots.txt: Sitemap: https://your-site.com/sitemap.xml
  • Contains all public pages that AI should see
  • <lastmod> tags are current (not static dates)
  • Size does not exceed 50,000 URLs (for large sites — sitemap index)

Page priority for AI

Not all pages are equally important for AI indexing. Prioritize:

  1. Homepage
  2. Service/product pages
  3. FAQ sections
  4. Expert articles and guides
  5. About page
  6. Case studies with data
  7. Pricing pages

Common mistakes

  • Sitemap contains pages blocked in robots.txt
  • Outdated <lastmod> (AI systems with web search prefer fresh content)
  • Missing sitemap index for sites with 10,000+ pages
  • Broken URLs in sitemap

5. Loading speed and Core Web Vitals

AI bots, like search crawlers, prefer fast websites. Additionally, some AI systems (Perplexity, Google AI) show page previews — a slow site creates a bad impression.

Target metrics

MetricGoodAcceptablePoor
LCP (Largest Contentful Paint)< 2.5 sec2.5-4.0 sec> 4.0 sec
INP (Interaction to Next Paint)< 200 ms200-500 ms> 500 ms
CLS (Cumulative Layout Shift)< 0.10.1-0.25> 0.25
TTFB (Time to First Byte)< 800 ms800-1800 ms> 1800 ms

Quick optimizations

  • Enable compression (gzip/brotli)
  • Configure static asset caching
  • Optimize images (WebP/AVIF, lazy loading)
  • Minify CSS and JavaScript
  • Use a CDN

The website GEO audit in GEO Scout automatically checks loading speed and Core Web Vitals via the PageSpeed API.


6. Mobile optimization

70%+ of requests to AI assistants come from mobile devices (especially voice queries to Alice). If your mobile website performs poorly — an AI recommendation will lead to a negative experience.

Requirements

  • Responsive design (not a separate m.site)
  • Text readable without zooming
  • Buttons and links with adequate touch targets (minimum 44x44 px)
  • Forms adapted for mobile input
  • No horizontal scrolling

7. Meta tags and content markup

Title and description

AI systems use meta tags for quick assessment of page content.

<title>Short, specific title with brand — up to 60 characters</title>
<meta name="description" content="Description with key facts and figures.
  Specifics instead of generic phrases. Up to 160 characters.">

Canonical URL

Mandatory for all pages. AI bots may index multiple versions of the same page (http/https, www/non-www, with/without parameters). Canonical specifies the primary version.

<link rel="canonical" href="https://example.com/page">

Open Graph and Twitter Cards

AI systems that work with social data (Grok) consider OG tags. Fill in:

<meta property="og:title" content="Title">
<meta property="og:description" content="Description">
<meta property="og:image" content="Image URL">
<meta property="og:type" content="website">

H1-H3 headings

Heading hierarchy is critical for AI — neural networks use it to understand content structure:

  • H1 — one per page, contains the main topic
  • H2 — main sections (AI often cites content by H2)
  • H3 — subsections with specifics

Automated audit: what GEO Scout checks

GEO Scout includes an automatic website GEO audit that checks all technical factors from this checklist:

  • robots.txt — accessibility for AI bots
  • Schema.org — presence and correctness of JSON-LD markup
  • PageSpeed — Core Web Vitals and loading speed
  • Meta tags — title, description, canonical, OG
  • Mobile adaptation — responsive and touch-friendly
  • SSL — presence and validity of certificate
  • Sitemap — presence and freshness

Audit results automatically flow into the Command Center, where AI prioritizes technical tasks by their impact on visibility in neural networks. Technical issues with high impact (for example, GPTBot blocking in robots.txt) receive maximum priority.


Checklist: website technical readiness for AI

robots.txt

  • robots.txt file exists and is accessible
  • GPTBot is not blocked
  • ClaudeBot is not blocked
  • PerplexityBot is not blocked
  • Google-Extended is not blocked
  • Internal sections are closed (/admin, /api, /account)
  • Path to sitemap.xml is specified

llms.txt

  • llms.txt file is created and placed at the site root
  • Contains a brief company description
  • Key pages with links are listed
  • Length does not exceed 500 words

Structured data (JSON-LD)

  • Organization — on homepage and About page
  • Product / Service — on product and service pages
  • FAQPage — on FAQ section and in articles
  • Article — on expert articles (with author specified)
  • BreadcrumbList — on all pages
  • Markup is valid (verify via Google Rich Results Test)

Sitemap.xml

  • File exists and is accessible
  • Contains all public pages
  • <lastmod> dates are current
  • No broken URLs
  • Specified in robots.txt

Speed and performance

  • LCP < 2.5 seconds
  • INP < 200 ms
  • CLS < 0.1
  • Compression enabled (gzip/brotli)
  • Images optimized (WebP/AVIF)
  • Static caching configured

Mobile optimization

  • Responsive design
  • Text readable without zoom
  • Touch targets >= 44x44 px
  • No horizontal scrolling

Meta tags and markup

  • Title on every page (unique, up to 60 characters)
  • Description on every page (with facts, up to 160 characters)
  • Canonical URL on every page
  • Open Graph tags filled in
  • H1-H3 hierarchy correct (one H1 per page)
  • SSL certificate valid

Частые вопросы

Which AI bots index websites?
The main AI bots: GPTBot (OpenAI/ChatGPT), ClaudeBot (Anthropic/Claude), PerplexityBot (Perplexity), Google-Extended (Gemini), Bytespider (ByteDance), CCBot (Common Crawl, used for model training), Amazonbot (Amazon/Alexa), FacebookBot, AppleBot-Extended (Apple Intelligence), cohere-ai. Yandex does not have a separate AI bot — data for Alice is taken from the main YandexBot index.
What is llms.txt and does my website need it?
llms.txt is a standard proposed for providing LLM systems with structured information about a website. The file is placed at the root of the site (example.com/llms.txt) and contains a brief company description, key pages, and context for AI. It is analogous to robots.txt, but for helping AI understand the site, not for access control. In 2026, support is not yet widespread, but Perplexity and other AI systems are starting to consider it.
Should I allow AI bots to index my website?
Yes, if you want AI to recommend your brand. By default, most AI bots have access to public pages. But if your robots.txt blocks GPTBot or ClaudeBot — AI systems will not be able to get up-to-date information from your website and will rely only on third-party sources that you do not control.
Which Schema.org types are most important for AI?
Priority types: Organization (company information), Product/Service (product descriptions), FAQPage (Q&A — actively used by AI to form answers), Article (expert articles), Review/AggregateRating (reviews and ratings), HowTo (step-by-step instructions), LocalBusiness (for local businesses). FAQ markup is especially important — AI frequently extracts ready-made answers from it.
Does website speed affect AI visibility?
Yes. AI bots, like search crawlers, prefer fast-loading pages. A slow website may not be fully indexed. Additionally, Perplexity and Google AI Mode show page previews — slow loading degrades the user experience. Core Web Vitals in the green zone is a basic requirement.
How can I check if AI is indexing my website?
Verification methods: 1) Check server logs for requests from GPTBot, ClaudeBot, PerplexityBot. 2) Ask AI directly about your company and evaluate the relevance of the data. 3) Use the GEO audit in GEO Scout — it checks robots.txt, structured data, and other technical factors. 4) Check robots.txt for AI user-agent blocks.
How is technical GEO different from technical SEO?
Technical SEO focuses on accessibility for search crawlers (Googlebot, YandexBot). Technical GEO adds to this: accessibility for AI bots (GPTBot, ClaudeBot), the llms.txt file, extended Schema.org markup for machine understanding, content structure optimization for citation. Many requirements overlap — good technical SEO covers 60-70% of technical GEO.
Technical Checklist: How to Prepare Your Website for AI Indexing