robots.txt for AI Crawlers: GPTBot, Google-Extended, PerplexityBot, and Search Access

AI crawler rules are now part of technical SEO. The mistake is treating every bot the same. Some user agents support search or answer retrieval, some relate to training controls, and normal Googlebot/Bingbot access still affects traditional search visibility.

The quick answer: use robots.txt to block private or non-public paths, but be careful before blocking AI-related crawlers if your goal is to be cited in AI answers. Separate training-control decisions from search-indexing decisions.

BaseToolbox's robots.txt generator helps create Allow and Disallow rules, but the policy choice is still yours.

Start With the Goal

Before editing robots.txt, decide what you want:

Goal	Likely policy
Keep admin, staging, or internal paths out	Disallow those paths for all crawlers.
Stay visible in Google Search	Do not block Googlebot from public pages.
Appear in AI answers and citations	Be cautious about blocking AI answer/search crawlers.
Reduce training use	Review each provider's training or AI-specific controls.
Protect private content	Do not rely only on robots.txt; require authentication.

robots.txt is a crawler instruction, not an access control system. Private data should not be publicly reachable just because bots are asked not to crawl it.

Common AI-Related User Agents

As of June 30, 2026, common AI-related crawler names include OpenAI's GPTBot, ChatGPT-User, and OAI-SearchBot; PerplexityBot; and Google's Google-Extended control. Provider names and behavior can change, so check official documentation before publishing a final policy.

Useful references:

Example Policy Patterns

A conservative public-site pattern usually keeps public content crawlable and blocks private paths:

User-agent: *
Disallow: /admin/
Disallow: /account/
Disallow: /checkout/

Sitemap: https://example.com/sitemap.xml

If you decide to block a specific AI crawler, write the rule explicitly:

User-agent: GPTBot
Disallow: /

Do not copy bot-blocking lists blindly. A rule that sounds privacy-friendly can also reduce the chance that AI systems discover or cite your public guides.

What robots.txt Cannot Do

robots.txt cannot:

Hide a URL from users who already know it
Protect private documents
Remove a page that is already indexed
Control every AI system on the internet
Replace authentication, noindex, headers, or deletion

Use authentication for private areas. Use noindex for pages that should not appear in search results. Use robots.txt mainly for crawl guidance.

A Practical Review Checklist

Before publishing an AI crawler policy, review it with three people or roles: SEO, legal/privacy, and engineering. SEO should confirm that public pages remain crawlable. Legal or privacy should decide whether training controls are needed. Engineering should confirm private routes are protected by authentication, not only by robots.txt.

Also test the final file with the exact user-agent blocks you intended. A single broad Disallow: / under the wrong user agent can remove more access than expected.

For content teams, keep a short changelog beside major robots.txt edits. If traffic, indexing, or AI referral visibility changes later, you will know which crawler policy changed and when.

After a policy change, monitor more than one metric. Check crawl logs when available, Search Console indexing, sitemap discovery, and analytics referrals over several weeks. AI answer visibility can move slowly, and a one-day change in referral traffic is not enough to prove that a crawler rule helped or hurt.

For a static utility site, the safest default is usually simple: keep public tools, blog posts, privacy pages, and help content crawlable; block only internal, duplicate, generated, or non-public paths. That keeps the site easy for search engines and answer systems to understand without pretending robots.txt is a security layer.

Quick Answer

For GEO and AI visibility, do not block AI-related crawlers by default unless you intentionally want to limit that access. Keep public helpful content crawlable, block private paths, and review each provider's current bot documentation before deciding.

FAQ

Does allowing AI crawlers guarantee ChatGPT or Perplexity citations?

No. Allowing access only makes citation possible. Content quality, authority, freshness, structure, and search visibility still matter.

Should I block Google-Extended?

That is a business decision. Review Google's documentation and decide whether the control matches your content and AI training policy. Do not confuse it with blocking Googlebot from Search.

Is robots.txt enough for confidential content?

No. Confidential content should require authentication or be removed from public access. robots.txt is not a security boundary.

BaseToolbox's robots.txt generator helps create Allow and Disallow rules, but the policy choice is still yours.

Start With the Goal

Before editing robots.txt, decide what you want:

Goal	Likely policy
Keep admin, staging, or internal paths out	Disallow those paths for all crawlers.
Stay visible in Google Search	Do not block Googlebot from public pages.
Appear in AI answers and citations	Be cautious about blocking AI answer/search crawlers.
Reduce training use	Review each provider's training or AI-specific controls.
Protect private content	Do not rely only on robots.txt; require authentication.

robots.txt is a crawler instruction, not an access control system. Private data should not be publicly reachable just because bots are asked not to crawl it.

Common AI-Related User Agents

Useful references:

Example Policy Patterns

A conservative public-site pattern usually keeps public content crawlable and blocks private paths:

User-agent: *
Disallow: /admin/
Disallow: /account/
Disallow: /checkout/

Sitemap: https://example.com/sitemap.xml

If you decide to block a specific AI crawler, write the rule explicitly:

User-agent: GPTBot
Disallow: /

Do not copy bot-blocking lists blindly. A rule that sounds privacy-friendly can also reduce the chance that AI systems discover or cite your public guides.

What robots.txt Cannot Do

robots.txt cannot:

Hide a URL from users who already know it
Protect private documents
Remove a page that is already indexed
Control every AI system on the internet
Replace authentication, noindex, headers, or deletion

Use authentication for private areas. Use noindex for pages that should not appear in search results. Use robots.txt mainly for crawl guidance.

A Practical Review Checklist

Also test the final file with the exact user-agent blocks you intended. A single broad Disallow: / under the wrong user agent can remove more access than expected.

For content teams, keep a short changelog beside major robots.txt edits. If traffic, indexing, or AI referral visibility changes later, you will know which crawler policy changed and when.

Quick Answer

FAQ

Does allowing AI crawlers guarantee ChatGPT or Perplexity citations?

No. Allowing access only makes citation possible. Content quality, authority, freshness, structure, and search visibility still matter.

Should I block Google-Extended?

That is a business decision. Review Google's documentation and decide whether the control matches your content and AI training policy. Do not confuse it with blocking Googlebot from Search.

Is robots.txt enough for confidential content?

No. Confidential content should require authentication or be removed from public access. robots.txt is not a security boundary.

robots.txt for AI Crawlers: GPTBot, Google-Extended, PerplexityBot, and Search Access

Start With the Goal

Common AI-Related User Agents

Example Policy Patterns

What robots.txt Cannot Do

A Practical Review Checklist

Quick Answer

FAQ

Does allowing AI crawlers guarantee ChatGPT or Perplexity citations?

Should I block Google-Extended?

Is robots.txt enough for confidential content?

Ready to try it yourself?

robots.txt for AI Crawlers: GPTBot, Google-Extended, PerplexityBot, and Search Access

Start With the Goal

Common AI-Related User Agents

Example Policy Patterns

What robots.txt Cannot Do

A Practical Review Checklist

Quick Answer

FAQ

Does allowing AI crawlers guarantee ChatGPT or Perplexity citations?

Should I block Google-Extended?

Is robots.txt enough for confidential content?

Ready to try it yourself?