robots txt

This article is a stub. You can help the IndieWeb wiki by expanding it.

robots.txt is a file used to inform web crawlers what parts of a site should or should not be crawled.

Because this file is just a suggestion and bots can choose to ignore it, it's not a guaranteed way of keeping away crawlers. But generally, ones from big search engines will respect it won't publicly index your site if you declare it so.

Example command names:

Examples

The following examples may be copy pasted into a plain text robots.txt file and placed at the root of your domain.

Brief example to block anything inside a particular top level directory "/wiki/":

User-agent: *
Disallow: /wiki/

Note that Google seems to ignore the "*" User-agent and must be specifically disallowed:

User-agent: Googlebot
Disallow: /wiki/

You may want to entirely block some particularly abusive bots:

User-agent: AhrefsBot
Disallow: /

Directives to disallow GPTBot: https://platform.openai.com/docs/gptbot/disallowing-gptbot

User-agent: GPTBot
Disallow: /

Directive to disallow ChatGPT: https://platform.openai.com/docs/plugins/bot

User-agent: ChatGPT-User
Disallow: /

Directive to disallow use for Google Bard and Vertex AI generative APIs [1]

User-agent: Google-Extended
Disallow: /

More examples:

http://pin13.net/robots.txt

robots
https://www.robotstxt.org/robotstxt.html
LOL: https://web.archive.org/web/20140702214604/https://www.google.com/killer-robots.txt
Google crawler’s implementation of robots.txt: https://developers.google.com/search/docs/advanced/robots/robots_txt
Google's C++ robots.txt parser: https://github.com/google/robotstxt
for fun: https://www.last.fm/robots.txt with details at https://www.wired.com/2010/08/robot-laws/
Go ahead and block AI web crawlers

Examples

See Also