robots txt

From IndieWeb
(Redirected from robotstxt)

robots.txt is a file used to inform web crawlers what parts of a site should or should not be crawled.

Because this file is just a suggestion and bots can choose to ignore it, it's not a guaranteed way of keeping away crawlers. But generally, ones from big search engines will respect it won't publicly index your site if you declare it so.

Example command names:

Examples

The following examples may be copy pasted into a plain text robots.txt file and placed at the root of your domain.

Brief example to block anything inside a particular top level directory "/wiki/":

User-agent: *
Disallow: /wiki/

Note that Google seems to ignore the "*" User-agent and must be specifically disallowed:

User-agent: Googlebot
Disallow: /wiki/

You may want to entirely block some particularly abusive bots:

User-agent: AhrefsBot
Disallow: /

Directives to disallow GPTBot: https://platform.openai.com/docs/gptbot/disallowing-gptbot

User-agent: GPTBot
Disallow: /

Directive to disallow ChatGPT: https://platform.openai.com/docs/plugins/bot

User-agent: ChatGPT-User
Disallow: /

Directive to disallow use for Google Bard and Vertex AI generative APIs [1]

User-agent: Google-Extended
Disallow: /

More examples:

See Also