Jump to content

Toggle the table of contents

common crawl

From IndieWeb

This article is a stub. You can help the IndieWeb wiki by expanding it.

common crawl is an open repository of web crawl data, extensively used for web analysis and generative text (AI) model training.

How to Remove Content from Common Crawl

You can request content to be removed from the Common Crawl dataset by emailing their team directly.

IndieWeb Examples

capjamesg requested removal of his personal website from common crawl.

See Also

Internet Archive
See https://microformats.org/wiki/any23 for microformats related work on the parser used by common crawl

Retrieved from "https://indieweb.org/wiki/index.php?title=common_crawl&oldid=94603"