common crawl
This article is a stub. You can help the IndieWeb wiki by expanding it.
common crawl is an open repository of web crawl data, extensively used for web analysis and generative text (AI) model training.
How to Remove Content from Common Crawl
You can request content to be removed from the Common Crawl dataset by emailing their team directly.
IndieWeb Examples
- capjamesg requested removal of his personal website from common crawl.
See Also
- Internet Archive
- See https://microformats.org/wiki/any23 for microformats related work on the parser used by common crawl