a crawler is a program that systematically browses the web
When receiving a webmention the simplest case is grabbing the mentioning resource. There are additional resources one may want to fetch and cache for a more robust experience:
- other people introduced into the discussion through this source
When fetching and caching a contact the simplest case is grabbing their homepage. There are additional resources one may want to fetch and cache for a more robust experience:
- PGP key
- u-url, rel=me
- identity consolidation
- auto-contextualize nickname upon syndication
- recommended subscriptions
- rel=next, rel=prev
- backfilling a subscription
Combinations are possible. A recommendation to subscribe to a feed on a distant but related profile. Photos from related profiles to contextualize the appropriate photo for the mention.
A crawler can be used to fetch one's identity graph by following all rel=me links.
2018 Summer Crawl
Angelo Gladding wrote a basic crawler to crawl the indieweb starting with known h-cards previously found in the indiemap crawl. Individual identities were consolidated, rel=me's were followed and PageRank was used to approximate the "primary" profile.
When stumbling upon a tag, whether it's an actual hashtag in a note or a u-category associated with an h-card, fetching and caching the resource referenced by the tag can provide a contextual cue as to its meaning.
A crawler can be used to fetch a tag's meaning by following all rel=alternate links.
You can then use this list to assign some or all of them as your own rel=alternate's.
The resulting concept graph could serve as the basis for a decentralized-chat.
Crawlers usually accompany their requests with a descriptive User-Agent header. This value can then be used in a robots_txt file to suggest access control. A crawler can always reuse a typical browser User-Agent to simulate a normal user instead. Thus User-Agent should not be relied upon for accuracy.