discovery-algorithms

This article is a stub. You can help the IndieWeb wiki by expanding it.

Discovery algorithms are a set of algorithms to start with a URL like a home page or post permalink, and determine information about that URL, such as authorship, page-name, date published, or feed file alternative, often making use of follow your nose techniques, sometimes also called autodiscovery.

On the IndieWeb, you are your URL and your URL is you. With a well-designed IndieWeb site, as humans we can discover all the information you choose to share about yourself.

This page is for documenting and brainstorming ways to discover that same information automatically by following a combination of semantic markup and algorithms.

If you’re looking for a user-centric definition of discovery, see:

discovery

Algorithms

The following algorithms are fairly well-developed, mature, and well implemented:

Profile Info

Typical IndieWeb sites have profile and often contact information right there on the home page.

Best publishing practice: mark it all up with hCard, especially your hyperlinks to other profiles with class="u-url url" and rel="me". This will create a representative hCard which can then be used by other sites. Use this tool to check how your hCard gets parsed.

Examples in the wild:

Best parsing practice: when given an indieweb URL that represents a person, parse it for a representative hCard and use that hCard for information about that person. Details:

https://microformats.org/wiki/representative-hcard-parsing

Separate Contact Page

Some IndieWeb sites have contact information on a separate page from the home page, but linked from the home page. While such links may be obvious to humans, e.g. with text labels like "Contact" or "About", it's not at all obvious to parsers where to find more information.

Examples of sites with separate contact/about pages:

https://adactio.com/ - contact information at: https://adactio.com/contact/
https://voxpelli.com/ - about information at: https://voxpelli.com/about (linked to with rel-me from front page)
https://kartikprabhu.com/ - about information at: https://kartikprabhu.com/about (linked with rel-me from homepage)
Jacky Alciné: At https://jacky.wtf, every page is marked up with rel=author on a h-card that points to /about, where my representative h-card lives at. I picked this approach to align with the HTML5 spec. From there, my contact page's linked at /contact.
...

Proposal:

With a new 'rel' value, e.g. rel="contact-info", parsers could discover such links, follow them, and then parse the destination for a representative hCard.

Updates

Most IndieWeb sites have updates on their home page, a stream of updates as it were, for example:

https://tantek.com
https://adactio.com
Jacky Alciné: I keep a few small h-feeds on my homepage that point to my microblog, linklog and articles. They're referenced by anchors, so any capable parser could use that for a feed.

Best publishing practice: mark up your posts with hAtom (and ideally microformats2 h-entry as well).

Best parsing practice: parse the given URL for hAtom (and preferably microformats2 h-entry).

Others have a simple introduction/contact page as their home page, and provide updates at another URL, for example:

... old adactio.com
... any current examples? Maybe folks that have only an IndieWeb contact home page and then link to a separate page for their blog?

Proposal:

With a new 'rel' value, e.g. rel="updates", parsers could discover links to separate "journal", "updates", "notes", "stream" pages and then parse those for hAtom and h-entry.

Post Information

To discover information about a post on a post permalink page:

Parse the page for h-entry and retrieve post name, summary, URL, contents, and author accordingly.

Authorship

Main article: authorship

Summary: (see main authorship page for full algorithm)

If the author is also an h-card, parse that for more information about the author such as name, photo, logo, URL, etc.
If there is no p-author, then look for a rel-author link.
- Follow it, and retrieve the representative h-card from the destination for author information

Post Name

Assuming a post permalink page, see:

page-name-discovery

Publication Date

If the h-entry has a dt-published property, use that.
- This is the only thing that's been implemented. Here are some fallbacks:
else the following sources may be used to infer it (brainstorming)
- Check the URL path. If the start (? or somewhere near the start?) of the URL path of the permalink is:
- /YYYY/MM/DD/ = publication date ISO YYYY-MM-DD
  - Sites using this: aaronparecki.com
- /YYYY/DDD/ = publication date ISO Ordinal YYYY-DDD
  - Sites using this: tantek.com
- ...

POSSE copies

To find the POSSE copies of an original post, on that original post's permalink page:

Look for rel=syndication links
Treat their destinations as POSSE'd copies of the original post

Original post

To find an original post from a POSSE'd copy, on that POSSE'd copy's permalink page:

Follow original-post-discovery

Real Time Updates

... rel=hub ... on link to PuSH hub for your updates.

Public Keys

For an improved Salmon key-discovery flow (that is, not using DRY-violating XRD files and web-breaking email-like identifiers), we need to expose public keys somehow.

Public key exchange was also discussed at the IndieWeb dinner on November 1st, 2013, in the context of this in-person using QR codes. Potentially the QR code could simply be a URL pointing to the user's website, where a public key could be extracted from microformat encoding.

Help or About

HTML has the rel="help" value, but it's not clear that it conveys the "about" kind of resources you link to.

IndieWeb Documentation

For example:

Tom Morris has documented the IndieWeb support on his site on a separate page.

Alternates Discovery

There are a bunch of discovery methods for alternate versions of content, in case you want to interoperate with sites and systems that use them.

Feed File Discovery

... rel=alternate ... type=...+xml ...

If you have a tool or service that wants to support autodiscovering feed files, or if you publish a feed file (like Atom or RSS) and want to enable autodiscovery, you can add a <link rel=alternate> element as follows:

(needs complete code example)

This section is a stub. You can help the IndieWeb wiki by expanding it.

Developer Examples

Examples that developers, e.g. folks at least familiar with manipulating and editing XML files, are publishing or can use:

Chris Aldrich has a following page of people he's subscribed to along with subscribeable OPML files to allow others to discover those people too.
tw2113 has published an OPML file of feeds he subscribes to over at OPML
Share Your OPML - A service by Dave Winer for sharing your OPML file and aggregating against others to help uncover popular websites. See also: http://scripting.com/2016/10/12/areYouReadyToShareYourOpml.html The resultant combined data set can be found at http://feedbase.io/

Libraries

Link rel discovery

Main article: link rel parser

Many discovery algorithms depend on link rel discovery, either via HTTP LINK headers and/or HTML <link> tags. These libraries can help with parsing/extracting both:

link_rel_parser - PHP http_rels($h) & head_http_rels($url) - HTTP header string parser for RFC5988 Link: rels (including X-Pingback) & function to curl a HEAD request and parse it all in one.
link_rel_parser - Ruby Link header (RFC 5988) parser (port of link-rel-parser-php)
phpish/link_header - PHP Link header (RFC 5988) parser
PEAR: HTTP2 - PHP Link header (RFC 5988) parser (documentation)
http-link-header - Haskell Link header (RFC 5988) parser
ex_http_link - Elixir Link header (RFC 5988) parser
indieweb - Rust library provides logic for fetching and handling link-rel values from headers and HTML