deduplication

From IndieWeb
Jump to: navigation, search


deduplication (de-duplication AKA deduping/de-duping) is the process of comparing responses (sometimes posts) and seeing if they are exactly or essentially the same, and only keeping the earliest or most canonical version, perhaps keeping track of alternative URLs, like syndicated copies.

Contents

How to deduplicate responses

Replies and other responses are often duplicated in different places, e.g. via backfeed of POSSEd replies by Bridgy. Ideally, recipients should try to de-dupe webmention sources, preferring an original post (see below). Getting this perfect is hard, but getting close is pretty easy (see one IRC discussion and another) by both:

  1. Preferring original replies
  2. Comparing an incoming reply (etc) to existing replies based on:
    • u-uid
    • u-url
    • u-syndication (also compare to u-url, and vice versa)
    • other u-in-reply-to links in the incoming reply
    • full text, after stripping HTML tags and probably ignoring whitespace differences
    • text prefix, after also stripping leading @username, RT/MT, trailing ..., etc.
    • edit distance, longest common subsequence, or other fuzzy match

Responses challenges

Examples / challenges for de-duping (use these as source material to check any de-duping approaches / algorithms)

  • comments on https://waterpigs.co.uk/notes/4Y38Ts/
  • security / identification / preventing hijacking. An attacker could overwrite or delete an existing webmention by sending a new one from their own site with the same u-url. To prevent this, receivers can compare source domain as well as uid, u-url, etc., and only interpret two webmentions as duplicates if both match.

IndieWeb Examples

Kyle Mahan

Kyle Mahan de-duplicates comments on his site since at least 2015-06:

Aaron Parecki

Aaron Parecki de-duplicates comments on his site since 2017-09-01, with a partially working implementation since ~2016

2017-09-aaronpk-syndicated-copy-of-webmention-comment.png

2017-09-aaronpk-syndicated-copy-of-webmention-reply.png

Silo Examples

Twitter

  • Twitter: ~24hr(?) dedupe. In their web create UI, if you enter the same text as a previous tweet in the past 24hrs (tested minutes, and years, educated guessing 24hrs) and attempt to "Tweet", Twitter won't post it, and will instead show an error message of "You have already sent this Tweet.".

See Also

Personal tools
Namespaces
Variants
Actions
Recent & Upcoming
Resources
Toolbox