sanitize

From IndieWeb
(Redirected from html sanitizer)


sanitize, specifically "sanitizing HTML", "sanitizing for (display inside) HTML", or "sanitization" is a common operation performed by any site which displays content from external sources, including user entry. It is often used to standardize appearance, to prevent broken code in the external content from breaking the site itself, and most importantly to prevent code injection attacks from malicious sources.

Comments (and thus comments-presentation) is one particularly obvious area where sanitizing of some degree is essential.

IndieWeb Examples

(this section is a stub)

Software Examples

  • Idno/Known currently aggressively strips tags.
    • Details? All HTML tags?
  • Mastodon does HTML sanitization, with very minimal HTML allowlists of:
    • elements: a, br, p, span
    • class names (mf2 or other semantics): h-*, p-*, u-*, dt-*, e-*, mention, hashtag, ellipsis, invisible
    • and soon:
      • elements: del, pre, blockquote, code, b, strong, u, i, em, ul, ol, li
      • transforms: h1-h6 tags to <p><strong>contents</strong></p>
  • WordPress has a small allowlist of tags:
    • a, abbr, acronym, b, blockquote, cite, code, pre, del, em, i, q, strike, strong.
    • It also has allowlists of attributes within those tags.
  • Pleroma optionally lets users author in Markdown, BBCode, or raw HTML. They don't document an explicit list of allowed HTML tags though.
  • Friendica lets users author in BBCode, but doesn't document an explicit list of allowed HTML tags either.

Service Examples

  • micro.blog has a list of Allowed HTML tags and their attributes:
    • elements: a, audio, b, blockquote, br, code, div, em, i, img, li, ol, p, pre, source, span, strong, ul, video
    • attributes on each:
      • a: href, title, class
      • span: style, class
      • img: src, style, class, width, height, alt
      • audio: src, controls
      • video: src, controls, width, height, preload, poster, alt, playsinline, style, class
      • source: src, type
    • style attribute allowed properties:
      • width, height, max-width, max-height, min-width, min-height, border

Approaches

There are numerous approaches to sanitizing / filtering for HTML. E.g.

  • only allow plaintext
  • allow only allowlisted HTML tags
  • allow all HTML but strip <script> etc.
  • ...

Here are some additional resources:

User Experience

Literal text

Users typically expect that whatever they type into a comment box will be shown literally, e.g. if a user types in:

if b<a and a>c then you do not know whether b>c

They expect to see that text in the comment.

What you might see, due to perhaps overly aggressive sanitizing (e.g. a regex) that thinks it is removing HTML:

if bc then you do not know whether b>c

Some markup

Other sites, e.g. Flickr permit user entry of a few tags so that users may add explicit hyperlinks, and some amount of text formatting to their comments.

See Also