sanitize

From IndieWeb
(Redirected from sanitise)


sanitize, specifically "sanitizing HTML", "sanitizing for (display inside) HTML", or "sanitization" is a common operation performed by any site which displays content from external sources, including user entry. It is often used to standardize appearance, to prevent broken code in the external content from breaking the site itself, and most importantly to prevent code injection attacks from malicious sources.

Comments (and thus comments-presentation) is one particularly obvious area where sanitizing of some degree is essential.

IndieWeb Examples

(this section is a stub)

Software Examples

Known

Idno/Known currently aggressively strips tags.

  • Details needed!
  • All HTML tags? Which HTML tags? Which does it allow

Mastodon

Mastodon does HTML sanitization, with very minimal HTML allowlists of:

  • elements: a, br, p, span
  • class names (mf2 or other semantics): h-*, p-*, u-*, dt-*, e-*, mention, hashtag, ellipsis, invisible
  • and soon:
    • elements: del, pre, blockquote, code, b, strong, u, i, em, ul, ol, li
    • transforms: h1-h6 tags to <p><strong>contents</strong></p>

WordPress

WordPress has a small allowlist of tags:

  • a, abbr, acronym, b, blockquote, cite, code, pre, del, em, i, q, strike, strong.
  • It also has allowlists of attributes within those tags.

Pleroma

Pleroma optionally lets users author in Markdown, BBCode, or raw HTML. They don't document an explicit list of allowed HTML tags though.

Friendica

Friendica lets users author in BBCode, but doesn't document an explicit list of allowed HTML tags either.

Service Examples

Flickr

Flickr permits user entry of a few tags so that users may add explicit hyperlinks, and some amount of text formatting to their comments.

  • Which tags are allowed?

micro.blog

micro.blog has a list of Allowed HTML tags and their attributes:

  • elements: a, audio, b, blockquote, br, code, div, em, i, img, li, ol, p, pre, source, span, strong, ul, video
  • attributes on each:
    • a: href, title, class
    • span: style, class
    • img: src, style, class, width, height, alt
    • audio: src, controls
    • video: src, controls, width, height, preload, poster, alt, playsinline, style, class
    • source: src, type
  • style attribute allowed properties:
    • width, height, max-width, max-height, min-width, min-height, border

Approaches

There are numerous approaches to sanitizing / filtering for HTML. E.g.

  • only allow plain text
  • allow only allowlisted HTML tags
  • allow all HTML but strip <script> etc.
  • ...

Here are some additional resources:

User Experience

Literal text

Users typically expect that whatever they type into a comment box will be shown literally, e.g. if a user types in:

if b<a and a>c then you do not know whether b>c

They expect to see that text in the comment.

What you might see, due to perhaps overly aggressive sanitizing (e.g. a regex) that thinks it is removing HTML:

if bc then you do not know whether b>c

See Also