From IndieWeb
Jump to: navigation, search

sanitize, specifically "sanitizing HTML", "sanitizing for (display inside) HTML", or "sanitization" is a common operation performed by any site which displays content from external sources, including user entry.

Comments (and thus comments-presentation) is one particularly obvious area where sanitizing of some degree is essential.

IndieWeb Examples

  • ... who does what when they receive comments via webmention or Bridgy ?
  • Idno currently aggressively strips tags.
  • WordPress only allows a small whitelist of tags: a, abbr, acronym, b, blockquote, cite, code, pre, del, em, i, q, strike, strong. It also whitelists attributes within those tags.


There are numerous approaches to sanitizing / filtering for HTML. E.g.

  • only allow plaintext
  • allow only whitelisted HTML tags
  • allow all HTML but strip <script> etc.
  • ...

Here are some additional resources:

User Experience

Literal text

Users typically expect that whatever they type into a comment box will be shown literally, e.g. if a user types in:

if b<a and a>c then you do not know whether b>c

They expect to see that text in the comment.

What you might see, due to perhaps overly aggressive sanitizing (e.g. a regex) that thinks it is removing HTML:

if bc then you do not know whether b>c

Some markup

Other sites, e.g. Flickr permit user entry of a few tags so that users may add explicit hyperlinks, and some amount of text formatting to their comments.

See Also