sanitize, specifically "sanitizing HTML", "sanitizing for (display inside) HTML", or "sanitization" is a common operation performed by any site which displays content from external sources, including user entry.
- ... who does what when they receive comments via webmention or Bridgy ?
- Idno currently aggressively strips tags.
- WordPress only allows a small whitelist of tags:
a, abbr, acronym, b, blockquote, cite, code, pre, del, em, i, q, strike, strong. It also whitelists attributes within those tags.
There are numerous approaches to sanitizing / filtering for HTML. E.g.
- only allow plaintext
- allow only whitelisted HTML tags
- allow all HTML but strip <script> etc.
Here are some additional resources:
- https://github.com/microcosm-cc/bluemonday - a sanitization library
- https://developers.google.com/caja/ (also https://code.google.com/p/google-caja/ ? ) - "the most insanely thorough sanitizer" - Kevin Marks
- http://wpbtips.wordpress.com/2010/05/23/html-allowed-in-comments-2/ - WordPress's minimal HTML tag whitelist. notably, it allows links, but not images.
Users typically expect that whatever they type into a comment box will be shown literally, e.g. if a user types in:
if b<a and a>c then you do not know whether b>c
They expect to see that text in the comment.
What you might see, due to perhaps overly aggressive sanitizing (e.g. a regex) that thinks it is removing HTML:
if bc then you do not know whether b>c
Other sites, e.g. Flickr permit user entry of a few tags so that users may add explicit hyperlinks, and some amount of text formatting to their comments.