sanitize
This article is a stub. You can help the IndieWeb wiki by expanding it.
sanitize, specifically "sanitizing HTML", "sanitizing for (display inside) HTML", or "sanitization" is a common operation performed by any site which displays content from external sources, including user entry. It is often used to standardize appearance, to prevent broken code in the external content from breaking the site itself, and most importantly to prevent code injection attacks from malicious sources.
Comments (and thus comments-presentation) is one particularly obvious area where sanitizing of some degree is essential.
IndieWeb Examples
(this section is a stub)
- ... who does what when they receive comments via webmention or Bridgy ?
Software Examples
- Idno/Known currently aggressively strips tags.
- Details? All HTML tags?
- Mastodon does HTML sanitization, with very minimal HTML allowlists of:
- WordPress has a small allowlist of tags:
a, abbr, acronym, b, blockquote, cite, code, pre, del, em, i, q, strike, strong
.- It also has allowlists of attributes within those tags.
- Pleroma optionally lets users author in Markdown, BBCode, or raw HTML. They don't document an explicit list of allowed HTML tags though.
- Friendica lets users author in BBCode, but doesn't document an explicit list of allowed HTML tags either.
Service Examples
- micro.blog has a list of Allowed HTML tags and their attributes:
- elements:
a, audio, b, blockquote, br, code, div, em, i, img, li, ol, p, pre, source, span, strong, ul, video
- attributes on each:
- a:
href, title, class
- span:
style, class
- img:
src, style, class, width, height, alt
- audio:
src, controls
- video:
src, controls, width, height, preload, poster, alt, playsinline, style, class
- source:
src, type
- a:
style
attribute allowed properties:width, height, max-width, max-height, min-width, min-height, border
- elements:
Approaches
There are numerous approaches to sanitizing / filtering for HTML. E.g.
- only allow plaintext
- allow only allowlisted HTML tags
- allow all HTML but strip <script> etc.
- ...
Here are some additional resources:
- https://github.com/microcosm-cc/bluemonday - a sanitization library
- http://pythonhosted.org/feedparser/html-sanitization.html
- https://web.archive.org/web/20080826033749/http://diveintomark.org/archives/2003/06/12/how_to_consume_rss_safely
- https://developers.google.com/caja/ (also https://code.google.com/p/google-caja/ ? ) - "the most insanely thorough sanitizer" - Kevin Marks
- http://wpbtips.wordpress.com/2010/05/23/html-allowed-in-comments-2/ - WordPress's minimal HTML tag allowlist. notably, it allows links, but not images.
- https://www.npmjs.org/package/sanitize-html
User Experience
Literal text
Users typically expect that whatever they type into a comment box will be shown literally, e.g. if a user types in:
if b<a and a>c then you do not know whether b>c
They expect to see that text in the comment.
What you might see, due to perhaps overly aggressive sanitizing (e.g. a regex) that thinks it is removing HTML:
if bc then you do not know whether b>c
Some markup
Other sites, e.g. Flickr permit user entry of a few tags so that users may add explicit hyperlinks, and some amount of text formatting to their comments.