sanitize
This article is a stub. You can help the IndieWeb wiki by expanding it.
sanitize, specifically "sanitizing HTML", "sanitizing for (display inside) HTML", or "sanitization" is a common operation performed by any site which displays content from external sources, including user entry. It is often used to standardize appearance, to prevent broken code in the external content from breaking the site itself, and most importantly to prevent code injection attacks from malicious sources.
Comments (and thus comments-presentation) is one particularly obvious area where sanitizing of some degree is essential.
IndieWeb Examples
(this section is a stub)
- ... who does what when they receive comments via webmention or Bridgy ?
Software Examples
Known
Idno/Known currently aggressively strips tags.
- Details needed!
- All HTML tags? Which HTML tags? Which does it allow
Mastodon
Mastodon does HTML sanitization, with very minimal HTML allowlists of:
- elements:
a, br, p, span
- class names (mf2 or other semantics):
h-*, p-*, u-*, dt-*, e-*, mention, hashtag, ellipsis, invisible
- and soon:
- elements:
del, pre, blockquote, code, b, strong, u, i, em, ul, ol, li
- transforms:
h1
-h6
tags to<p><strong>contents</strong></p>
- elements:
WordPress
WordPress has a small allowlist of tags:
a, abbr, acronym, b, blockquote, cite, code, pre, del, em, i, q, strike, strong
.- It also has allowlists of attributes within those tags.
Pleroma
Pleroma optionally lets users author in Markdown, BBCode, or raw HTML. They don't document an explicit list of allowed HTML tags though.
Friendica
Friendica lets users author in BBCode, but doesn't document an explicit list of allowed HTML tags either.
Service Examples
Flickr
Flickr permits user entry of a few tags so that users may add explicit hyperlinks, and some amount of text formatting to their comments.
- Which tags are allowed?
micro.blog
micro.blog has a list of Allowed HTML tags and their attributes:
- elements:
a, audio, b, blockquote, br, code, div, em, i, img, li, ol, p, pre, source, span, strong, ul, video
- attributes on each:
- a:
href, title, class
- span:
style, class
- img:
src, style, class, width, height, alt
- audio:
src, controls
- video:
src, controls, width, height, preload, poster, alt, playsinline, style, class
- source:
src, type
- a:
style
attribute allowed properties:width, height, max-width, max-height, min-width, min-height, border
Approaches
There are numerous approaches to sanitizing / filtering for HTML. E.g.
- only allow plain text
- allow only allowlisted HTML tags
- allow all HTML but strip <script> etc.
- ...
Here are some additional resources:
- https://github.com/microcosm-cc/bluemonday - a sanitization library
- http://pythonhosted.org/feedparser/html-sanitization.html
- https://web.archive.org/web/20080826033749/http://diveintomark.org/archives/2003/06/12/how_to_consume_rss_safely
- https://developers.google.com/caja/ (also https://code.google.com/p/google-caja/ ? ) - "the most insanely thorough sanitizer" - Kevin Marks
- http://wpbtips.wordpress.com/2010/05/23/html-allowed-in-comments-2/ - WordPress's minimal HTML tag allowlist. notably, it allows links, but not images.
- https://www.npmjs.org/package/sanitize-html
User Experience
Literal text
Users typically expect that whatever they type into a comment box will be shown literally, e.g. if a user types in:
if b<a and a>c then you do not know whether b>c
They expect to see that text in the comment.
What you might see, due to perhaps overly aggressive sanitizing (e.g. a regex) that thinks it is removing HTML:
if bc then you do not know whether b>c