What does HackerNews think of bluemonday?

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

Language: Go

#84 in Go
#76 in Go
#10 in HTML
#54 in Security
I'm on the receiving end of donations from sourcegraph using this. It's around $10 per month from that single donation and is for the only Go HTML santizer, which you use when you have user generated / untrusted input that you need to display as HTML. https://github.com/microcosm-cc/bluemonday

For me the library has been good enough for my own use for a very very long time. I mostly neglect it unless there's some critical issue (which there hasn't been for a very long time — but clearly infosec world knows this lib exists as when I do make changes with crap commit messages I get emails asking what the implications are). I don't improve it at all as my time is better spent on my day job.

I've often thought that there's room for improvement such as a DOM style sanitizer to validate HTML input rather than just a SAX style sanitizer, perhaps formatting of output in addition to sanitising input, transformation rules to allow a safe embedly type thing, etc.

When I got the initial donation I was surprised, first ever bit of support for open source software I'd written (as this was not written on company dime).

Even at $10 per month it is motivating enough to think someone values it. If it accrues into something significant then I may actually feel motivated to improve it rather than just support it.

Interesting is that I'd regard this OSS lib as successful by usage as sourcegraph says 803 projects use it — but given that a fair number of those are things like Hugo which in turn have thousands of forks and many more thousands of instances of use, well it does appear that this sanitizer is to safe HTML in Go what https://xkcd.com/2347/ illustrates. Originally it was written for my own use and it's now used by virtually everything in the Go world that makes a website.

Perhaps people don't know this and libraries like it exists though? Perhaps they import some web framework and this came with it? Well, for that awareness thank you to thanks.dev

Of the models I've seen so far for supporting individuals who create OSS code this one stands out by highlighting dependencies. Likely the solution is a blend of things... a commission type system for rare engineers and skills (i.e. https://words.filippo.io/pay-maintainers/ ) and thanks.dev for the long-tail of those who have done things and you want to use it long-term. There are also companies who create OSS software, and if you value their work and don't want what they produce to go behind Enterprise differentiation then perhaps pay for their services too.

My thoughts as a maintainer of a HTML sanitizer in Go https://github.com/microcosm-cc/bluemonday which is also available to Python via https://github.com/ColdHeat/pybluemonday

1. Sanitizing is not difficult, defining the policy/config is.

This is difficult as your need is not someone else's. First glance of this proposal is that this needs a lot more work to cover people's needs. It's good enough, but will have a lot of edges and will need to evolve.

2. If you allow a blocklist then it will be less secure.

Because people will use that by default as it's easier to say "I don't want " than it is to say "I only accept ". The problem here is that a blocklist requires the person writing the config to cover every scenario to be safe and unless they're a security engineer they will not... and if they were a security engineer they'd use the allowlist. Blocklists seldom deliver good security, allowlists deliver great security.

3. Provide sane defaults.

Most engineers simply do not know what is safe or not. I ship a policy in bluemonday for user generated content... it is safe by default and good enough for most people, and it can be taken and extended due to the way the API is structured so can cover other scenarios as a foundation policy.

-----

I think the proposal in general: specify a standard for a sanitization API has merit. But mostly it has merit if it specifies a standard for defining sanitization policies/configuration, allowing them to be portable across different languages and systems.

The one I wrote is very heavily inspired by https://github.com/owasp/java-html-sanitizer which is the OWASP project one maintained by Mike Samuel. When I did my research before writing the Go one, this was far and away the best way to construct the policy/config and I already saw that this perspective was more valuable than whether it's a token based parser (GIGO but low memory) or a DOM builder (more memory)... no-one cares about the internals, they care about expressing what safe means to them.

Author here.

Recently I was looking for a way to sanitize user generated HTML of malicious things like JavaScript.

Solutions like bleach, html_sanitizer, and lxml's Cleaner all work but I found that their performance on complicated HTML snippets were lacking because they needed to rely on html5lib for parsing HTML5. And completely normal content would get mangled without using html5lib.

I ended up writing these Python bindings to the bluemonday library. It seems to perform much better than existing Python solutions for the same problem[2]. I suspect because more of the work can be done in native code instead of having to pass an XML tree around.

Hoping that this is useful to someone else but also looking for any feedback. Especially about how the bindings were written.

[1] https://github.com/microcosm-cc/bluemonday

[2] https://github.com/ColdHeat/pybluemonday#performance

Out of curiosity, I fed the attacks to the Go bluemonday library https://github.com/microcosm-cc/bluemonday

It seemed to have no issues whatsoever (click Run): https://play.golang.org/p/oBYXo9MDusr https://play.golang.org/p/KzD1Ug-KKiB

It is all open source.

The only thing that isn't is the very old puppet script that managed the deploy as it was rather bespoke for our setup.

To my own regret I focused far too much on making it an effective platform rather than an easy install, so that bit might feel gnarly but at least the errors are sane and guide you.

The Go API and PostgreSQL schemas: https://github.com/microcosm-cc/microcosm

The Django frontend (the client is nothing but a thin client over the API) https://github.com/microcosm-cc/microweb

The Bootstrap derived theme for the Django fronted https://github.com/microcosm-cc/microweb-bootstrap

And then other miscellaneous things:

The Go HTML sanitizer for user generated content https://github.com/microcosm-cc/bluemonday

Our legal policies for forums on this platform (expensive to produce, but perfectly fitting a forum platform with minimal exposure for the platform owner/admin and minimal but some liability for a forum owner) https://github.com/microcosm-cc/legal

There is also a newer thing, I was (/am?) intending to replace the Django layer with a Go frontend and templating, and then moving the API into this, such that the forum could become a single binary install and thus gain a new lease of life: https://github.com/buro9/microcosm . Once in a while I chip away on that.

This didn't work so well for one of my projects: https://github.com/microcosm-cc/bluemonday

I chose to look at this one as the README is quite descriptive and offers examples, and it is reasonably well structured.

It's a Go HTML sanitizer.

The suggested topics included "go" but none of the others I have now tagged it with.

It did include things like "html-element" and "data-uri". I can see how it did this from word prevalence, but these were far too specific to examples and documentation, and did not describe the project.

It feels as if the word counted should be weighted towards the early part of the README, perhaps no farther in than the 3rd heading.

When I see anything that touches web coding practices for Go I always look up an area that I know best: sanitization.

I wrote https://github.com/microcosm-cc/bluemonday which is a pure Go HTML sanitizer inspired by https://github.com/owasp/java-html-sanitizer .

The key things to understand about HTML sanitizers:

* They must be whitelist based

* They must be aware of context

* You must sanitize ALL user input even if you don't think you're going to render it on a web page.

The book linked to in the article does not seem to understand any of the above.

The section on sanitization has the equivalent of "string replace" as the primary recommendation. Elsewhere in the XSS section a focus is on escaping content before it is rendered.

Sanitization needs to know not to run on

 blocks, and to escape HTML entities automatically, and to understand what links are safe and which are not.

XSS can be really interesting and quite targeted. It can be that a user-agent contains the XSS, because the target may not be the person reading the page but the admin looking at a web page of their web server logs through an analytics program on the same domain.

The bluemonday package I wrote can deal with all of these things but that isn't the point, the point is that this is an area I know and the book falls way short of a decent standard for creating a secure and safe web application. And if it falls short in this area (the first 2 chapters) then I would assume that it falls short in all areas.

Exactly.

Whitelist only trusted schemes, do not wait to blacklist untrusted.

I wrote the Go HTML sanitizer: https://github.com/microcosm-cc/bluemonday and have a rule for user generated (untrusted) content that basically does whitelist just the things that one can trust: https://github.com/microcosm-cc/bluemonday/blob/master/helpe...

That states that URIs must be:

1. Parseable

2. Relative

3. Or one of: mailto http https

4. And that I will add rel="nofollow" to external links, and additionally I'll add "rel="noopener" if the link has a target="_blank" attribute

Oh, and I do not trust Data URIs either.

The lack of example code in the security section should be a worry to all.

It isn't hard to prevent SQL injection if you use parameterized SQL statement rather than using string concat, and whilst examples of this are trivial they shouldn't be skipped.

In the XSS section it mentions filtering and checking inputs, but does not mention sanitization and does not give any examples. In the aversion to use any non-standard package it also does not mention https://github.com/microcosm-cc/bluemonday or anything similar (I am the author of bluemonday - a HTML sanitizer in the same vein as the OWASP Java HTML Sanitizer).

There is some sample code, in the Filtering section, but this only demonstrates the use of a fixed range of inputs in determining a filter, and then a regexp to check the input matches a format.

Beyond the security, where the theory is at least known even if a demonstration on how to implement it is lacking... the entire guide misses out on demonstrating templates to their fullest, and specifically using a map of functions to provide the equivalent of Django's TemplateTags.

In fact, the missing link for most Go developers who are building web apps, and for those coming from outside Go, are template tags. Most Go devs I know (who seem more systems focused) don't even realise that this stuff exists: https://golang.org/src/text/template/examplefunc_test.go

I use Markdown for user content, this is passed through a Go library I wrote to strip out iframes, embeds, etc... https://github.com/microcosm-cc/bluemonday and then as a post-processing task once I trust the content, I find the links that I know how to handle (YouTube, Bikely, etc) and embed third party content in iframes.

This is basically a way to do the equivalent of Twitter cards, it respects the JavaScript and web security model, but does mean that the iframes contain http content on a page that is https

Where I'm trying to get to is to have all iframes, etc be https

For the example given, you probably don't need Negroni, Controller or Render.

Mux is also possibly surplus, but does tidy up the extraction of values from routes a little. But to be totally honest the given example isn't complex enough to start letting Mux shine (multiple values on URLs, a single app serving both http and https).

Blackfriday (for Markdown) is necessary if you want to convert your text input into HTML, though there is a missing dependency here which is Bluemonday.

Bluemonday ( https://github.com/microcosm-cc/bluemonday ) is a HTML sanitizer and ensures that XSS supplied via the Markdown box is stripped before rendering - I wrote that, it's based on the whitelisting approach as demonstrated by the OWASP Java HTML Sanitizer. Blackfriday even recommends you clean your untrusted inputs: https://github.com/russross/blackfriday#sanitize-untrusted-c...

SQLite3 is required, or at least some SQL provider is going to be required if the tutorial wants to fully demonstrate actually saving content and using one of the established interfaces (database/sql) to do so.

So you could reduce this to three deps: Blackfriday, Bluemonday, SQLite3 and stick to core/stdlib http for the actual "web app in Go" stuff.

The other deps, it is debatable whether or not they are essential, certainly the larger web apps will find that they are simplified and easier to read and work with by using some combination of Mux, Negroni and Controller depending on your needs. Render I have not used and do not have a strong opinion on.

There's room for improvement in a lot of the packages out there, and I'm still surprised by what feels like obvious omissions.

An example of how even a core package could do with improving: "image" doesn't support all forms of progressive and/or interlaced PNG, any version of animated PNG, or WebP. The lack of progressive PNG means that you can't quite rely on it for user-uploaded images, and image/webp seems to be getting more traction and handling those appears to be growing in importance. Without these things, the other packages that rely on "image" to provide image processing can only be relied upon to work for JPG and GIF.

An example of an omission I found myself filling recently, a HTML sanitizer. I wanted something like the OWASP Java HTML sanitizer https://www.owasp.org/index.php/OWASP_Java_HTML_Sanitizer_Pr... , but could only find some very simplistic sanitizers packages (little to no ability to define the policy, simplistic URL checking, etc). I ended up writing my own package to help fill the gap https://github.com/microcosm-cc/bluemonday

You can tell I'm dealing with user generated/supplied content. Each domain still feels almost there (in having a toolbox of packages that do everything you need and are rock solid), but just shy of being there today.