Closed Bug 783958 Opened 12 years ago Closed 11 years ago

b2g email client needs its HTML security implementation reviewed

Categories

(Firefox OS Graveyard :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: asuth, Assigned: freddy)

References

Details

Attachments

(1 file)

The e-mail client may receive HTML e-mails that are full of nefarious badness.  We have implemented a defense-in-depth strategy that is believed to be safe.  It needs to be reviewed.

Our strategy is derived from this discussion on the webapi list:
https://groups.google.com/d/topic/mozilla.dev.webapi/wDFM_T9v7Tc/discussion

With particular focus on Henri Sivonen's response:
https://groups.google.com/d/msg/mozilla.dev.webapi/wDFM_T9v7Tc/Nr9Df4FUwuwJ

The key ideas are that: sandboxed iframes are great but insufficient because they currently have no means of preventing information leakage from displaying images, fetching external resources, or prefetching external resources. (Content policies protect Thunderbird from this.)  We could push for a standardized thing along these lines.


The pipeline is generally:

1) HTML is sanitized using bleach.js by using document.implementation.createHTMLDocumet to create a document "loaded as data" and then crammed in using innerHTML and traversing the DOM tree and removing everything that's not covered by a whitelist covering tags, attributes (both global and per-tag), and CSS style rules (based on whitelisted properties.)  Notes:

- Our whitelist does not include <script> tags or any on* attributes.

- href's are not whitelisted, but instead a helper function transfers the values (iff they are http/https currently) onto an inert attribute we made up.

- src's are not whitelisted, but instead a helper function transfers: a) http/https protocol references onto an external image attribute we made up, b) cid protocol references onto an embedded image attribute we made up, c) discards everything else.

- We currently don't whitelist any CSS stuff that could include a url() reference (including shortcuts like 'background'), although we'd want to support this along similar lines to our handling of sources.


2) The sanitized HTML is inserted into an iframe with sandbox="allow-same-origin" set on it.  We specify "allow-same-origin" because we need to be able to reach into the DOM tree to perform fix-ups.  Since "allow-scripts" is not set, this is believed safe.  Sandbox flags are being implemented on Bug 341604 and have not yet landed.

We are currently populating the iframe using iframe.contentDocument.open/write/close after setting the sandbox attribute.  The spec is clear that a navigation event must take place after the sandbox flag is set for the protection to take place.  It's not 100% clear that this covers our current mechanism.

Additionally, our app should be protected by CSP that prevents all scripts except from 'self' which should carry over to our same-origin iframe.  If we end up implementing the CSP sandbox directive on Bug 671389 that should also constrain the iframe (and potentially avoid any navigation semantics issues).  Note that I do not believe the CSP stuff has happened for our app yet.


3) If the user requests embedded images or external images to be displayed, we traverse the DOM tree and transfer the contents of our inert made-up attributes onto the "src" attributes of images as appropriate.  For the embedded image case, we map the cid reference to the Blob we stored the image and then use window.URL.createObjectURL.

The tentative plan for clicking on links is to add an event listener for clicks and see if the user clicked on an 'a' and then consult the inert place we stashed the href on and then prompt the user with a "do you want to browse to [the actual link you clicked on]".  This has not been UX approved yet, but I think is a reasonable way to a) help deal with potential phishing issues and our lack of a display-the-URL-on-hover implementation, and b) provide confirmation the user actually wants to both trigger the browser and for the specific URL, especially since it is easy to mis-interpret a scroll or just fat-finger the wrong link.


## Implementation Details / Source ##

- bleach.js fork used:
https://github.com/asutherland/bleach.js

- bleach.js' actual JS logic:
https://github.com/asutherland/bleach.js/blob/master/lib/bleach.js

- bleach.js' unit tests (which use a different whitelist):
https://github.com/asutherland/bleach.js/tree/master/test

- our HTML sanitization whitelist and logic with lots of comments:
https://github.com/mozilla-b2g/gaia-email-libs-and-more/blob/master/data/lib/rdimap/imapclient/htmlchew.js

- our email lib tests related to HTML and sanitization:
https://github.com/mozilla-b2g/gaia-email-libs-and-more/blob/master/test/unit/test_mail_html.js
https://github.com/mozilla-b2g/gaia-email-libs-and-more/blob/master/test/unit/test_imap_mime.js

- our iframe creation logic, also with lots of comments:
https://github.com/mozilla-b2g/gaia/blob/master/apps/email/js/iframe-shims.js
Note: This implementation has already landed in gaia and is hooked-up to IMAP accounts, but not ActiveSync accounts.  Because MozTCPSocket has not landed, a custom b2g build is currently required in order to be exposed to the implementation or any risks it might have.
Assignee: nobody → ptheriault
Thanks for heaps of info Andrew. I'll digest and get back to you asap.
The need to specify "allow-same-origin" seems like an potential issue with the iframe sandbox implementation.  Would it help if it were possible to "allow-same-origin" in one direction only, so that we could reach into the iframe but without the risk that it could reach out?

For example "allow-same-origin-in" and "allow-same-origin-out", or an up/down or inner/outer suffix to the same effect.
(In reply to Fred Andrews from comment #3)
> The need to specify "allow-same-origin" seems like an potential issue with
> the iframe sandbox implementation.  Would it help if it were possible to
> "allow-same-origin" in one direction only, so that we could reach into the
> iframe but without the risk that it could reach out?

The tighter restriction certainly would not harm us, but I'm not sure how useful that would be in this case or in general.  As origins go, e-mails are like data URI's.  To know the URL is to know its contents and be able to manipulate them.

"allow-same-origin-in" strikes me as an attempt to create a chrome-content barrier where one origin is more privileged than the other.  On the web proper, all origins (apart from some TLD sub-domain isues) are equal and it's a security hole to behave otherwise.

Do you know if we are giving anything up by using "allow-same-origin" (but not allowing script)?  Given the inert nature of the document, it seems like the primary risk would be that the HTTP requests related to (remote) image fetches could leak some type of information, such as the referrer.  (Link clicks are not currently implemented, but may be dispatched to the browser in such a way that no referrer is generated.)  And indeed, when I just ran b2g-desktop with gaia built with DEBUG=1 (so served off of http:// instead of as a packaged zip), the referrer did show up in the logs.  However, when I ran without DEBUG and therefore as a packaged app, no referrer goes over the wire to the webserver.  It's not clear if the latter behavior is intended or a side effect of the different protocol causing it to be filtered out.

It's not clear that the referrer constitutes an information leak, at least as long as we are sure to keep private data after the # in the URL if we start maintaining URL state.  One might even argue that it is appropriate that the referrer indicate the e-mail application in use since the user-agent will instead represent the browsing platform, since it could also help webmasters track down the developers of the e-mail client should it start doing crazy things to their server.
(In reply to Andrew Sutherland (:asuth) from comment #4)
> (In reply to Fred Andrews from comment #3)
> > The need to specify "allow-same-origin" seems like an potential issue with
> > the iframe sandbox implementation.  Would it help if it were possible to
> > "allow-same-origin" in one direction only, so that we could reach into the
> > iframe but without the risk that it could reach out?
> 
> The tighter restriction certainly would not harm us, but I'm not sure how
> useful that would be in this case or in general.  As origins go, e-mails are
> like data URI's.  To know the URL is to know its contents and be able to
> manipulate them.
>
> "allow-same-origin-in" strikes me as an attempt to create a chrome-content
> barrier where one origin is more privileged than the other.  On the web
> proper, all origins (apart from some TLD sub-domain isues) are equal and
> it's a security hole to behave otherwise.

Yes this was the intention.  Apps will be on the web and there is often a need for them to be more privileged than content they frame.

Perhaps some future enhancements to <iframe mozbrowser> would make it usable by the email app.  It seems that <iframe mozbrowser> may support injection by the parent in future as this would be handy for other apps.  All of the restrictions the email client needs would be useful options to the iframe mozbrowser or sandbox so perhaps someday this could be used and simplify the email client.

> Do you know if we are giving anything up by using "allow-same-origin" (but
> not allowing script)?

It seems ok because the HTML has been sanitized so well.

...

> It's not clear that the referrer constitutes an information leak, at least
> as long as we are sure to keep private data after the # in the URL if we
> start maintaining URL state.  One might even argue that it is appropriate
> that the referrer indicate the e-mail application in use since the
> user-agent will instead represent the browsing platform, since it could also
> help webmasters track down the developers of the e-mail client should it
> start doing crazy things to their server.

It would seem more appropriate for the User-Agent to indicate the application rather than for it to be exposed in the 'referer', if this is even an issue.

There is a need for an app and/or mozbrowser to be able to change the user-agent string, to limit fingerprinting, and perhaps there is a case to be made for the email client app to modify the default user agent too.
Apologies for the delay in getting through this, Andrew. I am working on this as a priority at the moment.
Should have commented on this bug a long time ago: I finished reviewing this code, and didn't find any issues. I performed limited testing by using a modified version of the email app, and injecting malicious content but didn't find any issues. Given the complexity of the problem I think it would be good to develop a custom fuzzer to exercise the sanitization list thoroughly, but I havent had a chance to do this.
Paul, we (well, mainly Vivien :) are working to move the e-mail back end into a worker thread.  I had originally been targeting a background page rather than a worker, and it turns out our workers are in fact incapable of doing DOM stuff.

As such, we find ourselves need to use some type of HTML and CSS parser on the worker thread.  Are there any pre-existing security-reviewed implementations out there that you or the security team know of that we can use?  If there aren't, do you have any preferences as to what we do?  For example, SAX parser versus DOM-style?  Mozilla's dom.js implementation, existing node implementation, other?
Flags: needinfo?(ptheriault)
Mozilla's Thimble uses slowparse by Atul: https://github.com/mozilla/slowparse
Thanks, Josh!  slowparse looks pretty sweet in terms of implementation simplicity, being webby as opposed to requiring a bunch of node shims, having tests, etc.  Atul, is there any reason we would want to not use slowparse?
(In reply to Andrew Sutherland (:asuth) from comment #8)
> Paul, we (well, mainly Vivien :) are working to move the e-mail back end
> into a worker thread.  I had originally been targeting a background page
> rather than a worker, and it turns out our workers are in fact incapable of
> doing DOM stuff.
> 
> As such, we find ourselves need to use some type of HTML and CSS parser on
> the worker thread.  Are there any pre-existing security-reviewed
> implementations out there that you or the security team know of that we can
> use?  If there aren't, do you have any preferences as to what we do?  For
> example, SAX parser versus DOM-style?  Mozilla's dom.js implementation,
> existing node implementation, other?

I'm not aware of any pre-existing solutions. What is the timeline for this change?
Flags: needinfo?(ptheriault)
(In reply to Paul Theriault [:pauljt] from comment #11)
> I'm not aware of any pre-existing solutions. What is the timeline for this
> change?

Migrating the e-mail back-end to live in a worker (bug 814257) is part of the push to make the e-mail app be more responsive/appear faster/etc.  I think the dream is that we would uplift all of this stuff to v1.0.1 in the extremely short term (this week/ASAP/unreasonably soon/etc.), but a v1.1 uplift may be more realistic depending on what release management thinks of the risk profile over the various massive changes going on.

My sincere apologies for creating this situation by not realizing when I implemented the sanitizer that it was going to be a potentially major problem/change for migrating to a worker instead of a background page.  I was betting pretty hard on the background page thing being an option or our workers drastically improving.
Is there any way we could keep the sanitizer in main thread, in addition to the worker ( can you still get the performance increase)?

I am concerned about using a new parser - any differences in behavior between the worker parser and the parser in the browser are potential security vulnerabilities. As an aside, I am aware of at least a few security issues on webmaker, not sure if they relate specifically to the parser or not though.
Hi Asuth and b2g folks!

So, while I would be totally flattered if you used slowparse, there are a number of things that might not be awesome for your needs, as the parser was created primarily to make HTML easier to learn and manipulate--not to sanitize. Among the potential drawbacks are:

* Slowparse uses the DOM API to create an actual DOM tree as it parses through the document. This part of the code is abstracted away into something called a "DOM Builder" [1], so it's possible to make an alternative implementation that doesn't use a page's DOM API directly (e.g. for use in a web worker). Currently, the direct use of the DOM API results in a few bugs [2], although none of them are security vulnerabilities.

* Slowparse has never been security reviewed; it also doesn't actually sanitize HTML. A separate extension to Slowparse called "forbidJS" [3] can be used to report errors for the most common uses of JS via script tags, javascript URLs, and event-handling attributes, but these errors are only intended to let users know as early as possible that their JS won't work when published: the JS isn't actually stripped from the HTML--that's done on the server-side, via Bleach.

* Slowparse isn't actually a fully-featured HTML5 parser. There are a number of HTML5 "shortcuts" that it's oblivious to [4].

All that said, patches are welcome. I also think that the Popcorn team has been investigating ways to sanitize HTML using JavaScript, at the very least via nodejs, so you might want to consult with them too.

[1] http://mozilla.github.com/slowparse/#section-93
[2] https://github.com/mozilla/slowparse/issues/51
[3] http://mozilla.github.com/slowparse/?tree-inspectors.js#section-7
[4] https://github.com/mozilla/slowparse/issues/14
As a status update, Vivien (:vingtetun) has been been porting a worker-thread SAX parser and adapting our bleach.js implementation to use that.
As an update, we had an off-thread discussion where many people thought we should surface a WebAPI for this stuff, but not pure JS implementations were immediately available.  Accordingly, Vivien went with using John Resig's "Pure Javascript HTML Parser" available at ejohn.org/blog/pure-javascript-html-parser/.  It was only labeled MPL licensed (although was derived from a tri-licensed-including-Apache 2.0 upstream), but I asked John and he kindly has allowed us to use it under Apache 2.0 which is our preferred license for Gaia.

There were some potential security bugs in the implementation that I believe I've resolved:

- The HTML parser seemed to be under the incorrect impression that you could escape double-quotes with backslashes inside attributes.  This was a problem because the sanitizer was up-converting single-quoted attributes and unquoted attributes to become double-quoted attributes.  The up-conversion seems reasonable to me, although I don't strongly care, so I've fixed the code to properly escape attributes and added some unit tests to our bleach.js.

- Greedy .* match was used for script and style tags so if there was more than one script/style tag in a document, its characters would include everything in between.

- The dynamically built regexp for script/style was unsafely case sensitive, so the same thing could happen.


I've also made the parser/sanitizer strip and ignore namespace prefixes on tags for the time being.  This was done for consistency with the DOM-based implementation (which ignored 'prefix' and 'namespaceURI') rather than out of any security concern.  Since we operate on a white-list, namespacing would just cause valid things to become invalid.  Although if you white-list 'xmlns' as an attribute, things would become more exciting, including causing CDATA parsing to start needing to happen, which the parser does not support.

I still have a bit more to do on cleaning up the parser/sanitizer.  I'll ping here again when I am done hacking on it and review/audit can begin.  I have created a pull request at https://github.com/asutherland/bleach.js/pull/1 which will be what people get pointed at.
Okay, the HTML sanitizer passes both its bleach.js tests and our gaia-email-libs-and-more tests.  Paul, the pull request is here if you could take a look or find someone who can take a look:
https://github.com/asutherland/bleach.js/pull/1

Barring major complications, this will likely land on gaia/master early next week where it will bake for a little before being uplifted to v1.0.1 and v1.1.
Flags: needinfo?(ptheriault)
Thanks for the update Andrew. I'll take a look. I am concerned about the parser quirks, but I will have a look at in the context of other mitigating controls.
Flags: needinfo?(ptheriault)
I've been able to focus more on the CSS parsing side of the house now, and we are going to need to do something more sophisticated than the naive '}', '{', ';' splitting Vivien did as an initial stop-gap implementation.  (Which is not really a surprise; I should have called that out in my previous statement that I did not think we were good to go with CSS.)

Specifically, this CSS string is able to cause us to pass-through background-image which is undesirable for information leakage purposes:

body { /* } */ background-image: url(http://example.com/EVIL.png); /* { color: red; */ }


Given that CSS explicitly calls for error recovery that involves parsing matched characters and escapes, I think continuing with the regexp implementation is a bad idea.

It looks like the best option is Tab Atkins' "Standards-Based CSS Parser" at https://github.com/tabatkins/css-parser and which he introduces in a blog post at http://www.xanthir.com/blog/b4Ju0.  Tab is the editor of the CSS3 spec, and the parser is designed to match the spec at the potential cost of speed/performance.  Since Gecko ideally conforms to the spec, this seems like the best way to ensure that what the sanitizer sees is what the Gecko CSS parser would see if it read the same thing.

Note that the implementation is currently very limited in terms of unit tests, so we would be relying on the correctness of its translation of the spec at the current time.  Because the parser does not include integrated stringification, we would likely want to run the lexer in location-storing mode and slightly modify the parser to operate in a somewhat streaming fashion that allows us to emit or not emit the verbatim text provided as appropriate.

The long-term performance plan would be to build up some serious nefarious test coverage before we would consider moving to a parser that might in fact deviate from spec.


The other best option would appear to be the caja CSS parser:
http://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/plugin/csslexer.js
http://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/plugin/cssparser.js

caja is intended to do things like this, and even has a unit test case for error recovery, but does not have overwhelming coverage at the unit level, at least:
http://code.google.com/p/google-caja/source/browse/trunk/tests/com/google/caja/plugin/cssparser_test.js#236


There are a number of other pure JS parsers out there too.  I have done a brief skim of the readme's and the unit test directories and none of them jump out at me as advertising any specific levels of spec compliance or having specific error recovery unit tests, but it was indeed a very brief skim!
http://www.glazman.org/JSCSSP/
https://github.com/NV/CSSOM
https://github.com/visionmedia/css-parse
https://github.com/css/gonzales
As an update, we went with with the CSS3 syntax spec derived implementation at https://github.com/tabatkins/css-parser.  This was landed very late on 4/3 in bug 814257.  As per previous discussion with :pauljt where I believe consensus was reached on this, we believe the protection provided by CSP and the iframe sandbox that both prevent the execution of JS in the iframe context are keeping us safe from malicious code.  The fallout of a failure of the sanitizer is expected to be limited to information leakage that a message is being displayed, namely from URI references that would be actively fetched (images) or pre-fetched (appropriately marked links, DNS?).

We continue to err on the side of removing HTML attributes and CSS declarations that *might* contain such references.  Bug 857914 has been filed on not erring so much about that.
Assignee: ptheriault → fbraun
Depends on: 897524
The current setup is very cumbersome to setup. htmlchew.js doesn't really work as well in a browser as it's supposed to..where would I get define() for example? :/
Depends on: 899070
This part of the review is done. Please address bug 899070 for a secreview+ :)
I am attaching the test files I produced during my tests. Feel free to play with. :)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: