Closed Bug 108153 Opened 23 years ago Closed 23 years ago

HTML sanitizer

Categories

(MailNews Core :: MIME, enhancement, P2)

enhancement

Tracking

(Not tracked)

VERIFIED FIXED
mozilla1.0

People

(Reporter: BenB, Assigned: BenB)

References

(Blocks 1 open bug, )

Details

(Keywords: qawanted, Whiteboard: Done, waiting for review)

Attachments

(4 files, 1 obsolete file)

(Create a stream converter that) cleans up HTML documents. Removes everything from the doc that is not explicitly allowed. This helps a lot with security. Users are e.g. HTML mail or security fanatics browsing the web.
Depends on: 18427
No longer depends on: 18427
Blocks: 18427
Summary: HTML santinizer → HTML sanitizer
Blocks: 31907
Depends on: 125881
Good news! I have a rudimentary version working. It strips all tags and attributes, which are not explicitly allowed, from the document. That way, you can have the advantages of HTML (expressive and mostly unambiguous syntax, in contrast to plaintext) without most of the disadvantages (large security and privacy problems). In fact, I think that this feature is what makes HTML mail a useable (and possibly preferable) option. THe code might also be used to an ultra-secure (and uncomfortable) browser, but I don't have this implemented yet. My current implementation is, however, relatively dumb about what can be allowed and disallowed - you can only (dis)allow certain tags and attributes, but not where they may appear. (This makes the implementation much more simple.) E.g. it is impossible to disallow a certain element being a child of another element or to disallow multiple <head> nodes. Currently, you can't even disallow text directly inside an element, e.g. the <head> node (I might fix that case, however). Disallowed tags and attributes will just not be output to the result. This is in line with the HTML convention of just ignoring unknown tags. (You should be able to remove the <applet>/</applet> tags an still get a reasonable document.) I also hooked this up with Mailnews' libmime. I added it as new option to the View|Body menu introduced in bug 30888. The current implementation takes a (Mozilla Gecko-)parsed HTML document and outputs HTML source to a string. This means that we have to parse the result again. I am aware of the performance implications and I wrote a longer explanation in the source as comments. This blocks bug 54184, because I want to make this the default in Beonex Communicator. I don't request review yet, because I want to clean up the source more and refactor it more. Maybe, I'll be done later this day.
Blocks: Beonex
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.9
This bug exposes bug 125983.
This is a screenshot once of the message with the sender's HTML (as rendered currently by Mozilla) (left) and once ran through the sanitizer (if enabled) (right). The sanitizer pref was sat to allow tables and some presentational formatting like <b>. You could also disallow tables, in which case everything would be rendered in in a single colunm (in the order it appears in the HTML source). Note that the left original has the much-hated horizontal scrollbar, while the right sanitized version doesn't.
Depends on: 126082
This is the current state of my work. It still has some rough edges (e.g. a leak), which I need to clean up. It is needed for the patch in bug 30888, however. There, you can find the rest of the fix for this bug (namely, the Mailnews part).
Attached file Test app, version 1
This is a test application for the sink. You can copy it over htmlparser/tests/outsinks/Convert.cpp. Then, invoke TestOutput with an HTML file as param. Don't forget to set LD_LIBRARY_PATH and MOZILLA_FIVE_HOME.
There's a tiny error in the diff: in mozSanitizingHTMLSerializer.cpp, line 24, there a question mark after the comment. Please remove it.
Blocks: 30888
akk pointed me to the DTD classes (e.g. htmlparser/src/nsWebFormedDTD.*) and said that I might be able to use them to restrict the allowed tags/attributes more flexible and cleanly. That hint is much appreciated, but I guess it's too late for that now - I'd like to get that in for 1.0 and have no time to complete rewrite the content/ part of this bug, considering that the current solution works.
I filled in the destruction or clean up mem. I am now requesting review. akk, can you do that, please? The code is not pretty - I am not proud of the implementation. But maybe it will be rewritten later with DTDs anyways. And the code seems to do its job for now and sctaches a big itch (solution/workaround for bug 28327). To see the code in action, you need either the test app (see above) or the non-overlapping part of the oatch for bug 30888 (which contains part of this bug) (take the "everything" patch there and manually delete everything content/ from the patch before applying it, because it contains an older version of the patch here).
Attachment #70305 - Attachment is obsolete: true
Keywords: patch, review
Priority: -- → P2
Whiteboard: Done, waiting for review
Target Milestone: mozilla0.9.9 → mozilla1.0
Attached patch Mac changesSplinter Review
This is needed in order to build the patch (version 2) on Mac
ben: can you confirm this is a networking bug?
In its initial constellation, it would have had a network component. This component is not yet implemented yet, though, and will be moved to a new bug. Moving to Mailnews for the lack of a better bugzilla Component. I suggest the QA is the same as bug 30888.
Component: Networking → MIME
Product: Browser → MailNews
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Fix checked into trunk. Yay, finally! Thanks to all reviewers and all others involved. This is not yet in the 1.0 branch. I still would like to check it into the branch. If not, it would not be in any Mozilla 1.0 releases, but only trunk nightlies and 1.1alpha. Please test the feature and try to circumvent it, i.e. get arbitary HTML rendered although the new options are active. If so, please file a new bug against Mailnews/MIME, owner me, and mention it here. Some documentation can be found at <http://www.beonex.com/temp/index.html>. When I have ftp access to my private server again, I will move it to <http://www.bucksch.com/1/projects/mozilla/108153>.
+qawanted - I'm probably not the best person to QA this, both for time and expertise reasons.
Keywords: qawanted
Blocks: 138249
No longer blocks: 138249
I have verified the mailnews portion of this bug when verifying 30888. Is there someone who can verifiy the browser portion, I don't see a preference or menu item for sanitizing browser pages as seen in comment #3's attachment.
Esther, I didn't implement the browser option (yet). It was originally an idea, but I didn't have the time/mood/use-case for it. I'll track that in another bug (I will reference it here), if I will implement it one day. So, if you verified the Mailnews part, you verified it all already ;-). The screenshot in comment 3 is Mailnews.
Your right, sorry I didn't see that. Then I'll verify this. Note, I see in bug 18427 that you make a statement that since this bug is fixed there is nothing left to do in 18427. However a bug marked a dup of 18427 (the dup is 30896) is still being worked on by sspitzer and ssu after the fixed date for this bug. These bugs are getting complex and hard to follow. Would you say that 30896 is not a dup of 18427. If so, then I will change the dup status of 30896 and then verify 18427 per testing I did for bug 30888.
hm? Bug 30896 is no longer a dup of bug 18427, see bug 30896 comment 9 and following.
verified
Status: RESOLVED → VERIFIED
QA Contact: benc → esther
How to enable this (bug 108153's) fix, as far as I can tell (it's not in the UI, AFAIK): The code says: + aPref is a long string, which holds an exhaustive list of allowed tags + and attributes. All other tags and attributes will be removed. + + aPref has the format + "html head body ul ol li a(href,name,title) img(src,alt,title) #text" + i.e. + - tags are separated by whitespace + - the attribute list follows the tag directly in brackets + - the attributes are separated by commas. preference setting mailnews.display.html_sanitizer.allowed_tags (see URL: about:config) is html head title body p br div(lang,title) h1 h2 h3 h4 h5 h6 ul ol li(value,start,compact) dl dt dd blockquote(type,cite) pre noscript noframes strong em sub sup span(lang,title) acronym(title) abbr(title) del(title,cite,datetime) ins(title,cite,datetime) q(cite) a(href,name,title) img(alt,title,longdesc,src) base(href) area(alt) applet(alt) object(alt) var samp dfn address kbd code cite s strike tt b i table(align) caption tr(align,valign) td(rowspan,colspan,align,valign) th(rowspan,colspan,align,valign) by default. I'm deleting div(...), src from img, applet(...), and object(...). As an attempted partial workaround for bug 28327. Not sure about the safety of base and area, among other tags. Still doesn't block ReadNotify, I think. Dox such as this are needed in http://www.mozilla.org/unix/customizing.html#prefs; akkana%netscape.com, can you add some dox for this pref there? (OT: akkana, some mention of where the prefs file is for the poor M$ users and/or on URL: about:config would be great too. Yeah I know its in /unix/, but now it's referenced in the Release Notes for all platforms...)
> akkana%netscape.com, can you add some dox for this pref there? Okay, I've now read through bug 28327 (thanks for the pointer!) But I'm not sure I understand your setting. Can you explain why div has to be blocked, but span doesn't? I understand removing applet and object (Ben, I'm surprised those are normally included -- is there a reason for that?) and I thought removing img entirely was part of the whole point of sanitized html? (I can't tell, because I block images in mail anyway, so the fact that I never see them in sanitized html doesn't tell me anything.) What about area -- if img is gone, does area make any sense? I also wonder about base -- anybody know? Seems like removing it from the list should be fairly safe. Anyway, I'm quite happy to document it, but I'd appreciate it if someone who's been following the issue longer than I have would provide a sensible comment so I don't have to make one up. And we might as well get a good list to suggest, or perhaps a couple of good lists for varying levels of security. > akkana, some mention of where the prefs file is for the poor M$ user Someone who actually is a MS user should mail me that info as a contribution. Do "find files on disk" or whatever they're calling it now, and look for prefs.js, and you'll probably find it. Then write that up in a way that will make sense to a typical MS user, and mail it to me, and I'll check it in (with credit). A Linux user writing guesses at how MS users might find their prefs file is probably not the best approach to documentation. :-) (I will write up something on about:config.)
> How to enable this (bug 108153's) fix ... (it's not in the UI, AFAIK): It is. Menu|Message|Body|Simple HTML (Checked in as part of bug 30888, see comment 1.) You cannot customize (via the UI) the list of HTML tags to allow, though. > I'm deleting div(...), src from img, applet(...), and object(...). > As an attempted partial workaround for bug 28327. You don't need to do that. By default, it doesn't allow anything that causes network activity (IIRC). If you can live with (or want) the reduced HTML, this bug completely fixes bug 28327, unless I missed something. > some mention of where the prefs file is for the poor M$ user uhm, release notes? It's probably explained somewhere under www.mozilla.org's end-user doc section as well. Maybe just link it. Hey akk, nice to talk to you again, > I understand removing applet and object (Ben, I'm surprised those > are normally included -- is there a reason for that?) Unless I'm missing something, only the |alt| attribute of <object>, <applet> etc. is allowed and <param> is not allowed at all, so the tags are intentionally crippled so much that only the alternative text remains, the active content is disfunctional (at least it should be). > I thought removing img entirely was part of the whole point of > sanitized html? Not quite. The point was mainly to get rid of any privacy and security threats that HTML mail poses. External images are privacy threats, images embedded in the mail are not and thus allowed. Getting rid of the formatting annoyance of HTML was also a goal. Images are (or at least can be) important content, so I allowed them, when possible. You could argue that images might have offending content, but then again, attached images will be shown inline with plaintext as well, so we are no different here. You can remove the <img> tag from the pref, if it bothers you, but attached images will still display. <political type="offending">In short, the goal was to turn HTML back into what is was supposed to be: description of content and structure, with the presentation left to the reader. I.e. the superiour text format (if both parties' software can deal with it properly) ;-).</political> > I also wonder about base -- anybody know? Seems like removing > it from the list should be fairly safe. I don't see that it would cause any harm, and removing it may break relative links in <a href="">s. > Anyway, I'm quite happy to document it, but I'd appreciate it if someone > who's been following the issue longer than I have would provide a sensible > comment so I don't have to make one up. There is <http://www.bucksch.org/1/projects/mozilla/108153>. Not sure, if it's what you're looking for. > perhaps a couple of good lists for varying levels of security. Good idea.
I have junk mail that produces the assert: ASSERTION: all the skipped content tokens did not get handled: 'mSkippedContent.GetSize() == 0', file CNavDTD.cpp, line 979 but only if I view it as Simple HTML and not Original HTML. Would that be a bug here?
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: