Closed Bug 108153 Opened 18 years ago Closed 18 years ago

HTML sanitizer

Categories

(MailNews Core :: MIME, enhancement, P2)

enhancement

Tracking

(Not tracked)

VERIFIED FIXED
mozilla1.0

People

(Reporter: BenB, Assigned: BenB)

References

(Blocks 1 open bug, )

Details

(Keywords: qawanted, Whiteboard: Done, waiting for review)

Attachments

(4 files, 1 obsolete file)

(Create a stream converter that) cleans up HTML documents.

Removes everything from the doc that is not explicitly allowed.

This helps a lot with security. Users are e.g. HTML mail or security fanatics
browsing the web.
Depends on: 18427
No longer depends on: 18427
Blocks: 18427
Summary: HTML santinizer → HTML sanitizer
Blocks: 31907
Depends on: 125881
Good news! I have a rudimentary version working. It strips all tags and
attributes, which are not explicitly allowed, from the document.

That way, you can have the advantages of HTML (expressive and mostly unambiguous
syntax, in contrast to plaintext) without most of the disadvantages (large
security and privacy problems). In fact, I think that this feature is what makes
HTML mail a useable (and possibly preferable) option.
THe code might also be used to an ultra-secure (and uncomfortable) browser, but
I don't have this implemented yet.

My current implementation is, however, relatively dumb about what can be allowed
and disallowed - you can only (dis)allow certain tags and attributes, but not
where they may appear. (This makes the implementation much more simple.) E.g. it
is impossible to disallow a certain element being a child of another element or
to disallow multiple <head> nodes.
Currently, you can't even disallow text directly inside an element, e.g. the
<head> node (I might fix that case, however).

Disallowed tags and attributes will just not be output to the result. This is in
line with the HTML convention of just ignoring unknown tags. (You should be able
to remove the <applet>/</applet> tags an still get a reasonable document.)

I also hooked this up with Mailnews' libmime. I added it as new option to the
View|Body menu introduced in bug 30888.

The current implementation takes a (Mozilla Gecko-)parsed HTML document and
outputs HTML source to a string. This means that we have to parse the result
again. I am aware of the performance implications and I wrote a longer
explanation in the source as comments.

This blocks bug 54182, because I want to make this the default in Beonex
Communicator.

I don't request review yet, because I want to clean up the source more and
refactor it more. Maybe, I'll be done later this day.
Blocks: Beonex
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.9
This bug exposes bug 125983.
This is a screenshot once of the message with the sender's HTML (as rendered
currently by Mozilla) (left) and once ran through the sanitizer (if enabled)
(right). The sanitizer pref was sat to allow tables and some presentational
formatting like <b>. You could also disallow tables, in which case everything
would be rendered in  in a single colunm (in the order it appears in the HTML
source).
Note that the left original has the much-hated horizontal scrollbar, while the
right sanitized version doesn't.
Depends on: 126082
This is the current state of my work. It still has some rough edges (e.g. a
leak), which I need to clean up. It is needed for the patch in bug 30888,
however. There, you can find the rest of the fix for this bug (namely, the
Mailnews part).
Attached file Test app, version 1
This is a test application for the sink. You can copy it over
htmlparser/tests/outsinks/Convert.cpp. Then, invoke TestOutput with an HTML
file as param. Don't forget to set LD_LIBRARY_PATH and MOZILLA_FIVE_HOME.
There's a tiny error in the diff: in mozSanitizingHTMLSerializer.cpp, line 24,
there a question mark after the comment. Please remove it.
Blocks: 30888
akk pointed me to the DTD classes (e.g. htmlparser/src/nsWebFormedDTD.*) and
said that I might be able to use them to restrict the allowed tags/attributes
more flexible and cleanly. That hint is much appreciated, but I guess it's too
late for that now - I'd like to get that in for 1.0 and have no time to complete
rewrite the content/ part of this bug, considering that the current solution works.
I filled in the destruction or clean up mem.

I am now requesting review. akk, can you do that, please?

The code is not pretty - I am not proud of the implementation. But maybe it
will be rewritten later with DTDs anyways. And the code seems to do its job for
now and sctaches a big itch (solution/workaround for bug 28327).

To see the code in action, you need either the test app (see above) or the
non-overlapping part of the oatch for bug 30888 (which contains part of this
bug) (take the "everything" patch there and manually delete everything content/
from the patch before applying it, because it contains an older version of the
patch here).
Attachment #70305 - Attachment is obsolete: true
Keywords: patch, review
Priority: -- → P2
Whiteboard: Done, waiting for review
Target Milestone: mozilla0.9.9 → mozilla1.0
Attached patch Mac changesSplinter Review
This is needed in order to build the patch (version 2) on Mac
ben: can you confirm this is a networking bug?
In its initial constellation, it would have had a network component. This
component is not yet implemented yet, though, and will be moved to a new bug.
Moving to Mailnews for the lack of a better bugzilla Component. I suggest the QA
is the same as bug 30888.
Component: Networking → MIME
Product: Browser → MailNews
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Fix checked into trunk. Yay, finally!

Thanks to all reviewers and all others involved.

This is not yet in the 1.0 branch. I still would like to check it into the
branch. If not, it would not be in any Mozilla 1.0 releases, but only trunk
nightlies and 1.1alpha.

Please test the feature and try to circumvent it, i.e. get arbitary HTML
rendered although the new options are active. If so, please file a new bug
against Mailnews/MIME, owner me, and mention it here.

Some documentation can be found at <http://www.beonex.com/temp/index.html>. When
I have ftp access to my private server again, I will move it to
<http://www.bucksch.com/1/projects/mozilla/108153>.
+qawanted - I'm probably not the best person to QA this, both for time and
expertise reasons.
Keywords: qawanted
Blocks: 138249
No longer blocks: 138249
I have verified the mailnews portion of this bug when verifying 30888.  Is there
someone who can verifiy the browser portion, I don't see a preference or menu
item for sanitizing browser pages as seen in comment #3's attachment.
Esther, I didn't implement the browser option (yet). It was originally an idea,
but I didn't have the time/mood/use-case for it. I'll track that in another bug
(I will reference it here), if I will implement it one day. So, if you verified
the Mailnews part, you verified it all already ;-). The screenshot in comment 3
is Mailnews.
Your right, sorry I didn't see that. Then I'll verify this.   Note, I see in bug
18427 that you make a statement that since this bug is fixed there is nothing
left to do in 18427.  However a bug marked a dup of 18427 (the dup is 30896) is
still being worked on by sspitzer and ssu after the fixed date for this bug.
These bugs are getting complex and hard to follow.  Would you say that 30896 is
not a dup of 18427.  If so, then I will change the dup status of 30896 and then
verify 18427 per testing I did for bug 30888.  
hm? Bug 30896 is no longer a dup of bug 18427, see bug 30896 comment 9 and
following.
verified 
Status: RESOLVED → VERIFIED
QA Contact: benc → esther
How to enable this (bug 108153's) fix, as far as I can tell (it's not in the UI,
AFAIK):
The code says:
+   aPref is a long string, which holds an exhaustive list of allowed tags
+   and attributes. All other tags and attributes will be removed.
+
+   aPref has the format
+   "html head body ul ol li a(href,name,title) img(src,alt,title) #text"
+   i.e.
+   - tags are separated by whitespace
+   - the attribute list follows the tag directly in brackets
+   - the attributes are separated by commas.

preference setting mailnews.display.html_sanitizer.allowed_tags (see URL:
about:config) 

is

html head title body p br div(lang,title) h1 h2 h3 h4 h5 h6 ul ol
li(value,start,compact) dl dt dd blockquote(type,cite) pre noscript noframes
strong em sub sup span(lang,title) acronym(title) abbr(title)
del(title,cite,datetime) ins(title,cite,datetime) q(cite) a(href,name,title)
img(alt,title,longdesc,src) base(href) area(alt) applet(alt) object(alt) var
samp dfn address kbd code cite s strike tt b i table(align) caption
tr(align,valign) td(rowspan,colspan,align,valign) th(rowspan,colspan,align,valign) 

by default.

I'm deleting div(...),  src from img, applet(...), and object(...).  As an
attempted partial workaround for bug 28327.
Not sure about the safety of base and area, among other tags.  Still doesn't
block ReadNotify, I think.

Dox such as this are needed in http://www.mozilla.org/unix/customizing.html#prefs; 

akkana%netscape.com, can you add some dox for this pref there? 
(OT: akkana, some mention of where the prefs file is for the poor M$ users
and/or on URL: about:config would be great too.  Yeah I know its in /unix/, but
now it's referenced in the Release Notes for all platforms...)
> akkana%netscape.com, can you add some dox for this pref there?

Okay, I've now read through bug 28327 (thanks for the pointer!)  But I'm not
sure I understand your setting.  Can you explain why div has to be blocked, but
span doesn't?  I understand removing applet and object (Ben, I'm surprised those
are normally included -- is there a reason for that?) and I thought removing img
entirely was part of the whole point of sanitized html?  (I can't tell, because
I block images in mail anyway, so the fact that I never see them in sanitized
html doesn't tell me anything.)  What about area -- if img is gone, does area
make any sense?  I also wonder about base -- anybody know?  Seems like removing
it from the list should be fairly safe.

Anyway, I'm quite happy to document it, but I'd appreciate it if someone who's
been following the issue longer than I have would provide a sensible comment so
I don't have to make one up.  And we might as well get a good list to suggest,
or perhaps a couple of good lists for varying levels of security.

> akkana, some mention of where the prefs file is for the poor M$ user

Someone who actually is a MS user should mail me that info as a contribution. 
Do "find files on disk" or whatever they're calling it now, and look for
prefs.js, and you'll probably find it.  Then write that up in a way that will
make sense to a typical MS user, and mail it to me, and I'll check it in (with
credit).  A Linux user writing guesses at how MS users might find their prefs
file is probably not the best approach to documentation. :-)
(I will write up something on about:config.)
> How to enable this (bug 108153's) fix ... (it's not in the UI, AFAIK):

It is. Menu|Message|Body|Simple HTML

(Checked in as part of bug 30888, see comment 1.)

You cannot customize (via the UI) the list of HTML tags to allow, though.

> I'm deleting div(...),  src from img, applet(...), and object(...).
> As an attempted partial workaround for bug 28327.

You don't need to do that. By default, it doesn't allow anything that causes
network activity (IIRC). If you can live with (or want) the reduced HTML, this
bug completely fixes bug 28327, unless I missed something.

> some mention of where the prefs file is for the poor M$ user

uhm, release notes? It's probably explained somewhere under www.mozilla.org's
end-user doc section as well.
Maybe just link it.


Hey akk, nice to talk to you again,

> I understand removing applet and object (Ben, I'm surprised those
> are normally included -- is there a reason for that?)

Unless I'm missing something, only the |alt| attribute of <object>, <applet>
etc. is allowed and <param> is not allowed at all, so the tags are intentionally
crippled so much that only the alternative text remains, the active content is
disfunctional (at least it should be).

> I thought removing img entirely was part of the whole point of
> sanitized html?

Not quite. The point was mainly to get rid of any privacy and security threats
that HTML mail poses. External images are privacy threats, images embedded in
the mail are not and thus allowed.

Getting rid of the formatting annoyance of HTML was also a goal. Images are (or
at least can be) important content, so I allowed them, when possible. You could
argue that images might have offending content, but then again, attached images
will be shown inline with plaintext as well, so we are no different here. You
can remove the <img> tag from the pref, if it bothers you, but attached images
will still display.

<political type="offending">In short, the goal was to turn HTML back into what
is was supposed to be: description of content and structure, with the
presentation left to the reader. I.e. the superiour text format (if both
parties' software can deal with it properly) ;-).</political>

> I also wonder about base -- anybody know?  Seems like removing
> it from the list should be fairly safe.

I don't see that it would cause any harm, and removing it may break relative
links in <a href="">s.

> Anyway, I'm quite happy to document it, but I'd appreciate it if someone
> who's been following the issue longer than I have would provide a sensible
> comment so I don't have to make one up.

There is <http://www.bucksch.org/1/projects/mozilla/108153>. Not sure, if it's
what you're looking for.

> perhaps a couple of good lists for varying levels of security.

Good idea.
I have junk mail that produces the assert:
ASSERTION: all the skipped content tokens did not get handled:
'mSkippedContent.GetSize() == 0', file CNavDTD.cpp, line 979
but only if I view it as Simple HTML and not Original HTML.  Would that be a bug
here?
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.