Closed
Bug 108153
Opened 23 years ago
Closed 23 years ago
HTML sanitizer
Categories
(MailNews Core :: MIME, enhancement, P2)
MailNews Core
MIME
Tracking
(Not tracked)
VERIFIED
FIXED
mozilla1.0
People
(Reporter: BenB, Assigned: BenB)
References
(Blocks 1 open bug, )
Details
(Keywords: qawanted, Whiteboard: Done, waiting for review)
Attachments
(4 files, 1 obsolete file)
129.41 KB,
image/png
|
Details | |
6.03 KB,
text/plain
|
Details | |
42.85 KB,
patch
|
Details | Diff | Splinter Review | |
3.84 KB,
patch
|
Details | Diff | Splinter Review |
(Create a stream converter that) cleans up HTML documents.
Removes everything from the doc that is not explicitly allowed.
This helps a lot with security. Users are e.g. HTML mail or security fanatics
browsing the web.
Updated•23 years ago
|
Summary: HTML santinizer → HTML sanitizer
Assignee | ||
Comment 1•23 years ago
•
|
||
Good news! I have a rudimentary version working. It strips all tags and
attributes, which are not explicitly allowed, from the document.
That way, you can have the advantages of HTML (expressive and mostly unambiguous
syntax, in contrast to plaintext) without most of the disadvantages (large
security and privacy problems). In fact, I think that this feature is what makes
HTML mail a useable (and possibly preferable) option.
THe code might also be used to an ultra-secure (and uncomfortable) browser, but
I don't have this implemented yet.
My current implementation is, however, relatively dumb about what can be allowed
and disallowed - you can only (dis)allow certain tags and attributes, but not
where they may appear. (This makes the implementation much more simple.) E.g. it
is impossible to disallow a certain element being a child of another element or
to disallow multiple <head> nodes.
Currently, you can't even disallow text directly inside an element, e.g. the
<head> node (I might fix that case, however).
Disallowed tags and attributes will just not be output to the result. This is in
line with the HTML convention of just ignoring unknown tags. (You should be able
to remove the <applet>/</applet> tags an still get a reasonable document.)
I also hooked this up with Mailnews' libmime. I added it as new option to the
View|Body menu introduced in bug 30888.
The current implementation takes a (Mozilla Gecko-)parsed HTML document and
outputs HTML source to a string. This means that we have to parse the result
again. I am aware of the performance implications and I wrote a longer
explanation in the source as comments.
This blocks bug 54184, because I want to make this the default in Beonex
Communicator.
I don't request review yet, because I want to clean up the source more and
refactor it more. Maybe, I'll be done later this day.
Assignee | ||
Comment 2•23 years ago
|
||
This bug exposes bug 125983.
Assignee | ||
Comment 3•23 years ago
|
||
This is a screenshot once of the message with the sender's HTML (as rendered
currently by Mozilla) (left) and once ran through the sanitizer (if enabled)
(right). The sanitizer pref was sat to allow tables and some presentational
formatting like <b>. You could also disallow tables, in which case everything
would be rendered in in a single colunm (in the order it appears in the HTML
source).
Note that the left original has the much-hated horizontal scrollbar, while the
right sanitized version doesn't.
Assignee | ||
Comment 4•23 years ago
|
||
This is the current state of my work. It still has some rough edges (e.g. a
leak), which I need to clean up. It is needed for the patch in bug 30888,
however. There, you can find the rest of the fix for this bug (namely, the
Mailnews part).
Assignee | ||
Comment 5•23 years ago
|
||
This is a test application for the sink. You can copy it over
htmlparser/tests/outsinks/Convert.cpp. Then, invoke TestOutput with an HTML
file as param. Don't forget to set LD_LIBRARY_PATH and MOZILLA_FIVE_HOME.
Assignee | ||
Comment 6•23 years ago
|
||
There's a tiny error in the diff: in mozSanitizingHTMLSerializer.cpp, line 24,
there a question mark after the comment. Please remove it.
Assignee | ||
Comment 7•23 years ago
|
||
akk pointed me to the DTD classes (e.g. htmlparser/src/nsWebFormedDTD.*) and
said that I might be able to use them to restrict the allowed tags/attributes
more flexible and cleanly. That hint is much appreciated, but I guess it's too
late for that now - I'd like to get that in for 1.0 and have no time to complete
rewrite the content/ part of this bug, considering that the current solution works.
Assignee | ||
Comment 8•23 years ago
|
||
I filled in the destruction or clean up mem.
I am now requesting review. akk, can you do that, please?
The code is not pretty - I am not proud of the implementation. But maybe it
will be rewritten later with DTDs anyways. And the code seems to do its job for
now and sctaches a big itch (solution/workaround for bug 28327).
To see the code in action, you need either the test app (see above) or the
non-overlapping part of the oatch for bug 30888 (which contains part of this
bug) (take the "everything" patch there and manually delete everything content/
from the patch before applying it, because it contains an older version of the
patch here).
Attachment #70305 -
Attachment is obsolete: true
Assignee | ||
Updated•23 years ago
|
Comment 9•23 years ago
|
||
This is needed in order to build the patch (version 2) on Mac
Comment 10•23 years ago
|
||
ben: can you confirm this is a networking bug?
Assignee | ||
Comment 11•23 years ago
|
||
In its initial constellation, it would have had a network component. This
component is not yet implemented yet, though, and will be moved to a new bug.
Moving to Mailnews for the lack of a better bugzilla Component. I suggest the QA
is the same as bug 30888.
Component: Networking → MIME
Product: Browser → MailNews
Assignee | ||
Updated•23 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 12•23 years ago
|
||
Fix checked into trunk. Yay, finally!
Thanks to all reviewers and all others involved.
This is not yet in the 1.0 branch. I still would like to check it into the
branch. If not, it would not be in any Mozilla 1.0 releases, but only trunk
nightlies and 1.1alpha.
Please test the feature and try to circumvent it, i.e. get arbitary HTML
rendered although the new options are active. If so, please file a new bug
against Mailnews/MIME, owner me, and mention it here.
Some documentation can be found at <http://www.beonex.com/temp/index.html>. When
I have ftp access to my private server again, I will move it to
<http://www.bucksch.com/1/projects/mozilla/108153>.
Comment 13•23 years ago
|
||
+qawanted - I'm probably not the best person to QA this, both for time and
expertise reasons.
Keywords: qawanted
Comment 14•22 years ago
|
||
I have verified the mailnews portion of this bug when verifying 30888. Is there
someone who can verifiy the browser portion, I don't see a preference or menu
item for sanitizing browser pages as seen in comment #3's attachment.
Assignee | ||
Comment 15•22 years ago
|
||
Esther, I didn't implement the browser option (yet). It was originally an idea,
but I didn't have the time/mood/use-case for it. I'll track that in another bug
(I will reference it here), if I will implement it one day. So, if you verified
the Mailnews part, you verified it all already ;-). The screenshot in comment 3
is Mailnews.
Comment 16•22 years ago
|
||
Your right, sorry I didn't see that. Then I'll verify this. Note, I see in bug
18427 that you make a statement that since this bug is fixed there is nothing
left to do in 18427. However a bug marked a dup of 18427 (the dup is 30896) is
still being worked on by sspitzer and ssu after the fixed date for this bug.
These bugs are getting complex and hard to follow. Would you say that 30896 is
not a dup of 18427. If so, then I will change the dup status of 30896 and then
verify 18427 per testing I did for bug 30888.
Assignee | ||
Comment 17•22 years ago
|
||
Comment 19•21 years ago
|
||
How to enable this (bug 108153's) fix, as far as I can tell (it's not in the UI,
AFAIK):
The code says:
+ aPref is a long string, which holds an exhaustive list of allowed tags
+ and attributes. All other tags and attributes will be removed.
+
+ aPref has the format
+ "html head body ul ol li a(href,name,title) img(src,alt,title) #text"
+ i.e.
+ - tags are separated by whitespace
+ - the attribute list follows the tag directly in brackets
+ - the attributes are separated by commas.
preference setting mailnews.display.html_sanitizer.allowed_tags (see URL:
about:config)
is
html head title body p br div(lang,title) h1 h2 h3 h4 h5 h6 ul ol
li(value,start,compact) dl dt dd blockquote(type,cite) pre noscript noframes
strong em sub sup span(lang,title) acronym(title) abbr(title)
del(title,cite,datetime) ins(title,cite,datetime) q(cite) a(href,name,title)
img(alt,title,longdesc,src) base(href) area(alt) applet(alt) object(alt) var
samp dfn address kbd code cite s strike tt b i table(align) caption
tr(align,valign) td(rowspan,colspan,align,valign) th(rowspan,colspan,align,valign)
by default.
I'm deleting div(...), src from img, applet(...), and object(...). As an
attempted partial workaround for bug 28327.
Not sure about the safety of base and area, among other tags. Still doesn't
block ReadNotify, I think.
Dox such as this are needed in http://www.mozilla.org/unix/customizing.html#prefs;
akkana%netscape.com, can you add some dox for this pref there?
(OT: akkana, some mention of where the prefs file is for the poor M$ users
and/or on URL: about:config would be great too. Yeah I know its in /unix/, but
now it's referenced in the Release Notes for all platforms...)
Comment 20•21 years ago
|
||
> akkana%netscape.com, can you add some dox for this pref there?
Okay, I've now read through bug 28327 (thanks for the pointer!) But I'm not
sure I understand your setting. Can you explain why div has to be blocked, but
span doesn't? I understand removing applet and object (Ben, I'm surprised those
are normally included -- is there a reason for that?) and I thought removing img
entirely was part of the whole point of sanitized html? (I can't tell, because
I block images in mail anyway, so the fact that I never see them in sanitized
html doesn't tell me anything.) What about area -- if img is gone, does area
make any sense? I also wonder about base -- anybody know? Seems like removing
it from the list should be fairly safe.
Anyway, I'm quite happy to document it, but I'd appreciate it if someone who's
been following the issue longer than I have would provide a sensible comment so
I don't have to make one up. And we might as well get a good list to suggest,
or perhaps a couple of good lists for varying levels of security.
> akkana, some mention of where the prefs file is for the poor M$ user
Someone who actually is a MS user should mail me that info as a contribution.
Do "find files on disk" or whatever they're calling it now, and look for
prefs.js, and you'll probably find it. Then write that up in a way that will
make sense to a typical MS user, and mail it to me, and I'll check it in (with
credit). A Linux user writing guesses at how MS users might find their prefs
file is probably not the best approach to documentation. :-)
(I will write up something on about:config.)
Assignee | ||
Comment 21•21 years ago
|
||
> How to enable this (bug 108153's) fix ... (it's not in the UI, AFAIK):
It is. Menu|Message|Body|Simple HTML
(Checked in as part of bug 30888, see comment 1.)
You cannot customize (via the UI) the list of HTML tags to allow, though.
> I'm deleting div(...), src from img, applet(...), and object(...).
> As an attempted partial workaround for bug 28327.
You don't need to do that. By default, it doesn't allow anything that causes
network activity (IIRC). If you can live with (or want) the reduced HTML, this
bug completely fixes bug 28327, unless I missed something.
> some mention of where the prefs file is for the poor M$ user
uhm, release notes? It's probably explained somewhere under www.mozilla.org's
end-user doc section as well.
Maybe just link it.
Hey akk, nice to talk to you again,
> I understand removing applet and object (Ben, I'm surprised those
> are normally included -- is there a reason for that?)
Unless I'm missing something, only the |alt| attribute of <object>, <applet>
etc. is allowed and <param> is not allowed at all, so the tags are intentionally
crippled so much that only the alternative text remains, the active content is
disfunctional (at least it should be).
> I thought removing img entirely was part of the whole point of
> sanitized html?
Not quite. The point was mainly to get rid of any privacy and security threats
that HTML mail poses. External images are privacy threats, images embedded in
the mail are not and thus allowed.
Getting rid of the formatting annoyance of HTML was also a goal. Images are (or
at least can be) important content, so I allowed them, when possible. You could
argue that images might have offending content, but then again, attached images
will be shown inline with plaintext as well, so we are no different here. You
can remove the <img> tag from the pref, if it bothers you, but attached images
will still display.
<political type="offending">In short, the goal was to turn HTML back into what
is was supposed to be: description of content and structure, with the
presentation left to the reader. I.e. the superiour text format (if both
parties' software can deal with it properly) ;-).</political>
> I also wonder about base -- anybody know? Seems like removing
> it from the list should be fairly safe.
I don't see that it would cause any harm, and removing it may break relative
links in <a href="">s.
> Anyway, I'm quite happy to document it, but I'd appreciate it if someone
> who's been following the issue longer than I have would provide a sensible
> comment so I don't have to make one up.
There is <http://www.bucksch.org/1/projects/mozilla/108153>. Not sure, if it's
what you're looking for.
> perhaps a couple of good lists for varying levels of security.
Good idea.
I have junk mail that produces the assert:
ASSERTION: all the skipped content tokens did not get handled:
'mSkippedContent.GetSize() == 0', file CNavDTD.cpp, line 979
but only if I view it as Simple HTML and not Original HTML. Would that be a bug
here?
Updated•20 years ago
|
Product: MailNews → Core
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•