Closed Bug 340554 Opened 18 years ago Closed 18 years ago

Provide sanitized HTML content

Categories

(Firefox Graveyard :: RSS Discovery and Preview, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Firefox 2 beta2

People

(Reporter: sayrer, Assigned: sayrer)

References

Details

(Keywords: fixed1.8.1, fixed1.8.1.2)

Attachments

(5 files, 14 obsolete files)

104.42 KB, image/png
Details
103.64 KB, image/png
Details
298.79 KB, image/png
Details
61.39 KB, patch
Details | Diff | Splinter Review
1.40 KB, patch
sayrer
: review+
Details | Diff | Splinter Review
Right now, only text and full HTML are available
Status: NEW → ASSIGNED
Assignee: nobody → sayrer
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
Whiteboard: [swag: 3d]
OS: Linux → All
Hardware: PC → All
Target Milestone: --- → Firefox 2 beta1
Flags: blocking-firefox2?
Flags: blocking-firefox2? → blocking-firefox2+
Whiteboard: [swag: 3d] → [swag: 2d]
Target Milestone: Firefox 2 beta1 → Firefox 2 beta2
Attachment #230449 - Attachment is obsolete: true
Attachment #230480 - Attachment is obsolete: true
same as previous, but up to date
Attachment #230692 - Attachment is obsolete: true
still need to get HTML attributes
Attachment #230693 - Attachment is obsolete: true
Attached patch XHTML, HTML sanitized. (obsolete) — Splinter Review
still need to check security for URI attribute values
Attachment #230705 - Attachment is obsolete: true
Attachment #230748 - Attachment is obsolete: true
Attached patch HTML, XHTML, sanitized (obsolete) — Splinter Review
This patch provides an extremely conservative sanitizer for the feed display, because the risk a vulnerability is high there.

There is probably a better way to do the whitelist functions... let me know if you want me to change them.
Attachment #230775 - Attachment is obsolete: true
Attachment #230796 - Flags: review?(bugmail)
Attached patch HTML, XHTML, sanitized (obsolete) — Splinter Review
Attachment #230796 - Attachment is obsolete: true
Attachment #230796 - Flags: review?(bugmail)
Attachment #230799 - Flags: review?(bugmail)
The feeds look *way* better with this patch... I think we really want this
*** Bug 343749 has been marked as a duplicate of this bug. ***
(In reply to comment #9)
> Created an attachment (id=230799) [edit]
> HTML, XHTML, sanitized

I think <bdo> and <var> are missing from the white list.
Whiteboard: [swag: 2d] → [has patch][needs review ben]
Attachment #230799 - Flags: superreview?(jst)
Attachment #230799 - Flags: review?(bugs)
Whiteboard: [has patch][needs review ben] → [has patch][needs review ben, sicking][needs SR jst]
Attached image apple photocast
Blocks: 340555
Comment on attachment 230799 [details] [diff] [review]
HTML, XHTML, sanitized

FeedWriter parts look ok to me
Attachment #230799 - Flags: review?(bugs) → review+
Comment on attachment 230799 [details] [diff] [review]
HTML, XHTML, sanitized

+// these two lists are used by the sanitizing fragment serializer
+static PRBool
+IsOnWhitelist(nsIAtom *aName)
+{
+  if (aName == nsHTMLAtoms::a ||
+      aName == nsHTMLAtoms::abbr ||
+      aName == nsHTMLAtoms::acronym ||
[...]
+      aName == nsHTMLAtoms::ul) 

and

+static PRBool
+IsAttrOnWhitelist(nsIAtom *aName)
+{
+  if (aName == nsHTMLAtoms::accept ||
+      aName == nsHTMLAtoms::acceptcharset ||
+      aName == nsHTMLAtoms::accesskey ||
[...]
[...]
+      aName == nsHTMLAtoms::width) 

Both those functions do a *lot* of pointer compares, and they're called a *lot* it seems when using the feed viewer. Might be worth while to build two hashes for those lists and do a hash lookup rather than all those pointer compares in those functions.

- nsHTMLParanoidFragmentSink::AddAttributes():

+  // use this to check for safe URIs in the few attributes that allow them
+  nsresult rv;
+  nsCOMPtr<nsIScriptSecurityManager>
+    secMan(do_GetService(NS_SCRIPTSECURITYMANAGER_CONTRACTID, &rv));

You can avoid this service lookup for every element with attributes by using nsContentUtils::GetSecurityManager() which returns a cached pointer to the security manager w/o doing a lookup every time.

+  NS_ENSURE_SUCCESS(rv, rv);
+  nsCOMPtr<nsIURI> baseURI = aContent->GetBaseURI();

Maybe leave the decalaration of baseURI here, but move the call to aContent->GetBaseURI() into the code where we actually need it (and only call it once per element, i.e. null-check baseURI before calling it in the loop). That way we'd save the GetBaseURI() call in most cases, and the call isn't *that* cheap.

Both of the above apply to nsXHTMLParanoidFragmentSink::AddAttributes() as well.

- In nsXHTMLParanoidFragmentSink::HandleStartElement():

+    if (nameSpaceID == kNameSpaceID_XMLNS ||
+        nameSpaceID == kNameSpaceID_XML ||
+        IsAttrOnWhitelist(name)) {
+        allowedAttrs.AppendElement((void*)aAtts[i]);
+        allowedAttrs.AppendElement((void*)aAtts[i + 1]);
+    }

Nit: Only indent the two lines in the body of the if statement two spaces, like the rest of the code around it.

+  PRUint32 allowedCount = allowedAttrs.Count();
+  for (PRUint32 i=0; i < aAttsCount; i++) {
+    // 
+    if (i < allowedCount)

Did you mean to write something after the '//'? :)

- In nsXHTMLParanoidFragmentSink::HandleEndElement():

+  if (mSkipLevel != 0) {
+    --mSkipLevel;
+    return NS_OK;
+  } else if (!IsOnWhitelist(name)) {
+    return NS_OK;
+  }

Remove the else-after-return.

- In nsScriptableUnescapeHTML::ParseFragment():

+  // stop scripts
+  nsCOMPtr<nsIScriptLoader> loader;
+  PRBool scripts_enabled = PR_FALSE;
+  if (document) {
+      loader = document->GetScriptLoader();
+      if (loader) {
+          loader->GetEnabled(&scripts_enabled);
+      }
+  }
+  if (scripts_enabled) {
+      loader->SetEnabled(PR_FALSE);
+  }

Nit: Stick to two space indentation.

+  // Wrap things in a div for parsing, but it won't show up
+  // in the fragment. XXX, not quite working yet

What part isn't quite working yet? Anything we care about before landing this?

sr=jst with that addressed.
Attachment #230799 - Flags: superreview?(jst) → superreview+
Attached patch address JST's comments (obsolete) — Splinter Review
Attachment #230799 - Attachment is obsolete: true
Attachment #230799 - Flags: review?(bugmail)
Attached patch address JST's comments (obsolete) — Splinter Review
Attachment #232371 - Attachment is obsolete: true
Attached patch address JST's comments (obsolete) — Splinter Review
Attachment #232372 - Attachment is obsolete: true
Attachment #232373 - Flags: review?(bugmail)
(In reply to comment #16
> 
> What part isn't quite working yet? Anything we care about before landing this?
> 
> sr=jst with that addressed.

Relative URIs in HTML content don't always work, because of a few small problems with my code, and some bugs in the HTMLFragmentSink superclass. Relative URIs in HTML content don't work in RSS but they are specified in Atom. It is still really rare, so I can't justify it as a blocker, but I will file a follow-on to get it.

Whiteboard: [has patch][needs review ben, sicking][needs SR jst] → [has patch][needs review sicking]
Comment on attachment 232373 [details] [diff] [review]
address JST's comments

bz says: peterv, mrbkap, or sicking needs to review this.

Let's see if mrbkap has time.
Attachment #232373 - Flags: review?(mrbkap)
Comment on attachment 232373 [details] [diff] [review]
address JST's comments

>Index: content/base/src/nsContentSink.h

>+// these two lists are used by the sanitizing fragment serializers
>+static nsIAtom** const kDefaultAllowedTags [] = {
>+  &nsHTMLAtoms::a,
>+  &nsHTMLAtoms::abbr,
>+  &nsHTMLAtoms::acronym,
...
It would be good if we could share this list with the mailnews list of allowed tags. Especially given that that list is able to restrict different attributes on different tags. That way we can do things like allowing object tags with alt attributes which isn't super important, but nice to have.

In general it's a bad idea to duplicate security info like this if we for example need to remove some element or attribute from the whitelist.

Additionally, it seems like a waste to have to sinks that do filtering like this, both this new sink and the existing mozSanitizingSerializer
Comment on attachment 232373 [details] [diff] [review]
address JST's comments

>Index: content/html/document/src/nsHTMLFragmentContentSink.cpp

>+class nsHTMLParanoidFragmentSink : public nsHTMLFragmentContentSink
>+  // Use nsTHashTable as a hash set for our whitelists
>+  nsTHashtable<nsISupportsHashKey>* mAllowedTags;
>+  nsTHashtable<nsISupportsHashKey>* mAllowedAttributes;

Why are these members pointers? Just do

nsTHashtable<nsISupportsHashKey> mAllowedTags;
nsTHashtable<nsISupportsHashKey> mAllowedAttributes;

That way you won't need to |new| and |delete| them. Also, you shouldn't need to explicitly call Clear() before they are deleted, the dtor should take care of that.
(In reply to comment #22)
>
> It would be good if we could share this list with the mailnews list of allowed
> tags.
>
> Additionally, it seems like a waste to have to sinks that do filtering like
> this, both this new sink and the existing mozSanitizingSerializer

Agree in principle, but reusing or adapting mozSanitizingSerializer didn't seem practical given the time constraints. It doesn't do XML, it doesn't use the security manager anywhere, and it strips things I would like to preserve. I would take a bug on unifying all three classes for 1.9.

Comment on attachment 232373 [details] [diff] [review]
address JST's comments

General nit: code in content/ tends to always put braces on if statements, even if they're not generally required.

>Index: browser/components/feeds/src/FeedWriter.js
>+          body.setAttribute("xml:base", summary.base.spec);

It looks like you actually want body.setAttributeNS here.

>Index: content/base/src/nsContentSink.h

>+static
>+PRBool IsAttrURI(nsIAtom *aName)
>+{
>+  if (aName == nsHTMLAtoms::action ||
>+      aName == nsHTMLAtoms::href ||
>+      aName == nsHTMLAtoms::src ||
>+      aName == nsHTMLAtoms::longdesc ||
>+      aName == nsHTMLAtoms::usemap ||
>+      aName == nsHTMLAtoms::cite) 
>+    return PR_TRUE;
>+  return PR_FALSE;
>+}

Style nit: I'd prefer |return (aName == ...)| instead of the explicit if/else. Or, at the very least, braces around the then-statements in the if.

>Index: content/html/document/src/nsHTMLFragmentContentSink.cpp

>+/*
>+ The following HTML tags are allowed (all others are stripped): a,
>+ abbr, acronym, address, area, b, big, blockquote, br, button,
>+ caption, center, cite, code, col, colgroup, dd, del, dfn, dir, div,
>+ ...

Here and below you repeat this huge comment. Given the big list in nsContentSink.h, I don't think the comment is necessary. Instead, you can just refer to the lists in nsContentSink.h here and not have to change 3 places every time the list changes.

>+  nsresult nameFromType(const nsHTMLTag aTag,
>+                        nsIAtom **aResult);
>+
>+  nsresult nameFromNode(const nsIParserNode& aNode,
>+                        nsIAtom **aResult);

Uber-nit: C++ style prefers capitalizing the first letter of method names.

>+nsHTMLParanoidFragmentSink::nsHTMLParanoidFragmentSink():
>+nsHTMLFragmentContentSink(PR_TRUE), mSkipLevel(0)

Uber-uber-nit: Indent this second line a couple of spaces.

>+  nsresult rv;
>+  PRUint32 size = NS_ARRAY_LENGTH(kDefaultAllowedTags);
>+  mAllowedTags = new nsTHashtable<nsISupportsHashKey>();

Is it worth it to make this a class-static member var and initialize on first-use or startup and delete it on shutdown? If we're going to make a lot of them, it seems like it'd be worth it.

>+nsHTMLParanoidFragmentSink::nameFromType(const nsHTMLTag aTag,
>+                                         nsIAtom **aResult) {
>+nsHTMLParanoidFragmentSink::nameFromNode(const nsIParserNode& aNode,
>+                                         nsIAtom **aResult) {


Nit: The opening brace of functions goes on the next line's 0th column, as you do below.

>+  nsresult rv;
>+  eHTMLTags type = (eHTMLTags)aNode.GetNodeType();
>+  
>+  *aResult = nsnull;
>+  if ((type == eHTMLTag_userdefined)) {

Nit: The extra parens around the condition aren't needed (and are dangerous if you accidentally had typed = instead of ==)!

>+  // bail if it's a script or style, or we're already inside one of those
>+  eHTMLTags type = (eHTMLTags)aNode.GetNodeType();
>+  if (mSkipLevel != 0 || type == eHTMLTag_script || type == eHTMLTag_style) {
>+    ++mSkipLevel;
>+    return rv;

Since this is an HTML content sink, this will never happen. Both the HTML tokenizer and the DTD enforce that the only children of script and style tags are text nodes. I think that means that you can make this member a PRBool or something.

>+  rv = nsHTMLFragmentContentSink::OpenContainer(aNode);
>+  return rv;
...
>+  rv = nsHTMLFragmentContentSink::CloseContainer(aTag);
>+  return rv;

Nit: just |return nsHTMLFragment ...|, no need to assign into rv first.

>+NS_IMETHODIMP
>+nsHTMLParanoidFragmentSink::AddLeaf(const nsIParserNode& aNode)

You're checking this into the 1.8 branch, right? If so, then you'll need to specially check for <script> and <style> here and collect the skipped content, like what nsHTMLFragmentContentSink does. Fortunately, you only have to do this on the 1.8 branch.

>+  rv = nsHTMLFragmentContentSink::AddLeaf(aNode);
>+
>+  return rv;

Ditto about the extra assignment.

>Index: content/xml/document/src/nsXMLContentSink.h
>+nsXHTMLParanoidFragmentSink::nsXHTMLParanoidFragmentSink():
>+nsXMLFragmentContentSink(PR_FALSE), mSkipLevel(0)

Same nit as above.

>+  mAllowedTags = new nsTHashtable<nsISupportsHashKey>();

Same question about sharing these hash tables as above.

>+  nsVoidArray allowedAttrs;

Can you use an nsTArray<PRUnichar *> here?

>+  // let the ok attributes through
>+  PRUint32 allowedCount = allowedAttrs.Count();
>+  for (PRUint32 i=0; i < count; i++) {

Can you pass a pointer to the first element of the array to the base class and avoid this loop?

>+  // It's an allowed element, so let's scrub the attributes

Err... Doesn't nsXMLContentSink::HandleStartElement use AddAttributes to add the attributes, meaning you don't need to do this again? Oh, I see: This loop scrubs which attributes you allow, and that one scrubs the values. Why not just do both things at the same time in AddAttributes?

>Index: parser/htmlparser/public/nsIFragmentContentSink.h

>+nsresult
>+NS_NewXHTMLParanoidFragmentSink(nsIFragmentContentSink** aInstancePtrResult);

Add a comment on what this does here?

>Index: toolkit/components/feeds/src/nsScriptableUnescapeHTML.cpp


>+  // in the fragment. XXX, not quite working yet

What's not quite working yet?

>+  } else {
>+    // HTML
>+    tagStack.AppendElement(ToNewUnicode(NS_LITERAL_CSTRING(HTML_BODY_TAG)));
>+    if (aBaseURI) {
>+      base.Append(NS_LITERAL_CSTRING((HTML_BODY_TAG)));
>+      base.Append(NS_LITERAL_CSTRING(" href=\""));
>+      aBaseURI->GetSpec(spec);
>+      base = base + spec;
>+      base.Append(NS_LITERAL_CSTRING("\""));
>+      tagStack.AppendElement(ToNewUnicode(base));
>+    }

You do a bunch of work here to append stuff to the tagStack, but then you don't use it. Why are you using the aAllContent version of the HTML fragment sink anyway?
(In reply to comment #25)
> 
> >+  nsresult rv;
> >+  PRUint32 size = NS_ARRAY_LENGTH(kDefaultAllowedTags);
> >+  mAllowedTags = new nsTHashtable<nsISupportsHashKey>();
> 
> Is it worth it to make this a class-static member var and initialize on
> first-use or startup and delete it on shutdown? If we're going to make a lot 
> of them, it seems like it'd be worth it.

Maybe, but what if we want to make this class configurable? I guess it would be better to add an instance hashtable for that purpose when we need it.

> >+  // It's an allowed element, so let's scrub the attributes
> 
> Err... Doesn't nsXMLContentSink::HandleStartElement use AddAttributes to add
> the attributes, meaning you don't need to do this again? Oh, I see: This loop
> scrubs which attributes you allow, and that one scrubs the values. Why not just
> do both things at the same time in AddAttributes?

The sinks do (and new implementations could add) things with the attribute array before they get to AddAttributes, so I thought it best to split it up. Don't feel strongly about that, though.

> What's not quite working yet?

relative URIs (see above), but that needs to be tested again, now that you saw my setAttribute[NS] mistake. I'm guessing it will work with XHTML, but I still need to do AddBaseTagInfo for HTML (which looks like it might be buggy in HTMLFragmentSink... looks like it doesn't do it for anchor elements?)
Attached patch address mrbkap's comments (obsolete) — Splinter Review
the relative URIs work now, I needed to handle base tags specially
Attachment #232373 - Attachment is obsolete: true
Attachment #233326 - Flags: review?(mrbkap)
Attachment #232373 - Flags: review?(mrbkap)
Attachment #232373 - Flags: review?(bugmail)
Comment on attachment 233326 [details] [diff] [review]
address mrbkap's comments

>+  return nsXMLFragmentContentSink::AddAttributes(&allowedAttrs[0], aContent);

Here and elsewhere, use allowedAttrs.Elements(). r=mrbkap with that.
Attachment #233326 - Flags: review?(mrbkap) → review+
Attached patch add ifdefs for the branch (obsolete) — Splinter Review
CollectSkippedInput on the branch
Attachment #233326 - Attachment is obsolete: true
(In reply to comment #29)
> Created an attachment (id=233335) [edit]

I've stated it before, but nobody responded. <var> is still missing. It's not really important (pretty much like <samp>), but there's also no good reason for leaving it out.

http://www.w3.org/TR/REC-html32#phrase
http://www.w3.org/TR/xhtml2/mod-text.html#sec_9.14.

<bdo> is missing, too. It looks important -- albeit I've never used it, because I only know LTR languages.

http://www.w3.org/TR/html401/struct/dirlang.html#h-8.2.4
(In reply to comment #30)
> 
> I've stated it before, but nobody responded. <var> is still missing.

I filed bug 348447 for the missing elements.
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Whiteboard: [has patch][needs review sicking]
Attachment #233335 - Flags: approval1.8.1?
(In reply to comment #29)
> Created an attachment (id=233335) [edit]

<summary type="html"> (e.g. attachment 228290 [details]) is still not working, is it? Should I file a new bug for this?
(In reply to comment #32)
> <summary type="html"> (e.g. attachment 228290 [details] [edit]) is still not working, is it?
> Should I file a new bug for this?
> 

The patch in bug 340555 fixes that problem.
Questions from triage today - this is a fairly big patch can you comment on:

a) What this is needed for
b) Level of risk
c) Types of manual and automated tests run to verify code

Also - would we need a patch for 340555 as well?

(In reply to comment #34)
> Questions from triage today - this is a fairly big patch can you comment on:
> 
> a) What this is needed for

This code is needed to coherently display several types of feeds. Right now, all markup is stripped from the entry content. Feeds that depend heavily on basic formatting like line breaks and img tags look broken. Some very popular feeds like flickr.com feeds won't display any meaningful content with the current code, which Ben and I went with due to unresolved security issues with the feed preview (executing with chrome privs).

The markup has to be sanitized so that untrusted feed content does not interfere with our interface and privileged operations. This is also a requirement for API users. 

> b) Level of risk

The level of risk is low for parts of the code not related to this feature. I had to make the AddAttributes method virtual on the content sinks, but I think that's the only non-feed code path I touched.

> c) Types of manual and automated tests run to verify code

I don't know of any automated test harness I can run on it yet. I have been using it for daily surfing for a while. I have used it on feeds known to contain security issues, I have used it on feeds that mix HTML and XHTML entries, I have tested it against feeds that contain chrome links, javascript links, data links, script elements, style elements, malformed HTML, SVG, and many other items.

> 
> Also - would we need a patch for 340555 as well?

We could take the patch that's in 340555 right now. It would fix the problems related to this bug (none of which are critical for b2), though it wouldn't close 340555.
Comment on attachment 233335 [details] [diff] [review]
add ifdefs for the branch

a=drivers for the MOZILLA_1_8_BRANCH.

>We could take the patch 
>that's in 340555 right 
>now. It would fix the 

Yeah, let's split that bit off and get it reviewed and landed so that this piece lands with no known issues.
Robert, can you s
Attachment #233335 - Flags: approval1.8.1? → approval1.8.1+
Attachment #233335 - Attachment is obsolete: true
Keywords: fixed1.8.1
Attachment #249386 - Flags: review?(sayrer) → review+
Attachment #249386 - Flags: approval1.8.1.2?
Comment on attachment 249386 [details] [diff] [review]
how about assigning to the rv that you check?

Approved for 1.8 branch, a=jay for drivers.
Attachment #249386 - Flags: approval1.8.1.2? → approval1.8.1.2+
Checking in nsHTMLFragmentContentSink.cpp;
/cvsroot/mozilla/content/html/document/src/nsHTMLFragmentContentSink.cpp,v  <--  nsHTMLFragmentContentSink.cpp
new revision: 1.102.4.6; previous revision: 1.102.4.5
done
Keywords: fixed1.8.1.2
This is duplicating mozSanitizingHTMLSerializer . It is used by Thunderbird View | Message Body As | Simple HTML since 5+ years.

That's a serializer, but it could be turned into a sink without too much effort, I think.  is quite paranoid, going further than this here. The implementation is a bit different (e.g. takes the allowed tags and attributes as string, because they're preffed, which was not possible with Atoms, because there was no tag name -> atom mapping), and it's surely not pretty, but it tries to be paranoid ("defense in depth") by removing javascript: and data: links altogether and even finding them in content.
http://mxr.mozilla.org/seamonkey/source/content/base/public/mozISanitizingSerializer.h
is an intro to the class. Most of the text at the bottom actually talks about how much I wanted this to be a dynamically hookable sink, not a serializer, and that it can be turned into a sink. (Ignore that "design" sentence ;)
Depends on: 476310
Product: Firefox → Firefox Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: