Last Comment Bug 482909 - [HTML5] Re-implement HTML sanitizer using the HTML5 parser
: [HTML5] Re-implement HTML sanitizer using the HTML5 parser
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: HTML: Parser (show other bugs)
: Other Branch
: All All
: P3 normal with 2 votes (vote)
: mozilla8
Assigned To: Henri Sivonen (:hsivonen)
:
: Andrew Overholt [:overholt]
Mentors:
: 554130 (view as bug list)
Depends on: 675437 675492 675916 675925 677847 750436
Blocks: 507452 563890 613912 650094 596182
  Show dependency treegraph
 
Reported: 2009-03-12 00:43 PDT by Henri Sivonen (:hsivonen)
Modified: 2013-09-04 16:57 PDT (History)
20 users (show)
hsivonen: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
sanitizer mochitests (21.08 KB, patch)
2009-08-04 16:26 PDT, Jonathan Griffin (:jgriffin)
no flags Details | Diff | Splinter Review
Backup of WIP (112.35 KB, patch)
2011-01-31 06:07 PST, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review
Current WIP (126.47 KB, patch)
2011-02-03 07:39 PST, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review
Allow Content MathML (125.81 KB, patch)
2011-02-07 04:55 PST, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review
The new sanitizer (131.87 KB, patch)
2011-02-15 05:27 PST, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review
Part 1: Atoms (9.66 KB, patch)
2011-03-24 06:22 PDT, Henri Sivonen (:hsivonen)
ehsan: review+
Details | Diff | Splinter Review
Part 2: New sanitizer and removal of the old one (123.83 KB, patch)
2011-03-24 06:23 PDT, Henri Sivonen (:hsivonen)
ehsan: review+
Details | Diff | Splinter Review
Part 2: New sanitizer and removal of the old one, v2 (124.17 KB, patch)
2011-03-28 05:14 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review
Part 2: New sanitizer and removal of the old one, v3 (124.12 KB, patch)
2011-03-31 06:07 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review
Part 2: New sanitizer and removal of the old one, v4 (123.34 KB, patch)
2011-04-15 08:49 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review
New hunks in v4 for reference (22.85 KB, patch)
2011-04-15 08:50 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review
Part 2: New sanitizer and removal of the old one, v5 (123.53 KB, patch)
2011-05-09 09:15 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review
Part 2: New sanitizer and removal of the old one, v6 (122.04 KB, patch)
2011-06-29 05:06 PDT, Henri Sivonen (:hsivonen)
bzbarsky: review-
Details | Diff | Splinter Review
Part 2: New sanitizer and removal of the old one, v7 (125.91 KB, patch)
2011-07-18 08:21 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review
Diff from v6 to v7 (13.05 KB, patch)
2011-07-18 08:22 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review
Part 2: New sanitizer and removal of the old one, v8 (128.36 KB, patch)
2011-07-28 07:19 PDT, Henri Sivonen (:hsivonen)
bzbarsky: review+
Details | Diff | Splinter Review
Diff from v7 to v8 (14.27 KB, patch)
2011-07-28 07:19 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Splinter Review

Description Henri Sivonen (:hsivonen) 2009-03-12 00:43:48 PDT
Feed parsing uses a sanitizing content sink. Need to clone that capability using the HTML5 parser.
Comment 1 Jonathan Griffin (:jgriffin) 2009-07-29 15:26:41 PDT
I can add the html5lib tests for the sanitizer as soon as this is implemented.
Comment 2 Jonathan Griffin (:jgriffin) 2009-08-04 16:26:52 PDT
Created attachment 392613 [details] [diff] [review]
sanitizer mochitests

Attached are the html5lib sanitizer tests, converted to mochitest using the tokenizer test harness.  As such, they require the tokenizer/parser subharness updates attached to bug 373864 in order to run.

In converting these tests to the format used by the tokenizer test harness, I assumed that expected output like:

   <bgsound src=\"javascript:alert('XSS');\"></bgsound>

meant that character output should be emitted like:

   <bgsound src=\"javascript:alert('XSS');\"/>

rather than emitting an element.  If this interpretation is wrong, let me know.

We currently fail most of these, and I don't know if there any intention to make all these tests pass, but I thought it might be of some use to be able to run them, especially as many of the tests deal with potential exploits.
Comment 3 Henri Sivonen (:hsivonen) 2011-01-31 06:07:44 PST
Created attachment 508386 [details] [diff] [review]
Backup of WIP

I've migrated the old HTML sanitizer code over to the new design, but figuring out what the Right Thing for SVG and MathML is has turned out to be harder than I expected.
Comment 4 Henri Sivonen (:hsivonen) 2011-02-03 07:39:21 PST
Created attachment 509425 [details] [diff] [review]
Current WIP
Comment 5 Henri Sivonen (:hsivonen) 2011-02-07 04:55:09 PST
Created attachment 510251 [details] [diff] [review]
Allow Content MathML
Comment 6 Henri Sivonen (:hsivonen) 2011-02-15 05:27:47 PST
Created attachment 512463 [details] [diff] [review]
The new sanitizer

If this passes the tryserver, let's move to review. Some notes:
 * Content MathML is whitelisted per ehsan's email to dev-platform.
 * SVG fonts aren't whitelisted, since we are actively not supporting them rather than just not having gotten to them yet.
 * The controls attribute is added to HTML video and audio to make them playable with scripts removed. (Matters for feeds in particular.)
 * Since the plan is to remove the html5.enable pref post-Gecko 2.0 anyway, this patch uses the new HTML parser unconditionally instead of checking the pref.
 * Per IRC discussion with bz, SVG different-document references aren't treated as dangerous URLs.
Comment 7 :Ehsan Akhgari 2011-02-15 16:06:02 PST
Is this urgent?  I'm trying to devote as much time as I can to blockers, and this is a huge patch, so I'd rather review post-Gecko2.0.  Would that be fine?
Comment 8 Henri Sivonen (:hsivonen) 2011-02-15 23:25:48 PST
(In reply to comment #7)
> Is this urgent?  I'm trying to devote as much time as I can to blockers, and
> this is a huge patch, so I'd rather review post-Gecko2.0.  Would that be fine?

This isn't urgent. I wasn't expecting anything earlier than post-Gecko2.0, so that's fine.
Comment 9 :Ehsan Akhgari 2011-03-21 14:31:13 PDT
Comment on attachment 512463 [details] [diff] [review]
The new sanitizer

This patch is extremely large, and it's hard to review.  As the first comment, it would be wonderful if you could split it up into smaller chunks.  I'm just going to go through it and comment on stuff as I encounter them, but this would probably need several rounds of review.

The tl;dr version of my initial review is that I like both the approach and the code very much.  Much of the below comments can be described as nits.  Also, please get bz to also look at this, I'm sure he can provide you with a lot of good comments as well.

>   /**
>    * Get the next node in the pre-order tree traversal of the DOM.  If
>    * aRoot is non-null, then it must be an ancestor of |this|
>    * (possibly equal to |this|) and only nodes that are descendants of
>    * aRoot, not including aRoot itself, will be returned.  Returns
>    * null if there are no more nodes to traverse.
>+   * If aSkipChildren is true, the next node is computed as if |this|
>+   * had no children.
>    */
>-  nsIContent* GetNextNode(const nsINode* aRoot = nsnull) const
>+  nsIContent* GetNextNode(const nsINode* aRoot = nsnull,
>+                          const PRBool aSkipChildren = PR_FALSE) const

I don't really like this, as the new boolean flag basically changes the semantics of the function.  Why can't you add a new method to implement your desired semantics?

>+class nsTreeSanitizer {
>+
>+    NS_INLINE_DECL_REFCOUNTING(nsTreeSanitizer)
>+
>+  public:
>+    nsTreeSanitizer(PRBool aAllowStyles, PRBool aAllowComments);
>+    ~nsTreeSanitizer();
>+    static void InitializeStatics();
>+    static void ReleaseStatics();
>+
>+    /**
>+     * Sanitizer a DOM fragment.
>+     */
>+    void Sanitize(nsIContent* aFragment);

I love how simple the API is!  :-)

>diff --git a/content/base/src/nsContentUtils.cpp b/content/base/src/nsContentUtils.cpp
>--- a/content/base/src/nsContentUtils.cpp
>+++ b/content/base/src/nsContentUtils.cpp
>@@ -178,16 +178,17 @@ static NS_DEFINE_CID(kXTFServiceCID, NS_
> #include "nsLayoutUtils.h"
> #include "nsFrameManager.h"
> #include "BasicLayers.h"
> #include "nsFocusManager.h"
> #include "nsTextEditorState.h"
> #include "nsIPluginHost.h"
> #include "nsICategoryManager.h"
> #include "nsAHtml5FragmentParser.h"
>+#include "nsTreeSanitizer.h"

This seems unnecessary.

>+nsTreeSanitizer::nsTreeSanitizer(PRBool aAllowStyles, PRBool aAllowComments)
>+ : mAllowStyles(aAllowStyles)
>+ , mAllowComments(aAllowComments)
>+{
>+  if (!sElementsHTML) {
>+    // Initialize lazily to avoid having to initialize at all if the user
>+    // doesn't paste HTML or load feeds.
>+    InitializeStatics();
>+  }
>+  nsresult rv;
>+  mNullPrincipal = do_CreateInstance(NS_NULLPRINCIPAL_CONTRACTID, &rv);

Please drop the unused |rv| variable.

>+}
>+
>+nsTreeSanitizer::~nsTreeSanitizer()
>+{
>+}

Ditto for the destructor.

>+PRBool
>+nsTreeSanitizer::MustPrune(PRInt32 aNamespace, nsIAtom* aLocal)
>+{
>+  // To avoid attacks where a MathML script becomes something that gets
>+  // serialized in a way that it parses back as an HTML script, let's just
>+  // drop these regardless of namespace
>+  if (nsGkAtoms::script == aLocal) {

What are the cases which you're worried about here?  Is it possible to just fix those cases here?

>+PRBool
>+nsTreeSanitizer::SanitizeStyleSheet(const nsAString& aOriginal,
>+                                    nsAString& aSanitized,
>+                                    nsIDocument* aDocument,
>+                                    nsIURI* aBaseURI)
>+{
>+  nsresult rv;
>+  aSanitized.Truncate();
>+  // aSanitized will hold the permitted CSS text.
>+  // We use a white-listing approach, so we explicitly allow
>+  // the CSS style and font-face rule types.  We also clear
>+  // -moz-binding CSS properties.

I don't think that this comment is accurate.  We're actually blacklisting, are we not?

>+void
>+nsTreeSanitizer::SanitizeAttributes(nsIContent* aElement,

>+      // Allow underscore to cater to the MCE editor library.
>+      // Allow data-* on SVG and MathML, too, as a forward-compat measure.

This kind of scares me.  Is this forward-compat measure actually needed right now?

>+      if (*localStr == '_' || (attrLocal->GetLength() > 5 && localStr[0] == 'd'
>+          && localStr[1] == 'a' && localStr[2] == 't' && localStr[3] == 'a'
>+          && localStr[4] == '-')) {
>+        continue;
>+      }
>+      // else not allowed
>+    } else if (kNameSpaceID_XML == attrNs) {
>+    } else if (aAllowXLink && kNameSpaceID_XLink == attrNs) {

I don't know much about these two cases.

>+void
>+nsTreeSanitizer::InitializeStatics()
>+{
>+  NS_PRECONDITION(!sElementsHTML, "Initializing a second time.");
>+
>+  sElementsHTML = new nsTHashtable<nsISupportsHashKey> ();
>+  sElementsHTML->Init(80);
>+  for (PRUint32 i = 0; kElementsHTML[i]; i++) {
>+    sElementsHTML->PutEntry(*kElementsHTML[i]);
>+  }

Hmm, why initing the hashtable with size 80 when you know the exact size?

>+void
>+nsTreeSanitizer::ReleaseStatics()
>+{
>+  if (sElementsHTML) {
>+    delete sElementsHTML;
>+    sElementsHTML = nsnull;
>+  }

You could make this a bit more compact by not null-checking, as null pointers are safe to delete.

>+//
>+// Thanks to Mark Pilgrim and Sam Ruby for the initial whitelist
>+//
>+nsIAtom** const nsTreeSanitizer::kElementsHTML[] = {

I haven't actually read this list, as I'm assuming the HTML ones haven't changed, and I can't judge the correctness of the SVG/MathML ones.

>+  // The old code created a new parser every time. This is inefficient.
>+  // However, the target document is not required to be an HTML document,
>+  // So avoid using the cached parser of aDocument. Once bug 596182 is fixed,
>+  // use the global parser here.
>+  nsCOMPtr<nsIParser> parser = nsHtml5Module::NewHtml5Parser();
>+  nsAHtml5FragmentParser* asFragmentParser =
>+      static_cast<nsAHtml5FragmentParser*> (parser.get());
>+  nsCOMPtr<nsIDOMDocumentFragment> frag;
>+  NS_NewDocumentFragment(getter_AddRefs(frag),
>+                         aTargetDocument->NodeInfoManager());
>+  nsCOMPtr<nsIContent> fragment = do_QueryInterface(frag);
>+  asFragmentParser->ParseHtml5Fragment(aFragStr,
>+                                      fragment,
>+                                      aContextLocalName ?
>+                                          aContextLocalName : nsGkAtoms::body,
>+                                      kNameSpaceID_XHTML,
>+                                      PR_FALSE,
>+                                      PR_TRUE);

I'm just assuming that all of the above is correct, since I don't know the interface to the HTML5 parser as good as you do.  :-)

>+  if (!aTrustedInput) {
>+    nsRefPtr<nsTreeSanitizer> sanitizer =
>+        new nsTreeSanitizer(!!aContextLocalName, !aContextLocalName);
>+    sanitizer->Sanitize(fragment);
>   }

Can't the sanitizer object be allocated on the stack?

>+  <div id="ss" contenteditable="true"></div>
>+  <div id="tt" contenteditable="true"></div>
>+  <div id="uu" contenteditable="true"></div>
>+  <div id="vv" contenteditable="true"></div>
>+  <div id="ww" contenteditable="true"></div>
>+  <div id="xx" contenteditable="true"></div>

Can you please add designMode equivalent versions for these tests too?  Just to make sure that we're using the same underlying code for both modes.

The tests look great.  Sorry if the test itself sucked a bit, I tried to make it as simple to add new tests as possible... :/

>diff --git a/toolkit/components/feeds/src/nsScriptableUnescapeHTML.cpp b/toolkit/components/feeds/src/nsScriptableUnescapeHTML.cpp

I don't know the code in this component very much, so I just skimmed over it to see if I see anything obviously wrong, and it looked good to me.
Comment 10 Henri Sivonen (:hsivonen) 2011-03-24 06:22:39 PDT
Created attachment 521471 [details] [diff] [review]
Part 1: Atoms
Comment 11 Henri Sivonen (:hsivonen) 2011-03-24 06:23:38 PDT
Created attachment 521473 [details] [diff] [review]
Part 2: New sanitizer and removal of the old one

(In reply to comment #9)
> Comment on attachment 512463 [details] [diff] [review]
> The new sanitizer
> 
> This patch is extremely large, and it's hard to review.  As the first comment,
> it would be wonderful if you could split it up into smaller chunks.

I've split out the nsGkAtoms part, so that the rest is easier to back out even if other atom changes land. However, other than that, I don't see a reasonable way to split that would actually make review any easier. :-( Did you have a particular split in mind?

I rebased the patch to the tip of m-c. I also backported one null check that was in the next patch in my queue.

> The tl;dr version of my initial review is that I like both the approach and the
> code very much. 

Thanks.

> Much of the below comments can be described as nits.  Also,
> please get bz to also look at this, I'm sure he can provide you with a lot of
> good comments as well.

I was planning on asking bz to sr.

> >   /**
> >    * Get the next node in the pre-order tree traversal of the DOM.  If
> >    * aRoot is non-null, then it must be an ancestor of |this|
> >    * (possibly equal to |this|) and only nodes that are descendants of
> >    * aRoot, not including aRoot itself, will be returned.  Returns
> >    * null if there are no more nodes to traverse.
> >+   * If aSkipChildren is true, the next node is computed as if |this|
> >+   * had no children.
> >    */
> >-  nsIContent* GetNextNode(const nsINode* aRoot = nsnull) const
> >+  nsIContent* GetNextNode(const nsINode* aRoot = nsnull,
> >+                          const PRBool aSkipChildren = PR_FALSE) const
> 
> I don't really like this, as the new boolean flag basically changes the
> semantics of the function.  Why can't you add a new method to implement your
> desired semantics?

I did that first, but then I felt bad about copypasted duplicate code.

Changed back to two methods.

> >diff --git a/content/base/src/nsContentUtils.cpp b/content/base/src/nsContentUtils.cpp
> >--- a/content/base/src/nsContentUtils.cpp
> >+++ b/content/base/src/nsContentUtils.cpp
> >@@ -178,16 +178,17 @@ static NS_DEFINE_CID(kXTFServiceCID, NS_
> > #include "nsLayoutUtils.h"
> > #include "nsFrameManager.h"
> > #include "BasicLayers.h"
> > #include "nsFocusManager.h"
> > #include "nsTextEditorState.h"
> > #include "nsIPluginHost.h"
> > #include "nsICategoryManager.h"
> > #include "nsAHtml5FragmentParser.h"
> >+#include "nsTreeSanitizer.h"
> 
> This seems unnecessary.

Removed.

> >+nsTreeSanitizer::nsTreeSanitizer(PRBool aAllowStyles, PRBool aAllowComments)
> >+ : mAllowStyles(aAllowStyles)
> >+ , mAllowComments(aAllowComments)
> >+{
> >+  if (!sElementsHTML) {
> >+    // Initialize lazily to avoid having to initialize at all if the user
> >+    // doesn't paste HTML or load feeds.
> >+    InitializeStatics();
> >+  }
> >+  nsresult rv;
> >+  mNullPrincipal = do_CreateInstance(NS_NULLPRINCIPAL_CONTRACTID, &rv);
> 
> Please drop the unused |rv| variable.

Ah. Autocomplete failed to tell me the second argument has a default value. Dropped.

> >+}
> >+
> >+nsTreeSanitizer::~nsTreeSanitizer()
> >+{
> >+}
> 
> Ditto for the destructor.

Not dropped. Dropping it breaks the build by causing complaints about the destructor of nsCOMPtr<nsIPrincipal> mNullPrincipal. I don't understand C++ templates well enough to say why this happens.

> >+PRBool
> >+nsTreeSanitizer::MustPrune(PRInt32 aNamespace, nsIAtom* aLocal)
> >+{
> >+  // To avoid attacks where a MathML script becomes something that gets
> >+  // serialized in a way that it parses back as an HTML script, let's just
> >+  // drop these regardless of namespace
> >+  if (nsGkAtoms::script == aLocal) {
> 
> What are the cases which you're worried about here?  Is it possible to just fix
> those cases here?

I'm worried about the case where the sanitizer considers <script> in the MathML namespace harmless but when the node is serialized and reparsed, the tag appears in a context where it becomes an HTML <script>.

That's why <script> in any namespace is treated as dangerous. <script> is invalid MathML anyway.

Clarified the comment.

> >+PRBool
> >+nsTreeSanitizer::SanitizeStyleSheet(const nsAString& aOriginal,
> >+                                    nsAString& aSanitized,
> >+                                    nsIDocument* aDocument,
> >+                                    nsIURI* aBaseURI)
> >+{
> >+  nsresult rv;
> >+  aSanitized.Truncate();
> >+  // aSanitized will hold the permitted CSS text.
> >+  // We use a white-listing approach, so we explicitly allow
> >+  // the CSS style and font-face rule types.  We also clear
> >+  // -moz-binding CSS properties.
> 
> I don't think that this comment is accurate.  We're actually blacklisting, are
> we not?

It's your comment that I've just moved to a different file. :-) 

But I think the code is indeed blacklisting, so changed the comment.

> >+void
> >+nsTreeSanitizer::SanitizeAttributes(nsIContent* aElement,
> 
> >+      // Allow underscore to cater to the MCE editor library.
> >+      // Allow data-* on SVG and MathML, too, as a forward-compat measure.
> 
> This kind of scares me.  Is this forward-compat measure actually needed right
> now?

Do you expect SVG and MathML to actually allocate data-* attributes to something dangerous in the future instead of using them in a way that's consistent with HTML?

> >+void
> >+nsTreeSanitizer::InitializeStatics()
> >+{
> >+  NS_PRECONDITION(!sElementsHTML, "Initializing a second time.");
> >+
> >+  sElementsHTML = new nsTHashtable<nsISupportsHashKey> ();
> >+  sElementsHTML->Init(80);
> >+  for (PRUint32 i = 0; kElementsHTML[i]; i++) {
> >+    sElementsHTML->PutEntry(*kElementsHTML[i]);
> >+  }
> 
> Hmm, why initing the hashtable with size 80 when you know the exact size?

Copied from the old code. Hard-coded better numbers. NS_ARRAY_LENGTH doesn't 
work in this case.

> >+void
> >+nsTreeSanitizer::ReleaseStatics()
> >+{
> >+  if (sElementsHTML) {
> >+    delete sElementsHTML;
> >+    sElementsHTML = nsnull;
> >+  }
> 
> You could make this a bit more compact by not null-checking, as null pointers
> are safe to delete.

Changed.

> >+  if (!aTrustedInput) {
> >+    nsRefPtr<nsTreeSanitizer> sanitizer =
> >+        new nsTreeSanitizer(!!aContextLocalName, !aContextLocalName);
> >+    sanitizer->Sanitize(fragment);
> >   }
> 
> Can't the sanitizer object be allocated on the stack?

Changed it to NS_STACK_CLASS.

> >+  <div id="ss" contenteditable="true"></div>
> >+  <div id="tt" contenteditable="true"></div>
> >+  <div id="uu" contenteditable="true"></div>
> >+  <div id="vv" contenteditable="true"></div>
> >+  <div id="ww" contenteditable="true"></div>
> >+  <div id="xx" contenteditable="true"></div>
> 
> Can you please add designMode equivalent versions for these tests too?  Just to
> make sure that we're using the same underlying code for both modes.

Added.

Thanks.
Comment 12 :Ehsan Akhgari 2011-03-24 13:51:09 PDT
Comment on attachment 521471 [details] [diff] [review]
Part 1: Atoms

r=me.
Comment 13 :Ehsan Akhgari 2011-03-24 14:17:32 PDT
Comment on attachment 521473 [details] [diff] [review]
Part 2: New sanitizer and removal of the old one

(In reply to comment #11)
> I've split out the nsGkAtoms part, so that the rest is easier to back out even
> if other atom changes land. However, other than that, I don't see a reasonable
> way to split that would actually make review any easier. :-( Did you have a
> particular split in mind?

Hmm, well, this might be easier said than done, but it was really helpful if we have a first part which just moved code around, and a second one (or several) which changed the moved code, so that the actual changes made to the code were more apparent.  But thinking about it agian, it might be very hard to do this now that you've written the code.  So, nevermind!

> > Much of the below comments can be described as nits.  Also,
> > please get bz to also look at this, I'm sure he can provide you with a lot of
> > good comments as well.
> 
> I was planning on asking bz to sr.

Maybe ask someone else for sr?  I'd _really_ like bz to take a look at this.

> > >   /**
> > >    * Get the next node in the pre-order tree traversal of the DOM.  If
> > >    * aRoot is non-null, then it must be an ancestor of |this|
> > >    * (possibly equal to |this|) and only nodes that are descendants of
> > >    * aRoot, not including aRoot itself, will be returned.  Returns
> > >    * null if there are no more nodes to traverse.
> > >+   * If aSkipChildren is true, the next node is computed as if |this|
> > >+   * had no children.
> > >    */
> > >-  nsIContent* GetNextNode(const nsINode* aRoot = nsnull) const
> > >+  nsIContent* GetNextNode(const nsINode* aRoot = nsnull,
> > >+                          const PRBool aSkipChildren = PR_FALSE) const
> > 
> > I don't really like this, as the new boolean flag basically changes the
> > semantics of the function.  Why can't you add a new method to implement your
> > desired semantics?
> 
> I did that first, but then I felt bad about copypasted duplicate code.

You don't need to duplicate code like that.  You can move the code to a private helper, which is parametrized, and just call it from the two public methods in the correct way.  What I didn't like about the function was its signature, not its body.

> > >+}
> > >+
> > >+nsTreeSanitizer::~nsTreeSanitizer()
> > >+{
> > >+}
> > 
> > Ditto for the destructor.
> 
> Not dropped. Dropping it breaks the build by causing complaints about the
> destructor of nsCOMPtr<nsIPrincipal> mNullPrincipal. I don't understand C++
> templates well enough to say why this happens.

Hmm.  Well, that's kind of surprising.  I thought that the default destructor just calls the destructor on base classes (which you don't have) and members, which would mean nsCOMPtr<nsIPrincipal>::~nsCOMPtr.  Can you quote the compiler error to satisfy my curiosity, please?

> > >+  // aSanitized will hold the permitted CSS text.
> > >+  // We use a white-listing approach, so we explicitly allow
> > >+  // the CSS style and font-face rule types.  We also clear
> > >+  // -moz-binding CSS properties.
> > 
> > I don't think that this comment is accurate.  We're actually blacklisting, are
> > we not?
> 
> It's your comment that I've just moved to a different file. :-) 

That was why it sounded familar!  ;-)  Thanks for fixing it.

> > >+void
> > >+nsTreeSanitizer::SanitizeAttributes(nsIContent* aElement,
> > 
> > >+      // Allow underscore to cater to the MCE editor library.
> > >+      // Allow data-* on SVG and MathML, too, as a forward-compat measure.
> > 
> > This kind of scares me.  Is this forward-compat measure actually needed right
> > now?
> 
> Do you expect SVG and MathML to actually allocate data-* attributes to
> something dangerous in the future instead of using them in a way that's
> consistent with HTML?

Not really, but I don't feel confident with leaving this in the code and later on forgetting to remove them when somebody adds a data-pwn attribute to SVG which runs code, for example.  Just a marginal improvement, as I don't believe that the data-pwn scenario is likely.

> > >+void
> > >+nsTreeSanitizer::InitializeStatics()
> > >+{
> > >+  NS_PRECONDITION(!sElementsHTML, "Initializing a second time.");
> > >+
> > >+  sElementsHTML = new nsTHashtable<nsISupportsHashKey> ();
> > >+  sElementsHTML->Init(80);
> > >+  for (PRUint32 i = 0; kElementsHTML[i]; i++) {
> > >+    sElementsHTML->PutEntry(*kElementsHTML[i]);
> > >+  }
> > 
> > Hmm, why initing the hashtable with size 80 when you know the exact size?
> 
> Copied from the old code. Hard-coded better numbers. NS_ARRAY_LENGTH doesn't 
> work in this case.

That's because kElementsHTML and friends are not declared as arrays, they're just pointers looking like arrays.  I think a much easier approach is to just move them to an anonymous namespace inside nsTreeSanitizer.cpp (instead of being a nsTreeSanitizer static private member), convert them to real arrays, and then just use NS_ARRAY_LENGTH, which guarantees that we'll always have the right number.

> > >+  if (!aTrustedInput) {
> > >+    nsRefPtr<nsTreeSanitizer> sanitizer =
> > >+        new nsTreeSanitizer(!!aContextLocalName, !aContextLocalName);
> > >+    sanitizer->Sanitize(fragment);
> > >   }
> > 
> > Can't the sanitizer object be allocated on the stack?
> 
> Changed it to NS_STACK_CLASS.

Perfect.  /me hates operator new.

> > >+  <div id="ss" contenteditable="true"></div>
> > >+  <div id="tt" contenteditable="true"></div>
> > >+  <div id="uu" contenteditable="true"></div>
> > >+  <div id="vv" contenteditable="true"></div>
> > >+  <div id="ww" contenteditable="true"></div>
> > >+  <div id="xx" contenteditable="true"></div>
> > 
> > Can you please add designMode equivalent versions for these tests too?  Just to
> > make sure that we're using the same underlying code for both modes.
> 
> Added.

Perfect!
Comment 14 Henri Sivonen (:hsivonen) 2011-03-28 05:14:16 PDT
Created attachment 522337 [details] [diff] [review]
Part 2: New sanitizer and removal of the old one, v2

(In reply to comment #13)
> > > Much of the below comments can be described as nits.  Also,
> > > please get bz to also look at this, I'm sure he can provide you with a lot of
> > > good comments as well.
> > 
> > I was planning on asking bz to sr.
> 
> Maybe ask someone else for sr?  I'd _really_ like bz to take a look at this.

OK. Requesting r from bz?

> > > >   /**
> > > >    * Get the next node in the pre-order tree traversal of the DOM.  If
> > > >    * aRoot is non-null, then it must be an ancestor of |this|
> > > >    * (possibly equal to |this|) and only nodes that are descendants of
> > > >    * aRoot, not including aRoot itself, will be returned.  Returns
> > > >    * null if there are no more nodes to traverse.
> > > >+   * If aSkipChildren is true, the next node is computed as if |this|
> > > >+   * had no children.
> > > >    */
> > > >-  nsIContent* GetNextNode(const nsINode* aRoot = nsnull) const
> > > >+  nsIContent* GetNextNode(const nsINode* aRoot = nsnull,
> > > >+                          const PRBool aSkipChildren = PR_FALSE) const
> > > 
> > > I don't really like this, as the new boolean flag basically changes the
> > > semantics of the function.  Why can't you add a new method to implement your
> > > desired semantics?
> > 
> > I did that first, but then I felt bad about copypasted duplicate code.
> 
> You don't need to duplicate code like that.  You can move the code to a private
> helper, which is parametrized, and just call it from the two public methods in
> the correct way.  What I didn't like about the function was its signature, not
> its body.

OK. Changed.

> > > >+}
> > > >+
> > > >+nsTreeSanitizer::~nsTreeSanitizer()
> > > >+{
> > > >+}
> > > 
> > > Ditto for the destructor.
> > 
> > Not dropped. Dropping it breaks the build by causing complaints about the
> > destructor of nsCOMPtr<nsIPrincipal> mNullPrincipal. I don't understand C++
> > templates well enough to say why this happens.
> 
> Hmm.  Well, that's kind of surprising.  I thought that the default destructor
> just calls the destructor on base classes (which you don't have) and members,
> which would mean nsCOMPtr<nsIPrincipal>::~nsCOMPtr.  Can you quote the compiler
> error to satisfy my curiosity, please?

In file included from ../../../dist/include/nsComponentManagerUtils.h:46,
                 from ../../../dist/include/nsIComponentManager.h:164,
                 from /opt/Projects/mozilla-html5/toolkit/components/feeds/nsScriptableUnescapeHTML.cpp:38:
../../../dist/include/nsCOMPtr.h: In destructor ‘nsCOMPtr<T>::~nsCOMPtr() [with T = nsIPrincipal]’:
../../../dist/include/nsTreeSanitizer.h:44:   instantiated from here
../../../dist/include/nsCOMPtr.h:531: error: invalid static_cast from type ‘nsIPrincipal*’ to type ‘nsISupports*’
../../../dist/include/nsCOMPtr.h:533: error: invalid use of incomplete type ‘struct nsIPrincipal’
../../../dist/include/nsNodeInfoManager.h:53: error: forward declaration of ‘struct nsIPrincipal’
make[6]: *** [nsScriptableUnescapeHTML.o] Error 1

> > > >+void
> > > >+nsTreeSanitizer::SanitizeAttributes(nsIContent* aElement,
> > > 
> > > >+      // Allow underscore to cater to the MCE editor library.
> > > >+      // Allow data-* on SVG and MathML, too, as a forward-compat measure.
> > > 
> > > This kind of scares me.  Is this forward-compat measure actually needed right
> > > now?
> > 
> > Do you expect SVG and MathML to actually allocate data-* attributes to
> > something dangerous in the future instead of using them in a way that's
> > consistent with HTML?
> 
> Not really, but I don't feel confident with leaving this in the code and later
> on forgetting to remove them when somebody adds a data-pwn attribute to SVG
> which runs code, for example.  Just a marginal improvement, as I don't believe
> that the data-pwn scenario is likely.

I'm worried that if data-* is removed from SVG but not from HTML, people will work around the situation in unfortunate ways. I'm leaving this in for bz's review still.

> > > >+void
> > > >+nsTreeSanitizer::InitializeStatics()
> > > >+{
> > > >+  NS_PRECONDITION(!sElementsHTML, "Initializing a second time.");
> > > >+
> > > >+  sElementsHTML = new nsTHashtable<nsISupportsHashKey> ();
> > > >+  sElementsHTML->Init(80);
> > > >+  for (PRUint32 i = 0; kElementsHTML[i]; i++) {
> > > >+    sElementsHTML->PutEntry(*kElementsHTML[i]);
> > > >+  }
> > > 
> > > Hmm, why initing the hashtable with size 80 when you know the exact size?
> > 
> > Copied from the old code. Hard-coded better numbers. NS_ARRAY_LENGTH doesn't 
> > work in this case.
> 
> That's because kElementsHTML and friends are not declared as arrays, they're
> just pointers looking like arrays.  I think a much easier approach is to just
> move them to an anonymous namespace inside nsTreeSanitizer.cpp (instead of
> being a nsTreeSanitizer static private member), convert them to real arrays,
> and then just use NS_ARRAY_LENGTH, which guarantees that we'll always have the
> right number.

Done.

Thank you!
Comment 15 :Ehsan Akhgari 2011-03-28 12:54:17 PDT
(In reply to comment #14)
> In file included from ../../../dist/include/nsComponentManagerUtils.h:46,
>                  from ../../../dist/include/nsIComponentManager.h:164,
>                  from
> /opt/Projects/mozilla-html5/toolkit/components/feeds/nsScriptableUnescapeHTML.cpp:38:
> ../../../dist/include/nsCOMPtr.h: In destructor ‘nsCOMPtr<T>::~nsCOMPtr() [with
> T = nsIPrincipal]’:
> ../../../dist/include/nsTreeSanitizer.h:44:   instantiated from here
> ../../../dist/include/nsCOMPtr.h:531: error: invalid static_cast from type
> ‘nsIPrincipal*’ to type ‘nsISupports*’
> ../../../dist/include/nsCOMPtr.h:533: error: invalid use of incomplete type
> ‘struct nsIPrincipal’
> ../../../dist/include/nsNodeInfoManager.h:53: error: forward declaration of
> ‘struct nsIPrincipal’
> make[6]: *** [nsScriptableUnescapeHTML.o] Error 1

Hmm, that's coming from <http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsTraceRefcnt.h#80> I bet.  I think that code should use NS_ISUPPORTS_CAST, which works around the ambiguous possible cast path.  Can you please file a bug about that?

And you should #include "nsIPrincipal.h" which should make this compile.
Comment 16 Henri Sivonen (:hsivonen) 2011-03-31 06:07:52 PDT
Created attachment 523290 [details] [diff] [review]
Part 2: New sanitizer and removal of the old one, v3

(In reply to comment #15)
> Hmm, that's coming from
> <http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsTraceRefcnt.h#80> I
> bet.  I think that code should use NS_ISUPPORTS_CAST, which works around the
> ambiguous possible cast path.  Can you please file a bug about that?

Filed bug 646812.

> And you should #include "nsIPrincipal.h" which should make this compile.

Included in this revision of the patch. (And removed the empty destructor from the .cpp.)

(No other changes since the previous patch.)
Comment 17 Henri Sivonen (:hsivonen) 2011-04-01 00:30:41 PDT
Landed the atoms (part 1) into cedar:
http://hg.mozilla.org/projects/cedar/rev/cd195b9da57e
Comment 18 :Ehsan Akhgari 2011-04-01 14:01:43 PDT
First part landed: http://hg.mozilla.org/mozilla-central/rev/92ee9853e654
Comment 19 Henri Sivonen (:hsivonen) 2011-04-15 08:49:41 PDT
Created attachment 526269 [details] [diff] [review]
Part 2: New sanitizer and removal of the old one, v4

This patch incorporates the fix for bug 649566.
Comment 20 Henri Sivonen (:hsivonen) 2011-04-15 08:50:42 PDT
Created attachment 526270 [details] [diff] [review]
New hunks in v4 for reference

This patch shows the hunks that are in v4 but weren't in the same form in v3.
Comment 21 Boris Zbarsky [:bz] (still a bit busy) 2011-04-21 09:50:06 PDT
I have to second the "I wish this huge patch came in bite-sized chunks" sentiment of comment 9.  :(

Reading the patch now, though.
Comment 22 Ben Bucksch (:BenB) 2011-04-23 01:20:43 PDT
FYI, the mozSanitizer has the following features:
- being paranoid
- removing all HTML tags and attributes (!) that are not in a whitelist.
- the whitelist is configurable, by the caller and the user via pref.
  Each use of the class passes in a string with the allowed tags and their
  atttributes.
  In some circumstances, more things should be allowed than in others.
  For example, I do not allow <font> in the mail sanitizing.
- Anything that smells like javascript is forbidden
  - links that contain javascript:, data: etc. are stripped.
  - even content text nodes are surveyed for such things, to capture
    things like external entities being javascript: URLs, and the like.

The whole idea of the mozSanitizer was to protect against most bugs in Gecko, *not* to sit on top of the Gecko protections.

I only mention this because you may want to ensure that you have the same features.
Comment 23 Henri Sivonen (:hsivonen) 2011-04-25 23:51:14 PDT
(In reply to comment #22)
> FYI, the mozSanitizer has the following features:

This bug is not about replacing mozSanitizingSerializer. This is about replacing the paranoid fragment sinks.

> - removing all HTML tags and attributes (!) that are not in a whitelist.

The patch here does that.

> - the whitelist is configurable, by the caller and the user via pref.
>   Each use of the class passes in a string with the allowed tags and their
>   atttributes.
>   In some circumstances, more things should be allowed than in others.
>   For example, I do not allow <font> in the mail sanitizing.

The whitelist here is not configurable.

> - Anything that smells like javascript is forbidden
>   - links that contain javascript:, data: etc. are stripped.

JavaScript links are stripped here, too.

>   - even content text nodes are surveyed for such things, to capture
>     things like external entities being javascript: URLs, and the like.

Not done here.

> The whole idea of the mozSanitizer was to protect against most bugs in Gecko,
> *not* to sit on top of the Gecko protections.

That seems like a different goal. The goal here is to avoid XSS in feeds and in cross-origin paste while assuming that the rest of Gecko behaves as it should.

> I only mention this because you may want to ensure that you have the same
> features.

Thanks. Having the same features as mozSanitizingSerializer is outside the scope of this bug.
Comment 24 Henri Sivonen (:hsivonen) 2011-05-09 09:15:52 PDT
Created attachment 531064 [details] [diff] [review]
Part 2: New sanitizer and removal of the old one, v5

More CSS deCOM unrot.
Comment 25 Henri Sivonen (:hsivonen) 2011-06-29 05:06:45 PDT
Created attachment 542777 [details] [diff] [review]
Part 2: New sanitizer and removal of the old one, v6

Rebasing to trunk. The difference is that MOZ_MATHML doesn't exist anymore.

bz, what's the outlook for review? It would be nice to get this landed relatively soon after the start of the Firefox 8 cycle.
Comment 26 Boris Zbarsky [:bz] (still a bit busy) 2011-06-30 00:04:46 PDT
> bz, what's the outlook for review?

Near the top of my priority list, but big...

Within a week, I hope.
Comment 27 Boris Zbarsky [:bz] (still a bit busy) 2011-07-15 21:01:07 PDT
Comment on attachment 542777 [details] [diff] [review]
Part 2: New sanitizer and removal of the old one, v6

>+++ b/content/base/public/nsINode.h
>+    return GetNextNodeImpl(aRoot, PR_FALSE);

This change worries me a little bit: this is _very_ hot code.  I'd hope that it all gets inlined and then the compiler will just optimize away the branch because once you inline the value of aSkipChildren is a known constant, but could you please verify that this is indeed what happens?

>+++ b/content/base/public/nsTreeSanitizer.h
>+    nsTreeSanitizer(PRBool aAllowStyles, PRBool aAllowComments);

Please document what those arguments mean.

>+     * Sanitizer a DOM fragment.

s/sanitizer/sanitize/

>+    void Sanitize(nsIContent* aFragment);

But this only really takes document fragments so far.  Why not just enforce that on the type level by making this argument nsIDOMDocumentFragment?  Then you could QI at the top of this method instead of having QIs at all the callsites.

If you do keep this nsIContent* for some reason, the restriction about only really allowing document fragments should be clearly documented.

>+    nsCOMPtr<nsIPrincipal> mNullPrincipal;

Is there a reason to not just have a nullprincipal in the statics, as long as you have statics?


>+    PRBool MustFlatten(PRInt32 aNamespace, nsIAtom* aLocal);
>+    PRBool MustPrune(PRInt32 aNamespace, nsIAtom* aLocal);
>+    PRBool IsURL(nsIAtom*** aURLs, nsIAtom* aLocalName);
>+    void SanitizeAttributes(nsIContent* aElement,
....
>+    void SanitizeURL(nsIContent* aElement,
....
>+    PRBool SanitizeStyleRule(mozilla::css::StyleRule *aRule,
>+                             nsAutoString &aRuleText);
>+    PRBool SanitizeStyleSheet(const nsAString& aOriginal,

These all need to be documented.  I should have asked for that earlier instead of trying to figure out what they do by reading the code....  Document what the arguments mean, and what the return value means (and fo some of these what the function does; e.g. document that SanitizeStyleRule does not mutate aRule).

>+    static nsTHashtable<nsISupportsHashKey>* sElementsHTML;
>+    static nsTHashtable<nsISupportsHashKey>* sAttributesHTML;
>+    static nsTHashtable<nsISupportsHashKey>* sElementsSVG;
>+    static nsTHashtable<nsISupportsHashKey>* sAttributesSVG;
>+    static nsTHashtable<nsISupportsHashKey>* sElementsMathML;
>+    static nsTHashtable<nsISupportsHashKey>* sAttributesMathML;

Document that these are storage for the whitelists, please.

>+++ b/content/base/src/nsTreeSanitizer.cpp
>+nsTreeSanitizer::SanitizeAttributes(nsIContent* aElement,

This function is assuming that as long as you loop in reverse order from GetAttrCount() - 1 to 0, getting the name and then calling UnsetAttr won't mess up the iteration.

There's nothing that actually guarantees that, and I'd really rather not depend on it.  Can we avoid making this assumption, please?

>+  // If we've got HTML audio or video, add the controls attribute, because
>+  // otherwise the content is unplayable with scripts removed.
>+  if (aURLs == kURLAttributesHTML) {

Why not just use aElement->IsHTML(nsGkAtoms::whatever)?  If this is meant as an optimization, it seems like a premature one: the resulting code is more confusing and I doubt the performance difference is noticeable.

If you always know you're passing an element to SanitizeAttributes, make aElement an Element, not an nsIContent, please.  Similar for any other methods where you know you have an Element (e.g. SanitizeURL).

>+nsTreeSanitizer::Sanitize(nsIContent* aFragment) {
>+  nsIContent* next;

Why not declare that in the places where you actually use it?  That would make sure it doesn't accidentally get used from a previous iteration or anything.

>+  while (node) {
>+      if (nsGkAtoms::style == localName) {

Lack of the need for a namespace check here is because MustPrune removes <style> elements in other namespaces, right?  Worth documenting.  And moving the assert that this is XHTML or SVG up to before this test.

>+      if (MustFlatten(ns, localName)) {

The code here assumes that no mutation events can fire on this subtree while it's running, yes?  Is this an assumption we can make because we control where the document fragments are coming from?  If so, that needs to be documented somewhere; as the code stands passing in user-provided document fragments to this code is NOT a good move.

Same for the SetNodeTextContent bit above, actually.

>+        nsresult rv;
>+        while ((child = node->GetFirstChild())) {
>+          parent->InsertBefore(child, node, &rv);

I think that if NS_FAILED(rv) you should just break out of the loop and move on with parent->RemoveChild(node).

r- for now, because I want to see those comments and sort out the attribute loop thing.
Comment 28 Henri Sivonen (:hsivonen) 2011-07-18 08:21:55 PDT
Created attachment 546540 [details] [diff] [review]
Part 2: New sanitizer and removal of the old one, v7

(In reply to comment #27)
> Comment on attachment 542777 [details] [diff] [review] [review]
> Part 2: New sanitizer and removal of the old one, v6
> 
> >+++ b/content/base/public/nsINode.h
> >+    return GetNextNodeImpl(aRoot, PR_FALSE);
> 
> This change worries me a little bit: this is _very_ hot code.

Going back and forth here. :-( See comment 11 and comment 13.

> I'd hope that
> it all gets inlined and then the compiler will just optimize away the branch
> because once you inline the value of aSkipChildren is a known constant, but
> could you please verify that this is indeed what happens?

Is there any other way to verify except to disassemble opt builds on all tier 1 platforms and to examine the disassembly? I don't even have a build environment for Windows and Android and I'm not quite sure what the correct verification procedure is even for Linux.

(I'd be OK with going back to having some copypasted code in two method variants to make sure we know it compiles as to distinct variants without a branch in a hot loop.)

> >+++ b/content/base/public/nsTreeSanitizer.h
> >+    nsTreeSanitizer(PRBool aAllowStyles, PRBool aAllowComments);
> 
> Please document what those arguments mean.

Fixed.

While documenting this, I noticed that the sanitizer breaks parts of Microdata. Not fixed in this patch, yet.

> >+     * Sanitizer a DOM fragment.
> 
> s/sanitizer/sanitize/

Fixed.

> >+    void Sanitize(nsIContent* aFragment);
> 
> But this only really takes document fragments so far.  Why not just enforce
> that on the type level by making this argument nsIDOMDocumentFragment?  Then
> you could QI at the top of this method instead of having QIs at all the
> callsites.

The callsites need to QI to nsIContent* in practice anyway, because the fragment parsing API wants nsIContent* in order to support parsing directly to a context element in the innerHTML setter case, so enforcing the fragment using types instead of the assertions that are in the patch currently would mean we'd be left with a tiny bit less performant opt builds. And additional QI here doesn't matter much one way or the other for perf, but I've tried to get into the habit of avoiding QI since QIs all over the app do add up. I also tend to want to avoid a lot of QI, because it makes the code verbose, but OTOH, the assertions are just as verbose.

> If you do keep this nsIContent* for some reason, the restriction about only
> really allowing document fragments should be clearly documented.

For now, I fixed this in documentation on the theory that nsIContent is nicer for internal code than any nsIDOM* stuff that doesn't inherit from nsIContent and needs explicit QI even for passing as what intuitively would be superclasses.

> >+    nsCOMPtr<nsIPrincipal> mNullPrincipal;
> 
> Is there a reason to not just have a nullprincipal in the statics, as long
> as you have statics?

I cargo-culted this from existing code assuming that if it was OK to hold the null principal in a static, we'd already have an app-wide static for it.

Shouldn't we have an app-wide static for it then? In nsContentUtils maybe.

> >+    PRBool MustFlatten(PRInt32 aNamespace, nsIAtom* aLocal);
> >+    PRBool MustPrune(PRInt32 aNamespace, nsIAtom* aLocal);
> >+    PRBool IsURL(nsIAtom*** aURLs, nsIAtom* aLocalName);
> >+    void SanitizeAttributes(nsIContent* aElement,
> ....
> >+    void SanitizeURL(nsIContent* aElement,
> ....
> >+    PRBool SanitizeStyleRule(mozilla::css::StyleRule *aRule,
> >+                             nsAutoString &aRuleText);
> >+    PRBool SanitizeStyleSheet(const nsAString& aOriginal,
> 
> These all need to be documented.

Documented, though while doing so, I realized I don't understand the usage of SanitizeStyleRule, so I pinged Ehsan about the code I copied.

> >+    static nsTHashtable<nsISupportsHashKey>* sElementsHTML;
> >+    static nsTHashtable<nsISupportsHashKey>* sAttributesHTML;
> >+    static nsTHashtable<nsISupportsHashKey>* sElementsSVG;
> >+    static nsTHashtable<nsISupportsHashKey>* sAttributesSVG;
> >+    static nsTHashtable<nsISupportsHashKey>* sElementsMathML;
> >+    static nsTHashtable<nsISupportsHashKey>* sAttributesMathML;
> 
> Document that these are storage for the whitelists, please.

Documented.

> >+++ b/content/base/src/nsTreeSanitizer.cpp
> >+nsTreeSanitizer::SanitizeAttributes(nsIContent* aElement,
> 
> This function is assuming that as long as you loop in reverse order from
> GetAttrCount() - 1 to 0, getting the name and then calling UnsetAttr won't
> mess up the iteration.
> 
> There's nothing that actually guarantees that, and I'd really rather not
> depend on it.  Can we avoid making this assumption, please?

How do you suggest the avoidance be implemented?

> >+  // If we've got HTML audio or video, add the controls attribute, because
> >+  // otherwise the content is unplayable with scripts removed.
> >+  if (aURLs == kURLAttributesHTML) {
> 
> Why not just use aElement->IsHTML(nsGkAtoms::whatever)?  If this is meant as
> an optimization, it seems like a premature one: the resulting code is more
> confusing and I doubt the performance difference is noticeable.

Removed the premature optimization.

> If you always know you're passing an element to SanitizeAttributes, make
> aElement an Element, not an nsIContent, please.  Similar for any other
> methods where you know you have an Element (e.g. SanitizeURL).

Changed (though I don't see this change having much value).

> >+nsTreeSanitizer::Sanitize(nsIContent* aFragment) {
> >+  nsIContent* next;
> 
> Why not declare that in the places where you actually use it?  That would
> make sure it doesn't accidentally get used from a previous iteration or
> anything.

Changed.

> >+  while (node) {
> >+      if (nsGkAtoms::style == localName) {
> 
> Lack of the need for a namespace check here is because MustPrune removes
> <style> elements in other namespaces, right?  Worth documenting.

Yes, documented.

> And moving
> the assert that this is XHTML or SVG up to before this test.

Moved the assertion right to the top of the inside of the if block--not before it, since it would be wrong before it.

> >+      if (MustFlatten(ns, localName)) {
> 
> The code here assumes that no mutation events can fire on this subtree while
> it's running, yes?  Is this an assumption we can make because we control
> where the document fragments are coming from?  If so, that needs to be
> documented somewhere; as the code stands passing in user-provided document
> fragments to this code is NOT a good move.
> 
> Same for the SetNodeTextContent bit above, actually.

I assumed that mutation events don't fire on nodes disconnected from the tree and a DocumentFragment is always disconnected from the tree.

Is it incorrect to assume that mutation events can't fire on disconnected nodes?

Also, since the fragments come from the parser, the <script> doesn't get run and onfoo attributes aren't supported for mutation events, so I can't think of a way for hostile mutation event handlers to get attached.

> >+        nsresult rv;
> >+        while ((child = node->GetFirstChild())) {
> >+          parent->InsertBefore(child, node, &rv);
> 
> I think that if NS_FAILED(rv) you should just break out of the loop and move
> on with parent->RemoveChild(node).

Done.

> r- for now, because I want to see those comments and sort out the attribute
> loop thing.

Also, this should probably preserve <meta itemprop> and <link itemprop> if we're whitelisting Microdata at all.
Comment 29 Henri Sivonen (:hsivonen) 2011-07-18 08:22:35 PDT
Created attachment 546541 [details] [diff] [review]
Diff from v6 to v7

Thank you for the review so far.
Comment 30 Henri Sivonen (:hsivonen) 2011-07-28 07:19:03 PDT
Created attachment 549102 [details] [diff] [review]
Part 2: New sanitizer and removal of the old one, v8

This iteration:
 * Preserves Microdata better (Microdata is still at risk of getting corrupted if it occurs on <object>, for example.)
 * Fixes some errors in the test cases.
 * Inspects all attributes again after removing one attribute to avoid relying on a non-existing guarantee of no attribute reshuffling upon removal.
Comment 31 Henri Sivonen (:hsivonen) 2011-07-28 07:19:36 PDT
Created attachment 549103 [details] [diff] [review]
Diff from v7 to v8
Comment 32 Boris Zbarsky [:bz] (still a bit busy) 2011-07-28 08:29:52 PDT
Ugh.  I'm sorry I didn't respond to your questions in comment 28.  I wasn't cced on the bug.  :(

> Is there any other way to verify except to disassemble opt builds on all tier 1
> platforms and to examine the disassembly?

Just doing it on one platform is probably good enough.  But I think I've convinced myself that if the compiler is not doing this optimization it's just silly, so let's not worry about it.

> The callsites need to QI to nsIContent* in practice anyway

Ah, ok.

> I cargo-culted this from existing code assuming that if it was OK to hold the
> null principal in a static, we'd already have an app-wide static for it.

There's not "the" null principal.  Each null principal object is a nonce principal.  In general, you want different nonces for things that have different security contexts.

But your particular usage of mNullPrincipal is very limited; in particular you never create any objects that have it as their principal.  So making yours static should be fine.  I would be opposed to having a static null principal available on content utils, though moving the whole "check whether this uri can be loaded by a null principal" API to content utils would be ok by me.

> How do you suggest the avoidance be implemented?

The simplest thing, probably, is to build an array of attr names on the stack up front and then go through it instead of going through the "live" attributes.

Or go through attributes and make a list of the ones to remove and then remove them...

I'll take a look at the solution you actually picked.

> Changed (though I don't see this change having much value).

At some point I hope to have GetAttr/SetAttr faster on Element than on nsIContent.

> I assumed that mutation events don't fire on nodes disconnected from the tree

This assumption is wrong, unfortunately.

> Also, since the fragments come from the parser, the <script> doesn't get run
> and onfoo attributes aren't supported for mutation events, so I can't think
> of a way for hostile mutation event handlers to get attached.

Right, that's the "control where the document fragments are coming from".  I'm ok with that if that limitation is very clearly spelled out in the API documentation.
Comment 33 Boris Zbarsky [:bz] (still a bit busy) 2011-07-28 08:49:56 PDT
Comment on attachment 549102 [details] [diff] [review]
Part 2: New sanitizer and removal of the old one, v8

>+++ b/content/base/public/nsTreeSanitizer.h
>+     * Sanitizes a disconnected DOM fragment. The argument must be a of type

s/a of/of/

and add documentation about it having to come directly from the parser, please.

>+     * @param aElement the element whole attributes to sanitize

s/whole attributes to sanitize/whose attributes should be sanitized/

>+    void SanitizeAttributes(mozilla::dom::Element* aElement,

Worth documenting aAllowStyle too.

>+     * @param aSanitize the reserialization without 'binding'

aSanitized.  Should mention that this is only set if true is returned.

>+++ b/content/base/src/nsTreeSanitizer.cpp
>+nsTreeSanitizer::MustPrune(PRInt32 aNamespace,
>+    if ((nsGkAtoms::meta == aLocal || nsGkAtoms::link == aLocal) &&
>+        !(aElement->HasAttr(kNameSpaceID_None, nsGkAtoms::itemprop) ||
>+          aElement->HasAttr(kNameSpaceID_None, nsGkAtoms::itemscope))) {

So with this change, something like:

  <link rel="stylesheet" itemprop href="something">

will no longer get pruned, right?  I guess the SanitizeAttributes changes make
this safe; please add a comment here pointing to that code.

>+      if (!((attrLocal == nsGkAtoms::rel &&
>+             aElement->IsHTML(nsGkAtoms::link)) ||
>+            (attrLocal == nsGkAtoms::name &&
>+             aElement->IsHTML(nsGkAtoms::meta))) &&
>+          aAllowed->GetEntry(attrLocal)) {

I think this would be more readable as:

      if (aAllowed->GetEntry(attrLocal) &&
          !(attrLocal == nsGkAtoms::rel &&
            aElement->IsHTML(nsGkAtoms::link)) &&
	  !(attrLocal == nsGkAtoms::name &&
            aElement->IsHTML(nsGkAtoms::meta))) {

Decrementing ac and restarting the loop seems fine in most cases.  It can
potentially make things O(N^2) in number of attributes, though... are we ok
with that?

r=me modulo the above and comment 32.
Comment 34 Henri Sivonen (:hsivonen) 2011-07-29 04:34:46 PDT
(In reply to comment #32)
> Ugh.  I'm sorry I didn't respond to your questions in comment 28.  I wasn't
> cced on the bug.  :(
> 
> > Is there any other way to verify except to disassemble opt builds on all tier 1
> > platforms and to examine the disassembly?
> 
> Just doing it on one platform is probably good enough.  But I think I've
> convinced myself that if the compiler is not doing this optimization it's
> just silly, so let's not worry about it.

OK. I left it like it was.

> > I cargo-culted this from existing code assuming that if it was OK to hold the
> > null principal in a static, we'd already have an app-wide static for it.
> 
> There's not "the" null principal.  Each null principal object is a nonce
> principal.  In general, you want different nonces for things that have
> different security contexts.
> 
> But your particular usage of mNullPrincipal is very limited; in particular
> you never create any objects that have it as their principal.  So making
> yours static should be fine.

I made it static.

> > Changed (though I don't see this change having much value).
> 
> At some point I hope to have GetAttr/SetAttr faster on Element than on
> nsIContent.

OK.

> > I assumed that mutation events don't fire on nodes disconnected from the tree
> 
> This assumption is wrong, unfortunately.

:-(

> > Also, since the fragments come from the parser, the <script> doesn't get run
> > and onfoo attributes aren't supported for mutation events, so I can't think
> > of a way for hostile mutation event handlers to get attached.
> 
> Right, that's the "control where the document fragments are coming from". 
> I'm ok with that if that limitation is very clearly spelled out in the API
> documentation.

Documented.

(In reply to comment #33)
> Comment on attachment 549102 [details] [diff] [review] [review]
> Part 2: New sanitizer and removal of the old one, v8
> 
> >+++ b/content/base/public/nsTreeSanitizer.h
> >+     * Sanitizes a disconnected DOM fragment. The argument must be a of type
> 
> s/a of/of/

Fixed.

> and add documentation about it having to come directly from the parser,
> please.

Fixed.

> >+     * @param aElement the element whole attributes to sanitize
> 
> s/whole attributes to sanitize/whose attributes should be sanitized/

Fixed.

> >+    void SanitizeAttributes(mozilla::dom::Element* aElement,
> 
> Worth documenting aAllowStyle too.

Fixed.

> >+     * @param aSanitize the reserialization without 'binding'
> 
> aSanitized.  Should mention that this is only set if true is returned.

Fixed.

> >+++ b/content/base/src/nsTreeSanitizer.cpp
> >+nsTreeSanitizer::MustPrune(PRInt32 aNamespace,
> >+    if ((nsGkAtoms::meta == aLocal || nsGkAtoms::link == aLocal) &&
> >+        !(aElement->HasAttr(kNameSpaceID_None, nsGkAtoms::itemprop) ||
> >+          aElement->HasAttr(kNameSpaceID_None, nsGkAtoms::itemscope))) {
> 
> So with this change, something like:
> 
>   <link rel="stylesheet" itemprop href="something">
> 
> will no longer get pruned, right?

Right.

> I guess the SanitizeAttributes changes
> make
> this safe; please add a comment here pointing to that code.

Fixed.

> >+      if (!((attrLocal == nsGkAtoms::rel &&
> >+             aElement->IsHTML(nsGkAtoms::link)) ||
> >+            (attrLocal == nsGkAtoms::name &&
> >+             aElement->IsHTML(nsGkAtoms::meta))) &&
> >+          aAllowed->GetEntry(attrLocal)) {
> 
> I think this would be more readable as:
> 
>       if (aAllowed->GetEntry(attrLocal) &&
>           !(attrLocal == nsGkAtoms::rel &&
>             aElement->IsHTML(nsGkAtoms::link)) &&
> 	  !(attrLocal == nsGkAtoms::name &&
>             aElement->IsHTML(nsGkAtoms::meta))) {

Fixed.

> Decrementing ac and restarting the loop seems fine in most cases.  It can
> potentially make things O(N^2) in number of attributes, though... are we ok
> with that?

I think that's OK. It's not a problem if there are only a few attributes. It's not a problem if all the attributes are kept. It's not a problem if all the attributes are removed. So I'm not worried about accidental triggering of bad behavior. And for malicious DoS, there are easier ways to bother Firefox users (that is, ways that don't involve baiting the user to subscribe to a feed or to copy and paste stuff). 

> r=me modulo the above and comment 32.

Thank you.

I'll land with the fixes once inbound reopens.
Comment 35 Henri Sivonen (:hsivonen) 2011-07-29 04:50:40 PDT
I sure hope this stays in the tree:
http://hg.mozilla.org/integration/mozilla-inbound/rev/8fb752f5e1fa
Comment 37 Sunil 2012-04-14 12:17:51 PDT
A question about comment #32

The following question was asked and responded to:
'
> I assumed that mutation events don't fire on nodes disconnected from the tree

This assumption is wrong, unfortunately.
'

Can someone give an example where disconnected nodes can fire Mutation events. 

(Sorry if this is not the right forum for this question. I have asked similar question on StackOverflow (http://stackoverflow.com/questions/10155161/firefox-dom-mutation-events-for-disconnected-nodes) and where ever I get the answer, I'll update the other thread).

Thanks.
Comment 38 Boris Zbarsky [:bz] (still a bit busy) 2012-04-14 18:01:05 PDT
> Can someone give an example where disconnected nodes can fire Mutation events. 

Any mutation to disconnected nodes will fire mutation events.  You just have to have the listeners somewhere on the nodes' parent chain.
Comment 39 Henri Sivonen (:hsivonen) 2012-09-26 22:49:41 PDT
*** Bug 554130 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.