Last Comment Bug 651072 - Support HTML parsing in XMLHttpRequest per XMLHttpRequest Level 2
: Support HTML parsing in XMLHttpRequest per XMLHttpRequest Level 2
Status: RESOLVED FIXED
: dev-doc-complete
Product: Core
Classification: Components
Component: DOM (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla11
Assigned To: Henri Sivonen (:hsivonen)
:
Mentors:
Depends on: 692434 704284 704760 704761 705072 705216
Blocks: 102699 482911
  Show dependency treegraph
 
Reported: 2011-04-19 04:07 PDT by Henri Sivonen (:hsivonen)
Modified: 2012-05-03 07:49 PDT (History)
16 users (show)
hsivonen: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
First attempt at implementation (24.87 KB, patch)
2011-09-23 06:29 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Support HTML in XHR, with tests (39.52 KB, patch)
2011-10-10 03:04 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Support HTML in XHR, with tests and with sync XHR support (41.02 KB, patch)
2011-11-01 08:05 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Support HTML in XHR, with tests and with sync XHR support and non-support for multipart (45.30 KB, patch)
2011-11-11 08:35 PST, Henri Sivonen (:hsivonen)
bugs: review+
Details | Diff | Review
Make extension updater treat text/html responses as errors (1.33 KB, patch)
2011-11-11 08:38 PST, Henri Sivonen (:hsivonen)
dtownsend: review+
Details | Diff | Review
Address review comments (14.68 KB, patch)
2011-11-15 05:06 PST, Henri Sivonen (:hsivonen)
bugs: review+
Details | Diff | Review
Deal with null mResponseXML after error (2.26 KB, patch)
2011-11-17 05:51 PST, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Supplemental patch to fix the problems that lead to backout and to unsupport the synchronous mode (12.00 KB, patch)
2011-11-20 23:33 PST, Henri Sivonen (:hsivonen)
bugs: review+
Details | Diff | Review

Description Henri Sivonen (:hsivonen) 2011-04-19 04:07:25 PDT
Add support for parsing text/html into a DOM in XMLHttpRequest per the XMLHttpRequest Level 2 spec:
http://dev.w3.org/2006/webapi/XMLHttpRequest-2/#document-response-entity-body
Comment 1 Henri Sivonen (:hsivonen) 2011-09-23 06:29:00 PDT
Created attachment 562026 [details] [diff] [review]
First attempt at implementation

This is going to need a lot of testing...
Comment 2 Henri Sivonen (:hsivonen) 2011-10-10 03:04:41 PDT
Created attachment 565887 [details] [diff] [review]
Support HTML in XHR, with tests
Comment 3 Henri Sivonen (:hsivonen) 2011-11-01 08:05:19 PDT
Created attachment 571005 [details] [diff] [review]
Support HTML in XHR, with tests and with sync XHR support
Comment 4 Henri Sivonen (:hsivonen) 2011-11-11 08:35:08 PST
Created attachment 573822 [details] [diff] [review]
Support HTML in XHR, with tests and with sync XHR support and non-support for multipart

As discussed earlier, HTML is not supported for multipart responses, because the current multipart infrastructure doesn't work with a parser that doesn't finish synchronously when OnStopRequest fires and it's unclear if there's demand for multipart support.
Comment 5 Henri Sivonen (:hsivonen) 2011-11-11 08:38:34 PST
Created attachment 573824 [details] [diff] [review]
Make extension updater treat text/html responses as errors

Currently, the extension updater relies on error pages getting a null responseXML. When HTML parsing is supported, text/html error pages result in a non-null responseXML. However, it's not OK to check the XHR status field for the success code 200, because the extension manager also does requests whose responses don't expose a status code (I did check if this was due to security concerns or due to the channel not being an HTTP channel).
Comment 6 Dave Townsend [:mossop] 2011-11-11 09:15:42 PST
(In reply to Henri Sivonen (:hsivonen) from comment #5)
> Created attachment 573824 [details] [diff] [review] [diff] [details] [review]
> Make extension updater treat text/html responses as errors
> 
> Currently, the extension updater relies on error pages getting a null
> responseXML. When HTML parsing is supported, text/html error pages result in
> a non-null responseXML. However, it's not OK to check the XHR status field
> for the success code 200, because the extension manager also does requests
> whose responses don't expose a status code (I did check if this was due to
> security concerns or due to the channel not being an HTTP channel).

Actually mostly what we expect for non-XML pages is that the responseXML has the namespace XMLURI_PARSE_ERROR. What is the content of a failed html parse? Can we just turn off html parsing in this case? Html is not a valid response here.
Comment 7 Henri Sivonen (:hsivonen) 2011-11-11 09:45:02 PST
(In reply to Dave Townsend (:Mossop) from comment #6)
> Actually mostly what we expect for non-XML pages is that the responseXML has
> the namespace XMLURI_PARSE_ERROR.

With text/html responses, the root element with have the http://www.w3.org/1999/xhtml namespace.

> What is the content of a failed html parse?

HTML parsing can't fail (except for Out of Memory), so if the response is of type text/html, responseXML will be non-null.

> Can we just turn off html parsing in this case?

I tried to make the extension manager override the MIME type to application/xml, but browser_manualupdates.js failed anyway. I didn't investigate further, though maybe I should.
Comment 8 Henri Sivonen (:hsivonen) 2011-11-11 09:46:31 PST
(In reply to Henri Sivonen (:hsivonen) from comment #7)
> with have 

will have
Comment 9 Henri Sivonen (:hsivonen) 2011-11-11 09:52:05 PST
(In reply to Henri Sivonen (:hsivonen) from comment #7)
> I tried to make the extension manager override the MIME type to
> application/xml, but browser_manualupdates.js failed anyway. I didn't
> investigate further, though maybe I should.

Looking at the source, it seems like there has to be a channel type that is involved in browser_manualupdates.js and that really honor nsIChannel::SetContentType().
Comment 10 Henri Sivonen (:hsivonen) 2011-11-11 09:54:44 PST
(In reply to Henri Sivonen (:hsivonen) from comment #9)
> that really honor nsIChannel::SetContentType().

that *does not*

I can't type English correctly today, it seems.
Comment 11 Dave Townsend [:mossop] 2011-11-11 10:05:12 PST
Comment on attachment 573824 [details] [diff] [review]
Make extension updater treat text/html responses as errors

(In reply to Henri Sivonen (:hsivonen) from comment #10)
> (In reply to Henri Sivonen (:hsivonen) from comment #9)
> > that really honor nsIChannel::SetContentType().
> 
> that *does not*
> 
> I can't type English correctly today, it seems.

Strange, everything I can see is using http there. That said I've realised which bit of the code this is now (more caffeine helped!) and it might actually be ok to allow http in this case, but I'd probably need to test that a bit better. This is fine for now.
Comment 12 Olli Pettay [:smaug] 2011-11-11 12:03:21 PST
Comment on attachment 573822 [details] [diff] [review]
Support HTML in XHR, with tests and with sync XHR support and non-support for multipart

Btw, I filed https://bugzilla.mozilla.org/show_bug.cgi?id=701787
But as of now, it doesn't affect to this bug.

>   static nsresult CreateDocument(const nsAString& aNamespaceURI, 
>                                  const nsAString& aQualifiedName, 
>                                  nsIDOMDocumentType* aDoctype,
>                                  nsIURI* aDocumentURI,
>                                  nsIURI* aBaseURI,
>                                  nsIPrincipal* aPrincipal,
>                                  nsIScriptGlobalObject* aScriptObject,
>                                  bool aSVGDocument,
>-                                 nsIDOMDocument** aResult);
>+                                 nsIDOMDocument** aResult,
>+                                 bool aForceHtml = false);


ok, now it is happening... Adding another "force this document".
Could you perhaps replace 'bool aSVGDocument' with some enum.
It could have values none, html, svg for now.

IIRC, there aren't really that many nsContentUtils::CreateDocument calls in the tree, so the change should
be pretty small.

>-    mLoadLengthComputable(false), mLoadTotal(0),
>+    mLoadLengthComputable(false),
>+    mIsHtml(false),
>+    mLoadTotal(0),
Shouldn't you initialize also mWarnAboutMultipartHtml

>+  if (mWarnAboutMultipartHtml) {
>+    mWarnAboutMultipartHtml = false;
>+    nsContentUtils::ReportToConsole(nsContentUtils::eDOM_PROPERTIES,
>+                                    "HTMLMultipartXHRWarning",
>+                                    nsnull,
>+                                    0,
>+                                    nsnull, // Response URL not kept around
>+                                    EmptyString(),
>+                                    0,
>+                                    0,
>+                                    nsIScriptError::warningFlag,
>+                                    "DOM Events",
Is there some better string for this than "DOM Events".



>@@ -2041,49 +2087,67 @@ nsXMLHttpRequest::OnStopRequest(nsIReque
>     }
>   }
> 
>   channel->SetNotificationCallbacks(nsnull);
>   mNotificationCallbacks = nsnull;
>   mChannelEventSink = nsnull;
>   mProgressEventSink = nsnull;
> 
>-  mState &= ~XML_HTTP_REQUEST_SYNCLOOPING;
>-
>   if (NS_FAILED(status)) {
>     // This can happen if the server is unreachable. Other possible
>     // reasons are that the user leaves the page or hits the ESC key.
> 
>     mErrorLoad = true;
>     mResponseXML = nsnull;
>   }
> 
>-  NS_ASSERTION(!parser || parser->IsParserEnabled(),
>-               "Parser blocked somehow?");
>-
>   // If we're uninitialized at this point, we encountered an error
>   // earlier and listeners have already been notified. Also we do
>   // not want to do this if we already completed.
>   if (mState & (XML_HTTP_REQUEST_UNSENT |
>                 XML_HTTP_REQUEST_DONE)) {
>+    mState &= ~XML_HTTP_REQUEST_SYNCLOOPING;
>     return NS_OK;
>   }
This change makes the method a bit more error prone.
It is easier to forget to remove XML_HTTP_REQUEST_SYNCLOOPING.
So, add a comment to be careful with XML_HTTP_REQUEST_SYNCLOOPING handling.

>+class nsXHRParseEndListener : public nsIDOMEventListener
>+{
>+public:
>+  NS_DECL_ISUPPORTS
>+  nsresult HandleEvent(nsIDOMEvent *event) {
NS_IMETHOD HandleEvent
and { should be in the next line.


>+    mXHR->ChangeStateToDone();
>+    mXHR = nsnull;
>+    return NS_OK;
>+  }
>+  nsXHRParseEndListener(nsXMLHttpRequest* aXHR) : mXHR(aXHR) {}
>+  virtual ~nsXHRParseEndListener() {}
>+private:
>+  nsRefPtr<nsXMLHttpRequest> mXHR;
>+};

So, XHR keeps the document alive, and document keeps event listener manager alive,
and event listener manager keeps nsXHRParseEndListener alive which keeps XHR alive...
Could you have a raw reference to XHR here, and XHR could also have a raw reference to this listener, 
and if either one is being deleted, the variables would be set to null.
(I wonder why the patch doesn't leak right now, but I assume it could leak in some cases)





>@@ -694,17 +694,17 @@ nsHTMLDocument::StartDocumentLoad(const 
>   
>   if (loadAsHtml5 && !viewSource &&
>       (!(contentType.EqualsLiteral("text/html") || plainText) &&
>       aCommand && !nsCRT::strcmp(aCommand, "view"))) {
>     loadAsHtml5 = false;
>   }
>   
>   // TODO: Proper about:blank treatment is bug 543435
>-  if (loadAsHtml5 && !viewSource) {
>+  if (loadAsHtml5 && aCommand && !nsCRT::strcmp(aCommand, "view")) {
Could you explain this change.
Comment 13 Henri Sivonen (:hsivonen) 2011-11-15 05:06:55 PST
Created attachment 574575 [details] [diff] [review]
Address review comments

(In reply to Olli Pettay [:smaug] from comment #12)
> ok, now it is happening... Adding another "force this document".
> Could you perhaps replace 'bool aSVGDocument' with some enum.
> It could have values none, html, svg for now.

I added an enum, but I used a more descriptive value instead of "none". :-)

> >-    mLoadLengthComputable(false), mLoadTotal(0),
> >+    mLoadLengthComputable(false),
> >+    mIsHtml(false),
> >+    mLoadTotal(0),
> Shouldn't you initialize also mWarnAboutMultipartHtml

Yes.

> >+  if (mWarnAboutMultipartHtml) {
> >+    mWarnAboutMultipartHtml = false;
> >+    nsContentUtils::ReportToConsole(nsContentUtils::eDOM_PROPERTIES,
> >+                                    "HTMLMultipartXHRWarning",
> >+                                    nsnull,
> >+                                    0,
> >+                                    nsnull, // Response URL not kept around
> >+                                    EmptyString(),
> >+                                    0,
> >+                                    0,
> >+                                    nsIScriptError::warningFlag,
> >+                                    "DOM Events",
> Is there some better string for this than "DOM Events".

I don't know. I don't know what consumes that string and what strings it recognizes.

> So, add a comment to be careful with XML_HTTP_REQUEST_SYNCLOOPING handling.

Added.

> >+class nsXHRParseEndListener : public nsIDOMEventListener
> >+{
> >+public:
> >+  NS_DECL_ISUPPORTS
> >+  nsresult HandleEvent(nsIDOMEvent *event) {
> NS_IMETHOD HandleEvent
> and { should be in the next line.

OK.

> >+    mXHR->ChangeStateToDone();
> >+    mXHR = nsnull;
> >+    return NS_OK;
> >+  }
> >+  nsXHRParseEndListener(nsXMLHttpRequest* aXHR) : mXHR(aXHR) {}
> >+  virtual ~nsXHRParseEndListener() {}
> >+private:
> >+  nsRefPtr<nsXMLHttpRequest> mXHR;
> >+};
> 
> So, XHR keeps the document alive, and document keeps event listener manager
> alive,
> and event listener manager keeps nsXHRParseEndListener alive which keeps XHR
> alive...
> Could you have a raw reference to XHR here, and XHR could also have a raw
> reference to this listener, 
> and if either one is being deleted, the variables would be set to null.

Wouldn't raw references be more error-prone than having an owning reference and a guarantee that the event will fire, so the reference will get nulled out?

> (I wonder why the patch doesn't leak right now, but I assume it could leak
> in some cases)

When the listener is added, it's clear that the parser has gotten far enough that DOMContentLoaded *will* fire. And nsXHRParseEndListener drops its XHR reference after the event has been handled.

> >-  if (loadAsHtml5 && !viewSource) {
> >+  if (loadAsHtml5 && aCommand && !nsCRT::strcmp(aCommand, "view")) {
> Could you explain this change.

Now that this line of code can see a broader range of parser commands, it's important to limit the special case to loading about:blank to a docshell. Otherwise, if about:blank is loaded via XHR, the old parser would be activated, which wouldn't be good.
Comment 14 Olli Pettay [:smaug] 2011-11-15 13:29:49 PST
Comment on attachment 574575 [details] [diff] [review]
Address review comments

> class nsXHRParseEndListener : public nsIDOMEventListener
> {
> public:
>   NS_DECL_ISUPPORTS
>-  nsresult HandleEvent(nsIDOMEvent *event) {
>+  NS_IMETHOD HandleEvent(nsIDOMEvent *event)
>+  {
>+    NS_ASSERTION(mXHR, "This event is supposed to fire exactly once.");
>     mXHR->ChangeStateToDone();
>-    mXHR = nsnull;
>+    mXHR = nsnull; // Break the reference cycle
>     return NS_OK;
>   }
>   nsXHRParseEndListener(nsXMLHttpRequest* aXHR) : mXHR(aXHR) {}
>   virtual ~nsXHRParseEndListener() {}
> private:
>   nsRefPtr<nsXMLHttpRequest> mXHR;
This feels still risky. Could you use nsWeakPtr. (XHR happens to extend nsSupportsWeakReference already)
Comment 15 Henri Sivonen (:hsivonen) 2011-11-15 23:45:20 PST
Thanks for the reviews. Landed with nsWeakPtr.
https://hg.mozilla.org/integration/mozilla-inbound/rev/75ace50df008
https://hg.mozilla.org/integration/mozilla-inbound/rev/5135cc4e6a36
Comment 16 Matt Brubeck (:mbrubeck) 2011-11-16 12:13:08 PST
Backed out 75ace50df008 in an attempt to fix test_XHRSendData.html crashes after a bad merge from mozilla-central to inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d0c677daedff

Example crash:
https://tbpl.mozilla.org/php/getParsedLog.php?id=7432104&tree=Mozilla-Inbound
Comment 17 :Ms2ger 2011-11-16 14:40:19 PST
The problem is (in the redirect test)

  if (NS_FAILED(status)) {
    // This can happen if the server is unreachable. Other possible
    // reasons are that the user leaves the page or hits the ESC key.

    mErrorLoad = true;
    mResponseXML = nsnull;
  }

and then

    nsCOMPtr<nsIDOMEventTarget> eventTarget = do_QueryInterface(mResponseXML);
    nsEventListenerManager* manager = eventTarget->GetListenerManager(true);

where status is NS_ERROR_NOT_AVAILABLE. (Which comes from nsInputStreamPump::mStatus.)

Haven't investigated further.
Comment 18 Marco Bonardo [::mak] 2011-11-17 02:53:59 PST
https://hg.mozilla.org/mozilla-central/rev/5135cc4e6a36
Comment 19 Henri Sivonen (:hsivonen) 2011-11-17 05:51:46 PST
Created attachment 575157 [details] [diff] [review]
Deal with null mResponseXML after error

(In reply to Ms2ger from comment #17)
> The problem is (in the redirect test)
> 
>   if (NS_FAILED(status)) {
>     // This can happen if the server is unreachable. Other possible
>     // reasons are that the user leaves the page or hits the ESC key.
> 
>     mErrorLoad = true;
>     mResponseXML = nsnull;
>   }
> 
> and then
> 
>     nsCOMPtr<nsIDOMEventTarget> eventTarget =
> do_QueryInterface(mResponseXML);
>     nsEventListenerManager* manager = eventTarget->GetListenerManager(true);

I changed this part to deal with null mResponseXML, but...

> where status is NS_ERROR_NOT_AVAILABLE. (Which comes from
> nsInputStreamPump::mStatus.)
> 
> Haven't investigated further.

The error still propagates via mErrorLoad in ways that tests don't expect.

What did your patches change to make the NS_ERROR_NOT_AVAILABLE occur where it didn't before?
Comment 20 Henri Sivonen (:hsivonen) 2011-11-17 07:33:50 PST
(In reply to Henri Sivonen (:hsivonen) from comment #19)
> What did your patches change to make the NS_ERROR_NOT_AVAILABLE occur where
> it didn't before?

I think I eyeballed through all the patches that came from the m-c side of the merge, but I didn't see anything that'd explain why something now says NS_ERROR_NOT_AVAILABLE but didn't when I pushed to try before the merge.
Comment 21 Henri Sivonen (:hsivonen) 2011-11-18 04:53:57 PST
Ms2ger, whatever is making this fail is the same thing that made you see the "cannot SetMetaDataElement" warning a couple of days ago. That's where the bad rv starts propagating.

Right about here:
http://mxr.mozilla.org/mozilla-central/source/content/html/document/src/nsHTMLDocument.cpp#947

Do you recall which change made you see that warning?
Comment 22 Henri Sivonen (:hsivonen) 2011-11-20 23:33:45 PST
Created attachment 575813 [details] [diff] [review]
Supplemental patch to fix the problems that lead to backout and to unsupport the synchronous mode

This supplemental patch:
 1) Deals with null mResponseXML after error
 2) Avoids propagating nsresult from the cache when a late write to the cache entry fails and the error gets overwritten in the XML case anyway.
 3) Makes the synchronous mode behave as if HTML parsing support had never been added (except for a console warning).
Comment 23 Olli Pettay [:smaug] 2011-11-21 03:04:20 PST
Comment on attachment 575813 [details] [diff] [review]
Supplemental patch to fix the problems that lead to backout and to unsupport the synchronous mode


>@@ -733,16 +734,30 @@ nsXMLHttpRequest::GetResponseXML(nsIDOMD
>                                     nsnull, // Response URL not kept around
>                                     EmptyString(),
>                                     0,
>                                     0,
>                                     nsIScriptError::warningFlag,
>                                     "DOM Events",
>                                     mOwner->WindowID());
>   }
>+  if (mWarnAboutSyncHtml) {
>+    mWarnAboutSyncHtml = false;
>+    nsContentUtils::ReportToConsole(nsContentUtils::eDOM_PROPERTIES,
>+                                    "HTMLSyncXHRWarning",
>+                                    nsnull,
>+                                    0,
>+                                    nsnull, // Response URL not kept around
>+                                    EmptyString(),
>+                                    0,
>+                                    0,
>+                                    nsIScriptError::warningFlag,
>+                                    "DOM Events",
>+                                    mOwner->WindowID());
>+  }

"DOM Events" is just wrong. Just use "DOM". It is used on some case.
Fix also the other error reporting.
Comment 24 Henri Sivonen (:hsivonen) 2011-11-21 03:43:11 PST
(In reply to Olli Pettay [:smaug] from comment #23)
> "DOM Events" is just wrong. Just use "DOM". It is used on some case.
> Fix also the other error reporting.

Thanks. Landed with s/DOM Events/DOM/:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0c4d3b7be17e
Comment 25 Matt Brubeck (:mbrubeck) 2011-11-21 09:07:55 PST
https://hg.mozilla.org/mozilla-central/rev/0c4d3b7be17e
Comment 26 Henri Sivonen (:hsivonen) 2011-11-23 05:55:08 PST
Notes for documentation:

 * HTML parsing is not supported in the synchronous mode (because the synchronous mode fits badly into the architecture of Gecko and the Web platform and we don't want to let people use it more than what the legacy requires)
 
 * HTML support when responseType == "" should be considered to be at risk. If supporting HTML in the default mode causes too much trouble with legacy uses of XHR (particularly because error pages with text/html response bodies now get a non-null responseXML), we might have to disable HTML support for responseType == ""

 * HTML support in the asynchronous mode with responseType == "document" can be safely be treated as a feature that will appear in Firefox 11 (unless something totally unforeseeable happens).

 * If there is no declared encoding for incoming text/html content (on the HTTP level, by using a BOM or in a <meta> tag within the first 1024 bytes), the default encoding is UTF-8.
Comment 27 Henri Sivonen (:hsivonen) 2011-11-24 05:02:35 PST
Drafted some initial docs: https://developer.mozilla.org/en/HTML_in_XMLHttpRequest

The documentation assumes that the fix for bug 705072 will land.
Comment 28 Ben Bucksch (:BenB) 2011-11-24 20:26:22 PST
This is triggering for HTTP server errors with HTML error pages. :-(

The caller is expecting XML or JSON, but you're parsing the HTML error page now. I don't think it's a good idea to parse the HTML at all unless the caller requests it, e.g. by calling .responseHTML or similar. XHR is mostly a protocol, but many servers respond with an HTML page when they are confused/broken, but most of the time, this error HTML page is useless for the calling app. Parsing it automatically is misguided, in most cases.
Comment 29 Henri Sivonen (:hsivonen) 2011-11-24 22:40:18 PST
BenB, bug 705072 limits HTML parsing to responseType == "document". Legacy usage is resposeType == "". (The spec changed accordingly, too, yesterday.)
Comment 30 Ben Bucksch (:BenB) 2011-11-25 04:13:49 PST
That's good, thanks. This also fixes us (bug 705216) and avoids that normal XMLHttpRequests are not affected. Good, thanks.
Comment 31 Eric Shepherd [:sheppy] 2012-05-03 07:49:11 PDT
(In reply to Henri Sivonen (:hsivonen) from comment #27)
> Drafted some initial docs:
> https://developer.mozilla.org/en/HTML_in_XMLHttpRequest
> 
> The documentation assumes that the fix for bug 705072 will land.

That documentation has been overhauled and cleaned up by lots of people; it should be good.

Note You need to log in before you can comment on or make changes to this bug.