Open Bug 1044332 Opened 10 years ago Updated 2 years ago

Stop treating XHTML as XML

Categories

(Core :: DOM: Core & HTML, defect)

x86
macOS
defect

Tracking

()

People

(Reporter: ehsan.akhgari, Unassigned)

References

Details

Let's agree that this new XHTML k00l tek is not going anywhere.  We are the only engine shipping semi-proper XHTML support, and it is still costing us after years, see bug 1036987 as evidence.  I worry that with the relatively young UA string of the Firefox OS browser, it may keep running into poorly tested content like this.

Johnny, Boris, do you agree that it's time to stop treating XHTML as XML?
Flags: needinfo?(jst)
Flags: needinfo?(bzbarsky)
Henri should weigh in as well!
Flags: needinfo?(hsivonen)
Er... Webkit/Blink treat XHTML as XML.  They just don't throw away what they've already parsed when they hit a parse error...  Is that all that's going on with bug 1036987, or are we getting served different content?
Flags: needinfo?(bzbarsky)
Hmm, it's hard to tell...
It's certainly time to stop throwing strict XML parsing errors.
Flags: needinfo?(jst)
I thought I had replied to this, but I guess I failed to submit...

(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #0)
> Let's agree that this new XHTML k00l tek is not going anywhere. 

Agreed.

> We are the only engine shipping semi-proper XHTML support,

This is not true. Trident, Blink and WebKit do, too. Presto did, too, before it got abandoned.

> and it is still costing us after years, see bug 1036987 as evidence.

The cost here arises from the foolish notion that "mobile" would work without text/html. We should evangelize Google to give us a post-iPhone mobile UI instead of giving us a pre-iPhone mobile UI in Gmail.

> Johnny, Boris, do you agree that it's time to stop treating XHTML as XML?

I think we should not stop treating application/xhtml+xml as XML. Doing so would break something else. The application/xhtml+xml legacy that requires XML parsing is tiny compared to the text/html legacy, but it is there. It's not exactly clear to me if the IE team added application/xhtml+xml support in IE9 as a matter of late XHTML public relations or as a matter of compatibility, but they added it and gained compatibility.

So far, the breakage from current behavior is isolated albeit affecting an important app. Even though the application/xhtml+xml is tiny compared to the text/html legacy, that legacy might intersect with something important, too. (Google Maps was what forced some other engines to add XSLT after Trident and Gecko...)

Let's not forget that this stuff isn't just about parsing:

 * There are XML DOM behaviors. Changing those to HTML behaviors may break something.
 * There's the opportunity to apply XSLT. Unless we manage to get rid of XSLT first, that's a potential problem.
 * Since image/svg+xml exists, just getting rid of XML parsing for application/xhtml+xml wouldn't let us get rid of loading XML into a docshell.

I'd rather see e.g. Blink bear the cost of trying something as radical as getting rid of XSLT or XML parsing instead of us bearing the cost of discovering the impact of such feature removals from the platform.

(In reply to Johnny Stenback  (:jst, jst@mozilla.com) from comment #4)
> It's certainly time to stop throwing strict XML parsing errors.

I think this would make sense given ample developer time. But as you know, rewriting our XML-to-docshell load path has been on my todo list since the Summit at Whistler (the later one of the two), but there have always been more important things to work on than XML.

If someone had the time to do this, I think the right way to go about this would be:

 1) Fork the HTML tokenizer Java code
 2) Resurrect Anne's XML-ER/XML5 spec
 3) Tweak the forked tokenizer code until it matches Anne's spec
 4) Tweak the translator to translate the new XML-ER/XML5 parser to C++.
 5) Implement https://wiki.mozilla.org/Platform/XML_Rewrite with the new parser.
 6) Strategic bonus: Release the Java version to disrupt XML on the server side at the same time.
Flags: needinfo?(hsivonen)
Component: DOM → DOM: Core & HTML

It is likely that some people are relying on XHTML being parsed as XML with the fail-safe error behavior to reduce the risk of XSS and certain kinds of exfiltration, e.g. exfiltration of CSP nonce attributes on scripts. For example, people may be relying on the behavior of <noscript> being sane in XHTML to prevent things like https://www.acunetix.com/blog/web-security-zone/mutation-xss-in-google-search/. Therefore, regressing from strict XML parsing would likely add security vulnerabilities to sites.

If people want HTML5-style error recovery, they can easily get it by using the text/html content type.

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.