Closed Bug 680532 Opened 13 years ago Closed 13 years ago

DOCTYPE internal subset is not serialized correctly for text/html

Categories

(Core :: DOM: Core & HTML, defect)

x86
All
defect
Not set
minor

Tracking

()

RESOLVED INVALID

People

(Reporter: theimp, Unassigned)

Details

If you include an internal subset in the SGML DOCTYPE declaration, for text/html only, it's parsed improperly and the result is that it thinks that anonymous CDATA implies that the HEAD has been implicitly closed and the BODY has been implicitly opened, which makes all of the parameters in the head ignored and breaks the DOM tree for the whole document, plus renders the mistaken CDATA.

Example:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" 
[<!ATTLIST div replace CDATA #FIXED "off">]>
<html>
 <head>
  <title>Replace</title>
 </head>
 <body>
  <div replace="off">
   example
  </div>
 </body>
</html>

-----
Would render as:
-----

]> example

-----

Based on my testing, this seems to be because a greater-than character ">" is treated as the end of the DOCTYPE no matter where it appears. This is probably a special compatibility parsing "feature", since this problem is not exhibited when parsing application/xhtml+xml.

Might be related to, or a duplicate of, Bug 191340.

I have a terrible feeling that this will be WONTFIXED, because in the current draft internal subsets won't be valid html5 even in compatibility mode. But it has always been valid HTML (all versions) and XHTML (all versions, even when delivered as text/html).

For the record, this is probably not a regression because it hasn't worked at least as far back as ff3.15. Also, anything based on WebKit (all versions of Chrome, Safari, new versions of Epiphany, etc.), make the same mistake for text/html, but they work for application/xhtml+xml, which is why I'm guessing that it's a deliberate compatibility quirk. All versions of Internet Explorer get it wrong in all possible modes (typical).

Opera always gets it right, though! (And have done for quite a while.)
We are correct per the HTML specification. HTML is not, and has never been, SGML. Opera will ship a correct version in the near future, if they haven't yet.
Status: UNCONFIRMED → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
> HTML5 provides a new way to use custom attributes

Thankyou, I know, but that's not any good to me in this case.

> We are correct per the HTML specification.

Sorry, but how do you conclude that? I have never seen such a parsing rule. Believe me, I've looked hard for one. Which HTML specification, even?

> HTML is not, and has never been, SGML.

I'm sorry, that's positively incorrect. HTML has always been an SGML application, and both the W3C and the IETF before them have always said as much.

http://www.w3.org/TR/html4/intro/sgmltut.html
http://www.rfc-editor.org/rfc/rfc1866.txt

etc. There are literally dozens of references to such statements, across dozens of standards.

XHTML can be SGML-compatible, but isn't SGML in fact; it's XML.

HTML 5 won't be, exactly; it's HTML5, not HTML, and can be either HTML (SGML) or XML-compatible, but again, need not be.

Perhaps, I'll allow, HTML is not, and never has been, properly parsed and rendered as SGML by user agents, and that user agents do not, and never have, been full SGML parsers. But that's not the same thing at all. Saying that HTML is not SGML is like saying that XHTML is not XML.

> Opera will ship a correct version in the near future, if they haven't yet.

Why would they do that? And where is their bug on that behavior?

***

The W3C has been providing specific examples of this usage since at least 1999, and specifically present it as an example as part of several specifications (including SVG and XHTML2, which are not SGML but do or can have a DOCTYPE).

***

Nevermind; this seems to already have been WONTFIX'd: Bugs 9071
> HTML 5 won't be, exactly; it's HTML5, not HTML
HTML5 *is* HTML these days, the HTML5 spec defines parsing of any HTML document, including HTML4.

> Why would they do that? 
Because they have to follow the spec (and they are very eager contributors to it).

> And where is their bug on that behavior?
Their bugs arn't public
I'm not deliberately trying to be offensive; but I'd much rather that it was said "We'd rather not obey that part of the specification" (optionally: "because it causes such-and-such a problem"), and be marked RESOLVED WONTFIX than "this is not a bug because [vague specification with no references] says so", with countless contradictions, and marked RESOLVED INVALID.

> the HTML5 spec defines parsing of any HTML document, including HTML4.

Not any valid HTML document, as you can see by running the above test code through a validator (or reading the specs).

So, HTML5 defines the parsing of any HTML document, incompatibly with any non-HTML5 documents' own parsing specification. Or, not any HTML document, but only a specific subset of HTML documents.

Ergo, HTML5 is *not* HTML. It is HTML5, and HTML5 user agents must be compatible in certain ways with pre-HTML5 HTML documents; but specifically not in all ways (such as this). QED.

(It also violates almost every other specification it touches, from EcmaScript to MIME to the W3s own DOM and Forms specifications, and more. But that's a complaint for another day.)

> Because they have to follow the spec (and they are very eager contributors to it).

HTML5 is *years* away from standardization.

This bug has existed in the codebases of all major and most minor browsers for tens of years.

> Their bugs arn't public

Technically true; however, anyone can apply to have access to their bug database.

***

As I've said, it seems to be a duplicate of another WONTFIX'd bug.
(In reply to theimp from comment #3)
> > HTML5 provides a new way to use custom attributes
> 
> Thankyou, I know, but that's not any good to me in this case.
> 
> > We are correct per the HTML specification.
> 
> Sorry, but how do you conclude that? I have never seen such a parsing rule.
> Believe me, I've looked hard for one. Which HTML specification, even?

<whatwg.org/html/#parsing>. You can follow the tokenization and tree building algorithms and will get the same result.

> > HTML is not, and has never been, SGML.
> 
> I'm sorry, that's positively incorrect. HTML has always been an SGML
> application, and both the W3C and the IETF before them have always said as
> much.
> 
> http://www.w3.org/TR/html4/intro/sgmltut.html
> http://www.rfc-editor.org/rfc/rfc1866.txt
> 
> etc. There are literally dozens of references to such statements, across
> dozens of standards.

I know there are documents that claim that HTML is SGML. I also know there are documents that claim there are living beings on Mars. Both of these statements are positively incorrect.

> XHTML can be SGML-compatible, but isn't SGML in fact; it's XML.

I don't think I mentioned XML.

> HTML 5 won't be, exactly; it's HTML5, not HTML, and can be either HTML
> (SGML) or XML-compatible, but again, need not be.

There are no versions in HTML, and have never been.

> Perhaps, I'll allow, HTML is not, and never has been, properly parsed and
> rendered as SGML by user agents, and that user agents do not, and never
> have, been full SGML parsers. But that's not the same thing at all. Saying
> that HTML is not SGML is like saying that XHTML is not XML.

XHTML is XML. That doesn't make HTML SGML.

> > Opera will ship a correct version in the near future, if they haven't yet.
> 
> Why would they do that? And where is their bug on that behavior?

In order to follow the specification, and be more compatible with contemporary web pages. If you're interested in their progress, the project goes under the name "Ragnarök".

> The W3C has been providing specific examples of this usage since at least
> 1999, and specifically present it as an example as part of several
> specifications (including SVG and XHTML2, which are not SGML but do or can
> have a DOCTYPE).

The W3C has, indeed, published a lot of nonsense.

(In reply to theimp from comment #5)
> I'm not deliberately trying to be offensive; but I'd much rather that it was
> said "We'd rather not obey that part of the specification" (optionally:
> "because it causes such-and-such a problem"), and be marked RESOLVED WONTFIX
> than "this is not a bug because [vague specification with no references]
> says so", with countless contradictions, and marked RESOLVED INVALID.

We don't obey the HTML4 specification, sure. We've never done. In this area, we've never even attempted. What it says is irrelevant. The HTML specification I linked to above clearly describes the correct behaviour here, which is the behaviour we implement.

> > the HTML5 spec defines parsing of any HTML document, including HTML4.
> 
> Not any valid HTML document, as you can see by running the above test code
> through a validator (or reading the specs).

Validators do, indeed, support a lot of misconceptions about how HTML actually works.

> So, HTML5 defines the parsing of any HTML document, incompatibly with any
> non-HTML5 documents' own parsing specification. Or, not any HTML document,
> but only a specific subset of HTML documents.

Not just for any HTML document, for any stream of bytes sent as text/html. The reason its definition doesn't match the claims of previous specifications is that these previous documents were rather dry science fiction.

> Ergo, HTML5 is *not* HTML. It is HTML5, and HTML5 user agents must be
> compatible in certain ways with pre-HTML5 HTML documents; but specifically
> not in all ways (such as this). QED.

HTML5 is HTML. As I've said before, HTML doesn't have versions.

> (It also violates almost every other specification it touches, from
> EcmaScript to MIME to the W3s own DOM and Forms specifications, and more.
> But that's a complaint for another day.)

Indeed. That's because these specifications are, at least partially, science fiction as well.

> > Because they have to follow the spec (and they are very eager contributors to it).
> 
> HTML5 is *years* away from standardization.

That is irrelevant. It is, for all intents and purposes, significantly more stable, correct, and useful than the fiction the W3C called HTML4.

> This bug has existed in the codebases of all major and most minor browsers
> for tens of years.

Because it isn't a bug.

> > Their bugs arn't public
> 
> Technically true; however, anyone can apply to have access to their bug
> database.

Feel free to do that, then.

I hope this clarifies some matters.
Please accept my apologies.

Here's what I worked out, eventually (I can't find any official source that explains these consequences of HTML5 adoption by Mozilla):

***

The Mozilla HTML processing engine (primarily Gecko), circa early 2006(?), abandoned goals to parse or render any HTML other than in accordance with the HTML5 specification.

The Mozilla project does NOT support, no longer intends to support, and will no longer attempt to support, any of the following HTML standards:
Open Mobile Alliance XHTML Mobile Profile 1.0, 1.1, 1.2 [XHTML-MP]
W3C HTML 3.2 [HTML32]
W3C HTML 4.0, 4.01 [HTML4]
W3C XHTML™ 1.0, 1.1 The Extensible HyperText Markup Language [XHTML]
W3C XHTML™ Basic 1.0, 1.1 [XHTMLBASIC]

Nor any of the following proposed, draft, experimental, informal, obsolete, or retired standards:
(Non-standard) HTML Tags
(Non-standard) HTML 3.0
(Non-standard) HTML+
(Non-standard) Compact HTML
IETF Hypertext Markup Language - 2.0 [RFC1866]
IETF Form-based File Upload in HTML [RFC1867]
IETF A Proposed Extension to HTML : Client-Side Image Maps [RFC1980]
IETF Internationalization of the Hypertext Markup Language [RFC2070]
IETF HTML Tables [RFC1942]
ISO/IEC 15445:2000
W3C HTML 3.2 [HTML32]
W3C XHTML™ 2.0 [XHTML2]

Or any other standard other than the (currently draft) HTML5 standard(s), which includes a standard for how to render non-HTML5 markup (including non-HTML5 HTML).

***

A free suggestion (no offense intended): perhaps you might like to quote the above text or something similar when someone mentions a non-HTML5 version of HTML? It might stop some tepid response posts.

Other standards, such as W3C CSS, look like they're supported; I need more data. Others still, such as EcmaScript, look like they are the same as HTML; not supported because they conflict with a superior specification (HTML). Yet others, such as W3C XHTML™ 1.1, seem to be supported coincidentally; they work, but if they don't, they won't be fixed (I need more data to verify this). Still other standards, such as Associating Style Sheets with XML documents 1.0, are "supported" inconsistent with their specifications, and will not be implemented fully/correctly, other than for reasons of complying with a superior specification (such as: self-contradictory, unimplementable, insufficient resources, etc.).

An up-to-date reference would be ideal.

I think I'll head over to Bug 74263 or Bug 194907 and look at getting this (or something like it) published somewhere.

I might even do so myself.

Sorry.
> (I can't find any official source that explains these consequences of HTML5 adoption by Mozilla)

You can't find it, because there isn't such a source, I think.

> ... abandoned goals to parse or render any HTML other than in accordance with the HTML5 specification

I don't think that's case. See for example bug 915, comment 324 up to comment 347. A sane patch would be accepted.

> W3C CSS
> EcmaScript

Implementing (and developing) these standards is day-to-day business. 

> XHTML1.1

From http://www.w3.org/TR/html5/introduction.html#history-1 :
... the scope of the HTML5 specification include what had previously been specified in three separate documents: HTML4, XHTML1, and DOM2 HTML.


(Note that are my personal, irrelevant opinions and I'v no relationship to the Project other than using and supporting it)
> XHTML1.1

http://www.w3.org/TR/html5/introduction.html#html-vs-xhtml :
This specification defines version 5 of the XHTML syntax, known as "XHTML5".
You need to log in before you can comment on or make changes to this bug.