Closed Bug 48351 Opened 20 years ago Closed 20 years ago

XHTML served as text/html uses HTML parser.

Categories

(Core :: DOM: HTML Parser, defect, P2, critical)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: ian, Assigned: rickg)

References

()

Details

(Keywords: dataloss, testcase, xhtml, Whiteboard: [nsbeta3+][PDTP2])

Attachments

(5 files)

This is a hot potato. Quick pass it on, pass it on, pass it on. ;-)
[obscure Douglas Adams reference there...]

We currently pass anything given as text/html to the HTML parser, which then
decides which DTD to use -- Strict for Strict HTML4 and XHTML, and quirky
otherwise (and Transitional in certain cases too).

However, this is BAD because not all XHTML Strict is valid HTML4 Strict!

Example:

   <p> <a name="a" /> text </p> <p> Hello. </p>

Using the HTML Strict DTD, we think that is all one paragraph because the </p>
is not well formed!

Now, we have two options as I see it:

   1. Tell the WG to stop being silly and treat all text/html as HTML.
      This entails NOT using the Strict DTD for XHTML DOCTYPEs.
   2. When the HTML parser detects an XHTML doctype, switch to the XML
      parser and start over.

Note that we should not do anything based on the presence of the <?xml PI, since
that is (a) optional for XHTML, (b) does not indicate that a document is XHTML,
and (c) is valid (albeit officially meaningless) in an HTML document.
Nominating for nsbeta3. This *cripples* XHTML adoption.
A suggestion made in
http://lists.w3.org/Archives/Public/www-html/2000Jul/0085.html
was that you look for "xmlns" as the beginning of an attribute on the HTML
element.  If xmlns is present, then parse as XML.
Completely valid XHTML doesn't (can't!) have a namespace attribute. It is only
included in the attached test cases to get around bug 48445.

Also, looking for 'xmlns' is likely to mean a lot more of the document has to be
parsed before being redirected to the XML parser.

According to the XHTML spec, valid XHTML documents *must* have a DOCTYPE. That
solves our problems IMHO since we can just use the DOCTYPE sniffer to do the
work. The spec only says XHTML can be sent as text/html, not random XML with
interspersed XHTML content.
*** Bug 26022 has been marked as a duplicate of this bug. ***
Ian - I now realize that you're right that valid XHTML is not required to have
the xmlns attribute - but it is allowed.  See
http://www.w3.org/TR/REC-xml#sec-attr-defaults
In reference to attachment #2 [details] [diff] [review]:

Actually it's the HTMLTokenizer that treats <strong /> as <strong> rather than 
<strong></strong> and then Strict DTD plays its role ( throwing away content!). 
The current HTMLTokenizer's behavior is correct. As Ian mentioned, this doucment 
should be dealt by the XML parser.  Therefore, in this case, even though the 
document's mime type is html someone higher up ( netlib I suppose ) should 
provide the parser with XML content sink and text/xml mime type. 

Reassigning bug to gagan.
Assignee: harishd → gagan
David: Oops, I missed that part of the DTD. Right, it is allowed. This does 
not, however, change the validity of this bug. ;-)

Harish and I just discussed this. Unfortunately, the parser/content sink 
apparently cannot change half way, so this probably has to go down to Necko.
David, is it possible to adapt your DOCTYPE parsing for Necko purposes, thus
moving the entire strict/transitional/quirks issue up a level?

If so, is this the right way we should be doing things?:

 text/xml  -> always XML
 text/html -> check DOCTYPE. [1]
               If XHTML DOCTYPE -> XML
               If HTML 4.01 or 4.0 Strict DOCTYPE -> HTML Strict
               If other HTML 4.01 DOCTYPE -> HTML Transitional [2]
               If other RECOGNISED DOCTYPE -> HTML Quirks (CNavDTD)
               If FPI contains 'Transitional' or 'Frameset' -> HTML Transitional
               ELSE -> HTML Strict
  text/xul -> XML
  anything else -> Not XML nor HTML. (image, text/plain, RTF...)

Notes:
  [1]: do NOT use the presence of a <?xml processing instruction for anything.
  [2]: Frameset and Transitional DTDs are internally the same, right?

Gagan?
This bug is actually the cause of most of the XHTML bugs that are filed (and
promptly marked INVALID in most cases, so they don't show up as dups anywhere).
No longer blocks: html4.01
Keywords: mostfreq
*** Bug 48617 has been marked as a duplicate of this bug. ***
I believe that stuff happens with the URILoader ->mscott
Assignee: gagan → mscott
Does XML loading go through nsParser?
*** Bug 38088 has been marked as a duplicate of this bug. ***
wow I am so confused by this bug.....are we saying that we expect necko to parse
the actual content and attempt to guess whether it is really text/html or
text/xhtml???

This sounds like a real layout/parser thing to me.

In either case, necko's responsiblity is to determine the content type of the
incoming data from the server for http. They look at the content type header and
they look at the file extension and there maybe some sniffing code in there too.
If that's code that needs changed to sniff for xhtml then this bug probably
belongs back in gagan's group.

Let me know if I'm understanding what you need done correctly.
What is the normal path for an XML load?
*** Bug 47958 has been marked as a duplicate of this bug. ***
A forward compatibility issue:
The current XHTML working drafts indicate that there is going to be a lot of different 
doctype declarations for XHTML documents even though the tags are in the same 
XHTML namespace. It seems to me that the one thing all these doctype declarations 
have in common is the string "XHTML" in the third field of the FPI.
http://www.w3.org/TR/xhtml-building/conformance.html#s_conform_naming_rules
With one HTML parser mode:

text/xml,
application/xml --> XML

text/html --> Look for doctype:

              No doctype --> CNavDTD + quirk layout
              Doctype with incorrect syntax --> CNavDTD + quirk layout
              "XHTML" is a substring of the third field of the FPI --> XML
              HTML 4.x Strict --> CNavDTD + std layout
              HTML 4.x Transitional with the URI --> CNavDTD + std layout
              HTML 4.x Transitional without the URI --> CNavDTD + quirk layout
              Other known quirky doctype --> CNavDTD + quirk layout
              Other doctype --> CNavDTD + std layout
              
(Std layout should probably be used with HTML 4.01 Frameset.)

Why shouldn't the <?xml?> declaration be used in detection? To prevent sending 
arbitrary XML as text/html?

Strictly conforming XHTML *must* have the xmlns attribute in the root element so I 
suppose it would be OK to try to detect it.
http://www.w3.org/TR/xhtml1/#docconf
*** Bug 46963 has been marked as a duplicate of this bug. ***
Marking [nsbeta3+] like the two other bugs that came out of Monday's meeting,
as per Eric Krock on those two bugs.

Eric -- hope that's ok...
Blocks: 50071
Whiteboard: [nsbeta3+]
This kind of sniffing happens down in necko as they are the ones that determine
the incoming content type, not the uriloader. back to gagan.

Looks like http's use of the content type field isn't enough for this bug. They
want to sniff the stream and over ride the content type header if the doc type
contains the string "xhtml".
Assignee: mscott → gagan
Vidur, is bug 50071 that I filed on you for using the XML codepath in this 
situation a DUP of this bug? If, so please close bug 50071 as a DUP. (I can't 
seem to find the bug I filed on Harish--maybe it's closed already?)

Adding Harish & Vidur to cc: list.
OK so I understand the part mscott sez, but I need to know what is the detection
criteria so that we can put that in? If you want to see how we do this currently
(pretty plain vanilla checks) see
http://lxr.mozilla.org/seamonkey/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#229
Yesterday I posted to n.p.m.xml to get opinions about suitable detection criteria. No one 
has replied to the post, yet.
Ian and I just sent a message to the HTML WG about detection criteria:
http://lists.w3.org/Archives/Member/w3c-html-wg/2000JulSep/0410.html (for those
with access).

I can't read n.p.m.xml since news.mozilla.org is down.
The HTML WG has changed its mind:
http://lists.w3.org/Archives/Member/w3c-html-wg/2000JulSep/0522.html
so I guess this bug is invalid now...
That archive is for members only. Are you allowed to summarize the situation to 
outsiders?

Forcing text/xml would disallow graceful degrading in legacy browsers which would be a 
huge disincentive for authors.
This shouldn't be a gagan bug. Reassigning to RickG since it's tied to the 
removal of Strict DTD.

I'm guessing we'll get a lot of grumbling from developers by sending XHTML 
delivered as text/html through the HTML codepath. It's definitely the easiest 
for us to implement, though. 

Rick, I guess this makes bug 50071 invalid and the only thing to be done is to 
make sure that anything delivered as text/html goes through the transitional 
codepath.
Assignee: gagan → rickg
Ok -- Having discussed this with Vidur, there are 2 things to do:
1) disable strictDTD
2) cause XHTML delivered as text/html use the transitional DTD (CNavDTD).
Severity: major → critical
Status: NEW → ASSIGNED
Priority: P3 → P2
Target Milestone: --- → M18
3) Evangelize the use of the appropriate mime-type on XHTML
I don't know the motivations behind the decision, but to me it doesn't make sense to not 
parse XHTML with an XML parser if one is available. What's the point in using XHTML at 
all if it gets handled as HTML 4 in new browsers, too? Is the intent that every site 
implements client detection and sends the content with a different content type for 
browsers that are known to support XHTML as text/xml? Is there any change the WG 
could be persuaded to reconsider its decision?
IIRC there's a method available for the browser to indicate its capabilities, so
it needn't be client sniffing and assumptions made based on that. But yes, that
seems to be the case: find out what a browser is capable of, then send content
tailored to that client. Alternatively, just send the latest of the latest, and
pray. Though in most cases it'll be a matter of knowing your public/market and
adjusting to them.

Btw, this isn't as bad as it sounds. Once it becomes commonly known that certain
doctypes will trigger standards layout on Mozilla, people will use that. Of
course, they'll also create lots and lots of broken XHTML (for the same
reason(s) they're creating brokem HTML 4 strict currently) but that's Somebody
Else's Problem :-/
PDT agrees P2 after talking to rickg
Whiteboard: [nsbeta3+] → [nsbeta3+][PDTP2]
*** Bug 53181 has been marked as a duplicate of this bug. ***
XHTML documents are now routed through the navdtd.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
verified
Status: RESOLVED → VERIFIED
"text/xhtml" is not a valid MIME type. Right now it seems the official MIME type
for XHTML will be "application/xhtml+xml". See
http://www.ietf.org/internet-drafts/draft-baker-xhtml-media-reg-00.txt.
Why is this marked as fixed?  Based on comments here and bug 199165 it sounds more like wontfix.
You need to log in before you can comment on or make changes to this bug.