Closed Bug 48351 Opened 20 years ago Closed 20 years ago
XHTML served as text/html uses HTML parser
605 bytes, text/xml
605 bytes, text/html
605 bytes, text/plain
605 bytes, text/xhtml
605 bytes, text/sgml
This is a hot potato. Quick pass it on, pass it on, pass it on. ;-) [obscure Douglas Adams reference there...] We currently pass anything given as text/html to the HTML parser, which then decides which DTD to use -- Strict for Strict HTML4 and XHTML, and quirky otherwise (and Transitional in certain cases too). However, this is BAD because not all XHTML Strict is valid HTML4 Strict! Example: <p> <a name="a" /> text </p> <p> Hello. </p> Using the HTML Strict DTD, we think that is all one paragraph because the </p> is not well formed! Now, we have two options as I see it: 1. Tell the WG to stop being silly and treat all text/html as HTML. This entails NOT using the Strict DTD for XHTML DOCTYPEs. 2. When the HTML parser detects an XHTML doctype, switch to the XML parser and start over. Note that we should not do anything based on the presence of the <?xml PI, since that is (a) optional for XHTML, (b) does not indicate that a document is XHTML, and (c) is valid (albeit officially meaningless) in an HTML document.
Nominating for nsbeta3. This *cripples* XHTML adoption.
A suggestion made in http://lists.w3.org/Archives/Public/www-html/2000Jul/0085.html was that you look for "xmlns" as the beginning of an attribute on the HTML element. If xmlns is present, then parse as XML.
Completely valid XHTML doesn't (can't!) have a namespace attribute. It is only included in the attached test cases to get around bug 48445. Also, looking for 'xmlns' is likely to mean a lot more of the document has to be parsed before being redirected to the XML parser. According to the XHTML spec, valid XHTML documents *must* have a DOCTYPE. That solves our problems IMHO since we can just use the DOCTYPE sniffer to do the work. The spec only says XHTML can be sent as text/html, not random XML with interspersed XHTML content.
*** Bug 26022 has been marked as a duplicate of this bug. ***
Ian - I now realize that you're right that valid XHTML is not required to have the xmlns attribute - but it is allowed. See http://www.w3.org/TR/REC-xml#sec-attr-defaults
In reference to attachment #2 [details] [diff] [review]: Actually it's the HTMLTokenizer that treats <strong /> as <strong> rather than <strong></strong> and then Strict DTD plays its role ( throwing away content!). The current HTMLTokenizer's behavior is correct. As Ian mentioned, this doucment should be dealt by the XML parser. Therefore, in this case, even though the document's mime type is html someone higher up ( netlib I suppose ) should provide the parser with XML content sink and text/xml mime type. Reassigning bug to gagan.
Assignee: harishd → gagan
David: Oops, I missed that part of the DTD. Right, it is allowed. This does not, however, change the validity of this bug. ;-) Harish and I just discussed this. Unfortunately, the parser/content sink apparently cannot change half way, so this probably has to go down to Necko. David, is it possible to adapt your DOCTYPE parsing for Necko purposes, thus moving the entire strict/transitional/quirks issue up a level? If so, is this the right way we should be doing things?: text/xml -> always XML text/html -> check DOCTYPE.  If XHTML DOCTYPE -> XML If HTML 4.01 or 4.0 Strict DOCTYPE -> HTML Strict If other HTML 4.01 DOCTYPE -> HTML Transitional  If other RECOGNISED DOCTYPE -> HTML Quirks (CNavDTD) If FPI contains 'Transitional' or 'Frameset' -> HTML Transitional ELSE -> HTML Strict text/xul -> XML anything else -> Not XML nor HTML. (image, text/plain, RTF...) Notes: : do NOT use the presence of a <?xml processing instruction for anything. : Frameset and Transitional DTDs are internally the same, right? Gagan?
This bug is actually the cause of most of the XHTML bugs that are filed (and promptly marked INVALID in most cases, so they don't show up as dups anywhere).
*** Bug 48617 has been marked as a duplicate of this bug. ***
I believe that stuff happens with the URILoader ->mscott
Assignee: gagan → mscott
Does XML loading go through nsParser?
*** Bug 38088 has been marked as a duplicate of this bug. ***
wow I am so confused by this bug.....are we saying that we expect necko to parse the actual content and attempt to guess whether it is really text/html or text/xhtml??? This sounds like a real layout/parser thing to me. In either case, necko's responsiblity is to determine the content type of the incoming data from the server for http. They look at the content type header and they look at the file extension and there maybe some sniffing code in there too. If that's code that needs changed to sniff for xhtml then this bug probably belongs back in gagan's group. Let me know if I'm understanding what you need done correctly.
What is the normal path for an XML load?
*** Bug 47958 has been marked as a duplicate of this bug. ***
A forward compatibility issue: The current XHTML working drafts indicate that there is going to be a lot of different doctype declarations for XHTML documents even though the tags are in the same XHTML namespace. It seems to me that the one thing all these doctype declarations have in common is the string "XHTML" in the third field of the FPI. http://www.w3.org/TR/xhtml-building/conformance.html#s_conform_naming_rules
With one HTML parser mode: text/xml, application/xml --> XML text/html --> Look for doctype: No doctype --> CNavDTD + quirk layout Doctype with incorrect syntax --> CNavDTD + quirk layout "XHTML" is a substring of the third field of the FPI --> XML HTML 4.x Strict --> CNavDTD + std layout HTML 4.x Transitional with the URI --> CNavDTD + std layout HTML 4.x Transitional without the URI --> CNavDTD + quirk layout Other known quirky doctype --> CNavDTD + quirk layout Other doctype --> CNavDTD + std layout (Std layout should probably be used with HTML 4.01 Frameset.) Why shouldn't the <?xml?> declaration be used in detection? To prevent sending arbitrary XML as text/html? Strictly conforming XHTML *must* have the xmlns attribute in the root element so I suppose it would be OK to try to detect it. http://www.w3.org/TR/xhtml1/#docconf
*** Bug 46963 has been marked as a duplicate of this bug. ***
Marking [nsbeta3+] like the two other bugs that came out of Monday's meeting, as per Eric Krock on those two bugs. Eric -- hope that's ok...
This kind of sniffing happens down in necko as they are the ones that determine the incoming content type, not the uriloader. back to gagan. Looks like http's use of the content type field isn't enough for this bug. They want to sniff the stream and over ride the content type header if the doc type contains the string "xhtml".
Assignee: mscott → gagan
OK so I understand the part mscott sez, but I need to know what is the detection criteria so that we can put that in? If you want to see how we do this currently (pretty plain vanilla checks) see http://lxr.mozilla.org/seamonkey/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#229
Yesterday I posted to n.p.m.xml to get opinions about suitable detection criteria. No one has replied to the post, yet.
Ian and I just sent a message to the HTML WG about detection criteria: http://lists.w3.org/Archives/Member/w3c-html-wg/2000JulSep/0410.html (for those with access). I can't read n.p.m.xml since news.mozilla.org is down.
The HTML WG has changed its mind: http://lists.w3.org/Archives/Member/w3c-html-wg/2000JulSep/0522.html so I guess this bug is invalid now...
That archive is for members only. Are you allowed to summarize the situation to outsiders? Forcing text/xml would disallow graceful degrading in legacy browsers which would be a huge disincentive for authors.
The message has now been reposted publicly: http://lists.w3.org/Archives/Public/www-html/2000Sep/0024.html
This shouldn't be a gagan bug. Reassigning to RickG since it's tied to the removal of Strict DTD. I'm guessing we'll get a lot of grumbling from developers by sending XHTML delivered as text/html through the HTML codepath. It's definitely the easiest for us to implement, though. Rick, I guess this makes bug 50071 invalid and the only thing to be done is to make sure that anything delivered as text/html goes through the transitional codepath.
Assignee: gagan → rickg
Ok -- Having discussed this with Vidur, there are 2 things to do: 1) disable strictDTD 2) cause XHTML delivered as text/html use the transitional DTD (CNavDTD).
Severity: major → critical
Status: NEW → ASSIGNED
Priority: P3 → P2
Target Milestone: --- → M18
3) Evangelize the use of the appropriate mime-type on XHTML
I don't know the motivations behind the decision, but to me it doesn't make sense to not parse XHTML with an XML parser if one is available. What's the point in using XHTML at all if it gets handled as HTML 4 in new browsers, too? Is the intent that every site implements client detection and sends the content with a different content type for browsers that are known to support XHTML as text/xml? Is there any change the WG could be persuaded to reconsider its decision?
IIRC there's a method available for the browser to indicate its capabilities, so it needn't be client sniffing and assumptions made based on that. But yes, that seems to be the case: find out what a browser is capable of, then send content tailored to that client. Alternatively, just send the latest of the latest, and pray. Though in most cases it'll be a matter of knowing your public/market and adjusting to them. Btw, this isn't as bad as it sounds. Once it becomes commonly known that certain doctypes will trigger standards layout on Mozilla, people will use that. Of course, they'll also create lots and lots of broken XHTML (for the same reason(s) they're creating brokem HTML 4 strict currently) but that's Somebody Else's Problem :-/
PDT agrees P2 after talking to rickg
Whiteboard: [nsbeta3+] → [nsbeta3+][PDTP2]
*** Bug 53181 has been marked as a duplicate of this bug. ***
XHTML documents are now routed through the navdtd.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
"text/xhtml" is not a valid MIME type. Right now it seems the official MIME type for XHTML will be "application/xhtml+xml". See http://www.ietf.org/internet-drafts/draft-baker-xhtml-media-reg-00.txt.
Why is this marked as fixed? Based on comments here and bug 199165 it sounds more like wontfix.
You need to log in before you can comment on or make changes to this bug.