Closed
Bug 48351
Opened 24 years ago
Closed 24 years ago
XHTML served as text/html uses HTML parser.
Categories
(Core :: DOM: HTML Parser, defect, P2)
Core
DOM: HTML Parser
Tracking
()
VERIFIED
FIXED
M18
People
(Reporter: ian, Assigned: rickg)
References
()
Details
(Keywords: dataloss, testcase, xhtml, Whiteboard: [nsbeta3+][PDTP2])
Attachments
(5 files)
This is a hot potato. Quick pass it on, pass it on, pass it on. ;-) [obscure Douglas Adams reference there...] We currently pass anything given as text/html to the HTML parser, which then decides which DTD to use -- Strict for Strict HTML4 and XHTML, and quirky otherwise (and Transitional in certain cases too). However, this is BAD because not all XHTML Strict is valid HTML4 Strict! Example: <p> <a name="a" /> text </p> <p> Hello. </p> Using the HTML Strict DTD, we think that is all one paragraph because the </p> is not well formed! Now, we have two options as I see it: 1. Tell the WG to stop being silly and treat all text/html as HTML. This entails NOT using the Strict DTD for XHTML DOCTYPEs. 2. When the HTML parser detects an XHTML doctype, switch to the XML parser and start over. Note that we should not do anything based on the presence of the <?xml PI, since that is (a) optional for XHTML, (b) does not indicate that a document is XHTML, and (c) is valid (albeit officially meaningless) in an HTML document.
Reporter | ||
Comment 1•24 years ago
|
||
Nominating for nsbeta3. This *cripples* XHTML adoption.
Reporter | ||
Comment 2•24 years ago
|
||
Reporter | ||
Comment 3•24 years ago
|
||
Reporter | ||
Comment 4•24 years ago
|
||
Reporter | ||
Comment 5•24 years ago
|
||
Reporter | ||
Comment 6•24 years ago
|
||
Reporter | ||
Updated•24 years ago
|
A suggestion made in http://lists.w3.org/Archives/Public/www-html/2000Jul/0085.html was that you look for "xmlns" as the beginning of an attribute on the HTML element. If xmlns is present, then parse as XML.
Reporter | ||
Comment 8•24 years ago
|
||
Completely valid XHTML doesn't (can't!) have a namespace attribute. It is only included in the attached test cases to get around bug 48445. Also, looking for 'xmlns' is likely to mean a lot more of the document has to be parsed before being redirected to the XML parser. According to the XHTML spec, valid XHTML documents *must* have a DOCTYPE. That solves our problems IMHO since we can just use the DOCTYPE sniffer to do the work. The spec only says XHTML can be sent as text/html, not random XML with interspersed XHTML content.
Ian - I now realize that you're right that valid XHTML is not required to have the xmlns attribute - but it is allowed. See http://www.w3.org/TR/REC-xml#sec-attr-defaults
Comment 11•24 years ago
|
||
In reference to attachment #2 [details] [diff] [review]: Actually it's the HTMLTokenizer that treats <strong /> as <strong> rather than <strong></strong> and then Strict DTD plays its role ( throwing away content!). The current HTMLTokenizer's behavior is correct. As Ian mentioned, this doucment should be dealt by the XML parser. Therefore, in this case, even though the document's mime type is html someone higher up ( netlib I suppose ) should provide the parser with XML content sink and text/xml mime type. Reassigning bug to gagan.
Assignee: harishd → gagan
Reporter | ||
Comment 12•24 years ago
|
||
David: Oops, I missed that part of the DTD. Right, it is allowed. This does not, however, change the validity of this bug. ;-) Harish and I just discussed this. Unfortunately, the parser/content sink apparently cannot change half way, so this probably has to go down to Necko. David, is it possible to adapt your DOCTYPE parsing for Necko purposes, thus moving the entire strict/transitional/quirks issue up a level? If so, is this the right way we should be doing things?: text/xml -> always XML text/html -> check DOCTYPE. [1] If XHTML DOCTYPE -> XML If HTML 4.01 or 4.0 Strict DOCTYPE -> HTML Strict If other HTML 4.01 DOCTYPE -> HTML Transitional [2] If other RECOGNISED DOCTYPE -> HTML Quirks (CNavDTD) If FPI contains 'Transitional' or 'Frameset' -> HTML Transitional ELSE -> HTML Strict text/xul -> XML anything else -> Not XML nor HTML. (image, text/plain, RTF...) Notes: [1]: do NOT use the presence of a <?xml processing instruction for anything. [2]: Frameset and Transitional DTDs are internally the same, right? Gagan?
Keywords: xhtml
Reporter | ||
Comment 13•24 years ago
|
||
This bug is actually the cause of most of the XHTML bugs that are filed (and promptly marked INVALID in most cases, so they don't show up as dups anywhere).
Reporter | ||
Comment 14•24 years ago
|
||
*** Bug 48617 has been marked as a duplicate of this bug. ***
Comment 15•24 years ago
|
||
I believe that stuff happens with the URILoader ->mscott
Assignee: gagan → mscott
Does XML loading go through nsParser?
*** Bug 38088 has been marked as a duplicate of this bug. ***
Comment 18•24 years ago
|
||
wow I am so confused by this bug.....are we saying that we expect necko to parse the actual content and attempt to guess whether it is really text/html or text/xhtml??? This sounds like a real layout/parser thing to me. In either case, necko's responsiblity is to determine the content type of the incoming data from the server for http. They look at the content type header and they look at the file extension and there maybe some sniffing code in there too. If that's code that needs changed to sniff for xhtml then this bug probably belongs back in gagan's group. Let me know if I'm understanding what you need done correctly.
What is the normal path for an XML load?
Comment 20•24 years ago
|
||
*** Bug 47958 has been marked as a duplicate of this bug. ***
Comment 21•24 years ago
|
||
A forward compatibility issue: The current XHTML working drafts indicate that there is going to be a lot of different doctype declarations for XHTML documents even though the tags are in the same XHTML namespace. It seems to me that the one thing all these doctype declarations have in common is the string "XHTML" in the third field of the FPI. http://www.w3.org/TR/xhtml-building/conformance.html#s_conform_naming_rules
Comment 22•24 years ago
|
||
With one HTML parser mode: text/xml, application/xml --> XML text/html --> Look for doctype: No doctype --> CNavDTD + quirk layout Doctype with incorrect syntax --> CNavDTD + quirk layout "XHTML" is a substring of the third field of the FPI --> XML HTML 4.x Strict --> CNavDTD + std layout HTML 4.x Transitional with the URI --> CNavDTD + std layout HTML 4.x Transitional without the URI --> CNavDTD + quirk layout Other known quirky doctype --> CNavDTD + quirk layout Other doctype --> CNavDTD + std layout (Std layout should probably be used with HTML 4.01 Frameset.) Why shouldn't the <?xml?> declaration be used in detection? To prevent sending arbitrary XML as text/html? Strictly conforming XHTML *must* have the xmlns attribute in the root element so I suppose it would be OK to try to detect it. http://www.w3.org/TR/xhtml1/#docconf
Reporter | ||
Comment 23•24 years ago
|
||
*** Bug 46963 has been marked as a duplicate of this bug. ***
Reporter | ||
Comment 24•24 years ago
|
||
Marking [nsbeta3+] like the two other bugs that came out of Monday's meeting, as per Eric Krock on those two bugs. Eric -- hope that's ok...
Blocks: 50071
Whiteboard: [nsbeta3+]
Comment 25•24 years ago
|
||
This kind of sniffing happens down in necko as they are the ones that determine the incoming content type, not the uriloader. back to gagan. Looks like http's use of the content type field isn't enough for this bug. They want to sniff the stream and over ride the content type header if the doc type contains the string "xhtml".
Assignee: mscott → gagan
Comment 26•24 years ago
|
||
Vidur, is bug 50071 that I filed on you for using the XML codepath in this situation a DUP of this bug? If, so please close bug 50071 as a DUP. (I can't seem to find the bug I filed on Harish--maybe it's closed already?) Adding Harish & Vidur to cc: list.
Comment 27•24 years ago
|
||
OK so I understand the part mscott sez, but I need to know what is the detection criteria so that we can put that in? If you want to see how we do this currently (pretty plain vanilla checks) see http://lxr.mozilla.org/seamonkey/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#229
Comment 28•24 years ago
|
||
Yesterday I posted to n.p.m.xml to get opinions about suitable detection criteria. No one has replied to the post, yet.
Ian and I just sent a message to the HTML WG about detection criteria: http://lists.w3.org/Archives/Member/w3c-html-wg/2000JulSep/0410.html (for those with access). I can't read n.p.m.xml since news.mozilla.org is down.
The HTML WG has changed its mind: http://lists.w3.org/Archives/Member/w3c-html-wg/2000JulSep/0522.html so I guess this bug is invalid now...
Comment 31•24 years ago
|
||
That archive is for members only. Are you allowed to summarize the situation to outsiders? Forcing text/xml would disallow graceful degrading in legacy browsers which would be a huge disincentive for authors.
The message has now been reposted publicly: http://lists.w3.org/Archives/Public/www-html/2000Sep/0024.html
Comment 33•24 years ago
|
||
This shouldn't be a gagan bug. Reassigning to RickG since it's tied to the removal of Strict DTD. I'm guessing we'll get a lot of grumbling from developers by sending XHTML delivered as text/html through the HTML codepath. It's definitely the easiest for us to implement, though. Rick, I guess this makes bug 50071 invalid and the only thing to be done is to make sure that anything delivered as text/html goes through the transitional codepath.
Assignee: gagan → rickg
Assignee | ||
Comment 34•24 years ago
|
||
Ok -- Having discussed this with Vidur, there are 2 things to do: 1) disable strictDTD 2) cause XHTML delivered as text/html use the transitional DTD (CNavDTD).
Severity: major → critical
Status: NEW → ASSIGNED
Priority: P3 → P2
Target Milestone: --- → M18
Comment 35•24 years ago
|
||
3) Evangelize the use of the appropriate mime-type on XHTML
Comment 36•24 years ago
|
||
I don't know the motivations behind the decision, but to me it doesn't make sense to not parse XHTML with an XML parser if one is available. What's the point in using XHTML at all if it gets handled as HTML 4 in new browsers, too? Is the intent that every site implements client detection and sends the content with a different content type for browsers that are known to support XHTML as text/xml? Is there any change the WG could be persuaded to reconsider its decision?
Comment 37•24 years ago
|
||
IIRC there's a method available for the browser to indicate its capabilities, so it needn't be client sniffing and assumptions made based on that. But yes, that seems to be the case: find out what a browser is capable of, then send content tailored to that client. Alternatively, just send the latest of the latest, and pray. Though in most cases it'll be a matter of knowing your public/market and adjusting to them. Btw, this isn't as bad as it sounds. Once it becomes commonly known that certain doctypes will trigger standards layout on Mozilla, people will use that. Of course, they'll also create lots and lots of broken XHTML (for the same reason(s) they're creating brokem HTML 4 strict currently) but that's Somebody Else's Problem :-/
Comment 38•24 years ago
|
||
PDT agrees P2 after talking to rickg
Whiteboard: [nsbeta3+] → [nsbeta3+][PDTP2]
Comment 39•24 years ago
|
||
*** Bug 53181 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 40•24 years ago
|
||
XHTML documents are now routed through the navdtd.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Comment 42•24 years ago
|
||
"text/xhtml" is not a valid MIME type. Right now it seems the official MIME type for XHTML will be "application/xhtml+xml". See http://www.ietf.org/internet-drafts/draft-baker-xhtml-media-reg-00.txt.
Comment 43•18 years ago
|
||
Why is this marked as fixed? Based on comments here and bug 199165 it sounds more like wontfix.
You need to log in
before you can comment on or make changes to this bug.
Description
•