Closed Bug 48351 Opened 24 years ago Closed 24 years ago

XHTML served as text/html uses HTML parser.

Tracking

()

Status:

VERIFIED FIXED

Milestone:

M18

People

(Reporter: ian, Assigned: rickg)

References

(
URL
)

Details

(Keywords: dataloss, testcase, xhtml, Whiteboard: [nsbeta3+][PDTP2])

Attachments

(5 files)

Test case, delivered as text/xml 24 years ago Hixie (not reading bugmail) 605 bytes, text/xml		Details
Test case, delivered as text/html 24 years ago Hixie (not reading bugmail) 605 bytes, text/html		Details
Test case, delivered as text/plain 24 years ago Hixie (not reading bugmail) 605 bytes, text/plain		Details
Test case, delivered as text/xhtml (not a valid mime type?) 24 years ago Hixie (not reading bugmail) 605 bytes, text/xhtml		Details
Test case, delivered as text/sgml 24 years ago Hixie (not reading bugmail) 605 bytes, text/sgml		Details

Hixie (not reading bugmail)

Reporter

Description

•

24 years ago

This is a hot potato. Quick pass it on, pass it on, pass it on. ;-)
[obscure Douglas Adams reference there...]

We currently pass anything given as text/html to the HTML parser, which then
decides which DTD to use -- Strict for Strict HTML4 and XHTML, and quirky
otherwise (and Transitional in certain cases too).

However, this is BAD because not all XHTML Strict is valid HTML4 Strict!

Example:

   <p> <a name="a" /> text </p> <p> Hello. </p>

Using the HTML Strict DTD, we think that is all one paragraph because the </p>
is not well formed!

Now, we have two options as I see it:

   1. Tell the WG to stop being silly and treat all text/html as HTML.
      This entails NOT using the Strict DTD for XHTML DOCTYPEs.
   2. When the HTML parser detects an XHTML doctype, switch to the XML
      parser and start over.

Note that we should not do anything based on the presence of the <?xml PI, since
that is (a) optional for XHTML, (b) does not indicate that a document is XHTML,
and (c) is valid (albeit officially meaningless) in an HTML document.

Hixie (not reading bugmail)

Reporter

Comment 1

•

24 years ago

Nominating for nsbeta3. This *cripples* XHTML adoption.

Blocks: html4.01

Keywords: 4xp, correctness, dataloss, nsbeta3, testcase

Hixie (not reading bugmail)

Reporter

Comment 2

•

24 years ago

Attached file Test case, delivered as text/xml — Details

Hixie (not reading bugmail)

Reporter

Comment 3

•

24 years ago

Attached file Test case, delivered as text/html — Details

Hixie (not reading bugmail)

Reporter

Comment 4

•

24 years ago

Attached file Test case, delivered as text/plain — Details

Hixie (not reading bugmail)

Reporter

Comment 5

•

24 years ago

Attached file Test case, delivered as text/xhtml (not a valid mime type?) — Details

Hixie (not reading bugmail)

Reporter

Comment 6

•

24 years ago

Attached file Test case, delivered as text/sgml — Details

Hixie (not reading bugmail)

Reporter

Updated

•

24 years ago

URL: http://www.avmaria.com/activitylog.html

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 7

•

24 years ago

A suggestion made in
http://lists.w3.org/Archives/Public/www-html/2000Jul/0085.html
was that you look for "xmlns" as the beginning of an attribute on the HTML
element.  If xmlns is present, then parse as XML.

Hixie (not reading bugmail)

Reporter

Comment 8

•

24 years ago

Completely valid XHTML doesn't (can't!) have a namespace attribute. It is only
included in the attached test cases to get around bug 48445.

Also, looking for 'xmlns' is likely to mean a lot more of the document has to be
parsed before being redirected to the XML parser.

According to the XHTML spec, valid XHTML documents *must* have a DOCTYPE. That
solves our problems IMHO since we can just use the DOCTYPE sniffer to do the
work. The spec only says XHTML can be sent as text/html, not random XML with
interspersed XHTML content.

Hixie (not reading bugmail)

Reporter

Comment 9

•

24 years ago

*** Bug 26022 has been marked as a duplicate of this bug. ***

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 10

•

24 years ago

Ian - I now realize that you're right that valid XHTML is not required to have
the xmlns attribute - but it is allowed.  See
http://www.w3.org/TR/REC-xml#sec-attr-defaults

harishd

Comment 11

•

24 years ago

In reference to attachment #2 [details] [diff] [review]:

Actually it's the HTMLTokenizer that treats <strong /> as <strong> rather than 
<strong></strong> and then Strict DTD plays its role ( throwing away content!). 
The current HTMLTokenizer's behavior is correct. As Ian mentioned, this doucment 
should be dealt by the XML parser.  Therefore, in this case, even though the 
document's mime type is html someone higher up ( netlib I suppose ) should 
provide the parser with XML content sink and text/xml mime type. 

Reassigning bug to gagan.

Assignee: harishd → gagan

Hixie (not reading bugmail)

Reporter

Comment 12

•

24 years ago

David: Oops, I missed that part of the DTD. Right, it is allowed. This does 
not, however, change the validity of this bug. ;-)

Harish and I just discussed this. Unfortunately, the parser/content sink 
apparently cannot change half way, so this probably has to go down to Necko.
David, is it possible to adapt your DOCTYPE parsing for Necko purposes, thus
moving the entire strict/transitional/quirks issue up a level?

If so, is this the right way we should be doing things?:

 text/xml  -> always XML
 text/html -> check DOCTYPE. [1]
               If XHTML DOCTYPE -> XML
               If HTML 4.01 or 4.0 Strict DOCTYPE -> HTML Strict
               If other HTML 4.01 DOCTYPE -> HTML Transitional [2]
               If other RECOGNISED DOCTYPE -> HTML Quirks (CNavDTD)
               If FPI contains 'Transitional' or 'Frameset' -> HTML Transitional
               ELSE -> HTML Strict
  text/xul -> XML
  anything else -> Not XML nor HTML. (image, text/plain, RTF...)

Notes:
  [1]: do NOT use the presence of a <?xml processing instruction for anything.
  [2]: Frameset and Transitional DTDs are internally the same, right?

Gagan?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

24 years ago

Keywords: xhtml

Hixie (not reading bugmail)

Reporter

Comment 13

•

24 years ago

This bug is actually the cause of most of the XHTML bugs that are filed (and
promptly marked INVALID in most cases, so they don't show up as dups anywhere).

No longer blocks: html4.01

Keywords: mostfreq

Hixie (not reading bugmail)

Reporter

Comment 14

•

24 years ago

*** Bug 48617 has been marked as a duplicate of this bug. ***

Gagan

Comment 15

•

24 years ago

I believe that stuff happens with the URILoader ->mscott

Assignee: gagan → mscott

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 16

•

24 years ago

Does XML loading go through nsParser?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 17

•

24 years ago

*** Bug 38088 has been marked as a duplicate of this bug. ***

Scott MacGregor

Comment 18

•

24 years ago

wow I am so confused by this bug.....are we saying that we expect necko to parse
the actual content and attempt to guess whether it is really text/html or
text/xhtml???

This sounds like a real layout/parser thing to me.

In either case, necko's responsiblity is to determine the content type of the
incoming data from the server for http. They look at the content type header and
they look at the file extension and there maybe some sniffing code in there too.
If that's code that needs changed to sniff for xhtml then this bug probably
belongs back in gagan's group.

Let me know if I'm understanding what you need done correctly.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 19

•

24 years ago

What is the normal path for an XML load?

Henri Sivonen (:hsivonen)

Comment 20

•

24 years ago

*** Bug 47958 has been marked as a duplicate of this bug. ***

Henri Sivonen (:hsivonen)

Comment 21

•

24 years ago

A forward compatibility issue:
The current XHTML working drafts indicate that there is going to be a lot of different 
doctype declarations for XHTML documents even though the tags are in the same 
XHTML namespace. It seems to me that the one thing all these doctype declarations 
have in common is the string "XHTML" in the third field of the FPI.
http://www.w3.org/TR/xhtml-building/conformance.html#s_conform_naming_rules

Henri Sivonen (:hsivonen)

Comment 22

•

24 years ago

With one HTML parser mode:

text/xml,
application/xml --> XML

text/html --> Look for doctype:

              No doctype --> CNavDTD + quirk layout
              Doctype with incorrect syntax --> CNavDTD + quirk layout
              "XHTML" is a substring of the third field of the FPI --> XML
              HTML 4.x Strict --> CNavDTD + std layout
              HTML 4.x Transitional with the URI --> CNavDTD + std layout
              HTML 4.x Transitional without the URI --> CNavDTD + quirk layout
              Other known quirky doctype --> CNavDTD + quirk layout
              Other doctype --> CNavDTD + std layout
              
(Std layout should probably be used with HTML 4.01 Frameset.)

Why shouldn't the <?xml?> declaration be used in detection? To prevent sending 
arbitrary XML as text/html?

Strictly conforming XHTML *must* have the xmlns attribute in the root element so I 
suppose it would be OK to try to detect it.
http://www.w3.org/TR/xhtml1/#docconf

Hixie (not reading bugmail)

Reporter

Comment 23

•

24 years ago

*** Bug 46963 has been marked as a duplicate of this bug. ***

Hixie (not reading bugmail)

Reporter

Comment 24

•

24 years ago

Marking [nsbeta3+] like the two other bugs that came out of Monday's meeting,
as per Eric Krock on those two bugs.

Eric -- hope that's ok...

Blocks: 50071

Whiteboard: [nsbeta3+]

Scott MacGregor

Comment 25

•

24 years ago

This kind of sniffing happens down in necko as they are the ones that determine
the incoming content type, not the uriloader. back to gagan.

Looks like http's use of the content type field isn't enough for this bug. They
want to sniff the stream and over ride the content type header if the doc type
contains the string "xhtml".

Assignee: mscott → gagan

ekrock's old account (dead)

Comment 26

•

24 years ago

Vidur, is bug 50071 that I filed on you for using the XML codepath in this 
situation a DUP of this bug? If, so please close bug 50071 as a DUP. (I can't 
seem to find the bug I filed on Harish--maybe it's closed already?)

Adding Harish & Vidur to cc: list.

Gagan

Comment 27

•

24 years ago

OK so I understand the part mscott sez, but I need to know what is the detection
criteria so that we can put that in? If you want to see how we do this currently
(pretty plain vanilla checks) see
http://lxr.mozilla.org/seamonkey/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#229

Henri Sivonen (:hsivonen)

Comment 28

•

24 years ago

Yesterday I posted to n.p.m.xml to get opinions about suitable detection criteria. No one 
has replied to the post, yet.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 29

•

24 years ago

Ian and I just sent a message to the HTML WG about detection criteria:
http://lists.w3.org/Archives/Member/w3c-html-wg/2000JulSep/0410.html (for those
with access).

I can't read n.p.m.xml since news.mozilla.org is down.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 30

•

24 years ago

The HTML WG has changed its mind:
http://lists.w3.org/Archives/Member/w3c-html-wg/2000JulSep/0522.html
so I guess this bug is invalid now...

Henri Sivonen (:hsivonen)

Comment 31

•

24 years ago

That archive is for members only. Are you allowed to summarize the situation to 
outsiders?

Forcing text/xml would disallow graceful degrading in legacy browsers which would be a 
huge disincentive for authors.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 32

•

24 years ago

The message has now been reposted publicly:
http://lists.w3.org/Archives/Public/www-html/2000Sep/0024.html

vidur (gone)

Comment 33

•

24 years ago

This shouldn't be a gagan bug. Reassigning to RickG since it's tied to the 
removal of Strict DTD.

I'm guessing we'll get a lot of grumbling from developers by sending XHTML 
delivered as text/html through the HTML codepath. It's definitely the easiest 
for us to implement, though. 

Rick, I guess this makes bug 50071 invalid and the only thing to be done is to 
make sure that anything delivered as text/html goes through the transitional 
codepath.

Assignee: gagan → rickg

rickg

Assignee

Comment 34

•

24 years ago

Ok -- Having discussed this with Vidur, there are 2 things to do:
1) disable strictDTD
2) cause XHTML delivered as text/html use the transitional DTD (CNavDTD).

Severity: major → critical

Status: NEW → ASSIGNED

Priority: P3 → P2

Target Milestone: --- → M18

Peter "jag" Annema

Comment 35

•

24 years ago

3) Evangelize the use of the appropriate mime-type on XHTML

Henri Sivonen (:hsivonen)

Comment 36

•

24 years ago

I don't know the motivations behind the decision, but to me it doesn't make sense to not 
parse XHTML with an XML parser if one is available. What's the point in using XHTML at 
all if it gets handled as HTML 4 in new browsers, too? Is the intent that every site 
implements client detection and sends the content with a different content type for 
browsers that are known to support XHTML as text/xml? Is there any change the WG 
could be persuaded to reconsider its decision?

Peter "jag" Annema

Comment 37

•

24 years ago

IIRC there's a method available for the browser to indicate its capabilities, so
it needn't be client sniffing and assumptions made based on that. But yes, that
seems to be the case: find out what a browser is capable of, then send content
tailored to that client. Alternatively, just send the latest of the latest, and
pray. Though in most cases it'll be a matter of knowing your public/market and
adjusting to them.

Btw, this isn't as bad as it sounds. Once it becomes commonly known that certain
doctypes will trigger standards layout on Mozilla, people will use that. Of
course, they'll also create lots and lots of broken XHTML (for the same
reason(s) they're creating brokem HTML 4 strict currently) but that's Somebody
Else's Problem :-/

Phil Peterson

Comment 38

•

24 years ago

PDT agrees P2 after talking to rickg

Whiteboard: [nsbeta3+] → [nsbeta3+][PDTP2]

Martin Horwath

Comment 39

•

24 years ago

*** Bug 53181 has been marked as a duplicate of this bug. ***

rickg

Assignee

Comment 40

•

24 years ago

XHTML documents are now routed through the navdtd.

Status: ASSIGNED → RESOLVED

Closed: 24 years ago

Resolution: --- → FIXED

Jan Carpenter

Comment 41

•

24 years ago

verified

Status: RESOLVED → VERIFIED

Robin Lionheart

Comment 42

•

24 years ago

"text/xhtml" is not a valid MIME type. Right now it seems the official MIME type
for XHTML will be "application/xhtml+xml". See
http://www.ietf.org/internet-drafts/draft-baker-xhtml-media-reg-00.txt.

Jesse Ruderman

Comment 43

•

18 years ago

Why is this marked as fixed?  Based on comments here and bug 199165 it sounds more like wontfix.

You need to log in before you can comment on or make changes to this bug.