Closed Bug 34476 Opened 24 years ago Closed 24 years ago

system identifier in doctype should trigger standard mode

Categories

(Core :: DOM: HTML Parser, defect, P3)

x86
Linux
defect

Tracking

()

RESOLVED INVALID

People

(Reporter: dbaron, Assigned: rickg)

References

Details

DESCRIPTION:  A system identifier in a DOCTYPE should trigger strict mode. 
MacIE5 has modes like mozilla, and it uses the presence of a SystemID to cause
strict mode (at least in the form <!DOCTYPE RootElem PUBLIC PublicID
SystemID>).  Since MacIE5 is already released, most pages where this causes any
problems should be changed in the near future.  See
http://www.deja.com/[ST_rn=ps]/getdoc.xp?AN=604424748&fmt=text for a quick
description of MacIE's algorithm.

I think any doctype that has a systemID in the form:
<!DOCTYPE HTML PUBLIC PublicID SystemID>
<!DOCTYPE HTML SYSTEM SystemID>
or has an internal subset:
<!DOCTYPE HTML (PUBLIC PublicID SystemID? | SYSTEM SystemID) [ Internal-SS ]>
should trigger strict mode (the latter two cases are very rare, and the first is
the one I'm sure MacIE does).

This would leave only DOCTYPEs of the form
<!DOCTYPE HTML PUBLIC PublicID>
to be treated with the current algorithm.  (These are the vast majority of
DOCTYPEs on the web.)

For information on the syntax of DOCTYPEs in XML (which is a subset of SGML, of
which HTML is an application), see:
http://www.w3.org/TR/REC-xml#dt-doctype
http://www.w3.org/TR/REC-xml#NT-ExternalID

Note that SGML does not require the SystemID for PUBLIC doctypes.

I think Ian may be able to provide steps to reproduce more easily than I can...
David: please explain why systemID should be treated as strict.
A SystemID should lead to strict mode because:
 * it's an easy way for page authors to control strict mode without affecting
validity
 * it's what MacIE does, and therefore it shouldn't cause too many problems
(since the pages where it causes problems will see them with MacIE first)
Target Milestone: --- → M17
Fixed in my tree
Status: NEW → ASSIGNED
Landed fixes. Read code in nsParser.cpp to learn more.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Reopening.  Looking at code in CParserContext.cpp, you're looking for the
strings "PublicID" and "SystemID", not the concepts.

I might have a chance to write a patch for what I meant in C++ this weekend
(hopefully), so we can avoid the difficulties of translating C++ to English and
back to C++ again.  What I'd like to do is just:
 1) check for a proper XML declaration (if so, strict)
 2) parse the DOCTYPE based on the SGML spec (and be quirks if it doesn't fit
the spec or if there isn't one at all)
 3) check for the presence of a SystemID or an internal subset (strict if so)
 4) then do the logic on the PublicID
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I think current mode determination is too complicate. And I think three (or 
more) modes are also too complicate. They will only confuse web designers. 
(Help! Which mode we are in?)
We had better to follow the MacIE5 strategy.
If they have system identifier, XML PI, XHTML DTD, or ISO HTML DTD, trigger the 
strict mode.
Otherwise, trigger the quirks mode.
That's simple and reasonable. (I will accept that HTML4 Strict without system 
identifier trigger the quirks.)
Modes in the parsing engine are independent of modes the users will see. To 
users, there will be 3 modes: STRICT, non-strict and quirks. Independent of 
that, there's html3, html4, xml and xhtml. That's the world we're in.

David -- I'd like to talk to you offline about ID's. I accept that my algorithm 
is a hack regarding these -- but I haven't done enough research to determine the 
right thing to do. I suspect you know, and can help me get it right.

I'm closing this bug -- and I'll open a new one regarding ID's. 
Status: REOPENED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Reopening since I can't find new bug regarding ID's (If there is one, please 
tell me a bug id).
System identifier check code still doesn't work at all.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd">
 - both handled in quirks.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
 "http://www.w3.org/TR/html4/strict.dtd">
 - both handled in strict.

If this behaviour is your intent, the following code is redundant.
269      //one last thing: look for a URI that specifies the strict.dtd
270      theStartPos+=6;
271      theCount=theEnd-theStartPos;
272      theSubIndex=theBuffer.Find("STRICT.DTD",PR_TRUE,theStartPos,theCount);
273      if(0<theSubIndex) {
274        //Since we found it, regardless of what's in the descr-text, kick 
into strict mode.
275        mParseMode=eParseMode_strict;
276        mDocType=eHTML4Text;
277      }

If this behaviour is not your intent, it should be fixed.

Also, the following code should be removed since 
<!DOCTYPE HTML PUBLIC PublicID SystemID>
is NOT a real doctype.

299    else {
300      PRInt32 thePos=theBuffer.Find("HTML",PR_TRUE,1,50);
301      if(kNotFound!=thePos) {
302        mDocType=eHTML4Text;
303        PRInt32 theIDPos=theBuffer.Find("PublicID",thePos);
304        if(kNotFound==theIDPos)
305          theIDPos=theBuffer.Find("SystemID",thePos);
306        mParseMode=(kNotFound==theIDPos) ? eParseMode_quirks : 
eParseMode_strict;
307      }
308    }
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The future HTML may become include only strict doctypes and their URI may not 
include the string "strict.dtd". For example, XHTML 1.1 does no longer include 
the transitional doctypes and their URI do not have the string "strict.dtd". Of 
course, this is not a problem since XHTML documents always trigger the strict. 
But the point is HTML 5.0 (or later) may become so.

My suggestion is (and possibly David's one is) that doctypes with URI always 
trigger the strict.
Web authors will be encouraged to use the horrible hack if we do not have the 
way to use strict with Transitional DTD. That is, they will use the strict 
doctypes only for trigger the strict, but actual document body is transitional.
The following doctype does not solve the problem since this is invalid.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/strict.dtd">

W3C defines only the following form:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
 "http://www.w3.org/TR/html4/strict.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
 "http://www.w3.org/TR/html4/frameset.dtd">

And we can omit URI per SGML spec:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN">

But the public identifier and the URI must match.
It is strange that valid doctypes trigger the quirks when invalid doctypes 
trigger the strict.
(aside: HTML 5 == XHTML 1. It is alomst certain that there will not be any more
SGML-based versions of the HTML language.)
Let's clear some of this up. We only have 2 options at this time for HTML (or 
XHTML served as text/html): compatible-mode and strict-mode. Transitional 
documents really should be handled as a variant of strict, but the strict-mode 
system isn't ready to do that just yet. So anything that is loose or 
transitional is handled in compatible mode.

The suggestion that we treat all documents with a URI as strict is patently 
wrong and will not be implemented.

Other DTD notes:

This is treated as strict:
 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"   
   "http://www.w3.org/TR/REC-html40/strict.dtd">

This is treated as compatible (because we don't have transitional):
 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"      
   "http://www.w3.org/TR/html4/loose.dtd">

This is treated as compatible:
  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"   
    "http://www.w3.org/TR/html4/frameset.dtd">

I don't see any other problems in this bug, so I'm closing it.
Please open a new bug (as this one is getting difficult to follow) for new 
problems.
Status: REOPENED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → INVALID
Hmm..  I thought the idea of 3 modes (where the parser would be lenient in the
middle one, but layout would be strict) was a good idea, and the intent of this
bug was that a systemID would trigger the middle mode.  I have this fully
implemented in the DOCTYPE handling code that I've written, except whatever
deals with the modes doesn't handle returning eDTDMode_Standard (or whatever it
is) correctly, so it just acts like quirks.

I think we should allow authors a way to trigger strict mode in *layout* without
having to use the strict DTD.  The three-mode solution, along with fixing this
bug, is a good way to do this.
Summary: system identifier in doctype should trigger strict mode → system identifier in doctype should trigger standard mode
rickg:  How does this relate to your fix for bug 29417?  There, you 
indicated that you had a more intelligent DOCTYPE based detection mechanism, 
but that it was not ready to be enabled by default.  Should that bug still be 
considered FIXED?

Also, in bug 34135, you indicated that you had a mechanism for controlling 
layout based on a META tag.  Is that mechanism ready to become "official"?

Finally, it'd be nice to have a bit more explanation for why treating all 
documents with a URI as strict (for purposes of layout) is patently wrong.  I've 
not seen a justification for that decision, at least in this bug.  From the 
points made here, it seems to be as clean a solution as could be hoped for 
without a lot of work.

If this bug does remain INVALID, what is the correct issue for addressing the 
problem of writing a page conforming to the Transitional DTD which needs to be 
laid out in non-quirks mode?  I suspect there are far more Transitional 
documents being written than Strict ones (I know this to be true of the 
companies I've worked with), and that this situation will persist for some time 
yet.  It seems like a workaround is called for, if a real solution is not 
feasible given the time constraints.
updated qa contact.
QA Contact: janc → bsharma
QA Contact: bsharma → moied
You need to log in before you can comment on or make changes to this bug.