Closed Bug 55264 Opened 24 years ago Closed 23 years ago

[DOCTYPE] Documents with unknown DOCTYPE should be displayed in strict mode

Categories

(Core :: DOM: HTML Parser, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9.5

People

(Reporter: ekrock, Assigned: dbaron)

References

Details

Attachments

(4 files)

Proposal for DOCTYPE handling hived off from bug 42525; see that bug for 
discussion and background.

(I currently have no position on this proposal; I'm opening this bug report so 
discussion can take place within this report.)

*** This bug has been marked as a duplicate of 55263 ***
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → DUPLICATE
Oops... I should have read more closely.  Sorry.  I *thought* I read the titles 
and they were the same.

Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Reassigned to Rickg/Parser. Nominating for rtm, like bug 42525.

Assignee: clayton → rickg
Status: REOPENED → NEW
Component: Layout → Parser
Keywords: rtm
OS: Windows NT → All
Hardware: PC → All
Are there any existing, currently "unknown" doctypes? Are there any real-world
documents that use such an existing, currently "unknown" doctype? Or is this a
pure forward compatibility issue?
This is a forward compatibility issue in the hypothetical cases that W3C
publishes HTML 4.02 with some minor corrections or that ISO comes up with a yet
unknown doctype.

As for existing "unknown" doctypes, one such doctype is the DTD Metrius
Presentational doctype declaration used at Motorola's site for example. It is
debatable, what an HTML browser should do with such a doctype.
How common is it for page authors or page editing software to accidentally 
introduce a typo in the DOCTYPE when editing markup? If it's common, we could 
create a real problem for product usability by applying strict layout to every 
page on the web with a corrupted DOCTYPE statement.

Another concern: have we looked at the DOCTYPE produced by every known authoring 
tool? We already know that Netscape Composer output an invalid DOCTYPE and had 
to build in awareness of that; I'm concerned about the risk that we may overlook 
bad DOCTYPEs output by other tools.

Finally, could someone use a spider to scan the web and generate a list of every 
unique DOCTYPE it found for HTML? *That* would be the way to make sure we 
weren't going to cause a lot of problems with this. Marking qawanted for this.
Keywords: qawanted
Marking [DOCTYPE] for easy searching.
Summary: Documents with unknown DOCTYPE should be displayed in strict mode → [DOCTYPE] Documents with unknown DOCTYPE should be displayed in strict mode
[rtm-]. The proposal may well have been meritorious, but unfortunately, no one
inside or outside of Netscape was able to do sufficient analysis of the
potential of this bug to cause regressions in time for us to commit this and
implement and accept a patch in time for RTM. Marking Future. Some will argue
that if we don't do this for RTM, we can't do it ever, but actually that's not
necessarily true. Depending upon the speed of market uptake of Netscape 6, the
length of time before the first point release, and further analysis of how many
existing web pages actually use unknown DOCTYPEs and would be affected by a
change either way, it may still make sense to consider this for the first point
release.
Whiteboard: [rtm-]
QA Contact: petersen → janc
I just wrote a perl script that would traverse dmoz.org and identify the
DOCTYPEs on each site. My computer isn't in a situation to run this over the
whole of dmoz, but I'll attach the script so that if someone wants to run it
over a larger subset of the web, they can do so (It should run on any system
that can support perl and wget).

The results for the first 100 sites (alphabetically by category on dmoz) are as
follows:

7
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
4
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
2
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
1
<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 3.2//EN'>
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd">
85
NO DOCTYPE

Since only 15 of these have doctypes at all, this is clearly not a large enough
sample, but the script can keep going indefinitely...

Suggested tweaks to the script if anyone wants to enhance it:

- Some sort of parallelizability (eg run 100 threads, or at least open 100 http
connections)
- The ability to save its status before terminating and then resume afterwards
 from the same place
- Pick the META GENERATOR tag out of the file if one exists, so that we can
identify what popular editors are putting in the doctype
- Some kind of crawling from the sites themselves (perhaps to a depth of 2 or
3), rather than only going to the single page linked from dmoz. The current
process doesn't even attempt to find the actual frames in a frameset, let alone
linked pages.
- Although wget knows that the file is text/html, the script doesn't retrieve
this information. If dmoz links to a plaintext file that has <!doctype in the
middle somewhere, you'll get a potentially very strange result...

If I get a chance to address any of these issues, I'll post updated versions of
the script in this bug. In the meantime, if anyone else wants to work on it (or
just to run it for a larger number of sites), go for it.

(attachment to follow...)
I'd just like to add something (I've already posted this as a bug which was
marked "invalid", and as a comment on another thread).

The subject of this bug is one of the things I fell are more lacking to Mozilla.
There *must* be a way to enable standards-compliant rendering with a
transitional doctype. Having a "quirks mode" must be of some use to many people
(not for me, though), but there must be a way to go around it without having to
go to strict mode.

Why this? Take for example one vastly used HTML authoring tool: Dreamweaver.
Pages generated by it have (questionably) transitional content, and they look
very good when shown in IE. But, if you look at them with Mozilla, you get the
same effect as NS4, which is not as good as IE's.

One example: tables with colored borders (both outline and inside the table).
Dreamweaver does this by setting a cell spacing higher than 0, and setting a
background color to the table. IE shows them great (even without a DOCTYPE
attached), because as far as I can tell it is not using a "quirks mode". But
looking in Mozilla with quirks mode it goes back to the same behaviour NS4 had:
showing the spacing between the cells as a white background instead of the table
background, ruining the effect.

Using a strict DTD makes Mozilla show them correctly, but it breaks almost the
whole page as Dreamweaver uses transitional syntax.

I'm not a huge fan of Dreamweaver myself (actually I hate it most of the time),
but I'm not a web designer (nor work with it, I do web programming), but the
designers here use it most of the time. And I can say that because of this
results are much better with IE than with Mozilla.

If anyone is interested in seeing the example in action, follow:
http://www.geocities.com/mvanzin/mozilla/table-strict.htm
http://www.geocities.com/mvanzin/mozilla/table-loose.htm

Same HTML, diferent DTD's (both with the DTD URL), and different results in
Mozilla. I'm not sure about if my intentions in the second table are correct,
but the first one (case I described above) should render equally in both cases
(by equally I mean equal to the strict mode one).

Hope this adds some food for thought.
(BTW, I'm adding myself to the CC list.)
This bug is not about transitional doctypes. This is about unknown doctypes.
Let's keep the bug focused.

(It is possible to activate the standards layout mode for transitional
documents. See: http://www.hut.fi/u/hsivonen/doctype.html)
Oops, sorry (the original bug, 42525, was about transitional. I think I did not
check the summary when going to this new thread).

BTW, thanks for the link. (I think that 4.0 transitional should also fire
standars mode, but...)
The issue about table backgrounds is, I think, covered by bug 4510.
Hmm, maybe I can add something to the discussion the. :-)

I'll try to base my comments on my previous comment (using the fact that many 
people use WYSIWYG HTML tools today, and they are mainly focused on producing 
nice output in IE, while mantaining an accetable result in NS4).

I'd say that using strict mode for unknown doctypes would be a very bad idea in 
this case. If you go around, you will see very few pages using strict layout, 
and when they do they generally use the correct doctype declaration. The same 
cannot be said about the much more common case of transitional doctypes. Most of 
the pages use transitional syntax, and many of them do not contain a doctype 
declaration.

That would leave us with three options: transitional with standards mode, 
transitional with quirks mode and old HTML 3 mode.

First one is the best in my opinion. I changed the doctype in documents at work 
today (using Henri's tip) and the results were better, but with minor quirks I 
will be watching more closely tomorrow.

Second one should be ok also, but then we fallback to my rants above. :-) It 
would not be in my preference to use this mode, but...

Third one is out of question I think. By doing so we were going to ignore CSS 
mostly, and results would be horrible.

This is a pretty tricky topic, but I think that it should, at least at this 
time, be resolved based on today's trends on HTML authoring. A nice idea would 
be (I think this was already suggested in some way): make one of them the 
default, and create an invisible preference (only editable going to the prefs 
file) to change it. That way, Mozilla can easily adapt to any decision made, 
and when people decide to change it then no problems would arise.
Remember that, by definition, "unknown" doctypes *excludes* all doctypes for the
most popular authoring tools (because, as described in this bug, we would have
to search out and identify what authoring tools use before we could fix this,
and then they become "known").

The aim here is, I think, that we should do something like the following:

1) Identify the doctypes that are widely used on the net, including *at least*
the ones used by popular authoring tools.
2) Decide what to do with these popular doctypes on a case-by-case basis, and
encode that knowledge (the majority of these will probably require quirks-mode
enabled, with a few exceptions such as XHTML, all STRICT variations and 4.0
Transitional with URL).
3) Treat doctypes that are *still* unknown as if they were strict.

The rationale here is that we need to render the web as it is now (for
compatibility) but we also need to render HTML5, XHTML 2, SOMEFUTUREML 7.6 etc
which we don't know about yet. The ones we *do* know about, we can decide what
to do with, but the ones we *don't* are ones that haven't been invented yet...
and those will *certainly* require strict handling.
I agree on the point that future (thus yet unknown) DTD's should be treated as 
strict mode. But I think that doing such a move *now* would break more things 
than fix them. At least until we get more standards compliance from everyone 
(browsers and authoring tools).

bug 55916 has an example of such a behaviour. If a new version of a popular tool 
is released and then includes a new DTD declaration, what will be the result? 
Maybe it will still output transitional syntax for better backward 
compatibility, but the DTD could be declared in a way that fools the doctype 
parser so it would think it was better to use strict parsing... and we would be 
breaking the rendering again.

Nominating for mozilla 0.9.  I think we need to fix this for mozilla 1.0 and we
need to get a bit of testing in before that happens.  I am willing to fix this.
Keywords: rtmmozilla0.9
updated qa contact.
QA Contact: janc → bsharma
Because of bug 55916, I propose we WONTFIX this bug. If we want to encourage 
standards support, we should be encouraging XHTML, and we already do all of 
text/xml in standards mode.
Keywords: qawanted
Whiteboard: [rtm-] → WONTFIX?
That's a silly reason.  We should just add HotMeTaL's DOCTYPE to our list of
quirks doctypes.  This is crucial for future-compatibility, since AFAICT XHTML
won't be able to be sent as text/xml for a long time in the future since there
are still non-supporting browsers around today.
As I see it it's six of one and half a dozen of the other. We're going to get
as many people writing new text/html pages with DOCTYPEs we don't recognise 
and wanting strict layout as we are people publishing old pages with silly 
DOCTYPEs with typos or other weird things and expecting a compatible rendering.

XHTML2 is not going to be backwards compatible with XHTML1 or HTML4, and the
latest version of XHTML, 1.1, is already treated in strict mode:

   http://www.bath.ac.uk/%7Epy8ieh/cgi/compat-test.pl?DOCTYPE=%3C%
21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.1%2F%2FEN%22++%22http%3A%
2F%2Fwww.w3.org%2FTR%2Fxhtml11%2FDTD%2Fxhtml11.dtd%22%3E&MODE=full

So the only likely possible forward compatability problem is already covered.
I've said this before, but I really think we should do some type of
doctype-crawl of the web to find out what's actually out there. Perhaps we could
ask the ODP/dmoz people, or google, or altavista, or somebody with a big
database of web pages whether they have any exhaustive lists of the DOCTYPEs
found on pages in their catalogs.

Then we can look at pretty much all existing doctypes on a case by case basis -
and hard-code this knowledge for these doctypes - making it much easier to say
"if nobody's EVER used it before, then it's almost guaranteed to want standard
rendering".
> We're going to get as many people writing new text/html pages with DOCTYPEs
> we don't recognise and wanting strict layout as we are people publishing old
> pages with silly DOCTYPEs with typos or other weird things and expecting a
> compatible rendering.

I disagree. If there's an error in your page as fundamental as a bad DOCTYPE, 
then all bets are (or should be) off as to how a browser will render it. I'm 
with David on this one -- forward compatibility is more important than backward 
compatibility.
Keywords: mozilla0.9mozilla0.9.2
> If there's an error in your page as fundamental as a bad DOCTYPE, 
then all bets are (or should be) off as to how a browser will render it. 

I agree compleatly.
However do notice 1 important thing about the doctype if this will ever be a
"fix" for this bug.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 ...etc
is the exact same as
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 ...etc
since html markup is case insensitive.

Thus the casing on the word "HTML" (and only HTML) is not relevant and should
thus not yield an invalid doctype.

For "proof" compair with the corresponding XHTML (which is case sensitive)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 ...
At least this is how I have intrepreted the difference between high & low case
in HTML/XHTML declarations.
I'm taking this bug, P1/critical (for standards support)/0.9.5.  See my comments
on bug 60511.
Assignee: rickg → dbaron
Severity: normal → critical
Priority: P3 → P1
Target Milestone: --- → mozilla0.9.5
Status: NEW → ASSIGNED
I still need to go through the code that I removed a little more carefully,
since the code I added is an updated version of the old patch I had on bug
44340, and the code I am replacing may have changed more that I noticed since then.

I also need to do a good bit of testing...
Whiteboard: WONTFIX?
I filed bug 98218, which exists with or without my patch.  I'll attach an
much-improved patch shortly (although it still uses obsolete string code).
Oops, I just noticed the bad formatting in ParsePS, and fixed it in my tree.
what impact will this bug have to those thousands of websites out there having 
no doctype specified. What are the consequences in rendering and the used 
appearance of websites to customers?
no impact -- no DTD will be done in quirks mode just as now. This bug is only about
*unknown* DTDs, not missing DTDs.
David: I like your changes a lot. The only thing that I didn't like is inlining
DetermineHTMLParseMode(). What's the reason behind it?
The reason I made it |inline| was that it's only used once -- this may as well
all get compiled into one big function (it should be slightly more efficient
that way), but I'd rather not *think* about it as one big function.  Of course,
changing |inline| to |static| would probably be only a negligible slowdown
(assuming the compiler is even capable of inlining the function), and it doesn't
really matter to me.
In the past Metrius has used an FPI like this: "-//Metrius//DTD Metrius
Presentational//EN" on pages of their clients. I suggest including it on the
list of quirky doctypes. Otherwise, bug 22274 will occur on Motorola's site.

BTW, in DetermineHTMLParseMode() there's a call like this
aBuffer.InsertWithConversion(
        "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">\n",
        0);
What's the purpose of that one? Is it used when Editor creates a new doc?
I updated my proposal at
  http://www.people.fas.harvard.edu/~dbaron/mozilla/doctypes
to reflect one bit of current practice -- that we *are* using strict
mode for HTML 4.01 transitional and frameset doctypes when a system
identifier is present.  (I also fixed some escaping errors in some of
the links.)

I also updated both the proposal and the code in my tree (a
one line change, removing the line noting that doctype -- not worth
posting a new patch) when I discovered that the old code treated
"-//IETF//DTD HTML i18n//EN" as a strict-mode public ID, and we haven't
had any problems caused by that.

Finally, I updated both the proposal and the code for the Metrius
doctype mentioned above (as eQuirks, not eQuirks3).

I verified that I was not changing the behavior on any of the tests
listed in that page (other than the one mentioned above that I changed)
where I was not expecting to change the behavior.  The only changes
were:

 * on the 2d, 3d, and 4th items in the strict mode list (system
   identifier only, neither system nor public identifier, and internal
   subset), which I think are safe changes.

 * on the public ID "-//SoftQuad Software//DTD HoTMetaL PRO
   6.0::19990601::extensions to HTML 4.0//EN" in the quirks list, which
   is bug 55916

So I think the patch is tested well enough that it's ready for checkin
early in a milestone.  I suspect we'll get a few reports of obscure
doctypes that my search-for-doctypes missed.  I plan to post to
n.p.m.layout and n.p.m.seamonkey, and also email some Netscape tech
evangelism folks so they know to be aware of the change.

So I think the patch is ready for review.  It's just the patch attached
above, with the one formatting change in ParsePS, and the one doctype
declaration mentioned above removed from the list of quirky doctypes,
and the Metrius one mentioned above added (as eQuirks).  I expect we'll
have to add a few more public IDs to the list in the coming weeks, but
that's why I want to check it in early in the milestone cycle.

(To respond to Henri Sivonen's comment:  I'm not sure what that
InsertWithConversion is there for, but it was there and I didn't want to
remove it for fear I'd break something.)
Comment on attachment 48191 [details] [diff] [review]
much improved patch

Change inline to static. With that r=harishd
Attachment #48191 - Flags: review+
harishd: Why do you think it is better for that function to be static rather than
inline? I'm trying to learn the "tricks" of the trade, and can't quite understand
this particular request. Thanks in advance for any explanation! :-)
Not to speak for Harish, but generally 'static' is used to prevent a method from
being exported from a module. It is useful, often necessary, if you want to make
sure that you don't end up with several different global functions with the same
name clashing at link time. 'inline' will likewise prevent the method from being
exported, so I think inline is fine here.
inline functions, especially large functions, can introduce code bloat which in
turn can cause negative performance. I would therefore perfer inlining smaller
functions. On the other hand the inline keyword does not force the complier to
inline a function ( I think ). It leaves the discretion to the compiler. I,
personally, don't prefer guessing compilers' actions :-)
But dbaron pointed out that the function is only called once, which means
inlining it is zero bloat (how many times can you duplicate something to produce
a single call to it?)

The only potential disadvantage is if somebody else doesn't realize the overhead
and makes another call to the same function. Perhaps that could be avoided by
arranging somehow to make sure the code isn't visible anywhere else, or by
having a sufficiently big comment everywhere it's used saying CHANGE INLINE TO
STATIC ON THE DECLARATION OF THIS IF YOU DUPLICATE THIS CODE OR USE IT ANYWHERE
ELSE.

Hopefully a good compiler would inline any function that's only called once and
isn't visible elsewhere anyway, right?
Comment on attachment 48191 [details] [diff] [review]
much improved patch

A question: Does ParsePS do the right thing if we don't see the end of a comment in the buffer passed in? It doesn't return kNotFound and my impression is that we wind up not skipping the comment.

If you are doing the right thing, sr=vidur.
Attachment #48191 - Flags: superreview+
Yes, what it does is not skip the comment, which will then lead to a parsing
error when we look for whatever should have followed it and find "--" instead. 
The idea of ParsePS is to consume as much as possible while matching the PS
production, but not to consume something that would be a partial match.  Then it
will fall back to quirks mode since it can't parse the whole DOCTYPE declaration
out of the initial buffer with which it is called -- that's the disadvantage of
doing the mode detection in an initial first pass rather than while consuming
the page (which I think would be better, and it wouldn't be too late).  Doing
the DOCTYPE parsing in the main pass requires more changes to the parser than I
want to make now.
Fix checked in 2001-09-08 11:37 PDT.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago23 years ago
Resolution: --- → FIXED
QA Contact: bsharma → moied
*** Bug 91038 has been marked as a duplicate of this bug. ***
verified
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: