Closed Bug 55264 Opened 24 years ago Closed 23 years ago

[DOCTYPE] Documents with unknown DOCTYPE should be displayed in strict mode

Categories

(Core :: DOM: HTML Parser, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9.5

People

(Reporter: ekrock, Assigned: dbaron)

References

Details

Attachments

(4 files)

Proposal for DOCTYPE handling hived off from bug 42525; see that bug for discussion and background. (I currently have no position on this proposal; I'm opening this bug report so discussion can take place within this report.)
*** This bug has been marked as a duplicate of 55263 ***
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → DUPLICATE
Oops... I should have read more closely. Sorry. I *thought* I read the titles and they were the same.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Reassigned to Rickg/Parser. Nominating for rtm, like bug 42525.
Assignee: clayton → rickg
Status: REOPENED → NEW
Component: Layout → Parser
Keywords: rtm
OS: Windows NT → All
Hardware: PC → All
Are there any existing, currently "unknown" doctypes? Are there any real-world documents that use such an existing, currently "unknown" doctype? Or is this a pure forward compatibility issue?
This is a forward compatibility issue in the hypothetical cases that W3C publishes HTML 4.02 with some minor corrections or that ISO comes up with a yet unknown doctype. As for existing "unknown" doctypes, one such doctype is the DTD Metrius Presentational doctype declaration used at Motorola's site for example. It is debatable, what an HTML browser should do with such a doctype.
How common is it for page authors or page editing software to accidentally introduce a typo in the DOCTYPE when editing markup? If it's common, we could create a real problem for product usability by applying strict layout to every page on the web with a corrupted DOCTYPE statement. Another concern: have we looked at the DOCTYPE produced by every known authoring tool? We already know that Netscape Composer output an invalid DOCTYPE and had to build in awareness of that; I'm concerned about the risk that we may overlook bad DOCTYPEs output by other tools. Finally, could someone use a spider to scan the web and generate a list of every unique DOCTYPE it found for HTML? *That* would be the way to make sure we weren't going to cause a lot of problems with this. Marking qawanted for this.
Keywords: qawanted
Marking [DOCTYPE] for easy searching.
Summary: Documents with unknown DOCTYPE should be displayed in strict mode → [DOCTYPE] Documents with unknown DOCTYPE should be displayed in strict mode
[rtm-]. The proposal may well have been meritorious, but unfortunately, no one inside or outside of Netscape was able to do sufficient analysis of the potential of this bug to cause regressions in time for us to commit this and implement and accept a patch in time for RTM. Marking Future. Some will argue that if we don't do this for RTM, we can't do it ever, but actually that's not necessarily true. Depending upon the speed of market uptake of Netscape 6, the length of time before the first point release, and further analysis of how many existing web pages actually use unknown DOCTYPEs and would be affected by a change either way, it may still make sense to consider this for the first point release.
Whiteboard: [rtm-]
QA Contact: petersen → janc
I just wrote a perl script that would traverse dmoz.org and identify the DOCTYPEs on each site. My computer isn't in a situation to run this over the whole of dmoz, but I'll attach the script so that if someone wants to run it over a larger subset of the web, they can do so (It should run on any system that can support perl and wget). The results for the first 100 sites (alphabetically by category on dmoz) are as follows: 7 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 4 <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> 2 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> 1 <!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 3.2//EN'> 1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd"> 85 NO DOCTYPE Since only 15 of these have doctypes at all, this is clearly not a large enough sample, but the script can keep going indefinitely... Suggested tweaks to the script if anyone wants to enhance it: - Some sort of parallelizability (eg run 100 threads, or at least open 100 http connections) - The ability to save its status before terminating and then resume afterwards from the same place - Pick the META GENERATOR tag out of the file if one exists, so that we can identify what popular editors are putting in the doctype - Some kind of crawling from the sites themselves (perhaps to a depth of 2 or 3), rather than only going to the single page linked from dmoz. The current process doesn't even attempt to find the actual frames in a frameset, let alone linked pages. - Although wget knows that the file is text/html, the script doesn't retrieve this information. If dmoz links to a plaintext file that has <!doctype in the middle somewhere, you'll get a potentially very strange result... If I get a chance to address any of these issues, I'll post updated versions of the script in this bug. In the meantime, if anyone else wants to work on it (or just to run it for a larger number of sites), go for it. (attachment to follow...)
I'd just like to add something (I've already posted this as a bug which was marked "invalid", and as a comment on another thread). The subject of this bug is one of the things I fell are more lacking to Mozilla. There *must* be a way to enable standards-compliant rendering with a transitional doctype. Having a "quirks mode" must be of some use to many people (not for me, though), but there must be a way to go around it without having to go to strict mode. Why this? Take for example one vastly used HTML authoring tool: Dreamweaver. Pages generated by it have (questionably) transitional content, and they look very good when shown in IE. But, if you look at them with Mozilla, you get the same effect as NS4, which is not as good as IE's. One example: tables with colored borders (both outline and inside the table). Dreamweaver does this by setting a cell spacing higher than 0, and setting a background color to the table. IE shows them great (even without a DOCTYPE attached), because as far as I can tell it is not using a "quirks mode". But looking in Mozilla with quirks mode it goes back to the same behaviour NS4 had: showing the spacing between the cells as a white background instead of the table background, ruining the effect. Using a strict DTD makes Mozilla show them correctly, but it breaks almost the whole page as Dreamweaver uses transitional syntax. I'm not a huge fan of Dreamweaver myself (actually I hate it most of the time), but I'm not a web designer (nor work with it, I do web programming), but the designers here use it most of the time. And I can say that because of this results are much better with IE than with Mozilla. If anyone is interested in seeing the example in action, follow: http://www.geocities.com/mvanzin/mozilla/table-strict.htm http://www.geocities.com/mvanzin/mozilla/table-loose.htm Same HTML, diferent DTD's (both with the DTD URL), and different results in Mozilla. I'm not sure about if my intentions in the second table are correct, but the first one (case I described above) should render equally in both cases (by equally I mean equal to the strict mode one). Hope this adds some food for thought. (BTW, I'm adding myself to the CC list.)
This bug is not about transitional doctypes. This is about unknown doctypes. Let's keep the bug focused. (It is possible to activate the standards layout mode for transitional documents. See: http://www.hut.fi/u/hsivonen/doctype.html)
Oops, sorry (the original bug, 42525, was about transitional. I think I did not check the summary when going to this new thread). BTW, thanks for the link. (I think that 4.0 transitional should also fire standars mode, but...)
The issue about table backgrounds is, I think, covered by bug 4510.
Hmm, maybe I can add something to the discussion the. :-) I'll try to base my comments on my previous comment (using the fact that many people use WYSIWYG HTML tools today, and they are mainly focused on producing nice output in IE, while mantaining an accetable result in NS4). I'd say that using strict mode for unknown doctypes would be a very bad idea in this case. If you go around, you will see very few pages using strict layout, and when they do they generally use the correct doctype declaration. The same cannot be said about the much more common case of transitional doctypes. Most of the pages use transitional syntax, and many of them do not contain a doctype declaration. That would leave us with three options: transitional with standards mode, transitional with quirks mode and old HTML 3 mode. First one is the best in my opinion. I changed the doctype in documents at work today (using Henri's tip) and the results were better, but with minor quirks I will be watching more closely tomorrow. Second one should be ok also, but then we fallback to my rants above. :-) It would not be in my preference to use this mode, but... Third one is out of question I think. By doing so we were going to ignore CSS mostly, and results would be horrible. This is a pretty tricky topic, but I think that it should, at least at this time, be resolved based on today's trends on HTML authoring. A nice idea would be (I think this was already suggested in some way): make one of them the default, and create an invisible preference (only editable going to the prefs file) to change it. That way, Mozilla can easily adapt to any decision made, and when people decide to change it then no problems would arise.
Remember that, by definition, "unknown" doctypes *excludes* all doctypes for the most popular authoring tools (because, as described in this bug, we would have to search out and identify what authoring tools use before we could fix this, and then they become "known"). The aim here is, I think, that we should do something like the following: 1) Identify the doctypes that are widely used on the net, including *at least* the ones used by popular authoring tools. 2) Decide what to do with these popular doctypes on a case-by-case basis, and encode that knowledge (the majority of these will probably require quirks-mode enabled, with a few exceptions such as XHTML, all STRICT variations and 4.0 Transitional with URL). 3) Treat doctypes that are *still* unknown as if they were strict. The rationale here is that we need to render the web as it is now (for compatibility) but we also need to render HTML5, XHTML 2, SOMEFUTUREML 7.6 etc which we don't know about yet. The ones we *do* know about, we can decide what to do with, but the ones we *don't* are ones that haven't been invented yet... and those will *certainly* require strict handling.
I agree on the point that future (thus yet unknown) DTD's should be treated as strict mode. But I think that doing such a move *now* would break more things than fix them. At least until we get more standards compliance from everyone (browsers and authoring tools). bug 55916 has an example of such a behaviour. If a new version of a popular tool is released and then includes a new DTD declaration, what will be the result? Maybe it will still output transitional syntax for better backward compatibility, but the DTD could be declared in a way that fools the doctype parser so it would think it was better to use strict parsing... and we would be breaking the rendering again.
Nominating for mozilla 0.9. I think we need to fix this for mozilla 1.0 and we need to get a bit of testing in before that happens. I am willing to fix this.
Keywords: rtmmozilla0.9
updated qa contact.
QA Contact: janc → bsharma
Because of bug 55916, I propose we WONTFIX this bug. If we want to encourage standards support, we should be encouraging XHTML, and we already do all of text/xml in standards mode.
Keywords: qawanted
Whiteboard: [rtm-] → WONTFIX?
That's a silly reason. We should just add HotMeTaL's DOCTYPE to our list of quirks doctypes. This is crucial for future-compatibility, since AFAICT XHTML won't be able to be sent as text/xml for a long time in the future since there are still non-supporting browsers around today.
As I see it it's six of one and half a dozen of the other. We're going to get as many people writing new text/html pages with DOCTYPEs we don't recognise and wanting strict layout as we are people publishing old pages with silly DOCTYPEs with typos or other weird things and expecting a compatible rendering. XHTML2 is not going to be backwards compatible with XHTML1 or HTML4, and the latest version of XHTML, 1.1, is already treated in strict mode: http://www.bath.ac.uk/%7Epy8ieh/cgi/compat-test.pl?DOCTYPE=%3C% 21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.1%2F%2FEN%22++%22http%3A% 2F%2Fwww.w3.org%2FTR%2Fxhtml11%2FDTD%2Fxhtml11.dtd%22%3E&MODE=full So the only likely possible forward compatability problem is already covered.
I've said this before, but I really think we should do some type of doctype-crawl of the web to find out what's actually out there. Perhaps we could ask the ODP/dmoz people, or google, or altavista, or somebody with a big database of web pages whether they have any exhaustive lists of the DOCTYPEs found on pages in their catalogs. Then we can look at pretty much all existing doctypes on a case by case basis - and hard-code this knowledge for these doctypes - making it much easier to say "if nobody's EVER used it before, then it's almost guaranteed to want standard rendering".
> We're going to get as many people writing new text/html pages with DOCTYPEs > we don't recognise and wanting strict layout as we are people publishing old > pages with silly DOCTYPEs with typos or other weird things and expecting a > compatible rendering. I disagree. If there's an error in your page as fundamental as a bad DOCTYPE, then all bets are (or should be) off as to how a browser will render it. I'm with David on this one -- forward compatibility is more important than backward compatibility.
Keywords: mozilla0.9mozilla0.9.2
> If there's an error in your page as fundamental as a bad DOCTYPE, then all bets are (or should be) off as to how a browser will render it. I agree compleatly. However do notice 1 important thing about the doctype if this will ever be a "fix" for this bug. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 ...etc is the exact same as <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 ...etc since html markup is case insensitive. Thus the casing on the word "HTML" (and only HTML) is not relevant and should thus not yield an invalid doctype. For "proof" compair with the corresponding XHTML (which is case sensitive) <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 ... At least this is how I have intrepreted the difference between high & low case in HTML/XHTML declarations.
I'm taking this bug, P1/critical (for standards support)/0.9.5. See my comments on bug 60511.
Assignee: rickg → dbaron
Severity: normal → critical
Priority: P3 → P1
Target Milestone: --- → mozilla0.9.5
Status: NEW → ASSIGNED
I still need to go through the code that I removed a little more carefully, since the code I added is an updated version of the old patch I had on bug 44340, and the code I am replacing may have changed more that I noticed since then. I also need to do a good bit of testing...
Whiteboard: WONTFIX?
I filed bug 98218, which exists with or without my patch. I'll attach an much-improved patch shortly (although it still uses obsolete string code).
Oops, I just noticed the bad formatting in ParsePS, and fixed it in my tree.
what impact will this bug have to those thousands of websites out there having no doctype specified. What are the consequences in rendering and the used appearance of websites to customers?
no impact -- no DTD will be done in quirks mode just as now. This bug is only about *unknown* DTDs, not missing DTDs.
David: I like your changes a lot. The only thing that I didn't like is inlining DetermineHTMLParseMode(). What's the reason behind it?
The reason I made it |inline| was that it's only used once -- this may as well all get compiled into one big function (it should be slightly more efficient that way), but I'd rather not *think* about it as one big function. Of course, changing |inline| to |static| would probably be only a negligible slowdown (assuming the compiler is even capable of inlining the function), and it doesn't really matter to me.
In the past Metrius has used an FPI like this: "-//Metrius//DTD Metrius Presentational//EN" on pages of their clients. I suggest including it on the list of quirky doctypes. Otherwise, bug 22274 will occur on Motorola's site. BTW, in DetermineHTMLParseMode() there's a call like this aBuffer.InsertWithConversion( "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">\n", 0); What's the purpose of that one? Is it used when Editor creates a new doc?
I updated my proposal at http://www.people.fas.harvard.edu/~dbaron/mozilla/doctypes to reflect one bit of current practice -- that we *are* using strict mode for HTML 4.01 transitional and frameset doctypes when a system identifier is present. (I also fixed some escaping errors in some of the links.) I also updated both the proposal and the code in my tree (a one line change, removing the line noting that doctype -- not worth posting a new patch) when I discovered that the old code treated "-//IETF//DTD HTML i18n//EN" as a strict-mode public ID, and we haven't had any problems caused by that. Finally, I updated both the proposal and the code for the Metrius doctype mentioned above (as eQuirks, not eQuirks3). I verified that I was not changing the behavior on any of the tests listed in that page (other than the one mentioned above that I changed) where I was not expecting to change the behavior. The only changes were: * on the 2d, 3d, and 4th items in the strict mode list (system identifier only, neither system nor public identifier, and internal subset), which I think are safe changes. * on the public ID "-//SoftQuad Software//DTD HoTMetaL PRO 6.0::19990601::extensions to HTML 4.0//EN" in the quirks list, which is bug 55916 So I think the patch is tested well enough that it's ready for checkin early in a milestone. I suspect we'll get a few reports of obscure doctypes that my search-for-doctypes missed. I plan to post to n.p.m.layout and n.p.m.seamonkey, and also email some Netscape tech evangelism folks so they know to be aware of the change. So I think the patch is ready for review. It's just the patch attached above, with the one formatting change in ParsePS, and the one doctype declaration mentioned above removed from the list of quirky doctypes, and the Metrius one mentioned above added (as eQuirks). I expect we'll have to add a few more public IDs to the list in the coming weeks, but that's why I want to check it in early in the milestone cycle. (To respond to Henri Sivonen's comment: I'm not sure what that InsertWithConversion is there for, but it was there and I didn't want to remove it for fear I'd break something.)
Comment on attachment 48191 [details] [diff] [review] much improved patch Change inline to static. With that r=harishd
Attachment #48191 - Flags: review+
harishd: Why do you think it is better for that function to be static rather than inline? I'm trying to learn the "tricks" of the trade, and can't quite understand this particular request. Thanks in advance for any explanation! :-)
Not to speak for Harish, but generally 'static' is used to prevent a method from being exported from a module. It is useful, often necessary, if you want to make sure that you don't end up with several different global functions with the same name clashing at link time. 'inline' will likewise prevent the method from being exported, so I think inline is fine here.
inline functions, especially large functions, can introduce code bloat which in turn can cause negative performance. I would therefore perfer inlining smaller functions. On the other hand the inline keyword does not force the complier to inline a function ( I think ). It leaves the discretion to the compiler. I, personally, don't prefer guessing compilers' actions :-)
But dbaron pointed out that the function is only called once, which means inlining it is zero bloat (how many times can you duplicate something to produce a single call to it?) The only potential disadvantage is if somebody else doesn't realize the overhead and makes another call to the same function. Perhaps that could be avoided by arranging somehow to make sure the code isn't visible anywhere else, or by having a sufficiently big comment everywhere it's used saying CHANGE INLINE TO STATIC ON THE DECLARATION OF THIS IF YOU DUPLICATE THIS CODE OR USE IT ANYWHERE ELSE. Hopefully a good compiler would inline any function that's only called once and isn't visible elsewhere anyway, right?
Comment on attachment 48191 [details] [diff] [review] much improved patch A question: Does ParsePS do the right thing if we don't see the end of a comment in the buffer passed in? It doesn't return kNotFound and my impression is that we wind up not skipping the comment. If you are doing the right thing, sr=vidur.
Attachment #48191 - Flags: superreview+
Yes, what it does is not skip the comment, which will then lead to a parsing error when we look for whatever should have followed it and find "--" instead. The idea of ParsePS is to consume as much as possible while matching the PS production, but not to consume something that would be a partial match. Then it will fall back to quirks mode since it can't parse the whole DOCTYPE declaration out of the initial buffer with which it is called -- that's the disadvantage of doing the mode detection in an initial first pass rather than while consuming the page (which I think would be better, and it wouldn't be too late). Doing the DOCTYPE parsing in the main pass requires more changes to the parser than I want to make now.
Fix checked in 2001-09-08 11:37 PDT.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago23 years ago
Resolution: --- → FIXED
QA Contact: bsharma → moied
*** Bug 91038 has been marked as a duplicate of this bug. ***
verified
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: