Bugzilla

Comment 9

•

24 years ago

I just wrote a perl script that would traverse dmoz.org and identify the
DOCTYPEs on each site. My computer isn't in a situation to run this over the
whole of dmoz, but I'll attach the script so that if someone wants to run it
over a larger subset of the web, they can do so (It should run on any system
that can support perl and wget).

The results for the first 100 sites (alphabetically by category on dmoz) are as
follows:

7
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
4
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
2
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
1
<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 3.2//EN'>
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd">
85
NO DOCTYPE

Since only 15 of these have doctypes at all, this is clearly not a large enough
sample, but the script can keep going indefinitely...

Suggested tweaks to the script if anyone wants to enhance it:

- Some sort of parallelizability (eg run 100 threads, or at least open 100 http
connections)
- The ability to save its status before terminating and then resume afterwards
 from the same place
- Pick the META GENERATOR tag out of the file if one exists, so that we can
identify what popular editors are putting in the doctype
- Some kind of crawling from the sites themselves (perhaps to a depth of 2 or
3), rather than only going to the single page linked from dmoz. The current
process doesn't even attempt to find the actual frames in a frameset, let alone
linked pages.
- Although wget knows that the file is text/html, the script doesn't retrieve
this information. If dmoz links to a plaintext file that has <!doctype in the
middle somewhere, you'll get a potentially very strange result...

If I get a chance to address any of these issues, I'll post updated versions of
the script in this bug. In the meantime, if anyone else wants to work on it (or
just to run it for a larger number of sites), go for it.

(attachment to follow...)

Comment 10

•

24 years ago

Attached file perl script to find doctypes of sites trawled from dmoz — Details

Henri Sivonen (:hsivonen)

Comment 11

•

24 years ago

I'd just like to add something (I've already posted this as a bug which was
marked "invalid", and as a comment on another thread).

The subject of this bug is one of the things I fell are more lacking to Mozilla.
There *must* be a way to enable standards-compliant rendering with a
transitional doctype. Having a "quirks mode" must be of some use to many people
(not for me, though), but there must be a way to go around it without having to
go to strict mode.

Why this? Take for example one vastly used HTML authoring tool: Dreamweaver.
Pages generated by it have (questionably) transitional content, and they look
very good when shown in IE. But, if you look at them with Mozilla, you get the
same effect as NS4, which is not as good as IE's.

One example: tables with colored borders (both outline and inside the table).
Dreamweaver does this by setting a cell spacing higher than 0, and setting a
background color to the table. IE shows them great (even without a DOCTYPE
attached), because as far as I can tell it is not using a "quirks mode". But
looking in Mozilla with quirks mode it goes back to the same behaviour NS4 had:
showing the spacing between the cells as a white background instead of the table
background, ruining the effect.

Using a strict DTD makes Mozilla show them correctly, but it breaks almost the
whole page as Dreamweaver uses transitional syntax.

I'm not a huge fan of Dreamweaver myself (actually I hate it most of the time),
but I'm not a web designer (nor work with it, I do web programming), but the
designers here use it most of the time. And I can say that because of this
results are much better with IE than with Mozilla.

If anyone is interested in seeing the example in action, follow:
http://www.geocities.com/mvanzin/mozilla/table-strict.htm
http://www.geocities.com/mvanzin/mozilla/table-loose.htm

Same HTML, diferent DTD's (both with the DTD URL), and different results in
Mozilla. I'm not sure about if my intentions in the second table are correct,
but the first one (case I described above) should render equally in both cases
(by equally I mean equal to the strict mode one).

Hope this adds some food for thought.
(BTW, I'm adding myself to the CC list.)

Comment 12

•

24 years ago

This bug is not about transitional doctypes. This is about unknown doctypes.
Let's keep the bug focused.

(It is possible to activate the standards layout mode for transitional
documents. See: http://www.hut.fi/u/hsivonen/doctype.html)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 13

•

24 years ago

Oops, sorry (the original bug, 42525, was about transitional. I think I did not
check the summary when going to this new thread).

BTW, thanks for the link. (I think that 4.0 transitional should also fire
standars mode, but...)

Assignee

Comment 14

•

24 years ago

The issue about table backgrounds is, I think, covered by bug 4510.

Comment 15

•

24 years ago

Hmm, maybe I can add something to the discussion the. :-)

I'll try to base my comments on my previous comment (using the fact that many 
people use WYSIWYG HTML tools today, and they are mainly focused on producing 
nice output in IE, while mantaining an accetable result in NS4).

I'd say that using strict mode for unknown doctypes would be a very bad idea in 
this case. If you go around, you will see very few pages using strict layout, 
and when they do they generally use the correct doctype declaration. The same 
cannot be said about the much more common case of transitional doctypes. Most of 
the pages use transitional syntax, and many of them do not contain a doctype 
declaration.

That would leave us with three options: transitional with standards mode, 
transitional with quirks mode and old HTML 3 mode.

First one is the best in my opinion. I changed the doctype in documents at work 
today (using Henri's tip) and the results were better, but with minor quirks I 
will be watching more closely tomorrow.

Second one should be ok also, but then we fallback to my rants above. :-) It 
would not be in my preference to use this mode, but...

Third one is out of question I think. By doing so we were going to ignore CSS 
mostly, and results would be horrible.

This is a pretty tricky topic, but I think that it should, at least at this 
time, be resolved based on today's trends on HTML authoring. A nice idea would 
be (I think this was already suggested in some way): make one of them the 
default, and create an invisible preference (only editable going to the prefs 
file) to change it. That way, Mozilla can easily adapt to any decision made, 
and when people decide to change it then no problems would arise.

Comment 16

•

24 years ago

Remember that, by definition, "unknown" doctypes *excludes* all doctypes for the
most popular authoring tools (because, as described in this bug, we would have
to search out and identify what authoring tools use before we could fix this,
and then they become "known").

The aim here is, I think, that we should do something like the following:

1) Identify the doctypes that are widely used on the net, including *at least*
the ones used by popular authoring tools.
2) Decide what to do with these popular doctypes on a case-by-case basis, and
encode that knowledge (the majority of these will probably require quirks-mode
enabled, with a few exceptions such as XHTML, all STRICT variations and 4.0
Transitional with URL).
3) Treat doctypes that are *still* unknown as if they were strict.

The rationale here is that we need to render the web as it is now (for
compatibility) but we also need to render HTML5, XHTML 2, SOMEFUTUREML 7.6 etc
which we don't know about yet. The ones we *do* know about, we can decide what
to do with, but the ones we *don't* are ones that haven't been invented yet...
and those will *certainly* require strict handling.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 17

•

24 years ago

I agree on the point that future (thus yet unknown) DTD's should be treated as 
strict mode. But I think that doing such a move *now* would break more things 
than fix them. At least until we get more standards compliance from everyone 
(browsers and authoring tools).

bug 55916 has an example of such a behaviour. If a new version of a popular tool 
is released and then includes a new DTD declaration, what will be the result? 
Maybe it will still output transitional syntax for better backward 
compatibility, but the DTD could be declared in a way that fools the doctype 
parser so it would think it was better to use strict parsing... and we would be 
breaking the rendering again.

Assignee

Comment 18

•

24 years ago

Nominating for mozilla 0.9.  I think we need to fix this for mozilla 1.0 and we
need to get a bit of testing in before that happens.  I am willing to fix this.

Keywords: rtm → mozilla0.9

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Updated

•

24 years ago

Blocks: 60511

bsharma

Comment 19

•

24 years ago

updated qa contact.

QA Contact: janc → bsharma

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 20

•

23 years ago

Because of bug 55916, I propose we WONTFIX this bug. If we want to encourage 
standards support, we should be encouraging XHTML, and we already do all of 
text/xml in standards mode.

Keywords: qawanted

Whiteboard: [rtm-] → WONTFIX?

Assignee

Comment 21

•

23 years ago

That's a silly reason.  We should just add HotMeTaL's DOCTYPE to our list of
quirks doctypes.  This is crucial for future-compatibility, since AFAICT XHTML
won't be able to be sent as text/xml for a long time in the future since there
are still non-supporting browsers around today.

Comment 22

•

23 years ago

As I see it it's six of one and half a dozen of the other. We're going to get
as many people writing new text/html pages with DOCTYPEs we don't recognise 
and wanting strict layout as we are people publishing old pages with silly 
DOCTYPEs with typos or other weird things and expecting a compatible rendering.

XHTML2 is not going to be backwards compatible with XHTML1 or HTML4, and the
latest version of XHTML, 1.1, is already treated in strict mode:

   http://www.bath.ac.uk/%7Epy8ieh/cgi/compat-test.pl?DOCTYPE=%3C%
21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.1%2F%2FEN%22++%22http%3A%
2F%2Fwww.w3.org%2FTR%2Fxhtml11%2FDTD%2Fxhtml11.dtd%22%3E&MODE=full

So the only likely possible forward compatability problem is already covered.

Matthew T (active 1999-2002)

Comment 23

•

23 years ago

I've said this before, but I really think we should do some type of
doctype-crawl of the web to find out what's actually out there. Perhaps we could
ask the ODP/dmoz people, or google, or altavista, or somebody with a big
database of web pages whether they have any exhaustive lists of the DOCTYPEs
found on pages in their catalogs.

Then we can look at pretty much all existing doctypes on a case by case basis -
and hard-code this knowledge for these doctypes - making it much easier to say
"if nobody's EVER used it before, then it's almost guaranteed to want standard
rendering".

Comment 24

•

23 years ago

> We're going to get as many people writing new text/html pages with DOCTYPEs
> we don't recognise and wanting strict layout as we are people publishing old
> pages with silly DOCTYPEs with typos or other weird things and expecting a
> compatible rendering.

I disagree. If there's an error in your page as fundamental as a bad DOCTYPE, 
then all bets are (or should be) off as to how a browser will render it. I'm 
with David on this one -- forward compatibility is more important than backward 
compatibility.

Keywords: mozilla0.9 → mozilla0.9.2

Stefan Huszics

Comment 25

•

23 years ago

> If there's an error in your page as fundamental as a bad DOCTYPE, 
then all bets are (or should be) off as to how a browser will render it. 

I agree compleatly.
However do notice 1 important thing about the doctype if this will ever be a
"fix" for this bug.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 ...etc
is the exact same as
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 ...etc
since html markup is case insensitive.

Thus the casing on the word "HTML" (and only HTML) is not relevant and should
thus not yield an invalid doctype.

For "proof" compair with the corresponding XHTML (which is case sensitive)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 ...
At least this is how I have intrepreted the difference between high & low case
in HTML/XHTML declarations.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 26

•

23 years ago

I'm taking this bug, P1/critical (for standards support)/0.9.5.  See my comments
on bug 60511.

Assignee: rickg → dbaron

Severity: normal → critical

Priority: P3 → P1

Target Milestone: --- → mozilla0.9.5

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Updated

•

23 years ago

Status: NEW → ASSIGNED

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 27

•

23 years ago

Attached patch preliminary patch — Details — Splinter Review

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 28

•

23 years ago

I still need to go through the code that I removed a little more carefully,
since the code I added is an updated version of the old patch I had on bug
44340, and the code I am replacing may have changed more that I noticed since then.

I also need to do a good bit of testing...

Whiteboard: WONTFIX?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 29

•

23 years ago

I filed bug 98218, which exists with or without my patch.  I'll attach an
much-improved patch shortly (although it still uses obsolete string code).

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Updated

•

23 years ago

Blocks: 44340, 55916, 61901

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 30

•

23 years ago

Attached patch much improved patch — Details — Splinter Review

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 31

•

23 years ago

Attached file the new code within the above patch (easier to read than the diff) — Details

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 32

•

23 years ago

Oops, I just noticed the bad formatting in ParsePS, and fixed it in my tree.

Markus Hübner

Comment 33

•

23 years ago

what impact will this bug have to those thousands of websites out there having 
no doctype specified. What are the consequences in rendering and the used 
appearance of websites to customers?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 34

•

23 years ago

no impact -- no DTD will be done in quirks mode just as now. This bug is only about
*unknown* DTDs, not missing DTDs.

Assignee

Comment 35

•

23 years ago

What Ian said.

harishd

Comment 36

•

23 years ago

David: I like your changes a lot. The only thing that I didn't like is inlining
DetermineHTMLParseMode(). What's the reason behind it?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 37

•

23 years ago

The reason I made it |inline| was that it's only used once -- this may as well
all get compiled into one big function (it should be slightly more efficient
that way), but I'd rather not *think* about it as one big function.  Of course,
changing |inline| to |static| would probably be only a negligible slowdown
(assuming the compiler is even capable of inlining the function), and it doesn't
really matter to me.

Henri Sivonen (:hsivonen)

Comment 38

•

23 years ago

In the past Metrius has used an FPI like this: "-//Metrius//DTD Metrius
Presentational//EN" on pages of their clients. I suggest including it on the
list of quirky doctypes. Otherwise, bug 22274 will occur on Motorola's site.

BTW, in DetermineHTMLParseMode() there's a call like this
aBuffer.InsertWithConversion(
        "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">\n",
        0);
What's the purpose of that one? Is it used when Editor creates a new doc?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 39

•

23 years ago

I updated my proposal at
  http://www.people.fas.harvard.edu/~dbaron/mozilla/doctypes
to reflect one bit of current practice -- that we *are* using strict
mode for HTML 4.01 transitional and frameset doctypes when a system
identifier is present.  (I also fixed some escaping errors in some of
the links.)

I also updated both the proposal and the code in my tree (a
one line change, removing the line noting that doctype -- not worth
posting a new patch) when I discovered that the old code treated
"-//IETF//DTD HTML i18n//EN" as a strict-mode public ID, and we haven't
had any problems caused by that.

Finally, I updated both the proposal and the code for the Metrius
doctype mentioned above (as eQuirks, not eQuirks3).

I verified that I was not changing the behavior on any of the tests
listed in that page (other than the one mentioned above that I changed)
where I was not expecting to change the behavior.  The only changes
were:

 * on the 2d, 3d, and 4th items in the strict mode list (system
   identifier only, neither system nor public identifier, and internal
   subset), which I think are safe changes.

 * on the public ID "-//SoftQuad Software//DTD HoTMetaL PRO
   6.0::19990601::extensions to HTML 4.0//EN" in the quirks list, which
   is bug 55916

So I think the patch is tested well enough that it's ready for checkin
early in a milestone.  I suspect we'll get a few reports of obscure
doctypes that my search-for-doctypes missed.  I plan to post to
n.p.m.layout and n.p.m.seamonkey, and also email some Netscape tech
evangelism folks so they know to be aware of the change.

So I think the patch is ready for review.  It's just the patch attached
above, with the one formatting change in ParsePS, and the one doctype
declaration mentioned above removed from the list of quirky doctypes,
and the Metrius one mentioned above added (as eQuirks).  I expect we'll
have to add a few more public IDs to the list in the coming weeks, but
that's why I want to check it in early in the milestone cycle.

(To respond to Henri Sivonen's comment:  I'm not sure what that
InsertWithConversion is there for, but it was there and I didn't want to
remove it for fear I'd break something.)

harishd

Comment 40

•

23 years ago

Comment on attachment 48191 [details] [diff] [review]
much improved patch

Change inline to static. With that r=harishd

Attachment #48191 - Flags: review+

Comment 41

•

23 years ago

harishd: Why do you think it is better for that function to be static rather than
inline? I'm trying to learn the "tricks" of the trade, and can't quite understand
this particular request. Thanks in advance for any explanation! :-)

Marc Attinasi

Comment 42

•

23 years ago

Not to speak for Harish, but generally 'static' is used to prevent a method from
being exported from a module. It is useful, often necessary, if you want to make
sure that you don't end up with several different global functions with the same
name clashing at link time. 'inline' will likewise prevent the method from being
exported, so I think inline is fine here.

harishd

Comment 43

•

23 years ago

inline functions, especially large functions, can introduce code bloat which in
turn can cause negative performance. I would therefore perfer inlining smaller
functions. On the other hand the inline keyword does not force the complier to
inline a function ( I think ). It leaves the discretion to the compiler. I,
personally, don't prefer guessing compilers' actions :-)