Closed Bug 154304 Opened 22 years ago Closed 14 years ago

nested <dl>'s inconsistently parse, 1 byte difference (scanner is confused)

Categories

(Core :: DOM: HTML Parser, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: ftobin+bugzilla, Unassigned)

Details

Attachments

(9 obsolete files)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.0) Gecko/20020606
BuildID:    2002060614

This is quite difficult to explain (Mozilla is acting *very* inconsistently, and
I'm trying to be exacting), so please bear with me.

Note: I'm going to attach the pages which have problems, instead giving a URL,
because Mozilla is acting differently depending on extensions, and the problems
are seen most vividly if loaded from the local HD.

Each file I'm going to attach has a list of two-level nested <dl>'s.    You'll
notice that the files are exactly the same, except for the abscence of exactly 3
'class' attributes on some elements in the second page.  Note that there is no
CSS at all in this page, or linked to.  If the files are named, say, 'out2' and
'out3' (no extension), then Mozilla renders these files differently.  However,
if 'out2' is renamed to 'out2.html', it is suddenly rendered the same as 'out3'.

Reproducible: Always
Steps to Reproduce:
1.  Save the two attached files to the local disk, and name them 'out2', and 'out3'.

2.  Diff 'out2' and 'out3' to verify the differences are minimal:
$ diff out2 out3   
11,13c11,13
< <dt class="date">Date</dt>
< <dd class="date">2002-06-17</dd>
< <dt class="picked_up">Picked Up</dt>
---
> <dt>Date</dt>
> <dd>2002-06-17</dd>
> <dt>Picked Up</dt>

3.  View each file in Mozilla.  Note that <dl> inside the first item (the one
with Date: 2002-06-17) is rendered differently in each case.

4. Rename out2 to out2.html, and note that is now rendered the same as out3.

Actual Results:  The files render differently, even though the differences in
the files are strictly non-presentational.

Expected Results:  The files should render the same, specifically as 'out3' is
rendered.

Each file is XHTML 1.1 conformant.

It seems that there is a variety of small, similar, should-be-non-affecting
changes that I can make to out2 to get it to render the same as out3, so
isolating this change is extremely difficult for me.  I have merely picked one
small change that should obviously not affect the rendering.
Attached file out2 (with the class attributes) (obsolete) —
Attached file out3 (without class attributes) (obsolete) —
I'm going to have to replace the out2 file I've uploaded, and give a different
content-type for it, because when it uploaded Bugzilla modified the data inside
the file (e.g., removed the XML declaration), and said modification magically
'corrects' the problem.  I'll use octet-stream so that Bugzilla doesn't dare
touch it.
Attachment #89207 - Attachment is obsolete: true
Attached image out2 screenshot (obsolete) —
bz: When it doesn't have an extension, is it going to come into the parser with
a text/xml MIME type (from the unknown decoder) or with a blank mime type?
Attached image out3 screenshot (obsolete) —
Note: this screenshot is identical to 'out2' being renamed to 'out2.html'.
Comment on attachment 89209 [details]
out2 (with the class statements, renders incorrectly)

Changing this one to text/xml.
Attachment #89209 - Attachment mime type: application/octet-stream → text/xml
Mmm, never mind, it's not really XHTML.
It doesn't have an xmlns attribute on the root element, so it can't usefully be
served as text/xml.  (Validating XHTML with an SGML parser is a bit of a joke, too.)
In any case, I do see the bug when I download it to a local disk, and my initial
guess would be that it's a Parser bug, both since the parser is the main thing
that does evil things with MIME types and because the problems seem like the
result of misparsing (perhaps even due to packet sizes, though).
Assignee: attinasi → harishd
Component: Layout → Parser
QA Contact: petersen → moied
Status: UNCONFIRMED → NEW
Ever confirmed: true
db: Out of curiousity, what validator did you use to check it?  Something online
or something in your head? :)
Component: Parser → Layout
The unknown decoder would flag that one (luckily) as text/html, because it does
not flag anything as text/xml.  We should correct that...

And yes, this is most likely a parser bug, and most likely a dup of our other
nested <dl> bugs.
Component: Layout → Parser
Whiteboard: DUPEME
Oh, I'm not disagreeing about the validity anymore, I was just wondering how you
picked it up on it, since the validator most people probably use, w3.org's, didn't.

As a side note, adding the xmlns attribute to make it XHTML 1.1 doesn't fix the
problem.
It's a matter of working with this stuff and trying to implement it... leads to
a certaing memorization of the spec.  ;)
I've been able to isolate a trigger of the bug.

The attachment I'm making now, 'out4', renders incorrectly (the same as
'out2').  The diff from 'out3' (which renders correctly) to 'out4' is:

11c11
< <dt>Date</dt>
---
> <dt>Date 1234</dt>

As you can see, the only addition is 5 characters.  Note that 'out4' is 2038
bytes.	I've been repeatedly able to find that removing 1 byte from anywhere in
the page, bringing the size down to 2037 bytes causes the bug to *not* be
triggered.

For example, the 'out5' I will be attaching next is 2037 bytes.  It renders
correctly (the same as 'out3').  The diff from 'out4' (correcty) to 'out5'
(incorrect) is:

11c11
< <dt>Date 1234</dt>
---
> <dt>Date 123</dt>

The bug is most definitely triggered by moving from 2037 to 2038 bytes,
anywhere in the page.
Attached file out5 (2037 bytes, correctly rendered) (obsolete) —
This file is referenced in the comment attached to the 'out4' attachment.
Ack, I need to correct what I said earlier in comment #17:

I said:

> The diff from 'out4' (correcty) to 'out5' (incorrect) is:

I reversed the situation; it should be:

  The diff from 'out4' (incorrect) to 'out5' (correct) is:
Summary: nested <dl>'s inconsistently indent, depending on things it shouldn't → nested <dl>'s inconsistently indent, depending 1 byte difference
Attached image out4 dom tree screenshot (obsolete) —
The issue definitely seems to come down to a parsing issue.

This is a screenshot of inspecting the DOM tree of 'out4'.  The DL highlighted
in this screenshot and the upcoming screenshot for 'out5' represents the DL
that is not being indented properly.  As you'll note, the highlighted DL for
'out4' is incorrectly located in the document.	The highlighted DL in the
'out5' DOM screenshot shows it being in the DOM correctly.
Attached image out5 dom tree screenshot (obsolete) —
This screenshot of the 'out5' DOM tree is mentioned in comment #20.
What are the _exact_ steps to reproduce here?  When I save "out4" as "test.html"
and open in Mozilla, it renders fine...
First of all, I'm changing the summary to reflect the fact that this is a
parsing, not rendering issue.  The rendering issue is due to an incorrect DOM.

The sample file out4 and out5 now BOTH render correctly in Mozilla 1.1.  I'm not
sure when they suddenly started working.

However, I have NEW example files which, again, differ by one byte, yet produce
different DOMs in 1.1.  I shan't bother with the screenshots or dom tree
attachments anymore; they were probably just confusing.
Summary: nested <dl>'s inconsistently indent, depending 1 byte difference → nested <dl>'s inconsistently parse, 1 byte difference
Attachment #89208 - Attachment is obsolete: true
Attachment #89209 - Attachment is obsolete: true
Attachment #89210 - Attachment is obsolete: true
Attachment #89211 - Attachment is obsolete: true
Attachment #89275 - Attachment is obsolete: true
Attachment #89276 - Attachment is obsolete: true
Attachment #89280 - Attachment is obsolete: true
Attachment #89279 - Attachment is obsolete: true
Frank, could you get me those files that are showing the problem?  I'm
investigating <dl> parsing as it is, and I'd like to see how my changes affect
this bug.... (you're right that screenshots are not necessary as long as the
files show the problem).
I was hoping to attach them, but due to a combination of bug #179290 and bug
#87404 I can't.  So, I'll provide urls and hope that there aren't any cr/nl
issues.  I recommend downloading them locally, and then viewing them in Mozilla
(if they are delivered as text/html the bug does not present itself).

http://www.neverending.org/~ftobin/tmp/out6 is valid, 1053-byte XHTML 1.1
document and is parsed into a DOM correctly.

http://www.neverending.org/~ftobin/tmp/out7 is valid, 1054-byte XHTML 1.1
document and is parsed into a DOM *incorrectly*.  It really doesn't matter where
the extra byte happens to be; I just put a ! in the last dd.
Here are my new agent/build specs:

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020912
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020912
FYI, it should be clear from the rendering difference between the two files
where the DOM trees differ.  Basically, the <dd> after the "Linux Security
Newsletters <dt> is closed prematurely, and the <dl> that is supposed to be a
child of said <dd> becomes the succeeding sibiling of it instead.
Tried saving those two files as .xml, .html, .xhtml.  They render correctly (and
identically) in all three cases.  Linux build 2002-11-01-21 here.
Interesting; I just noticed that if you have an extension on the filename, then
it renders correctly.  But if you don't, it doesn't render correctly.

For example, have out7 be just plain 'out7', with no extension; it renders
incorrectly (at least for me).
OK, I see that with a linux trunk 2002-11-01-21 build.  I get the following
warnings when loading out7 as "out7" and not as "out7.html":

WARNING: NS_ENSURE_TRUE(NS_SUCCEEDED(result)) failed, file
/home/bzbarsky/mozilla/debug/mozilla/htmlparser/src/nsHTMLTokens.cpp, line 343
WARNING: NS_ENSURE_TRUE(NS_SUCCEEDED(result)) failed, file
/home/bzbarsky/mozilla/debug/mozilla/htmlparser/src/nsHTMLTokenizer.cpp, line 801

The second warning is a corollary of the first.  The first happens because
FillBuffer() is returning an end-of-file error, because the scanner's
mInputStream is null!  Which seems very very wrong...
Summary: nested <dl>'s inconsistently parse, 1 byte difference → nested <dl>'s inconsistently parse, 1 byte difference (scanner is confused)
Assignee: harishd → nobody
QA Contact: moied → parser
I don't see a problem here anymore.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
Whiteboard: DUPEME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: