Tags that are not closed by the end of the document do strange things.

RESOLVED FIXED in Future

Status

()

Core
HTML: Parser
--
minor
RESOLVED FIXED
17 years ago
13 years ago

People

(Reporter: Eric Mao, Assigned: mrbkap)

Tracking

({testcase})

Trunk
Future
x86
All
testcase
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(5 attachments, 1 obsolete attachment)

(Reporter)

Description

17 years ago
I created a simple HTML file:
  <html> <body bgcolor="white" "bogus extra stuff""""> </body> </html>

I opened the file up in Mozilla 0.8 under Win2K, and went to View Page Source. 
I expected to see the same HTML code that I typed in, despite the fact that the
HTML code contained errors.  Instead, some of the double-quotes got cleaned up:
  <html> <body bgcolor="white" bogus extra stuff> </body> </html>

Comment 1

17 years ago
Same in here, under Linux. The question is whether it's a bug: a html file
should look like that: <TAG OPT1="value1"> and not <TAG "value1"> ...

Comment 2

17 years ago
sending to parser for triage
Assignee: ben → harishd
Component: XP Apps: GUI Features → Parser
QA Contact: sairuh → janc

Comment 3

17 years ago
Confirming as per user comment, setting OS=All.
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows 2000 → All
View source should be showing the source as it really is -- otherwise its
usefulness for debugging web pages is not very great (and why does anyone ever
look at view source otherwise?).

Comment 5

17 years ago
because we want to see what content mozilla decided was worth parsing ;-b 
otherwise, we'd use telnet to get the real content :) *sigh* this is at least 
normal severity.
Severity: minor → normal
Keywords: dataloss

Updated

17 years ago
QA Contact: janc → bsharma

Comment 6

17 years ago
I appreciate that this is truly annoying, but can it actually alter the 
semantics of content so that developers are misled rather than just 
inconvenienced?  Futuring for now, will reconsider given a compelling test case. 
Target Milestone: --- → Future
Created attachment 27905 [details]
Testcase (icky HTML, I know)
The attached testcase sucks as HTML goes.  Nevertheless...

Moving the closing quote to before </html> will make things work OK.  But the
fact remains that view source just silently lost half the source of that page,
and more importantly has lost all indications that an error of any kind has occured.

I'm not completely sure that it's the same bug, but it seems related...
Blocks: 57724
No longer blocks: 57724

*** This bug has been marked as a duplicate of 57724 ***
Status: NEW → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → DUPLICATE

Updated

17 years ago
QA Contact: bsharma → moied
Reopening 57724 dependencies for independent resolution.
Status: RESOLVED → REOPENED
Depends on: 57724
Keywords: testcase
Resolution: DUPLICATE → ---

Comment 11

16 years ago
Created attachment 82208 [details]
Single quotes on event handler morphed to double-quotes

Outer single quotes are converted to double quotes both in "View->Source" and
via the "Save Page As..." dialogue.  In the latter case, the saved page will
not work because this change breaks the nested quotes.
Roland, I cannot reproduce the error you describe when viewing the source of
that attachment (linux trunk build 2002-05-01-21).  I see single quotes...

The "save as" issue is a separate issue (with the "web page, complete" mode
only) and should be filed as a separate bug.

Comment 13

15 years ago
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and
<http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss
bugs are of critical or possibly higher severity.  Only changing open bugs to
minimize unnecessary spam.  Keywords to trigger this would be crash, topcrash,
topcrash+, zt4newcrash, dataloss.
Severity: normal → critical
Severity: critical → minor
Keywords: dataloss
Please actually read bugs before changing the severity, ok?

Comment 15

14 years ago
See also comment #3 of this bug: 223753

Comment 16

14 years ago
*** Bug 233609 has been marked as a duplicate of this bug. ***

Comment 17

14 years ago
This "cleaning up" of the source before interpretation can also cause security
issues.

Eg my website allows users to add certain simple HTML tags to their posts, but
not others. If, however, a user enters this:

<B ONCLICK="window.open('http://www.badsite.com')" 

then mozilla will automatically append the closing > and render the next part of
the website's own content with this onclick link.


Admittedly, it's my fault in this case for not having a good enough regexp for
filtering out bad tags (which I've now fixed), but I do wonder whether Mozilla
should be displaying "what an attacker means" rather than "what the designer said". 



The following was the vulnerable html cleanup code I had used. I've simplified
$allowed for clarity.
	
$allowed='\s*\/?\s*(b|i|u|s|pre|tt|ul|ol|li|p|)\s*';
$memo=preg_replace("/<((?!($allowed>))[^<>]*)>/is", "&lt;\\1&gt;", $memo);

(Assignee)

Comment 18

13 years ago
--> me since I'm working on this.
Assignee: harishd → mrbkap
Status: REOPENED → NEW
(Assignee)

Comment 19

13 years ago
Fixing summary to reflect the bug that this is covering (our awful handling of
unclosed tags and the end of the document).

For the record: start tags disappear altogether, end tags get duplicated.
Summary: view source makes double-quotes disappear → Tags that are not closed by the end of the document do strange things.
(Assignee)

Comment 20

13 years ago
bz, rbs, do you have any more information on the stuff here? If you know anyone
who may, please add them to the CC list.

My explanation for this bug turned into a 300 word mini-essay on end-of-file
handling at the tokenizer level, so instead of filling up this bug with it, I'll
attach it (it's also available at:
http://www.owlnet.rice.edu/~mrbkap/bug70828explain.html).

My explanation is pretty long and comprehensive, but the short of it is that I
have a partial patch with a couple of problems left to fix. I'm not sure about
some parts of the patch (see the explanation). Any suggestions on my explanation
or solution for this are welcome!

Also, I can generalize my explanation and put it up as parser documentation (on
EOF handling in the tokenizer) somewhere (since htmlparser documentation is
terribly lacking), I just need to know what format/where to put it.
(Assignee)

Comment 21

13 years ago
Created attachment 160805 [details]
Explanation of this bug.

I'm attaching this to the bug so that if/when I take the explanation down from
the internet, people looking into parser code can still find it here (and don't
receive a 404).
(Assignee)

Comment 22

13 years ago
Created attachment 161187 [details] [diff] [review]
work in progress

This is a partial fix for this bug. I'm still unsure as to the
nsScanner::IsIncremental()/mIsFinalChunk decision and there are a couple of
(easier) bugs that I'd like to fix in this code before I spend more time on
this bug.
(Assignee)

Comment 23

13 years ago
Comment on attachment 161187 [details] [diff] [review]
work in progress

I've decided to go with the existing code's idea and fix this (for the most
part) with nsScanner::IsIncremental(). I'll post a new patch later this week.
Attachment #161187 - Attachment is obsolete: true
(Assignee)

Comment 24

13 years ago
Created attachment 162547 [details] [diff] [review]
patch v2

This patch fixes this bug. Basically, because of document.write(),
mIsFinalChunk is useless to us (this is the reason it isn't really used
anywhere in the parser).  Basically, I went through and each place we can
receive EOF, and added a check to see if the document was out of data.

This also fixes the bug where the document: "<script>foo;</script " would show
up as "<script>foo;</script</script" in view-source.

I'm not sure if I was just unobservant, but my debug build seems to be viewing
source very slowly (1/2 to 1/4 of a second lag on slashdot.org). I'm unsure if
it's my patch, because when I locally backed it out, I still saw the lag. If
someone could test and see if this does adversely view-source performance, it
would be great.
(Assignee)

Comment 25

13 years ago
Comment on attachment 162547 [details] [diff] [review]
patch v2

rbs, could you take a look at this?

An additional note: the changes to nsScanner::ReadTagIdentifier are because if
we ran out of data before reaching an invalid character, we wouldn't end up
appending what we had already.
Attachment #162547 - Flags: review?(rbs)
Note that testing performance in a debug build is very pointless....

Comment 27

13 years ago
Comment on attachment 162547 [details] [diff] [review]
patch v2

I think you are better off at this point to find out how to run the parser
regression tests. They will add another degree of confidence. This patch is
invasive and hard to assess. I am witholding my r= pending what you see from
the regression tests.
Attachment #162547 - Flags: review?(rbs)
(Assignee)

Comment 28

13 years ago
Comment on attachment 162547 [details] [diff] [review]
patch v2

Rerequesting r= after running the parser regression tests.

The only two testcases to change were quote002.html and quote003.html, both of
which exhibit the problem that this bug is trying to fix (unclosed start tag at
the end of the document gets dropped).
Attachment #162547 - Flags: review?(rbs)

Comment 29

13 years ago
Comment on attachment 162547 [details] [diff] [review]
patch v2

r=rbs
Attachment #162547 - Flags: review?(rbs) → review+
(Assignee)

Comment 30

13 years ago
Comment on attachment 162547 [details] [diff] [review]
patch v2

bz, asking you for sr since I'm touching nsScanner::ReadTagIdentifier. again.
Attachment #162547 - Flags: superreview?(bzbarsky)
(Assignee)

Comment 31

13 years ago
Comment on attachment 162547 [details] [diff] [review]
patch v2

This patch misses a case: the document |<a b| will lose the 'b'. I have a fix
in my tree, but don't have time to attach it now. The patch is pretty much the
same, except for an extra check in CAttributeToken::Consume().
(Assignee)

Comment 32

13 years ago
Created attachment 163144 [details] [diff] [review]
patch v2

This makes us not lose the last attribute in an unclosed tag.
(Assignee)

Comment 33

13 years ago
Comment on attachment 163144 [details] [diff] [review]
patch v2

This should be the final patch. The differences between this patch are all in
CAttributeToken::Consume(). I also fixed some typos and cleaned up some
comments.
Attachment #163144 - Flags: superreview?(bzbarsky)
(Assignee)

Updated

13 years ago
Attachment #162547 - Flags: superreview?(bzbarsky)
(Assignee)

Comment 34

13 years ago
Comment on attachment 163144 [details] [diff] [review]
patch v2

In an attempt to get this in faster. I'm playing with sr's. Trying roc first.

rbs, I'm asking for r= on this one (for the attribute changes).
Attachment #163144 - Flags: superreview?(roc)
Attachment #163144 - Flags: superreview?(bzbarsky)
Attachment #163144 - Flags: review?(rbs)

Comment 35

13 years ago
Comment on attachment 163144 [details] [diff] [review]
patch v2

r=rbs
Attachment #163144 - Flags: review?(rbs) → review+
Attachment #163144 - Flags: superreview?(roc) → superreview+
(Assignee)

Comment 36

13 years ago
Fix checked in (albeit with the wrong bug number in the comment).
Status: NEW → RESOLVED
Last Resolved: 17 years ago13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.