Closed Bug 214476 (SGMLComment) Opened 21 years ago Closed 15 years ago

Mozilla interprets a -- (two dashes in a row) inside of a comment or an include improperly

Categories

(Core :: DOM: HTML Parser, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: bugzilla-emails-bugs, Unassigned)

References

()

Details

(Keywords: html5, Whiteboard: [fixed by the HTML5 parser])

Attachments

(1 file, 3 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5a) Gecko/20030626 Mozilla Firebird/0.6
Build Identifier: 

While updating our top level page, we found out that if a double hyphen (--) is
used within a ssi style comment or ssi that it is misinterpreted by mozilla and
the close may not be read properly, leaving parts or all of the page not displayed.
Example:
<!-- V. 3: August 2000 by University Relations; designer--developer: Bennet
George -->
The double dash between designer and developer caused the comment to not close
properly.
For the time being we found that we can avoid it simply by using a single
hyphen. The problem does not occur in opera, msie, or most of the other browsers
we tested with and does comply to XML standards.

I do not remember if we had spaces on either side of the hyphens or not.

Reproducible: Always

Steps to Reproduce:
1.
2.
3.

Actual Results:  
Browser displayed only a small portion of the page

Expected Results:  
Browser should have interpreted the --> as the end of the comment, it did not
See http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.4
I guess this bug is INVALID.
*** Bug 214475 has been marked as a duplicate of this bug. ***
This is invalid.  No -- are permitted within a comment.
Status: UNCONFIRMED → RESOLVED
Closed: 21 years ago
Resolution: --- → INVALID
In SGML, "--" starts a comment and "--" ends a comment.  HTML just uses SGML
comments, with "<!" signalling the starts of SGML markup and ">" signalling its
end.  Therefore:

<!-- -- --> text -->

Has two comments; one containing the string " " and one containing the string 
"> text ".

Now if Mozilla is put in quirks mode, we do backwards compatible comment parsing
(read "broken comment parsing just like old browsers").  But it sounds like the
site in question put Mozilla in standards mode.
Bill, not the difference between what you said and how comment parsing actually
works (it's subtle, but important: "--" is the comment delimiter, not just "not
allowed inside a comment").
BZ, maybe before you CC me yet again on bugs that I don't want to be on, you
should take time to take task within the W3C HTML group:

"A common error is to include a string of hyphens ("---") within a comment.
Authors should avoid putting two or more adjacent hyphens inside comments."

Stop CCing me to be pedantic when I'm quoting a spec, or I swear I'll just stop
donating time to Mozilla to triage bugs.  I didn't start writing HTML yesterday.
Status: RESOLVED → VERIFIED
*** Bug 269104 has been marked as a duplicate of this bug. ***
*** Bug 271854 has been marked as a duplicate of this bug. ***
*** Bug 271860 has been marked as a duplicate of this bug. ***
*** Bug 288610 has been marked as a duplicate of this bug. ***
*** Bug 294614 has been marked as a duplicate of this bug. ***
Alias: SGMLComment
*** Bug 294796 has been marked as a duplicate of this bug. ***
*** Bug 307747 has been marked as a duplicate of this bug. ***
*** Bug 318411 has been marked as a duplicate of this bug. ***
*** Bug 320933 has been marked as a duplicate of this bug. ***
*** Bug 321472 has been marked as a duplicate of this bug. ***
*** Bug 332516 has been marked as a duplicate of this bug. ***
*** Bug 338810 has been marked as a duplicate of this bug. ***
*** Bug 340975 has been marked as a duplicate of this bug. ***
*** Bug 341443 has been marked as a duplicate of this bug. ***
Invalid or not, I'm not sure I understand why Firefox sometimes will render these rogue comments, sometimes will treat them as commenting out entire sections, etc.  Regardless of the number of "-" between the start and end of a comment, shouldn't they always not render and not affect anything else?  With the number of duplicates, obviously lots of people use this technique to visually block off sections of code.
> I'm not sure I understand

Please read the whole bug, esp. comment 4.
The following perfectly valid web page is made invalid by having two hyphen characters in a row inside an HTML comment.

Enter the following web page code into the HTML validator under "Validate by Direct Input" at the following address:

http://validator.w3.org/


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>HTML Comments Display on Web pages</title>
</head>
<body>

<!--

This entire comment -- will show in web browser

-->

<p>This line is supposed to be first visible text on web page.</p>

<p>The page works perfectly if the hyphen is split with even a space or removed completely.</p>

</body>
</html>

*** Bug 351419 has been marked as a duplicate of this bug. ***
*** Bug 352664 has been marked as a duplicate of this bug. ***
So if I'm a web developer and I need to allow -- inside comments, what am I to do? FFox renders this directly even with transitional mode, which strikes me as the wrong thing. Or is there some browser-specific way to force "quirks mode" ?
> So if I'm a web developer and I need to allow -- inside comments

You can't do that in HTML, if the HTML spec is actually followed.

> Or is there some browser-specific way to force "quirks mode" ?

This is well-documented at http://developer.mozilla.org/en/docs/Mozilla%27s_DOCTYPE_sniffing

I should note that that page is linked directly off http://developer.mozilla.org/en/docs/Mozilla's_Quirks_Mode, which is the first Google hit for "mozilla quirks".
*** Bug 359014 has been marked as a duplicate of this bug. ***
*** Bug 362663 has been marked as a duplicate of this bug. ***
FWIW, per HTML5 this is a bug in Firefox.
But -- is still not allowed inside a comment, right? Just the error handling is different (http://www.whatwg.org/specs/web-apps/current-work/#bogus)?
Right. (Though you would not end up in the "bogus comment state".) And also, that doesn't make it less of a bug :-)
Reopening per Anne's comment.
Severity: major → normal
Status: VERIFIED → UNCONFIRMED
Resolution: INVALID → ---
Assignee: harishd → nobody
Status: UNCONFIRMED → NEW
Ever confirmed: true
QA Contact: dsirnapalli → parser
If this is because -- is not allowed between <!-- and -->, then it sounds like the specification is inadequate. It defies common sense to use something like that for delimiting when you know full well that it's also used for comments and that -- is commonly used in texts. Use a less common sequence of characters for that purpose, ergo <!-- -#- -->. Don't make the job of a web developer more a pain because of ridiculously short-sighted standards.
Keywords: html5
SGML comments are ridiculous and problematic. They confuse authors and break pages. HTML 5 recognises this, and no longer requires browsers to parse comments in SGML format. They are now parsed in a way compatible with all browsers except Firefox. Firefox seems to be insisting on hanging on to the SGML comments, even though others realised they are stupid, and protested against their inclusion in Acid 2.

HTML is not SGML, and never has been (even though it was originally supposed to be) - this won't work anywhere, even though it is valid SGML:
http://virtuelvis.com/download/162/evilml.html

SGML comments were removed from Acid 2 because they are stupid. They were removed from HTML 5 because they are stupid. It's time to remove them from Firefox, and stop breaking pages like the one in this report.

For those wondering why some patterns work, and others leave bits on the page:
http://www.howtocreate.co.uk/SGMLComments.html
If Tarquin is right and SGML comments were removed from HTML5, then Firefox should not treat documents with a valid HTML5 Doctype and SGML comment deliminators as such. For older documents, or non-valid HTML5 documents, the gotcha-but-correct behaviour should be maintained.
Flags: blocking1.9.1?
Note that this is specified in the parsing section of HTML 5 (the language definition tells authors not to include -- inside comments, but the error handling stage of parsing will allow it):
http://www.w3.org/html/wg/html5/#comment3

"U+002D HYPHEN-MINUS (-) 
Parse error. Append a U+002D HYPHEN-MINUS (-) character to the comment token's data. Stay in the comment end state.
...
Anything else 
Parse error. Append two U+002D HYPHEN-MINUS (-) characters and the input character to the comment token's data. Switch to the comment state."
Wouldn't hold the release for this, but I think we should get this on the list for 1.9.1. 

blocking1.9.1-
wanted1.9.1+, P2.  

Anyone want to volunteer here?
Flags: wanted1.9.1+
Flags: blocking1.9.1?
Flags: blocking1.9.1-
Priority: -- → P2
I'll take this.
Assignee: nobody → mrbkap
"For older documents, or non-valid HTML5 documents, the gotcha-but-correct behaviour should be maintained."

This will not help anyone. The broken pages (including the one in this report) will not get fixed. Firefox will remain incompatible with all other browsers. SGML comments were removed from HTML because they are stupid in all cases, not sometimes-stupid. Having comments that nobody understands in HTML 4 standards mode, but not in quirks mode or HTML 5 mode, is beyond confusing.

They should be removed in all modes in order to be compatible with existing Web Pages, other browsers, and author expectations, while providing a consistent response to all doctypes.
Attached patch Implement HTML5 comments, v1 (obsolete) β€” β€” Splinter Review
If we're going to replace our comment parsing, we might as well implement HTML5. This patch is a straightforward implementation of the part of the state machine that consumes comments. I haven't tested it very thoroughly (in particular, I need to ensure that the behavior is consistent across packet boundaries) but as far as I can tell, it follows the spec word for word.
Blake, what test framework do we use to test something like this?
We have parser mochitests.
So, the parser mochitests work, but require a bunch of manual verification. In particular, my patch doesn't affect where we put the comments in the DOM and all of the interesting test cases in our mochitests hit this problem.
Doesn't your patch affect where the comment terminates and therefore what Element nodes end up in the document?
Sorry, yes. I meant that given the testcase |<!-- comment -->|, our resulting DOM looks like:
HTML
  HEAD
    <!-- comment -->
  BODY

where the tests want
<!-- comment -->
HTML
  HEAD
  BODY

and fixing that seems beyond the scope of this bug (unless people say otherwise).
Sure.  I was assuming we'd add the tests from this bug and/or duplicates to parser/htmlparser/tests/mochitest/regressions.txt or some such.  At least that's where I've been adding the parser tests... ;)
Attached patch Implement HTML5 comments, v2 (obsolete) β€” β€” Splinter Review
I made the state machine a little less jumpy and started adding tests. Unfortunately, the tests I've added here all fail. I don't understand how we're serializing these comments.
Attachment #326714 - Attachment is obsolete: true
Attached patch Implement HTML5 comments, v2.5 (obsolete) β€” β€” Splinter Review
This adds a bunch of tests, and I'm pretty sure that I've implemented the spec faithfully. This is ready for review.
Attachment #327975 - Attachment is obsolete: true
Attachment #328128 - Flags: review?(jonas)
One thing I've noticed is in the testcase: |<title>foo <!-- bar| the resulting content model is:

html
  head
    title
      "foo "
    <!--  bar -->

I'll fix that.
Here's the interesting part of the interdiff:

diff --git a/parser/htmlparser/src/nsHTMLTokens.cpp b/parser/htmlparser/src/nsHTMLTokens.cpp
--- a/parser/htmlparser/src/nsHTMLTokens.cpp
+++ b/parser/htmlparser/src/nsHTMLTokens.cpp
@@ -900,6 +900,10 @@ CTextToken::ConsumeParsedCharacterData(P
 
         consumer.AppendSourceTo(theContent.writable());
         mNewlineCount += consumer.GetNewlineCount();
+
+        // If we successfully consumed a comment, end the title after the
+        // comment.
+        aScanner.CurrentPosition(altEndPos);
         continue;
       }
     }
Attachment #328128 - Attachment is obsolete: true
Attachment #329653 - Flags: review?(jonas)
Attachment #328128 - Flags: review?(jonas)
Blake, what is the status of this patch? Is it still good to review?
(In reply to comment #72)
> Blake, what is the status of this patch? Is it still good to review?

Yeah, the only question that needs answering before review is whether we want to do this at all.
Assuming that this makes us follow the HTML5 algorithm I think we should try it.

The only reason not to do it would be if we think the HTML5 parser is going to land pretty soon anyway...
I was the one opening bug 484036. This bug unfortunately didn't show up when I searched, I'm sorry for that.
Just to add to this discussion, putting the url of an IDN-domain between those comment tags triggers this error too. I run several forums based off vBulletin and this application puts the forum's url wrapped in a comment in the footer. One of my forums is using an IDN-domain and vBulletin puts that domain name in punycode format in the footer (www.xn--something-xyz.com) and breaks the page. Not allowing certain valid domains within a comment seems a little far fetched so I hope this problem will get resolved. Other browsers renders this correctly.
> Not allowing certain valid domains within a comment seems a little far fetched

Er, say what?  The two languages are unrelated!  You wouldn't complain if a url that contains "*/" ended a CSS comment, would you?

This bug should be fixed for compat, but said compat is just a workaround for people sticking things with the comment end delimiter in them inside comments...
Not exactly sure what you mean. I was referring to comment #3 where someone claims it's forbidden to have -- enclosed in comment tags. The people inventing punycode and making IDN domains a standard either didn't know this or completely ignored it. In any case it's the users of Firefox that has to pay the price which, after reading this thread, seems to have been forgotten. Standards seem to be more important for Mozilla than user experience. If I have misunderstood this I sincerely apologize.
> or completely ignored it

The latter.  Or assumed people would properly escape their stuff inside comments, of course.
In reply to comment #80: If you want to put */ inside a C comment without ending the comment, you have to alter it somehow. Add a space between the star and slash, maybe. Similarly, you can't put a punycode URL, containing two dashes in the middle of text, inside an HTML4 comment -- you have to alter it somehow. If you want the punycode URL to be human-readable, you can replace two dashes by four (but the opposite conversion will have to be done when copying it to the URL bar); or if you want it to be machine-readable, I think you can replace the two dashes by %2D%2D (correct me, someone, if I'm wrong in thinking that such an "escaped" URL, when copied to the URL bar, will be correctly interpreted, first by having each %2D replaced by a dash, then by punycode interpretation). _Four_ dashes in sequence are allowed within an HTML comment, even in HTML4; and %2D%2D means -- in URL syntax but not in HTML comment syntax.

IIUC, HTML5 comment syntax differs from HTML4 comment syntax; but I don't know the details. The above paragraph is about HTML4 but this bug is (IIUC, now that it has been reopened, assigned, and given the "html5" keyword) about altering Gecko to handle HTML5 comments correctly.
Your example is not quite correct though.  */ is the terminating marker for a C comment - nobody is wanting to put --> inside a HTML comment.  The problem is putting just part of the comment terminator inside a comment terminates it.  It would be like being unable to use a single * inside a C comment.

The same argument applies to comment #79.  I wouldn't complain if a URL containing */ ended a CSS comment, but I could complain if a URL containing * ended it.  Likewise I'm not complaining that "-->" ends a HTML comment, but rather that "--" does.

I'm not aware of any other language that uses one character sequence to start a comment, but two or more different sequences to terminate that same comment.
You can choose whether you're wrong according to HTML4.01 and SGML:

http://htmlhelp.com/reference/wilbur/misc/comment.html

or according to HTML5:

http://dev.w3.org/html5/spec/Overview.html#comments

but unfortunately that's just the way HTML comments work.  You can't put -- inside them.
Adam, in SGML and HTML the comment terminator sequence is "--", and can only be used inside an SGML markup declaration.  SGML markup declaration start with "<!" and end with ">".  Here's an example from the HTML4 DTD at http://www.w3.org/TR/REC-html40/sgml/dtd.html :

<!ATTLIST Q
  %attrs;                              -- %coreattrs, %i18n, %events --
  cite        %URI;          #IMPLIED  -- URI for source document or msg --
  >

Here the comments are "%coreattrs, %i18n, %events" and "URI for source document or msg" but the rest of the text is not comment and in fact is the declaration for the allowed attributes on the Q element.

I realize this is a bit more complex than the way comments work in C, but that's life.

Note that I already said all this in this bug in comment 4, almost 6 years ago...

In any case, this bug is now about ignoring HTML4 and implementing the HTML5 definition of comments, which does indeed start with "<!--" and end with "-->", so I'm really not sure what all the discussion is about at this point.
I've noticed that -- followed by any sequence of characters and a > within a conditional comment really messes things up. Surely a conditional comment needs to be an exception to this rule, as it may contain a script with valid code. I realise conditional comments aren't best practice coding, but they do exist and they do get parsed by browsers.

Here's an example:

<!--[if lt IE 7]>
<script language="javascript">
var i=1;
i--;
var string=">";
alert("this displays as inline text in firefox");
</script>
<![endif]-->
This is starting to get ridiculous... 

#86 ... perhaps you can use eval and unescape to avoid having -- and > characters in IE only code?
Conditional comments are "parsed by browsers" as just comments, with no special treatment, except for IE.  Seriously, if you want to shoot yourself in the foot with HTML you can.

And I still don't nderstand what the discussion is about, since the plan is to change behavior here...  Can people just shut up and let the bug be until it's fixed?
#86 Use Conditional compilation instead. Problem solved!
I submitted duplicate bug 500110 for this, after three failed searches.

Boris' comment "... I already said all this ... almost six years ago" (#86) is distressing.  Six years!  By the time HTML5 is official *and* the browser manufacturers adopt it *and* older browsers like Firefox 3.0.11 disappear from the Internet, I may have retired.

I'm in the camp that uses dashes (and others) in comments to visually break my code into sections.  A solution that looks good and seems to work well is to replace the non-allowed dashes with character 196 (Alt-196 in Windows).
Γ£
but what I am curious about is, are these valid HTML?  Am I to understand that a space is required?
<!-----> (one - in the middle. in firefox, this is not a usable comment because it has an odd number of -'s.)
<!--+--> (one + in the middle, should be same problem as above, but lexically analyzed differently.  I am not sure, but I think in firefox this may not be a usable comment.)
<!--->->--> (->-> in the middle.  in firefox this is likely not be a usable comment because it has 3 -'s.)
<!------> (two - in the middle. in firefox, this is a usable comment because it has an even number of -'s.)

What I have noticed in the past about the firefox lexer is that it just simply pairs off -'s, which may be the wrong way to lexically analyze them. 

I was thinking of the following algorithm:
char0=0
char1=0
char2=0
match "<!--" in sequence
char0=get character
while (char0=get character != EOF) {
    if (char0=='-') {
        char1=get character
        if (char1=='-') {
            char2=get character
            if (char2=='>') {
                //found end of comment!
                break;
            }
        }
    }
}

but then I realized that this is the logic error that only recognizes double -'s!
what I really need here is the software equivelant of the digital electronics shift register.  it is similar to a deque, where you push on one side and pop on the other.  If I could examine the contents of the whole deque, that would be wonderful.  then I could see if the --> was coming down the pipe, and I could keep it always 3 or 4 characters full of characters that I newly got.
got it?  

what you would need to implement this in C++ is something like the STL to do this the easiest way.  there is already a deque class.  unfortunately, I have already stuffed my STL book in a box for moving. an iterator should provide the necessary means of iterating across the data elements of the deque.  and they are easy to make.
Jim Michaels
Well the way I look at it is the tag name is "!--" in the same way as the tag name might be "img".  Obviously <imgblah> is not valid (you need a space after the tag name), so I would expect <!-----> to be something completely different to a comment.  <!-- - --> would be a comment with an uneven number of dashes.
Jim, please read comment 85 (and then comment 4).  Those explain exactly how SGML and HTML4 work here (and yes, they require simply pairing off "--").

Nte that a comment containing "--" is invalid HTML4 and it says so right in the HTML4 specification.

HTML5 changes the specified behavior here, which is why this bug is still (or rather again) open.
Comment on attachment 329653 [details] [diff] [review]
Implement HTML5 comments, v2.7

Hopefully all of this code is going away soon, so no need to muck around with it at this point.

If that plan changes please rerequest review
Depends on: html5-parsing
Not sure about implementation in swallowing, but I think you use the big state machines...  while in comments you could break this into three states:

IN_COMMENTS:
  if ( nextchar == "-" )
  {
     state = IN_COMMENTS_ONEDASH;
     break;
  }
  // else swallow the rest;
IN_COMMENTS_ONEDASH:
  if ( nextchar == "-" )
  {
     state = IN_COMMENTS_TWODASH;
     break;
  }
  else
  {
     state = IN_COMMENTS;
     break
  }
IN_COMMENTS_TWODASH:
  if ( nextchar == ">" )
  {
     state = DATA; // or return to whatever state we were in
     break;
  }
  else if ( nextchar == "-" )
  {
    state = IN_COMMENTS_TWODASH; // cause now we have two dashes again
  }
  else
  {
    state = IN_COMMENTS;
  }
  // maybe some error here because of dashes and no end to comments


That should work for the state engines and you can keep track of the errors.
(In reply to comment #105)

http://www.w3.org/TR/2010/WD-html5-20100304/syntax.html#comments (just counting "<!--" for comment block)

well, in https://hg.mozilla.org/mozilla-central/file/e9312d05488f/parser/html/javasrc/Tokenizer.java it's already this way, except for the allowing space after the two dashes (not sure if that is whats being used).

it breaks from comments in case of <!--> or <!---> which seems fine, but it breaks when <!-- -- >... which in html5 is incorrect if the comment block should end in "-->".

If white-space (or "!") is encountered such as in COMMENT_END it should do what it does in "default" and not go to COMMENT_END_SPACE (or BANG) like it does.  Other than that and other tokenizers, it appears it should work.
Assignee: mrbkap → nobody
Status: ASSIGNED → RESOLVED
Closed: 21 years ago15 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by the HTML5 parser]
I thought <!-- and --> delimited comments. At least that is how it works with ie. (Yes I understand that ie doesn't follow all standards).
James, you thought wrong, in the case of HTML4.  See comment 4.

Note that as a result of IE's behavior and people thinking it's correct the standard got changed...
Also note that this bug has been marked FIXED, which means that this bug is fixed in the code that will become the next version of firefox.
Given Firefox 4 about to ship with an html5-compliant parser that may lead to even more sites "broken" in Firefox 3.6, which will have a significant number of users for quite some time. It may be worth taking Blake's patch or something like it on the 1.9.2. branch.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: