Closed Bug 474670 Opened 15 years ago Closed 14 years ago

Unterminated HTML entities sometimes rendered

Categories

(Core :: DOM: HTML Parser, defect)

x86
Windows 2000
defect
Not set
major

Tracking

()

RESOLVED FIXED

People

(Reporter: volkmarkostka, Unassigned)

References

Details

(Keywords: regression, Whiteboard: [fixed by the HTML5 parser])

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.9.2a1pre) Gecko/20090121 Minefield/3.2a1pre
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.9.2a1pre) Gecko/20090121 Minefield/3.2a1pre

If you put a string like "<a href="http://go/here?id=1&lang=0">here</a>" inside a textarea the &lang unterminated entity is rendered. The output is "<a href="http://go/here?id=1<=0">here</a>".

This also happens if you use the HTML editor of the "WebDeveloper Toolbar".

First described here: http://forums.mozillazine.org/viewtopic.php?f=25&t=1043535

This seriously impacts CMS systems.

Reproducible: Always

Steps to Reproduce:
Load this HTML page:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Untitled Document</title>
</head>

<body>
<form action="test.html">
<textarea name="text" cols="80"><a href="http://go/here?id=1&lang=0">here</a></textarea>
</form>
</body>
</html>

Actual Results:  
<a href="http://go/here?id=1<=0">here</a>

Expected Results:  
<a href="http://go/here?id=1&lang=0">here</a>
This has changed in the last few months of 2005.
Keywords: regression
Product: Firefox → Core
QA Contact: general → general
Version: unspecified → Trunk
2005? That old?
Yeah. Regression range is 
http://bonsai.mozilla.org/cvsquery.cgi?module=PhoenixTinderbox&date=explicit&mindate=2005-11-03+03%3A00&maxdate=2005-11-03+14%3A00
I think it could be bug 312104.
Blocks: 312104
Status: UNCONFIRMED → NEW
Component: General → HTML: Parser
Ever confirmed: true
QA Contact: general → parser
I claim this is a dup of bug 155047.  That makes it surprising that there's a regression range in 2005, though.
Yep. Sounds very like #155047 but the context seems different.
How is the context different?  Both bugs are about the parsing of <a href>.
See Bug 278404(DUP'ed to Bug 155047) for lecture by Boris Zbarsky on "Character Entity of HTML(based on SGML)" for Dan(opener of Bug 278404) and stupid me.

History looks to be;
1) Initially, it worked as designed(as Bug 155047 and many DUPs say).
2) It was broken between 2004-11-10 and 2004-12-19(perhaps by bug 88952)
   when "Character Entity delimited by a space" in <textarea>.
   => Bug 312104 was opened and fixed in 2005.
   Note:
   Since Target Milestone:mozilla1.9alpha1, Bug 312104 still occurs on Sm 1.x.
This bug's actual result is an evidence that Bug 312104 was fixed correctly.
To comment 6:
The difference in the context is that the link in the textarea is displayed not rendered.
For me the question is if the content of a textarea is part of the DOM or not. If not the content should not be interpreted at all.
Per HTML5, textarea content is an RCDATA element. RCDATA elements can contain character references.
That being said, HTML5 parser will not consider "&lang" (without semicolon) as Named character reference.
Depends on: html5-parsing
Additional info for trunk:
If HTML5 is enabled then the display is correct.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by the HTML5 parser]
Apologies for bumping this old bug, but I didn't want to log a new bug as I am pretty confident the issue I am seeing is a regression of this one. 

The code in the original submission seems to be clear of this problem, but I can still reproduce the bug using different entity tags - for example; &lt instead of &lang. 

There appears to be some confusion about how this should work (there are several older bugs here about it : 222193, 155047) - apparently in ye olde days it was valid SGML, but this appears to now be at odds with the HTML5 spec ( http://www.w3.org/TR/html5/syntax.html#character-references ) which states: "The name must be one that is terminated by a U+003B SEMICOLON character (;)", which seems pretty unequivocal to me :)

The following code reproduces it for me: 

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Untitled Document</title>
</head>
<body>
<form action="test.html">
<textarea name="text" cols="80"><a href="http://go/here?id=1&lt=0">here</a></textarea>
</form>
</body>
</html>

Actual results: 
<a href="http://go/here?id=1<=0">here</a>

Expected results:
<a href="http://go/here?id=1&lt=0">here</a>
(In reply to David Harrison from comment #12)
> this appears to now be at odds with the HTML5
> spec ( http://www.w3.org/TR/html5/syntax.html#character-references ) which
> states: "The name must be one that is terminated by a U+003B SEMICOLON
> character (;)", which seems pretty unequivocal to me :)

That text states requirement for writing HTML--not consuming it. The rules for consuming character references are at http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#consume-a-character-reference

When <a href="http://go/here?id=1&lt=0">here</a> occurs as markup, "http://go/here?id=1&lt=0" is part of an attribute value and &lt= does not tokenize as a character reference.

The rules for attribute values and non-attribute value text are different. In your example, <a href="http://go/here?id=1&lt=0">here</a> is text in a textarea, so the attribute value rules don't apply, so &lt tokenizes as a character reference.

In textarea text, you need to escape & as &amp;.
Henri, 

Thanks for the clarification on consuming/writing. I'll have a read of that now. 

I should also point out that the conversion of &lt also takes place outside of TEXTAREAs - e.g.,

<!DOCTYPE html> 
<html>
<body>
Here's a test &lt
</body>
</html>
(In reply to David Harrison from comment #14)
> I should also point out that the conversion of &lt also takes place outside
> of TEXTAREAs - e.g.,
> 
> <!DOCTYPE html> 
> <html>
> <body>
> Here's a test &lt
> </body>
> </html>

That's expected, since &lt is outside an attribute value here, too.
You need to log in before you can comment on or make changes to this bug.