Open
Bug 436006
Opened 16 years ago
Updated 7 months ago
"ASSERTION: Not a UTF-8 string" with <a name="%e2g">
Categories
(Core :: DOM: HTML Parser, defect)
Tracking
()
NEW
People
(Reporter: jruderman, Unassigned)
References
()
Details
(Keywords: assertion, testcase)
Attachments
(4 files)
Loading the testcase triggers: ###!!! ASSERTION: Not a UTF-8 string. This code should only be used for converting from known UTF-8 strings.: 'Error', file ../../../dist/include/string/nsUTF8Utils.h, line 596 ###!!! ASSERTION: not a UTF8 string: 'Error', file ../../dist/include/string/nsUTF8Utils.h, line 130 ###!!! ASSERTION: Input wasn't UTF8 or incorrect length was calculated: 'Error', file /Users/jruderman/central/mozilla/xpcom/string/src/nsReadableUtils.cpp, line 287 I found this bug by simply loading random web sites and looking for interesting assertion failures.
Reporter | ||
Comment 1•16 years ago
|
||
Reporter | ||
Comment 2•16 years ago
|
||
The stack trace blames HTMLContentSink::AddAttributes.
Comment 3•16 years ago
|
||
Yeah, we nsUnescape @name on <html:a>. Which could introduce all sorts of random bytes...
Flags: wanted1.9.0.x?
Updated•16 years ago
|
Flags: wanted1.9.1?
Flags: wanted1.9.0.x?
Flags: wanted1.9.0.x+
Comment 5•15 years ago
|
||
So I was going to ask why we assume that @name is in UTF-8, rather than the document encoding, and found the answer in bug 87966 comment 5: the HTML spec requires it (http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars). So strictly |<a name=%e2g>| is invalid, but surely the code in nsHTMLContentSink::AddAttributes should at least check for non-UTF8 @name, and maybe even try falling back to the page encoding? What do other UAs do?
Comment 6•15 years ago
|
||
Clarification: when I say "non-UTF8" I mean "non-UTF8 when unescaped".
Comment 7•15 years ago
|
||
I believe that other UAs treat it as UTF-8, but it's worth experimenting.
Comment 8•15 years ago
|
||
As far as I can tell, no other UA unescapes non-ASCII characters in name attributes, ever, in any encoding. I tested Safari 3.2.3 on Mac, and IE7, IE8, Opera9.52 and Opera9.64 on Windows. Given that, I suggest that we should test whether the name is valid UTF-8 after unescaping, and leave it as escaped if not.
Comment 9•15 years ago
|
||
The trouble with that suggestion seems to be that it breaks non-UTF-8 hrefs to non-UTF-8 anchors.
Comment 10•15 years ago
|
||
This testcase incorporates Jesse's "named anchors" bookmarklet from squarefree.com to display the anchor names. I have a patch that makes all the anchors and links to them work as expected, which I will attach after more testing (e.g. with escaped codepoints that are valid in neither UTF-8 nor the page encoding)
Assignee: nobody → smontagu
Updated•15 years ago
|
Attachment #381731 -
Attachment mime type: text/html → text/html; charset=iso-8859-1
Comment 11•15 years ago
|
||
Simon, how would that testcase fare if we instead just implemented http://www.whatwg.org/specs/web-apps/current-work/multipage/history.html#scroll-to-fragid ?
Comment 12•15 years ago
|
||
If I am interpreting that whatwg reference correctly, then the other UAs mentioned in comment 8 are conforming to it and we are not, at least with respect to parsing the name attributes. With respect to following the hrefs, IE's behaviour seems so be the most conformant: basically the links should only work (i.e. scroll to the anchors), when the link encoding is the same as the anchor encoding (ignoring NCRs as a separate can of worms for the time being). I think that for us to implement it would mean cutting out the whole block in nsHTMLContentSink::AddAttributes that unescapes the name attribute, and modifying nsPresShell::GoToAnchor and its callers to search for id and name attributes separately with different input, and changing the fallback if UTF-8 conversion fails to use the escaped string rather than attempting to decode the unescaped string in the document charset. Does that sound right? Those callers also assert as they stand, by the way, but we can prevent that by calling IsUTF8 before conversion instead of attempting to convert all input as UTF-8 and falling back if the conversion fails.
Comment 13•15 years ago
|
||
Comment 14•15 years ago
|
||
> Does that sound right?
Yes. Would that work with your first testcase?
Updated•15 years ago
|
Attachment #382008 -
Attachment mime type: text/html → text/html; charset=iso-8859-1
Comment 15•15 years ago
|
||
(In reply to comment #14) > Would that work with your first testcase? What are you asking exactly? The two testcases are the same, except for the prose describing the expected results, and except that I forgot to specify the charset in the second one until this moment ;-)
Comment 16•15 years ago
|
||
I guess my question was really two questions: 1) Does the HTML5 spec give the expected behavior on the first testcase? 2) If not, is that because of a case that we think should actually work, or just because the testcase is exhaustive in terms of testing things that we make work right now?
Comment 17•15 years ago
|
||
So after reading a bunch of related bugs that Boris pointed me to, I would say that in the long term we probably want to do what HTML5 says, but making any single change in that direction would be like pushing over a domino, and we can't really do it without considering where every other domino would fall. In the short term, we might want to take a patch as described in comment 10, which makes anchors as in http://mtv3.fi (from bug 496336) not be totally broken, and is, I think, internally consistent with what we do already.
Comment 18•15 years ago
|
||
That said, I'm not sure that there is a "short term" at this point -- we probably want to do the "long term" fix by the next release, and it's much too late to get the "short term" fix in for 1.9.1.
Yeah, I don't think any changes here would be safe enough for 1.9.1. And since it's not flagged as a regression (is this correct) I doubt that it's worth taking on a branch. If this is indeed a regression I think we should try to get it into 1.9.1.x though.
Updated•15 years ago
|
Flags: wanted1.9.1? → wanted1.9.1.x?
Comment 20•15 years ago
|
||
It's not a regression: the code that causes the assertion was checked in back in 2001 in bug 87996.
Updated•15 years ago
|
Flags: wanted1.9.1.x?
Flags: wanted1.9.0.x+
Flags: blocking1.9.2?
Comment 21•15 years ago
|
||
Not holding 1.9.2 for this.
Flags: wanted-next+
Flags: blocking1.9.2?
Flags: blocking1.9.2-
Reporter | ||
Comment 22•15 years ago
|
||
WFM on trunk.
Comment 23•15 years ago
|
||
I still see the assertions on trunk (OS X and Linux)
Reporter | ||
Comment 24•14 years ago
|
||
I only see the assertions if I set html5.enable to false.
Comment 25•14 years ago
|
||
That's because the html5 parser knocked over one of the dominoes mentioned in comment 17 but not the others. So now we don't assert, but we're not web-compatible either.
Reporter | ||
Comment 26•14 years ago
|
||
Now I get the assertions with the default setting (html5.enable true).
Comment 27•14 years ago
|
||
(In reply to comment #26) > Now I get the assertions with the default setting (html5.enable true). That's because I carefully copied and pasted the old bogus code to the HTML5 side in order to keep fragment refs working in the Gecko 2.0 timeframe.
Comment 28•2 years ago
|
||
The bug assignee is inactive on Bugzilla, so the assignee is being reset.
Assignee: smontagu → nobody
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•