Last Comment Bug 436006 - "ASSERTION: Not a UTF-8 string" with <a name="%e2g">
: "ASSERTION: Not a UTF-8 string" with <a name="%e2g">
Status: NEW
: assertion, testcase
Product: Core
Classification: Components
Component: HTML: Parser (show other bugs)
: Trunk
: x86 All
: -- normal with 1 vote (vote)
: ---
Assigned To: Simon Montagu :smontagu
: Andrew Overholt [:overholt]
: 496336 (view as bug list)
Depends on: 483660
  Show dependency treegraph
Reported: 2008-05-27 20:11 PDT by Jesse Ruderman
Modified: 2013-10-15 19:34 PDT (History)
14 users (show)
jst: wanted‑next+
jst: blocking1.9.2-
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---

testcase (85 bytes, text/html)
2008-05-27 20:11 PDT, Jesse Ruderman
no flags Details
stack trace (6.21 KB, text/plain)
2008-05-27 20:15 PDT, Jesse Ruderman
no flags Details
Testcase (12.03 KB, text/html; charset=iso-8859-1)
2009-06-05 00:58 PDT, Simon Montagu :smontagu
no flags Details
Attempt to express the behaviour described on in the testcase (12.61 KB, text/html; charset=iso-8859-1)
2009-06-07 01:50 PDT, Simon Montagu :smontagu
no flags Details

Description Jesse Ruderman 2008-05-27 20:11:14 PDT
Created attachment 322719 [details]

Loading the testcase triggers:

###!!! ASSERTION: Not a UTF-8 string. This code should only be used for converting from known UTF-8 strings.: 'Error', file ../../../dist/include/string/nsUTF8Utils.h, line 596

###!!! ASSERTION: not a UTF8 string: 'Error', file ../../dist/include/string/nsUTF8Utils.h, line 130

###!!! ASSERTION: Input wasn't UTF8 or incorrect length was calculated: 'Error', file /Users/jruderman/central/mozilla/xpcom/string/src/nsReadableUtils.cpp, line 287

I found this bug by simply loading random web sites and looking for interesting assertion failures.
Comment 1 Jesse Ruderman 2008-05-27 20:15:07 PDT
Created attachment 322720 [details]
stack trace
Comment 2 Jesse Ruderman 2008-05-27 20:15:45 PDT
The stack trace blames HTMLContentSink::AddAttributes.
Comment 3 Boris Zbarsky [:bz] (still a bit busy) 2008-05-27 22:27:34 PDT
Yeah, we nsUnescape @name on <html:a>.  Which could introduce all sorts of random bytes...
Comment 4 Simon Montagu :smontagu 2009-06-04 07:54:09 PDT
*** Bug 496336 has been marked as a duplicate of this bug. ***
Comment 5 Simon Montagu :smontagu 2009-06-04 08:17:08 PDT
So I was going to ask why we assume that @name is in UTF-8, rather than the document encoding, and found the answer in bug 87966 comment 5: the HTML spec requires it (

So strictly |<a name=%e2g>| is invalid, but surely the code in nsHTMLContentSink::AddAttributes should at least check for non-UTF8 @name, and maybe even try falling back to the page encoding?

What do other UAs do?
Comment 6 Simon Montagu :smontagu 2009-06-04 08:18:23 PDT
Clarification: when I say "non-UTF8" I mean "non-UTF8 when unescaped".
Comment 7 Boris Zbarsky [:bz] (still a bit busy) 2009-06-04 08:41:59 PDT
I believe that other UAs treat it as UTF-8, but it's worth experimenting.
Comment 8 Simon Montagu :smontagu 2009-06-04 23:11:48 PDT
As far as I can tell, no other UA unescapes non-ASCII characters in name attributes, ever, in any encoding. I tested Safari 3.2.3 on Mac, and IE7, IE8, Opera9.52 and Opera9.64 on Windows.

Given that, I suggest that we should test whether the name is valid UTF-8 after unescaping, and leave it as escaped if not.
Comment 9 Simon Montagu :smontagu 2009-06-04 23:36:13 PDT
The trouble with that suggestion seems to be that it breaks non-UTF-8 hrefs to non-UTF-8 anchors.
Comment 10 Simon Montagu :smontagu 2009-06-05 00:58:20 PDT
Created attachment 381731 [details]

This testcase incorporates Jesse's "named anchors" bookmarklet from to display the anchor names.

I have a patch that makes all the anchors and links to them work as expected, which I will attach after more testing (e.g. with escaped codepoints that are valid in neither UTF-8 nor the page encoding)
Comment 11 Boris Zbarsky [:bz] (still a bit busy) 2009-06-05 08:54:23 PDT
Simon, how would that testcase fare if we instead just implemented ?
Comment 12 Simon Montagu :smontagu 2009-06-07 01:16:30 PDT
If I am interpreting that whatwg reference correctly, then the other UAs mentioned in comment 8 are conforming to it and we are not, at least with respect to parsing the name attributes.

With respect to following the hrefs, IE's behaviour seems so be the most conformant: basically the links should only work (i.e. scroll to the anchors), when the link encoding is the same as the anchor encoding (ignoring NCRs as a separate can of worms for the time being).

I think that for us to implement it would mean cutting out the whole block in nsHTMLContentSink::AddAttributes that unescapes the name attribute, and modifying nsPresShell::GoToAnchor and its callers to search for id and name attributes separately with different input, and changing the fallback if UTF-8 conversion fails to use the escaped string rather than attempting to decode the unescaped string in the document charset. Does that sound right?

Those callers also assert as they stand, by the way, but we can prevent that by calling IsUTF8 before conversion instead of attempting to convert all input as UTF-8 and falling back if the conversion fails.
Comment 13 Simon Montagu :smontagu 2009-06-07 01:50:02 PDT
Created attachment 382008 [details]
Attempt to express the behaviour described on in the testcase
Comment 14 Boris Zbarsky [:bz] (still a bit busy) 2009-06-07 06:26:00 PDT
> Does that sound right?

Yes.  Would that work with your first testcase?
Comment 15 Simon Montagu :smontagu 2009-06-07 07:20:57 PDT
(In reply to comment #14)
> Would that work with your first testcase?

What are you asking exactly? The two testcases are the same, except for the prose describing the expected results, and except that I forgot to specify the charset in the second one until this moment ;-)
Comment 16 Boris Zbarsky [:bz] (still a bit busy) 2009-06-07 07:30:46 PDT
I guess my question was really two questions:

1)  Does the HTML5 spec give the expected behavior on the first testcase?
2)  If not, is that because of a case that we think should actually work, or just
    because the testcase is exhaustive in terms of testing things that we make
    work right now?
Comment 17 Simon Montagu :smontagu 2009-06-09 11:17:06 PDT
So after reading a bunch of related bugs that Boris pointed me to, I would say that in the long term we probably want to do what HTML5 says, but making any single change in that direction would be like pushing over a domino, and we can't really do it without considering where every other domino would fall.

In the short term, we might want to take a patch as described in comment 10, which makes anchors as in (from bug 496336) not be totally broken, and is, I think, internally consistent with what we do already.
Comment 18 Simon Montagu :smontagu 2009-06-09 11:54:19 PDT
That said, I'm not sure that there is a "short term" at this point -- we probably want to do the "long term" fix by the next release, and it's much too late to get the "short term" fix in for 1.9.1.
Comment 19 Jonas Sicking (:sicking) No longer reading bugmail consistently 2009-06-09 14:16:52 PDT
Yeah, I don't think any changes here would be safe enough for 1.9.1. And since it's not flagged as a regression (is this correct) I doubt that it's worth taking on a branch.

If this is indeed a regression I think we should try to get it into 1.9.1.x though.
Comment 20 Simon Montagu :smontagu 2009-06-09 15:03:44 PDT
It's not a regression: the code that causes the assertion was checked in back in 2001 in bug 87996.
Comment 21 Johnny Stenback (:jst, 2009-07-28 15:05:01 PDT
Not holding 1.9.2 for this.
Comment 22 Jesse Ruderman 2009-08-25 20:16:18 PDT
WFM on trunk.
Comment 23 Simon Montagu :smontagu 2009-08-26 11:46:20 PDT
I still see the assertions on trunk (OS X and Linux)
Comment 24 Jesse Ruderman 2010-06-03 01:10:03 PDT
I only see the assertions if I set html5.enable to false.
Comment 25 Boris Zbarsky [:bz] (still a bit busy) 2010-07-27 13:42:08 PDT
That's because the html5 parser knocked over one of the dominoes mentioned in comment 17 but not the others.  So now we don't assert, but we're not web-compatible either.
Comment 26 Jesse Ruderman 2010-11-18 17:08:51 PST
Now I get the assertions with the default setting (html5.enable true).
Comment 27 Henri Sivonen (:hsivonen) 2010-11-19 00:51:04 PST
(In reply to comment #26)
> Now I get the assertions with the default setting (html5.enable true).

That's because I carefully copied and pasted the old bogus code to the HTML5 side in order to keep fragment refs working in the Gecko 2.0 timeframe.

Note You need to log in before you can comment on or make changes to this bug.