Open Bug 436006 Opened 16 years ago Updated 7 months ago

"ASSERTION: Not a UTF-8 string" with <a name="%e2g">

Categories

(Core :: DOM: HTML Parser, defect)

x86
All
defect

Tracking

()

People

(Reporter: jruderman, Unassigned)

References

()

Details

(Keywords: assertion, testcase)

Attachments

(4 files)

Attached file testcase
Loading the testcase triggers:

###!!! ASSERTION: Not a UTF-8 string. This code should only be used for converting from known UTF-8 strings.: 'Error', file ../../../dist/include/string/nsUTF8Utils.h, line 596

###!!! ASSERTION: not a UTF8 string: 'Error', file ../../dist/include/string/nsUTF8Utils.h, line 130

###!!! ASSERTION: Input wasn't UTF8 or incorrect length was calculated: 'Error', file /Users/jruderman/central/mozilla/xpcom/string/src/nsReadableUtils.cpp, line 287

I found this bug by simply loading random web sites and looking for interesting assertion failures.
Attached file stack trace
The stack trace blames HTMLContentSink::AddAttributes.
Yeah, we nsUnescape @name on <html:a>.  Which could introduce all sorts of random bytes...
Flags: wanted1.9.0.x?
Flags: wanted1.9.1?
Flags: wanted1.9.0.x?
Flags: wanted1.9.0.x+
OS: Mac OS X → All
So I was going to ask why we assume that @name is in UTF-8, rather than the document encoding, and found the answer in bug 87966 comment 5: the HTML spec requires it (http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars).

So strictly |<a name=%e2g>| is invalid, but surely the code in nsHTMLContentSink::AddAttributes should at least check for non-UTF8 @name, and maybe even try falling back to the page encoding?

What do other UAs do?
Clarification: when I say "non-UTF8" I mean "non-UTF8 when unescaped".
I believe that other UAs treat it as UTF-8, but it's worth experimenting.
As far as I can tell, no other UA unescapes non-ASCII characters in name attributes, ever, in any encoding. I tested Safari 3.2.3 on Mac, and IE7, IE8, Opera9.52 and Opera9.64 on Windows.

Given that, I suggest that we should test whether the name is valid UTF-8 after unescaping, and leave it as escaped if not.
The trouble with that suggestion seems to be that it breaks non-UTF-8 hrefs to non-UTF-8 anchors.
Attached file Testcase
This testcase incorporates Jesse's "named anchors" bookmarklet from squarefree.com to display the anchor names.

I have a patch that makes all the anchors and links to them work as expected, which I will attach after more testing (e.g. with escaped codepoints that are valid in neither UTF-8 nor the page encoding)
Assignee: nobody → smontagu
Attachment #381731 - Attachment mime type: text/html → text/html; charset=iso-8859-1
Simon, how would that testcase fare if we instead just implemented http://www.whatwg.org/specs/web-apps/current-work/multipage/history.html#scroll-to-fragid ?
Depends on: 483660
If I am interpreting that whatwg reference correctly, then the other UAs mentioned in comment 8 are conforming to it and we are not, at least with respect to parsing the name attributes.

With respect to following the hrefs, IE's behaviour seems so be the most conformant: basically the links should only work (i.e. scroll to the anchors), when the link encoding is the same as the anchor encoding (ignoring NCRs as a separate can of worms for the time being).

I think that for us to implement it would mean cutting out the whole block in nsHTMLContentSink::AddAttributes that unescapes the name attribute, and modifying nsPresShell::GoToAnchor and its callers to search for id and name attributes separately with different input, and changing the fallback if UTF-8 conversion fails to use the escaped string rather than attempting to decode the unescaped string in the document charset. Does that sound right?

Those callers also assert as they stand, by the way, but we can prevent that by calling IsUTF8 before conversion instead of attempting to convert all input as UTF-8 and falling back if the conversion fails.
> Does that sound right?

Yes.  Would that work with your first testcase?
Attachment #382008 - Attachment mime type: text/html → text/html; charset=iso-8859-1
(In reply to comment #14)
> Would that work with your first testcase?

What are you asking exactly? The two testcases are the same, except for the prose describing the expected results, and except that I forgot to specify the charset in the second one until this moment ;-)
I guess my question was really two questions:

1)  Does the HTML5 spec give the expected behavior on the first testcase?
2)  If not, is that because of a case that we think should actually work, or just
    because the testcase is exhaustive in terms of testing things that we make
    work right now?
So after reading a bunch of related bugs that Boris pointed me to, I would say that in the long term we probably want to do what HTML5 says, but making any single change in that direction would be like pushing over a domino, and we can't really do it without considering where every other domino would fall.

In the short term, we might want to take a patch as described in comment 10, which makes anchors as in http://mtv3.fi (from bug 496336) not be totally broken, and is, I think, internally consistent with what we do already.
That said, I'm not sure that there is a "short term" at this point -- we probably want to do the "long term" fix by the next release, and it's much too late to get the "short term" fix in for 1.9.1.
Yeah, I don't think any changes here would be safe enough for 1.9.1. And since it's not flagged as a regression (is this correct) I doubt that it's worth taking on a branch.

If this is indeed a regression I think we should try to get it into 1.9.1.x though.
Flags: wanted1.9.1? → wanted1.9.1.x?
It's not a regression: the code that causes the assertion was checked in back in 2001 in bug 87996.
Flags: wanted1.9.1.x?
Flags: wanted1.9.0.x+
Flags: blocking1.9.2?
Not holding 1.9.2 for this.
Flags: wanted-next+
Flags: blocking1.9.2?
Flags: blocking1.9.2-
WFM on trunk.
I still see the assertions on trunk (OS X and Linux)
I only see the assertions if I set html5.enable to false.
That's because the html5 parser knocked over one of the dominoes mentioned in comment 17 but not the others.  So now we don't assert, but we're not web-compatible either.
Now I get the assertions with the default setting (html5.enable true).
(In reply to comment #26)
> Now I get the assertions with the default setting (html5.enable true).

That's because I carefully copied and pasted the old bogus code to the HTML5 side in order to keep fragment refs working in the Gecko 2.0 timeframe.

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: smontagu → nobody
Severity: normal → S3
See Also: → 582361
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: