2008-03-07-04 linux trunk nightly Acid3 test 68 reports that we handle unpaired halves of a surrogate pair wrongly. Hexdumping the report page, I get "input was ED A1 A3text" "output was EF BF BDext" (although that very well could have gone through the bug cycle a few times and be different from the original text) Is this in the right component?
After staring at this for a while, I'm pretty sure HTML parser is not the right component, but I don't know what is.
Escaped data of variables are as follows. (See Bug 415391 Comment #2) >(Seamonkey 1.1.7) > before : length=5 escape(before) == %uD863text > after : length=3 escape(after) == ext >(Fx trunk 2008-01-27-04 build). > before : length=5 escape(before) == %uD863text > after : length=4 escape(after) == %uFFFDext > Acid3 test 68 expects %uFFFDtext, if conversion to %uFFFD is executed. It looks that %uD863t is converted to %uFFFD (is cut off when Branch). As written in Comment #0, %uD863(U+D863) in UTF-8 is ED A1 A3. 3bytes of 'U+D863'+'t' in UTF-16 was converted to U+FFFD(or cut)?
Created attachment 309688 [details] [diff] [review] Patch This fixes the problem for me. The Unicode spec is unclear or at least not very explicit about exactly how many code units should be consumed for all the various types of invalid code unit sequences; the most precise text is this: "For example, in UTF-8 every code unit of the form 110xxxx2 must be followed by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2 is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant pro- cess must treat the first code unit 110xxxxx2 as an illegally terminated code unit sequence—for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character." So we know what should happen for a two-unit UTF-8 sequence and can, I think, reasonably extrapolate the same behavior for the corresponding UTF-16 situation, but we don't really know what should happen, say, for a three- or four-unit UTF-8 sequence that's invalid or a UTF-8 sequence that's too short. I suppose that's not relevant to this particular bug, but it'd be something I'd bring up with the Unicode people if I were involved with them at all -- they really should clarify exactly how many code units all the various possible error conditions consume, in the case where a replacement character is used. If we had a few months for this to bake I'd say land and see what breaks, but we couldn't reasonably take this now and expect to be able to sniff out all possible regressions and potential inconsistencies this might introduce before release. Consider this, then, as a gift to self-builders and something to consider for a future release.
To Jeff Walden: Can phenomenon of Bug 421546 be explained by this bug's cause? - After error message of "Test 68 fails", first character of any word in following error messages is lost
Created attachment 321467 [details] [diff] [review] Patch with tests A closer look suggests I missed one spot in the last patch, but that's not really a huge surprise -- as far as I can tell after searching through mazes of twisty little string headers and pondering arcane traits-usage, the spot I missed is used in exactly one location in the entire tree, to determine the accesskey specified by that attribute on menus (crazy, but it seems to be true): http://mxr.mozilla.org/mozilla/search?string=NextChar&case=on&find=&findi=&filter=&hitlimit=&tree=mozilla http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/layout/xul/base/src/nsMenuBarFrame.cpp&rev=1.172&mark=250#235 I haven't tested this understanding, but if I'm right we could drop the second NextChar and replace it with the one everything else uses. That can be a followup change, if it can be made.
Created attachment 321477 [details] [diff] [review] Alternate patch that removes the Actually, the extra change is so small I'm just going to fold it in here. Patch builds fine, build seems to work okay (although I'll probably want to get an independent verification that accesskeys still work, since they're less than useful on OS X sadly)), and less code to change to fix this bug at the same time.
Comment on attachment 321477 [details] [diff] [review] Alternate patch that removes the >+ // Unicode replacement character 0xFFFD. Note that the >+ // pointer to the next character points to the second 16-bit >+ // value, not beyond it, as per Unicode 5.0.0 Chapter 3 C10, >+ // only the first code unit of an illegal sequence must be >+ // treated as an illegally terminated code unit sequence. You might want to cite D91 as well in these comments. >diff --git a/xpcom/tests/TestEncoding.cpp b/xpcom/tests/TestEncoding.cpp It might also be worth testing: * a pair of surrogates in the correct order * a pair of surrogates in backwards order (should yield two U+FFFD characters) r=dbaron
Created attachment 321980 [details] [diff] [review] Final patch This is what I'll be committing whenever the tree gets around to opening. I added note of D91 as well as the two other tests, and I did a build in a non-debug (non-libxul more likely) build to make sure everything worked okay there. It's a pity libxul screws up access to so many functions. I had the original patch at the top of my patch queue atop another set of changes to C++ unit tests, so I had to do some patch folding, splitting, and reordering to get something that applies directly to tip. Mercurial made that surprisingly difficult (seems like it should have been able to three-way-merge for the win), but I suspect folding, targeted qrefs, and qnew -f will come more readily to my fingers in the future, if nothing else.
Fixt in mozilla-central.
verified fixed using Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1b1pre) Gecko/20080910043000 Minefield/3.1b1pre and the acid3 tests
Comment on attachment 321980 [details] [diff] [review] Final patch Approved for 220.127.116.11, a=dveditz
Landed on CVS trunk (for 18.104.22.168), 2009-05-06 18:33 -0700.
Verified for 22.214.171.124 with Mozilla/5.0 (X11; U; Linux i686; en-US; rv:126.96.36.199pre) Gecko/2009051404 GranParadiso/3.0.11pre.
(In reply to comment #3) A followup on processing ill-formed code units during decoding: As it turns out the Unicode people were working on clarifying some aspects of this at the time I made the comment in UTR# 36 and Unicode 5.1.0 (finished around April 2008, a month after I wrote comment 3): http://unicode.org/reports/tr36/#Ill-Formed_Subsequences However, it seems they were apparently unwilling to define as far as I had wished (in particular, exactly how many characters to consume in the case of a >2 code unit sequence where more than the first, but not all, code units appear correct), as illustrated by this paragraph: "Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is not otherwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors. For example, in processing the UTF-8 code unit sequence <F0 80 80 41>, the only requirement on a converter is that the <41> be processed and correctly interpreted as <U+0041>. The converter could return <U+FFFD, U+0041>, handling <F0 80 80> as a single error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a separate error, or could take other approaches to signalling <F0 80 80> as an ill-formed code unit subsequence. " http://www.unicode.org/versions/Unicode5.1.0/#Conformance_Changes (Incidentally, I discovered this while looking through intl/ testcases and seeing one with a name that sounded curiously indicative of addressing some of the problems I noted here, and lo and behold, it was! I think this is an excellent argument for giving automated tests descriptive names, and I may end up using this comment in that manner in the future.)