Closed Bug 421576 Opened 16 years ago Closed 16 years ago

Unpaired surrogate handled wrongly (Acid3 #68)


(Core :: XPCOM, defect)

Not set





(Reporter: aguertin+bugzilla, Assigned: Waldo)


(Blocks 1 open bug)


(Keywords: verified1.9.0.11)


(1 file, 3 obsolete files)

2008-03-07-04 linux trunk nightly

Acid3 test 68 reports that we handle unpaired halves of a surrogate pair wrongly.
Hexdumping the report page, I get
"input was ED A1 A3text"
"output was EF BF BDext"

(although that very well could have gone through the bug cycle a few times and be different from the original text)

Is this in the right component?
Blocks: 421546
After staring at this for a while, I'm pretty sure HTML parser is not the right component, but I don't know what is.
Component: HTML: Parser → General
QA Contact: parser → general
Escaped data of variables are as follows. (See Bug 415391 Comment #2)
>(Seamonkey 1.1.7)
> before : length=5 escape(before) == %uD863text
>  after : length=3 escape(after)  == ext
>(Fx trunk 2008-01-27-04 build).
> before : length=5 escape(before) == %uD863text
>  after : length=4 escape(after)  == %uFFFDext
> Acid3 test 68 expects %uFFFDtext, if conversion to %uFFFD is executed.

It looks that %uD863t is converted to %uFFFD (is cut off when Branch).
As written in Comment #0, %uD863(U+D863) in UTF-8 is ED A1 A3.
3bytes of 'U+D863'+'t' in UTF-16 was converted to U+FFFD(or cut)? 
Component: General → XPCOM
OS: Linux → All
QA Contact: general → xpcom
Target Milestone: --- → Future
Attached patch Patch (obsolete) — Splinter Review
This fixes the problem for me.  The Unicode spec is unclear or at least not very explicit about exactly how many code units should be consumed for all the various types of invalid code unit sequences; the most precise text is this:

"For example, in UTF-8 every code unit of the form 110xxxx2 must be followed 
by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2 
is ill-formed and must never be generated. When faced with this ill-formed 
code unit sequence while transforming or interpreting text, a conformant pro- 
cess must treat the first code unit 110xxxxx2 as an illegally terminated code
unit sequence—for example, by signaling an error, filtering the code unit out,
or representing the code unit with a marker such as U+FFFD replacement 

So we know what should happen for a two-unit UTF-8 sequence and can, I think, reasonably extrapolate the same behavior for the corresponding UTF-16 situation, but we don't really know what should happen, say, for a three- or four-unit UTF-8 sequence that's invalid or a UTF-8 sequence that's too short.  I suppose that's not relevant to this particular bug, but it'd be something I'd bring up with the Unicode people if I were involved with them at all -- they really should clarify exactly how many code units all the various possible error conditions consume, in the case where a replacement character is used.

If we had a few months for this to bake I'd say land and see what breaks, but we couldn't reasonably take this now and expect to be able to sniff out all possible regressions and potential inconsistencies this might introduce before release.  Consider this, then, as a gift to self-builders and something to consider for a future release.
Assignee: nobody → jwalden+bmo
To Jeff Walden:
Can phenomenon of Bug 421546 be explained by this bug's cause?
 - After error message of "Test 68 fails", 
   first character of any word in following error messages is lost
Attached patch Patch with tests (obsolete) — Splinter Review
A closer look suggests I missed one spot in the last patch, but that's not really a huge surprise -- as far as I can tell after searching through mazes of twisty little string headers and pondering arcane traits-usage, the spot I missed is used in exactly one location in the entire tree, to determine the accesskey specified by that attribute on menus (crazy, but it seems to be true):

I haven't tested this understanding, but if I'm right we could drop the second NextChar and replace it with the one everything else uses.  That can be a followup change, if it can be made.
Attachment #309688 - Attachment is obsolete: true
Attachment #321467 - Flags: review?(dbaron)
Actually, the extra change is so small I'm just going to fold it in here.  Patch builds fine, build seems to work okay (although I'll probably want to get an independent verification that accesskeys still work, since they're less than useful on OS X sadly)), and less code to change to fix this bug at the same time.
Attachment #321467 - Attachment is obsolete: true
Attachment #321477 - Flags: review?(dbaron)
Attachment #321467 - Flags: review?(dbaron)
Comment on attachment 321477 [details] [diff] [review]
Alternate patch that removes the 

>+            // Unicode replacement character 0xFFFD.  Note that the
>+            // pointer to the next character points to the second 16-bit
>+            // value, not beyond it, as per Unicode 5.0.0 Chapter 3 C10,
>+            // only the first code unit of an illegal sequence must be
>+            // treated as an illegally terminated code unit sequence.

You might want to cite D91 as well in these comments.

>diff --git a/xpcom/tests/TestEncoding.cpp b/xpcom/tests/TestEncoding.cpp

It might also be worth testing:
 * a pair of surrogates in the correct order
 * a pair of surrogates in backwards order (should yield two U+FFFD characters)

Attachment #321477 - Flags: review?(dbaron) → review+
Attached patch Final patchSplinter Review
This is what I'll be committing whenever the tree gets around to opening.  I added note of D91 as well as the two other tests, and I did a build in a non-debug (non-libxul more likely) build to make sure everything worked okay there.  It's a pity libxul screws up access to so many functions.

I had the original patch at the top of my patch queue atop another set of changes to C++ unit tests, so I had to do some patch folding, splitting, and reordering to get something that applies directly to tip.  Mercurial made that surprisingly difficult (seems like it should have been able to three-way-merge for the win), but I suspect folding, targeted qrefs, and qnew -f will come more readily to my fingers in the future, if nothing else.
Attachment #321477 - Attachment is obsolete: true
Attachment #321980 - Flags: review+
Fixt in mozilla-central.
Closed: 16 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: Future → mozilla2.0
verified fixed using  Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1b1pre) Gecko/20080910043000 Minefield/3.1b1pre and the acid3 tests
Blocks: 396127
Comment on attachment 321980 [details] [diff] [review]
Final patch

Approved for, a=dveditz
Attachment #321980 - Flags: approval1.9.0.11+
Blocks: 490513
Blocks: 489041
Landed on CVS trunk (for, 2009-05-06 18:33 -0700.
Keywords: fixed1.9.0.11
Verified for with Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/2009051404 GranParadiso/3.0.11pre.
(In reply to comment #3)
A followup on processing ill-formed code units during decoding:

As it turns out the Unicode people were working on clarifying some aspects of this at the time I made the comment in UTR# 36 and Unicode 5.1.0 (finished around April 2008, a month after I wrote comment 3):

However, it seems they were apparently unwilling to define as far as I had wished (in particular, exactly how many characters to consume in the case of a >2 code unit sequence where more than the first, but not all, code units appear correct), as illustrated by this paragraph:

"Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is not otherwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors. For example, in processing the UTF-8 code unit sequence <F0 80 80 41>, the only requirement on a converter is that the <41> be processed and correctly interpreted as <U+0041>. The converter could return <U+FFFD, U+0041>, handling <F0 80 80> as a single error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a separate error, or could take other approaches to signalling <F0 80 80> as an ill-formed code unit subsequence. "

(Incidentally, I discovered this while looking through intl/ testcases and seeing one with a name that sounded curiously indicative of addressing some of the problems I noted here, and lo and behold, it was!  I think this is an excellent argument for giving automated tests descriptive names, and I may end up using this comment in that manner in the future.)
You need to log in before you can comment on or make changes to this bug.