Closed Bug 421576 Opened 17 years ago Closed 17 years ago

Unpaired surrogate handled wrongly (Acid3 #68)

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla2.0

People

(Reporter: aguertin+bugzilla, Assigned: Waldo)

References

(Blocks 1 open bug)

Details

(Keywords: verified1.9.0.11)

Attachments

(1 file, 3 obsolete files)

Patch 17 years ago Jeff Walden [:Waldo] 3.98 KB, patch		Details \| Diff \| Splinter Review
Patch with tests 17 years ago Jeff Walden [:Waldo] 8.68 KB, patch		Details \| Diff \| Splinter Review
Alternate patch that removes the 17 years ago Jeff Walden [:Waldo] 10.75 KB, patch	dbaron : review+	Details \| Diff \| Splinter Review
Final patch 17 years ago Jeff Walden [:Waldo] 15.82 KB, patch	Waldo : review+ dveditz : approval1.9.0.11+	Details \| Diff \| Splinter Review

dolphinling

Reporter

Description

•

17 years ago

2008-03-07-04 linux trunk nightly Acid3 test 68 reports that we handle unpaired halves of a surrogate pair wrongly. Hexdumping the report page, I get "input was ED A1 A3text" "output was EF BF BDext" (although that very well could have gone through the bug cycle a few times and be different from the original text) Is this in the right component?

dolphinling

Reporter

Updated

•

17 years ago

Blocks: 421546

WADA:World Anti-bad-Duping Agency

Updated

•

17 years ago

Blocks: 415391

dolphinling

Reporter

Comment 1

•

17 years ago

After staring at this for a while, I'm pretty sure HTML parser is not the right component, but I don't know what is.

Component: HTML: Parser → General

QA Contact: parser → general

WADA:World Anti-bad-Duping Agency

Comment 2

•

17 years ago

Escaped data of variables are as follows. (See Bug 415391 Comment #2) >(Seamonkey 1.1.7) > before : length=5 escape(before) == %uD863text > after : length=3 escape(after) == ext >(Fx trunk 2008-01-27-04 build). > before : length=5 escape(before) == %uD863text > after : length=4 escape(after) == %uFFFDext > Acid3 test 68 expects %uFFFDtext, if conversion to %uFFFD is executed. It looks that %uD863t is converted to %uFFFD (is cut off when Branch). As written in Comment #0, %uD863(U+D863) in UTF-8 is ED A1 A3. 3bytes of 'U+D863'+'t' in UTF-16 was converted to U+FFFD(or cut)?

Jeff Walden [:Waldo]

Assignee

Updated

•

17 years ago

Component: General → XPCOM

OS: Linux → All

QA Contact: general → xpcom

Target Milestone: --- → Future

Jeff Walden [:Waldo]

Assignee

Comment 3

•

17 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

This fixes the problem for me. The Unicode spec is unclear or at least not very explicit about exactly how many code units should be consumed for all the various types of invalid code unit sequences; the most precise text is this: "For example, in UTF-8 every code unit of the form 110xxxx2 must be followed by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2 is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant pro- cess must treat the first code unit 110xxxxx2 as an illegally terminated code unit sequence—for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character." So we know what should happen for a two-unit UTF-8 sequence and can, I think, reasonably extrapolate the same behavior for the corresponding UTF-16 situation, but we don't really know what should happen, say, for a three- or four-unit UTF-8 sequence that's invalid or a UTF-8 sequence that's too short. I suppose that's not relevant to this particular bug, but it'd be something I'd bring up with the Unicode people if I were involved with them at all -- they really should clarify exactly how many code units all the various possible error conditions consume, in the case where a replacement character is used. If we had a few months for this to bake I'd say land and see what breaks, but we couldn't reasonably take this now and expect to be able to sniff out all possible regressions and potential inconsistencies this might introduce before release. Consider this, then, as a gift to self-builders and something to consider for a future release.

Assignee: nobody → jwalden+bmo

Status: NEW → ASSIGNED

WADA:World Anti-bad-Duping Agency

Comment 4

•

17 years ago

To Jeff Walden: Can phenomenon of Bug 421546 be explained by this bug's cause? - After error message of "Test 68 fails", first character of any word in following error messages is lost

WADA:World Anti-bad-Duping Agency

Updated

•

17 years ago

Blocks: 423228

Jeff Walden [:Waldo]

Assignee

Comment 5

•

17 years ago

Attached patch Patch with tests (obsolete) — Details — Splinter Review

A closer look suggests I missed one spot in the last patch, but that's not really a huge surprise -- as far as I can tell after searching through mazes of twisty little string headers and pondering arcane traits-usage, the spot I missed is used in exactly one location in the entire tree, to determine the accesskey specified by that attribute on menus (crazy, but it seems to be true): http://mxr.mozilla.org/mozilla/search?string=NextChar&case=on&find=&findi=&filter=&hitlimit=&tree=mozilla http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/layout/xul/base/src/nsMenuBarFrame.cpp&rev=1.172&mark=250#235 I haven't tested this understanding, but if I'm right we could drop the second NextChar and replace it with the one everything else uses. That can be a followup change, if it can be made.

Attachment #309688 - Attachment is obsolete: true

Attachment #321467 - Flags: review?(dbaron)

Jeff Walden [:Waldo]

Assignee

Comment 6

•

17 years ago

Attached patch Alternate patch that removes the (obsolete) — Details — Splinter Review

Actually, the extra change is so small I'm just going to fold it in here. Patch builds fine, build seems to work okay (although I'll probably want to get an independent verification that accesskeys still work, since they're less than useful on OS X sadly)), and less code to change to fix this bug at the same time.

Attachment #321467 - Attachment is obsolete: true

Attachment #321477 - Flags: review?(dbaron)

Attachment #321467 - Flags: review?(dbaron)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 7

•

17 years ago

Comment on attachment 321477 [details] [diff] [review] Alternate patch that removes the >+ // Unicode replacement character 0xFFFD. Note that the >+ // pointer to the next character points to the second 16-bit >+ // value, not beyond it, as per Unicode 5.0.0 Chapter 3 C10, >+ // only the first code unit of an illegal sequence must be >+ // treated as an illegally terminated code unit sequence. You might want to cite D91 as well in these comments. >diff --git a/xpcom/tests/TestEncoding.cpp b/xpcom/tests/TestEncoding.cpp It might also be worth testing: * a pair of surrogates in the correct order * a pair of surrogates in backwards order (should yield two U+FFFD characters) r=dbaron

Attachment #321477 - Flags: review?(dbaron) → review+

Jeff Walden [:Waldo]

Assignee

Comment 8

•

17 years ago

Attached patch Final patch — Details — Splinter Review

This is what I'll be committing whenever the tree gets around to opening. I added note of D91 as well as the two other tests, and I did a build in a non-debug (non-libxul more likely) build to make sure everything worked okay there. It's a pity libxul screws up access to so many functions. I had the original patch at the top of my patch queue atop another set of changes to C++ unit tests, so I had to do some patch folding, splitting, and reordering to get something that applies directly to tip. Mercurial made that surprisingly difficult (seems like it should have been able to three-way-merge for the win), but I suspect folding, targeted qrefs, and qnew -f will come more readily to my fingers in the future, if nothing else.

Attachment #321477 - Attachment is obsolete: true

Attachment #321980 - Flags: review+

Jeff Walden [:Waldo]

Assignee

Comment 9

•

17 years ago

Fixt in mozilla-central.

Status: ASSIGNED → RESOLVED

Closed: 17 years ago

Flags: in-testsuite+

Resolution: --- → FIXED

Target Milestone: Future → mozilla2.0

Carsten Book [:Tomcat]

Comment 12

•

16 years ago

verified fixed using Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1b1pre) Gecko/20080910043000 Minefield/3.1b1pre and the acid3 tests

Status: RESOLVED → VERIFIED

Simon Montagu :smontagu

Updated

•

16 years ago

Blocks: 396127

Daniel Veditz [:dveditz]

Comment 13

•

16 years ago

Comment on attachment 321980 [details] [diff] [review] Final patch Approved for 1.9.0.11, a=dveditz

Attachment #321980 - Flags: approval1.9.0.11+

Daniel Veditz [:dveditz]

Updated

•

16 years ago

Blocks: 490513

Daniel Veditz [:dveditz]

Updated

•

16 years ago

Blocks: 489041

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 14

•

16 years ago

Landed on CVS trunk (for 1.9.0.11), 2009-05-06 18:33 -0700.

Keywords: fixed1.9.0.11

Al Billings [:abillings - ex-MoCo]

Comment 15

•

16 years ago

Verified for 1.9.0.11 with Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.11pre) Gecko/2009051404 GranParadiso/3.0.11pre.

Keywords: fixed1.9.0.11 → verified1.9.0.11

Jeff Walden [:Waldo]

Assignee

Comment 16

•

16 years ago

(In reply to comment #3) A followup on processing ill-formed code units during decoding: As it turns out the Unicode people were working on clarifying some aspects of this at the time I made the comment in UTR# 36 and Unicode 5.1.0 (finished around April 2008, a month after I wrote comment 3): http://unicode.org/reports/tr36/#Ill-Formed_Subsequences However, it seems they were apparently unwilling to define as far as I had wished (in particular, exactly how many characters to consume in the case of a >2 code unit sequence where more than the first, but not all, code units appear correct), as illustrated by this paragraph: "Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is not otherwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors. For example, in processing the UTF-8 code unit sequence <F0 80 80 41>, the only requirement on a converter is that the <41> be processed and correctly interpreted as <U+0041>. The converter could return <U+FFFD, U+0041>, handling <F0 80 80> as a single error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a separate error, or could take other approaches to signalling <F0 80 80> as an ill-formed code unit subsequence. " http://www.unicode.org/versions/Unicode5.1.0/#Conformance_Changes (Incidentally, I discovered this while looking through intl/ testcases and seeing one with a name that sounded curiously indicative of addressing some of the problems I noted here, and lo and behold, it was! I think this is an excellent argument for giving automated tests descriptive names, and I may end up using this comment in that manner in the future.)

You need to log in before you can comment on or make changes to this bug.