Closed Bug 481308 Opened 16 years ago Closed 16 years ago

ByteArray.toString() behaves differently for BOM-less UTF8

Categories

(Tamarin Graveyard :: Virtual Machine, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: stejohns, Assigned: daumling)

References

Details

Attachments

(4 obsolete files)

Prior to the new strings, ByteArray.toString (and probably also FileClass::read) processed a data stream without a BOM by calling newString, which implicitly treated the data as UTF8. The new code treats such data as Latin1, which is an incompatible change. We need to mimic what the old code did, including how it handled non-UTF8 data fed via this route.
The patch first checks for a valid UTF-8 sequence; if that fails, it creates a simple Latin-1 string. The fix favors memory over performance by checking for the validity of the UTF-8 sequence before attempting to create a string.
Attachment #365365 - Flags: review?(stejohns)
Assigned to myself.
Status: NEW → ASSIGNED
Attachment #365365 - Flags: review?(stejohns) → review+
Comment on attachment 365365 [details] [diff] [review] Test for UTF-8 first, then Latin-1 Sounds right -- a similar fix is needed in FileClass too, I think.
More or less a cut&paste operation from the ByteArrayGlue fix. The ByteArrayGlus fix is part of the new patch.
Attachment #365365 - Attachment is obsolete: true
Attachment #365474 - Flags: review?(stejohns)
Actually, I think a more sophisticated fix is needed, to mimic the existing behavior, which seems to convert as much as possible, then does something odd. A snippet of code: var b : ByteArray = new ByteArray(); b[0]=0xE4; // 19968 b[1]=0xB8; b[2]=0x80; b[3]=0xE4; // 20108 b[4]=0xBA; b[5]=0x8C; b[6]=0xE4; // 19977 b[7]=0xB8; b[8]=0x89; var bs:String = b.toString(); trace("valid utf8 sequence:",bs); for (var i = 0; i < bs.length; ++i) trace(" cc=",bs.charCodeAt(i)); var b2 : ByteArray = new ByteArray(); b2[0]=0xE4; // 19968 b2[1]=0xB8; b2[2]=0x80; b2[3]=0xE4; // bad sequence b2[4]=0xE4; b2[5]=0xE4; var b2s:String = b2.toString(); trace("partially invalid utf8 sequence:",b2s); for (var i = 0; i < b2s.length; ++i) trace(" cc=",b2s.charCodeAt(i)); var b3 : ByteArray = new ByteArray(); b3[3]=0xE4; // bad sequence b3[4]=0xE4; b3[5]=0xE4; var b3s:String = b3.toString(); trace("completely invalid utf8 sequence:",b3s); for (var i = 0; i < b3s.length; ++i) trace(" cc=",b3s.charCodeAt(i)); run in Flash10, this yields valid utf8 sequence: ??? (high-order chars don't paste in here well, but the char codes are accurate) cc= 19968 cc= 20108 cc= 19977 partially invalid utf8 sequence: ?äää cc= 19968 cc= 228 cc= 228 cc= 228 completely invalid utf8 sequence: cc= 0 cc= 0 cc= 0 cc= 228 cc= 228 cc= 228 which is wacky, but behavior we need to preserve (for ByteArray anyway)...
Flags: in-testsuite?
This patch is a little bigger, because it fixes two other bugs along the way. 1. UnicodeUtils::Utf8ToUtf16() has an additional strict flag; if set to true, the method returns -1 on a malformed UTF-8 sequence; if false, the old behavior remains, where invalid UTF-sequences were stored as characters. 2. String::createUTF8() has this new strict flag, and defaults to true. 3. AvmCore::findInternedString() uses strict == false to be backwards compatible 4. AvmCore::newStringUtf8() has an additional strict flag, which defaults to false. 5. FileClass and ByteArrayGlue use AvmCore::newStringUtf8() with strict explicitly set to false to maintain backwards compatibility. 6. Fix for bug #473995: The headers say that the character buffer must not be null. There is an assert if the buffer is null. All create methods use a new static method checkForCachedStrings() that checks if a cached string can be returned. 7. Fix for bug #473998 along the way: header documentation does not mention out-of-memory anymore. 8. Changed primitive data in UnicodeUtils to be C99 compatible (except for uint8* for UTF-8 strings). There are three failing test cases on Win (not tested on other platforms): as3/DescribeType/describeType.abc : test 0 FAILED! expected <type base="Class" isDynamic="true" isFinal="true" isStatic="true" name="Foo2"> as3/DescribeType/describeType.abc : test 1 FAILED! expected <type base="Foo" isDynamic="false" isFinal="false" isStatic="false" name="Foo2"> ecma3/Date/e15_9_5_35_1.abc : Fri Apr 3 17:00:00 GMT-0700 1970.getHours() = 17 FAILED! expected: 16 The last may be related to the upcoming DST change; I cannot explain why the other two fail.
Attachment #365474 - Attachment is obsolete: true
Attachment #365999 - Flags: review?(stejohns)
Attachment #365474 - Flags: review?(stejohns)
Blocks: 473995
Blocks: 473998
Comment on attachment 365999 [details] [diff] [review] Combined patch with fixes for 473995 and 473998 (1) I suspect that the vast majority of callers to newStringUTF8 will know at compile time whether they need strict mode or not, so I think it might make sense to have newStringUTF8 always be strict, and add newStringUTF8Nonstrict() -- this would save an argument-push, and also make the oddball nonstrict case more obvious in the code (and harder to call by accident). (2) eliminating Utf8Count is a good call, but I wonder if it makes sense to provide it as an inline wrapper to Utf8toUtf16? (3) do I need to push this for you?
Attachment #365999 - Flags: review?(stejohns) → review+
1) This sounds reasonable. I was just afraid that switching newStringUTF8 to strict mode could break existing SWFs, although highly unlikely. Maybe we should wait until we have a version update. 2) I left the inline in so possible users do not stumble. It has the same effect as forcing the user to replace their calls to Utf8Count, but is less annoying. 3) Yes, please - I still do not feel confident enough.
(In reply to comment #8) > 1) This sounds reasonable. I was just afraid that switching newStringUTF8 to > strict mode could break existing SWFs, although highly unlikely. Maybe we > should wait until we have a version update. Not sure if that's a "yes, let's change to two distinct" or "no, let's keep it with the optional argument" -- if the former, what name do we want for nonstrict? > 2) I left the inline in so possible users do not stumble. It has the same > effect as forcing the user to replace their calls to Utf8Count, but is less > annoying. Nice :-) > 3) Yes, please - I still do not feel confident enough. Will do so as soon as I hear from you re: #1 above.
Sorry - I was unclear. Let's keep it with an optional argument for now to avoid introducing yet another API that people would have to change to.
Okey doke. I'll do my best to get this pushed today.
pushed to redux as changeset: 1575:9467136e0443
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Unfortunately, fix is still not complete: in String::createUTF8, we call _analyzeUtf8(), which is still *always* strict -- and fails for bad input. We need a nonstrict version of _analyzeUtf8().
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
1) Rewrote _analyzeUtf8() using UnicodeUtils::Utf8ToUtf16() as a template to ensure identical behavior. 2) Changed all String constructors to also accept NULL buffers, and to treat them as zero-length strings; changed header comments. 3) changed checkForCachedStrings() to checkForTinyStrings() and made is a static String member; returns a zero-length string of the correct width if len == 0. 4) Fixed bug in createUTF8() that led to a memory overwrite for invalid UTF8 sequences (code did not check for return value of 8 when calling UnicodeUtils::Utf8ToUcs4). 5) Fixed an old, undiscovered bug in UnicodeUtils::Utf8ToUtf16() that did not recognize some out-of-range UTF-8 sequences as invalid. Note: this fix may break existing code! If a ByteArray contained such an out-of-range sequence (like e.g. FB BF BF BF BF BF for a character value of 0x3FFFF), the sequence was recognized as a correct sequence, but of a length of 4 bytes. 6) Added test cases to as3/ShellClasses/ByteArray.as to check for invalid, partial, or out-of-range UTF8 sequences. The fix for 5) may be problematic. Although it fixes a really obvious bug, the fix changes the way a ByteArray (or the contents of a file) is converted to Unicode when assuming UTF-8. Please let me know if the fix (and the test case) should be removed again.
Attachment #365999 - Attachment is obsolete: true
Attachment #366530 - Flags: review?(stejohns)
Typo: in 4), the return value to check for is 0, not 8.
Comment on attachment 366530 [details] [diff] [review] Fix as requested, plus additional fixes nice, even an updated acceptance test! :-) review+, but I'm going to test in Flash/AIR as well before pushing.
Attachment #366530 - Flags: review?(stejohns) → review+
Attachment #366530 - Attachment is obsolete: true
Comment on attachment 366530 [details] [diff] [review] Fix as requested, plus additional fixes pushed to redux as changeset: 1578:3371474ae6c8
Flags: in-testsuite?
I think this is (finally) fixed, closing.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
Flags: in-testsuite+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: