Closed
Bug 481308
Opened 16 years ago
Closed 16 years ago
ByteArray.toString() behaves differently for BOM-less UTF8
Categories
(Tamarin Graveyard :: Virtual Machine, defect)
Tamarin Graveyard
Virtual Machine
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: stejohns, Assigned: daumling)
References
Details
Attachments
(4 obsolete files)
Prior to the new strings, ByteArray.toString (and probably also FileClass::read) processed a data stream without a BOM by calling newString, which implicitly treated the data as UTF8. The new code treats such data as Latin1, which is an incompatible change. We need to mimic what the old code did, including how it handled non-UTF8 data fed via this route.
Assignee | ||
Comment 1•16 years ago
|
||
The patch first checks for a valid UTF-8 sequence; if that fails, it creates a simple Latin-1 string. The fix favors memory over performance by checking for the validity of the UTF-8 sequence before attempting to create a string.
Attachment #365365 -
Flags: review?(stejohns)
Reporter | ||
Updated•16 years ago
|
Attachment #365365 -
Flags: review?(stejohns) → review+
Reporter | ||
Comment 3•16 years ago
|
||
Comment on attachment 365365 [details] [diff] [review]
Test for UTF-8 first, then Latin-1
Sounds right -- a similar fix is needed in FileClass too, I think.
Assignee | ||
Comment 4•16 years ago
|
||
More or less a cut&paste operation from the ByteArrayGlue fix. The ByteArrayGlus fix is part of the new patch.
Attachment #365365 -
Attachment is obsolete: true
Attachment #365474 -
Flags: review?(stejohns)
Reporter | ||
Comment 5•16 years ago
|
||
Actually, I think a more sophisticated fix is needed, to mimic the existing behavior, which seems to convert as much as possible, then does something odd. A snippet of code:
var b : ByteArray = new ByteArray();
b[0]=0xE4; // 19968
b[1]=0xB8;
b[2]=0x80;
b[3]=0xE4; // 20108
b[4]=0xBA;
b[5]=0x8C;
b[6]=0xE4; // 19977
b[7]=0xB8;
b[8]=0x89;
var bs:String = b.toString();
trace("valid utf8 sequence:",bs);
for (var i = 0; i < bs.length; ++i)
trace(" cc=",bs.charCodeAt(i));
var b2 : ByteArray = new ByteArray();
b2[0]=0xE4; // 19968
b2[1]=0xB8;
b2[2]=0x80;
b2[3]=0xE4; // bad sequence
b2[4]=0xE4;
b2[5]=0xE4;
var b2s:String = b2.toString();
trace("partially invalid utf8 sequence:",b2s);
for (var i = 0; i < b2s.length; ++i)
trace(" cc=",b2s.charCodeAt(i));
var b3 : ByteArray = new ByteArray();
b3[3]=0xE4; // bad sequence
b3[4]=0xE4;
b3[5]=0xE4;
var b3s:String = b3.toString();
trace("completely invalid utf8 sequence:",b3s);
for (var i = 0; i < b3s.length; ++i)
trace(" cc=",b3s.charCodeAt(i));
run in Flash10, this yields
valid utf8 sequence: ??? (high-order chars don't paste in here well, but the char codes are accurate)
cc= 19968
cc= 20108
cc= 19977
partially invalid utf8 sequence: ?äää
cc= 19968
cc= 228
cc= 228
cc= 228
completely invalid utf8 sequence:
cc= 0
cc= 0
cc= 0
cc= 228
cc= 228
cc= 228
which is wacky, but behavior we need to preserve (for ByteArray anyway)...
Updated•16 years ago
|
Flags: in-testsuite?
Assignee | ||
Comment 6•16 years ago
|
||
This patch is a little bigger, because it fixes two other bugs along the way.
1. UnicodeUtils::Utf8ToUtf16() has an additional strict flag; if set to true, the method returns -1 on a malformed UTF-8 sequence; if false, the old behavior remains, where invalid UTF-sequences were stored as characters.
2. String::createUTF8() has this new strict flag, and defaults to true.
3. AvmCore::findInternedString() uses strict == false to be backwards compatible
4. AvmCore::newStringUtf8() has an additional strict flag, which defaults to false.
5. FileClass and ByteArrayGlue use AvmCore::newStringUtf8() with strict explicitly set to false to maintain backwards compatibility.
6. Fix for bug #473995: The headers say that the character buffer must not be null. There is an assert if the buffer is null. All create methods use a new static method checkForCachedStrings() that checks if a cached string can be returned.
7. Fix for bug #473998 along the way: header documentation does not mention out-of-memory anymore.
8. Changed primitive data in UnicodeUtils to be C99 compatible (except for uint8* for UTF-8 strings).
There are three failing test cases on Win (not tested on other platforms):
as3/DescribeType/describeType.abc : test 0 FAILED! expected <type base="Class" isDynamic="true" isFinal="true" isStatic="true" name="Foo2">
as3/DescribeType/describeType.abc : test 1 FAILED! expected <type base="Foo" isDynamic="false" isFinal="false" isStatic="false" name="Foo2">
ecma3/Date/e15_9_5_35_1.abc : Fri Apr 3 17:00:00 GMT-0700 1970.getHours() = 17 FAILED! expected: 16
The last may be related to the upcoming DST change; I cannot explain why the other two fail.
Attachment #365474 -
Attachment is obsolete: true
Attachment #365999 -
Flags: review?(stejohns)
Attachment #365474 -
Flags: review?(stejohns)
Reporter | ||
Comment 7•16 years ago
|
||
Comment on attachment 365999 [details] [diff] [review]
Combined patch with fixes for 473995 and 473998
(1) I suspect that the vast majority of callers to newStringUTF8 will know at compile time whether they need strict mode or not, so I think it might make sense to have newStringUTF8 always be strict, and add newStringUTF8Nonstrict() -- this would save an argument-push, and also make the oddball nonstrict case more obvious in the code (and harder to call by accident).
(2) eliminating Utf8Count is a good call, but I wonder if it makes sense to provide it as an inline wrapper to Utf8toUtf16?
(3) do I need to push this for you?
Attachment #365999 -
Flags: review?(stejohns) → review+
Assignee | ||
Comment 8•16 years ago
|
||
1) This sounds reasonable. I was just afraid that switching newStringUTF8 to strict mode could break existing SWFs, although highly unlikely. Maybe we should wait until we have a version update.
2) I left the inline in so possible users do not stumble. It has the same effect as forcing the user to replace their calls to Utf8Count, but is less annoying.
3) Yes, please - I still do not feel confident enough.
Reporter | ||
Comment 9•16 years ago
|
||
(In reply to comment #8)
> 1) This sounds reasonable. I was just afraid that switching newStringUTF8 to
> strict mode could break existing SWFs, although highly unlikely. Maybe we
> should wait until we have a version update.
Not sure if that's a "yes, let's change to two distinct" or "no, let's keep it with the optional argument" -- if the former, what name do we want for nonstrict?
> 2) I left the inline in so possible users do not stumble. It has the same
> effect as forcing the user to replace their calls to Utf8Count, but is less
> annoying.
Nice :-)
> 3) Yes, please - I still do not feel confident enough.
Will do so as soon as I hear from you re: #1 above.
Assignee | ||
Comment 10•16 years ago
|
||
Sorry - I was unclear. Let's keep it with an optional argument for now to avoid introducing yet another API that people would have to change to.
Reporter | ||
Comment 11•16 years ago
|
||
Okey doke. I'll do my best to get this pushed today.
Reporter | ||
Comment 12•16 years ago
|
||
pushed to redux as changeset: 1575:9467136e0443
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 13•16 years ago
|
||
Unfortunately, fix is still not complete: in String::createUTF8, we call _analyzeUtf8(), which is still *always* strict -- and fails for bad input. We need a nonstrict version of _analyzeUtf8().
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 14•16 years ago
|
||
1) Rewrote _analyzeUtf8() using UnicodeUtils::Utf8ToUtf16() as a template to ensure identical behavior.
2) Changed all String constructors to also accept NULL buffers, and to treat them as zero-length strings; changed header comments.
3) changed checkForCachedStrings() to checkForTinyStrings() and made is a static String member; returns a zero-length string of the correct width if len == 0.
4) Fixed bug in createUTF8() that led to a memory overwrite for invalid UTF8 sequences (code did not check for return value of 8 when calling UnicodeUtils::Utf8ToUcs4).
5) Fixed an old, undiscovered bug in UnicodeUtils::Utf8ToUtf16() that did not recognize some out-of-range UTF-8 sequences as invalid. Note: this fix may break existing code! If a ByteArray contained such an out-of-range sequence (like e.g. FB BF BF BF BF BF for a character value of 0x3FFFF), the sequence was recognized as a correct sequence, but of a length of 4 bytes.
6) Added test cases to as3/ShellClasses/ByteArray.as to check for invalid, partial, or out-of-range UTF8 sequences.
The fix for 5) may be problematic. Although it fixes a really obvious bug, the fix changes the way a ByteArray (or the contents of a file) is converted to Unicode when assuming UTF-8. Please let me know if the fix (and the test case) should be removed again.
Attachment #365999 -
Attachment is obsolete: true
Attachment #366530 -
Flags: review?(stejohns)
Assignee | ||
Comment 15•16 years ago
|
||
Typo: in 4), the return value to check for is 0, not 8.
Reporter | ||
Comment 16•16 years ago
|
||
Comment on attachment 366530 [details] [diff] [review]
Fix as requested, plus additional fixes
nice, even an updated acceptance test! :-)
review+, but I'm going to test in Flash/AIR as well before pushing.
Attachment #366530 -
Flags: review?(stejohns) → review+
Reporter | ||
Updated•16 years ago
|
Attachment #366530 -
Attachment is obsolete: true
Reporter | ||
Comment 17•16 years ago
|
||
Comment on attachment 366530 [details] [diff] [review]
Fix as requested, plus additional fixes
pushed to redux as changeset: 1578:3371474ae6c8
Updated•16 years ago
|
Flags: in-testsuite?
Reporter | ||
Comment 18•16 years ago
|
||
I think this is (finally) fixed, closing.
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Updated•15 years ago
|
Status: RESOLVED → VERIFIED
Flags: in-testsuite+
You need to log in
before you can comment on or make changes to this bug.
Description
•