ByteArray.toString() behaves differently for BOM-less UTF8

VERIFIED FIXED

Status

VERIFIED FIXED
10 years ago
9 years ago

People

(Reporter: stejohns, Assigned: daumling)

Tracking

unspecified
Dependency tree / graph
Bug Flags:
in-testsuite +

Details

Attachments

(4 obsolete attachments)

(Reporter)

Description

10 years ago
Prior to the new strings, ByteArray.toString (and probably also FileClass::read) processed a data stream without a BOM by calling newString, which implicitly treated the data as UTF8. The new code treats such data as Latin1, which is an incompatible change. We need to mimic what the old code did, including how it handled non-UTF8 data fed via this route.
(Assignee)

Comment 1

10 years ago
Created attachment 365365 [details] [diff] [review]
Test for UTF-8 first, then Latin-1

The patch first checks for a valid UTF-8 sequence; if that fails, it creates a simple Latin-1 string. The fix favors memory over performance by checking for the validity of the UTF-8 sequence before attempting to create a string.
Attachment #365365 - Flags: review?(stejohns)
(Assignee)

Comment 2

10 years ago
Assigned to myself.
Status: NEW → ASSIGNED
(Reporter)

Updated

10 years ago
Attachment #365365 - Flags: review?(stejohns) → review+
(Reporter)

Comment 3

10 years ago
Comment on attachment 365365 [details] [diff] [review]
Test for UTF-8 first, then Latin-1

Sounds right -- a similar fix is needed in FileClass too, I think.
(Assignee)

Comment 4

10 years ago
Created attachment 365474 [details] [diff] [review]
Same patch for both ByteArray and File classes

More or less a cut&paste operation from the ByteArrayGlue fix. The ByteArrayGlus fix is part of the new patch.
Attachment #365365 - Attachment is obsolete: true
Attachment #365474 - Flags: review?(stejohns)
(Reporter)

Comment 5

10 years ago
Actually, I think a more sophisticated fix is needed, to mimic the existing behavior, which seems to convert as much as possible, then does something odd. A snippet of code:

				var b : ByteArray = new ByteArray();
				b[0]=0xE4; // 19968
				b[1]=0xB8;
				b[2]=0x80;
				b[3]=0xE4; // 20108
				b[4]=0xBA; 
				b[5]=0x8C;
				b[6]=0xE4; // 19977
				b[7]=0xB8;
				b[8]=0x89;
				var bs:String = b.toString();
				trace("valid utf8 sequence:",bs);
				for (var i = 0; i < bs.length; ++i)
					trace("  cc=",bs.charCodeAt(i));
				var b2 : ByteArray = new ByteArray();
				b2[0]=0xE4; // 19968
				b2[1]=0xB8;
				b2[2]=0x80;
				b2[3]=0xE4; // bad sequence
				b2[4]=0xE4;
				b2[5]=0xE4;
				var b2s:String = b2.toString();
				trace("partially invalid utf8 sequence:",b2s);
				for (var i = 0; i < b2s.length; ++i)
					trace("  cc=",b2s.charCodeAt(i));
				var b3 : ByteArray = new ByteArray();
				b3[3]=0xE4; // bad sequence
				b3[4]=0xE4;
				b3[5]=0xE4;
				var b3s:String = b3.toString();
				trace("completely invalid utf8 sequence:",b3s);
				for (var i = 0; i < b3s.length; ++i)
					trace("  cc=",b3s.charCodeAt(i));


run in Flash10, this yields 

valid utf8 sequence: ??? (high-order chars don't paste in here well, but the char codes are accurate)
  cc= 19968
  cc= 20108
  cc= 19977
partially invalid utf8 sequence: ?äää
  cc= 19968
  cc= 228
  cc= 228
  cc= 228
completely invalid utf8 sequence: 
  cc= 0
  cc= 0
  cc= 0
  cc= 228
  cc= 228
  cc= 228


which is wacky, but behavior we need to preserve (for ByteArray anyway)...

Updated

10 years ago
Flags: in-testsuite?
(Assignee)

Comment 6

10 years ago
Created attachment 365999 [details] [diff] [review]
Combined patch with fixes for 473995 and 473998

This patch is a little bigger, because it fixes two other bugs along the way.

1. UnicodeUtils::Utf8ToUtf16() has an additional strict flag; if set to true, the method returns -1 on a malformed UTF-8 sequence; if false, the old behavior remains, where invalid UTF-sequences were stored as characters.
2. String::createUTF8() has this new strict flag, and defaults to true.
3. AvmCore::findInternedString() uses strict == false to be backwards compatible
4. AvmCore::newStringUtf8() has an additional strict flag, which defaults to false.
5. FileClass and ByteArrayGlue use AvmCore::newStringUtf8() with strict explicitly set to false to maintain backwards compatibility.
6. Fix for bug #473995: The headers say that the character buffer must not be null. There is an assert if the buffer is null. All create methods use a new static method checkForCachedStrings() that checks if a cached string can be returned.
7. Fix for bug #473998 along the way: header documentation does not mention out-of-memory anymore.
8. Changed primitive data in UnicodeUtils to be C99 compatible (except for uint8* for UTF-8 strings).

There are three failing test cases on Win (not tested on other platforms):

  as3/DescribeType/describeType.abc : test 0 FAILED! expected <type base="Class" isDynamic="true" isFinal="true" isStatic="true" name="Foo2">
  as3/DescribeType/describeType.abc : test 1 FAILED! expected <type base="Foo" isDynamic="false" isFinal="false" isStatic="false" name="Foo2">
  ecma3/Date/e15_9_5_35_1.abc : Fri Apr 3 17:00:00 GMT-0700 1970.getHours() = 17 FAILED! expected: 16
  
The last may be related to the upcoming DST change; I cannot explain why the other two fail.
Attachment #365474 - Attachment is obsolete: true
Attachment #365999 - Flags: review?(stejohns)
Attachment #365474 - Flags: review?(stejohns)
(Assignee)

Updated

10 years ago
Blocks: 473995
(Assignee)

Updated

10 years ago
Blocks: 473998
(Reporter)

Comment 7

10 years ago
Comment on attachment 365999 [details] [diff] [review]
Combined patch with fixes for 473995 and 473998

(1) I suspect that the vast majority of callers to newStringUTF8 will know at compile time whether they need strict mode or not, so I think it might make sense to have newStringUTF8 always be strict, and add newStringUTF8Nonstrict() -- this would save an argument-push, and also make the oddball nonstrict case more obvious in the code (and harder to call by accident).

(2) eliminating Utf8Count is a good call, but I wonder if it makes sense to provide it as an inline wrapper to Utf8toUtf16? 

(3) do I need to push this for you?
Attachment #365999 - Flags: review?(stejohns) → review+
(Assignee)

Comment 8

10 years ago
1) This sounds reasonable. I was just afraid that switching newStringUTF8 to strict mode could break existing SWFs, although highly unlikely. Maybe we should wait until we have a version update.

2) I left the inline in so possible users do not stumble. It has the same effect as forcing the user to replace their calls to Utf8Count, but is less annoying.

3) Yes, please - I still do not feel confident enough.
(Reporter)

Comment 9

10 years ago
(In reply to comment #8)
> 1) This sounds reasonable. I was just afraid that switching newStringUTF8 to
> strict mode could break existing SWFs, although highly unlikely. Maybe we
> should wait until we have a version update.

Not sure if that's a "yes, let's change to two distinct" or "no, let's keep it with the optional argument" -- if the former, what name do we want for nonstrict?

> 2) I left the inline in so possible users do not stumble. It has the same
> effect as forcing the user to replace their calls to Utf8Count, but is less
> annoying.

Nice :-)

> 3) Yes, please - I still do not feel confident enough.

Will do so as soon as I hear from you re: #1 above.
(Assignee)

Comment 10

10 years ago
Sorry - I was unclear. Let's keep it with an optional argument for now to avoid introducing yet another API that people would have to change to.
(Reporter)

Comment 11

10 years ago
Okey doke. I'll do my best to get this pushed today.
(Reporter)

Comment 12

10 years ago
pushed to redux as changeset:   1575:9467136e0443
Status: ASSIGNED → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
(Reporter)

Comment 13

10 years ago
Unfortunately, fix is still not complete: in String::createUTF8, we call _analyzeUtf8(), which is still *always* strict -- and fails for bad input. We need a nonstrict version of _analyzeUtf8().
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 14

10 years ago
Created attachment 366530 [details] [diff] [review]
Fix as requested, plus additional fixes

1) Rewrote _analyzeUtf8() using UnicodeUtils::Utf8ToUtf16() as a template to ensure identical behavior.

2) Changed all String constructors to also accept NULL buffers, and to treat them as zero-length strings; changed header comments.

3) changed checkForCachedStrings() to checkForTinyStrings() and made is a static String member; returns a zero-length string of the correct width if len == 0.

4) Fixed bug in createUTF8() that led to a memory overwrite for invalid UTF8 sequences (code did not check for return value of 8 when calling UnicodeUtils::Utf8ToUcs4).

5) Fixed an old, undiscovered bug in UnicodeUtils::Utf8ToUtf16() that did not recognize some out-of-range UTF-8 sequences as invalid. Note: this fix may break existing code! If a ByteArray contained such an out-of-range sequence (like e.g. FB BF BF BF BF BF for a character value of 0x3FFFF), the sequence was recognized as a correct sequence, but of a length of 4 bytes.

6) Added test cases to as3/ShellClasses/ByteArray.as to check for invalid, partial, or out-of-range UTF8 sequences.

The fix for 5) may be problematic. Although it fixes a really obvious bug, the fix changes the way a ByteArray (or the contents of a file) is converted to Unicode when assuming UTF-8. Please let me know if the fix (and the test case) should be removed again.
Attachment #365999 - Attachment is obsolete: true
Attachment #366530 - Flags: review?(stejohns)
(Assignee)

Comment 15

10 years ago
Typo: in 4), the return value to check for is 0, not 8.
(Reporter)

Comment 16

10 years ago
Comment on attachment 366530 [details] [diff] [review]
Fix as requested, plus additional fixes

nice, even an updated acceptance test! :-)

review+, but I'm going to test in Flash/AIR as well before pushing.
Attachment #366530 - Flags: review?(stejohns) → review+
(Reporter)

Updated

10 years ago
Attachment #366530 - Attachment is obsolete: true
(Reporter)

Comment 17

10 years ago
Comment on attachment 366530 [details] [diff] [review]
Fix as requested, plus additional fixes

pushed to redux as changeset:   1578:3371474ae6c8

Updated

10 years ago
Flags: in-testsuite?
(Reporter)

Comment 18

10 years ago
I think this is (finally) fixed, closing.
Status: REOPENED → RESOLVED
Last Resolved: 10 years ago10 years ago
Resolution: --- → FIXED

Updated

9 years ago
Status: RESOLVED → VERIFIED
Flags: in-testsuite+
You need to log in before you can comment on or make changes to this bug.