Closed Bug 465506 Opened 16 years ago Closed 16 years ago

Integrate the TT String class into Tamarin-Redux

Categories

(Tamarin Graveyard :: Virtual Machine, defect, P3)

defect

Tracking

(Not tracked)

VERIFIED FIXED
flash10.1

People

(Reporter: daumling, Assigned: daumling)

References

Details

Attachments

(9 files, 7 obsolete files)

56.12 KB, text/plain
Details
295.23 KB, patch
stejohns
: review+
Details | Diff | Splinter Review
1.41 KB, patch
stejohns
: review+
Details | Diff | Splinter Review
1.02 KB, patch
daumling
: review+
Details | Diff | Splinter Review
122.98 KB, patch
daumling
: review+
edwsmith
: review+
Details | Diff | Splinter Review
16.92 KB, text/plain
daumling
: review+
Details
23.09 KB, patch
daumling
: review+
Details | Diff | Splinter Review
510 bytes, patch
edwsmith
: review+
Details | Diff | Splinter Review
48.78 KB, patch
stejohns
: review+
Details | Diff | Splinter Review
Integration task for the new String class into Tamarin-Redux. Changes will be marked with a #define AVMPLUS_NEW_STRINGS; that #define needs to be uncommented in avmbuild.h. After the integratrino completes successfully, these #defines can be removed and/or optimized.
Assigning the bug to myself.
Status: NEW → ASSIGNED
Assignee: nobody → daumling
Flags: flashplayer-qrb+
Priority: -- → P3
Target Milestone: --- → flash10.x
It would be good to have this done by early December so that it could go into the December TR->TC release, probably in 3rd week of December.
Here is the patch. It is huge, but most changes are trivial. The XML classes have quite a few drastic changes, though. I have assigned Steven as the reviewer, but I suspect that you may want to split code review among a few more people. The code passes all acceptance and performance tests on Win and Mac. Other platforms have not been tested.

These are the numbers, with the executable with new strings (avm) and the old executable (avm2), tested on XP and a 2.6 GHz dual-core notebook CPU. You can see more or less drastic improvements in memory usage, and mostly also a nice performance improvement. I have already changed the code that creates the ABC image strings to using static string references.

test                                     avm    avm2     %sp  metric

sunspider/string-fasta.as               1.7M    2.3M    35.3  memory
sunspider/string-unpack-code.as         3.1M    3.6M    16.1  memory
sunspider/string-validate-input.as      2.3M    2.8M    21.7  memory
sunspider/as3/string-fasta.as           3.7M    3.9M     5.4  memory
sunspider/as3/string-unpack-code.as     3.3M    4.5M    36.4  memory
sunspider/as3/string-validate-input.as  2.3M    3.2M    39.1  memory

test                                     avm    avm2     %sp  metric

sunspider/string-fasta.as               100     105    -4.8    time
sunspider/string-unpack-code.as         234     250    -6.4    time
sunspider/string-validate-input.as       93      97    -4.1    time
sunspider/as3/string-fasta.as            73      69     5.8    time
sunspider/as3/string-unpack-code.as     242     245    -1.2    time
sunspider/as3/string-validate-input.as   81      88    -8.0    time
Attachment #349410 - Flags: review?(stejohns)
I haven't looked at the patch, but suggest that Werner should review the XML changes.
any theories about the as3/string-fasta slowdown?
Looks like the warning about an initialized, but not used local variable (4189) is suppressed in the VC2008 projects. This is a difference to the build scripts, and should be fixed. I have removed that warning and built all four configs without problems (after removing that single one).

The build logs still have this issue at the beginning:
cl : Command line warning D9002 : ignoring unknown option '--version'
cl : Command line error D8003 : missing source filename

This looks like a build script issue, however.

No real idea about the slowdown in sunspider/as3/string-fasta.as yet, but I will investigate. Interesting: the non-AS3 test runs at the same speed as before.
Attachment #349410 - Attachment is obsolete: true
Attachment #349442 - Flags: review?(stejohns)
Attachment #349410 - Flags: review?(stejohns)
FYI: I might not have time to review this today, and I'm on vacation all next week. If the review needs to be done sooner than that, we may need to find someone else to do so. If that delay is acceptable, feel free to remind me in a week in case I forget :-)
OK, managed time to do a first-pass review: on the whole, looks great, but I have enough comments and questions that it will probably need a second pass. In no particular order: 

Question: if AVMPLUS_UTF32_SUPPORT is defined, is a wchar* assumed to point to UTF16 or UCS2? (ie will surrogate pairs be recognized?) Some code (eg String::isSpace) treats a wchar as though it's equivalent to a Unicode Code Point, which isn't quite right. (erhaps the "wchar" type name needs to go away entirely? Having a 16-bit character type handy seems like an accident waiting to happen if we want full Unicode 4.x+ support... maybe we should add a "utf16_t" type for pointer-to-UTF16 strings, and a "unichar_t" type (== uint32) for things that return code points.

String::compare has a comment about being unable to use memcmp() due to the ecma3/String/localeCompare_rt test relying on the difference in char values, but I think that may be a bogus test... I don't think the JS spec, nor other JS engines, guarantee this.

Agreed that the E4X changes should probably be looked at by Werner -- I didn't see anything obviously wrong but it's tricky code.

I'm a little unclear about the "start/end AVMPLUS_NEW_STRINGS" comments — are they meant to be long-lived? Are they present just as code markers? I think I’d like to see them removed before pushing.

There are various places that have "delete" operators commented out -- I'd prefer we remove them entirely if this is no longer legal.

I see a few checks for NULL returned from GC::Alloc(), with "TODO: out of memory error" commented in -- I think that with upcoming changes for better out-of-memory handing, GC::Alloc() will be guaranteed never to return NULL (we might throw an exception, or simply call exit(), but won't return NULL) -- but we should check with Tommy Reilly on that.

AVMPLUS_MINBUILD is a concept that hasn't migrated from Tamarin-Tracing; we can lose all those ifdefs (maybe someday we'll revisit that, but not just yet).

Re: all the Unicode character properties, we might want to consider using a two-stage table for lookup -- though I've never used it, it sounds like a clever and efficient technique (see http://smallcode.weblogs.us/2008/06/25/multi-stage/ for a good explanation, though others are readily available)

Why does StringObject.h need to include NamespaceSet.h?

The commented-out #define of AVMPLUS_UTF32_SUPPORT should probably be moved to avmbuild.h, with a descriptive comment.

StringIndexer is beautiful, was it in TT or is it new to this port?

StringBuilder is actually unused and slated to be removed -- thanks for updating it anyway though :-)

A few style notes: while we have no official coding guidelines (we’re working on it), there are a few styles in the new code that I think need modification to conform to existing coding approaches elsewhere in Tamarin:

-- preferred style in Tamarin is no space between function name and opening paren, and no space inside parens:

    int x = foo(1); // good
    int x = foo (1); // bad
    int x = foo( 1 ); // bad
    int x = foo ( 1 ); // bad

likewise,

    SomeClass foo(args); // good
    SomeClass foo (args); // bad

same story for arrays:

    char foo[2]; // good
    a = b[3]; // good
    char foo [2]; // bad
    a = b [3]; // bad


(Yes, the E4X code violates this but pretty much all new code is being written this way)

-- when doing comparions against a constant, put the constant on the right, not the left:
    
    if (i == 0) // good
    if (0 == i) // bad
    if (i == kSomeConstant) // good
    if (kSomeConstant == i) // bad
    AvmAssert(p != NULL); // good
    AvmAssert(NULL != p); // bad

yes, the constant-on-left prevents accidental assignment, but we prefer to solve that problem thru appropriate compiler warnings, rather than compromising readability.
a wchar is always UCS-2 (no surrogate pair recognition). If I added that capability, the change would break backwards compatibiliy. However, the createUTF16() call does recognize surrogate pairs. There is no toUTF32String() call yet, which would probably be implemented as combining surrogate pairs.

I can quickly change String::compare() back to memcmp(), and I think that we should. ECMA-262 specifically mentions that the return value is implementation dependent.

The start/end AVMPLUS_NEW_STRINGS are a temporary marker to quickly identify changes. It makes it easier to spot the source of problems, and they should go away ASAP, i.e. in the next patch round.

We can debate whether delete should be legal. I would say NO, because strings may be used as the master of dependent strings. This is also true for the current implementation.

The out-of-memory problem needs to be handled inside MMgc. The current code does check, but actually, it transfers the problem from the origin to the caller. No calling code checks a created string for being NULL.

The inclusion of NamespaceSet.h is probably a leftover and can be removed.

StringIndexer is new - and extremely useful :)

I can take a stab at removing StringBuilder once the patch has landed (as well as modifying MathUtils). I will create follow-up bugs for this.

And I can take care of the style issues as wellas of the other smaller edit that Steven mentioned (although my personal taste is different ;). This will probably have to wait till next week, though.

I will check email this weekend in case anybody wants to integrate the patch to tamarin-redux and has questions.
Blocks: 466271
Blocks: 466272
String documentation is at https://wiki.mozilla.org/Tamarin:Strings. There are no links to that page yet.
(In reply to comment #9)

> The out-of-memory problem needs to be handled inside MMgc. The current code
> does check, but actually, it transfers the problem from the origin to the
> caller. No calling code checks a created string for being NULL.

Stevens point was that even the string code need not check, as the test for NULL will always fail.  But Tommy gets the last word on this.
(In reply to comment #9)
> a wchar is always UCS-2 (no surrogate pair recognition). If I added that
> capability, the change would break backwards compatibiliy. However, the
> createUTF16() call does recognize surrogate pairs. There is no toUTF32String()
> call yet, which would probably be implemented as combining surrogate pairs.
> 
> I can quickly change String::compare() back to memcmp(), and I think that we
> should. ECMA-262 specifically mentions that the return value is implementation
> dependent.

This would be good.  There needs to be a clear separation between the memory management concerns (strings are arrays of 8/16/32bit ints) and the Unicode concerns (strings represent encoded text).
There's a subtle issue I hadn't considered before: currently, String.length actually returns the length of the string in UTF16 characters, not Unicode code points. So a string that contains (say) a single Japanese characters that requires two words in UTF16 format will return length == 2, even though it's a String that contains only a single Unicode character. I'm guessing we will need to continue this (mis)behavior for backwards compatibility with existing code.
(In reply to comment #13)
> There's a subtle issue I hadn't considered before: currently, String.length
> actually returns the length of the string in UTF16 characters, not Unicode code
> points. So a string that contains (say) a single Japanese characters that
> requires two words in UTF16 format will return length == 2, even though it's a
> String that contains only a single Unicode character. I'm guessing we will need
> to continue this (mis)behavior for backwards compatibility with existing code.

UTF-16 is according to the ECMAScript specification, so the "(mis)" is probably not warranted.  We're going to have to deal with all the issues we tried to deal with for ES4 if we're to allow full Unicode characters in ActionScript.
True, the "mis" on my part just shows my pro-full-Unicode bias... existing behavior is correct and as-designed, just IMHO a regrettable design decision that we're now stuck with.
You can at some point always switch to 32-bit strings, which would support full Unicode without surrogate pairs. But that would probably break some code that relied on surrogate pairs. Code memory footprint would increase by 1-2K, and string memory would mainly increase if strings had to be 32 bits because of the code points it contained. Some minor adjustments in string creation would needed to allow code to combine surrogate pairs. Again, that would probably be a subject to a future AVM/ActionScript version, or an ActionScript compile option, or the like.
Comment on attachment 349442 [details] [diff] [review]
Updated patch without VC2008 compiler warning

Let's go ahead and land it -- my understanding is that the various minor issues brought up in comments will be addressed in a subsequent patch.
Attachment #349442 - Flags: review?(stejohns) → review+
This autoexp.dat works for VC8 and VC9. It contains visualizers for the new String class as well as for a few other classes - makes working with strings much easier. Supports raw view and address view, and resolves NULL.
Attached patch Updated, 64-bit savvy patch (obsolete) — Splinter Review
This patch passes acceptance and performance tests on Win32 and Win64 (Win64 has one test failure - see below). I have also applied numerous small edits according to Steven's comments above. The changes are all minimal, and do not warrant a new review IMHO, so please feel free to land the patch.

1) Moved the #define AVMPLUS_UTF32_SUPPORT to avmbuild.h and renamed it to FEATURE_UTF32_SUPPORT.

2) Removed AvmCore::freeString(). It was only used in LIR.cpp, and I removed the string deletion calls there. Removed all direct deletes of Strings.

3) Remvoed checks for NULL returned from GC allocations.

4) Removed #ifdef AVMPLUS_MINBUILD.

5) Removed #include "NamespaceSet.h" from StringObject.h.

6) Cosmetic edits according to Steven's comments:
   - Removed AVMPLUS_NEW_STRINGS comments.
   - Changed foo (n) to foo(n) and bar [x] to bar[x].
   - Reverted if (constant == expression) to if (expression == constant).
 
Additional edits:

7) Added delete calls to all places where UTF8String and UTF16String instances were created. These classes are just character data buffers, so it is OK to delete them after use.

8) Removed RegExpObject::Utf16ToUtf8Index() and Utf8ToUtf16Index() and inlined the respective String method calls.
 
Open issues:

- String::compare() still does not use memcmp() in order to not break the ecma3/String/localeCompare_rt test.

- Unicode table lookup is still the old algorithm - no changes here.

Win64 test failure: This may be a VC9 bug. The test below fails only for release builds; it passed in all other configs, or if -Dinterp is active. If I compile StringObject.cpp with the optimizer set to /Od or /o1, the test passes. Very strange.

Executing 1 tests against vm: ../../platform/win32/x64/Release/avmplus_64.exe
current configuration: x86-win-avm2-release
0 running ecma3/GlobalObject/e15_1_2_5_1.as
   unescape( %u1F ) = %u   FAILED! expected: %u1F
   unescape( %u4A ) = %u4A FAILED! expected: %u
   unescape( %u9F ) = %u   FAILED! expected: %u9F
   unescape( %uCA ) = %uCA FAILED! expected: %u
   FAILED passes:527 fails:4 unexpected passes: 0 expected failures: 0

FAILURES:
  ecma3/GlobalObject/e15_1_2_5_1.abc : unescape( %u1F ) = %u   FAILED! expected: %u1F
  ecma3/GlobalObject/e15_1_2_5_1.abc : unescape( %u4A ) = %u4A FAILED! expected: %u
  ecma3/GlobalObject/e15_1_2_5_1.abc : unescape( %u9F ) = %u   FAILED! expected: %u9F
  ecma3/GlobalObject/e15_1_2_5_1.abc : unescape( %uCA ) = %uCA FAILED! expected: %u

test run FAILED!
Tests complete at 2008-12-04 06:38:24.545969
Start Date: 2008-12-04 06:38:24.170969
End Date  : 2008-12-04 06:38:24.545969
Test Time : 0:00:00.375000
passes               : 527
failures             : 4
Attachment #349442 - Attachment is obsolete: true
There are still failures on 64-bit Mac (both Debug and Release):

  ecma3/Array/e15_4__1.abc : 
    var myarr = new Array(); myarr[Math.pow(2,32)-2]='hi'; 
    myarr.length = 0 FAILED! expected: 4294967295
  ecma3/Array/e15_4__1.abc : 
    var myarr = new Array(); myarr[Math.pow(2,32)-3]='hi';
    myarr.length = 0 FAILED! expected: 4294967294
  ecma3/Array/e15_4__1.abc : 
    var myarr = new Array(); myarr[Math.pow(2,31)-2]='hi'; 
    myarr.length = 0 FAILED! expected: 2147483647
  ecma3/Array/e15_4__1.abc :
    var myarr = new Array(); myarr[Math.pow(2,31)-1]='hi';
    myarr.length = 0 FAILED! expected: 2147483648
  ecma3/Array/e15_4__1.abc :
    var myarr = new Array(); myarr[Math.pow(2,31)]='hi';
    myarr.length = 0 FAILED! expected: 2147483649
  ecma3/Array/e15_4__1.abc :
    var myarr = new Array(); myarr[Math.pow(2,30)-2]='hi';
    myarr.length = 0 FAILED! expected: 1073741823

These failures are probably an interpreter bug (https://bugzilla.mozilla.org/show_bug.cgi?id=465754) but it's worrisome that they don't show up on Windows 64-bit, because they ought to.

(There's no JIT on Mac to compare to at this time.)

Despite those failures I'm going to land this.
tamarin-redux changeset 1189:94c3ee7f7730
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Fails to build on winmo; have backed out the change; will run it through the sandbox.  Reopening until that's resolved.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This patch contains the fix for the failed Array tests. Please see StringObject::getIntAtom() and StringObject::parsewIndex() for dewtails. The problem was that getIntAtom() set the TSTR_NONUM_BIT, because the number (0xfffffffe) was > 28 bits. and that cause parseIndex() to think that the string did not contain a valid index. Now, there are two different parse flags, one for getIntAtom(), and one for parseIndex(). This has also lead to a performance increase (getIntAtom() is called very often).

I also removed the manual deletion of UTF8String and UTF16String instances after use. The test sunspider/string-unpack-code was 15% slower than the old version with manual deletion, and it switched to be 5% faster when not deleting the strings, but leaving the deletion to MMgc. Memory used is 3.6M for the old code, 3.1M without deletion, and 2.4M with deletion, but since the code is temporary (it will go away after I am done with the optimizations in MathUtils, see bug #466272), I think that we can live with just 16% less memory than 50% less memory for now.

FYI: The sunspider/string-unpack-code test tests parseInt() and Number.toString().

This test fails with Debug_Debugger and Release_Debugger, but it also fails with the old string code:

as3/sampling/Callback.abc : Callback: callback with parameters throws exception =  FAILED! expected: ArgumentError: Error #1063

The numbers are still dancing around a little, but string-fasta.as has improved. It is now faster than before.

test                                                                avm                   avm2
                                                       min :   max     avg       min :   max     avg   %diff  metric
                                                   -----------------------   -----------------------   -----
sunspider/string-fasta.as                          [    93 :   110]   97.0   [    93 :   110]  103.0    -5.8    time
sunspider/string-unpack-code.as                    [   234 :   250]  243.8   [   234 :   250]  246.8    -1.2    time
sunspider/string-validate-input.as                 [    94 :   109]   97.0   [    94 :   125]  103.2    -6.0    time
sunspider/as3/string-fasta.as                      [    62 :    78]   68.8   [    62 :    79]   72.0    -4.4    time
sunspider/as3/string-unpack-code.as                [   254 :   260]  257.0   [   250 :   252]  251.2     2.3    time
sunspider/as3/string-validate-input.as             [    79 :    94]   85.0   [    79 :    94]   88.0    -3.4    time
Attachment #351375 - Attachment is obsolete: true
(In reply to comment #22)
> Fails to build on winmo; have backed out the change; will run it through the
> sandbox.  Reopening until that's resolved.

Appears to be that wchar has to be qualified as avmplus::wchar in the prototype for wmain() in avmshell.cpp.
Attachment #351528 - Flags: review?(stejohns)
Comment on attachment 351528 [details] [diff] [review]
Fixed failed Array tests, performance improvements

Lars requested that you review the updated patch.
This file compiles with GCC (I goto'ed over a variable declaration). I have also added the fix for WinMo that Lars applied, as far as I understood (change wchar to avmplus::wchar in the wmain declaration in AvmShell.cpp)

I could not build AVM using the Xcode 3.1 projects - the #defines for Nanojit were apparently wrong. Also, I got several warning that the offsetof macro may have problems - is offsetof a safe macro to use cross-platform?

Steven, would you review the patch?
Attachment #351528 - Attachment is obsolete: true
Attachment #351563 - Flags: review?(stejohns)
Attachment #351528 - Flags: review?(stejohns)
> Steven, would you review the patch?

Will do by end of day (hopefully sooner)
Comment on attachment 351563 [details] [diff] [review]
Fixed GCC build issues (and hopefully, WinMo issues)

So does this pass sanities and build properly everywhere now? If so, I'm ready to push it.
Attachment #351563 - Flags: review?(stejohns) → review+
pushed to redux as changeset:   1207:fe2aaeaf8674
(In reply to comment #28)
> (From update of attachment 351563 [details] [diff] [review])
> So does this pass sanities and build properly everywhere now? If so, I'm ready
> to push it.

Looking at the latest results (http://tamarin-builds.mozilla.org/tamarin-redux/waterfall) it builds properly everywhere but it fails many acceptance tests.  There are the following non-sampler failures:

Win64 Release:
  ecma3/GlobalObject/e15_1_2_5_1.abc :
   unescape( %u1F ) = %u�� FAILED! expected: %u1F
  ecma3/GlobalObject/e15_1_2_5_1.abc :
   unescape( %u4A ) = %u4A FAILED! expected: %u��
  ecma3/GlobalObject/e15_1_2_5_1.abc :
   unescape( %u9F ) = %u�� FAILED! expected: %u9F
  ecma3/GlobalObject/e15_1_2_5_1.abc :
   unescape( %uCA ) = %uCA FAILED! expected: %u��

Win64 forcemir: hangs, killed by the test harness

Win64 release-debugger: three unescape issues as above; crashes in sampler.

Win64 debug and debug-debugger: 1800 new asserts each (one per test).

Mac-ppc-10.5 -Dforcemir and -Dinterp, plus three of the macppc-10.4 builds:

  ecma3/Array/splice1.abc :
    exhaustive splice w/no optional args 1 = false FAILED! expected: true
  ecma3/Array/splice2.abc :
    exhaustive splice w/2 optional args 1 = false FAILED! expected: true

Mac-ppc-10.4 release-debugger:
 ecma3/Expressions/e11_4_6.abc : 
    +( String.fromCharCode(0x000C) +  99 ) = 0 FAILED! expected: 99

I suggest we back this out and go through the process of getting Michael set up with mozilla commit privileges, or at least figure out some way for Michael to use our try server.
I agree - it would make life easier. This is a huge patch, and it appears to be tricky.

The alternative would be to leave it in to make the patches smaller. I will work on the problems today, so there is good chance that we will be done by EOD today - with help from Lars to validate patches ;) D'oh - I thought I had fixed the issues.
I have investigated the 1800 asserts. This is a new assert in MMgc::ChangeSize, which Tommy checked in Friday. Strings have nothing to do with this problem.

The problem is active if AVMPLUS_MIR is defined, which is not the case for Win-32 builds, but for Win-64 builds. MIR uses an AvmGrowableBuffer. This buffer has an initial raw size of 0x38 bytes. It then grows by 16K to 0x4038 bytes. MIR shrinks it (via pool->codeBuffer->decommitUnused()) to 0x1038 bytes. During GCAlloc::Finalize(), the allocator thinks that the size is only 0x38 bytes, so there must be a tracking issue here. Ignoring the assert leads to a crash, which is most likely the problem of the hangs and crashes that Lars observed.

To repro the problem in Win32, just comment out line 149:

//#    define FEATURE_NANOJIT

This enables MIR. Then, start avmplus with a non-existent file name and Boom.
The unescape test failures appear to be a VC9 compiler optimizer problem. This is the method in question:

Stringp String::concatStrings(Stringp leftStr, Stringp rightStr)
{
    if (leftStr == NULL || leftStr->m_length == 0)
        return rightStr;
    if (rightStr == NULL || rightStr->m_length == 0)
        return leftStr;

    String* s = leftStr->append(rightStr->getData(), rightStr->length(), 
                                rightStr->getWidth());
    return s;
}

If I put a #pragma around this function that disables speed optimization:

#pragma optimize ("t", off)
...
#pragma optimize ("", on)

The test runs fine. Also, which is even stranger, if I change the method's first line to:

    if (leftStr == NULL)

The problem goes away as well. I've looked at the assembly listings; the code is almost identical. If I disable speed optimization for the method, the call to append() is a real call; if speed optimization is enabled, it is a direct jump.

I am not the expert on the AMD64 opcode set, so I am out of options. Any suggestions?
from your description it sounds like it's attempting tail-call optimization.  maybe possible to disable just that optimization w/out disabling the other speed optimizations?
This (small) patch turns off speed optimization for String::concatStrings() for Win64. The only difference in code is that the attempted tail call for String::append() is replaced with a real call, so it should not really impact performance. The patch fixes the failed test ecma3/GlobalObject/e15_1_2_5_1.as. Please verify the fix on another Win64 machine.
Attachment #351896 - Flags: review?(edwsmith)
investigating win64 asserts...
Blocks: 468472
Attachment #351896 - Flags: review?(edwsmith) → review+
Subtle but critical bug in the hashing functions: some using an unsigned to accumulate the hash, and some a signed int, leading to different hashcodes for the same string depending on which you called. This could generate multiple interned strings for the same sequence of codepoints, causing amusing failures.
Attachment #352017 - Flags: review?(daumling)
Attachment #352017 - Flags: review?(daumling) → review+
pushed hashcode fix to redux as changeset:   1214:725a18d123c2
Attached patch Patch to revise creation APIs (obsolete) — Splinter Review
Further changes to the new String class to fix some outright bugs, and to clarify encoding more carefully. This is a much more sweeping change than I originally envisioned; please review carefully... my concern is that the generic, overloaded names we are using "create(), append()" makes it too easy to choose the wrong one by mistake (eg Latin1 when you meant UTF8 or vice versa), with consequences that might not show up on typical dev systems. So:

-- _analyzeUtf16 had a bug:

	else if (ch < 0xD800 && ch > 0xDBFF)

should have been

	else if (ch < 0xD800 || ch > 0xDBFF)

so we were never incrementing w16.

-- String::createUTF8 checked for desiredWidth == k16 in order to use cachedChars; should have used desiredWidth == k8.

-- String::createUTF16 calculated desiredWidth incorrectly for kAuto (k8 and k16 were flopped).

-- String::createUTF16 handled surrogates incorrectly (<= should have been >=), but only for FEATURE_UTF32_SUPPORT.

-- added a special exception for String::createUTF16; this call needs to be able to create an illegal UTF16 sequence when called by String.fromCharCode; various sanities require roundtripping of codes in the surrogate-pair range. I special-cased to allow illegal strings of len=1 to continue, but am not entirely happy with this fix... suggestions are welcome. (Note that we didn't encounter this as a problem previously because String.fromCharCode was using the now-gone String::create() method to create a string of raw 16-bit codes, so we may encounter further issues with this change. Thoughts?)

-- Renamed String::create() [8-bit strings] to String::createLatin1(), since that's the encoding it expected, as opposed to UTF8.

-- Eliminated String::create() [16-bit strings] in favor of String::createUTF16, since we almost never want to be able to create illegal UTF16 strings. This is a change I'm a little unsure about; what I found was that a fair chunk of code (even internal to the VM) that claimed to be creating a "UTF16" string was in fact creating a string of 16-bit codes that may or may not be valid UTF16.

-- added String::createEndianUTF16() to allow for reading UTF16 in a specific endianness. This allowed us to make String::getData() private (except for some friend class access) to reduce possible misbehavior by clients code.

-- all of the String::createXXX() methods now allow you to omit the length parameter; if you do, we assume the input string is null-terminated and calculate it. 

-- Updated ByteArrayGlue and FileClass to use createEndianUTF16().

-- renamed the String::append() calls to be appendUTF16() or appendLatin1(), as appropriate. (Note, there is no appendUTF8() call)

-- Remove AvmCore::newString calls and replaced usage with the appropriate String::createXXX() call. This turns out to be useful because a large number of the calls are using string literals which are provably Latin1 (not UTF8), thus the creation is simpler and faster since we don't have to analyze the "utf8" string.

-- renamed AvmCore::internAlloc() to AvmCore::internAllocUtf16() to emphasize the expected encoding.

-- collapsed some redundant code in RegExpObject.

-- added comment to StringOutputStream and StringBuffer pointing out the return string is always UTF8 encoded.

-- removed unnecessary "using namespace MMgc" from StringObject.h.

-- Eliminated StringBuilder class entirely (it was slated for demolition anyway, so I removed it rather than fix it)
Attachment #352254 - Flags: review?(daumling)
Comment on attachment 352254 [details] [diff] [review]
Patch to revise creation APIs

The patch looks good, and should be landed - again, it is a huge patch. The comments below do not address issues that would prevent the patch from being landed, and the bug fixes that the patch contains are sorely needed :)

- there are many calls to String::createLatin1(character constant). We should add a new creator function createLatin1Constant() that is an alias to createLatin1(core, const char*, -1, String::k8, true). This avoids the unnecessary copying of the character data.

- also, there are several calls like s = String::concatStrings(s, String::createLatin1(character constant)). These should be replaced with s = s->appendLatin1(character constant), like in MethodClosure.cpp:113.

- StringClass.cpp:49-57 still uses newString(), but is commented out.

- The sequence core->internString(String::createLatin1(character constant)) should be replaced with core->constantString(character constant), like in XMLObject.cpp:2191

- Maybe AvmCore::constantString() should be renamed to utf8Constant() or the like.

- Although it is desirable to make getData() private, I am not sure if we don't limit ourselves too much here. There will be other cases where someone needs a string to be filled with data. As an example, I was thinking of changing the MathUtil methods that fill character buffers to a sequence like

  1) create a String with a sufficient size
  2) Fill the string
  3) Call a new String::shrink(n) method to adjust the length and number of available in-place concat space
  
For MathUtils, this would reduce stack space used (now, the buffers are created on the stack), and avoid having to copy the final buffer.
Attachment #352254 - Flags: review?(daumling) → review+
After the patch is landed #466271 (remove StringBuilder class) could be closed as fixed.
I'll get it landed today and then address these in a followup patch.
Edwin has expressed concerns in bug 468472 that merit further discussion, so I will hold off landing this until we can resolve that.
After some offline discussion with Edwin and Michael, I think this patch needs further tweaking... I won't be landing this as-is, and hope to have a revised version by end of day.
What about fixing the bugs at least?
working on it now :-)
OK, trying this again... patch is similar to previous one, but address most of Michael's suggestions and most of Edwin's concerns. Most of the Unicode-heavy creation APIs are now moved back into AvmCore, which is intended to provide the stable long-term API that I can work with. 

In several cases, the underlying Unicode implementation still lives in String (ie AvmCore is just a wrapper), but isn't substantially more intertwined than it was before I made any changes, so I think we're no worse off in that regard than we were before.

Some of the new API calls are a bit wordy, but IMHO it's a reasonable price to pay for clarity of the code, which was sorely lacking before.

This patch has only been tested in Mac but other platforms will be tested prior to push, if approved.
Attachment #352254 - Attachment is obsolete: true
Attachment #352458 - Flags: review?(daumling)
Comment on attachment 352458 [details] [diff] [review]
Patch #2 for creation APIs

I like wordiness - always better to see what a function does by looking at its name rather than looking at its documentation. These changes will probably increase the burden on users of Tamarin code, though.

I have a few nits, but these are cosmetic, so I have added them to bug #468472.
Attachment #352458 - Flags: review?(daumling) → review+
Just going to reiterate the current bugs that are in this patch (as noted earlier in comment #30). I am going to push a patch to the testconfig.txt to expect these failures so that we can get the build system back under control.

Win64 Release_Debugger:
ecma3/GlobalObject/e15_1_2_5_1.abc : unescape( %6F ) = %�� FAILED! expected: o
as3/sampling/SimpleSampling.abc  avmplus crash: exception 0xC0000005 occurred


PPC 10.4/5 (Release-MIR (-Dforcemir) | Release-LIR (-Ojit) | Release –Dinterp | Release)
ecma3/Array/splice1.abc : exhaustive splice w/no optional args 1 = false FAILED! expected: true
ecma3/Array/splice2.abc : exhaustive splice w/2 optional args 1 = false FAILED! expected: true

PPC 10.4/5 Release_Debugger
ecma3/Expressions/e11_4_6.abc : +( String.fromCharCode(0x000C) +  99 ) = 0 FAILED! expected: 99

PPC 10.4 Release_Debugger
ecma3/Array/e15_4_4_10.abc : exhaustive slice test 3 = false FAILED! expected: true
Looks like the patch https://bugzilla.mozilla.org/attachment.cgi?id=352458 is missing necessary updates to the following files (removal of StringBuilder):

M shell/shell_toplevel.abc
M shell/shell_toplevel.cpp
M shell/shell_toplevel.h

Also the patch appears to have problems compiling on windows:
VectorClass.cpp(82) : warning C4242: 'initializing' : conversion from 'avmplus::utf32_t' to 'const avmplus::wchar', possible loss of data
(In reply to comment #50)
> Looks like the patch https://bugzilla.mozilla.org/attachment.cgi?id=352458 is
> missing necessary updates to the following files (removal of StringBuilder):

correct, I generally omit changes to autogenerated files in my patches to make diffs easier.

> Also the patch appears to have problems compiling on windows:
> VectorClass.cpp(82) : warning C4242: 'initializing' : conversion from
> 'avmplus::utf32_t' to 'const avmplus::wchar', possible loss of data

Yep, as stated in my comment, only tested on Mac (but testing and fixing elsewhere will of course be done prior to pushing)
Comment on attachment 352458 [details] [diff] [review]
Patch #2 for creation APIs

adding Edwin, as he had concerns about previous patch
Attachment #352458 - Flags: review?(edwsmith)
(In reply to comment #49)
> Just going to reiterate the current bugs that are in this patch (as noted
> earlier in comment #30). I am going to push a patch to the testconfig.txt to
> expect these failures so that we can get the build system back under control.

Thanks for keeping us on top of this, Brent. We'll get them under control as soon as possible.
Attachment #352458 - Flags: review?(edwsmith) → review+
(In reply to comment #49)
> Win64 Release_Debugger:
> ecma3/GlobalObject/e15_1_2_5_1.abc : unescape( %6F ) = %�� FAILED! expected: o
> as3/sampling/SimpleSampling.abc  avmplus crash: exception 0xC0000005 occurred

The Win64 failure is intermittent -- it fails each time but not identically each time.
It appears that the Win64 compiler bug that hit concatStrings() is also affecting append() -- disabling optimize-for-speed for it heals this.
(In reply to comment #55)
> It appears that the Win64 compiler bug that hit concatStrings() is also
> affecting append() -- disabling optimize-for-speed for it heals this.

Whoops, spoke too soon, ignore that. The problem actually appears to be JIT-related; it does not happen for me with -Dinterp on, so this may not be String-related as much as String-tickled?
pushed to redux as changeset:   1225:6085ffd9e912
Blocks: 469358
Grab bag o' String fixes:

-- the XMLParser was rewritten to add a new "endtag" field to XMLTag, rather than passing the node name with '/' prepended, to save on String manipulations. It turns out a lot of code downstream depended on the old behavior, so I removed endTag and reverted to the old approach rather than rewriting ticklish legacy code.

-- changed the AvmCore and StringBuffer String APIs that dealt with UTF8 to expect char* rather than utf8_t*; again, lots of client code expects that all text is char*, thus using utf8_t* would require lots of pointless casts. (The String APIs for UTF8 remain unchanged.)

-- UTF16String::length() was incorrectly returning a length that included the null-termination on the end. Now it doesn't.

-- grouped all the newStringXXX and internStringXXX calls together in AvmCore and added better comments. Uniformly allow len to be omitted to imply null termination. added AvmCore::internStringLatin1() because it proved useful.
Attachment #352814 - Flags: review?(edwsmith)
Blocks: 469669
Blocks: 469670
Blocks: 469671
Attachment #352814 - Flags: review?(daumling)
Attachment #352814 - Flags: review?(daumling) → review+
Comment on attachment 352814 [details]
Patch #3 for creation APIs

Looks good. I am not 100% sure about the changes to the handling of the end tag (hard to see in this diff), but I am pretty sure that the changes are OK.
(In reply to comment #59)
> (From update of attachment 352814 [details])
> Looks good. I am not 100% sure about the changes to the handling of the end tag
> (hard to see in this diff), but I am pretty sure that the changes are OK.

The acceptance tests still pass, and Flash code that previously failed now does not, so I think it's OK. Probably slightly less efficient than the previous code, but we can improve that in a subsequent pass if necessary.
pushed to redux as changeset:   1229:0b2ee1612a02
Blocks: 469616
Attachment #352814 - Flags: review?(edwsmith)
Patches to fix problematic string collections:

-- Force String::getData() to go thru a stack-based temporary (StringData) so that the intermediate strings for concats, etc can be found by GC and not collected.

-- restructure the append() method to pass the string further down the chain; due to tail-call optimization in some compilers the upstream stack reference was being lost and allowed GC to collect a string in use.

-- make String::toUTF8String() and String::toUTF16String private; now you must use StUTF8/16String to access these, to ensure there's a valid stack reference to the UTF string in question.

-- fixed an unrelated 64-bit bug in Sampler.cpp that was crashing SimpleSampling.abc
Attachment #353298 - Flags: review?(lhansen)
Attachment #353298 - Flags: review?(daumling)
Comment on attachment 353298 [details] [diff] [review]
Patch #4: fix string disposal errors

It would probably be even better if the classes UTF8String and UTF16String were removed altogether and merged into stUTF8String andStUTF16STring. The member variables would go into the stack-based instance, and the character buffer would be dynamic, there would not be an interior pointer to character data at all.
Attachment #353298 - Flags: review?(daumling) → review+
(In reply to comment #63)
> (From update of attachment 353298 [details] [diff] [review])
> It would probably be even better if the classes UTF8String and UTF16String were
> removed altogether and merged into stUTF8String andStUTF16STring. The member
> variables would go into the stack-based instance, and the character buffer
> would be dynamic, there would not be an interior pointer to character data at
> all.

Fine with me -- do you want to take that on as a followup task?
pushed patch #4 as changeset:   1232:2ea77b90aea6
Attachment #353298 - Flags: review?(lhansen)
StringData::str needs to be made volatile; GCC/PPC is outsmarting us.
Attachment #353531 - Flags: review?(edwsmith)
Attachment #353531 - Flags: review?(edwsmith) → review+
Further revisions to the StUTF8/16String classes: the existing classes allowed you to extract the underlying UTF8/16String pointer, which would have been disastrous if FEATURE_PREFER_STRING_PERFORMANCE was not defined: the dtor would delete the pointer immediately, but someone else might have grabbed it. Instead, we now just provide wrappers for c_str() and length() functions. 

As a result, I made String::toUTF8/16String() public again, but added a warning that most callers probably want to use the "St" classes instead, and that the caller has to be cautious about lifetimes.

Also moved the utf8-indexing code out of UTF8String, into the new StIndexedUTF8String class; RegExpObject was the only bit of code that needs to use this so it doesn't make sense to allocate extra space in every UTF8String to deal with this edge case.

Also fixed a typo in the UTF8String ctor.
Attachment #353572 - Flags: review?(daumling)
Attachment #353572 - Flags: review?(lhansen)
Comment on attachment 353572 [details]
Patch #6: fix dangerous StUTF8/16String design

Looking good!
Attachment #353572 - Flags: review?(daumling) → review+
Attachment #353572 - Flags: review?(lhansen) → review+
There's a serious bug in AbcParser.cpp: when reading in the string constant pool, we set the "constant" flag. The problem with this is that some clients (eg Flash) can later unload the ABC, meaning we have an interned string that now points to an unloaded section of memory. Pointing the the ABC section is certainly desirable for memory savings, but is not safe to do unless we have a way to flush all strings when ABC data is unloaded... for the time being I'm going to offer a patch that passes false for this flag; we can consider a proper optimization later.
core/AbcParser.cpp: do not create pool strings as constant strings, as some clients unload ABC.

core/StringObject.cpp: rename toUTF8String/toUTF16String as getUTF8String/getUTF16String to create a deliberate API incompatibility: most clients calling these functions should really be using StUTF8/16String utility classes instead.
Attachment #353872 - Flags: review?(lhansen)
Attachment #353872 - Flags: review?(lhansen) → review+
Comment on attachment 353872 [details] [diff] [review]
Patch #7: abc strigs must not be constant

pushed to redux as changeset:   1246:fdce7566c1d4
Attachment #353872 - Attachment is obsolete: true
Attachment #353572 - Attachment is obsolete: true
The AbcParser optimization is actually quite simple (I had it in TT): In the PoolObject dtor, walk thru your constant strings, and call makeDynamic() on each of them. That call creates a dynamic buffer and copies the string data over.
I like it! I'll try it out.
(In reply to comment #72)
> The AbcParser optimization is actually quite simple (I had it in TT): In the
> PoolObject dtor, walk thru your constant strings, and call makeDynamic() on
> each of them. That call creates a dynamic buffer and copies the string data
> over.

Actually, I don't see this in the most recent TT tip, but it seems simple enough to implement. I'll try it and see how it works.
OK, so I implemented this (and added a way for makeDynamic to only affect the strings in a given memory range, which was a worthwhile optimization)... but Im not going to offer it as a patch yet: after talking with Lars, there's a likelihood that WORD_CODE builds will flush ABC data entirely in the name of space savings, so this could all be moot. Let's revisit this after the new year.
(Posting for Michael Daumling)

New version of String without interior pointers:
 
1) Replaced the kDirect string type with kDynamic, and removed the STR_DYNAMIC flag.
2) Dissolved the union that held various data elements and replaced it with m_master and m_buffer.
3) Removed the macro that kept interior pointers of the stack.
4) Removed StringData and charAt(StringData...).
5) Merged UTF8String/StUTF8String and UTF16String/StUTF16String, addeds dynamic string buffer, eliminating interior pointers.
 
Optimiations:
 
1) StringIndexer now stores a getter function during construction, eliminating one switch/case per index access.
2) String data buffers are allocated without flags, which should speed up GC when scanning for pointers (before, the entire String instance was scanned for pointers, including the data).
3) m_master doubles as the storage location for a scanned integer (if type != kDynamic), speeding up getIntAtom() and parseIndex().
Attachment #356020 - Flags: review?(stejohns)
Attachment #356020 - Flags: review?(stejohns) → review+
pushed to redux as changeset:   1266:e85901fc6734
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Resolved fixed engineering / work item that has been pushed.  Setting status to verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: