Closed Bug 411163 Opened 13 years ago Closed 12 years ago

TT: strings need a faster charAt() function

Categories

(Tamarin Graveyard :: Tracing Virtual Machine, defect, P1)

x86
Windows XP
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: edwsmith, Assigned: daumling)

Details

Attachments

(2 files, 8 obsolete files)

many pieces of code depend on having O(1) charAt() and charCodeAt().

in sunspider: crypto-md5, crypto-sha1 use strings as lookup tables.

in esc (and probably any other parser), lex-scan.es uses charCodeAt(i++)

ideas:
  * since symbol strings aren't allocated as Strings anymore, allocate Strings as utf16 instead of utf8
  * detect constant-width utf8 strings and add 2-bit scaling factor to String class
pardon, what's a "symbol string"?
most of the strings in the string constant pool in a typical abc are symbol names.   Only a subset of entries in the constant pool ever see the light of day in user code as a String value.

In TC, symbol tables (MultinameHashtable) are indexed by String and Namespace objects.  In TT, symbol tables (SymTable) are indexed by references pointing into the abc -- String objects aren't allocated for them.  This saves a bunch of RAM for symbol tables.  (Namespace objects still are allocated, but maybe don't need to be).

Since symbol-name strings aren't allocated as String values in TT by default there is less of a penalty for representing strings as uint16[] vs uint8[].
IMHO, we should avoid using utf16 in any new code; when utf8 isn't appropriate we should go straight to UCS4.
Steven, can you clarify comment 3?  charAt() and charCodeAt() have to behave as though the string were represented using 16-bit code units.  Why go to UCS-4?
IIRC charAt and charCodeAt have to behave on a per-code-point basis, which is > 16 bits for Unicode 4.0 or later (which is critical for proper support of various Asian languages). A Unicode codepoint encompasses ~1000000 characters which means that a 32-bit size is the most practical when working on per-character basis. UTF16 is now problematic IMHO because it is easy to make a mistake that will work OK for most Western languages but cause a problem for things beyond the Basic Multilingual Plane.
According to the ES3 spec and current JS practice, charAt operates on 16-bit code units, not unicode codepoints.
I stand corrected. 

how might this affect ES4, which is planning on moving to Unicode 4.0+ IIRC?
I don't know... this is definitely an unfortunate edge of the ES3 spec because at that point unicode was only the basic plane. Let's ask the ES4 spec-masters what the current spec says and what it should say.
it's double problematic because it implies that all index-based methods (length, index, substr... etc) have to pretend it's utf16, which is really the wrong thing going forward. I suppose we could add an entire parallel set of apis using codepoints (eg, unilength, unicharCodeAt, unicharAt.... ugh) but that would be pretty horrible.
ES4 is trying to allow full Unicode support, at the expense of interoperation:

http://wiki.ecmascript.org/doku.php?id=proposals:update_unicode

Some details remain to be worked out:

http://bugs.ecmascript.org/ticket/213 (read the latest substantive comment!)
http://bugs.ecmascript.org/ticket/148
http://bugs.ecmascript.org/ticket/37

The real issue is #213. If we wish to allow implementations to support 32-bit Unicode per spec, we cannot allow surrogate pairs to be merged.

TT could take a chance and try to track ES4 as proposed, but if we end up specifying anything else, it will have to change. In the mean time, and for a long time on the web, ES3 compatibility matters more.

See http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html for a timely blog post from roc.

My advice is to conform to ES3 for now and optimize under the hood.

/be
(In reply to comment #0)
> in sunspider: crypto-md5, crypto-sha1 use strings as lookup tables.

Sigh. I hugely favour UTF-8 but abusing strings for random-access lookup tables makes it pretty hard to use UTF-8. The only way to optimize it would be to attach a full array of UTF-16 index -> UTF-8 index mappings to any string on which charAt was used. Maybe that wouldn't be so bad...
Sounds like a job for intelligent caching, since I'm guessing that you''ll rarely need to use charAt one more than a handful of strings at a time...
The random lookup-table case is a pretty heinous case to optimize for: if people are typically manually iterating through a string you can cache the last lookup location for O(1) access to adjacent items.
It might be simple enough to just lazily create the mapping array on the first charAt to the string and discard all the mapping arrays at each GC. Of course you'd have to invalidate the array when strings change.

> The random lookup-table case is a pretty heinous case to optimize for

Welcome to the Web! I wonder how typical SunSpider's usage is.
jorendorff mentions

> Also, in ECMAScript Edition 4, it is proposed
> that strings will be indexable, sort of like in
> Python:
>
> "hello world"[6] == "w"

That would really stuff up UTF-8 representation. It basically forces implementations to use UCS-4 if you decree that JS characters are Unicode characters, or UTF-16 if you decree that they're UTF-16 code units. (The latter case is really bad though because it would let people create invalid surrogate pairs or else give you weird error cases to check and handle.) I really strongly encourage the ES4 people to revisit that decision.
roc: indexing is no different from charAt, except for the lack of an override opportunity (String.prototype.charAt can be replaced without affecting indexing). It's in not only SpiderMonkey, but also (IIRC) Opera and Webkit. OTOH, it is not widely used on the Web.

Neither is charAt (more below), but it is used (along with charCodeAt), sometimes for want of a byte vector type in JS (which is coming, see

http://bugs.ecmascript.org/ticket/234

and feel free to comment.

Iterators are coming too, and while the exact interface evolved a fair bit from JS1.7 to cover all cases and use the type system, you can mock up a useful string iterator now:

js> String.prototype.__iterator__ = function (keys) {
    for (let i=0; i<this.length; i++)
        yield keys ? i : this[i];
}
function (keys) {
    for (let i = 0; i < this.length; i++) {
        yield keys ? i : this[i];
    }
}
js> for (i in "hello") print(i);
0
1
2
3
4
js> for each (i in "hello") print(i);
h
e
l
l
o

I'll follow up to make sure string iteration is as sweet as possible in JS2/ES4, but we can't eliminate charAt or indexing (backward compatibility). The random access benchmarks are not representative of much, though: real string processing in JS uses RegExps and indexOf more than charAt.

None of this changes the challenge to implementors: support the Web as it is, with Java-inspired charAt/charCodeAt and "UCS-2", and optimize to compete at the often somewhat to very bogus benchmarks.

/be
> roc: indexing is no different from charAt

Read-only indexing, yes, and is amenable to various optimizations that might make it tractable for non-UTF-16 representations. Writeable indexing is much worse though, I don't see any reasonable way to optimize that.
JS strings are immutable -- properties reflected as indexed single-string elements appear to be read-only. Same for JS2/ES4 slice syntax -- slices are rvalues.

/be
Component: Virtual Machine → Tracing Virtual Machine
QA Contact: vm → tracing-vm
Note, a weak partial fix (for single-byte string only) is in the patch for https://bugzilla.mozilla.org/show_bug.cgi?id=416924 but we probably need a more robust solution.
See also bug 416411.

/be
Priority: -- → P1
Is UTF-8 really the way to go for TT?

Was the decision made in order to save memory? If so, do we have statistics about how much memory is saved in light of various locales? Looking at the most common character sets, it may very well be true that we have an increase in memory instead of a decrease.

For CJK languages, this is true, because all characters above 0x0FFF need three bytes, which would increase the memory needs by 50%. For Russian, the memory size should be about he same, since Cyrillic is in the 0x400 range, which needs 2 bytes. For most Western languages, there is a 50% memory savings for ASCII, but there is a 100% penalty for characters between 0x80 and 0xFF on the Latin-1 code page.

There is still the problem of non-BMP characters that need to be encoded into UTF-16 surrogate pairs (which need 4 bytes as well as their UTF-8 counterpart). It is nice to have non-BMP character as one entity, but this behavior may very well break the Web, where UTF-16 surrogate pairs are expected, as Brendan et al already pointed out. IMHO, a 32-bit implementation of strings, although desirable, is not an option for now.

So what if the String implementation would either be 8-bit or 16-bit, depending on the character set? If the character set is Latin-1 (or a UTF-8 buffer looks like it only contains characters between 0x00 and 0xFF), the string uses 8-bit characters, while it uses 16-bit characters for all other character sets. We may also want to add a String creator that simply accepts 8-bit strings as-is, leaving the decision whether the string is valid Latin-1 to the implementer.

We cannot afford to use UTF-8. Many scripts use charAt() and charCodeAt(), and there is no practical solution to make these routines as fast as if they were based on a string with fixed-width characters.
The UTF8 size results in bug 416411 may be of interest, although we were measuring the DOM, not JS. Perhaps we should get Dave Mandelin (or someone else) to instrument Spidermonkey.
> Was the decision made in order to save memory?

Yes. The gotcha here is that there are two distinct kinds of string we encounter: 

(1) those that are used by the language itself and/or the DOM; these are almost always in the ASCII range, thus UTF8 is an ideal format.

(2) those that are used for arbitrary user data; as you point out, this has different memory usage implications depending on what range of Unicode is likely to be dominant, and horrible performance characteristics for random-access.

Regarding UTF-16 and non-BMP characters, I think that using a 16-bit character size at this point would be a big problem. Web usage and/or ES4 may continue to constrain character indices to 16-bit surrogate pair indices (though I fervently hope not), but not all users of TT will (in particular, future versions of Flash will require efficient access to the entire Unicode range). 

So my proposal would be to allow the internal representation of String be (generally) a UCS4 encoding, but with strings in the ASCII range represented in a single-byte (UTF8) format for efficiency.

So let us use fixed-width strings, but with three different widths: 8, 16, and 32.

There would be three creation methods at AvmCore: newString8(), newString16(), newString32(). An 8-bit string is NOT UTF-8, but covers the first 256 Unicode characters. Concatenating two strings, results in a new string of the bigger of the widths.

Then, we have three getter methods() at the String class: getAs8(), getAs16(), and getAs32(). These methods return the string, itself, convert to a wider format, or return NULL if conversion to a narrower format fails. Again NO UTF-8, but conversion of non-BMP 32-bit characters to UTF-16 surrogates and back.

This way, the String class has fast indexing and other operations while maintaining the optimum string format, and saving memory. Implementers can choose to only use a single width; if an implementer wants 32-bit characters only, so be it :)

I guess that there are more methods to think of (like String::getWidth(), or String::getAsUtf8() if really needed), but let this comment serve as an initial idea.
Good idea. Let's make it so.
Three string implementations sounds like a bit of work, and it makes further optimizations more work.

SpiderMonkey strings have O(1) charAt and charCodeAt, and optimized append and substring operations.  We'd like to port those into TT.  Could we try that first and see how folks feel about it?

They won't address the non-BMP/ES4 concerns; they still hold 16-bit code units.
We have pretty solid evidence that there's enough 8bit strings in the world that 8-bit optimizations are worth the pain. (both in Flash and in JS, Dave Mandelin's data, i think)

It's reasonable to do 16bit only as a way of wading into the shallow end of TT, get a reviewable patch up, and some perf numbers.  But for something to land in TT, I think the minimum 8/16.  

Then we could add 32bit in a second phase.

waddya think
I should clarify that I never did any measurements on JS strings, just on Mozilla browser strings (although some of these can be shared with Spidermonkey). I found a memory savings of 1% for UTF-8 over UTF-16, so at least on the Mozilla side, we concluded the memory savings weren't worth any significant trouble. (This is true for even CJK, see http://blog.mozilla.com/dmandelin/2008/02/14/wtf-16/). 
I saw that TT strings have been assigned to Moh and Jim; nevertheless, I'd like to volunteer for the implementation if Moh and Jim agree.
OK, I'll assign the bug to myself. There will be a Wiki page at http://wiki.mozilla.org/Tamarin:Plans#Shared_work_items.
Status: NEW → ASSIGNED
Michael, please talk to Jim Blandy before proceeding. We would like to integrate the string optimizations from Spidermonkey as part of this, to facilitate actionmonkey.
see bug 427959 for optimizing concat by appending to a buffer whose refcount == 1
Assignee: nobody → daumling
Status: ASSIGNED → NEW
Who will review this patch?
Simply copy the numbers (not the entire lines) from a Windowsa console into the spreadsheet column.
This is the first shot at new strings. The attached diff adds seven files to Win VC 2005 and Mac XCode projects:

StringObjectFW.cpp
StringObjectFW.h
StringClassFW.cpp
StringUtils.cpp
StringUtils.h
StringRange.cpp
StringRange.h

There is a new, temporary #define TT_NEW_STRINGS in avmbuild.h. If defined, TT is compiled using the new strings; if undefined, the old strings are restored. This makes it possible to use the old strings on platforms where the new strings have not been tested yet.

The new strings pass the acceptance tests on Win VC 2005 and Mac. On Win, all SunSpider tests run three times faster. The very slow string-validate-input tests run 15-17 times faster. Even with these tests excluded, SunSpider runs up to 20% faster.

I would need some help to verify the implementation on platforms and compilers other than VC 2005 and Mac XCode.

SpiderMonkey finalizers are planned, but not implemented yet. Also, I have not done much optimization yet, so there is still room for more. RegExps will run slower, because strings have to be translated back to UTF-8, so the next big project would be to change pcre to use the String::charAt() API.

I've attached SunSpider.xls, which compares old and new performance results. On Windows, just copy and paste the number columns from the Console into the spreadsheet.
Status: NEW → ASSIGNED
I'll be happy to review it.
Attachment #325584 - Flags: review?(stejohns)
Nice work!

comments: (note that none of these would prevent a submit, except perhaps _narrow_32_16)

in String::createDirect, there's the comment "No kZero here - we do not store pointers, just data", but you still allocate with kContainsPointers. is this right?

it looks there may be a merge glitch in _narrow32_16:
 
		// do you really want a double while? 
		while (len-- > 0)
		while (len-- > 0)
		{
	

would it make more sense to have String::Width declared as
		enum Width
		{
			kAuto	= 0,	// only used in APIs
			k8	= 1,
			k16	= 2,
			k32	= 4,
		};
so that _copyBuffers (and others) could avoid the "+1" magic on dstWidth when calculating memcpy ranges? I thought it was just a bug at first. If we're going to rely on enum values, why not just have them be the ones we need?

can String_private__charAt make use of core->cachedChars (the way that String_private_c2s does)?

String::substr has a comment that I suspect is stale:
	start = (int32_t)_clampIndexInt (start, m_length); 
	// ClampIndex takes a double (not int32_t or uint) for first parm...
(note that start and m_length are both ints)

The String finalizer stuff might be problematic on some mobile platforms (Symbian?) that restrict writable global data. Can the finalizer table be const? Do we even need user-defined finalizers? (I'm trying to think of a use case)

Also it looks like String_dynamicFinalizer, String_concatFinalizer, and String_dependentFinalizer are all the same... and also redundant to code in String::finalize itself. What am I missing here?

String::append is declared in the .h file but is entirely commented out in the .cpp file.

There are some TODO comments that I'd like a little more info on... ie are they things that definitely need more work?

AvmCore::writeUnichar // TODO: change to something else than UniChar?
PoolObject::_getPoolString // TODO: make thread safe
String::concatStrings // TODO: make thread safe
String_private_c2s // TODO: add 8-bit strings if desired
String::Finalizer finalizers // TODO: Move the table to global data section.

tiny quibble: there are a few naked "int" and such sprinkled in the code; we're trying to move to the C99 types (int32_t) for new code. ditto for "signed char" vs "int8_t".

tiny quibble: there are various chunks of commented-out code (eg initAvm) that look like they are not likely to ever be resurrected; we generally prefer to avoid checking in commented-out code unless it's being preserved as a negative example. (ditto the various chunks marked TO BE REMOVED)

did you mean to turn on AVMPLUS_VERBOSE for all builds? (I presume this was an accidental inclusion)
User-defined finalizers are a SpiderMonkey API feature. See

http://developer.mozilla.org/en/docs/JS_AddExternalStringFinalizer

/be
>in String::createDirect, there's the comment "No kZero here - we do not store
pointers, just data", but you still allocate with kContainsPointers. is this
right?

I am getting an assert inside MMgc if I don't include that bit. No idea what is going on here - maybe soneome can comment?

> it looks there may be a merge glitch in _narrow32_16:

Oh yes - will be fixed.

> would it make more sense to have String::Width declared as...

I'll check - not a bad idea at all :)

> can String_private__charAt make use of core->cachedChars (the way that
String_private_c2s does)?

StringUtils::create() takes care of this.

> String::substr has a comment that I suspect is stale:...

I've copied the code from the old stirng implementation. I want to revisit that code once the patch has been landed, and make sure that the code conforms to ECMA-262.

> The String finalizer stuff might be problematic on some mobile platforms
(Symbian?) that restrict writable global data. Can the finalizer table be
const? Do we even need user-defined finalizers? (I'm trying to think of a use
case)

Brendan already pointed out that finalizers are a SM feature.

> Also it looks like String_dynamicFinalizer, String_concatFinalizer, and
String_dependentFinalizer are all the same... and also redundant to code in
String::finalize itself. What am I missing here?

True - these finalizers are left-overs from an earlier step. I can remove these. A better place to store finalizers would be in the AvmCore structure. I can move the data there if you like.

> String::append is declared in the .h file but is entirely commented out in the
.cpp file.

I want to resurrect that code once I start optimizing the TT code base. append() can be handy in cases where string constants are appended during the preparation of messages. This will save some memory footprint as well as memory usage.
 
> There are some TODO comments that I'd like a little more info on... ie are they things that definitely need more work?

Yes - I have marked all places that need slocks for later multithreading support.

> tiny quibble: there are a few naked "int" and such sprinkled in the code; we're trying to move to the C99 types (int32_t) for new code. ditto for "signed char" vs "int8_t".

Will fix.

> tiny quibble: there are various chunks of commented-out code (eg initAvm) that
look like they are not likely to ever be resurrected; we generally prefer to
avoid checking in commented-out code unless it's being preserved as a negative
example. (ditto the various chunks marked TO BE REMOVED)

TO BE REMOVED is stuff that I want to remove from the code base once new strings are integrated. I can remove the remaining stuff for now.

> did you mean to turn on AVMPLUS_VERBOSE for all builds? (I presume this was an
accidental inclusion)

Probably - I'll check.

I'll submit an updated patch tomorrow.
Do we need our String class to support spidermonkey's "external" feature or can
we have host's use MMgc for string memory instead?    If we really need it
could be implemented outside TT using a presweep hook and a host managed table
of external strings.
Okey doke, I'll hold off doing test builds and such till I get the updated patch -- you already have it working on VC2005 and XCode, so other targets should only be trivial issues for me to deal with.

Re: custom finalizers, I didn't realize that SM offered this... but that brings up the question of what SM clients use them for. Is it a requirement for TT to support them if it wants to be used in place of SM? Or is it a legacy API that could be deprecated? (I'm not morally opposed to them, just want to ensure they are a definite requirement for SM compatibility) If they are necessary then moving them to AvmCore would probably be a better solution.

Re: the MMgc assert, I dunno -- Tommy is the go-to guy on MMgc mysteries :-)
Let's not worry about spidermonkey strings yet... we'd definitely like to avoid the finalizer penalty in the common case.
(In reply to comment #41)
> Do we need our String class to support spidermonkey's "external" feature or can
> we have host's use MMgc for string memory instead?    If we really need it
> could be implemented outside TT using a presweep hook and a host managed table
> of external strings.
> 

Someone weigh in if they feel differently.  I think this string impl can require mmgc and even depend on features/behavior of mmgc.  frankly i see strings (and arrays and hashtables, etc) as *features* of the memory manager, any opportunities of optimizing them together should be fully exploited.
This patch implements all of Steven's comments - a second pair of eyes is priceless!

I have commented out the code that handles string finalizers for now (using TTSTR_FINALIZERS) until an agreement has reached about whether this should be handled inside the String class or not. I believe that it should - how can we call an external finalizer function inside String::finalizer() in any other way? We still need to agree upon how to pass in the JSContext for SM, but we can leave that discussion to a later time.
Attachment #325584 - Attachment is obsolete: true
Attachment #325584 - Flags: review?(stejohns)
Excellent! I'll re-review all of this -- if I review+ it do you want me to push the patch as well?
Please do so - I doubt that I have write access, and I love to blame any problems to someone else anyway :))
Looks like only a few trivial fixes are necessary.

Also, I had to put back in the GCContainsPointers bit since I'm still getting the assert -- we should ask Tommy about it.

I also enabled the use of memcmp for string-compare (we should have been doing that in the old code too -- didn't know we weren't)
pushed as changeset:   417:275e2895d4c3

leaving open for now -- not sure if Michael has additional work he wants to do.
I'll use this bug to push optimizations not related to pcre.

Also, I'd like to remove the #define TT_NEW_STRINGS as soon as the new strings run stable on all platforms. Currently, I am spreading this #define all over the code, and that is ugly.
Right now you have to use kContainsPointers if you want to participate in DRC.  I'll file a bug.
Add this text to your autoexp.dat to have a visualizer for new Strings. Very handy if you use VC 2005 (it will probably also work with VC 2008). Instructions inside. Also contains ref count visualizers for RCObject and AvmPlusScriptableObject.
re: remove the #define TT_NEW_STRINGS

agreed. they seem to be running just fine. let's give it a day or three for issues to shake out. 

btw, I found a potential bug in _getPoolString; if you ask for an interned string but the string in the table is non-interned, you get the non-interned string. A fix is included in my suggested patch for https://bugzilla.mozilla.org/show_bug.cgi?id=440242
I'll fix that. I have another one: String::finalize() just nulls the pointers of dependent strings instead of doing WBRC() calls with NULL; this may lead to MMgc not being able to free dependent strings. I'l attach a patch tomorrow.
Attached patch Updated patch (obsolete) — Splinter Review
Lots of fixes:

- Rewrote concat()
- made sure that WBRC(NULL) is called to release pointers
- substr() makes sure that no kDependent hierarchies are built
- String::tryDelete() delet4es this if RefCount() == 0, and StringDataUTF8 and others make use of that feature to offload GC
- faster UTF/string index conversion routines as part of StringDataUTF8
- changed fields in m_bitsAndFlags
- made Pointers union public, and added fast nextch() routine

Passes all acceptance tests (well, 6 errors, but the same ones as without the new strings: typeofNaN != Number, and class of new Function() is null instead of Function)
Attachment #325585 - Attachment is obsolete: true
Attachment #325612 - Attachment is obsolete: true
Attachment #326293 - Flags: review?(stejohns)
The m_bitsAndFlags field has changed - this updated visualizer adapts the new structure.

String display is e.g.:

[8/direct] "abcdefgh" - 8 bits, direct buffer
{8/static} "abcdefgh" - 8 bits, static constant, interned
Attachment #326294 - Flags: review?(stejohns)
Attachment #326294 - Flags: review?(stejohns) → review+
Comment on attachment 326293 [details] [diff] [review]
Updated patch

Nice. I'll get it pushed later today.

Any significant impact on performance, positive or negative?
Attachment #326293 - Flags: review?(stejohns) → review+
FYI, it looks like the patch you attached has a few files with CRLF and some with LF-only, causing patch to barf on non-Windows systems. I can work around it but probably good to normalize line endings before diffing.
Sorry 'bout that - I was not sure if that was a problem.

Performance gains are hard to tell - the SunSpider numbers are so small that most of the tests have substantial differences. The resolution of the timer is too low. Some tests seem to run faster, though. Repeated substrs should be faster now, as will be concats. But there is still room for improvement! Especially RegExps are still slow.

Would it be possible to increase the number of iterations in the SunSpider tests?
Yeah, unfortunately HG seems to be a bit daft about dealing with line-ending differences. One would think that such problems would be easy to solve, but alas, apparently not. (Rumor has it there are ways to configure HG to improve this, but why it can't just be set to ignore line-endings by default when diffing common file types, I dunno...)

Re: increasing number of iterations, yeah, that might well be useful. Brent Baker and Dan Schaeffer do most of the maintenance of those tests, cc'ing them.
hg diff has a few options, one of which, IIRC, deals with ignoring white space things. Maybe you can find what you need there?
AFAIK that option affects spaces/tabs but not line endings.

The only builtin solution I'm aware of is http://www.selenic.com/mercurial/wiki/index.cgi/EncodeDecodeFilter which strikes me as unnecessarily complex; IMHO being able to ignore line endings in diffs (and normalize on commits without requiring external utilities) should be built-in. 
Line endings should be canonicalized to \n in the repo. This is not just a diff or display issue. Graydon testifies that it's surprisingly hard for VCS to guess what the line ending should be, just by looking at bytes in the file.

/be
I hate to admit it - but I failed to test the latest patch on the Mac, plus a bug somehow crept back into the patch that had to be fixed.

This patch should work now - and it appears to have consistent line endings (no wonder - I used a Mac the create the diff :)
Attachment #325771 - Attachment is obsolete: true
Attachment #326293 - Attachment is obsolete: true
Attachment #326352 - Flags: review?(stejohns)
Comment on attachment 326352 [details] [diff] [review]
Patch that fixes GCC warnings and a bug

pushed as changeset:   432:7a7fa16cdee6
Attachment #326352 - Flags: review?(stejohns) → review+
Once the algorithmic aspects of the new code is done, please let me know so we can start looking at SSE2 opportunities.

- moh 
Attached patch Fix doubles charAt() performance (obsolete) — Splinter Review
This fix makes sure that StringUtils::create() returns cached characters for codepoints < 128, which was not the case before. The C++ wrapper for charAt() calls this method.

A cosmetic fix in StringObjectFW.cpp.

And the addition of CYGWIN_NT-6.0 to runtests.py for Vista support.
Attachment #326677 - Flags: review?(stejohns)
Attachment #326677 - Flags: review?(stejohns) → review+
pushed as changeset:   438:34e27ae30694
- removed the kConcat and kDynamic string types with associated data structures like StringBuf
- changed concatenate() to create a kDirect string with extra chars at the end if rightStr cannot be appended in-place; a #define TTSTR_EXTRA_MAX can be platform-tweaked.
- Added a check to see if rightStr already exists in leftStr's extended buffer (e.g. if leftStr == "123ABC" and len == 3, and rightStr == "AB", then a kDependent string would be created without copying the data)
- substr(): When this was kDependent, only start got updated, but not end when resolving the master string
- removed flatten(), collectTree(), and getUsedChars() (the latter being integrated into concatenate())
- the whole thing is a little more transparent now
Attachment #326352 - Attachment is obsolete: true
Attachment #326677 - Attachment is obsolete: true
Attachment #328353 - Flags: review?(stejohns)
Needed after patch is landed
Attachment #326294 - Attachment is obsolete: true
Comment on attachment 328353 [details] [diff] [review]
Recursion-free version of strings

I love it when big chunks of code just go away... :-)
Attachment #328353 - Flags: review?(stejohns) → review+
pushed as changeset:   481:a9a4b1502555
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.