411163 - TT: strings need a faster charAt() function

Reporter

Description

•

17 years ago

many pieces of code depend on having O(1) charAt() and charCodeAt(). in sunspider: crypto-md5, crypto-sha1 use strings as lookup tables. in esc (and probably any other parser), lex-scan.es uses charCodeAt(i++) ideas: * since symbol strings aren't allocated as Strings anymore, allocate Strings as utf16 instead of utf8 * detect constant-width utf8 strings and add 2-bit scaling factor to String class

Benjamin Smedberg

Comment 1

•

17 years ago

pardon, what's a "symbol string"?

Edwin Smith

Reporter

Comment 2

•

17 years ago

most of the strings in the string constant pool in a typical abc are symbol names. Only a subset of entries in the constant pool ever see the light of day in user code as a String value. In TC, symbol tables (MultinameHashtable) are indexed by String and Namespace objects. In TT, symbol tables (SymTable) are indexed by references pointing into the abc -- String objects aren't allocated for them. This saves a bunch of RAM for symbol tables. (Namespace objects still are allocated, but maybe don't need to be). Since symbol-name strings aren't allocated as String values in TT by default there is less of a penalty for representing strings as uint16[] vs uint8[].

Steven Johnson

Comment 3

•

17 years ago

IMHO, we should avoid using utf16 in any new code; when utf8 isn't appropriate we should go straight to UCS4.

Jason Orendorff [:jorendorff]

Comment 4

•

17 years ago

Steven, can you clarify comment 3? charAt() and charCodeAt() have to behave as though the string were represented using 16-bit code units. Why go to UCS-4?

Steven Johnson

Comment 5

•

17 years ago

IIRC charAt and charCodeAt have to behave on a per-code-point basis, which is > 16 bits for Unicode 4.0 or later (which is critical for proper support of various Asian languages). A Unicode codepoint encompasses ~1000000 characters which means that a 32-bit size is the most practical when working on per-character basis. UTF16 is now problematic IMHO because it is easy to make a mistake that will work OK for most Western languages but cause a problem for things beyond the Basic Multilingual Plane.

Benjamin Smedberg

Comment 6

•

17 years ago

According to the ES3 spec and current JS practice, charAt operates on 16-bit code units, not unicode codepoints.

Steven Johnson

Comment 7

•

17 years ago

I stand corrected. how might this affect ES4, which is planning on moving to Unicode 4.0+ IIRC?

Benjamin Smedberg

Comment 8

•

17 years ago

I don't know... this is definitely an unfortunate edge of the ES3 spec because at that point unicode was only the basic plane. Let's ask the ES4 spec-masters what the current spec says and what it should say.

Steven Johnson

Comment 9

•

17 years ago

it's double problematic because it implies that all index-based methods (length, index, substr... etc) have to pretend it's utf16, which is really the wrong thing going forward. I suppose we could add an entire parallel set of apis using codepoints (eg, unilength, unicharCodeAt, unicharAt.... ugh) but that would be pretty horrible.

Brendan Eich [:brendan]

Comment 10

•

17 years ago

ES4 is trying to allow full Unicode support, at the expense of interoperation: http://wiki.ecmascript.org/doku.php?id=proposals:update_unicode Some details remain to be worked out: http://bugs.ecmascript.org/ticket/213 (read the latest substantive comment!) http://bugs.ecmascript.org/ticket/148 http://bugs.ecmascript.org/ticket/37 The real issue is #213. If we wish to allow implementations to support 32-bit Unicode per spec, we cannot allow surrogate pairs to be merged. TT could take a chance and try to track ES4 as proposed, but if we end up specifying anything else, it will have to change. In the mean time, and for a long time on the web, ES3 compatibility matters more. See http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html for a timely blog post from roc. My advice is to conform to ES3 for now and optimize under the hood. /be

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 11

•

17 years ago

(In reply to comment #0) > in sunspider: crypto-md5, crypto-sha1 use strings as lookup tables. Sigh. I hugely favour UTF-8 but abusing strings for random-access lookup tables makes it pretty hard to use UTF-8. The only way to optimize it would be to attach a full array of UTF-16 index -> UTF-8 index mappings to any string on which charAt was used. Maybe that wouldn't be so bad...

Steven Johnson

Comment 12

•

17 years ago

Sounds like a job for intelligent caching, since I'm guessing that you''ll rarely need to use charAt one more than a handful of strings at a time...

Benjamin Smedberg

Comment 13

•

17 years ago

The random lookup-table case is a pretty heinous case to optimize for: if people are typically manually iterating through a string you can cache the last lookup location for O(1) access to adjacent items.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 14

•

17 years ago

It might be simple enough to just lazily create the mapping array on the first charAt to the string and discard all the mapping arrays at each GC. Of course you'd have to invalidate the array when strings change. > The random lookup-table case is a pretty heinous case to optimize for Welcome to the Web! I wonder how typical SunSpider's usage is.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 15

•

17 years ago

jorendorff mentions > Also, in ECMAScript Edition 4, it is proposed > that strings will be indexable, sort of like in > Python: > > "hello world"[6] == "w" That would really stuff up UTF-8 representation. It basically forces implementations to use UCS-4 if you decree that JS characters are Unicode characters, or UTF-16 if you decree that they're UTF-16 code units. (The latter case is really bad though because it would let people create invalid surrogate pairs or else give you weird error cases to check and handle.) I really strongly encourage the ES4 people to revisit that decision.

Brendan Eich [:brendan]

Comment 16

•

17 years ago

roc: indexing is no different from charAt, except for the lack of an override opportunity (String.prototype.charAt can be replaced without affecting indexing). It's in not only SpiderMonkey, but also (IIRC) Opera and Webkit. OTOH, it is not widely used on the Web. Neither is charAt (more below), but it is used (along with charCodeAt), sometimes for want of a byte vector type in JS (which is coming, see http://bugs.ecmascript.org/ticket/234 and feel free to comment. Iterators are coming too, and while the exact interface evolved a fair bit from JS1.7 to cover all cases and use the type system, you can mock up a useful string iterator now: js> String.prototype.__iterator__ = function (keys) { for (let i=0; i<this.length; i++) yield keys ? i : this[i]; } function (keys) { for (let i = 0; i < this.length; i++) { yield keys ? i : this[i]; } } js> for (i in "hello") print(i); 0 1 2 3 4 js> for each (i in "hello") print(i); h e l l o I'll follow up to make sure string iteration is as sweet as possible in JS2/ES4, but we can't eliminate charAt or indexing (backward compatibility). The random access benchmarks are not representative of much, though: real string processing in JS uses RegExps and indexOf more than charAt. None of this changes the challenge to implementors: support the Web as it is, with Java-inspired charAt/charCodeAt and "UCS-2", and optimize to compete at the often somewhat to very bogus benchmarks. /be

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 17

•

17 years ago

> roc: indexing is no different from charAt Read-only indexing, yes, and is amenable to various optimizations that might make it tractable for non-UTF-16 representations. Writeable indexing is much worse though, I don't see any reasonable way to optimize that.

Brendan Eich [:brendan]

Comment 18

•

17 years ago

JS strings are immutable -- properties reflected as indexed single-string elements appear to be read-only. Same for JS2/ES4 slice syntax -- slices are rvalues. /be

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 19

•

17 years ago

Gah, sorry for being stupid.

Edwin Smith

Reporter

Updated

•

17 years ago

Component: Virtual Machine → Tracing Virtual Machine

QA Contact: vm → tracing-vm

Steven Johnson

Comment 20

•

17 years ago

Note, a weak partial fix (for single-byte string only) is in the patch for https://bugzilla.mozilla.org/show_bug.cgi?id=416924 but we probably need a more robust solution.

Brendan Eich [:brendan]

Comment 21

•

17 years ago

Updated

•

17 years ago

Priority: -- → P1

Michael Daumling

Assignee

Comment 22

•

17 years ago

Is UTF-8 really the way to go for TT? Was the decision made in order to save memory? If so, do we have statistics about how much memory is saved in light of various locales? Looking at the most common character sets, it may very well be true that we have an increase in memory instead of a decrease. For CJK languages, this is true, because all characters above 0x0FFF need three bytes, which would increase the memory needs by 50%. For Russian, the memory size should be about he same, since Cyrillic is in the 0x400 range, which needs 2 bytes. For most Western languages, there is a 50% memory savings for ASCII, but there is a 100% penalty for characters between 0x80 and 0xFF on the Latin-1 code page. There is still the problem of non-BMP characters that need to be encoded into UTF-16 surrogate pairs (which need 4 bytes as well as their UTF-8 counterpart). It is nice to have non-BMP character as one entity, but this behavior may very well break the Web, where UTF-16 surrogate pairs are expected, as Brendan et al already pointed out. IMHO, a 32-bit implementation of strings, although desirable, is not an option for now. So what if the String implementation would either be 8-bit or 16-bit, depending on the character set? If the character set is Latin-1 (or a UTF-8 buffer looks like it only contains characters between 0x00 and 0xFF), the string uses 8-bit characters, while it uses 16-bit characters for all other character sets. We may also want to add a String creator that simply accepts 8-bit strings as-is, leaving the decision whether the string is valid Latin-1 to the implementer. We cannot afford to use UTF-8. Many scripts use charAt() and charCodeAt(), and there is no practical solution to make these routines as fast as if they were based on a string with fixed-width characters.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 23

•

17 years ago

The UTF8 size results in bug 416411 may be of interest, although we were measuring the DOM, not JS. Perhaps we should get Dave Mandelin (or someone else) to instrument Spidermonkey.

Steven Johnson

Comment 24

•

17 years ago

> Was the decision made in order to save memory? Yes. The gotcha here is that there are two distinct kinds of string we encounter: (1) those that are used by the language itself and/or the DOM; these are almost always in the ASCII range, thus UTF8 is an ideal format. (2) those that are used for arbitrary user data; as you point out, this has different memory usage implications depending on what range of Unicode is likely to be dominant, and horrible performance characteristics for random-access. Regarding UTF-16 and non-BMP characters, I think that using a 16-bit character size at this point would be a big problem. Web usage and/or ES4 may continue to constrain character indices to 16-bit surrogate pair indices (though I fervently hope not), but not all users of TT will (in particular, future versions of Flash will require efficient access to the entire Unicode range). So my proposal would be to allow the internal representation of String be (generally) a UCS4 encoding, but with strings in the ASCII range represented in a single-byte (UTF8) format for efficiency.

Michael Daumling

Assignee

Comment 25

•

17 years ago

So let us use fixed-width strings, but with three different widths: 8, 16, and 32. There would be three creation methods at AvmCore: newString8(), newString16(), newString32(). An 8-bit string is NOT UTF-8, but covers the first 256 Unicode characters. Concatenating two strings, results in a new string of the bigger of the widths. Then, we have three getter methods() at the String class: getAs8(), getAs16(), and getAs32(). These methods return the string, itself, convert to a wider format, or return NULL if conversion to a narrower format fails. Again NO UTF-8, but conversion of non-BMP 32-bit characters to UTF-16 surrogates and back. This way, the String class has fast indexing and other operations while maintaining the optimum string format, and saving memory. Implementers can choose to only use a single width; if an implementer wants 32-bit characters only, so be it :) I guess that there are more methods to think of (like String::getWidth(), or String::getAsUtf8() if really needed), but let this comment serve as an initial idea.

Steven Johnson

Comment 26

•

17 years ago

Good idea. Let's make it so.

Jim Blandy :jimb

Comment 27

•

17 years ago

Three string implementations sounds like a bit of work, and it makes further optimizations more work. SpiderMonkey strings have O(1) charAt and charCodeAt, and optimized append and substring operations. We'd like to port those into TT. Could we try that first and see how folks feel about it? They won't address the non-BMP/ES4 concerns; they still hold 16-bit code units.

Edwin Smith

Reporter

Comment 28

•

17 years ago

We have pretty solid evidence that there's enough 8bit strings in the world that 8-bit optimizations are worth the pain. (both in Flash and in JS, Dave Mandelin's data, i think) It's reasonable to do 16bit only as a way of wading into the shallow end of TT, get a reviewable patch up, and some perf numbers. But for something to land in TT, I think the minimum 8/16. Then we could add 32bit in a second phase. waddya think

David Mandelin [:dmandelin]

Comment 29

•

17 years ago

I should clarify that I never did any measurements on JS strings, just on Mozilla browser strings (although some of these can be shared with Spidermonkey). I found a memory savings of 1% for UTF-8 over UTF-16, so at least on the Mozilla side, we concluded the memory savings weren't worth any significant trouble. (This is true for even CJK, see http://blog.mozilla.com/dmandelin/2008/02/14/wtf-16/).

Michael Daumling

Assignee

Comment 30

•

17 years ago

I saw that TT strings have been assigned to Moh and Jim; nevertheless, I'd like to volunteer for the implementation if Moh and Jim agree.

Michael Daumling

Assignee

Comment 31

•

17 years ago

OK, I'll assign the bug to myself. There will be a Wiki page at http://wiki.mozilla.org/Tamarin:Plans#Shared_work_items.

Status: NEW → ASSIGNED

Benjamin Smedberg

Comment 32

•

17 years ago

Michael, please talk to Jim Blandy before proceeding. We would like to integrate the string optimizations from Spidermonkey as part of this, to facilitate actionmonkey.

Edwin Smith

Reporter

Comment 33

•

17 years ago

see bug 427959 for optimizing concat by appending to a buffer whose refcount == 1

Edwin Smith

Reporter

Updated

•

17 years ago

Assignee: nobody → daumling

Status: ASSIGNED → NEW

Michael Daumling

Assignee

Comment 34

•

17 years ago

Attached patch Patch implements new fixed-width strings (obsolete) — Details — Splinter Review

Who will review this patch?

Michael Daumling

Assignee

Comment 35

•

17 years ago

Attached file Excel spreadsheet for comparing SunSpider tests (obsolete) — Details

Simply copy the numbers (not the entire lines) from a Windowsa console into the spreadsheet column.

Michael Daumling

Assignee

Comment 36

•

17 years ago

This is the first shot at new strings. The attached diff adds seven files to Win VC 2005 and Mac XCode projects: StringObjectFW.cpp StringObjectFW.h StringClassFW.cpp StringUtils.cpp StringUtils.h StringRange.cpp StringRange.h There is a new, temporary #define TT_NEW_STRINGS in avmbuild.h. If defined, TT is compiled using the new strings; if undefined, the old strings are restored. This makes it possible to use the old strings on platforms where the new strings have not been tested yet. The new strings pass the acceptance tests on Win VC 2005 and Mac. On Win, all SunSpider tests run three times faster. The very slow string-validate-input tests run 15-17 times faster. Even with these tests excluded, SunSpider runs up to 20% faster. I would need some help to verify the implementation on platforms and compilers other than VC 2005 and Mac XCode. SpiderMonkey finalizers are planned, but not implemented yet. Also, I have not done much optimization yet, so there is still room for more. RegExps will run slower, because strings have to be translated back to UTF-8, so the next big project would be to change pcre to use the String::charAt() API. I've attached SunSpider.xls, which compares old and new performance results. On Windows, just copy and paste the number columns from the Console into the spreadsheet.

Status: NEW → ASSIGNED

Steven Johnson

Comment 37

•

17 years ago

I'll be happy to review it.

Steven Johnson

Updated

•

17 years ago

Attachment #325584 - Flags: review?(stejohns)

Steven Johnson

Comment 38

•

17 years ago

Nice work! comments: (note that none of these would prevent a submit, except perhaps _narrow_32_16) in String::createDirect, there's the comment "No kZero here - we do not store pointers, just data", but you still allocate with kContainsPointers. is this right? it looks there may be a merge glitch in _narrow32_16: // do you really want a double while? while (len-- > 0) while (len-- > 0) { would it make more sense to have String::Width declared as enum Width { kAuto = 0, // only used in APIs k8 = 1, k16 = 2, k32 = 4, }; so that _copyBuffers (and others) could avoid the "+1" magic on dstWidth when calculating memcpy ranges? I thought it was just a bug at first. If we're going to rely on enum values, why not just have them be the ones we need? can String_private__charAt make use of core->cachedChars (the way that String_private_c2s does)? String::substr has a comment that I suspect is stale: start = (int32_t)_clampIndexInt (start, m_length); // ClampIndex takes a double (not int32_t or uint) for first parm... (note that start and m_length are both ints) The String finalizer stuff might be problematic on some mobile platforms (Symbian?) that restrict writable global data. Can the finalizer table be const? Do we even need user-defined finalizers? (I'm trying to think of a use case) Also it looks like String_dynamicFinalizer, String_concatFinalizer, and String_dependentFinalizer are all the same... and also redundant to code in String::finalize itself. What am I missing here? String::append is declared in the .h file but is entirely commented out in the .cpp file. There are some TODO comments that I'd like a little more info on... ie are they things that definitely need more work? AvmCore::writeUnichar // TODO: change to something else than UniChar? PoolObject::_getPoolString // TODO: make thread safe String::concatStrings // TODO: make thread safe String_private_c2s // TODO: add 8-bit strings if desired String::Finalizer finalizers // TODO: Move the table to global data section. tiny quibble: there are a few naked "int" and such sprinkled in the code; we're trying to move to the C99 types (int32_t) for new code. ditto for "signed char" vs "int8_t". tiny quibble: there are various chunks of commented-out code (eg initAvm) that look like they are not likely to ever be resurrected; we generally prefer to avoid checking in commented-out code unless it's being preserved as a negative example. (ditto the various chunks marked TO BE REMOVED) did you mean to turn on AVMPLUS_VERBOSE for all builds? (I presume this was an accidental inclusion)

Brendan Eich [:brendan]

Comment 39

•

17 years ago

User-defined finalizers are a SpiderMonkey API feature. See http://developer.mozilla.org/en/docs/JS_AddExternalStringFinalizer /be

Michael Daumling

Assignee

Comment 40

•

17 years ago

>in String::createDirect, there's the comment "No kZero here - we do not store pointers, just data", but you still allocate with kContainsPointers. is this right? I am getting an assert inside MMgc if I don't include that bit. No idea what is going on here - maybe soneome can comment? > it looks there may be a merge glitch in _narrow32_16: Oh yes - will be fixed. > would it make more sense to have String::Width declared as... I'll check - not a bad idea at all :) > can String_private__charAt make use of core->cachedChars (the way that String_private_c2s does)? StringUtils::create() takes care of this. > String::substr has a comment that I suspect is stale:... I've copied the code from the old stirng implementation. I want to revisit that code once the patch has been landed, and make sure that the code conforms to ECMA-262. > The String finalizer stuff might be problematic on some mobile platforms (Symbian?) that restrict writable global data. Can the finalizer table be const? Do we even need user-defined finalizers? (I'm trying to think of a use case) Brendan already pointed out that finalizers are a SM feature. > Also it looks like String_dynamicFinalizer, String_concatFinalizer, and String_dependentFinalizer are all the same... and also redundant to code in String::finalize itself. What am I missing here? True - these finalizers are left-overs from an earlier step. I can remove these. A better place to store finalizers would be in the AvmCore structure. I can move the data there if you like. > String::append is declared in the .h file but is entirely commented out in the .cpp file. I want to resurrect that code once I start optimizing the TT code base. append() can be handy in cases where string constants are appended during the preparation of messages. This will save some memory footprint as well as memory usage. > There are some TODO comments that I'd like a little more info on... ie are they things that definitely need more work? Yes - I have marked all places that need slocks for later multithreading support. > tiny quibble: there are a few naked "int" and such sprinkled in the code; we're trying to move to the C99 types (int32_t) for new code. ditto for "signed char" vs "int8_t". Will fix. > tiny quibble: there are various chunks of commented-out code (eg initAvm) that look like they are not likely to ever be resurrected; we generally prefer to avoid checking in commented-out code unless it's being preserved as a negative example. (ditto the various chunks marked TO BE REMOVED) TO BE REMOVED is stuff that I want to remove from the code base once new strings are integrated. I can remove the remaining stuff for now. > did you mean to turn on AVMPLUS_VERBOSE for all builds? (I presume this was an accidental inclusion) Probably - I'll check. I'll submit an updated patch tomorrow.

Tommy Reilly

Comment 41

•

17 years ago

Do we need our String class to support spidermonkey's "external" feature or can we have host's use MMgc for string memory instead? If we really need it could be implemented outside TT using a presweep hook and a host managed table of external strings.

Steven Johnson

Comment 42

•

17 years ago

Okey doke, I'll hold off doing test builds and such till I get the updated patch -- you already have it working on VC2005 and XCode, so other targets should only be trivial issues for me to deal with. Re: custom finalizers, I didn't realize that SM offered this... but that brings up the question of what SM clients use them for. Is it a requirement for TT to support them if it wants to be used in place of SM? Or is it a legacy API that could be deprecated? (I'm not morally opposed to them, just want to ensure they are a definite requirement for SM compatibility) If they are necessary then moving them to AvmCore would probably be a better solution. Re: the MMgc assert, I dunno -- Tommy is the go-to guy on MMgc mysteries :-)

Benjamin Smedberg

Comment 43

•

17 years ago

Let's not worry about spidermonkey strings yet... we'd definitely like to avoid the finalizer penalty in the common case.

Edwin Smith

Reporter

Comment 44

•

17 years ago

(In reply to comment #41) > Do we need our String class to support spidermonkey's "external" feature or can > we have host's use MMgc for string memory instead? If we really need it > could be implemented outside TT using a presweep hook and a host managed table > of external strings. > Someone weigh in if they feel differently. I think this string impl can require mmgc and even depend on features/behavior of mmgc. frankly i see strings (and arrays and hashtables, etc) as *features* of the memory manager, any opportunities of optimizing them together should be fully exploited.

Michael Daumling

Assignee

Comment 45

•

17 years ago

Attached patch Updated patch following Steven's comments (obsolete) — Details — Splinter Review

This patch implements all of Steven's comments - a second pair of eyes is priceless! I have commented out the code that handles string finalizers for now (using TTSTR_FINALIZERS) until an agreement has reached about whether this should be handled inside the String class or not. I believe that it should - how can we call an external finalizer function inside String::finalizer() in any other way? We still need to agree upon how to pass in the JSContext for SM, but we can leave that discussion to a later time.

Attachment #325584 - Attachment is obsolete: true

Attachment #325584 - Flags: review?(stejohns)

Steven Johnson

Comment 46

•

17 years ago

Excellent! I'll re-review all of this -- if I review+ it do you want me to push the patch as well?

Michael Daumling

Assignee

Comment 47

•

17 years ago

Please do so - I doubt that I have write access, and I love to blame any problems to someone else anyway :))

Steven Johnson

Comment 48

•

17 years ago

Looks like only a few trivial fixes are necessary. Also, I had to put back in the GCContainsPointers bit since I'm still getting the assert -- we should ask Tommy about it. I also enabled the use of memcmp for string-compare (we should have been doing that in the old code too -- didn't know we weren't)

Steven Johnson

Comment 49

•

17 years ago

pushed as changeset: 417:275e2895d4c3 leaving open for now -- not sure if Michael has additional work he wants to do.

Michael Daumling

Assignee

Comment 50

•

17 years ago

I'll use this bug to push optimizations not related to pcre. Also, I'd like to remove the #define TT_NEW_STRINGS as soon as the new strings run stable on all platforms. Currently, I am spreading this #define all over the code, and that is ugly.

Tommy Reilly

Comment 51

•

17 years ago

Right now you have to use kContainsPointers if you want to participate in DRC. I'll file a bug.

Michael Daumling

Assignee

Comment 52

•

17 years ago

Attached file Extension to VC++ autoexp.dat to display Strings (obsolete) — Details

Add this text to your autoexp.dat to have a visualizer for new Strings. Very handy if you use VC 2005 (it will probably also work with VC 2008). Instructions inside. Also contains ref count visualizers for RCObject and AvmPlusScriptableObject.

Steven Johnson

Comment 53

•

17 years ago

re: remove the #define TT_NEW_STRINGS agreed. they seem to be running just fine. let's give it a day or three for issues to shake out. btw, I found a potential bug in _getPoolString; if you ask for an interned string but the string in the table is non-interned, you get the non-interned string. A fix is included in my suggested patch for https://bugzilla.mozilla.org/show_bug.cgi?id=440242

Michael Daumling

Assignee

Comment 54

•

17 years ago

I'll fix that. I have another one: String::finalize() just nulls the pointers of dependent strings instead of doing WBRC() calls with NULL; this may lead to MMgc not being able to free dependent strings. I'l attach a patch tomorrow.

Michael Daumling

Assignee

Comment 55

•

17 years ago

Attached patch Updated patch (obsolete) — Details — Splinter Review

Lots of fixes: - Rewrote concat() - made sure that WBRC(NULL) is called to release pointers - substr() makes sure that no kDependent hierarchies are built - String::tryDelete() delet4es this if RefCount() == 0, and StringDataUTF8 and others make use of that feature to offload GC - faster UTF/string index conversion routines as part of StringDataUTF8 - changed fields in m_bitsAndFlags - made Pointers union public, and added fast nextch() routine Passes all acceptance tests (well, 6 errors, but the same ones as without the new strings: typeofNaN != Number, and class of new Function() is null instead of Function)

Attachment #325585 - Attachment is obsolete: true

Attachment #325612 - Attachment is obsolete: true

Attachment #326293 - Flags: review?(stejohns)

Michael Daumling

Assignee

Comment 56

•

17 years ago

Attached file Updated VC 2005 visualizer to reflect the new field structure (obsolete) — Details

The m_bitsAndFlags field has changed - this updated visualizer adapts the new structure. String display is e.g.: [8/direct] "abcdefgh" - 8 bits, direct buffer {8/static} "abcdefgh" - 8 bits, static constant, interned

Attachment #326294 - Flags: review?(stejohns)

Steven Johnson

Updated

•

17 years ago

Attachment #326294 - Flags: review?(stejohns) → review+

Steven Johnson

Comment 57

•

17 years ago

Comment on attachment 326293 [details] [diff] [review] Updated patch Nice. I'll get it pushed later today. Any significant impact on performance, positive or negative?

Attachment #326293 - Flags: review?(stejohns) → review+

Steven Johnson

Comment 58

•

17 years ago

FYI, it looks like the patch you attached has a few files with CRLF and some with LF-only, causing patch to barf on non-Windows systems. I can work around it but probably good to normalize line endings before diffing.

Michael Daumling

Assignee

Comment 59

•

17 years ago

Sorry 'bout that - I was not sure if that was a problem. Performance gains are hard to tell - the SunSpider numbers are so small that most of the tests have substantial differences. The resolution of the timer is too low. Some tests seem to run faster, though. Repeated substrs should be faster now, as will be concats. But there is still room for improvement! Especially RegExps are still slow. Would it be possible to increase the number of iterations in the SunSpider tests?

Steven Johnson

Comment 60

•

17 years ago

Yeah, unfortunately HG seems to be a bit daft about dealing with line-ending differences. One would think that such problems would be easy to solve, but alas, apparently not. (Rumor has it there are ways to configure HG to improve this, but why it can't just be set to ignore line-endings by default when diffing common file types, I dunno...) Re: increasing number of iterations, yeah, that might well be useful. Brent Baker and Dan Schaeffer do most of the maintenance of those tests, cc'ing them.

Dirkjan Ochtman (:djc)

Comment 61

•

17 years ago

hg diff has a few options, one of which, IIRC, deals with ignoring white space things. Maybe you can find what you need there?

Steven Johnson

Comment 62

•

17 years ago

AFAIK that option affects spaces/tabs but not line endings. The only builtin solution I'm aware of is http://www.selenic.com/mercurial/wiki/index.cgi/EncodeDecodeFilter which strikes me as unnecessarily complex; IMHO being able to ignore line endings in diffs (and normalize on commits without requiring external utilities) should be built-in.

Brendan Eich [:brendan]

Comment 63

•

17 years ago

Line endings should be canonicalized to \n in the repo. This is not just a diff or display issue. Graydon testifies that it's surprisingly hard for VCS to guess what the line ending should be, just by looking at bytes in the file. /be

Michael Daumling

Assignee

Comment 64

•

17 years ago

Attached patch Patch that fixes GCC warnings and a bug (obsolete) — Details — Splinter Review

I hate to admit it - but I failed to test the latest patch on the Mac, plus a bug somehow crept back into the patch that had to be fixed. This patch should work now - and it appears to have consistent line endings (no wonder - I used a Mac the create the diff :)

Attachment #325771 - Attachment is obsolete: true

Attachment #326293 - Attachment is obsolete: true

Attachment #326352 - Flags: review?(stejohns)

Steven Johnson

Comment 65

•

17 years ago

Comment on attachment 326352 [details] [diff] [review] Patch that fixes GCC warnings and a bug pushed as changeset: 432:7a7fa16cdee6

Attachment #326352 - Flags: review?(stejohns) → review+

Moh Haghighat

Comment 66

•

17 years ago

Once the algorithmic aspects of the new code is done, please let me know so we can start looking at SSE2 opportunities. - moh

Michael Daumling

Assignee

Comment 67

•

17 years ago

Attached patch Fix doubles charAt() performance (obsolete) — Details — Splinter Review

This fix makes sure that StringUtils::create() returns cached characters for codepoints < 128, which was not the case before. The C++ wrapper for charAt() calls this method. A cosmetic fix in StringObjectFW.cpp. And the addition of CYGWIN_NT-6.0 to runtests.py for Vista support.

Attachment #326677 - Flags: review?(stejohns)

Steven Johnson

Updated

•

17 years ago

Attachment #326677 - Flags: review?(stejohns) → review+

Steven Johnson

Comment 68

•

17 years ago

pushed as changeset: 438:34e27ae30694

Michael Daumling

Assignee

Comment 69

•

17 years ago

Attached patch Recursion-free version of strings — Details — Splinter Review

- removed the kConcat and kDynamic string types with associated data structures like StringBuf - changed concatenate() to create a kDirect string with extra chars at the end if rightStr cannot be appended in-place; a #define TTSTR_EXTRA_MAX can be platform-tweaked. - Added a check to see if rightStr already exists in leftStr's extended buffer (e.g. if leftStr == "123ABC" and len == 3, and rightStr == "AB", then a kDependent string would be created without copying the data) - substr(): When this was kDependent, only start got updated, but not end when resolving the master string - removed flatten(), collectTree(), and getUsedChars() (the latter being integrated into concatenate()) - the whole thing is a little more transparent now

Attachment #326352 - Attachment is obsolete: true

Attachment #326677 - Attachment is obsolete: true

Attachment #328353 - Flags: review?(stejohns)

Michael Daumling

Assignee

Comment 70

•

17 years ago

Attached file Updated VC++ visualizer for reduced string types — Details

Needed after patch is landed

Attachment #326294 - Attachment is obsolete: true

Steven Johnson

Comment 71

•

17 years ago

Comment on attachment 328353 [details] [diff] [review] Recursion-free version of strings I love it when big chunks of code just go away... :-)

Attachment #328353 - Flags: review?(stejohns) → review+

Steven Johnson

Comment 72

•

17 years ago

pushed as changeset: 481:a9a4b1502555

Status: ASSIGNED → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Patch implements new fixed-width strings 17 years ago Michael Daumling 238.98 KB, patch		Details \| Diff \| Splinter Review
Excel spreadsheet for comparing SunSpider tests 17 years ago Michael Daumling 24.50 KB, application/octet-stream		Details
Updated patch following Steven's comments 17 years ago Michael Daumling 234.31 KB, patch		Details \| Diff \| Splinter Review
Extension to VC++ autoexp.dat to display Strings 17 years ago Michael Daumling 2.26 KB, text/plain		Details
Updated patch 17 years ago Michael Daumling 87.82 KB, patch	stejohns : review+	Details \| Diff \| Splinter Review
Updated VC 2005 visualizer to reflect the new field structure 17 years ago Michael Daumling 2.36 KB, text/plain	stejohns : review+	Details
Patch that fixes GCC warnings and a bug 17 years ago Michael Daumling 86.42 KB, patch	stejohns : review+	Details \| Diff \| Splinter Review
Fix doubles charAt() performance 17 years ago Michael Daumling 3.52 KB, patch	stejohns : review+	Details \| Diff \| Splinter Review
Recursion-free version of strings 17 years ago Michael Daumling 29.79 KB, patch	stejohns : review+	Details \| Diff \| Splinter Review
Updated VC++ visualizer for reduced string types 17 years ago Michael Daumling 2.01 KB, text/plain		Details