TT: strings need a faster charAt() function

RESOLVED FIXED

Status

defect
P1
normal
RESOLVED FIXED
12 years ago
11 years ago

People

(Reporter: edwsmith, Assigned: daumling)

Tracking

Details

Attachments

(2 attachments, 8 obsolete attachments)

Reporter

Description

12 years ago
many pieces of code depend on having O(1) charAt() and charCodeAt().

in sunspider: crypto-md5, crypto-sha1 use strings as lookup tables.

in esc (and probably any other parser), lex-scan.es uses charCodeAt(i++)

ideas:
  * since symbol strings aren't allocated as Strings anymore, allocate Strings as utf16 instead of utf8
  * detect constant-width utf8 strings and add 2-bit scaling factor to String class

Comment 1

12 years ago
pardon, what's a "symbol string"?
Reporter

Comment 2

12 years ago
most of the strings in the string constant pool in a typical abc are symbol names.   Only a subset of entries in the constant pool ever see the light of day in user code as a String value.

In TC, symbol tables (MultinameHashtable) are indexed by String and Namespace objects.  In TT, symbol tables (SymTable) are indexed by references pointing into the abc -- String objects aren't allocated for them.  This saves a bunch of RAM for symbol tables.  (Namespace objects still are allocated, but maybe don't need to be).

Since symbol-name strings aren't allocated as String values in TT by default there is less of a penalty for representing strings as uint16[] vs uint8[].

Comment 3

12 years ago
IMHO, we should avoid using utf16 in any new code; when utf8 isn't appropriate we should go straight to UCS4.
Steven, can you clarify comment 3?  charAt() and charCodeAt() have to behave as though the string were represented using 16-bit code units.  Why go to UCS-4?

Comment 5

12 years ago
IIRC charAt and charCodeAt have to behave on a per-code-point basis, which is > 16 bits for Unicode 4.0 or later (which is critical for proper support of various Asian languages). A Unicode codepoint encompasses ~1000000 characters which means that a 32-bit size is the most practical when working on per-character basis. UTF16 is now problematic IMHO because it is easy to make a mistake that will work OK for most Western languages but cause a problem for things beyond the Basic Multilingual Plane.

Comment 6

12 years ago
According to the ES3 spec and current JS practice, charAt operates on 16-bit code units, not unicode codepoints.

Comment 7

12 years ago
I stand corrected. 

how might this affect ES4, which is planning on moving to Unicode 4.0+ IIRC?

Comment 8

12 years ago
I don't know... this is definitely an unfortunate edge of the ES3 spec because at that point unicode was only the basic plane. Let's ask the ES4 spec-masters what the current spec says and what it should say.

Comment 9

12 years ago
it's double problematic because it implies that all index-based methods (length, index, substr... etc) have to pretend it's utf16, which is really the wrong thing going forward. I suppose we could add an entire parallel set of apis using codepoints (eg, unilength, unicharCodeAt, unicharAt.... ugh) but that would be pretty horrible.
ES4 is trying to allow full Unicode support, at the expense of interoperation:

http://wiki.ecmascript.org/doku.php?id=proposals:update_unicode

Some details remain to be worked out:

http://bugs.ecmascript.org/ticket/213 (read the latest substantive comment!)
http://bugs.ecmascript.org/ticket/148
http://bugs.ecmascript.org/ticket/37

The real issue is #213. If we wish to allow implementations to support 32-bit Unicode per spec, we cannot allow surrogate pairs to be merged.

TT could take a chance and try to track ES4 as proposed, but if we end up specifying anything else, it will have to change. In the mean time, and for a long time on the web, ES3 compatibility matters more.

See http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html for a timely blog post from roc.

My advice is to conform to ES3 for now and optimize under the hood.

/be
(In reply to comment #0)
> in sunspider: crypto-md5, crypto-sha1 use strings as lookup tables.

Sigh. I hugely favour UTF-8 but abusing strings for random-access lookup tables makes it pretty hard to use UTF-8. The only way to optimize it would be to attach a full array of UTF-16 index -> UTF-8 index mappings to any string on which charAt was used. Maybe that wouldn't be so bad...

Comment 12

12 years ago
Sounds like a job for intelligent caching, since I'm guessing that you''ll rarely need to use charAt one more than a handful of strings at a time...

Comment 13

12 years ago
The random lookup-table case is a pretty heinous case to optimize for: if people are typically manually iterating through a string you can cache the last lookup location for O(1) access to adjacent items.
It might be simple enough to just lazily create the mapping array on the first charAt to the string and discard all the mapping arrays at each GC. Of course you'd have to invalidate the array when strings change.

> The random lookup-table case is a pretty heinous case to optimize for

Welcome to the Web! I wonder how typical SunSpider's usage is.
jorendorff mentions

> Also, in ECMAScript Edition 4, it is proposed
> that strings will be indexable, sort of like in
> Python:
>
> "hello world"[6] == "w"

That would really stuff up UTF-8 representation. It basically forces implementations to use UCS-4 if you decree that JS characters are Unicode characters, or UTF-16 if you decree that they're UTF-16 code units. (The latter case is really bad though because it would let people create invalid surrogate pairs or else give you weird error cases to check and handle.) I really strongly encourage the ES4 people to revisit that decision.
roc: indexing is no different from charAt, except for the lack of an override opportunity (String.prototype.charAt can be replaced without affecting indexing). It's in not only SpiderMonkey, but also (IIRC) Opera and Webkit. OTOH, it is not widely used on the Web.

Neither is charAt (more below), but it is used (along with charCodeAt), sometimes for want of a byte vector type in JS (which is coming, see

http://bugs.ecmascript.org/ticket/234

and feel free to comment.

Iterators are coming too, and while the exact interface evolved a fair bit from JS1.7 to cover all cases and use the type system, you can mock up a useful string iterator now:

js> String.prototype.__iterator__ = function (keys) {
    for (let i=0; i<this.length; i++)
        yield keys ? i : this[i];
}
function (keys) {
    for (let i = 0; i < this.length; i++) {
        yield keys ? i : this[i];
    }
}
js> for (i in "hello") print(i);
0
1
2
3
4
js> for each (i in "hello") print(i);
h
e
l
l
o

I'll follow up to make sure string iteration is as sweet as possible in JS2/ES4, but we can't eliminate charAt or indexing (backward compatibility). The random access benchmarks are not representative of much, though: real string processing in JS uses RegExps and indexOf more than charAt.

None of this changes the challenge to implementors: support the Web as it is, with Java-inspired charAt/charCodeAt and "UCS-2", and optimize to compete at the often somewhat to very bogus benchmarks.

/be
> roc: indexing is no different from charAt

Read-only indexing, yes, and is amenable to various optimizations that might make it tractable for non-UTF-16 representations. Writeable indexing is much worse though, I don't see any reasonable way to optimize that.
JS strings are immutable -- properties reflected as indexed single-string elements appear to be read-only. Same for JS2/ES4 slice syntax -- slices are rvalues.

/be
Reporter

Updated

12 years ago
Component: Virtual Machine → Tracing Virtual Machine
QA Contact: vm → tracing-vm

Comment 20

12 years ago
Note, a weak partial fix (for single-byte string only) is in the patch for https://bugzilla.mozilla.org/show_bug.cgi?id=416924 but we probably need a more robust solution.
See also bug 416411.

/be
Reporter

Updated

11 years ago
Priority: -- → P1
Assignee

Comment 22

11 years ago
Is UTF-8 really the way to go for TT?

Was the decision made in order to save memory? If so, do we have statistics about how much memory is saved in light of various locales? Looking at the most common character sets, it may very well be true that we have an increase in memory instead of a decrease.

For CJK languages, this is true, because all characters above 0x0FFF need three bytes, which would increase the memory needs by 50%. For Russian, the memory size should be about he same, since Cyrillic is in the 0x400 range, which needs 2 bytes. For most Western languages, there is a 50% memory savings for ASCII, but there is a 100% penalty for characters between 0x80 and 0xFF on the Latin-1 code page.

There is still the problem of non-BMP characters that need to be encoded into UTF-16 surrogate pairs (which need 4 bytes as well as their UTF-8 counterpart). It is nice to have non-BMP character as one entity, but this behavior may very well break the Web, where UTF-16 surrogate pairs are expected, as Brendan et al already pointed out. IMHO, a 32-bit implementation of strings, although desirable, is not an option for now.

So what if the String implementation would either be 8-bit or 16-bit, depending on the character set? If the character set is Latin-1 (or a UTF-8 buffer looks like it only contains characters between 0x00 and 0xFF), the string uses 8-bit characters, while it uses 16-bit characters for all other character sets. We may also want to add a String creator that simply accepts 8-bit strings as-is, leaving the decision whether the string is valid Latin-1 to the implementer.

We cannot afford to use UTF-8. Many scripts use charAt() and charCodeAt(), and there is no practical solution to make these routines as fast as if they were based on a string with fixed-width characters.
The UTF8 size results in bug 416411 may be of interest, although we were measuring the DOM, not JS. Perhaps we should get Dave Mandelin (or someone else) to instrument Spidermonkey.

Comment 24

11 years ago
> Was the decision made in order to save memory?

Yes. The gotcha here is that there are two distinct kinds of string we encounter: 

(1) those that are used by the language itself and/or the DOM; these are almost always in the ASCII range, thus UTF8 is an ideal format.

(2) those that are used for arbitrary user data; as you point out, this has different memory usage implications depending on what range of Unicode is likely to be dominant, and horrible performance characteristics for random-access.

Regarding UTF-16 and non-BMP characters, I think that using a 16-bit character size at this point would be a big problem. Web usage and/or ES4 may continue to constrain character indices to 16-bit surrogate pair indices (though I fervently hope not), but not all users of TT will (in particular, future versions of Flash will require efficient access to the entire Unicode range). 

So my proposal would be to allow the internal representation of String be (generally) a UCS4 encoding, but with strings in the ASCII range represented in a single-byte (UTF8) format for efficiency.

Assignee

Comment 25

11 years ago
So let us use fixed-width strings, but with three different widths: 8, 16, and 32.

There would be three creation methods at AvmCore: newString8(), newString16(), newString32(). An 8-bit string is NOT UTF-8, but covers the first 256 Unicode characters. Concatenating two strings, results in a new string of the bigger of the widths.

Then, we have three getter methods() at the String class: getAs8(), getAs16(), and getAs32(). These methods return the string, itself, convert to a wider format, or return NULL if conversion to a narrower format fails. Again NO UTF-8, but conversion of non-BMP 32-bit characters to UTF-16 surrogates and back.

This way, the String class has fast indexing and other operations while maintaining the optimum string format, and saving memory. Implementers can choose to only use a single width; if an implementer wants 32-bit characters only, so be it :)

I guess that there are more methods to think of (like String::getWidth(), or String::getAsUtf8() if really needed), but let this comment serve as an initial idea.

Comment 26

11 years ago
Good idea. Let's make it so.

Comment 27

11 years ago
Three string implementations sounds like a bit of work, and it makes further optimizations more work.

SpiderMonkey strings have O(1) charAt and charCodeAt, and optimized append and substring operations.  We'd like to port those into TT.  Could we try that first and see how folks feel about it?

They won't address the non-BMP/ES4 concerns; they still hold 16-bit code units.
Reporter

Comment 28

11 years ago
We have pretty solid evidence that there's enough 8bit strings in the world that 8-bit optimizations are worth the pain. (both in Flash and in JS, Dave Mandelin's data, i think)

It's reasonable to do 16bit only as a way of wading into the shallow end of TT, get a reviewable patch up, and some perf numbers.  But for something to land in TT, I think the minimum 8/16.  

Then we could add 32bit in a second phase.

waddya think
I should clarify that I never did any measurements on JS strings, just on Mozilla browser strings (although some of these can be shared with Spidermonkey). I found a memory savings of 1% for UTF-8 over UTF-16, so at least on the Mozilla side, we concluded the memory savings weren't worth any significant trouble. (This is true for even CJK, see http://blog.mozilla.com/dmandelin/2008/02/14/wtf-16/). 
Assignee

Comment 30

11 years ago
I saw that TT strings have been assigned to Moh and Jim; nevertheless, I'd like to volunteer for the implementation if Moh and Jim agree.
Assignee

Comment 31

11 years ago
OK, I'll assign the bug to myself. There will be a Wiki page at http://wiki.mozilla.org/Tamarin:Plans#Shared_work_items.
Status: NEW → ASSIGNED

Comment 32

11 years ago
Michael, please talk to Jim Blandy before proceeding. We would like to integrate the string optimizations from Spidermonkey as part of this, to facilitate actionmonkey.
Reporter

Comment 33

11 years ago
see bug 427959 for optimizing concat by appending to a buffer whose refcount == 1
Reporter

Updated

11 years ago
Assignee: nobody → daumling
Status: ASSIGNED → NEW
Assignee

Comment 34

11 years ago
Who will review this patch?
Assignee

Comment 35

11 years ago
Simply copy the numbers (not the entire lines) from a Windowsa console into the spreadsheet column.
Assignee

Comment 36

11 years ago
This is the first shot at new strings. The attached diff adds seven files to Win VC 2005 and Mac XCode projects:

StringObjectFW.cpp
StringObjectFW.h
StringClassFW.cpp
StringUtils.cpp
StringUtils.h
StringRange.cpp
StringRange.h

There is a new, temporary #define TT_NEW_STRINGS in avmbuild.h. If defined, TT is compiled using the new strings; if undefined, the old strings are restored. This makes it possible to use the old strings on platforms where the new strings have not been tested yet.

The new strings pass the acceptance tests on Win VC 2005 and Mac. On Win, all SunSpider tests run three times faster. The very slow string-validate-input tests run 15-17 times faster. Even with these tests excluded, SunSpider runs up to 20% faster.

I would need some help to verify the implementation on platforms and compilers other than VC 2005 and Mac XCode.

SpiderMonkey finalizers are planned, but not implemented yet. Also, I have not done much optimization yet, so there is still room for more. RegExps will run slower, because strings have to be translated back to UTF-8, so the next big project would be to change pcre to use the String::charAt() API.

I've attached SunSpider.xls, which compares old and new performance results. On Windows, just copy and paste the number columns from the Console into the spreadsheet.
Status: NEW → ASSIGNED

Comment 37

11 years ago
I'll be happy to review it.

Updated

11 years ago
Attachment #325584 - Flags: review?(stejohns)

Comment 38

11 years ago
Nice work!

comments: (note that none of these would prevent a submit, except perhaps _narrow_32_16)

in String::createDirect, there's the comment "No kZero here - we do not store pointers, just data", but you still allocate with kContainsPointers. is this right?

it looks there may be a merge glitch in _narrow32_16:
 
		// do you really want a double while? 
		while (len-- > 0)
		while (len-- > 0)
		{
	

would it make more sense to have String::Width declared as
		enum Width
		{
			kAuto	= 0,	// only used in APIs
			k8	= 1,
			k16	= 2,
			k32	= 4,
		};
so that _copyBuffers (and others) could avoid the "+1" magic on dstWidth when calculating memcpy ranges? I thought it was just a bug at first. If we're going to rely on enum values, why not just have them be the ones we need?

can String_private__charAt make use of core->cachedChars (the way that String_private_c2s does)?

String::substr has a comment that I suspect is stale:
	start = (int32_t)_clampIndexInt (start, m_length); 
	// ClampIndex takes a double (not int32_t or uint) for first parm...
(note that start and m_length are both ints)

The String finalizer stuff might be problematic on some mobile platforms (Symbian?) that restrict writable global data. Can the finalizer table be const? Do we even need user-defined finalizers? (I'm trying to think of a use case)

Also it looks like String_dynamicFinalizer, String_concatFinalizer, and String_dependentFinalizer are all the same... and also redundant to code in String::finalize itself. What am I missing here?

String::append is declared in the .h file but is entirely commented out in the .cpp file.

There are some TODO comments that I'd like a little more info on... ie are they things that definitely need more work?

AvmCore::writeUnichar // TODO: change to something else than UniChar?
PoolObject::_getPoolString // TODO: make thread safe
String::concatStrings // TODO: make thread safe
String_private_c2s // TODO: add 8-bit strings if desired
String::Finalizer finalizers // TODO: Move the table to global data section.

tiny quibble: there are a few naked "int" and such sprinkled in the code; we're trying to move to the C99 types (int32_t) for new code. ditto for "signed char" vs "int8_t".

tiny quibble: there are various chunks of commented-out code (eg initAvm) that look like they are not likely to ever be resurrected; we generally prefer to avoid checking in commented-out code unless it's being preserved as a negative example. (ditto the various chunks marked TO BE REMOVED)

did you mean to turn on AVMPLUS_VERBOSE for all builds? (I presume this was an accidental inclusion)
User-defined finalizers are a SpiderMonkey API feature. See

http://developer.mozilla.org/en/docs/JS_AddExternalStringFinalizer

/be
Assignee

Comment 40

11 years ago
>in String::createDirect, there's the comment "No kZero here - we do not store
pointers, just data", but you still allocate with kContainsPointers. is this
right?

I am getting an assert inside MMgc if I don't include that bit. No idea what is going on here - maybe soneome can comment?

> it looks there may be a merge glitch in _narrow32_16:

Oh yes - will be fixed.

> would it make more sense to have String::Width declared as...

I'll check - not a bad idea at all :)

> can String_private__charAt make use of core->cachedChars (the way that
String_private_c2s does)?

StringUtils::create() takes care of this.

> String::substr has a comment that I suspect is stale:...

I've copied the code from the old stirng implementation. I want to revisit that code once the patch has been landed, and make sure that the code conforms to ECMA-262.

> The String finalizer stuff might be problematic on some mobile platforms
(Symbian?) that restrict writable global data. Can the finalizer table be
const? Do we even need user-defined finalizers? (I'm trying to think of a use
case)

Brendan already pointed out that finalizers are a SM feature.

> Also it looks like String_dynamicFinalizer, String_concatFinalizer, and
String_dependentFinalizer are all the same... and also redundant to code in
String::finalize itself. What am I missing here?

True - these finalizers are left-overs from an earlier step. I can remove these. A better place to store finalizers would be in the AvmCore structure. I can move the data there if you like.

> String::append is declared in the .h file but is entirely commented out in the
.cpp file.

I want to resurrect that code once I start optimizing the TT code base. append() can be handy in cases where string constants are appended during the preparation of messages. This will save some memory footprint as well as memory usage.
 
> There are some TODO comments that I'd like a little more info on... ie are they things that definitely need more work?

Yes - I have marked all places that need slocks for later multithreading support.

> tiny quibble: there are a few naked "int" and such sprinkled in the code; we're trying to move to the C99 types (int32_t) for new code. ditto for "signed char" vs "int8_t".

Will fix.

> tiny quibble: there are various chunks of commented-out code (eg initAvm) that
look like they are not likely to ever be resurrected; we generally prefer to
avoid checking in commented-out code unless it's being preserved as a negative
example. (ditto the various chunks marked TO BE REMOVED)

TO BE REMOVED is stuff that I want to remove from the code base once new strings are integrated. I can remove the remaining stuff for now.

> did you mean to turn on AVMPLUS_VERBOSE for all builds? (I presume this was an
accidental inclusion)

Probably - I'll check.

I'll submit an updated patch tomorrow.

Comment 41

11 years ago
Do we need our String class to support spidermonkey's "external" feature or can
we have host's use MMgc for string memory instead?    If we really need it
could be implemented outside TT using a presweep hook and a host managed table
of external strings.

Comment 42

11 years ago
Okey doke, I'll hold off doing test builds and such till I get the updated patch -- you already have it working on VC2005 and XCode, so other targets should only be trivial issues for me to deal with.

Re: custom finalizers, I didn't realize that SM offered this... but that brings up the question of what SM clients use them for. Is it a requirement for TT to support them if it wants to be used in place of SM? Or is it a legacy API that could be deprecated? (I'm not morally opposed to them, just want to ensure they are a definite requirement for SM compatibility) If they are necessary then moving them to AvmCore would probably be a better solution.

Re: the MMgc assert, I dunno -- Tommy is the go-to guy on MMgc mysteries :-)

Comment 43

11 years ago
Let's not worry about spidermonkey strings yet... we'd definitely like to avoid the finalizer penalty in the common case.
Reporter

Comment 44

11 years ago
(In reply to comment #41)
> Do we need our String class to support spidermonkey's "external" feature or can
> we have host's use MMgc for string memory instead?    If we really need it
> could be implemented outside TT using a presweep hook and a host managed table
> of external strings.
> 

Someone weigh in if they feel differently.  I think this string impl can require mmgc and even depend on features/behavior of mmgc.  frankly i see strings (and arrays and hashtables, etc) as *features* of the memory manager, any opportunities of optimizing them together should be fully exploited.
Assignee

Comment 45

11 years ago
This patch implements all of Steven's comments - a second pair of eyes is priceless!

I have commented out the code that handles string finalizers for now (using TTSTR_FINALIZERS) until an agreement has reached about whether this should be handled inside the String class or not. I believe that it should - how can we call an external finalizer function inside String::finalizer() in any other way? We still need to agree upon how to pass in the JSContext for SM, but we can leave that discussion to a later time.
Attachment #325584 - Attachment is obsolete: true
Attachment #325584 - Flags: review?(stejohns)

Comment 46

11 years ago
Excellent! I'll re-review all of this -- if I review+ it do you want me to push the patch as well?
Assignee

Comment 47

11 years ago
Please do so - I doubt that I have write access, and I love to blame any problems to someone else anyway :))

Comment 48

11 years ago
Looks like only a few trivial fixes are necessary.

Also, I had to put back in the GCContainsPointers bit since I'm still getting the assert -- we should ask Tommy about it.

I also enabled the use of memcmp for string-compare (we should have been doing that in the old code too -- didn't know we weren't)

Comment 49

11 years ago
pushed as changeset:   417:275e2895d4c3

leaving open for now -- not sure if Michael has additional work he wants to do.
Assignee

Comment 50

11 years ago
I'll use this bug to push optimizations not related to pcre.

Also, I'd like to remove the #define TT_NEW_STRINGS as soon as the new strings run stable on all platforms. Currently, I am spreading this #define all over the code, and that is ugly.

Comment 51

11 years ago
Right now you have to use kContainsPointers if you want to participate in DRC.  I'll file a bug.
Assignee

Comment 52

11 years ago
Add this text to your autoexp.dat to have a visualizer for new Strings. Very handy if you use VC 2005 (it will probably also work with VC 2008). Instructions inside. Also contains ref count visualizers for RCObject and AvmPlusScriptableObject.

Comment 53

11 years ago
re: remove the #define TT_NEW_STRINGS

agreed. they seem to be running just fine. let's give it a day or three for issues to shake out. 

btw, I found a potential bug in _getPoolString; if you ask for an interned string but the string in the table is non-interned, you get the non-interned string. A fix is included in my suggested patch for https://bugzilla.mozilla.org/show_bug.cgi?id=440242
Assignee

Comment 54

11 years ago
I'll fix that. I have another one: String::finalize() just nulls the pointers of dependent strings instead of doing WBRC() calls with NULL; this may lead to MMgc not being able to free dependent strings. I'l attach a patch tomorrow.
Assignee

Comment 55

11 years ago
Posted patch Updated patch (obsolete) — Splinter Review
Lots of fixes:

- Rewrote concat()
- made sure that WBRC(NULL) is called to release pointers
- substr() makes sure that no kDependent hierarchies are built
- String::tryDelete() delet4es this if RefCount() == 0, and StringDataUTF8 and others make use of that feature to offload GC
- faster UTF/string index conversion routines as part of StringDataUTF8
- changed fields in m_bitsAndFlags
- made Pointers union public, and added fast nextch() routine

Passes all acceptance tests (well, 6 errors, but the same ones as without the new strings: typeofNaN != Number, and class of new Function() is null instead of Function)
Attachment #325585 - Attachment is obsolete: true
Attachment #325612 - Attachment is obsolete: true
Attachment #326293 - Flags: review?(stejohns)
Assignee

Comment 56

11 years ago
The m_bitsAndFlags field has changed - this updated visualizer adapts the new structure.

String display is e.g.:

[8/direct] "abcdefgh" - 8 bits, direct buffer
{8/static} "abcdefgh" - 8 bits, static constant, interned
Attachment #326294 - Flags: review?(stejohns)

Updated

11 years ago
Attachment #326294 - Flags: review?(stejohns) → review+

Comment 57

11 years ago
Comment on attachment 326293 [details] [diff] [review]
Updated patch

Nice. I'll get it pushed later today.

Any significant impact on performance, positive or negative?
Attachment #326293 - Flags: review?(stejohns) → review+

Comment 58

11 years ago
FYI, it looks like the patch you attached has a few files with CRLF and some with LF-only, causing patch to barf on non-Windows systems. I can work around it but probably good to normalize line endings before diffing.
Assignee

Comment 59

11 years ago
Sorry 'bout that - I was not sure if that was a problem.

Performance gains are hard to tell - the SunSpider numbers are so small that most of the tests have substantial differences. The resolution of the timer is too low. Some tests seem to run faster, though. Repeated substrs should be faster now, as will be concats. But there is still room for improvement! Especially RegExps are still slow.

Would it be possible to increase the number of iterations in the SunSpider tests?

Comment 60

11 years ago
Yeah, unfortunately HG seems to be a bit daft about dealing with line-ending differences. One would think that such problems would be easy to solve, but alas, apparently not. (Rumor has it there are ways to configure HG to improve this, but why it can't just be set to ignore line-endings by default when diffing common file types, I dunno...)

Re: increasing number of iterations, yeah, that might well be useful. Brent Baker and Dan Schaeffer do most of the maintenance of those tests, cc'ing them.
hg diff has a few options, one of which, IIRC, deals with ignoring white space things. Maybe you can find what you need there?

Comment 62

11 years ago
AFAIK that option affects spaces/tabs but not line endings.

The only builtin solution I'm aware of is http://www.selenic.com/mercurial/wiki/index.cgi/EncodeDecodeFilter which strikes me as unnecessarily complex; IMHO being able to ignore line endings in diffs (and normalize on commits without requiring external utilities) should be built-in. 
Line endings should be canonicalized to \n in the repo. This is not just a diff or display issue. Graydon testifies that it's surprisingly hard for VCS to guess what the line ending should be, just by looking at bytes in the file.

/be
Assignee

Comment 64

11 years ago
I hate to admit it - but I failed to test the latest patch on the Mac, plus a bug somehow crept back into the patch that had to be fixed.

This patch should work now - and it appears to have consistent line endings (no wonder - I used a Mac the create the diff :)
Attachment #325771 - Attachment is obsolete: true
Attachment #326293 - Attachment is obsolete: true
Attachment #326352 - Flags: review?(stejohns)

Comment 65

11 years ago
Comment on attachment 326352 [details] [diff] [review]
Patch that fixes GCC warnings and a bug

pushed as changeset:   432:7a7fa16cdee6
Attachment #326352 - Flags: review?(stejohns) → review+

Comment 66

11 years ago
Once the algorithmic aspects of the new code is done, please let me know so we can start looking at SSE2 opportunities.

- moh 
Assignee

Comment 67

11 years ago
This fix makes sure that StringUtils::create() returns cached characters for codepoints < 128, which was not the case before. The C++ wrapper for charAt() calls this method.

A cosmetic fix in StringObjectFW.cpp.

And the addition of CYGWIN_NT-6.0 to runtests.py for Vista support.
Attachment #326677 - Flags: review?(stejohns)

Updated

11 years ago
Attachment #326677 - Flags: review?(stejohns) → review+

Comment 68

11 years ago
pushed as changeset:   438:34e27ae30694
Assignee

Comment 69

11 years ago
- removed the kConcat and kDynamic string types with associated data structures like StringBuf
- changed concatenate() to create a kDirect string with extra chars at the end if rightStr cannot be appended in-place; a #define TTSTR_EXTRA_MAX can be platform-tweaked.
- Added a check to see if rightStr already exists in leftStr's extended buffer (e.g. if leftStr == "123ABC" and len == 3, and rightStr == "AB", then a kDependent string would be created without copying the data)
- substr(): When this was kDependent, only start got updated, but not end when resolving the master string
- removed flatten(), collectTree(), and getUsedChars() (the latter being integrated into concatenate())
- the whole thing is a little more transparent now
Attachment #326352 - Attachment is obsolete: true
Attachment #326677 - Attachment is obsolete: true
Attachment #328353 - Flags: review?(stejohns)
Assignee

Comment 70

11 years ago
Needed after patch is landed
Attachment #326294 - Attachment is obsolete: true

Comment 71

11 years ago
Comment on attachment 328353 [details] [diff] [review]
Recursion-free version of strings

I love it when big chunks of code just go away... :-)
Attachment #328353 - Flags: review?(stejohns) → review+

Comment 72

11 years ago
pushed as changeset:   481:a9a4b1502555
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.