Closed Bug 851064 Opened 12 years ago Closed 11 years ago

Much slower than V8 on RiaBench run-length encoder test due to repeated charAt on a concatenation result rope

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla24

People

(Reporter: bzbarsky, Assigned: h4writer)

References

(Depends on 1 open bug, )

Details

Attachments

(2 files, 4 obsolete files)

STEPS TO REPRODUCE:

1)  Load http://www.timo-ernst.net/misc/riabench-start/runlengthencode/javascript/
2)  Wait.
3)  Observe the number of millisecods.
4)  Repeat in Chrome.

ACTUAL RESULTS: In my case, we're about 6x slower.

EXPECTED RESULTS: Not 6x slower.  ;)

I'm working on a shell testcase.
Attached file Shell testcase
So...  if I focus on the main thread, not the GC thread:

1) We spend about 15% of the time doing type-analysis.  
2) We spend about 70% of our time doing a stubs::SlowCall to js_str_charAt, most of it
   under JSRope::flattenInternal doing memmove.  Not surprising, since the testcase does
   compression by concatenating substrings then indexes right back into the result.

Luke, I know you're probably not doing string guts anymore, but do you know who is?
Flags: needinfo?(luke)
Summary: Much slower than V8 on RiaBench run-length encoder test → Much slower than V8 on RiaBench run-length encoder test due to repeated charAt on a concatenation result rope
FWIW, with a BC build I get 280 ms, vs 350 with a m-c build. Haven't looked into it but it probably avoids (1).
That's fairly plausible, yeah.  What sort of times do you get with d8?  ;)
(In reply to Jan de Mooij [:jandem] from comment #3)
> FWIW, with a BC build I get 280 ms, vs 350 with a m-c build. Haven't looked
> into it but it probably avoids (1).

Sorry, nevermind. For the shell testcase it's only 235 vs 248. My nightly build is from two weeks ago, maybe that's it...
(In reply to Boris Zbarsky (:bz) from comment #4)
> That's fairly plausible, yeah.  What sort of times do you get with d8?  ;)

40 ms.
Eddy and Tom have both learned about how the strings work recently.  Perhaps they'd be interested in looking in to see what the usage pattern is here that is causing all the flattening?

Based on the 6x factor it would appear that v8 is avoiding flattening.  If these are ropes with few and very long segments, it would make sense for charAt to do a tree-walk instead of flattening.  This would be dangerous to do in general because many ropes have many nodes, with a few chars each, in which case traversal would be much slower.  I filed bug 615290 with a possible solution to this issue.
Flags: needinfo?(luke)
I will try to look at this after tomorrow.
The usage pattern is simple.  

		var numOfRedundances = getNumOfRedundances(i, text);
                if (numOfRedundances > 2){
			text = compress(i, numOfRedundances, text);
                }

where getNumOfRedundances(pos, text) leads off with:

	if (pos+1 < text.length){
		var charToCheck = text.charAt(pos);

and compress() does:

  compressedString = text.substr(0, startpoint) + separator + (len+1) + separator +
                     text.substr(restIndex, text.length);
  return compressedString;
Attached patch wip (obsolete) — Splinter Review
So I had the following idea, when we try to look up a char inside a rope, first try to find out on which side it is. When that side happens to be linear string just look it up and skip the flattening. This turns out to be slower for JM+ION, I assume, because we never actually ion compile and miss the inlined code for charAt in JM. So I decided to try this on the new baseline jit. And luckily enough we compile more and this is a nice win. Sadly the benchmark is slower from the start in baseline so the net result is even slower.
Hannes do you have a good idea what we can do to optimize this? I think you told me that my patch didn't actually help, even with baseline.
The problem here is how the code is written. If the creators had just used the old string to find the places and put the result into a string they are concatting there wouldn't have been a problem. Now with this benchmark they are adding strings to the input string, but everytime we concat/charat/... we have to flatten it. Now your patch allows one level deep charAt. That doesn't help a lot, since we still have to flatten (i.e. copy all into an array) when the concat happens, or the indexOf ... Or just the next iteration. I.e. at max you've removed half of the copies. But I don't think you did, since there are other functions that are still ask to flatten it.

I think we need to add support for Ropes to these function, but we need to find a way to make concat smarter. Atleast the following operation:

bfdkjqsmfjqmkmlkjqfdmljdkjsqmfjksm

    |   |
bfdkjqsmfjqmkmlkjqfdmljdkjsqmfjksm

    |   |
bfdkj   fjqmkmlkjqfdmljdkjsqmfjksm

     +ddd+
bfdkj     fjqmkmlkjqfdmljdkjsqmfjksm

bfdkjdddfjqmkmlkjqfdmljdkjsqmfjksm

Without needing to copy the whole string, but only a small part.

I.e. only using ropes, never flattening would make all depending functions slower, since they need to follow all ropes. And there are a lot.
I.e. flatten every time would make all depending functions the fastest, but every iteration we need to flatten the string.
Can you tell whether v8 supports indexOf etc on ropes directly?

> I.e. only using ropes, never flattening would make all depending functions
> slower, since they need to follow all ropes. And there are a lot.
> I.e. flatten every time would make all depending functions the fastest, but
> every iteration we need to flatten the string.

In bug 615290 I was proposing a middle-ground where we maintain in rope nodes (which have a free word) some extra data (perhaps the pair (number of child nodes, number of times accessed)) that would allow us to make a heuristic decision about whether to flatten on operations like indexOf or not.  With this, we should hopefully be able to achieve amortized O(n) copying on examples like this.
Attached patch WIP (obsolete) — Splinter Review
I just noticed we don't need an advanced algorithm for the ratio ropes string length. Actually to improve the situation it is only needed to allow max. a single ropes per string.

for (var pos = 0; pos < text.length; pos++)
   text = text.substr(0, pos) + "DATA" + text.substr(pos, 0)

This improves the above situation. Since the rope is always between 0 and pos. So this patch enables us to only flatten the part text.substr(0, pos) every time. Instead of the whole text.
We improve from 700ms to 350ms on the benchmark. (v8 is 162ms). (In other words 2x slower than v8, possible because our strings are twice as big?)

But we are still doing too much. If the part "text.substr(0, pos)" could be over-allocated. That way when flattening we only need to copy the rhs of the rope ... That would be a drastic improvement. I see some code in jsstr that is suppose to do this, but I can't get to understand how to enable it ... 

(Note: this sometime crashes in the rope logic added to MCharCodeAt by evilpie. Still trying to understand why.)
Attachment #725507 - Attachment is obsolete: true
(In reply to Hannes Verschore [:h4writer] from comment #15)
> Created attachment 754211 [details] [diff] [review]
> WIP
> 
> But we are still doing too much. If the part "text.substr(0, pos)" could be
> over-allocated. That way when flattening we only need to copy the rhs of the
> rope ... That would be a drastic improvement. I see some code in jsstr that
> is suppose to do this, but I can't get to understand how to enable it ... 

I discovered how this over-allocating / extensible optimization works.

while (...) {
   s += "a"
   s.flatten
}

This feature was optimized for one rope. Resulting in bad performance for:

while (...) {
   s += "a" + "a" // two holes, but extensible is actually possible
   s.flatten
}

So I made it possible to also use extensible strings for this. This lowers our score to 159ms (--no-ion --no-baseline). So on par with chrome.

Gonna polish the bug and try to locate and fix that bug.
Assignee: general → hv1989
Attached patch WIP 2.0 (obsolete) — Splinter Review
Got IM working and passes tests:

chrome: 180ms
before patch: 750ms
with patch: 17ms
(Timings are on testcase in this patch, but doubling the text)
Attachment #754211 - Attachment is obsolete: true
(In reply to Hannes Verschore [:h4writer] from comment #17)
> (Timings are on testcase in this patch, but doubling the text)
* (Timings are on testcase provided in this bug report, but doubling the text)
Attached patch Patch (obsolete) — Splinter Review
@Evilpies: I didn't use the codegen for multiple reasons: (1) It had a fault in it, (2) c++ is easier to maintain (3) The big slowdown isn't in the jit to c++ call. Feel free to add again as follow-up bug ;))
Attachment #754281 - Attachment is obsolete: true
Attachment #754445 - Flags: review?(evilpies)
Comment on attachment 754445 [details] [diff] [review]
Patch

Review of attachment 754445 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/ion/VMFunctions.cpp
@@ +366,5 @@
>  {
>      JS_ASSERT(index >= 0 &&
>                static_cast<uint32_t>(index) < str->length());
>  
> +    const jschar *chars;

nit hoist under the comment

@@ +379,5 @@
> +     * }
> +     */
> +    if (str->isRope()) {
> +        JSRope *rope = &str->asRope();
> +        if (uint32_t(index) < rope->leftChild()->length()) {

I think this only really helps when they left or right child is already flat? Or is that not the case? Seems like in this case leftChild would be a rope and rightChild would be a dependent string?

::: js/src/jsstr.cpp
@@ +597,5 @@
> +            begin -= rope->leftChild()->length();
> +            return js_NewDependentString(cx, str, size_t(begin), size_t(len));
> +        }
> +
> +        /* Substring goes over rope. Return a rope as substring. */

This doesn't parse very well. Something lik: Requested substring is between the two childs. Create a rope of substrings for both childs.

@@ +601,5 @@
> +        /* Substring goes over rope. Return a rope as substring. */
> +        JS_ASSERT (begin < rope->leftChild()->length() &&
> +                   begin + len > rope->leftChild()->length());
> +
> +        size_t lhs_length = rope->leftChild()->length() - begin;

I think we usually go for camelCase.

@@ +604,5 @@
> +
> +        size_t lhs_length = rope->leftChild()->length() - begin;
> +        size_t rhs_length = size_t(begin) + size_t(len) - rope->leftChild()->length();
> +
> +        RootedString lhs(cx, NULL);

Move js_NewDependentString in the constructor.

@@ +617,5 @@
> +
> +        return JSRope::new_<CanGC>(cx, lhs, rhs, len);
> +    }
> +
> +    return js_NewDependentString(cx, str, size_t(begin), size_t(len));

Unnecessary casts.

::: js/src/vm/String-inl.h
@@ +401,5 @@
>  }
>  
>  inline JSLinearString *
>  js::StaticStrings::getUnitStringForElement(JSContext *cx, JSString *str, size_t index)
>  {

We should make CharCodeAt from VMFunctions.cpp reusable. We could probably use this in str_charCodeAt then as well.

@@ +415,5 @@
> +     * }
> +     */
> +    if (str->isRope()) {
> +        JSRope *rope = &str->asRope();
> +        if (index < rope->leftChild()->length()) {

Would it make sense to check if they are linear?

::: js/src/vm/String.cpp
@@ +210,5 @@
> +     * happens before flattening.
> +     */
> +    while (true) {
> +        JSRope *rope = this;
> +        while (rope->leftChild()->isRope())

This only ever flattens the most leaf rope node, so I don't see how we hoist anything? This probably only works well with two level deep ropes.

@@ +212,5 @@
> +    while (true) {
> +        JSRope *rope = this;
> +        while (rope->leftChild()->isRope())
> +            rope = &rope->leftChild()->asRope();
> +        if (rope->leftChild()->isExtensible() && rope != this) {

This only really makes sense when we can actually fit the right child in the extensible part, right?

@@ +213,5 @@
> +        JSRope *rope = this;
> +        while (rope->leftChild()->isRope())
> +            rope = &rope->leftChild()->asRope();
> +        if (rope->leftChild()->isExtensible() && rope != this) {
> +            rope->flatten(maybecx); /* transform JSRope in JSStringExtensible */

The name is JSExtensibleString. This also drops a null check.
Btw very nice find! I hope we can do this without regressing something that depends on aggressive flattening.
I'm a little worried about the loop added to flattenInternal.  For one thing, it seems to exhibit O(n^2) behavior (where n is the left-most path length).  For another, it seems like it could, in some common circumstances involve twice the amount of copying.

Now, IIUC, all we really want to do is generalize the existing extensible optimization which tries to extend the buffer of the immediate left child to instead query the left-*most* child.  Does that sound right to you?  If so, then we can just tweak the existing extensible path to stop hard-coding "this->leftChild" and instead use a computed "leftMostChild".
Attached patch Patch v2Splinter Review
Attachment #754445 - Attachment is obsolete: true
Attachment #754445 - Flags: review?(evilpies)
Attachment #755219 - Flags: review?(evilpies)
Attachment #755219 - Flags: feedback?(luke)
Comment on attachment 755219 [details] [diff] [review]
Patch v2

Review of attachment 755219 [details] [diff] [review]:
-----------------------------------------------------------------

Nice job digging to flatten :)

::: js/src/jsstr.cpp
@@ +570,5 @@
>      return true;
>  }
>  
> +static JSString *
> +CreateDependentString(JSContext *cx, JSString *str, size_t begin, size_t len)

It would be better if, for regularity's sake, all "create dependent string" paths used the same heuristic, not just the one path you're optimizing here.  Thus, I'd suggest:
 - moving this function to be right below js_NewDependentString and making it extern js::CreateDependentString
 - change uses of js_NewDependentString to js::CreateDependentString
 - rename js_NewDependentString to be a non-extern static CreateDependentStringImpl

::: js/src/vm/String.cpp
@@ +206,5 @@
>  
> +    /* Find the left most string, containing the first string. */
> +    JSRope *leftMost = this;
> +    while (leftMost->leftChild()->isRope())
> +        leftMost = &leftMost->leftChild()->asRope();

I'd rename leftMost to leftMostRope.  Also, you might want to check the generated code to see if "leftMost->leftChild" is being GVN'ed into a register or re-loaded every time.  If it isn't, I'd add a local: JSString* leftMostRopeLeftChild = leftMostRope->leftChild();

@@ +214,3 @@
>          size_t capacity = left.capacity();
>          if (capacity >= wholeLength) {
> +            /* Prep. all ropes leading to the leftMost string. */

Instead of just saying "prepare", perhaps you could be more specific and say "Simulate a left-most traversal from the root to leftMost->leftChild() via first_visit_node".

@@ +214,4 @@
>          size_t capacity = left.capacity();
>          if (capacity >= wholeLength) {
> +            /* Prep. all ropes leading to the leftMost string. */
> +            while (str != leftMost) {

Can you add JS_ASSERT(str->isRope()); ?

@@ +221,5 @@
> +                }
> +                JSString *child = str->d.u1.left;
> +                str->d.u1.chars = left.chars();
> +                child->d.s.u3.parent = str;          /* Return to this when 'left' done, */
> +                child->d.lengthAndFlags = 0x200;     /* but goto visit_right_child. */

These comments aren't really necessary, given that we've said we're emulating first_visit_node above (also, the meaning of 'left' here is different than in first_visit_node).
Attachment #755219 - Flags: feedback?(luke) → feedback+
(In reply to Luke Wagner [:luke] from comment #24)
> ::: js/src/jsstr.cpp
> @@ +570,5 @@
> >      return true;
> >  }
> >  
> > +static JSString *
> > +CreateDependentString(JSContext *cx, JSString *str, size_t begin, size_t len)
> 
> It would be better if, for regularity's sake, all "create dependent string"
> paths used the same heuristic, not just the one path you're optimizing here.
> Thus, I'd suggest:
>  - moving this function to be right below js_NewDependentString and making
> it extern js::CreateDependentString
>  - change uses of js_NewDependentString to js::CreateDependentString
>  - rename js_NewDependentString to be a non-extern static
> CreateDependentStringImpl

I'm really hesitant for this change. 
1) CreateDependentString is only known to improvement for substrings
2) I can image places where we want the old behaviour instead of this
3) Shouldn't a non static functions names not start with lower case
4) Can we just remove the function js_NewDependentString??

So I have a counterproposal:

1) s/CreateDependentString/DoSubstr/ and only use that function for str_substr and str_substring

 OR

2) If you still think we should have this behaviour for everywhere we currently call js_NewDependentString

- JSLinearString *js_NewDependentString => JSLinearString *js_NewDependentStringImpl
- JSString *CreateDependentString => JSString *js_NewDependentString
- Test this very very very carefull to not regress.
Re (1) and (2): well, it's only *known* to improve substrings in the one particular workload you are looking at ;).  If there is a general problem, I'd think it show up with substring just as well as any other use (although maybe there is a semantic aspect of substring I'm missing?).

Re (3), I wasn't proposing anything lowercase so "yes"; js_Blah is the old pre-C++ naming convention for extern-but-not-JS_PUBLIC_API functions and we should rename them all to js::Blah.

Re (4), since you call it from several places inside CreateDependentString, I thought you'd still want to keep it around (as a static function in jsstr.cpp).

So, looking at the rope handling in CreateDependentString again, I'm a bit worried that the "overlaps left and right" case could be worse in some cases (perhaps you were too).  Is this case necessary to optimize the workload you're looking at?  If not, I think it'd be fine to just have this case fall back to flattening.

Another thought: could we generalize your "fully in left" and "fully in right" cases so that they iterate down the left and right branches until they find the smallest enclosing string?  Given that we're about to flatten, this seems to strictly win.
> the "overlaps left and right" case could be worse in some cases

That is indeed my worry. During rope creation we are flattening lhs and rhs of the rope. If we flatten the constructed rope directly afterwards, we will be slower. Since rhs gets copied twice.

> Is this case necessary to optimize the workload you're looking at?

Yes this case is needed for the performance improvement. This creates the separation point between the old string (that we don't want to take in the flattening every time) and the new parts (that needs to get flatten every time) 

> Another thought: could we generalize your "fully in left" and "fully in right" 

That is possible, but I'm not sure if that is a good idea. The flattening is done to make sure all operations afterwards are possible in constant time. If we keep all ropes on the original string, every str.substr(i, 1) will take possible O(n) time (n being the depth of the rope. And ropes are not balanced, so in worst case the length of the string). Instead of only the first flattening operation taking O(n) and all str.substr(i,1) being constant...
(In reply to Hannes Verschore [:h4writer] from comment #27)
Ah, good point about the potentially bad case of repeated O(n) traversal.  This is a general problem with anything we consider doing on ropes and I think we can fix it.  This is the subject of bug 615290, but, roughly, we have a free word in ropes that we could use to store the number of child nodes and/or a number-of-times-we've-traversed-this-rope.  Using these, we should be able to optimize the cases where we win big while avoiding the pathological O(n^2) traversals.

> That is indeed my worry. During rope creation we are flattening lhs and rhs
> of the rope. If we flatten the constructed rope directly afterwards, we will
> be slower. Since rhs gets copied twice.
> 
> > Is this case necessary to optimize the workload you're looking at?
> 
> Yes this case is needed for the performance improvement.

Can you do some measurements of this, to see what the worst case is?  Seems like we need to consider this case even if it's only used for substr().

There is another long-standing idea to allow a new type of non-linear string that was like a dependent string except that the base could be a rope.  When flattening such a "dependent rope", the flattening algo would only flatten the portion of the rope that was dependend upon (in essence, achieve the same effect as you have here without needing to create the 2 intermediary dependent strings).  I haven't thought this through, though, so maybe there be dragons ;)  However, this has come up on several occasions so I think it'd be generally helpful; I thought I filed a bug but I've lost it.
About bug 615290. That is indeed the "pro" way to solve it. But is also the most involved fix. After finding some good trade-offs, we have to support it everywhere. Also in the jit functions adding a lot of complexity. And I don't have a real overview on how big the gains would be. Looks like a project for somebody that can focus on it for some decent time ... 
=> I think allowing max. 1 level deep ropes, makes sure we don't have our potential O(n) performance problems related with ropes, and still have the gains for the case mentioned here.

Worst case:

for (var j = 0; j < 20000; j++) {
  var b = ""
  for (var i = 0; i < 1000; i++) {
    b += "t";
  }
  b.substr(0,b.length-1).toLowerCase()
}

I notice the performance drop from 300ms to 330ms.
As an extra note I want to add that it is really easy to mitigate the performance problem if we encounter it. We only need to know which function is causing us to flatten the result of a substring and need to add support for 1-level deep rope in that function. (In this case toLowerCase())

About the dependent rope idea. Does that mean:

rope = JSRope (JSLinearString("test"), JSLinearString("test"))
dependent = JSDependentRope (rope, 0, 6)

dependent.flatten() would do:
(1)
rope = JSRope (JSLinearString("testte"), JSLinearString("st"))
dependent = JSDependent (rope.leftChild(), 0, 6)

or would it do:
(2)
rope = JSLinearString("testtest")
dependent = JSDependent (rope, 0, 6)

(2) wouldn't solve anything at all
(1) could solve the issue, but we need to be aware that this can also introduce performance issues. What if after flattening the dependent rope, the rope itself needs to get flatten. Then we will have to copy the contents of the dependent rope twice.
Comment on attachment 755219 [details] [diff] [review]
Patch v2

Review of attachment 755219 [details] [diff] [review]:
-----------------------------------------------------------------

r+, but please address Luke's commments. Still misses tests for substring with strings depending on the left, right and both sides of the rope. Also test for the the charCodeAt optimization with ropes.

::: js/src/ion/VMFunctions.cpp
@@ +364,5 @@
>  }
>  
>  bool
>  CharCodeAt(JSContext *cx, HandleString str, int32_t index, uint32_t *code)
>  {

In theory you could make code uint8_t and simplifiy the code. Don't do it if that is bad for the generated code.

::: js/src/vm/String.h
@@ +892,5 @@
>      return NULL;
>  }
>  
> +JS_ALWAYS_INLINE bool
> +JSString::getChar(JSContext *cx, int32_t index, jschar *code)

make this size_t index. But keep the index < length() assert.
Attachment #755219 - Flags: review?(evilpies) → review+
Thanks for backing out. Strange, I'll investigate why.
Small oversight wrt barriers :(.
Try: https://tbpl.mozilla.org/?tree=Try&rev=b2203b20d5ee
https://hg.mozilla.org/mozilla-central/rev/ce3ac61c379c
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla24
(In reply to Hannes Verschore [:h4writer] from comment #29)
I'm still a little concerned that we have a substr()-specific dependent-string-creation routine when there is nothing special about substr() other than it shows up on a benchmark.  I would really prefer we use the same heuristic everywhere unless we can find a more fundamental reason why we shouldn't.

> That is indeed the "pro" way to solve it. But is also the most involved fix.
> After finding some good trade-offs, we have to support it everywhere. 

We wouldn't have to "support" it everywhere; only where we wish to take advantage of it to  avoid flattening.  The only thing that would have to change in the jits immediately is the concat path (to accumulate number of child nodes, if we ended up wanting to do that).

I'm fine not doing this now, I'm just pointing out what I think is a real fix to the underlying problem if we see these problems continue to show up in the future.

> About the dependent rope idea. Does that mean:
[...]
> (2) wouldn't solve anything at all
> (1) could solve the issue, but we need to be aware that this can also
> introduce performance issues. What if after flattening the dependent rope,
> the rope itself needs to get flatten. Then we will have to copy the contents
> of the dependent rope twice.

I meant (1).  Yes, there is a chance we'd do worse, but if you compare it to what we do now (effectively, we speculate that if you need to flatten one small part of a rope you will later flatten the whole rope, so we flatten everything), this seems like the preferable speculation.
Depends on: 880214
(In reply to Luke Wagner [:luke] from comment #37)
> (In reply to Hannes Verschore [:h4writer] from comment #29)
> I'm still a little concerned that we have a substr()-specific
> dependent-string-creation routine when there is nothing special about
> substr() other than it shows up on a benchmark.  I would really prefer we
> use the same heuristic everywhere unless we can find a more fundamental
> reason why we shouldn't.

Oh sorry, I thought you agreed with my reasoning. I opened a follow-up bug on this to implement for all dependent-string-creation and to measure there is indeed not a reason this shouldn't be done for all.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: