Closed Bug 604905 Opened 14 years ago Closed 14 years ago

Significant javascript performance regression since ff4b6

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: cers, Assigned: dvander)

References

(Blocks 2 open bugs, )

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(3 files)

User-Agent:       Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b8pre) Gecko/20101016 Firefox/4.0b8pre
Build Identifier: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b8pre) Gecko/20101016 Firefox/4.0b8pre

Testing out an image processing algorithm I've recently ported from C, I noticed a significant difference between the performance in b6 and recent nightlies. There appears to be a slowdown in the order of 1000% in the nightlies compared to b6.

Here are representative scores from b6 and Gecko/20101016 Firefox/4.0b8pre using the same machine and profile.

b6:
(10x10) = 2ms
(50x50) = 3ms
(100x100) = 3ms
(500x500) = 91ms
(1000x1000) = 415ms

b8pre:
(10x10) = 3ms
(50x50) = 7ms
(100x100) = 23ms
(500x500) = 912ms
(1000x1000) = 5108ms

Reproducible: Always

Steps to Reproduce:
1. Go to http://www.geeksbynature.dk/projects/distance/test.html
2. Compare result with those from beta6
Actual Results:  
Large slowdown.

Expected Results:  
Slight speedup.
I'm seeing the exact same issue on Windows XP: Mozilla/5.0 (Windows NT 5.1; rv:2.0b8pre) Gecko/20101016 Firefox/4.0b8pre ID:20101016082013

~B
Status: UNCONFIRMED → NEW
Ever confirmed: true
Regression range in tracemonkey tree.
(On Windows build)
Fast:20100909043955(f5f47e8fbc15)
Slow:20100910044908(a1f43f4ef565)

Pushlog:
http://hg.mozilla.org/tracemonkey/pushloghtml?fromchange=f5f47e8fbc15&tochange=a1f43f4ef565
The code causes slowness is seemed the line 140 of distance.js

row_copy = this.data.slice(this.lut[r], this.lut[r+1]);

in my environment, it takes about 2.4 sec.(in case of 1000x1000)
Regression range in TM:
Fast:
http://hg.mozilla.org/tracemonkey/rev/f5f47e8fbc15
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b6pre) Gecko/20100909 Firefox/4.0b6pre ID:20100909043955
Slow:
http://hg.mozilla.org/tracemonkey/rev/a1f43f4ef565
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b6pre) Gecko/20100910 Firefox/4.0b6pre ID:20100910044908
Pushlog in TM
http://hg.mozilla.org/tracemonkey/pushloghtml?fromchange=f5f47e8fbc15&tochange=a1f43f4ef565
In local build,
following changeset causes the performance issue.
1b55ec0c7aee	David Anderson — Fix various bugs in tracer integration (bug 593532, r=dmandelin).
Blocks: 593532
dvander, can you take a look at this regression? looks like your changes are blamed.
So the upshot here is twofold:

1)  -j -m is a major slowdown over -j on this testcase.
2)  -j is a slowdown over _interp_ on this testcase.

Both seem like problems.  #1 may be addressed by heuristics tweaking, but #2 seems like we should fix it anyway (maybe in a separate bug?).

Data:

mozilla% foreach args ( "" "-j" "-m" "-m -j" )
foreach? echo "args: $args"
foreach? ./js/src/opt-obj/js $args ~/test.js
foreach? end
args: 
(10x10) = 0ms
(50x50) = 6ms
(100x100) = 22ms
(500x500) = 566ms
(1000x1000) = 2330ms
args: -j
(10x10) = 2ms
(50x50) = 2ms
(100x100) = 4ms
(500x500) = 59ms
(1000x1000) = 233ms
args: -m
(10x10) = 0ms
(50x50) = 5ms
(100x100) = 27ms
(500x500) = 687ms
(1000x1000) = 2641ms
args: -m -j
(10x10) = 2ms
(50x50) = 4ms
(100x100) = 15ms
(500x500) = 485ms
(1000x1000) = 1585ms
Er, I meant:

2)  -m is a slowdown over interp on this testcase

of course.
I did a shark profile of the 1000x1000 -m case:

  22% in mjit-generated code.
  26% under js_AtomizeString (called from stubs::GetElem)
  17% under js_ValueToString (called from stubs::GetElem)
  13% under TypedArrayTemplate<double>::obj_getProperty
   9% in stubs::GetElem itself
   8% under stubs::Interrupt (doing gc)
   2% under stubs::SetElem
   2% under stubs::Equal
That string/atom gunk is under js_InternNonIntElementId, and it happens because rref.isDouble().  The double's value is things like 218191, 216191, etc.

For each value we hit the string gunk several times (at least 5; I stopped counting after that).  This seems like problem #3 here...

Problem #2 seems like bug 594247 might help (but why are we slower than interp?  and why are those values doubles?).

For problem #1 I'd love to try the patch in bug 580468, but I can't find a tracemonkey revision it actually applies to.
Depends on: 580468, 594247
In fact, if I change the testcase to use regular arrays instead of Float64 ones (by just throwing in the try blocks there), I get these numbers:

mozilla% foreach args ( "" "-j" "-m" "-m -j" )
foreach? echo "args: $args"
foreach? ./js/src/opt-obj/js $args ~/test.js
foreach? end
args: 
(10x10) = 0ms
(50x50) = 5ms
(100x100) = 19ms
(500x500) = 437ms
(1000x1000) = 1740ms
args: -j
(10x10) = 2ms
(50x50) = 3ms
(100x100) = 3ms
(500x500) = 56ms
(1000x1000) = 268ms
args: -m
(10x10) = 1ms
(50x50) = 1ms
(100x100) = 4ms
(500x500) = 84ms
(1000x1000) = 356ms
args: -m -j
(10x10) = 2ms
(50x50) = 1ms
(100x100) = 3ms
(500x500) = 28ms
(1000x1000) = 130ms

Reporter, are you possibly using values out of those arrays as array indices themselves?
fwiw, the v8 shell times (using untyped arrays) are:

(10x10) = 1ms
(50x50) = 1ms
(100x100) = 2ms
(500x500) = 53ms
(1000x1000) = 191ms

So -m -j beats it, but -m alone is 2x slower.  For -m alone on untyped arrays, 91% of the time is in mjit-generated code.
Which may be another (4th?) bug we need here.
Note: tracer also handles double args to JSOP_GETELEM via the "convert it to string" slow path, as far as I can tell.  So presumably tracer is NOT seeing doubles there for some reason???
(In reply to comment #11)
> In fact, if I change the testcase to use regular arrays instead of Float64 ones
> 
[snip]
> Reporter, are you possibly using values out of those arrays as array indices
> themselves?

Well spotted - I've changed the relevant arrays to Uint32Array and the scores are much improved. I am still seeing a ~100% slowdown though, compared to the original b6 results and ~400% slowdown compared with the new version in b6.

@@ -32,7 +32,7 @@
 function ImgPUInt32(im, c, r) {
     try {
         this.data = Float64Array(im);
-        this.lut = Float64Array(r+1);
+        this.lut = Uint32Array(r+1);
     } catch (e) {
         this.data = im;
         this.lut = Array(r+1);
@@ -103,9 +103,9 @@
     var MAX = -1 >>> 1;
 
     try {
-        s = Float64Array(cols);
-        t = Float64Array(cols);
-        row_copy = Float64Array(cols);
+        s = Uint32Array(cols);
+        t = Uint32Array(cols);
+        row_copy = Uint32Array(cols);
     } catch (e) {
         s = Array(cols);
         t = Array(cols);
(In reply to comment #15)

> @@ -32,7 +32,7 @@
>  function ImgPUInt32(im, c, r) {
>      try {
>          this.data = Float64Array(im);
> -        this.lut = Float64Array(r+1);
> +        this.lut = Uint32Array(r+1);
>      } catch (e) {
>          this.data = im;
>          this.lut = Array(r+1);
> @@ -103,9 +103,9 @@
>      var MAX = -1 >>> 1;
> 
>      try {
> -        s = Float64Array(cols);
> -        t = Float64Array(cols);
> -        row_copy = Float64Array(cols);
> +        s = Uint32Array(cols);
> +        t = Uint32Array(cols);
> +        row_copy = Uint32Array(cols);
>      } catch (e) {
>          s = Array(cols);
>          t = Array(cols);

Might in fact make sense to keep row_copy as Float64 - it's not used as index, and float seems faster. (Also, the data copied into it using that diff would be coming from a Float64Array, but changing both to Uint32Array was slightly slower)
If I make the changes suggested in the combination of comment 15 and comment 16, then I get these numbers:

args: 
(10x10) = 0ms
(50x50) = 7ms
(100x100) = 21ms
(500x500) = 538ms
(1000x1000) = 2183ms
args: -j
(10x10) = 2ms
(50x50) = 1ms
(100x100) = 2ms
(500x500) = 38ms
(1000x1000) = 131ms
args: -m
(10x10) = 1ms
(50x50) = 3ms
(100x100) = 11ms
(500x500) = 252ms
(1000x1000) = 1027ms
args: -j -m
(10x10) = 2ms
(50x50) = 3ms
(100x100) = 10ms
(500x500) = 180ms
(1000x1000) = 729ms

A profile of the -m case shows that the double-to-atom stuff is all gone; what remains seems to be mostly bug 594247.

One interesting thing here is that I just did profiles for the -j case for both the original code (where we taks 230ms) and the new code (where we take 130ms).  The former definitely shows some double-to-atom conversions, totaling about 25% of the time.  There are also stub calls to js_GetPropertyByName, etc.  So it's doing things similar to what -m is doing in that case, I think.  The question is why it's still so much faster with -j.

Given comment 10 paragraph 2, and this code in the testcase:

  var tmp = ( (u*u - (s[q]*s[q]) + this.data[u+img_row] -
               this.data[s[q]+img_row]) / (2*(u - s[q])) );

I wonder whether tracejit is just managing to CSE the property accesses...  But I wouldn't have thought those would be CSE-eable...
And in fact, if I just count calls to js_AtomizeString while running the 1000x1000 case of the original code, I get 9367611 under methodjit and 777501 under tracejit.  Not sure whether this is also worth investigating.

dvander, do you want me to try to file separate bugs blocking this one for the various things that might be issues here, or just shut up and let you take a look too?
(In reply to comment #10)
> Problem #2 seems like bug 594247 might help (but why are we slower than interp?
>  and why are those values doubles?).

I would guess bug 594247 is the major malfunction here. It will probably be a bunch easier to figure out what else might be hurting us after we get that fixed.

> For problem #1 I'd love to try the patch in bug 580468, but I can't find a
> tracemonkey revision it actually applies to.

We're going to land that one to TM very soon.
Agreed.  Let's get those two bugs fixed (which might well just fix the -j -m case) and then remeasure.
blocking2.0: ? → betaN+
Depends on: 608479
Depends on: 608480
Blocks: 467263
Depends on: 608490
With the test case from comment #21, and bug 594247 applied:

dvander@dvander:~/mozilla/tracemonkey/js/src$ ./Opt32/js -m -j -p 2k.js 
(10x10) = 1ms
(50x50) = 2ms
(100x100) = 3ms
(500x500) = 64ms
(1000x1000) = 259ms

The original test case:
dvander@dvander:~/mozilla/tracemonkey/js/src$ ./Opt32/js -m -j -p k.js 
(10x10) = 1ms
(50x50) = 3ms
(100x100) = 9ms
(500x500) = 265ms
(1000x1000) = 1086ms

dvander@dvander:~/mozilla/tracemonkey/js/src$ ./Opt32/js -j k.js 
(10x10) = 2ms
(50x50) = 1ms
(100x100) = 2ms
(500x500) = 51ms
(1000x1000) = 230ms

I'm hesitant to block on bug 608479, since adding an IC to convert double indices to integers is a little complicated, and an unusual use case (which here appears to have been accidental).
David, I think the main question here is why there are so many more double-to-string conversions with -m than with -j on the original testcase and whether we can easily fix that.....
Attached patch fixSplinter Review
Thanks, good point. The atomization only happens in GETELEM, and only because GETELEM, unlike other key-is-value opcodes, does not call ValueFitsInInt32. That seems worth fixing.

That brings the slow version down to:
dvander@dvander:~/mozilla/tracemonkey/js/src$ ./Opt32/js -m bz.js
(10x10) = 1ms
(50x50) = 2ms
(100x100) = 5ms
(500x500) = 139ms
(1000x1000) = 558ms

Which is a big improvement - but obviously, way better to use an integer array in the first place, which will IC.
Assignee: general → dvander
Status: NEW → ASSIGNED
Attachment #489755 - Flags: review?(lw)
Comment on attachment 489755 [details] [diff] [review]
fix

Because ValueFitsInInt32 will only be picking up the double case, and I can't see us adding any other value types to the set of "fits in int32", it like you could write:

} else if (v.isDouble() && JSDOUBLE_IS_INT32(v.toDouble(), &i) &&
           INT_FITS_IN_JSID(i)) {
    id = INT_TO_JSID(i);
}

r+ either way.
Attachment #489755 - Flags: review?(lw) → review+
Yep, I can confirm that with the attached patch -m on the first testcase ends up 3x faster (and in fact ends up about as fast as on the second testcase).

If I also apply the patch in bug 594247, then JM ends up within a factor of 2.5 of TM on the testcases here.

Best yet, with the patch for bug 608480 I see -j -m -p being about as good as just -j on the attached testcases (and much faster than -j -m), for total win.  ;)
Blocks: 612930
http://hg.mozilla.org/mozilla-central/rev/9bff531727b2
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: