604905 - Significant javascript performance regression since ff4b6

Reporter

Description

•

14 years ago

User-Agent:       Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b8pre) Gecko/20101016 Firefox/4.0b8pre
Build Identifier: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b8pre) Gecko/20101016 Firefox/4.0b8pre

Testing out an image processing algorithm I've recently ported from C, I noticed a significant difference between the performance in b6 and recent nightlies. There appears to be a slowdown in the order of 1000% in the nightlies compared to b6.

Here are representative scores from b6 and Gecko/20101016 Firefox/4.0b8pre using the same machine and profile.

b6:
(10x10) = 2ms
(50x50) = 3ms
(100x100) = 3ms
(500x500) = 91ms
(1000x1000) = 415ms

b8pre:
(10x10) = 3ms
(50x50) = 7ms
(100x100) = 23ms
(500x500) = 912ms
(1000x1000) = 5108ms

Reproducible: Always

Steps to Reproduce:
1. Go to http://www.geeksbynature.dk/projects/distance/test.html
2. Compare result with those from beta6
Actual Results:  
Large slowdown.

Expected Results:  
Slight speedup.

Bryan

Comment 1

•

14 years ago

I'm seeing the exact same issue on Windows XP: Mozilla/5.0 (Windows NT 5.1; rv:2.0b8pre) Gecko/20101016 Firefox/4.0b8pre ID:20101016082013

~B

Asa Dotzler [:asa]

Updated

•

14 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Toshihiro Yamada

Comment 2

•

14 years ago

Regression range in tracemonkey tree.
(On Windows build)
Fast:20100909043955(f5f47e8fbc15)
Slow:20100910044908(a1f43f4ef565)

Pushlog:
http://hg.mozilla.org/tracemonkey/pushloghtml?fromchange=f5f47e8fbc15&tochange=a1f43f4ef565

Toshihiro Yamada

Comment 3

•

14 years ago

The code causes slowness is seemed the line 140 of distance.js

row_copy = this.data.slice(this.lut[r], this.lut[r+1]);

in my environment, it takes about 2.4 sec.(in case of 1000x1000)

Alice0775 White

Comment 4

•

14 years ago

Regression range in TM:
Fast:
http://hg.mozilla.org/tracemonkey/rev/f5f47e8fbc15
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b6pre) Gecko/20100909 Firefox/4.0b6pre ID:20100909043955
Slow:
http://hg.mozilla.org/tracemonkey/rev/a1f43f4ef565
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b6pre) Gecko/20100910 Firefox/4.0b6pre ID:20100910044908
Pushlog in TM
http://hg.mozilla.org/tracemonkey/pushloghtml?fromchange=f5f47e8fbc15&tochange=a1f43f4ef565

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Updated

•

14 years ago

blocking2.0: --- → ?

Alice0775 White

Comment 5

•

14 years ago

In local build,
following changeset causes the performance issue.
1b55ec0c7aee	David Anderson — Fix various bugs in tracer integration (bug 593532, r=dmandelin).

Blocks: 593532

Asa Dotzler [:asa]

Comment 6

•

14 years ago

dvander, can you take a look at this regression? looks like your changes are blamed.

Boris Zbarsky [:bzbarsky]

Comment 7

•

14 years ago

Attached file Shell version of that web page — Details

So the upshot here is twofold:

1)  -j -m is a major slowdown over -j on this testcase.
2)  -j is a slowdown over _interp_ on this testcase.

Both seem like problems.  #1 may be addressed by heuristics tweaking, but #2 seems like we should fix it anyway (maybe in a separate bug?).

Data:

mozilla% foreach args ( "" "-j" "-m" "-m -j" )
foreach? echo "args: $args"
foreach? ./js/src/opt-obj/js $args ~/test.js
foreach? end
args: 
(10x10) = 0ms
(50x50) = 6ms
(100x100) = 22ms
(500x500) = 566ms
(1000x1000) = 2330ms
args: -j
(10x10) = 2ms
(50x50) = 2ms
(100x100) = 4ms
(500x500) = 59ms
(1000x1000) = 233ms
args: -m
(10x10) = 0ms
(50x50) = 5ms
(100x100) = 27ms
(500x500) = 687ms
(1000x1000) = 2641ms
args: -m -j
(10x10) = 2ms
(50x50) = 4ms
(100x100) = 15ms
(500x500) = 485ms
(1000x1000) = 1585ms

Boris Zbarsky [:bzbarsky]

Comment 8

•

14 years ago

Er, I meant:

2)  -m is a slowdown over interp on this testcase

of course.

Boris Zbarsky [:bzbarsky]

Comment 9

•

14 years ago

I did a shark profile of the 1000x1000 -m case:

  22% in mjit-generated code.
  26% under js_AtomizeString (called from stubs::GetElem)
  17% under js_ValueToString (called from stubs::GetElem)
  13% under TypedArrayTemplate<double>::obj_getProperty
   9% in stubs::GetElem itself
   8% under stubs::Interrupt (doing gc)
   2% under stubs::SetElem
   2% under stubs::Equal

Boris Zbarsky [:bzbarsky]

Comment 10

•

14 years ago

That string/atom gunk is under js_InternNonIntElementId, and it happens because rref.isDouble().  The double's value is things like 218191, 216191, etc.

For each value we hit the string gunk several times (at least 5; I stopped counting after that).  This seems like problem #3 here...

Problem #2 seems like bug 594247 might help (but why are we slower than interp?  and why are those values doubles?).

For problem #1 I'd love to try the patch in bug 580468, but I can't find a tracemonkey revision it actually applies to.

Depends on: 580468, 594247

Boris Zbarsky [:bzbarsky]

Comment 11

•

14 years ago

In fact, if I change the testcase to use regular arrays instead of Float64 ones (by just throwing in the try blocks there), I get these numbers:

mozilla% foreach args ( "" "-j" "-m" "-m -j" )
foreach? echo "args: $args"
foreach? ./js/src/opt-obj/js $args ~/test.js
foreach? end
args: 
(10x10) = 0ms
(50x50) = 5ms
(100x100) = 19ms
(500x500) = 437ms
(1000x1000) = 1740ms
args: -j
(10x10) = 2ms
(50x50) = 3ms
(100x100) = 3ms
(500x500) = 56ms
(1000x1000) = 268ms
args: -m
(10x10) = 1ms
(50x50) = 1ms
(100x100) = 4ms
(500x500) = 84ms
(1000x1000) = 356ms
args: -m -j
(10x10) = 2ms
(50x50) = 1ms
(100x100) = 3ms
(500x500) = 28ms
(1000x1000) = 130ms

Reporter, are you possibly using values out of those arrays as array indices themselves?

Boris Zbarsky [:bzbarsky]

Comment 12

•

14 years ago

fwiw, the v8 shell times (using untyped arrays) are:

(10x10) = 1ms
(50x50) = 1ms
(100x100) = 2ms
(500x500) = 53ms
(1000x1000) = 191ms

So -m -j beats it, but -m alone is 2x slower.  For -m alone on untyped arrays, 91% of the time is in mjit-generated code.

Boris Zbarsky [:bzbarsky]

Comment 13

•

14 years ago

Which may be another (4th?) bug we need here.

Boris Zbarsky [:bzbarsky]

Comment 14

•

14 years ago

Note: tracer also handles double args to JSOP_GETELEM via the "convert it to string" slow path, as far as I can tell.  So presumably tracer is NOT seeing doubles there for some reason???

Christian Sonne [:cers]

Reporter

Comment 15

•

14 years ago

(In reply to comment #11)
> In fact, if I change the testcase to use regular arrays instead of Float64 ones
> 
[snip]
> Reporter, are you possibly using values out of those arrays as array indices
> themselves?

Well spotted - I've changed the relevant arrays to Uint32Array and the scores are much improved. I am still seeing a ~100% slowdown though, compared to the original b6 results and ~400% slowdown compared with the new version in b6.

@@ -32,7 +32,7 @@
 function ImgPUInt32(im, c, r) {
     try {
         this.data = Float64Array(im);
-        this.lut = Float64Array(r+1);
+        this.lut = Uint32Array(r+1);
     } catch (e) {
         this.data = im;
         this.lut = Array(r+1);
@@ -103,9 +103,9 @@
     var MAX = -1 >>> 1;
 
     try {
-        s = Float64Array(cols);
-        t = Float64Array(cols);
-        row_copy = Float64Array(cols);
+        s = Uint32Array(cols);
+        t = Uint32Array(cols);
+        row_copy = Uint32Array(cols);
     } catch (e) {
         s = Array(cols);
         t = Array(cols);

Christian Sonne [:cers]

Reporter

Comment 16

•

14 years ago

(In reply to comment #15)

> @@ -32,7 +32,7 @@
>  function ImgPUInt32(im, c, r) {
>      try {
>          this.data = Float64Array(im);
> -        this.lut = Float64Array(r+1);
> +        this.lut = Uint32Array(r+1);
>      } catch (e) {
>          this.data = im;
>          this.lut = Array(r+1);
> @@ -103,9 +103,9 @@
>      var MAX = -1 >>> 1;
> 
>      try {
> -        s = Float64Array(cols);
> -        t = Float64Array(cols);
> -        row_copy = Float64Array(cols);
> +        s = Uint32Array(cols);
> +        t = Uint32Array(cols);
> +        row_copy = Uint32Array(cols);
>      } catch (e) {
>          s = Array(cols);
>          t = Array(cols);

Might in fact make sense to keep row_copy as Float64 - it's not used as index, and float seems faster. (Also, the data copied into it using that diff would be coming from a Float64Array, but changing both to Uint32Array was slightly slower)

Boris Zbarsky [:bzbarsky]

Comment 17

•

14 years ago

If I make the changes suggested in the combination of comment 15 and comment 16, then I get these numbers:

args: 
(10x10) = 0ms
(50x50) = 7ms
(100x100) = 21ms
(500x500) = 538ms
(1000x1000) = 2183ms
args: -j
(10x10) = 2ms
(50x50) = 1ms
(100x100) = 2ms
(500x500) = 38ms
(1000x1000) = 131ms
args: -m
(10x10) = 1ms
(50x50) = 3ms
(100x100) = 11ms
(500x500) = 252ms
(1000x1000) = 1027ms
args: -j -m
(10x10) = 2ms
(50x50) = 3ms
(100x100) = 10ms
(500x500) = 180ms
(1000x1000) = 729ms

A profile of the -m case shows that the double-to-atom stuff is all gone; what remains seems to be mostly bug 594247.

One interesting thing here is that I just did profiles for the -j case for both the original code (where we taks 230ms) and the new code (where we take 130ms).  The former definitely shows some double-to-atom conversions, totaling about 25% of the time.  There are also stub calls to js_GetPropertyByName, etc.  So it's doing things similar to what -m is doing in that case, I think.  The question is why it's still so much faster with -j.

Given comment 10 paragraph 2, and this code in the testcase:

  var tmp = ( (u*u - (s[q]*s[q]) + this.data[u+img_row] -
               this.data[s[q]+img_row]) / (2*(u - s[q])) );

I wonder whether tracejit is just managing to CSE the property accesses...  But I wouldn't have thought those would be CSE-eable...

Boris Zbarsky [:bzbarsky]

Comment 18

•

14 years ago

And in fact, if I just count calls to js_AtomizeString while running the 1000x1000 case of the original code, I get 9367611 under methodjit and 777501 under tracejit.  Not sure whether this is also worth investigating.

dvander, do you want me to try to file separate bugs blocking this one for the various things that might be issues here, or just shut up and let you take a look too?

David Mandelin [:dmandelin]

Comment 19

•

14 years ago

(In reply to comment #10)
> Problem #2 seems like bug 594247 might help (but why are we slower than interp?
>  and why are those values doubles?).

I would guess bug 594247 is the major malfunction here. It will probably be a bunch easier to figure out what else might be hurting us after we get that fixed.

> For problem #1 I'd love to try the patch in bug 580468, but I can't find a
> tracemonkey revision it actually applies to.

We're going to land that one to TM very soon.

Boris Zbarsky [:bzbarsky]

Comment 20

•

14 years ago

Agreed.  Let's get those two bugs fixed (which might well just fix the -j -m case) and then remeasure.

Robert Sayre

Updated

•

14 years ago

blocking2.0: ? → betaN+

Boris Zbarsky [:bzbarsky]

Comment 21

•

14 years ago

Attached file Shell testcase that I tested in comment 17 — Details

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Depends on: 608479

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Depends on: 608480

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Blocks: 467263

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Depends on: 608490

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 22

•

14 years ago

With the test case from comment #21, and bug 594247 applied:

dvander@dvander:~/mozilla/tracemonkey/js/src$ ./Opt32/js -m -j -p 2k.js 
(10x10) = 1ms
(50x50) = 2ms
(100x100) = 3ms
(500x500) = 64ms
(1000x1000) = 259ms

The original test case:
dvander@dvander:~/mozilla/tracemonkey/js/src$ ./Opt32/js -m -j -p k.js 
(10x10) = 1ms
(50x50) = 3ms
(100x100) = 9ms
(500x500) = 265ms
(1000x1000) = 1086ms

dvander@dvander:~/mozilla/tracemonkey/js/src$ ./Opt32/js -j k.js 
(10x10) = 2ms
(50x50) = 1ms
(100x100) = 2ms
(500x500) = 51ms
(1000x1000) = 230ms

I'm hesitant to block on bug 608479, since adding an IC to convert double indices to integers is a little complicated, and an unusual use case (which here appears to have been accidental).

Boris Zbarsky [:bzbarsky]

Comment 23

•

14 years ago

David, I think the main question here is why there are so many more double-to-string conversions with -m than with -j on the original testcase and whether we can easily fix that.....

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 24

•

14 years ago

Attached patch fix — Details — Splinter Review

Thanks, good point. The atomization only happens in GETELEM, and only because GETELEM, unlike other key-is-value opcodes, does not call ValueFitsInInt32. That seems worth fixing.

That brings the slow version down to:
dvander@dvander:~/mozilla/tracemonkey/js/src$ ./Opt32/js -m bz.js
(10x10) = 1ms
(50x50) = 2ms
(100x100) = 5ms
(500x500) = 139ms
(1000x1000) = 558ms

Which is a big improvement - but obviously, way better to use an integer array in the first place, which will IC.

Assignee: general → dvander

Status: NEW → ASSIGNED

Attachment #489755 - Flags: review?(lw)

Luke Wagner [:luke]

Comment 25

•

14 years ago

Comment on attachment 489755 [details] [diff] [review]
fix

Because ValueFitsInInt32 will only be picking up the double case, and I can't see us adding any other value types to the set of "fits in int32", it like you could write:

} else if (v.isDouble() && JSDOUBLE_IS_INT32(v.toDouble(), &i) &&
           INT_FITS_IN_JSID(i)) {
    id = INT_TO_JSID(i);
}

r+ either way.

Attachment #489755 - Flags: review?(lw) → review+

Boris Zbarsky [:bzbarsky]

Comment 26

•

14 years ago

Yep, I can confirm that with the attached patch -m on the first testcase ends up 3x faster (and in fact ends up about as fast as on the second testcase).

If I also apply the patch in bug 594247, then JM ends up within a factor of 2.5 of TM on the testcases here.

Best yet, with the patch for bug 608480 I see -j -m -p being about as good as just -j on the attached testcases (and much faster than -j -m), for total win.  ;)

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Blocks: 612930

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 27

•

14 years ago

http://hg.mozilla.org/tracemonkey/rev/9bff531727b2

Whiteboard: fixed-in-tracemonkey

Robert Sayre

Comment 28

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/9bff531727b2

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Shell version of that web page 14 years ago Boris Zbarsky [:bzbarsky] 6.11 KB, text/plain		Details
Shell testcase that I tested in comment 17 14 years ago Boris Zbarsky [:bzbarsky] 6.11 KB, text/plain		Details
fix 14 years ago David Anderson [:dvander] - inactive, e-mail if emergency 1.77 KB, patch	luke : review+	Details \| Diff \| Splinter Review