Closed Bug 526173 Opened 10 years ago Closed 10 years ago

3.5x regression in indexOf performance

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- beta5-fixed

People

(Reporter: bzbarsky, Assigned: luke)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 5 obsolete files)

See the attached testcase.  Apart from the fact that the indexOf tests are a lot slower than the lastIndexOf ones for some reason (and have been going back to Fx 3.5 at least), there's been a recent 3.5x regression on the indexOf tests.

The m-c checkin range for that regression (using nightlies) is http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=def6be3d8b6b&tochange=149c3820e8a8 and there's a nice 100-changeset tracemonkey merge in there.  indexOf got rewritten in bug 511777, which I'm fingering as the most likely culprit here.  Shark just shows all the time spent in str_indexOf (yay inlining) and doesn't give me reasonable line listings for some reason (boo!).
Flags: blocking1.9.2?
Attached file Testcase (obsolete) —
Attached file Same testcase without the shark gunk (obsolete) —
And I bet the relevant change was this:

-    if (textlen - i >= 512 && (jsuint)(patlen - 2) <= BMH_PATLEN_MAX - 2) {
+    if (textlen >= 512 && patlen <= sBMHPatLenMax) {

In particular this changed from doing the simple walk along the string for patlen == 1 to using BMH.  This is an obvious loss, since BMH does more work and can't possibly win on patlen == 1.

If I change the condition to also test patlen > 1, things get much faster, though still about 10-20% slower than Fx 3.5.
Attachment #409878 - Attachment is obsolete: true
Ok, indexOf being a lot slower than lastIndexOf was somewhat due to pilot error.  With that fixed, indexOf is only about 50% slower than lastIndexOf, not many times slower....

Still might be worth a separate bug on that.  It looks like our indexOf has about the performance of Safari's lastIndexOf, while their indexOf is about 3x faster.
Attachment #409879 - Attachment is obsolete: true
blocking2.0: --- → ?
(In reply to comment #3)

First of all, many thanks for finding the weakened BMH condition!  I don't know what I was thinking when I left that out.  On that line of though, it is also worth some quick measurements to see if there is a higher k where we don't do BMH for patlen < k.

> Ok, indexOf being a lot slower than lastIndexOf was somewhat due to pilot
> error.  With that fixed, indexOf is only about 50% slower than lastIndexOf, not
> many times slower....

Hmm, I ran your tests in a tracemonkey tip js shell (with the BMH thinko fixed) and here's what I see:

(in ms)
testIndexOfx: 63
testLastIndexOfx: 50
testIndexOfDash: 874
testLastIndexOfDash: 1587

Now, if I read the construction of testStrGiant, the '-' is equidistant from the beginning and end of the string, hence |testStrGiant.indexOf("-")| should be the same speed as |testStrGiant.lastIndexOf("-")|.  However, indexOf is faster; I suspect because of the changes in bug 511777.  Finding 'x' from the beginning and end, on the other hand, is not a symmetric task because 'x' is 24 characters from the beginning and 9 characters from the end.  That indexOf is only 26% slower is, I believe, also a positive result of bug 511777.

> It looks like our indexOf has
> about the performance of Safari's lastIndexOf, while their indexOf is about 3x
> faster.

Wow.  Hmm, and we are tracing those loops too.  Are you quite sure of these results?
Someone confirm the head-to-head with Safari and dig into the generated machine code. It's time to get down and dirty.

/be
> worth some quick measurements to see if there is a higher k 

Yep.  ;)

> Hmm, I ran your tests in a tracemonkey tip js shell (with the BMH thinko
> fixed)

Interesting.  I was running in tracemonkey tip opt browser.  I just tried this again (with the patlen > 1 added, and in shell I see:

testIndexOfDash: 874
testLastIndexOfDash: 1587

In browser I see:

testIndexOfDash: 2594
testLastIndexOfDash: 1720

In Safari (browser) I see:

testIndexOfDash: 906
testLastIndexOfDash: 2219

This is very consistent.  Without the BMH fix, browser is at about 6000 for testIndexOfDash, while shell is at 8600 or so.

Sounds like we have a separate bug here...  Can you try browser on your end to confirm?

> the '-' is equidistant from the beginning and end of the string

+-1, yes.  That was the goal, at least.

> Finding 'x' from the beginning and end, on the other hand, is not a symmetric
> task

Right.  I was mostly looking at the '-' case.

> Are you quite sure of these results?

See above.
Hmm.  I just did more testing, and now shell is consistently coming up with 1500ms or so for testIndexOfDash (???)
Flags: blocking1.9.2? → blocking1.9.2+
(In reply to comment #4)
> (In reply to comment #3)
> On that line of though, it is also
> worth some quick measurements to see if there is a higher k where we don't do
> BMH for patlen < k.

I ran a matrix of pattern lengths and BMH threshold values.  Each row runs with a different value of k in the abovementioned |(patlen - k) < sBMHPatLenMax|, starting at 0.  Each column uses a different length pattern, starting at 1.  You can see a diagonal starting at k = 2, patlen = 1 and fading away around patlen = 11.  Hopefully this remains formatted:

267  134  91   68   56   47   42   35   34   30   28   26   24   24   23   20   22   
268  134  91   69   55   47   41   36   33   30   28   26   23   24   24   20   22   
29   134  90   69   56   47   41   36   33   30   28   26   24   24   23   20   21   
30   28   91   69   55   48   41   35   33   31   28   26   23   24   23   20   21   
29   29   28   69   56   47   41   35   33   31   28   26   24   24   23   19   22   
29   29   28   29   56   47   41   35   34   30   28   26   23   24   24   19   22   
29   29   28   29   28   47   41   36   33   30   28   26   23   24   23   20   22   
29   29   29   28   29   29   41   35   33   31   28   26   23   24   23   20   21   
28   29   29   28   29   29   28   36   33   30   28   26   24   24   23   19   22   
29   29   29   28   30   28   29   29   33   30   29   25   24   24   23   19   22   
29   28   30   28   29   29   28   29   28   31   28   26   23   24   23   19   22   
29   29   29   29   29   29   28   29   29   28   28   26   23   24   23   20   22   
29   28   29   28   29   29   28   29   29   29   28   26   23   24   24   19   22   
29   29   29   28   29   28   29   29   28   29   29   29   23   24   23   20   21   
30   29   28   29   29   28   29   29   30   28   29   29   29   24   23   20   21   
29   28   29   29   28   29   29   29   28   29   29   29   29   29   23   19   22   
30   28   29   29   29   28   29   29   29   29   29   29   29   29   29   19   22   
28   29   28   29   28   29   29   28   29   29   28   29   29   29   28   29   22

Clearly the numbers are machine specific (Ubuntu w/ Core 2 Duo @ 2.67Ghz, btw), but it does clearly suggest a k higher than 2.

Now back to the "why is the browser slower" hunt.
Confound it!  One column too many.
(In reply to comment #6)
> Sounds like we have a separate bug here...  Can you try browser on your end to
> confirm?

I get pretty much the same relative performance as the js shell in a browser build:

testIndexOfDash: 871
testLastIndexOfDash: 1582
Interesting.  I wonder what the difference is...  I'm just using --enable-optimize --disable-debug; pretty basic Firefox opt build.
Attached patch fix the BMH bug (obsolete) — Splinter Review
Well, fix the glaring problem for now.
Assignee: general → lw
Status: NEW → ASSIGNED
Attachment #410036 - Flags: review?(jwalden+bmo)
Explicit cast to jsuint might be nice there.  And maybe a comment.
Attached patch with comments and cast (obsolete) — Splinter Review
Attachment #410036 - Attachment is obsolete: true
Attachment #410041 - Flags: review?(jwalden+bmo)
Attachment #410036 - Flags: review?(jwalden+bmo)
Comment on attachment 410041 [details] [diff] [review]
with comments and cast

We try not to overparenthesize subtraction against <= and so on (gal's Pascal or Oberon mistrained fingers excluded :-P).

/be
(In reply to comment #8)
Split off to bug 526348.
I ran a try-server-built version of your benchmark and indeed there is a horrendous regression from testIndexOfDash being 2x faster to < 2x slower!  I will shark this.
So looking at the disassembly with Shark data, it seems Mac gcc doesn't think that 't' and 'p0', the hottest variables of the unrolled loop in StringMatch, deserve registers.  Slowness ensues.
Hmm.  I just tried making those |register| with no effect, but it could well just be ignoring me....
Yeehaa!

  register int test_integer asm ("EBX");

Puts performance back to Linux-level for both shell and browser!
We used to have to do this kind of shit in the '80s (with the Portable C Compiler). To quote from the Wicked Witch of the West: "What a world!"

/be
Attached patch fix both problems (obsolete) — Splinter Review
I tried to make the register/asm hack apply as narrowly as possible (__APPLE__ && __GNUC__ && __i386__); anyone with portability experience, please let me know if I could do better.  With this, OS X 32-bit browser and shell have much better indexOf performance than 3.5 and are on par with Ubuntu.
Attachment #410041 - Attachment is obsolete: true
Attachment #410377 - Flags: review?(jwalden+bmo)
Attachment #410041 - Flags: review?(jwalden+bmo)
Attached patch more fasterSplinter Review
So WebKit has a hard-coded loop for length-one patterns.  This should be slower than an unrolled loop (extra comparison and branch) and on 64-bit OS X, the unrolled loop wins.  However, register pressure on 32-bit x86 (both OS X and Linux, no idea for Windows) slows down the unrolled loop due to spilling.  testIndexOfDash spends enough time in the loop body for this to make a difference (~60ms).

Comparing the result against 1.9.1, I also noticed that testLastIndexOfDash, which is syntactically identical, has dropped from 1141ms to 1511ms on tip.  Looking at it under shark, all the time is spent in the 1 loop at the end.  I played to forcing things into registers, but to no avail.  Tweaking the logic to use pointers and a cached pat[0] recovers old performance almost exactly, so that is also in the patch.

End result for 32-bit OS X browser:
                     FF3.5.4  WebKit nightly  tip    patch
textIndexOfx         80       74              263    63
testLastIndexOfx     45       86              43     45
testIndexOfDash      1876     803             5288   787
testLastIndexOfDash  1141     1916            1510   1143
Attachment #410377 - Attachment is obsolete: true
Attachment #410612 - Flags: review?(jwalden+bmo)
Attachment #410377 - Flags: review?(jwalden+bmo)
Attachment #410612 - Flags: review?(jwalden+bmo) → review+
http://hg.mozilla.org/tracemonkey/rev/f8188f1c00f0
Whiteboard: fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/f8188f1c00f0
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
These bugs landed after b4 was cut. Moving flag out.
blocking2.0: ? → ---
You need to log in before you can comment on or make changes to this bug.