Closed Bug 480494 Opened 15 years ago Closed 15 years ago

[TM] Dromaeo: Drop in crypto/string processing performance

Categories

(Core :: JavaScript Engine, defect, P2)

x86
Windows XP
defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: rflazaro, Assigned: dmandelin)

References

()

Details

(Keywords: perf, regression)

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090208 Minefield/3.2a1pre
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090208 Minefield/3.2a1pre

There is a "significant" drop in crypto/string processing performance. I am adding crypto because it involves the MD5 and SHA1 hashing it might be a string only issue.

See http://dromaeo.com/?id=61195,61196 for the comparison

I only have the 2/4 and 2/8 nightlies; I don't have time to narrow it right now.
Firefox 3.2a1pre ID:20090204032912
Firefox 3.2a1pre ID:20090208034355


Reproducible: Always

Steps to Reproduce:
1. Run http://dromaeo.com/?sunspider-crypto-aes|sunspider-crypto-md5|v8-crypto|sunspider-crypto-sha1 using Firefox 3.2a1pre ID:20090204032912
2. Run http://dromaeo.com/?sunspider-crypto-aes|sunspider-crypto-md5|v8-crypto|sunspider-crypto-sha1 using Firefox 3.2a1pre ID:20090208034355
3. Compare results
Actual Results:  
Firefox 3.2a1pre ID:20090208034355 build is "slower"

Expected Results:  
Firefox 3.2a1pre ID:20090208034355 build should exhibit same performance or better than previous build
Changeset range is betweeb http://hg.mozilla.org/mozilla-central/rev/4ada519b5c97 and http://hg.mozilla.org/mozilla-central/rev/688c44602a55

I don't have the time to narrow it down right now but I suspect it's a string-related change
Component: General → JavaScript Engine
Product: Firefox → Core
Assignee: nobody → general
QA Contact: general → general
I think I've narrowed the range to:

2/4 Firefox 3.2a1pre ID:20090204032912   http://dromaeo.com/?id=61195	
2/5 Firefox 3.2a1pre ID:20090205033258   http://dromaeo.com/?id=61512
Comparison                               http://dromaeo.com/?id=61195,61512

Checkin range is http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2%2F4%2F09&enddate=2%2F5%2F09

There's a whole lot of checkins in that range and I don't see any string or loops specific code. But there's a "big" TM merge.
Keywords: perf, regression
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: blocking1.9.1?
Summary: Dromaeo: Drop in crypto/string processing performance → [TM] Dromaeo: Drop in crypto/string processing performance
Related to bug 481922 ?

/be
I did not see bug 481922 improving any of the string tests I included.

I expanded my test set to included other regressions which involve strings and loops.

Builds tested
2/4 Firefox 3.2a1pre ID:20090204032912
2/5 Firefox 3.2a1pre ID:20090205033258
3/9 Firefox 3.2a1pre ID:20090309032542

Test: http://dromaeo.com/?sunspider-crypto-aes|sunspider-string-fasta|sunspider-access-fannkuch|sunspider-crypto-md5|sunspider-access-nsieve|sunspider-crypto-sha1
Comaparison: http://dromaeo.com/?id=61947,61948,61950

My original report detected slowdowns of about 50% in MD5 Hashing and SHA1 Hashing.

Looking back I also see a slowdown for Prime Number Computation in my original regression range.

My current test set shows slowdowns of about 50% for AES Encryption/Decryption, DNA Sequence Alignment, and Fannkuch. These are recent regressions; I have no regression range for these.
Including TM build 20090309021534 in comparison (test 61951). It shows that (except for DNA Sequence Alignment) we are almost back to pre-regression performance.

http://dromaeo.com/?id=61947,61948,61950,61951

Waiting for merge to m-c... =)
Flags: blocking1.9.1? → blocking1.9.1+
I just ran through this myself, comparing the two changesets you reported with this morning's tip, which is 2bfbc7fe3867. I will call them R0, R1, and Rt. Notable differences:

           R0       R1      Rt

DNA       169      171     131
MD5      1278      518    1100
Prime     170      155     196
SHA1     1003      571     969

As of today, it seems that we may have:

- a new regression in DNA: 170 -> 130
- possible mild regressions in MD5, SHA1
- improvement in Prime
Detailed stats for DNA, R1 vs Rt (btw, "DNA" actually means string-fasta) are below. Ignore the non-JS time; it is irrelevant. I think Dromaeo runs everything for 20 seconds, so the key thing here is more the ratio of interpreted time vs. native time.

# Rt (current, slower)
david-mandelins-macbook-pro:scripts dmandelin$ python binlog.py
Activity duration summary:

interpret     7982.501978
monitor         12.656783
record          15.870025
compile         12.821750
execute          7.658347
native       12166.841877
-------------------------
Subtotal     20198.350760
Non-JS       21191.393634
=========================
Total        41389.744394
Estimated speedup:  1.90x

Reasons for transitions to interpret:
start         5928.450675 ms     3326  1782.456607 us
abort          861.947463 ms      812  1061.511654 us
timeoutExit    557.293237 ms        8 69661.654602 us
backoff        519.918843 ms     1206   431.110152 us
cold            93.904430 ms      687   136.687672 us
extendOuter     15.672669 ms     1510    10.379251 us
extend           3.275303 ms      174    18.823582 us
mismatchExit     1.740493 ms     1465     1.188050 us
monitor          0.240087 ms      398     0.603234 us
extendInner      0.039302 ms       11     3.572893 us
deepBailExit     0.019476 ms        2     9.737955 us


# R1, faster
david-mandelins-macbook-pro:scripts dmandelin$ python binlog.py
Activity duration summary:

interpret     4974.553315
monitor         31.304646
record          25.471654
compile         19.401827
execute         38.807475
native       14909.712752
-------------------------
Subtotal     19999.251669
Non-JS       12888.297141
=========================
Total        32887.548810
Estimated speedup:  2.11x

Reasons for transitions to interpret:
timeoutExit   3070.217471 ms     7584   404.828253 us
extend         965.701927 ms     1440   670.626338 us
start          580.653700 ms     4158   139.647355 us
abort          125.226376 ms      756   165.643354 us
extendOuter     91.527501 ms     9663     9.471955 us
monitor         90.505205 ms     1251    72.346287 us
cold            37.329944 ms     4448     8.392523 us
mismatchExit    12.970392 ms     1910     6.790781 us
deepBailExit     0.420799 ms       17    24.752861 us
On string-fasta we're getting burned by blacklisting. The stats in my last comment subtly show this, but it was much easier to see with the visualization. At some points we spend over a second in the interpreter without a break.

The regression most likely came from the new, more aggressive blacklisting. I can make it go away by setting BL_ATTEMPTS to 200.

The attachment contains the aborts for my run. There are many--mostly jsQuery stuff.
OK, I now have a more definitive diagnosis of Dromaeo string-fasta ("DNA Sequence Alignment"). This is the key function: 

function fastaRepeat(n, seq) {
  var seqi = 0, lenOut = 60;
  while (n>0) {
*** if (n<lenOut) lenOut = n;
    if (seqi + lenOut < seq.length) {
      ret = seq.substring(seqi, seqi+lenOut);
      seqi += lenOut;
    } else {
      var s = seq.substring(seqi);
      seqi = lenOut - s.length;
      ret = s + seq.substring(0, seqi);
    }
    n -= lenOut;
  }
}

In the standard SS version, this function is called once. We compile 3 traces for this loop at the beginning, and then run natively until the function is done. Note that for the line marked ***, we always take the same branch (bypassing the 'then' statement), except on the last iteration of the loop, where we execute the 'then' statement.

If we call the function more than once (either due to adding a loop to the SS version, or just calling it a lot like Dromaeo does), bad things start happening:

1. On the second call, we have to record the 3 traces again, because the typemap is now different. The first time, it was OOISIIB and OOISIIS:B, while the second time (in Dromaeo), it is OOISIIBII:S, where the colon separates the stack part from the global part. That slot is for the local variable 's'. Why it would be different, I don't know. This seems like a bug to me.

2. We will take *** enough times with the new trace to try to extend the tree from that exit. But that extension trace exits the loop without our noticing, and then exits the function, and we abort at the LeaveFrame (because that would take us back to a frame not incorporated into this trace). 

3. Eventually the aborts occur enough times (3) to cause blacklisting. The blacklisting is applied to the root of the trace tree, and it makes it so that js_MonitorLoopEdge is not called for that trace tree ever again. So now we have a tree with good compiled traces, but we can't call it.  

I think each one of these events represents a defect, although I'm sure 3 is the easiest to solve in the short term:

1': In reality the types on the second call are identical to those on the first call. We should figure out how to decide type compatibility more accurately.

2': I don't see why we have to abort the trace. We could consider simply stopping the trace at the frame boundary and then trying to start a new trace at that point.

3': Here we are blacklisting a tree, in the sense of never recording or tracing it again, simply because one of its branches couldn't be compiled. Also, my previous studies indicated that the LeaveFrame abort condition isn't very "harmful", in the sense that valid traces can often be formed for headers that get that condition. 

    Because entering and leaving traces repeatedly is costly, and often leads to sub-interpreter performance, there is some merit to the idea of blacklisting a tree if one or more of its common paths can't be compiled. But a LeaveFrame path, it would seem, cannot be a common path. Thus, the extension point should be marked so that we don't try to record from it again, but the tree itself should not be blacklisted.
Attached patch patchSplinter Review
This patch only blacklists the tree if we abort on a primary fragment (the first trace we record). This causes a significant (50ms) perf loss in the shell on SS. I am not sure where.
I can't measure a slowdown when I use the sunspider harness:

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           -                 1053.2ms +/- 0.1%   1053.9ms +/- 0.1% 

=============================================================================

  3d:                  *1.004x as slow*   156.1ms +/- 0.2%    156.8ms +/- 0.4%     significant
    cube:              ??                  44.7ms +/- 0.6%     44.9ms +/- 0.7%     not conclusive: might be *1.004x as slow*
    morph:             ??                  30.8ms +/- 0.4%     30.9ms +/- 0.7%     not conclusive: might be *1.003x as slow*
    raytrace:          *1.005x as slow*    80.6ms +/- 0.2%     81.0ms +/- 0.3%     significant

  access:              -                  138.9ms +/- 0.2%    138.8ms +/- 0.2% 
    binary-trees:      -                   40.1ms +/- 0.3%     40.1ms +/- 0.3% 
    fannkuch:          -                   58.3ms +/- 0.2%     58.2ms +/- 0.2% 
    nbody:             -                   26.3ms +/- 0.5%     26.2ms +/- 0.5% 
    nsieve:            ??                  14.2ms +/- 0.9%     14.3ms +/- 0.9%     not conclusive: might be *1.007x as slow*

  bitops:              ??                  36.0ms +/- 0.8%     36.1ms +/- 0.7%     not conclusive: might be *1.003x as slow*
    3bit-bits-in-byte: -                    1.6ms +/- 8.8%      1.5ms +/- 9.4% 
    bits-in-byte:      1.010x as fast       8.0ms +/- 0.7%      8.0ms +/- 0.7%     significant
    bitwise-and:       ??                   2.5ms +/- 5.7%      2.6ms +/- 5.5%     not conclusive: might be *1.032x as slow*
    nsieve-bits:       *1.008x as slow*    23.9ms +/- 0.5%     24.1ms +/- 0.4%     significant

  controlflow:         -                   33.0ms +/- 0.5%     33.0ms +/- 0.3% 
    recursive:         -                   33.0ms +/- 0.5%     33.0ms +/- 0.3% 

  crypto:              -                   63.6ms +/- 0.4%     63.6ms +/- 0.3% 
    aes:               -                   36.5ms +/- 0.7%     36.3ms +/- 0.4% 
    md5:               *1.008x as slow*    20.1ms +/- 0.4%     20.2ms +/- 0.6%     significant
    sha1:              *1.006x as slow*     7.0ms +/- 0.0%      7.0ms +/- 0.8%     significant

  date:                1.004x as fast     179.1ms +/- 0.2%    178.3ms +/- 0.2%     significant
    format-tofte:      1.009x as fast      70.8ms +/- 0.3%     70.2ms +/- 0.2%     significant
    format-xparb:      -                  108.3ms +/- 0.3%    108.2ms +/- 0.2% 

  math:                -                   39.0ms +/- 0.8%     38.8ms +/- 0.5% 
    cordic:            -                   19.1ms +/- 0.6%     19.0ms +/- 0.6% 
    partial-sums:      -                   13.7ms +/- 1.6%     13.5ms +/- 1.1% 
    spectral-norm:     ??                   6.2ms +/- 1.9%      6.2ms +/- 2.0%     not conclusive: might be *1.006x as slow*

  regexp:              ??                  45.1ms +/- 0.4%     45.2ms +/- 0.3%     not conclusive: might be *1.002x as slow*
    dna:               ??                  45.1ms +/- 0.4%     45.2ms +/- 0.3%     not conclusive: might be *1.002x as slow*

  string:              ??                 362.5ms +/- 0.1%    363.2ms +/- 0.2%     not conclusive: might be *1.002x as slow*
    base64:            ??                  17.2ms +/- 0.6%     17.2ms +/- 0.7%     not conclusive: might be *1.003x as slow*
    fasta:             *1.005x as slow*    75.4ms +/- 0.2%     75.7ms +/- 0.2%     significant
    tagcloud:          ??                  99.5ms +/- 0.2%     99.8ms +/- 0.4%     not conclusive: might be *1.003x as slow*
    unpack-code:       -                  139.2ms +/- 0.2%    139.3ms +/- 0.3% 
    validate-input:    -                   31.2ms +/- 0.5%     31.2ms +/- 0.5%
2. We will take *** enough times with the new trace to try to extend the tree
from that exit. But that extension trace exits the loop without our noticing,
and then exits the function, and we abort at the LeaveFrame (because that would
take us back to a frame not incorporated into this trace). 

This is a BUG. Can we reduce this to a smaller testcase or do we know where and how we walk out of the loop without noticing?
Without the patch:

recorder: started(353), aborted(108), completed(520), different header(0), trees trashed(1), slot promoted(0), unstable loop variable(93), breaks(9), returns(3), unstableInnerCalls(60)
monitor: triggered(23292), exits(23292), type mismatch(0), global mismatch(12)

With the patch:

recorder: started(352), aborted(108), completed(519), different header(0), trees trashed(1), slot promoted(0), unstable loop variable(92), breaks(9), returns(3), unstableInnerCalls(60)
monitor: triggered(23292), exits(23292), type mismatch(0), global mismatch(12)

Bench shows a 5ms slowdown. The sunspider testsuite shows no slowdown. The difference in the stats is minimal and unstable. In other words the bench result different might be just noise.

Aborting the trace is still a bug, but this patch could go in as additional fix. Lets fix the underlying bug first though.
I am tempted to convert the abort into an assert. We don't hit it in trace-tests.
Depends on: 482800
https://bugzilla.mozilla.org/show_bug.cgi?id=482800 contains a suggested fix for the fastaRepeat issue.
Attachment #366853 - Flags: review?(graydon)
I'm hesitating on reviewing this as r+. It is ok in the sense that the diagnosis is correct and the fix appears to change the diagnosed problem into the proposed new state of affairs.

However, I'm concerned that we're just re-opening ourselves up to a "too many aborted compilations on a non-compilable leaf" scenario. The current behavior is "intentional" in the sense that falling back to interpreting is guaranteed to have a lower-bound of "no slower than interpreter".

We have often  seen cases where the constant-aborts make the JIT's presence *slow down* the browser from interpreter mode. Blacklisting tree roots aggressively saves us from these cases (mostly: it still doesn't save us from the auxiliary cost of entering and leaving traces too often).

I believe a more desirable fix would do the following:
  - backoff on a per-fragment basis, not moving to root but also not ignoring leaves
  - blacklist on a per-fragment basis when attempts *for that frag* hits limit
  - only patch over LOOP opcodes in cases where we're looking at a root (afaict the only 
    reason the current code insists on blacklisting at the roots)
(In reply to comment #17)
> I believe a more desirable fix would do the following:
>   - backoff on a per-fragment basis, not moving to root but also not ignoring
> leaves
>   - blacklist on a per-fragment basis when attempts *for that frag* hits limit
>   - only patch over LOOP opcodes in cases where we're looking at a root (afaict
> the only 
>     reason the current code insists on blacklisting at the roots)

This seems dicey to me. If you blacklist just a non-root fragment or two, you can easily get in the situation of going on and off trace repeatedly through the loop, which as you point out hurts perf a lot. And blacklisting tree roots *does* save us from many of those (but possibly not all--I think inner and outer loops can still interact to create that problem). Not that we can't try it--sadly, it seems impossible to get away from empirical testing in blacklisting algorithm design, at least with what we know so far.

But maybe we should use a hybrid solution--backoff/blacklist at both the fragment and tree level. At the fragment level, we backoff if we get to a situation where the fragment doesn't compile now, but we have reason to believe it will later. We blacklist if we think it will never compile. At the tree level, I'm not sure backoff makes sense, but we can blacklist if the level of "badness" (i.e., falling off trace too quickly) is too high. We could measure the number of iterations per trace execution for the first 20 or 100 or so tree entries after recording has stabilized, and if it's too low, blacklist that tree. 

Eventually I think we will also want to think harder about inner/outer loop interaction wrt blacklisting.
Assignee: general → dmandelin
I don't want to be holding this up; gal, do you have any feelings on dmandelin's or my comments here, either way? Or do you want to charge ahead with the patch as-is?
The underlying bug is fixed. We no longer leave the loop and abort there so I don't think the patch is necessary for that particular benchmark so its less critical to get it in. You still seem to have concerns.
sayrer: please unblock (maybe verify first that slowdown is one?)
Priority: -- → P2
We still have problems here. In fact, we've gone backwards from my last test on at least 4 benchmarks. Performance seems to be unstable on these. Extending my table from before, with Ru=tip of 4/2,

           R0       R1      Rt      Ru

fasta     169      171     131     123  ("DNA Sequence Alignment")
MD5      1278      518    1100     871
Prime     170      155     196     149
SHA1     1003      571     969     676

The visualization for fasta looks pretty much like what I did before, so I think the same causes still apply. I will recheck.
OK, some real progress on fasta, finally. The problem is too many GCs. A GC during this test takes about .3 seconds, and gets run about .7 seconds, so this really kills our throughput. Runs that get 2 GCs are even slower. I don't know how these GCs are getting triggered yet--I'm going to dtrace it.
The GC is getting triggered in js_NewGCThing by the rt->gcMallocBytes >= rt->gcMaxMallocBytes condition. Thus, the program is allocating about 32MB/.4s of mutator run time. After that, we do GC for .3 seconds, which is both a big pause time and a poor mutator throughput quotient.

Running that GC that often doesn't seem that bad, as long as the GC can be made to run faster. I'm not sure why it's taking so long. I'm also not sure how much garbage is or is not being recovered. The problem could also really be more related to string implementation code.

Perf probably dropped off when I made JS_realloc'd bytes for new memory objects count in gcMallocBytes. This is necessary, because some apps GC way too late and start thrashing without it.
Blocks: 481103
Depends on: 487546
Still grinding through the revision history. In searching for the regressions, I ended up finding first a perf improvement. Changeset 81a29bf64b16, for bug 475474, produced a dramatic improvement in MD5 and SHA1, and smaller improvements in fasta (DNA sequence matching) and access-nsieve (Primes). I don't know the exact details of that bug, but given that it makes the JIT handle global objects more flexibly/precisely, it's not surprising that it would help in-browser perf.

About half of this perf gain was lost by changeset 569acf636d50, so the next step is to track that down. After that, the perf was eventually regained, and after that, about 10-20% lost again.
The next interesting event was a severe regression in MD5 and SHA1, accompanied by a smaller regression in Primes. This happened at changeset 569acf636d50 for bug 469625, which I believe introduced the arrayProtoHasElements flag. The perf regression was detected and filed as bug 481251 and fixed around March 3.
I think I have the complete story on MD5 (and nearly complete on SHA1):

long ago:       MD5: 33, SHA1: 240

81a29bf64b16:   MD5: 1000, SHA1: 780
                This was the tracking-multiple-globals patch.

569acf636d50:   MD5: 400, SHA1: 400
                This fixed a bug where the tracer generated incorrect code for
                arrays with holes when an array prototype element had been set.
                This first solution was somewhat crude, relying on a global
                flag that deoptimized all array accesses once the jit thought
                an array prototype element had been set (and the jit used a
                very conservative check, so things got deoptimized quickly).

8761d29774e4:   MD5: 900, SHA1: 700
                This replaced the mechanism from the last bug. The new code
                uses guards to verify that the shapes of the objects on the 
                prototype chain are the same as they were during trace
                recording. Thus, we don't get widespread deoptimization, but
                we do have moderately expensive checks for all sparse array
                accesses, so we didn't get back all the perf we had before
                fixing the bug.
There are sparse arrays in our crypto? :-(
(In reply to comment #28)
> There are sparse arrays in our crypto? :-(

Dense arrays, with holes. <Homer S. voice> Love that SunSpider, don't hate -- love! :-P.

/be
Tracking multiple globals has nothing to do with crypto, either -- that one is due to Dromaeo's harness. But it was a good fix.

/be
Depends on: 491022
I think this is WONTFIX. The maximum perf was due to incorrect semantics. We can probably optimize this further using raw arrays or other tricks, but the performance loss has been diagnosed and happens by design.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
Comment on attachment 366853 [details] [diff] [review]
patch

Long dead, clearing out queue.
Attachment #366853 - Flags: review?(graydon) → review-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: