Performance on JSMIPS 2x slower than Chrome

RESOLVED WORKSFORME

Status

()

defect
--
minor
RESOLVED WORKSFORME
8 years ago
6 years ago

People

(Reporter: gkrichar, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

()

Attachments

(3 attachments, 2 obsolete attachments)

Reporter

Description

8 years ago
User-Agent:       Mozilla/5.0 (X11; Linux i686 on x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Build Identifier: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1

After doing a presentation on the benchmarking system I'm making, during which I made an off-hand comment about how one of my own pages that I was measuring is 2x slower on Firefox than Chrome, I was asked to file a bug report for the performance issue. Please note that Chrome is the outlier here, not Firefox, as other browsers (e.g. Opera) are closer to (and slower than) Firefox.

The attached file is a benchmark version of JSMIPS, a MIPS simulator, running UNIX dc, generated by JSBench. It includes a harness to run various different configurations and produce statistically-meaningful results. You can avoid the harness entirely by running e.g. jsmipsdc/ure.html , or even jsmipsdc/ure.js in a non-browser JS environment so long as you have console.log. Results from my system are included in the "results" directory.

Although I never intended for JSMIPS to be Chrome-savvy, Chrome displays impressive performance on it all the way back to Chrome 5, the earliest version I happen to have available.

It is possible, although I believe unlikely, that the problem lies in JSBench, and not JSMIPS. If that's the case, I'd like to know as well :)

If you have any questions about JSMIPS or JSBench, feel free to ask.

Reproducible: Always

Steps to Reproduce:
1. Run jsbench-jsmipsdc-wharness-2011-06-02/index.html (the benchmark harness) on Firefox 4.0.1 and Chrome 11
2. Observe consistently 2x disparate results

Actual Results:  
Chrome's result: 401.76ms ± 0.30%
Firefox's result: 1198.71ms ± 0.78%

Expected Results:  
You tell me ;)

JSMIPS is a MIPS simulator in JavaScript. It is a performance nightmare, with bizarre math intrinsics, terrifying use of eval() for JITting to JavaScript, virtual memory management through sparse arrays, a vt100 terminal implemented in JS and a trace-mutilating while()-switch loop as a goto alternative in generated code. To perform well on it is impressive, to perform as well as Chrome does is almost suspicious.

The referenced web page is UNIX dc with JSMIPS. As it checks the time and does timeouts to avoid stalling the browser, it is not a useful benchmark, but might be useful for understanding how JSMIPS works.
Reporter

Comment 1

8 years ago
Do I have to set some options to run it in Chrome 11? I get:

Unsafe JavaScript attempt to access frame with URL file:///C:/things/jsmips/jsbench-jsmipsdc-wharness-2011-06-02/index.html from frame with URL file:///C:/things/jsmips/jsbench-jsmipsdc-wharness-2011-06-02/jsmipsdc/urv.html. Domains, protocols and ports must match.
Uncaught TypeError: Cannot read property 's330589ac96669f3b6f94f90c2dc2bb4430fe774a_0' of undefined

Or do we need to get it on a web server to run in Chrome?
Reporter

Comment 3

8 years ago
I'm sorry, I forgot to mention that. Yes, due to how the benchmark frame communicates with the harness, it needs to be run from a web server in Chrome. If you don't care about the harness (which is sort of unnecessary for JSMIPS as it's a long enough running benchmark that its runtime isn't very variable), you can run jsmipsdc/ure.html directly, which has no interframe communication when not run through the harness.
I put it up at http://10.250.2.131/~sfink/mips/ 

For a single run, it's easier to use http://10.250.2.131/~sfink/mips/jsmipsdc/ure.html

I get pretty high variance from a single run -- 700ms to 1000ms. Fortunately, Chrome smokes us enough for it to not matter much! :)
Blocks: WebJSPerf
Status: UNCONFIRMED → NEW
Ever confirmed: true
Reporter

Comment 5

8 years ago
Curiously, I'm seeing greater variance on Mac OS X than on Debian. But in both cases the standard deviation is low enough overall that I don't think the occasional outlier is skewing the results.
I attempted to run jsmipsdc/ure.js with a JS shell, and it demanded window and setTimeout and other bits. Eventually, when I kept giving it what it asked for, it complained about 'jsmips is not defined'.
Status: NEW → UNCONFIRMED
Ever confirmed: false
Reporter

Comment 7

8 years ago
Yeah, looks like I broke out-of-browser execution ... it's actually easier than that to fix, just move the block "// function repository" after the immediately next block (the one where it detects if it's in the browser or not). I'll attach the modified file too.

On the SpiderMonkey shell it won't display any output since console isn't available, you can replace the console.log bit with print(). I just didn't want to include print() in general since that is a function with a very different behavior in browsers :). Unsurprisingly the shell result for me is effectively identical to the in-browser result.
Reporter

Comment 8

8 years ago
Reporter

Comment 9

8 years ago
Sorry, I uploaded the wrong ure.js last time ...
Attachment #537047 - Attachment is obsolete: true
Thanks for that shell testcase!  It's actually a good thing that the shell result is the same; often enough it's not.

A profile of shell running |-a -m -j -p| shows that 33% of the benchmark time is spent under an eval() call (11% under js::Parser and 20% under js_EmitTree, with a huge hotspot at 14% in UpdateJumpTargets).  Another 45% is under TraceRecorder::MonitorRecording.  2% in js::Interpret.  5% running on trace, and some mjit compilation.

If I do just |-m -a|, then we run about 2x as fast (still about 20% slower than Chrome).  The profile is now 70% parsing/compilation for eval calls.  UpdateJumpTargets is 30% of total time.

I'll do a type inference run, but I doubt it will be very different since the hotspot is before we run any code...
Status: UNCONFIRMED → NEW
Ever confirmed: true
Ah, just read comment 0 carefully.  So the trace suck is in fact expected...  I wonder whether our heuristics can pick up that while()-switch thing usefully.

So I just did some timings on the type inference branch (the ones without -n match m-c pretty closely and have about plus or minus 3ms of noise):

Interp: 380 ms
JM: 375 ms
JM+TI: 647ms
V8 shell: 302ms

Note that interp number!
Hmm, I get slightly different numbers (not as big a difference between options/engines):

js: 450
js -m: 405
js -m -n: 452
d8: 375

Boris, are you running x86 or x64?  Does Shark work for you on the TI branch?  Would be good to know where that extra TI cost is coming from.  Would also be good to get the benchmark spending more time in the MIPS interpreter loop, it only runs through the switch about 15k times.  Not that the performance story will look great then, as we use a lookup switch for the loop and end up doing a linear scan of all the switch cases (eep!).
Status: NEW → UNCONFIRMED
Ever confirmed: false
Reporter

Comment 13

8 years ago
Just to be perfectly clear, this is NOT a conventional interpreter while(true) switch(mem[pc]) { case instruction: } type loop. It actually compiles the MIPS code into JS code (this is part of the eval you're seeing :) ), and each case in the switch statement is a codepoint in the original program, e.g.:

while (true) {
    switch (pc) {
        case 0x40000000:
            // code for instruction at 0x40000000
            // note: no break

        case 0x40000004:
            // code for instruction at 0x40000004

        case 0x40000008: // a jmp instruction
            pc = /* jump target */;
            break;
        ...
    }
}

This is still very trace-hostile, but it's not quite the same as what you probably expect when I say "while()-switch loop". The actual case to jump to is not evaluated very often.
Reporter

Comment 14

8 years ago
Oh, another major eval source is that it does the initial load of the recorded scripts with a global eval alias. I don't /think/ that those evals should account for anywhere near 33% of the time though.
> Boris, are you running x86 or x64?

x64 on Mac.  I just compared the jaegermonkey branch 32-bit numbers and they're like so:

Interp: 370ms
JM: 360ms
JM+TI: 570ms

This is rev 11714be33655 on the jaegermonkey branch.

> Does Shark work for you on the TI branch? 

Yes, absolutely.  Looks like JM+TI has a very different structure for the code flow (e.g. way more JaegerInterpoline nesting) but the bottom-up profile shows that we're spending the following percentages of total time in (not under) functions that look TI-related:

  10% analyzeTypesBytecode
   6% analyzeSSA
   3% analyzeBytecode
   3% analyzeLifetimes
   2% js::types::TypeSet::add

In general, about 20% of the total time is under js::mjit::Compiler::checkAnalysis, and another 10% or so under ScopeNameCompiler::updateTypes (coming off ic::Name, of course).  That 30% is about right for the JM vs JM+TI difference I see...
Status: UNCONFIRMED → NEW
Ever confirmed: true
OK, I can get similar numbers now.  Some comments:

- We're probably doing analysis/compilation too eagerly.  We analyze about 15 times as much code as in SS but only run JS for twice as long.  This is most of the difference I think.

- Within the analyses we're quadratic in a few places.  The SSA analysis should use a hash instead of a vector for keeping track of the branches it is in.

- We're also quadratic keeping track of the different initializer objects in scripts.  The outer script uses a different initializer for every Date object created (about 1000 of them), each of which has a different getTime function.  Each initializer and each getTime function gets a new type object, and this ends up slowing down execution and analysis vs. the original program.  It would be good if the construction of these tracked objects followed along with the original program's behavior better (each Date is a 'new' object for the same function, and all delegate getTime to their shared prototype).
Reporter

Comment 17

8 years ago
Hmmm, looking at this replay, I'm surprised that it did create separate getTime functions. It seems there must be a bug there, as any single function, even if it is used across multiple instances, should only have one replay function ... I'll see if I can quick-fix that and see how that changes things.
I filed bug 661826 on UpdateJumpTargets.  Sadly, it's old code that was last touched by Brendan in prehistory.  Luckily, the CVS checkin comment is extensive and informative; whoever wants to look at it should read that comment.
Depends on: 661826

Comment 19

8 years ago
cdleary has worked on UpdateJumpTargets before I believe.
Reporter

Comment 21

8 years ago
This new version fixes the function explosion problem, there is now only one getTime. I also used a longer execution since somebody asked.

The changes:
 * getTime is only one function
 * Due to some unrelated improvements I was making, it uses less local variables so shouldn't bang on the stack so much. (Also things were renamed, see jsmipsdcfib/urem.js or jsmipsdcfib/usrem.js now)
 * All versions that don't use the DOM should work in the shell now.
Reporter

Comment 22

8 years ago
Oh, I also included the results for this version. The brief:

Firefox 4.0.1: 3287.7ms ± 2.92%
Chrome r77788: 721.79ms ± 1.02%
Opera 11.10: 3150.2499999999995ms ± 0.45%

This version has also made the variance go even more crazy on Firefox, with the standard deviation being 15% of the mean. This is a pretty effective trace-killer, with the TM time being about 4x the interpreter.
Hmm, it looks to me like the function explosion is still there.  In all the .js files there are about 10000 sequences that look like:

o14 = {};
o0.returns.push(o14);
o15 = function() { return arguments.callee.returns[arguments.callee.inst++]; };
o15.returns = [];
o15.inst = 0;
o14.getTime = o15;
o14 = null;
o15.returns.push(1307113301821);
o15 = null;

Not sure how hard this would be, but to match the way the original webpage constructed these objects better it would be great if this looked vaguely like:

// preamble
o0 = function() { return o0.returns[o0.inst++]; }; // Date
o0.returns = [];
o0.inst = 0;
o1 = function() { return o1.returns[o1.inst++]; } // Date.prototype.getTime
o1.returns - [];
o1.inst = 0;
o0.prototype.getTime = o1;

// for each date created
o0.returns.push(new o0());

// for each call to getTime
o1.returns.push(1307113301821)

A smaller concern is that calling the native getTime on a Date object will be *much* faster than doing a scripted call which executes 'arguments.callee.returns[arguments.callee.inst++]'.  We make a new object every time a script accesses non-integer properties of its arguments, and then do a bunch of ridiculously slow stuff to get to those properties.  It would not be hard to make this code run like lightning, but then we'd be optimizing for the replay harness and not the original webpage.  This only executes 10000 times on the page (costing about 5ms) but might be a problem on pages that interact more with the DOM or other tracked objects.
Attachment #537048 - Attachment mime type: application/octet-stream → text/plain
Attachment #537030 - Attachment mime type: application/octet-stream → application/x-bzip2
Attachment #537162 - Attachment mime type: application/octet-stream → application/x-bzip2
Reporter

Comment 24

8 years ago
Urkh, sorry about that, my bugfix was a bugnotfix. Turns out the problem is a bit more insidious than I'd realized. Still fixable though, please hold ...
Reporter

Comment 25

8 years ago
OK, I fixed the bug in my fix of the bug. Note that proper prototype reconstruction is a little bit further down the line, it's simply not an easy problem for a system with JSBench's design to solve, so there is no equivalent of o0.prototype.getTime = ...;. However, this does (actually) have only one getTime replay function. Proper prototype recreation is on the TODO list, but pretty distant for now. iframes first :)

For "We make a new object
every time a script accesses non-integer properties of its arguments, and then
do a bunch of ridiculously slow stuff to get to those properties.", I actually changed it to use arguments.callee just recently to avoid keeping a permanent hold on a local variable recently. I can certainly switch it back, it just costs a bit more memory since that variable has to be kept permanently. Now that there aren't so many functions, that's perhaps not such a big deal.

My results on this version:

Firefox 4.0.1: 3133.09ms ± 2.45%
Chrome r77788: 639.9ms ± 0.24%
Opera 11.10: 3021.6ms ± 0.12%
Attachment #537162 - Attachment is obsolete: true
(In reply to Gregor Richards from comment #25)
> My results on this version:
> 
> Firefox 4.0.1: 3133.09ms ± 2.45%
> Chrome r77788: 639.9ms ± 0.24%
> Opera 11.10: 3021.6ms ± 0.12%

I just tried this version with the latest Nightly (I had to use --disable-web-security for Chrome to run the benchmark locally).

Firefox Nightly 26 :  690.7 ms
Chrome 31          :  691.9 ms
Safari 6.0         :  760.6 ms
Opera 12.16        : 2023.8 ms

We came a really long way and are about the fastest now, so closing this.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.