Closed Bug 933369 Opened 6 years ago Closed 6 years ago

x86 emulator much slower than in Chrome

Categories

(Core :: JavaScript Engine: JIT, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla28

People

(Reporter: bzbarsky, Assigned: jandem)

References

()

Details

(Whiteboard: [good first verify])

Attachments

(2 files)

Steps to reproduce:

1) Load http://copy.sh/v24/
2) Click the "Linux 2.6 (4.9 MB)" button.
3) Wait for the OS to boot and the kIPS counter to stabilize.

ACTUAL RESULTS: 15s to boot and stabilizes at about 7200kIPS for me.

EXPECTED RESULTS: Chrome boots this in 11s and stabilizes at about 22000kIPS.

ADDITIONAL INFORMATION: I profiled the steady state after boot is done.  What I see is that nearly half the CPU time is being spent under js::jit::DoBindNameFallback.  This is doing lookupGeneric stuff, etc.  The other half of the time seems to be spent in jitcode.
Flags: needinfo?(jdemooij)
Sigh, I created a shell version of this but it's way slower because it's hitting a ton of bailouts. For some reason the browser doesn't hit this but I will investigate this first.
(In reply to Jan de Mooij [:jandem] from comment #1)
> Sigh, I created a shell version of this but it's way slower because it's
> hitting a ton of bailouts. For some reason the browser doesn't hit this but
> I will investigate this first.

OK, the problem is that Load/Store TypedArrayElementStatic don't handle a double index well (it will always bail). After fixing that the shell gets much faster, about as fast as the browser I think. Now I also see the DoBindNameFallback problem bz noticed.
IonMonkey can compile try-catch now, but when we have to catch an exception we bail out to Baseline, and when this happens > 10 times we forbid further Ion compilation because bailouts are slow and we don't want to bail all the time.

Now, in this case the heuristic is bad because although we catch exceptions, this function also loops a lot and Baseline does not have a BINDNAME IC so it's better to run it in IonMonkey. Removing the ForbidCompilation call improves the "speed" in the shell after boot is done from 4500 to 24000 and we spend almost all time in JIT code. It may help less in the browser, but it should be a good win.

So we need a better heuristic to detect functions that (1) catch exceptions and (2) don't do much else.

We also need a Baseline IC for BINDNAME to be more robust and we should fix the shell-only problem in comment 2. Will file bugs for these problems.
Flags: needinfo?(jdemooij)
See comment 3 - the problem was that we catch > 10 exceptions and invalidate the IonScript in that case.

This patch instead resets the script's use count whenever we catch an exception in Baseline or Ion code, so that we catch (no pun intended) the bad cases but can still enter Ion code if the script has a hot loop.
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Attachment #825923 - Flags: review?(bhackett1024)
Attachment #825923 - Flags: review?(bhackett1024) → review+
Depends on: 933798
The good news: I downloaded a Try build with the patch in this bug and bug 933798, on OS X I get the following numbers in kIPS:

FF before:  6900
Chrome:    21000
FF after:  29000

Boot is still a bit slower I think, but the difference is really small (about a second or so).

The bad news: the shell-only problem I mentioned earlier seemed shell-only because I was testing a 32-bit shell. Load/Store TypedArrayElementStatic are not used on 64-bit and hence the 64-bit browser on OS X doesn't have this problem.

So on 32-bit this will be way slower than the 64-bit builds we're testing on OS X :( I will post a patch to fix that problem as well.
Fix Load/Store TypedArrayElementStatic to not bailout if the input is a double, but convert the input to int32. I can also do this in IonBuilder to match the other instructions; let me know if you think that's better.

Also, two questions:

(1) Why are these ops x86 only? Looking at codegen I don't see anything that we couldn't implement the same way on x64/ARM.

(2) Why do these ops not use separate bounds checks? I bet we could eliminate many of these :(
Attachment #826055 - Flags: review?(bhackett1024)
Flags: needinfo?(bhackett1024)
Fwiw, you can run a 32-bit browser on OSX from the command line like so:

 arch -i386 /path/to/.app/Contents/MacOS/firefox -profile /tmp/make-up-a-name

for testing purposes if you're starting with a try build.  If your build is local, of course, you probably didn't build the 32-bit bits.
(In reply to Jan de Mooij [:jandem] from comment #6)
> (1) Why are these ops x86 only? Looking at codegen I don't see anything that
> we couldn't implement the same way on x64/ARM.

These ops are (or are supposed to be) doing the same thing that asm.js load/stores do on the same architecture.  x64 asm.js loads/stores use signals to handle OOB accesses, and I don't have a way to test ARM changes.

> (2) Why do these ops not use separate bounds checks? I bet we could
> eliminate many of these :(

Again, asm.js doesn't try to eliminate bounds checks because that could lead to invalidation.  I wanted to integrate these ops with our bounds check elimination analyses but the higher priority was to get parity between asm.js code and normal Ion code and that goal has gotten seriously sidetracked.  i.e. I'd like to use the backtracking allocator on super hot code in Ion, and to get *that* working without regressing ss we need to stage multiple Ion compilations at different levels of optimization, and to get *that* working I'd like to have zero-cost off thread IonBuilder, and *that* has opened up a whole can of worms.
Attachment #826055 - Flags: review?(bhackett1024) → review+
(In reply to Brian Hackett (:bhackett) from comment #8)
> These ops are (or are supposed to be) doing the same thing that asm.js
> load/stores do on the same architecture.  x64 asm.js loads/stores use
> signals to handle OOB accesses, and I don't have a way to test ARM changes.

Well, if these ops with the bounds checks are faster on x86 they should also be faster on x64 no? Even though asm.js/Odin does something totally different.

(In reply to Boris Zbarsky [:bz] from comment #7)
> Fwiw, you can run a 32-bit browser on OSX from the command line like so:

Hm I used to run 32-bit browser builds on OS X but they started crashing after I upgraded to 10.8. But I should try that again :)
(In reply to Jan de Mooij [:jandem] from comment #9)
> Well, if these ops with the bounds checks are faster on x86 they should also
> be faster on x64 no? Even though asm.js/Odin does something totally
> different.

Well, probably, though that is again getting sidetracked from the original point of these changes.
Flags: needinfo?(bhackett1024)
(In reply to Jan de Mooij [:jandem] from comment #6)
> Created attachment 826055 [details] [diff] [review]
> Part 2 - Fix Load/Store TypedArrayElementStatic type policy
...
> Also, two questions:
> 
> (1) Why are these ops x86 only? Looking at codegen I don't see anything that
> we couldn't implement the same way on x64/ARM.

See bug 864214 for some discussion.  The ARM code was written, but did not appear to improve performance.  Without optimization of the bounds checks this change often degrades performance.  For example, the bounds check is currently emitted even when the index is a constant.

The asm.js operations can not bail out so the bounds checking can not be hoisted as an exception, whereas the JIT compiler can.

The asm.js operations have been optimized a little by the use of range info to optimize away the bounds check in some cases, and this could also benefit the JIT paths but the JIT paths might be better optimized in other ways.
https://hg.mozilla.org/integration/mozilla-inbound/rev/a21e15f1e484
https://hg.mozilla.org/integration/mozilla-inbound/rev/a705d4de9193

(In reply to Douglas Crosher [:dougc] from comment #11)
> See bug 864214 for some discussion.  The ARM code was written, but did not
> appear to improve performance.  Without optimization of the bounds checks
> this change often degrades performance.  For example, the bounds check is
> currently emitted even when the index is a constant.

Yeah, I will file a bug on this soon and see how much bounds check hoisting for these instructions helps on x86..
https://hg.mozilla.org/mozilla-central/rev/a21e15f1e484
https://hg.mozilla.org/mozilla-central/rev/a705d4de9193
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla28
On the OpenBSD, Nightly took 165s for me, while Chrome took 125s. Don't know if this is something important. But Nightly was faster than Chrome on the KolibriOS (50s x 60s).
Whiteboard: [good first verify]
You need to log in before you can comment on or make changes to this bug.