1330539 - JITted code seems to be causing a lot of icache and ITLB misses

Reporter

Description

•

9 years ago

While working on doing microarchitecture analysis profiling for issues with displaylists on Gsuites, I noticed that the block of jit-ed code that VTune can't identify reports -extremely- poor CPI rates, spending an average of 10 cycles on retiring an instruction in one case where I was looking at. (Where Firefox as a whole manages to spend about 1-1.5 cycles on retiring an instruction, so a 6-10x perf difference with our native code) The biggest contributors to this seem to be instruction cache misses and ITLB misses, suggesting that we're not doing a good job getting code locality for our jit-ed code. From what I'm seeing here there is room for serious speedups here and I'm wondering whether we're performing poorer relative to chrome on 'real life' websites than benchmarks because for benchmarks it's much easier to get good code locality than on a big thing like google suites.

Kan-Ru Chen [:kanru] (UTC+9)

Comment 1

•

9 years ago

If I remember correctly, the JIT engine intentionally randomize the allocated page for jit-ed code on windows. That might explain the poor code locality. ok, I found this: http://searchfox.org/mozilla-central/rev/a712d69adb9b2588f88aff678216b2be94d3719c/js/src/jit/ExecutableAllocatorWin.cpp#51

Bas Schouten (:bas.schouten)

Reporter

Comment 2

•

9 years ago

So interestingly I ran a similar analysis on Octane, we score a CPI retired of about 0.5 there which is actually really good. Which supports the hypothesis of a possible explanation for the discrepancy between between benchmark and real world.

Jan de Mooij [:jandem]

Comment 3

•

9 years ago

Someone should try to reproduce this and check what kind of JIT code it is. I wonder if we're entering IC stubs here for instance. (We need to measure if this helps a bit, but one idea is to make the ExecutableAllocator per-zone instead of per-runtime, so code that belongs to a single tab is more likely be allocated in the same pool.)

Bas Schouten (:bas.schouten)

Reporter

Comment 4

•

9 years ago

(In reply to Jan de Mooij [:jandem] from comment #3) > Someone should try to reproduce this and check what kind of JIT code it is. > I wonder if we're entering IC stubs here for instance. > > (We need to measure if this helps a bit, but one idea is to make the > ExecutableAllocator per-zone instead of per-runtime, so code that belongs to > a single tab is more likely be allocated in the same pool.) It should be noted, we explored this a little further, when we -disable- ion, we consistently get better CPI retired rates. i.e. enabling ion seems to make us less icache friendly, Sean had a potential explanation for this but I won't pretend I entirely understood it :-).

Bas Schouten (:bas.schouten)

Reporter

Comment 5

•

9 years ago

I ran some more tests on a different workloads (scrolling through slides). Here I'm seeing similar 'differences' between us and Chrome. Both have much better CPI retired rates, although ours is -better- than Chrome's. However, we are mostly being stalled on front-end (ITLB, icache, etc.), where Chrome is mostly being stalled on backend (specifically cache misses). Their overall performance appears better.

Bas Schouten (:bas.schouten)

Reporter

Updated

•

9 years ago

Summary: On a glance JITted code seems to be cause a lot of icache and ITLB misses → On a glance JITted code seems to be causing a lot of icache and ITLB misses

(no longer active)

Comment 6

•

9 years ago

Bas and I tested this on OSX using Instruments to get a sense of whether Bas' results are reproducible there and it seems that they are. My test case was bug 1326346. In Firefox for JIT code I get CPIs that are typically between 2-4, but there are also a lot of call stacks with CPIs much higher than that. However the JIT code in Chrome for this same test case has CPIs around 1.x which is around what we'd get from native code. I also tested Octane in both browsers, and we're both getting CPIs around 1.x there. We also tested some other metrics such as icache misses per instructions, L2 misses per instructions etc and in everything we tested, the results were consistent with "we're slower in GSuites but not in benchmarks". Another interesting data was that I was testing both Chrome and Firefox 64-bit but Bas' Firefox tests were done in an x86 build, and we're both getting consistent results across the two architectures.

(no longer active)

Comment 7

•

9 years ago

Julian, Dan, Tim, do you guys happen to have relevant experience with code locality issues that may help us with this problem? Thanks!

Bas Schouten (:bas.schouten)

Reporter

Updated

•

9 years ago

Summary: On a glance JITted code seems to be causing a lot of icache and ITLB misses → JITted code seems to be causing a lot of icache and ITLB misses

Hannes Verschore [:h4writer]

Updated

•

9 years ago

Priority: -- → P3

Timothy B. Terriberry (:derf)

Comment 8

•

9 years ago

Is there any way to measure the size of the generated code (e.g., with and without ion)? icache misses may also indicate poor branch predictability, but I assume that would have shown up in VTune/Instruments.

Bas Schouten (:bas.schouten)

Reporter

Comment 9

•

9 years ago

(In reply to Timothy B. Terriberry (:derf) from comment #8) > Is there any way to measure the size of the generated code (e.g., with and > without ion)? > > icache misses may also indicate poor branch predictability, but I assume > that would have shown up in VTune/Instruments. Yeah, VTune does indicate some branch resteers, but not much.

Julian Seward [:jseward]

Comment 10

•

9 years ago

Somewhat tangentially: is there any machinery in SpiderMonkey that will pass, at JIT-time, information that connects JIT-created code ranges to source names? I vaguely seem to remember such a thing existing for passing names to GDB.

Bas Schouten (:bas.schouten)

Reporter

Comment 11

•

9 years ago

(In reply to Julian Seward [:jseward] from comment #10) > Somewhat tangentially: is there any machinery in SpiderMonkey that will > pass, at JIT-time, information that connects JIT-created code ranges > to source names? I vaguely seem to remember such a thing existing for > passing names to GDB. I think there is, there used to be VTune integration (it doesn't work anymore, but it would have required such machinery).

Jim Blandy :jimb

Comment 12

•

9 years ago

It would be interesting to hack that JIT -> code mapping to indicate, instead of source code positions, the role of the code: main line, inline cache fast path, slow path, and so on. Then VTune would tell you exactly which category of Ion-generated code was slow.

(Away)

Comment 13

•

9 years ago

(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #1) > If I remember correctly, the JIT engine intentionally randomize the > allocated page for jit-ed code on windows. That might explain the poor code > locality. > > ok, I found this: > http://searchfox.org/mozilla-central/rev/ > a712d69adb9b2588f88aff678216b2be94d3719c/js/src/jit/ExecutableAllocatorWin. > cpp#51 Has anyone tried a quick test with the randomization code disabled?

Sean Stangl [:sstangl]

Updated

•

9 years ago

Depends on: 1332466

Bas Schouten (:bas.schouten)

Reporter

Comment 14

•

9 years ago

(In reply to David Major [:dmajor] from comment #13) > (In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #1) > > If I remember correctly, the JIT engine intentionally randomize the > > allocated page for jit-ed code on windows. That might explain the poor code > > locality. > > > > ok, I found this: > > http://searchfox.org/mozilla-central/rev/ > > a712d69adb9b2588f88aff678216b2be94d3719c/js/src/jit/ExecutableAllocatorWin. > > cpp#51 > > Has anyone tried a quick test with the randomization code disabled? I'm willing to run some comparative tests if someone has instructions on how to disable it?

(Away)

Comment 15

•

9 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #14) > (In reply to David Major [:dmajor] from comment #13) > > Has anyone tried a quick test with the randomization code disabled? > > I'm willing to run some comparative tests if someone has instructions on how > to disable it? Try stubbing out ExecutableAllocator::computeRandomAllocationAddress() to just "return nullptr".

Bas Schouten (:bas.schouten)

Reporter

Comment 16

•

9 years ago

(In reply to David Major [:dmajor] from comment #15) > (In reply to Bas Schouten (:bas.schouten) from comment #14) > > (In reply to David Major [:dmajor] from comment #13) > > > Has anyone tried a quick test with the randomization code disabled? > > > > I'm willing to run some comparative tests if someone has instructions on how > > to disable it? > > Try stubbing out ExecutableAllocator::computeRandomAllocationAddress() to > just "return nullptr". There's no noticeable difference. Another note I should make, I went and tested this with the discrete GPU on, this basically means there's a lot more free space on the DRAM bus. In this situation, on the Google Slide deck, firefox appears to outperform Chrome, and Edge in turn outperforms us. In this situation, our CPIs are consistently around 1.7 (regardless of the random allocation addresses), which is bad, but not as terrible as I daw them before, and Chrome's are very bad, around 5.7x, Chrome is seen in this situation completely filling up the system memory bandwidth. My guess is that this is because we're saving the memory bandwidth by using the discrete GPU (since there's a fair amount of drawing involved here), while chrome draws most things in software and still uses the regular bus. I probably need to find a somewhat better google docs workload that involved more JS and less drawing.

Bas Schouten (:bas.schouten)

Reporter

Comment 17

•

9 years ago

I've done more investigating using the same test case Ehsan used. Hoping I'd get more JS usage here and less painting. This seems to have somewhat worked, results are consistent on Windows with what Ehsan was seeing, our CPIs are ~2 and Chrome around 1.3 during the document loading. Again we seem largely front-end bound whereas chrome in this case the CPI seems to be limited by branch mispredictions. Our front-end bound limitations here are not as clearly icache and ITLB misses though, but I'm afraid what exactly the causes here are is beyond my current understanding of how the microarchitecture works. Being able to see which JS function calls are responsible here is likely going to be at least somewhat helpful.

Steve Fink [:sfink] [:s:]

Comment 18

•

9 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #16) > Another note I should make, I went and tested this with the discrete GPU on, > this basically means there's a lot more free space on the DRAM bus. In this > situation, on the Google Slide deck, firefox appears to outperform Chrome, > and Edge in turn outperforms us. In this situation, our CPIs are > consistently around 1.7 (regardless of the random allocation addresses), > which is bad, but not as terrible as I daw them before, and Chrome's are > very bad, around 5.7x, Chrome is seen in this situation completely filling > up the system memory bandwidth. > > My guess is that this is because we're saving the memory bandwidth by using > the discrete GPU (since there's a fair amount of drawing involved here), > while chrome draws most things in software and still uses the regular bus. Whoa! Wait, so let me see if I'm understanding correctly. You are saying that the poor CPIs are due to memory bus interference from graphics code? As in, the number of icache misses is about the same with integrated vs discrete GPU, but the cost of misses is higher when graphics code is hogging the memory bus? Or is it a space issue in a lower level of cache that is unified between instructions and data, and that is raising the cost of icache misses because instructions are getting evicted? Do you know if this graphics memory traffic is causing issues in non-JIT code? Is there a way to compare the CPI of non-JIT non-graphics code between the two scenarios?

Bas Schouten (:bas.schouten)

Reporter

Comment 19

•

9 years ago

(In reply to Steve Fink [:sfink] [:s:] from comment #18) > (In reply to Bas Schouten (:bas.schouten) from comment #16) > > Another note I should make, I went and tested this with the discrete GPU on, > > this basically means there's a lot more free space on the DRAM bus. In this > > situation, on the Google Slide deck, firefox appears to outperform Chrome, > > and Edge in turn outperforms us. In this situation, our CPIs are > > consistently around 1.7 (regardless of the random allocation addresses), > > which is bad, but not as terrible as I daw them before, and Chrome's are > > very bad, around 5.7x, Chrome is seen in this situation completely filling > > up the system memory bandwidth. > > > > My guess is that this is because we're saving the memory bandwidth by using > > the discrete GPU (since there's a fair amount of drawing involved here), > > while chrome draws most things in software and still uses the regular bus. > > Whoa! Wait, so let me see if I'm understanding correctly. You are saying > that the poor CPIs are due to memory bus interference from graphics code? As > in, the number of icache misses is about the same with integrated vs > discrete GPU, but the cost of misses is higher when graphics code is hogging > the memory bus? > > Or is it a space issue in a lower level of cache that is unified between > instructions and data, and that is raising the cost of icache misses because > instructions are getting evicted? > > Do you know if this graphics memory traffic is causing issues in non-JIT > code? Is there a way to compare the CPI of non-JIT non-graphics code between > the two scenarios? Well, I described my observations there but I think I accidentally hinted at some incorrect conclusions, the previous workload where I saw our disastrous CPIs was -not- the Google slide deck. 1. With the discrete GPU, performance on google slides gets a bit better, we benefit from it more than chrome, that makes sense. There's not a lot of JS in there, but CPIs for the JS I do see are better. This might be due to better memory bandwidth. 2. On google slides scrolling icache misses don't seem to be large part of the JS issue. This might be a different JS workload. 3. On the test case Ehsan pointed at and looked at, I see our CPI's being roughly 1.5-2.5x that of Chrome, consistently, on the discrete, or the integrated GPU. 4. Using the integrated GPU has a noticeable effect on system memory bandwidth consumption. JS code seems like it -might- be a little bit more affected by a throttled memory bus then our other code, but I don't have sufficient data to say for sure. It's probably worth trying to do some profiling while artificially throttling the memory bandwidth somewhat if only to understand how firefox responds to that.

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Mayank Bansal

Updated

•

2 years ago

Updated

•

1 year ago

Blocks: sm-js-perf