Closed Bug 1744943 Opened 2 years ago Closed 2 years ago

Individually created stubs de-randomize code, affecting icache hit rates

Categories

(Core :: JavaScript: WebAssembly, defect, P2)

x86_64
Unspecified
defect

Tracking

()

RESOLVED FIXED
97 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox96 --- wontfix
firefox97 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: regression)

Attachments

(1 file)

Consider the benchmark suite at https://github.com/lars-t-hansen/call-ubench and compare the two cases, fib_random_indirect_intermodule.js and fib_random_indirect_intermodule_initelems.js. These run a doubly-recursive fib(40) with every call being cross-module to a randomly selected target function from a set of eight, in total there are two tables and 16 copies of fib. The only difference between the two tests is that the former initializes the call tables using the Table.set operation from JS, thus storing individual elements, while the latter initializes the tables using elem segments, which will bulk-create stubs.

On my Xeon system, the former takes 45% longer to run than the latter: 3260ms vs 2250ms, under controlled conditions (release build, --wasm-compiler=ion, and numactl). This result is very stable.

A similar discrepancy does not exist for the corresponding non-"random" tests (ie where the call target is a fixed table element and the table is quite small).

If this is a microarchitectural effect it would almost have to be related to the icache. When stubs are created individually, we allocate individual pages for them. Every individual stub ends up at the start of a code page, so with enough stubs and pages the icache could be very busy indeed.

This effect would not be unique to the indirect stubs, but would be felt to greater or lesser degree in all individually created stubs.

Perf says it's an icache problem, 640K misses vs 304M.

Valgrind agrees:
==404317== I1 misses: 790,350
==404317== LLi misses: 28,238
==404317== LL refs: 1,252,894 ( 1,036,924 rd + 215,970 wr)
vs
==404528== I1 misses: 392,008,538
==404528== LLi misses: 29,374
==404528== LL refs: 392,474,669 ( 392,257,294 rd + 217,375 wr)

The difference in I1 misses accounts entirely for the difference in LL refs. LL misses remains near zero in both cases.

Adding some simple padding:

--- a/js/src/wasm/WasmCode.cpp
+++ b/js/src/wasm/WasmCode.cpp
@@ -884,6 +884,15 @@ bool LazyStubTier::createManyIndirectStu
   WasmMacroAssembler masm(alloc);
   AutoCreatedBy acb(masm, "LazyStubTier::createManyIndirectStubs");
 
+  if (targets.length() == 1) {
+    static uint8_t zeroes[64] = {};
+    static uint32_t counter = 0;
+
+    counter++;
+    for ( uint32_t i=0; i < (counter & 7); i++ )
+      masm.appendRawCode(zeroes, sizeof(zeroes));
+  }
+
   CodeRangeVector codeRanges;
   for (const auto& target : targets) {
     MOZ_ASSERT(!lookupIndirectStub(target.functionIdx, target.tls));

restores the miss rate and LL ref rate to what it should be:

==406122== I1  misses:           802,313
==406122== LLi misses:            29,383
==406122== LL refs:            1,270,550  (    1,052,031 rd   +       218,519 wr)

and running time drops back down to 2200ms as it should.

Summary: call_indirect much slower with individually created stubs than bulk-created stubs → Individually created stubs de-randomize code, affecting icache hit rates
Blocks: 1744984

In the slow case, we allocate two 64KB stub segments, one for each module. Each stub is placed on an individual 4KB page, at the start of that page. The cache is supposedly 32KB 8-way set associative per core with a 64B line size (cf WikiChip, this is a Broadwell system), 512B per set, 64 sets. Suppose bits 11..6 of the address are used to select the set. If we place code at the beginning of a 4KB page, those bits will always be zero, ie, every stub will select the same set, namely set zero. This is consistent with the symptoms observed.

If I change the benchmark a little and stop using four of the functions without changing code placement, the running time drops to 2400ms.

If I use all the functions but change the code placement so that code is allocated to different sets (see comment 2), the running time drops to 2200ms.

No doubt we could validate this even further, but it seems to me that the problem here is that non-random code placement of many distinct functions destroys the effect of the icache.

When allocating individual stubs to a page, add a randomish amount of
padding before the stub code to avoid thrashing the icache, which will
happen when all the stubs are mapped to the same set. See bug for
a more detailed analysis.

Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/85d024b6a4c8
Randomize stub placement to avoid icache thrashing. r=jseward
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 97 Branch
Regressed by: 1742053

Set release status flags based on info from the regressing bug 1742053

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: