Closed Bug 1744943 Opened 2 years ago Closed 2 years ago

Individually created stubs de-randomize code, affecting icache hit rates

Tracking

()

Status:

RESOLVED FIXED

Milestone:

97 Branch

Tracking Flags:

Tracking

Status

firefox-esr91

---

unaffected

firefox96

---

wontfix

firefox97

---

fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: regression)

Attachments

(1 file)

Bug 1744943 - Randomize stub placement to avoid icache thrashing. r?jseward 2 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review

Lars T Hansen [:lth]

Assignee

Description

•

2 years ago

•

Edited

Consider the benchmark suite at https://github.com/lars-t-hansen/call-ubench and compare the two cases, fib_random_indirect_intermodule.js and fib_random_indirect_intermodule_initelems.js. These run a doubly-recursive fib(40) with every call being cross-module to a randomly selected target function from a set of eight, in total there are two tables and 16 copies of fib. The only difference between the two tests is that the former initializes the call tables using the Table.set operation from JS, thus storing individual elements, while the latter initializes the tables using elem segments, which will bulk-create stubs.

On my Xeon system, the former takes 45% longer to run than the latter: 3260ms vs 2250ms, under controlled conditions (release build, --wasm-compiler=ion, and numactl). This result is very stable.

A similar discrepancy does not exist for the corresponding non-"random" tests (ie where the call target is a fixed table element and the table is quite small).

If this is a microarchitectural effect it would almost have to be related to the icache. When stubs are created individually, we allocate individual pages for them. Every individual stub ends up at the start of a code page, so with enough stubs and pages the icache could be very busy indeed.

This effect would not be unique to the indirect stubs, but would be felt to greater or lesser degree in all individually created stubs.

Lars T Hansen [:lth]

Assignee

Comment 1

•

2 years ago

•

Edited

Perf says it's an icache problem, 640K misses vs 304M.

Valgrind agrees:
==404317== I1 misses: 790,350
==404317== LLi misses: 28,238
==404317== LL refs: 1,252,894 ( 1,036,924 rd + 215,970 wr)
vs
==404528== I1 misses: 392,008,538
==404528== LLi misses: 29,374
==404528== LL refs: 392,474,669 ( 392,257,294 rd + 217,375 wr)

The difference in I1 misses accounts entirely for the difference in LL refs. LL misses remains near zero in both cases.

Lars T Hansen [:lth]

Assignee

Comment 2

•

2 years ago

•

Edited

Adding some simple padding:

--- a/js/src/wasm/WasmCode.cpp
+++ b/js/src/wasm/WasmCode.cpp
@@ -884,6 +884,15 @@ bool LazyStubTier::createManyIndirectStu
   WasmMacroAssembler masm(alloc);
   AutoCreatedBy acb(masm, "LazyStubTier::createManyIndirectStubs");
 
+  if (targets.length() == 1) {
+    static uint8_t zeroes[64] = {};
+    static uint32_t counter = 0;
+
+    counter++;
+    for ( uint32_t i=0; i < (counter & 7); i++ )
+      masm.appendRawCode(zeroes, sizeof(zeroes));
+  }
+
   CodeRangeVector codeRanges;
   for (const auto& target : targets) {
     MOZ_ASSERT(!lookupIndirectStub(target.functionIdx, target.tls));

restores the miss rate and LL ref rate to what it should be:

==406122== I1  misses:           802,313
==406122== LLi misses:            29,383
==406122== LL refs:            1,270,550  (    1,052,031 rd   +       218,519 wr)

and running time drops back down to 2200ms as it should.

Lars T Hansen [:lth]

Assignee

Updated

•

2 years ago

Summary: call_indirect much slower with individually created stubs than bulk-created stubs → Individually created stubs de-randomize code, affecting icache hit rates

Lars T Hansen [:lth]

Assignee

Updated

•

2 years ago

Blocks: 1744984

Lars T Hansen [:lth]

Assignee

Comment 3

•

2 years ago

•

Edited

In the slow case, we allocate two 64KB stub segments, one for each module. Each stub is placed on an individual 4KB page, at the start of that page. The cache is supposedly 32KB 8-way set associative per core with a 64B line size (cf WikiChip, this is a Broadwell system), 512B per set, 64 sets. Suppose bits 11..6 of the address are used to select the set. If we place code at the beginning of a 4KB page, those bits will always be zero, ie, every stub will select the same set, namely set zero. This is consistent with the symptoms observed.

If I change the benchmark a little and stop using four of the functions without changing code placement, the running time drops to 2400ms.

If I use all the functions but change the code placement so that code is allocated to different sets (see comment 2), the running time drops to 2200ms.

No doubt we could validate this even further, but it seems to me that the problem here is that non-random code placement of many distinct functions destroys the effect of the icache.

Lars T Hansen [:lth]

Assignee

Comment 4

•

2 years ago

Attached file Bug 1744943 - Randomize stub placement to avoid icache thrashing. r?jseward — Details

When allocating individual stubs to a page, add a randomish amount of
padding before the stub code to avoid thrashing the icache, which will
happen when all the stubs are mapped to the same set. See bug for
a more detailed analysis.

Pulsebot

Comment 5

•

2 years ago

Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/85d024b6a4c8
Randomize stub placement to avoid icache thrashing. r=jseward

Natalia Csoregi [:nataliaCs]

Comment 6

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/85d024b6a4c8

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox97: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 97 Branch

Lars T Hansen [:lth]

Assignee

Updated

•

2 years ago

Regressed by: 1742053

BugBot [:suhaib / :marco/ :calixte]

Comment 7

•

2 years ago

Set release status flags based on info from the regressing bug 1742053

status-firefox96: --- → affected

status-firefox-esr91: --- → unaffected

BugBot [:suhaib / :marco/ :calixte]

Updated

•

2 years ago

Keywords: regression

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

status-firefox96: affected → wontfix

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Individually created stubs de-randomize code, affecting icache hit rates

Categories

(Core :: JavaScript: WebAssembly, defect, P2)

Tracking

()

People

(Reporter: lth, Assigned: lth)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: regression)

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Updated

Updated

Attachment

General

Description

File Name

Content Type