1649109 - Indirect calls are half as fast in the browser as in the JS shell (Linux SECCOMP issue)

Reporter

Description

•

4 years ago

STR:
Clone github.com/lars-t-hansen/embenchen.
In embenchen/wasm-micro, run make (requires wabt).
In same directory, run python3 -m http.server 8000.
In a browser, load localhost:8000/wasm-micro.html
Click the button, note the times that are shown, here are mine, they are roughly the same in FF79 nightly and FF76 (Fedora 30):

call external: 730
call internal: 714
call direct: 196
fib 37: 182.8

Modify Makefile to point to a SpiderMonkey release shell.
Run make bench, note the times, mine are roughly:

external: 402.09716796875
internal: 334.18408203125
direct: 188.264892578125
fib 37: 177.9

This is totally consistent: inter-module ("external") and intra-module ("internal") indirect calls are much slower in the browsers than in the shell. I don't think this has been regressed by what Dmitry's been doing lately, it looks older than that. (It might also not be a bug, or not a bug in the browser, but it warrants investigation.)

Fedora Core 30, octacore Xeon.

Lars T Hansen [:lth]

Reporter

Comment 1

•

4 years ago

This could be an artifact of the Xeon, could be related to bug 1646663? Times from MacBook Pro, browser (Nightly):

call external: 283
call internal: 266
call direct: 138
fib 37: 145.2

and release shell:

external: 277.555908203125
internal: 282.273193359375
direct: 138.475830078125
fib 37: 145.8

Lars T Hansen [:lth]

Reporter

Comment 2

•

4 years ago

Adding the alignment fix from bug 1646663 does not help here, but it could be a similar problem. I'm going to change the title of this bug to something less alarmist and downgrade it a bit since Xeon is not exactly consumer hardware and since this is probably an artifact of this being a microbenchmark, but we should try to look into it to find the hot code paths and see if those tell us something.

Assignee: lhansen → nobody

Status: ASSIGNED → NEW

Priority: P2 → P3

Summary: Indirect calls are half as fast in the browser as in the JS shell → Indirect calls are half as fast in the browser as in the JS shell (Xeon)

Julian Seward [:jseward]

Comment 3

•

4 years ago

IIUC from bug 1646663, this is a dual-processor Xeon (two physical packages), right? I wonder if this is somehow a cross-chip-traffic performance cliff. I've had a single-chip Xeon for about three years and never saw anything like this, although I haven't looked. If you keep the work all on one chip, do you still see the effect?

Lars T Hansen [:lth]

Reporter

Comment 4

•

4 years ago

Yes, two physical packages as you say. I was also wondering about that, even though it doesn't really fit the symptoms -- the problem appears in Nightly and Release and in a locally built browser too without any other tabs. Anyway, I tried numactl to control the affinity for the locally built browser, and at least that doesn't seem to do anything (numactl -C 1 ./mach run).

However, ./mach run --disable-e10s brings the numbers back down to where the shell is... This is a little frightening maybe? It's at least worth investigating further -- something seems to be going on.

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

Summary: Indirect calls are half as fast in the browser as in the JS shell (Xeon) → Indirect calls are half as fast in the browser as in the JS shell (Xeon e10s issue)

Lars T Hansen [:lth]

Reporter

Comment 5

•

4 years ago

•

Edited

Hypothesizing broadly, if e10s is involved then we could be seeing artifacts of functionality such as:

signal handling
profiling (even polling/checking to see if profiling is enabled)
debugging (ditto)
realm switching

The next test would be to repeat the experiment on Windows to see if this is some kind of OS artifact.

Lars T Hansen [:lth]

Reporter

Comment 6

•

4 years ago

Not a problem on Windows 10, so will assume this is specific to Linux. What we need now is some way of factoring out the Xeon...

Summary: Indirect calls are half as fast in the browser as in the JS shell (Xeon e10s issue) → Indirect calls are half as fast in the browser as in the JS shell (Linux Xeon e10s issue)

Lars T Hansen [:lth]

Reporter

Comment 7

•

4 years ago

•

Edited

Same problem on my ancient AMD FX-4100 (ca 2011) with Ubuntu 18. Running with MOZ_FORCE_DISABLE_E10S=1 doubles the perf of the indirect calls. So this is a Linux e10s problem, not a Xeon problem.

Nika, does e10s have performance problems on Linux that it does not have on other platforms?

Flags: needinfo?(nika)

Summary: Indirect calls are half as fast in the browser as in the JS shell (Linux Xeon e10s issue) → Indirect calls are half as fast in the browser as in the JS shell (Linux e10s issue)

Lars T Hansen [:lth]

Reporter

Comment 8

•

4 years ago

As another experiment, I inlined the wasm bytes for the test files in the test program and compiled them with 'new WebAssembly.Module' instead of with 'WebAssembly.compileStreaming', in case there was something going on there. This made no difference. So now I think I need some instruction traces with and without e10s.

Lars T Hansen [:lth]

Reporter

Comment 9

•

4 years ago

I have a simplified test case that exhibits the same problem (will upload by and by). I have instruction traces of the benchmark loop and the indirect callee (not all iterations, just one). The traces are identical and instructions are on the same offsets within memory pages. The traces are also pretty clean; there are places where I would have chosen more efficient instructions, but nothing unexpected is happening. In particular we never call out to any runtime code in the loop, and the call just sets up the callee's context and makes an indirect call.

Lars T Hansen [:lth]

Reporter

Comment 10

•

4 years ago

Normally I would now expect that this is the result of some process-wide spectre mitigation such as turning off an indirect branch target predictor (after all the core of the loop is an indirect call and the content process under e10s should be in some type of seccomp context), but the fact that this is a problem also for the older AMD chip makes me a little skeptical of that. I would have thought that it would not have that ability. But I need to find docs.

Nika Layzell [:nika] (ni? for response)

Comment 11

•

4 years ago

(In reply to Lars T Hansen [:lth] from comment #7)

Nika, does e10s have performance problems on Linux that it does not have on other platforms?

I can't think of anything specific like that off the top of my head, but that doesn't mean it doesn't exist. It might be interesting to try running with MOZ_DISABLE_CONTENT_SANDBOX=1 to see if this is related to the way we handle the content process sandbox on Linux. You could also try turning off the javascript.options.spectre.* prefs to see if this is related to any spectre mitigations.

Flags: needinfo?(nika)

Lars T Hansen [:lth]

Reporter

Comment 12

•

4 years ago

Thanks. MOZ_DISABLE_CONTENT_SANDBOX=1 does indeed remove the problem.

Luke Wagner [:luke]

•

4 years ago

Assignee: nobody → lhansen

Severity: S3 → S2

Status: NEW → ASSIGNED

Summary: Indirect calls are half as fast in the browser as in the JS shell (Linux e10s issue) → Indirect calls are half as fast in the browser as in the JS shell (Linux sandbox issue)

Lars T Hansen [:lth]

Reporter

Comment 16

•

4 years ago

Confirmed this on the old AMD system too: disabling the Linux sandbox restores performance to the expected level.

Lars T Hansen [:lth]

Reporter

Comment 17

•

4 years ago

On my Xeon, we have /sys/devices/system/cpu/vulnerabilities/spectre_v2 showing Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling in the shell. Per that article that means seccomp will enable IBPB and STIBP by default. I have CONFIG_SECCOMP=y in the kernel config so that all hangs together. (Also I have CONFIG_HAVE_ARCH_SECCOMP_FILTER=y, CONFIG_SECCOMP_FILTER=y.)

Current kernel docs here: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/spectre.html. These docs say about the seccomp setting that "... all seccomp threads will enable the mitigation unless they explicitly opt out." The implication is that both IBPB and STIBP are opt-out-able. Exactly how to opt-out is not well documented but the seccomp system call has a flag, SECCOMP_FILTER_FLAG_SPEC_ALLOW, that does just that.

The following patch restores performance in the browser to be on par with what's in the shell:

diff --git a/security/sandbox/linux/Sandbox.cpp b/security/sandbox/linux/Sandbox.cpp
--- a/security/sandbox/linux/Sandbox.cpp
+++ b/security/sandbox/linux/Sandbox.cpp
@@ -212,6 +212,10 @@ static void InstallSigSysHandler(void) {
  * @see SandboxInfo
  * @see BroadcastSetThreadSandbox
  */
+#ifndef SECCOMP_FILTER_FLAG_SPEC_ALLOW
+#  define SECCOMP_FILTER_FLAG_SPEC_ALLOW 4
+#endif
+
 static bool MOZ_MUST_USE InstallSyscallFilter(const sock_fprog* aProg,
                                               bool aUseTSync) {
   if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
@@ -224,7 +228,7 @@ static bool MOZ_MUST_USE InstallSyscallF
 
   if (aUseTSync) {
     if (syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
-                SECCOMP_FILTER_FLAG_TSYNC, aProg) != 0) {
+                SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_SPEC_ALLOW, aProg) != 0) {
       SANDBOX_LOG_ERROR("thread-synchronized seccomp failed: %s",
                         strerror(errno));
       MOZ_CRASH("seccomp+tsync failed, but kernel supports tsync");

The question is what to do next. Presumably this gets kicked upstairs to the Sandbox people somehow, if we can argue that this is safe (enough). The definition of the constant actually may want to be upstreamed because the ..._FLAG_TSYNC is defined in code from Google (security/sandbox/chromium/sandbox/linux/system_headers/linux_seccomp.h).

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

Comment 18

•

4 years ago

Filed a sandbox bug. Luke, feel free to weigh in there re options here.

Lars T Hansen [:lth]

Reporter

Comment 19

•

4 years ago

Other people are running into the problem: https://github.com/opencontainers/runc/issues/2430

Luke Wagner [:luke]

Comment 20

•

4 years ago

Great investigation Lars.

Lars T Hansen [:lth]

Reporter

Comment 21

•

4 years ago

The conclusion in bug 1649696 is that we have to wait for fission, and even then it's a little bit open what we'll do. I'll unassign myself from this but we'll keep it open for now.

Assignee: lhansen → nobody

Status: ASSIGNED → NEW

Summary: Indirect calls are half as fast in the browser as in the JS shell (Linux sandbox issue) → Indirect calls are half as fast in the browser as in the JS shell (Linux SECCOMP issue)

Lars T Hansen [:lth]

•

11 months ago

:rhunt, could you check to see if the priority, and severity settings still make sense for this bug? We've also set the performance impact to low based on the priority.

Performance Impact: ? → low

Flags: needinfo?(rhunt)

Ryan Hunt [:rhunt] (on leave until early May)

Comment 23

•

10 months ago

Sorry for the delay. This is an important issue, I'd say the performance impact is probably higher than 'low'. Indirect calls are very important to wasm performance. However this bug is P5 priority because there is nothing for us to do here, we're blocked on removing spectre mitigations and then will need to re-triage if it's possible to fix this.

Flags: needinfo?(rhunt)

Marco Castelluccio [:marco]

Comment 24

•

6 months ago

I guess the bug should be retriaged now that spectre mitigations have been relaxed.

Severity: S2 → --

Performance Impact: low → ?

Priority: P5 → --

Alex Thayer [:alexical] (she/her)

Comment 25

•

6 months ago

I think this is actually performance impact none, unless someone has a real world site that is perceptibly affected by this.

Performance Impact: ? → none

Steven DeTar [:sdetar]

Updated

•

5 months ago

Severity: -- → S2

Priority: -- → P2

Ryan Hunt [:rhunt] (on leave until early May)

Comment 26

•

5 months ago

According to bug 1649696 comment 16, this now only affects older kernels of Linux (<5.16), of which Ubuntu 23 and Ubuntu Desktop 22 LTS no longer have and so are not affected. I believe that this lowers the severity.

Ryan Hunt [:rhunt] (on leave until early May)

Updated

•

5 months ago

Severity: S2 → S3

Ryan Hunt [:rhunt] (on leave until early May)

Updated

•

5 months ago

Priority: P2 → P3

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 27

•

3 months ago

Fixed in bug 1649696.

Not that it matters anymore, but it seems that the cause of this was not STIBP, but rather SSBD (speculative store bypass disable). This is more than a little counterintuitive given what the benchmarks were testing, but I tried them separately in bug 1649696 comment #26.

Also, if I understand correctly, x86 CPUs since about 2018/2019 (?) have efficient system-wide STIBP or equivalent, so opting into it is actually a no-op there. (On Linux, cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 and look for STIBP: always-on (AMD) or Enhanced / Automatic IBRS (Intel).)

Status: NEW → RESOLVED

Closed: 3 months ago

Resolution: --- → FIXED