Closed Bug 1649109 Opened 4 years ago Closed 3 months ago

Indirect calls are half as fast in the browser as in the JS shell (Linux SECCOMP issue)

Categories

(Core :: JavaScript: WebAssembly, defect, P3)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
Performance Impact none

People

(Reporter: lth, Unassigned)

References

(Depends on 1 open bug)

Details

STR:
Clone github.com/lars-t-hansen/embenchen.
In embenchen/wasm-micro, run make (requires wabt).
In same directory, run python3 -m http.server 8000.
In a browser, load localhost:8000/wasm-micro.html
Click the button, note the times that are shown, here are mine, they are roughly the same in FF79 nightly and FF76 (Fedora 30):

call external: 730
call internal: 714
call direct: 196
fib 37: 182.8

Modify Makefile to point to a SpiderMonkey release shell.
Run make bench, note the times, mine are roughly:

external: 402.09716796875
internal: 334.18408203125
direct: 188.264892578125
fib 37: 177.9

This is totally consistent: inter-module ("external") and intra-module ("internal") indirect calls are much slower in the browsers than in the shell. I don't think this has been regressed by what Dmitry's been doing lately, it looks older than that. (It might also not be a bug, or not a bug in the browser, but it warrants investigation.)

Fedora Core 30, octacore Xeon.

This could be an artifact of the Xeon, could be related to bug 1646663? Times from MacBook Pro, browser (Nightly):

call external: 283
call internal: 266
call direct: 138
fib 37: 145.2

and release shell:

external: 277.555908203125
internal: 282.273193359375
direct: 138.475830078125
fib 37: 145.8

Adding the alignment fix from bug 1646663 does not help here, but it could be a similar problem. I'm going to change the title of this bug to something less alarmist and downgrade it a bit since Xeon is not exactly consumer hardware and since this is probably an artifact of this being a microbenchmark, but we should try to look into it to find the hot code paths and see if those tell us something.

Assignee: lhansen → nobody
Status: ASSIGNED → NEW
Priority: P2 → P3
Summary: Indirect calls are half as fast in the browser as in the JS shell → Indirect calls are half as fast in the browser as in the JS shell (Xeon)

IIUC from bug 1646663, this is a dual-processor Xeon (two physical packages), right? I wonder if this is somehow a cross-chip-traffic performance cliff. I've had a single-chip Xeon for about three years and never saw anything like this, although I haven't looked. If you keep the work all on one chip, do you still see the effect?

Yes, two physical packages as you say. I was also wondering about that, even though it doesn't really fit the symptoms -- the problem appears in Nightly and Release and in a locally built browser too without any other tabs. Anyway, I tried numactl to control the affinity for the locally built browser, and at least that doesn't seem to do anything (numactl -C 1 ./mach run).

However, ./mach run --disable-e10s brings the numbers back down to where the shell is... This is a little frightening maybe? It's at least worth investigating further -- something seems to be going on.

Summary: Indirect calls are half as fast in the browser as in the JS shell (Xeon) → Indirect calls are half as fast in the browser as in the JS shell (Xeon e10s issue)

Hypothesizing broadly, if e10s is involved then we could be seeing artifacts of functionality such as:

  • signal handling
  • profiling (even polling/checking to see if profiling is enabled)
  • debugging (ditto)
  • realm switching

The next test would be to repeat the experiment on Windows to see if this is some kind of OS artifact.

Not a problem on Windows 10, so will assume this is specific to Linux. What we need now is some way of factoring out the Xeon...

Summary: Indirect calls are half as fast in the browser as in the JS shell (Xeon e10s issue) → Indirect calls are half as fast in the browser as in the JS shell (Linux Xeon e10s issue)

Same problem on my ancient AMD FX-4100 (ca 2011) with Ubuntu 18. Running with MOZ_FORCE_DISABLE_E10S=1 doubles the perf of the indirect calls. So this is a Linux e10s problem, not a Xeon problem.

Nika, does e10s have performance problems on Linux that it does not have on other platforms?

Flags: needinfo?(nika)
Summary: Indirect calls are half as fast in the browser as in the JS shell (Linux Xeon e10s issue) → Indirect calls are half as fast in the browser as in the JS shell (Linux e10s issue)

As another experiment, I inlined the wasm bytes for the test files in the test program and compiled them with 'new WebAssembly.Module' instead of with 'WebAssembly.compileStreaming', in case there was something going on there. This made no difference. So now I think I need some instruction traces with and without e10s.

I have a simplified test case that exhibits the same problem (will upload by and by). I have instruction traces of the benchmark loop and the indirect callee (not all iterations, just one). The traces are identical and instructions are on the same offsets within memory pages. The traces are also pretty clean; there are places where I would have chosen more efficient instructions, but nothing unexpected is happening. In particular we never call out to any runtime code in the loop, and the call just sets up the callee's context and makes an indirect call.

Normally I would now expect that this is the result of some process-wide spectre mitigation such as turning off an indirect branch target predictor (after all the core of the loop is an indirect call and the content process under e10s should be in some type of seccomp context), but the fact that this is a problem also for the older AMD chip makes me a little skeptical of that. I would have thought that it would not have that ability. But I need to find docs.

(In reply to Lars T Hansen [:lth] from comment #7)

Nika, does e10s have performance problems on Linux that it does not have on other platforms?

I can't think of anything specific like that off the top of my head, but that doesn't mean it doesn't exist. It might be interesting to try running with MOZ_DISABLE_CONTENT_SANDBOX=1 to see if this is related to the way we handle the content process sandbox on Linux. You could also try turning off the javascript.options.spectre.* prefs to see if this is related to any spectre mitigations.

Flags: needinfo?(nika)

Thanks. MOZ_DISABLE_CONTENT_SANDBOX=1 does indeed remove the problem.

Ooh, I have a theory: STIBP sounds like it could have exactly this kind of impact on indirect branch prediction performance and, reading this article, it sounds there is a kernel setting where STIBP is force-enabled for any process with seccomp filters! Could you test whether spectre_v2=seccomp (from the article) is set?

Actually, from this comment, it looks like maybe spectre_v2_user may contain seccomp by default, which means that we're paying a rather high price for using seccomp across the browser on most Linux systems. It's not clear, but it may be possible to disable STIBP via prctl (specifically PR_SET_SPECULATION_CTRL). Lars, maybe you could play with this to see if it fixes perf? If so, then we may want to consider doing this in release, either now or once we land Fission.

Yeah, STIBP is what I had in mind in comment 10, and the linux kernel docs point in the same direction as the lwn article. (I need to verify that this is what causes the problem too on my old AMD, which of course could have received some microcode update via a Linux or Windows install to mitigate Spectre, so it's plausible. And there's another experiment we could run where we replace only the indirect call instruction with a direct call instruction and observe whether that removes the slowdown.) Time's a little short this week and I'm out for the next three, but it's worth looking into this further I think.

Assignee: nobody → lhansen
Severity: S3 → S2
Status: NEW → ASSIGNED
Summary: Indirect calls are half as fast in the browser as in the JS shell (Linux e10s issue) → Indirect calls are half as fast in the browser as in the JS shell (Linux sandbox issue)

Confirmed this on the old AMD system too: disabling the Linux sandbox restores performance to the expected level.

On my Xeon, we have /sys/devices/system/cpu/vulnerabilities/spectre_v2 showing Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling in the shell. Per that article that means seccomp will enable IBPB and STIBP by default. I have CONFIG_SECCOMP=y in the kernel config so that all hangs together. (Also I have CONFIG_HAVE_ARCH_SECCOMP_FILTER=y, CONFIG_SECCOMP_FILTER=y.)

Current kernel docs here: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/spectre.html. These docs say about the seccomp setting that "... all seccomp threads will enable the mitigation unless they explicitly opt out." The implication is that both IBPB and STIBP are opt-out-able. Exactly how to opt-out is not well documented but the seccomp system call has a flag, SECCOMP_FILTER_FLAG_SPEC_ALLOW, that does just that.

The following patch restores performance in the browser to be on par with what's in the shell:

diff --git a/security/sandbox/linux/Sandbox.cpp b/security/sandbox/linux/Sandbox.cpp
--- a/security/sandbox/linux/Sandbox.cpp
+++ b/security/sandbox/linux/Sandbox.cpp
@@ -212,6 +212,10 @@ static void InstallSigSysHandler(void) {
  * @see SandboxInfo
  * @see BroadcastSetThreadSandbox
  */
+#ifndef SECCOMP_FILTER_FLAG_SPEC_ALLOW
+#  define SECCOMP_FILTER_FLAG_SPEC_ALLOW 4
+#endif
+
 static bool MOZ_MUST_USE InstallSyscallFilter(const sock_fprog* aProg,
                                               bool aUseTSync) {
   if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
@@ -224,7 +228,7 @@ static bool MOZ_MUST_USE InstallSyscallF
 
   if (aUseTSync) {
     if (syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
-                SECCOMP_FILTER_FLAG_TSYNC, aProg) != 0) {
+                SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_SPEC_ALLOW, aProg) != 0) {
       SANDBOX_LOG_ERROR("thread-synchronized seccomp failed: %s",
                         strerror(errno));
       MOZ_CRASH("seccomp+tsync failed, but kernel supports tsync");

The question is what to do next. Presumably this gets kicked upstairs to the Sandbox people somehow, if we can argue that this is safe (enough). The definition of the constant actually may want to be upstreamed because the ..._FLAG_TSYNC is defined in code from Google (security/sandbox/chromium/sandbox/linux/system_headers/linux_seccomp.h).

See Also: → 1649696

Filed a sandbox bug. Luke, feel free to weigh in there re options here.

Other people are running into the problem: https://github.com/opencontainers/runc/issues/2430

Great investigation Lars.

The conclusion in bug 1649696 is that we have to wait for fission, and even then it's a little bit open what we'll do. I'll unassign myself from this but we'll keep it open for now.

Assignee: lhansen → nobody
Status: ASSIGNED → NEW
Summary: Indirect calls are half as fast in the browser as in the JS shell (Linux sandbox issue) → Indirect calls are half as fast in the browser as in the JS shell (Linux SECCOMP issue)
See Also: → 1340235
Priority: P3 → P5
QA Whiteboard: qa-not-actionable
Performance Impact: --- → ?

:rhunt, could you check to see if the priority, and severity settings still make sense for this bug? We've also set the performance impact to low based on the priority.

Performance Impact: ? → low
Flags: needinfo?(rhunt)

Sorry for the delay. This is an important issue, I'd say the performance impact is probably higher than 'low'. Indirect calls are very important to wasm performance. However this bug is P5 priority because there is nothing for us to do here, we're blocked on removing spectre mitigations and then will need to re-triage if it's possible to fix this.

Flags: needinfo?(rhunt)

I guess the bug should be retriaged now that spectre mitigations have been relaxed.

Severity: S2 → --
Performance Impact: low → ?
Priority: P5 → --

I think this is actually performance impact none, unless someone has a real world site that is perceptibly affected by this.

Performance Impact: ? → none
Severity: -- → S2
Priority: -- → P2

According to bug 1649696 comment 16, this now only affects older kernels of Linux (<5.16), of which Ubuntu 23 and Ubuntu Desktop 22 LTS no longer have and so are not affected. I believe that this lowers the severity.

Severity: S2 → S3
Priority: P2 → P3

Fixed in bug 1649696.

Not that it matters anymore, but it seems that the cause of this was not STIBP, but rather SSBD (speculative store bypass disable). This is more than a little counterintuitive given what the benchmarks were testing, but I tried them separately in bug 1649696 comment #26.

Also, if I understand correctly, x86 CPUs since about 2018/2019 (?) have efficient system-wide STIBP or equivalent, so opting into it is actually a no-op there. (On Linux, cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 and look for STIBP: always-on (AMD) or Enhanced / Automatic IBRS (Intel).)

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.