Bugzilla

William Lachance (:wlach)

Comment 2

•

8 years ago

Enabling seccomp will cause every system call from the content process to get filtered and run through a decision module, so I guess it's not too surprising there's an impact on performance.

The question is if this is the expected amount. Some googling suggests seccomp-bpf overhead per system call to be on the order of 25% ish for simple filters (most of ours just do Allow()), so having a 3% total slowdown does sound like it may be more than expected?

Avi Halachmi (:avih)

Comment 3

•

8 years ago

(In reply to Gian-Carlo Pascutto [:gcp] from comment #2)
> The question is if this is the expected amount. Some googling suggests
> seccomp-bpf overhead per system call to be on the order of 25% ish for
> simple filters (most of ours just do Allow()), so having a 3% total slowdown
> does sound like it may be more than expected?

It's not total slowdown. It's total slowdown on one test out of quite a few, and 3% is barely over our noise level in general (though I trust the report that this specific regression shows enough statistical significance to flag it as such, or else typically it would not have been reported - but that's a guess).

Comment 4

•

8 years ago

(In reply to Avi Halachmi (:avih) from comment #3)
> It's not total slowdown. It's total slowdown on one test out of quite a few,
> and 3% is barely over our noise level in general (though I trust the report
> that this specific regression shows enough statistical significance to flag
> it as such, or else typically it would not have been reported - but that's a
> guess).

The change in distribution is very obvious if you look at the graph:

https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,619ab4c1115bc476b1b460a67288d28bdc272579,1,1%5D&selected=%5Bmozilla-inbound,619ab4c1115bc476b1b460a67288d28bdc272579,34103,31282178,1%5D

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 5

•

8 years ago

(In reply to Gian-Carlo Pascutto [:gcp] from comment #2)
> The question is if this is the expected amount. Some googling suggests
> seccomp-bpf overhead per system call to be on the order of 25% ish for
> simple filters (most of ours just do Allow())

Note that there's one filter program applied to all syscalls, and the one the Chromium code generates is not exactly simple: it checks that the call is for the correct architecture, then uses a binary decision tree to dispatch on the syscall number (see http://searchfox.org/mozilla-central/search?q=AssembleJumpTable ), and *then* does the compiled version of the bpf_dsl::Allow() or other action.

To see a disassembly of the generated BPF program, run with MOZ_SANDBOX_VERBOSE set in the environment.  It looks like the fastest paths to “allow” results in the current desktop content policy are about 12 BPF instructions.

Updated

•

8 years ago

Flags: needinfo?(julian.r.hector)

Whiteboard: sblc2

Comment 6

•

8 years ago

I would like to add that I also found out this is the root cause of a 7.66% tabpaint e10s regression on linux64:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,ceb25e8fd81f15ed82526f7e28bcfd1f5f6d49ea,1,1%5D&series=%5Bfx-team,ceb25e8fd81f15ed82526f7e28bcfd1f5f6d49ea,1,1%5D

this is really just a LOT noisier than before, and it seems fairly clear that the change occurred with the same push.

Comment 7

•

8 years ago

(In reply to Robert Wood [:rwood] from comment #0)
> *** Please let us know your plans within 3 business days, or the offending
> patch(es) will be backed out! ***

The linux sandbox is currently restricted to nightly builds and won't roll out for a while. This bug blocks our next milestone. We don't need to back bug 742434 out unless something shows up that's seriously debilitating for our nightly audience.

Comment 8

•

8 years ago

is there a bug to track when this will go from nightly -> aurora,etc.?

Updated

•

8 years ago

Flags: needinfo?(jmathies)

Comment 9

•

8 years ago

(In reply to Joel Maher ( :jmaher ) from comment #8)
> is there a bug to track when this will go from nightly -> aurora,etc.?

Not that I'm aware of. We have multiple milestones to go before we consider rolling out.

Flags: needinfo?(jmathies)

Comment 10

•

8 years ago

ok, I will ping in this bug once/release cycle to know when I should expect this to roll downstream.

Comment 11

•

8 years ago

I just did some testing, but couldn't really tell a difference, but maybe my system is not configured the same way as the talos.

However, Bug 1259508, may be responsible for this 2.88% regression. It causes the fork/execve of a binary which needs to happen before seccomp is enabled. This action is related to audio and is usually performed on demand. But due to the patch landed in Bug 1259508, this now happens every time a new content process is started.

This could also be the cause of the tabpaint test regression.

Andrew Overholt [:overholt]

Comment 12

•

8 years ago

thanks Julian for taking a look at this.

Any performance number will probably be much different locally vs in our automation environment.  I would say roughly 25% of the regressions in automation can be easily reproduced locally.

A few tips for investigating the regressions:
* use try server, collect at least 6 data points
* the tip of trunk changes often (for example ts_paint test was modified yesterday!!), remember when working on this to collect additional data from the base revision you are testing on try/locally.  This is usually done with the retrigger job action on treeherder (click on the job letter, hit the 'r' key)
* use the compare feature in perfherder: https://treeherder.mozilla.org/perf.html#/compare
* we have spsProfiling options on try server, this is a good way to use the built in sampling profiler to collect data before/after a patch to get an idea where time is being spent.

and of course, just ask for help in the bug or on irc :)

Comment 13

•

8 years ago

So this is a wontfix for 50 (now Aurora), right?

Flags: needinfo?(jmathies)

Comment 14

•

8 years ago

(In reply to Andrew Overholt [:overholt] from comment #13)
> So this is a wontfix for 50 (now Aurora), right?

seccompf is restricted to nightly so this isn't riding the trains.

status-firefox50: affected → unaffected

status-firefox51: --- → affected

Flags: needinfo?(jmathies)

Comment 15

•

8 years ago

tracking on Firefox 51 for now- will adjust as needed if we don't ride the trains in v.51

Blocks: 1291351
No longer blocks: 1280481

Comment 16

•

8 years ago

:jimm, will we be riding the trains with this?  or will we need to track this on Firefox 52?

Flags: needinfo?(jmathies)

Comment 17

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #16)
> :jimm, will we be riding the trains with this?  or will we need to track
> this on Firefox 52?

Linux sandbox isn't riding in 51.

Flags: needinfo?(jmathies)

Updated

•

8 years ago

Blocks: 1302124
No longer blocks: 1291351

Gerry Chang [:gchang]

Comment 18

•

8 years ago

Hi :jmaher,
Per comment #17, does that mean we won't fix in 51?

Flags: needinfo?(jmaher)

Comment 19

•

8 years ago

Because 51 is Aurora it is no longer effected. The Linux sandbox enables based on Nightly flags.

Flags: needinfo?(jmaher)

Comment 20

•

8 years ago

if this is nightly only, will it ever go down the trains?  I would prefer to track this on the latest version and when it rides the trains leave it alone.

Comment 21

•

8 years ago

You can assume for now that it will ride in 52.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 22

•

8 years ago

great, we are all set then

Updated

•

8 years ago

status-firefox51: affected → unaffected

status-firefox52: --- → affected

Comment 23

•

8 years ago

:gcp, can you confirm this is riding the 52 train?

Flags: needinfo?(gpascutto)

Comment 24

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #23)
> :gcp, can you confirm this is riding the 52 train?

It's currently only enabled on Nightly still: https://dxr.mozilla.org/mozilla-aurora/rev/88cbb2726969ee5a4204ae9df3025c39c864de15/old-configure.in#4046

I think shipping the patches that caused this doesn't gain much without also adding the stuff that caused bug 1310119 and has at least one serious functional regression (no printing, bug 1309205).

So it depends on how quickly the latter can be fixed and how we feel about the perf vs. security impact.

Flags: needinfo?(gpascutto)

Comment 25

•

8 years ago

I took another look at this and tested my theory on try, it turns out that the early initialization of Cubeb doesn't cause the regression, at least I don't see an improvement if that early init is removed.

Here is the comparison:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c22b272ce18ebf6191bbae12292e5edc9b0caf9e&newProject=try&newRevision=04de1ff6a949c6d42cb393607c1a1bd61dc2c17a&framework=1&showOnlyImportant=0

I don't think the seccomp filter itself causes the regression, rather the setup code found here:
http://searchfox.org/mozilla-central/rev/4012e61cc38758ffa1f7ce26039173f67b048853/security/sandbox/linux/Sandbox.cpp#461

But at the same time I don't know the best way to test how much of a performance impact the policy compilation causes.

Comment 26

•

8 years ago

I am assuming that the compilation may the cause of the regression because as far as I know, ts_paint measures the start-up time and that code is triggered during startup.

Comment 27

•

8 years ago

I did a bunch of pushes of the following style: try: -b o -p linux64,linux -u none -t all --rebuild-talos 10

This is inbound from last week (2938e1861662) versus the same revision with the filesystem broker patches backed out (bug 1289718) as e41eb4be5706. 
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e41eb4be5706&newProject=try&newRevision=2938e1861662&framework=1&showOnlyImportant=0&showOnlyConfident=1

This is the above branch with the broker patched out (e41eb4be5706) versus the same branch with all of the content sandbox stuff backed out (40e55de76d0f):
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=40e55de76d0f&newProject=try&newRevision=e41eb4be5706&framework=1&showOnlyImportant=0&showOnlyConfident=1

The pairs are displayed in order of original landing (so reverse order of backing out).

The last pair mimics what this bug should have seen (no content sandbox vs basic-seccomp-but-no-filesystem-broker), but as you can see the results are a bit different: except for glterrain, the regression is only for a few tests on the order of 1%-2%. Notably, ts_paint looks like this now:
linux64   graph 	1504.45 ± 0.75% 	> 	1501.82 ± 0.76% 	-0.18% 		0.54 (low) 	11 / 11 	

So that one is gone.

tabbpaint is still there, though:
linux64   subtests · graph 	69.90 ± 3.59% 	< 	72.30 ± 2.07% 	3.43% 		2.72 (low) 	11 / 11 	

But the exact regression amount on that one is tricky because of the high noise. glterrain is probably real: we've already seen the sandbox is prone to break WebGL so it's reasonable to suspect a lot of driver syscalls are now having some overhead.

I think that regression (1-2% max) is something we're willing to ship.

The regressions in the 2nd pair are more serious. They're as high as 3% on top of the first one. Some of the subtests are very clear:

tp5o indiatimes.com opt e10s   graph 	363.82 ± 0.95% 	< 	393.27 ± 3.08% 	8.10% 		7.76 (high) 	11 / 11 	

I think that running one of these locally with logging will reveal what is going on.

Comment 28

•

8 years ago

thanks for the great details on this!

Ryan VanderMeulen [:RyanVM]

Comment 29

•

8 years ago

(In reply to Gian-Carlo Pascutto [:gcp] from comment #27)
> I think that running one of these locally with logging will reveal what is
> going on.

Curiously, I see zero activity whatsoever in the file broker while these tests run. I'm still digging in.

Comment 30

•

8 years ago

This isn't riding the 52 train, correct?

status-firefox53: --- → affected

Flags: needinfo?(gpascutto)

Ryan VanderMeulen [:RyanVM]

Comment 31

•

8 years ago

Content Sandboxing is disabled on anything not Nightly.

We might make a call to let seccomp filtering (<2% regression) ride the trains so we have some advance warning of any compatibility impact.

File access brokering (regression unclear, none locally, up to 9% on talos) needs a fix for printing before it can be considered. But I'm working on this and if I get a patch quickly we may consider uplifting it.

Flags: needinfo?(gpascutto)

Updated

•

8 years ago

status-firefox52: affected → disabled

Comment 32

•

8 years ago

I spent time on a talos linux64 loaner investigating this. Specifically, I've been running

./mach talos-test --suite tp5o

with a tp5o.manifest containing only indiatimes.com. That's based off of:

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=e41eb4be5706&newProject=try&newRevision=2938e1861662&originalSignature=80984697abf1f1ff2b058e2d9f0b351fd9d12ad9&newSignature=80984697abf1f1ff2b058e2d9f0b351fd9d12ad9&framework=1

showing that to be a consistent 8% regression with file brokering.

I see no significant performance difference between sandboxing on/off, file brokering on/off, or compiling the sandboxing code out entirely. My results are consistently of the form:

12:17:12     INFO -  PERFHERDER_DATA: {"framework": {"name": "talos"}, "suites": [{"lowerIsBetter": true, "subtests": [{"name": "http:", "lowerIsBetter": true, "alertThreshold": 2.0, "replicates": [638.0, 438.0, 414.0, 407.0, 402.0, 390.0, 381.0, 417.0, 393.0, 375.0, 420.0, 519.0, 398.0, 426.0, 387.0, 385.0, 416.0, 393.0, 400.0, 424.0, 396.0, 380.0, 472.0, 384.0, 402.0], "value": 397.0, "unit": "ms"}], "name": "tp5o", "alertThreshold": 2.0}, {"subtests": [{"name": "Main_RSS", "value": 219456748.30769232}, {"name": "Private Bytes", "value": 572639846.4}, {"name": "XRes", "value": 1419659.6}, {"name": "responsiveness", "value": 0.9654507564067291}], "name": "tp5o"}]}

i.e. about 400ms average. Referring to the previous treeherder comparison, that's the "slow" side of results.

I also went back to 3e9a2031152f, which is the commit before file brokering: same result. Going back to c676d55b6b00 (just before seccomp at all) failed to build and I didn't manage to work around the breakage easily.

The only useful thing that came out of this investigation is the realization that Talos is running on Ubuntu 12.04, using Linux 3.2 kernels. These are old. They're old enough that they're missing a bunch of seccomp improvements and optimizations, most notably:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8f577cadf7181243d336be9aba40c1bcc02c4c54

the JIT for seccomp-bpf. So this means that any performance regression on Talos from seccomp would be rather exaggerated. That is, if I could see the regression. But I don't.

Comment 33

•

8 years ago

Pushes based on current m-i, 6 talos iterations:

Regular vs Seccomp-bpf:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=4392a3797247&newProject=try&newRevision=e2124c2277e7&framework=1&showOnlyConfident=1

Regular vs Seccomp-bpf+file broker:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=4392a3797247&newProject=try&newRevision=d36030c58be5&framework=1&showOnlyConfident=1

Note the variance between the previous tries, i.e. the 9% tp5o regression on indiatimes.com is suddenly only a 5% one, although nothing in the sandbox changed! And again, from the logging I see that the file broker isn't actually activated at all by this test...I don't understand why talos has so much variance here.

Summarizing the investigation so far:

Maximum regression is <3% and this maximum is only on a limited amount of tests, and maybe a bit more on 1 or 2 who are using the GPU (which is expected). As explained in the previous comment, even this is likely a severe overestimation for the majority of our users (Ubuntu 14.04 and Debian 8 have >3.16 kernels with seccomp JIT, unlike Talos). Given the limited regression, likelihood of a substantially smaller regression for real users, and the security benefits, I believe we should ship with this.

I will unblock the seccomp sandbox for Firefox 52 and enable it on Aurora.

Comment 34

•

8 years ago

:gcp, thank you for your detailed work on this.  As to why you had trouble reproducing the issue on a loaner, I have no answers- usually you can.  In this case knowing we cannot reproduce it easily on a loaner and our OS on the test machines has a very old kernel (i.e. very limited exposure) and that this might not be that severe based on the noise or future changes to Firefox, your assessment of accepting this is right on.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX