Closed Bug 1434927 Opened 2 years ago Closed 2 years ago

2.03 - 13.17% cpstartup content-process-startup / sessionrestore / sessionrestore_many_windows / ts_paint / ts_paint_heavy (linux64) regression on push a415b43fc1d26ff89cd3d9fd4bed95611febbabe (Thu Feb 1 2018)

Categories

(Core :: Security: Process Sandboxing, defect, P1)

Unspecified
Linux
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr52 --- unaffected
firefox58 --- unaffected
firefox59 --- unaffected
firefox60 --- fix-optional

People

(Reporter: igoldan, Assigned: jld)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression, talos-regression)

Attachments

(1 file)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=a415b43fc1d26ff89cd3d9fd4bed95611febbabe

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 13%  sessionrestore_many_windows linux64 pgo e10s     843.04 -> 954.08
 10%  sessionrestore_many_windows linux64 opt e10s     914.00 -> 1,007.42
  5%  sessionrestore linux64 opt e10s                  278.83 -> 291.50
  3%  sessionrestore linux64 pgo e10s                  261.54 -> 269.83
  3%  ts_paint linux64 opt e10s                        449.25 -> 462.75
  3%  ts_paint linux64 pgo e10s                        421.75 -> 432.58
  2%  cpstartup content-process-startup linux64 pgo e10s196.46 -> 200.53
  2%  ts_paint_heavy linux64 opt e10s                  460.33 -> 469.67


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=11374

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: Untriaged → Security: Process Sandboxing
Product: Firefox → Core
:jld We got some big performance regressions since bug 1430949 landed. Can you please take a look over them? Is this something we can fix or should we rather backout/accept?
Flags: needinfo?(jld)
I tried reproducing locally; the time difference has the wrong sign and isn't significant.  So now I'm trying Try.

gcp suggested on IRC that this could just be a side-effect of how I'm detecting remote X11; if so, that would make this much easier to deal with than if the overhead is really from creating new network namespaces.
Note that try uses really old Linux kernels, so if this codepath in the kernel has been optimized in the last decade or so that optimization won't be on try.
actually we just upgraded our linux test machines to new hardware and OS- As of January 25 we now run ubuntu 16.04.
Oh, that's great news! (And bad news for this bug perhaps :-)
So it definitely is the network namespace: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=a925596dcb8eb445b6fa882e1fddb56032c9f801&framework=1&showOnlyConfident=1&selectedTimeRange=86400

The diff for that experiment is https://hg.mozilla.org/try/rev/8962bfa9ea5c4de3e67efcd1ea77590dbd6089d7 and the only changed line is:

+      flags |= CLONE_NEWNET;

Specifically, the “before”/“fast” here is using user namespaces and chroot()ing, so the network namespace is the only difference.  The one other possible confounder here is that X11 in the content process will use a named socket instead of the abstract namespace, but it doesn't seem very plausible that that would affect performance; on the other hand, it would be relatively easy to test this with symbol interposition.

(As for kernels: unprivileged user namespaces are only 5 years old.  And regular tests have had a modernish kernel since the beginning of TaskCluster, even though the container userland wasn't updated until recently, but Talos may be different.)

I'd also been thinking that it might be a slow path that was optimized sometime between 4.4 and 4.14, and that's why I can't reproduce it locally, but browsing through the commits in that range that touch net/core or net/unix and mention namespaces in the commit message hasn't turned up anything useful.

(I'm assuming, here, that the Ubuntu 16.04 in question was initially installed as the .0 or .1 patchlevel and then (maybe) updated, in which case it'd have a 4.4 kernel rather than the newer kernel you'd get if you start with a newer patchlevel; the diagram at https://wiki.ubuntu.com/Kernel/LTSEnablementStack#Kernel.2FSupport.A16.04.x_Ubuntu_Kernel_Support makes this a little clearer.)


So, the regressions:
* cpstartup, 3-4 ms and significant
** single-window sessionrestore, maybe in the same 3-4 ms area but higher noise / lower significance
* ts_paint, 10-15 ms and significant-ish
* sessionrestore_many_windows, 80-100 ms and *very* significant

If this were the result of some fixed overhead in creating a new content process, I'd expect that all of {cpstartup, sessionrestore, ts_paint} would be about the same and sessionrestore_many_windows would be ~3x that, but that doesn't quite fit the observations.  On the other hand, if there were some ongoing cost with something like fd passing, I'd expect performance to suffer across the board, not just on these startup-focused tests.

If I could reproduce this locally I'd try to `perf record` the test and profile the kernel, but I don't know if perf even works on CI (I don't need CPU performance counters, just time profiling, but it might be restricted for security reasons), and then I'd have to figure out how to get the test harness to run it.  I might be better off trying to see if it'll reproduce locally inside a VM.
Assignee: nobody → jld
Flags: needinfo?(jld)
> The one other possible confounder here is that X11 in the content process will use a named socket instead of the abstract namespace, but it doesn't seem very plausible that that would affect performance; on the other hand, it would be relatively easy to test this with symbol interposition.

I ran that experiment, and that's not it.  It really is something about the network namespace itself.

I was going to try to `perf record` a test run, but the “one-click loaner” option doesn't work for Talos (it complains that the `features` key in the YAML file shouldn't exist and immediately exits).  I thought maybe I could try wedging it into the test script — this is a bad idea because it would record the entire test suite and probably a bunch of environment setup, but it'd be someplace to start at least — but that doesn't work because perf(1) isn't installed.

gcp tried to reproduce this — he has hardware with a similar Ubuntu version installed (16.04, kernel 4.4.0-112 vs. the outdated 4.4.0-66 on the Talos host, but in theory the differences there are just bug fixes) — and found a small difference, maybe 15-25 ms, that might be statistically significant.  This is better than what I got, but it's much smaller than the Talos result, and it may be difficult trying to identify the samples corresponding to that difference in a profile.

At this point there isn't going to be a “fix”, but I think I'd have a stronger case for accepting the regression if I had more solid evidence for how the CI test systems differ from the ones where we can't reproduce it.


It would be interesting to see whether this reproduces on a test host that's up-to-date with OS patches; and, if it does, whether upgrading the hardware enablement packages (as described at https://wiki.ubuntu.com/Kernel/LTSEnablementStack) changes anything.  I don't know how the OS configuration management for these hosts is handled so I don't know how much work that would be.
Priority: -- → P1
I did some more investigating locally.  There is an effect I can observe that's the right order of magnitude, but I'm not sure how meaningful it is.  I'm comparing these:

* perf stat sh -c 'for i in `seq 10`; do unshare -U clang --version >/dev/null; done'
* perf stat sh -c 'for i in `seq 10`; do unshare -Un clang --version >/dev/null; done'

The former takes about 48ms per clang invocation, both in task-clock time (CPU used by the command) and in elapsed time.  The latter takes ~3ms more in task clock time, but *50ms* more elapsed time per process.

However, this happens only when the `unshare`s are run serially; if I run them in parallel the effect of the network namespace is negligible.  The amount of computation done by the payload command also matters (probably due to cache contention, and maybe also synchronization contention in the kernel); with /bin/true, the overheads for task-clock and elapsed time are 0.6ms and 5ms respectively.

So I did some more system-wide profiling with perf.  Creating the network namespace is charged to the process's CPU usage, but it also needs to be destroyed; that happens asynchronously on a kernel worker thread.   But that still doesn't account for all of the time difference, even ignoring the async-ness — that part of the profile has ~3x as many samples as creating the namespaces, so ~12ms total vs. observed 50.

Looking at the kernel source, cleanup_net holds net_mutex for most of the work, and that's also taken when creating a new network namespace, so that starts to explain why rapid sequential create/destroy cycles would block on the cleanup work.  There's also rtnl_lock(), which seems to have to do with interface/route changes and is another point of contention for net namespace creation/destruction, but it's held for a shorter interval in cleanup_net than net_mutex is.

There's also a call to synchronize_rcu(), which waits until all current RCU readers have finished (in case they were reading parts of the destroyed network stack), with net_mutex held.  Now, `clang --version` (which is mostly FS access, stat and readlink) seems to be spending a certain amount of time in a function named __d_lookup_rcu, but (1) still not enough time to explain all this, (2) probably not all at once, and (3) necessarily *not* while `unshare` is unsharing.  But there could still be something else relevant about RCU.

If nothing else, I can use `strace -T` to confirm that the extra elapsed time is consumed by the unshare() syscall, and it doesn't occur during the first unshare, and it doesn't occur at all if `sleep 0.1` is inserted in the loop, all of which supports the hypothesis that it's blocking on the last iteration's cleanup_net.


But this gives me something actionable: if we can avoid having content processes exit immediately before starting other content processes, this should more or less go away.  The test harness running the benchmark repeatedly in a loop is one possibility; this could be tested by inserting a short sleep between runs.  Alternately, if browser startup for some reason winds up creating a short-lived content process that exits before the processes that render actual content have started, then that seems bad for performance in its own right.
And here's a 7.5% reduction in sessionrestore_many_windows opt (1,009.70 ± 1.28% → 934.30 ± 1.86%), from just inserting `time.sleep(0.5)` into the big loop in talos.ttest:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1a403fd61c4d9dce9ece683e534752b49ff5ab32&newProject=try&newRevision=cf25c2061ee7e1b3499364c2b024ef8230f3770c&framework=1&showOnlyConfident=1

To be more scientific I should repeat this experiment without CLONE_NEWNET, to see if there's anything else going on that's affected by a little extra resting time.  This would also show the real regression vs. before bug 1430949.  Comparing the numbers I have now with comment #0, it looks like 10-15 ms, which matches what I'd expect for creating 3-4 network namespaces.  I think we can live with that.
There's nothing significant when I test the time.sleep without network namespaces.

The effect of network namespaces in the presence of the time.sleep patch is mostly insignificant:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=d4026f57bc21ab016d5a8a61e434f9ce3711c7cc&newProject=try&newRevision=cf25c2061ee7e1b3499364c2b024ef8230f3770c&framework=1

There's a result of “medium” significance for cpstartup, after 30 runs of both revisions, with size ~3ms (1.5%).  Again, I think we can accept that.
Attached patch talos-delay.diffSplinter Review
For reference, this is the Talos patch that makes this “regression” go away.  The sleep time could probably be lowered to something closer to 100ms if we actually wanted to do something like this.
Should I file a followup bug against Talos to allow adding a small delay between test runs, like I did on Try here (but less hackily), or should we just accept that some of these numbers might (on Linux, at least) include time spent by the OS cleaning up the previous run?
Flags: needinfo?(jmaher)
yes, please file a followup bug.
Flags: needinfo?(jmaher)
:jld Have you already filed the follow up bug?
Flags: needinfo?(jld)
See Also: → 1446215
Finally filed the followup bug.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
Flags: needinfo?(jld)
You need to log in before you can comment on or make changes to this bug.