Closed Bug 1152930 Opened 9 years ago Closed 9 years ago

10-20% performance regression in invoking JS callbacks

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Tracking Status
e10s ? ---
firefox40 --- affected

People

(Reporter: bzbarsky, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: perf, regression)

See testcase at http://mozilla.pettay.fi/moztests/events/event_speed_3.html

This got 10% slower from the patches in bug 674779.  The actual call-into-JS bits got 20% slower.

This might not be a huge worry as long as it's nightly-only, I guess...
Blocks: 1152875
Blocks: 840201
Oh, that doesn't sound good.
Blocks: 1136927
Well, we would like to let this ride the trains, eventually, so it is a worry for me.
Unfortunately, for the time being, I have no clear idea on how we could optimize that feature any further.
Yeah, to be clear, the getrusage calls are taking up all the time.  :(

Do we store that data on a persistent basis over time?  If not, can we only collect it when about:compartments is loaded or something?
We do store data on a persistent basis over time, to perform slow add-on detection (and hopefully one of these days slow webpage detection). Now, one way to decrease the impact would be to use some form of sampling for jank detection (CPOW detection uses an almost unrelated codepath and should have no observable impact on performance in the absence of CPOW).

I see the following options:
1. Turn sampling on/off every few milliseconds.
2. Turn sampling on/off every few events.
3. Use an OMT clock for updating a cached `getrusage` every few milliseconds.

None is entirely satisfying. 1. and 3. promise to be a hornet's nest wrt power usage, while 2. makes the performance impact harder to predict.
I'd be tempted to go with 2, because it's simple and the worst that can happen is not improving performance.
Does wall-clock time in processing a given script call differ a lot from the CPU time? If not, then you could rdtsc (since for jank you really care about time away and not just CPU consumption; bad paging behaviour is still jank-causing) around the script calls and accumulate that way. There could be some error if things migrated across cores during execution, but I think small enough that you could ignore for the primary "what tab is melting my laptop?" use case.
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #5)
> 3. Use an OMT clock for updating a cached `getrusage` every few milliseconds.
> 
> None is entirely satisfying. 1. and 3. promise to be a hornet's nest wrt
> power usage, while 2. makes the performance impact harder to predict.

For 3, could you perhaps tie it to whether or not a refresh driver is active? I believe we already have infrastructure so that when there are no refresh driver consumers (which I assume means simply that we're not actively painting), we stop ticking the refresh driver to reduce power consumption.
(In reply to Mike Shaver (:shaver -- probably not reading bugmail closely) from comment #7)
> Does wall-clock time in processing a given script call differ a lot from the
> CPU time?

In a heavily-loaded or heavily-swapping system the difference can be considerable, with wallclock time causing lots of false positives. And, of course, we can have hours of wallclock time for nanoseconds of CPU time if the user closes the laptop lid and the computer goes to sleep for the night. That's the reasons for which I went with getrusage & co.

> If not, then you could rdtsc (since for jank you really care about
> time away and not just CPU consumption; bad paging behaviour is still
> jank-causing) around the script calls and accumulate that way. There could
> be some error if things migrated across cores during execution, but I think
> small enough that you could ignore for the primary "what tab is melting my
> laptop?" use case.

Is rdtsc much faster than getrusage?

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #8)
> For 3, could you perhaps tie it to whether or not a refresh driver is
> active? I believe we already have infrastructure so that when there are no
> refresh driver consumers (which I assume means simply that we're not
> actively painting), we stop ticking the refresh driver to reduce power
> consumption.

Mmmh... We would be missing causes of slowdowns, but maybe only when they are not visible, so that may be ok. However, I'm a bit nervous about such a change. This would introduce a new thread spinning continuously, depending on the refresh driver, but whose task is to update atomics for consumption by SpiderMonkey. I'd rather find something simpler.
rdtsc is massively faster than getrusage. It's an inline instruction rather than a system call.
On OS X you could also try task_info instead; you'd need to get the Mach task_t, but that should be doable. I don't know if it would be faster, but it's plausible since it goes directly to Mach instead of through the syscall layer.

http://web.mit.edu/darwin/src/modules/xnu/osfmk/man/task_info.html
(In reply to Mike Shaver (:shaver -- probably not reading bugmail closely) from comment #11)
> On OS X you could also try task_info instead; you'd need to get the Mach
> task_t, but that should be doable. I don't know if it would be faster, but
> it's plausible since it goes directly to Mach instead of through the syscall
> layer.
> 
> http://web.mit.edu/darwin/src/modules/xnu/osfmk/man/task_info.html

That's bug 1142457.
Oh. In that case, what does getrusage tell you that thread_info doesn't?
(In reply to Mike Shaver (:shaver -- probably not reading bugmail closely) from comment #10)
> rdtsc is massively faster than getrusage. It's an inline instruction rather
> than a system call.

It's also much trickier to use correctly.

> This ability in Windows 7 to resolve the QPC timer service dynamically based on the
> current hardware is the reason Windows application developers are advised to stay away
> from using rdtsc – unless you absolutely know what you are doing – and to use QPC instead.
> Coding an inline rdtsc instruction is still going to be faster than calling QPC to access
> the TSC, but I hope I have made clear that using rdtsc directly is not for the faint of
> heart.

From here: http://performancebydesign.blogspot.co.il/2012/03/high-resolution-clocks-and-timers-for.html - Very good article
Yeah, rdtsc isn't great if you're concerned about power cycling changing the resolution (though I'm not sure of the semantics there; if it scales with CPU freq then that might in fact be what you want to measure for "CPU impact"). mach_absolute_time is the QPC equiv on Mac, roughly, but I think it's also a syscall.

But if we're already getting the task thread info, I wonder if we can calculate the getrusage requirement from that rather than having to make the system call.
On MacOS X, we use the task thread info (as of today). On Linux/Android, we use getrusage. On Windows, we use GetThreadTimes.
Note: If the task thread info is not the same as getrusage, I'm interested in hearing about it, as the documentation didn't seem to indicate so.
Oh. I understood from bz's comment that it was getrusage, and the bug is against OS X.

On Linux you might try times(2). It shouldn't have to do as much work as getrusage, so you might win some perf there.
I'm not convinced by the idea of using `times`, as this gives us per-process data, so this will result us a very poor precision in today's 60-ish threaded Firefox.

By the way, bz, a few questions:
1. What's the gravity of this regression? i.e. how much sleep should I lose on it?
2. What was your protocol?
Flags: needinfo?(bzbarsky)
> 1. What's the gravity of this regression?

Good question.  This mostly affects situations with cheap calls into JS: callback functions that don't do much.  This happens mostly with event handlers and promise callbacks.  I ran into it doing profiles for bug 1152875, where it was showing up _quite_ noticeably.  And it was definitely getrusage, on OSX.

I don't think it's a problem to ship this temporarily, but longer-term I do think we should see what we can do here.

> 2. What was your protocol?

For which?  Just loading the url in comment 0 with a build with/without the CPU accounting patches shows the regression on that testcase.  Getting useful numbers out of the testcases in bug 1152875 is a bit harder for various reasons (e.g. we're actively landing other patches to optimize other things in that testcase).
Flags: needinfo?(bzbarsky)
Oh, and my measurements were before bug 1142457 landed, so it may be worth remeasuring...
I assume you ran this within the Gecko profiler, right?
No, I used Instruments for the profiling.
So, after much brainstorming inside the perf team, here's the current plan:
- let perf monitoring ride the trains with jank monitoring OFF by default, ON only when actively used (i.e., initially when about:performance is open), and CPOW monitoring ON by default;
- similarly, let AddonWatcher and Telemetry reporting of slow add-ons ride the trains with jank monitoring OFF by default;
- wait until we have improved the performance of jank monitoring to turn it ON by default.

To improve performance, I will try to introduce a degree of sampling to activate jank monitoring randomly, for 1 event every N events.
I am thinking of the following strategy that might improve performance.

1. Use per-thread `getrusage`/`thread_info`/`GetThreadTimes` exactly twice per event, when entering/leaving the XPConnect event loop, to get the total resource usage during the execution of that loop.
2. Use `rtdsc` whenever we enter/exit a compartment, to determine which fraction of that cost should be assigned to each compartment.

This should be much faster than having a few hundred calls to `getrusage` per iteration. However, we will get skewed results if the thread is moved across CPUs during the execution of an event loop. It may also be skewed by background thread activity, if this varies wildly during the execution of an event. Since our measurements last at least 10 seconds (and possibly as long as a Firefox session), I imagine that both skews will average out.

Boris, Avi, Mike, what do you think about this? Can you think of any way to [in]validate the approach?
Flags: needinfo?(shaver)
Flags: needinfo?(bzbarsky)
Flags: needinfo?(avihpit)
(In reply to David Rajchenbach-Teller [:Yoric] (away June 22 - July 7th, use "needinfo") from comment #25)
> Boris, Avi, Mike, what do you think about this? Can you think of any way to
> [in]validate the approach?

I don't think I understand it enough to comment on the approach itself, but as for validating it, assuming prototyping it is not too much hassle, maybe implement a prototype of the new approach as well and run both approaches through similar loads and compare their results?
Flags: needinfo?(avihpit)
Apart from vague worries about making entering/exiting compartments slower (in particular exiting; what would that need to do in addition to calling rtdsc), we could try this.  I agree that it would be interesting to see how this approach compares to our existing one in terms of both performance and the output it produces.
Flags: needinfo?(bzbarsky)
If I just use `rdtsc`, entering/exiting compartments should actually be much faster (pending verification). Leaving the event loop should be a bit slower, as I need to walk the list of compartments that have been visited, but not by much.

Now, if I try to make sure that `rdtsc` is called on the right CPU, this may slow things down.
I don't know enough about how modern processors schedule their pipelines to be sure, but amortized over a number of event loop iterations I can't imagine that rdtsc drift due to clock frequency changes or similar would really distort the data. This is intended as a rough diagnostic tool, after all, not a substitute for measurement of specific cases or profiling, right?
Flags: needinfo?(shaver)
On older CPUs, the TSC might run at a lower frequency or halt completely on lower power states. I think that would have a much bigger impact than drift between cores. IIRC the TSC will also be synchronized whenever the CPU executes a serializing instruction (like cpuid), but my memory is hazy on this.
The good news is that generally if you're worried about CPU usage, the CPU isn't being allowed to drop into low-power states. Having it step down while we're actively executing an event seems unlikely to me, but it might be that older CPUs have issues with this approach; we could warn if the CPU is older than Sandy Bridge or whatever if we're worried about noise reports from it.

(We used to have problems with CPUs stepping down between subtests of Sunspider back in the day, but I think this would measure that case properly.)
(In reply to Mike Shaver (:shaver -- probably not reading bugmail closely) from comment #31)
> The good news is that generally if you're worried about CPU usage, the CPU
> isn't being allowed to drop into low-power states. Having it step down while
> we're actively executing an event seems unlikely to me, but it might be that
> older CPUs have issues with this approach; we could warn if the CPU is older
> than Sandy Bridge or whatever if we're worried about noise reports from it.

My observations say otherwise. My test case is scrolling in Firefox on Windows. When choosing "balanced" power mode where the CPU changes its state according to load, typically you'll get a lesser (less consistent) scroll experience than when windows is set to "high performance".

I've been monitoring various CPU power values, and they do get to lower states more than I'd have expected them to.

I've observed this on recent hardware such as Haswell and Broadwell ULV CPUs in Windows 8.1.
Is it dropped between events, or during events? I guess it'd be hard to tell.

Well hrmph.
Still, at worst a lower power state will increment the counter *less*, so it will look like the event took less time than it actually did - which shouldn't be a problem for anything that finishes executing while the CPU is still in a lower power state. It won't cause us to over-report, which I assume is the main concern.

New CPUs should all have constant+nonstop or invariant TSC, so even if they aggressively power down it shouldn't affect the measurements.
(In reply to Mike Shaver (:shaver -- probably not reading bugmail closely) from comment #33)
> Is it dropped between events, or during events? I guess it'd be hard to tell.
> 
> Well hrmph.

I never tried to monitor it at such resolution. But the fact is that the CPU frequency can fluctuate even when you'd expect it not to - with "balanced" power mode, which is the default mode all laptops which I examined (even if it's called differently - like "HP Recommended" on some laptops).

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #34)
> Still, at worst a lower power state will increment the counter *less*, so it
> will look like the event took less time than it actually did - which
> shouldn't be a problem for anything that finishes executing while the CPU is
> still in a lower power state. It won't cause us to over-report, which I
> assume is the main concern.

Wouldn't under-reporting be equally undesirable?

> New CPUs should all have constant+nonstop or invariant TSC, so even if they
> aggressively power down it shouldn't affect the measurements.

Not sure I follow this one...
(In reply to Avi Halachmi (:avih) from comment #35)
> Wouldn't under-reporting be equally undesirable?

I won't claim to speak for David, but I personally think it's *less* undesirable than over-reporting. If something doing a small amount of work takes longer in real time because the CPU needs to power up, that's not very useful information. Even if we wanted to do something to keep the CPU in a higher power state to be more responsive, the code that ran wasn't responsible for being slow.

In contrast, *not* reporting something that took a lot of clock cycles would be a problem - but that should cause the CPU to power up and run the TSC at full speed.

> > New CPUs should all have constant+nonstop or invariant TSC, so even if they
> > aggressively power down it shouldn't affect the measurements.
> 
> Not sure I follow this one...

On newer CPUs, the TSC increments at a constant rate regardless of the power state. So the TSC reflects the real time spent executing something.
There's an overall question here too: do we want to measure wall-clock time or "CPU impact"? My original goal with the CPU accounting was the latter, in which case a given unit of time at a higher clock should "cost more" than at a lower clock. (Note that my original patches didn't do that, and instead measured "scheduled wall-clock time".) That wants performance counters rather than time, I guess.
Very good question (wallclock vs cpu impact vs maybe-something-else). Let's try to look at it backwards.

Overall, I think we're interested in identifying code which impacts user experience negatively, or which performs in lesser ways than expected.

I can imagine three interesting "views":
- "micro lags" which mostly affect animation smoothness.
- longer lags which hang the browser/content for some duration (200+ ms ?)
- Impacts power consumption even without affecting UX.

For the first two items, I'd think we're interested in wallclock. The first is measured in relation to the monitor refresh rate, and the second is absolute duration.

The third, however is not necessarily wallclock. E.g. some stuff could take longer but in lower power state which doesn't necessarily impact power draw negatively (even if there's the theory that shorter bursts of high power draw may overall draw less power).

Not saying we should cover all of those, but I do think all three are interesting to monitor. However, as far as user experience goes, it seems to me that wallclock would be a good target.
Measuring (or especially aggregating) individual event-loop turns doesn't unfortunately tell you much about the cause of longer lags in many cases, because the issue is a number of events that are in line before the one that will cause visible response to the most recent input event. Even when it's a single type of event, 100 turns at 0.2 ms each is pretty different from 1 turn of 20ms, when it comes to scroll lag.

To diagnose jank (micro or macro), I think you need to measure the time between an input event being queued and the paint subsequent to it, and then figure out how to characterize the intervening time (circular log of event types?).

What's being done here will let you go from "Firefox is using 55% CPU, WTF" to "ugh, facebook.com is churning", similar to the Chrome task manager. That's valuable, and being able to get to this data from dev tools would be even more so -- you could charge CPU time to specific setTimeout types, in some cases at least.

(Maybe none of that is still true, and the major causes of jank are single outsized events; I last dug into this seriously something like five years ago, but then I was mostly seeing sequences of events that cumulatively took us away from things.)
Right now, the focus is on determining what causes us to skip frames. We have many cases in which some treatment (possibly async) takes place on the main thread, blocking repaint, without being triggered directly by an input event. For instance, AdBlock Plus causes some jank while it decides whether stuff needs to be loaded. That's the kind of thing we manage to catch with PerformanceStats/about:performance.

Finding out what's causing delay between an input event and a paint is clearly useful, too, but I want to keep this for a v2.

The problem with wallclock is that it's very sensitive to system load, so attempting to browse while compiling mozilla-central (for instance) will probably trigger 100% of false positives. For this reason, I'm trying to use CPU time, rather than wallclock time. I'm computing ~total CPU time for each event, then using the number of cycles to determine an approximation of the proportion of the time spent in each add-on/webpage.
tracking-e10s: --- → ?
Flags: needinfo?(cweiner)
bz, once you return from PTO, now that bug 1181175 has landed, it would be great if you could run the tests on your side and see whether you find an improvement.  I'll also do it on my side.
We haven't been able to reproduce.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(cweiner)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.