1152930 - 10-20% performance regression in invoking JS callbacks

Reporter

Description

•

10 years ago

See testcase at http://mozilla.pettay.fi/moztests/events/event_speed_3.html This got 10% slower from the patches in bug 674779. The actual call-into-JS bits got 20% slower. This might not be a huge worry as long as it's nightly-only, I guess...

Boris Zbarsky [:bzbarsky]

Reporter

Updated

•

10 years ago

Blocks: 1152875

Boris Zbarsky [:bzbarsky]

Reporter

Updated

•

10 years ago

Blocks: 840201

David Teller [:Yoric] - still alive but not very active

Comment 1

•

10 years ago

Oh, that doesn't sound good.

Blocks: 1136927

David Teller [:Yoric] - still alive but not very active

Comment 2

•

10 years ago

Well, we would like to let this ride the trains, eventually, so it is a worry for me. Unfortunately, for the time being, I have no clear idea on how we could optimize that feature any further.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 3

•

10 years ago

Yeah, to be clear, the getrusage calls are taking up all the time. :( Do we store that data on a persistent basis over time? If not, can we only collect it when about:compartments is loaded or something?

Comment hidden (duplicate)

David Teller [:Yoric] - still alive but not very active

Comment 5

•

10 years ago

We do store data on a persistent basis over time, to perform slow add-on detection (and hopefully one of these days slow webpage detection). Now, one way to decrease the impact would be to use some form of sampling for jank detection (CPOW detection uses an almost unrelated codepath and should have no observable impact on performance in the absence of CPOW). I see the following options: 1. Turn sampling on/off every few milliseconds. 2. Turn sampling on/off every few events. 3. Use an OMT clock for updating a cached `getrusage` every few milliseconds. None is entirely satisfying. 1. and 3. promise to be a hornet's nest wrt power usage, while 2. makes the performance impact harder to predict.

David Teller [:Yoric] - still alive but not very active

Comment 6

•

10 years ago

I'd be tempted to go with 2, because it's simple and the worst that can happen is not improving performance.

Mike Shaver (:shaver emeritus)

Comment 7

•

10 years ago

Does wall-clock time in processing a given script call differ a lot from the CPU time? If not, then you could rdtsc (since for jank you really care about time away and not just CPU consumption; bad paging behaviour is still jank-causing) around the script calls and accumulate that way. There could be some error if things migrated across cores during execution, but I think small enough that you could ignore for the primary "what tab is melting my laptop?" use case.

Emanuel Hoogeveen [:ehoogeveen]

Comment 8

•

10 years ago

(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #5) > 3. Use an OMT clock for updating a cached `getrusage` every few milliseconds. > > None is entirely satisfying. 1. and 3. promise to be a hornet's nest wrt > power usage, while 2. makes the performance impact harder to predict. For 3, could you perhaps tie it to whether or not a refresh driver is active? I believe we already have infrastructure so that when there are no refresh driver consumers (which I assume means simply that we're not actively painting), we stop ticking the refresh driver to reduce power consumption.

David Teller [:Yoric] - still alive but not very active

Comment 9

•

10 years ago

(In reply to Mike Shaver (:shaver -- probably not reading bugmail closely) from comment #7) > Does wall-clock time in processing a given script call differ a lot from the > CPU time? In a heavily-loaded or heavily-swapping system the difference can be considerable, with wallclock time causing lots of false positives. And, of course, we can have hours of wallclock time for nanoseconds of CPU time if the user closes the laptop lid and the computer goes to sleep for the night. That's the reasons for which I went with getrusage & co. > If not, then you could rdtsc (since for jank you really care about > time away and not just CPU consumption; bad paging behaviour is still > jank-causing) around the script calls and accumulate that way. There could > be some error if things migrated across cores during execution, but I think > small enough that you could ignore for the primary "what tab is melting my > laptop?" use case. Is rdtsc much faster than getrusage? (In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #8) > For 3, could you perhaps tie it to whether or not a refresh driver is > active? I believe we already have infrastructure so that when there are no > refresh driver consumers (which I assume means simply that we're not > actively painting), we stop ticking the refresh driver to reduce power > consumption. Mmmh... We would be missing causes of slowdowns, but maybe only when they are not visible, so that may be ok. However, I'm a bit nervous about such a change. This would introduce a new thread spinning continuously, depending on the refresh driver, but whose task is to update atomics for consumption by SpiderMonkey. I'd rather find something simpler.

Mike Shaver (:shaver emeritus)

Comment 10

•

10 years ago

rdtsc is massively faster than getrusage. It's an inline instruction rather than a system call.

Mike Shaver (:shaver emeritus)

Comment 11

•

10 years ago

On OS X you could also try task_info instead; you'd need to get the Mach task_t, but that should be doable. I don't know if it would be faster, but it's plausible since it goes directly to Mach instead of through the syscall layer. http://web.mit.edu/darwin/src/modules/xnu/osfmk/man/task_info.html

David Teller [:Yoric] - still alive but not very active

Comment 12

•

10 years ago

(In reply to Mike Shaver (:shaver -- probably not reading bugmail closely) from comment #11) > On OS X you could also try task_info instead; you'd need to get the Mach > task_t, but that should be doable. I don't know if it would be faster, but > it's plausible since it goes directly to Mach instead of through the syscall > layer. > > http://web.mit.edu/darwin/src/modules/xnu/osfmk/man/task_info.html That's bug 1142457.

Mike Shaver (:shaver emeritus)

Comment 13

•

10 years ago

Oh. In that case, what does getrusage tell you that thread_info doesn't?

Avi Halachmi (:avih)

Comment 14

•

10 years ago

(In reply to Mike Shaver (:shaver -- probably not reading bugmail closely) from comment #10) > rdtsc is massively faster than getrusage. It's an inline instruction rather > than a system call. It's also much trickier to use correctly. > This ability in Windows 7 to resolve the QPC timer service dynamically based on the > current hardware is the reason Windows application developers are advised to stay away > from using rdtsc – unless you absolutely know what you are doing – and to use QPC instead. > Coding an inline rdtsc instruction is still going to be faster than calling QPC to access > the TSC, but I hope I have made clear that using rdtsc directly is not for the faint of > heart. From here: http://performancebydesign.blogspot.co.il/2012/03/high-resolution-clocks-and-timers-for.html - Very good article

Mike Shaver (:shaver emeritus)

Comment 15

•

10 years ago

Yeah, rdtsc isn't great if you're concerned about power cycling changing the resolution (though I'm not sure of the semantics there; if it scales with CPU freq then that might in fact be what you want to measure for "CPU impact"). mach_absolute_time is the QPC equiv on Mac, roughly, but I think it's also a syscall. But if we're already getting the task thread info, I wonder if we can calculate the getrusage requirement from that rather than having to make the system call.

David Teller [:Yoric] - still alive but not very active

Comment 16

•

10 years ago

On MacOS X, we use the task thread info (as of today). On Linux/Android, we use getrusage. On Windows, we use GetThreadTimes.

David Teller [:Yoric] - still alive but not very active

Comment 17

•

10 years ago

Note: If the task thread info is not the same as getrusage, I'm interested in hearing about it, as the documentation didn't seem to indicate so.

Mike Shaver (:shaver emeritus)

Comment 18

•

10 years ago

Oh. I understood from bz's comment that it was getrusage, and the bug is against OS X. On Linux you might try times(2). It shouldn't have to do as much work as getrusage, so you might win some perf there.

David Teller [:Yoric] - still alive but not very active

Comment 19

•

10 years ago

I'm not convinced by the idea of using `times`, as this gives us per-process data, so this will result us a very poor precision in today's 60-ish threaded Firefox. By the way, bz, a few questions: 1. What's the gravity of this regression? i.e. how much sleep should I lose on it? 2. What was your protocol?

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Reporter

Comment 20

•

10 years ago

> 1. What's the gravity of this regression? Good question. This mostly affects situations with cheap calls into JS: callback functions that don't do much. This happens mostly with event handlers and promise callbacks. I ran into it doing profiles for bug 1152875, where it was showing up _quite_ noticeably. And it was definitely getrusage, on OSX. I don't think it's a problem to ship this temporarily, but longer-term I do think we should see what we can do here. > 2. What was your protocol? For which? Just loading the url in comment 0 with a build with/without the CPU accounting patches shows the regression on that testcase. Getting useful numbers out of the testcases in bug 1152875 is a bit harder for various reasons (e.g. we're actively landing other patches to optimize other things in that testcase).

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Reporter

Comment 21

•

10 years ago

Oh, and my measurements were before bug 1142457 landed, so it may be worth remeasuring...

David Teller [:Yoric] - still alive but not very active

Comment 22

•

10 years ago

I assume you ran this within the Gecko profiler, right?

Boris Zbarsky [:bzbarsky]

Reporter

Comment 23

•

10 years ago

No, I used Instruments for the profiling.

David Teller [:Yoric] - still alive but not very active

Comment 24

•

10 years ago

So, after much brainstorming inside the perf team, here's the current plan: - let perf monitoring ride the trains with jank monitoring OFF by default, ON only when actively used (i.e., initially when about:performance is open), and CPOW monitoring ON by default; - similarly, let AddonWatcher and Telemetry reporting of slow add-ons ride the trains with jank monitoring OFF by default; - wait until we have improved the performance of jank monitoring to turn it ON by default. To improve performance, I will try to introduce a degree of sampling to activate jank monitoring randomly, for 1 event every N events.

David Teller [:Yoric] - still alive but not very active

Updated

•

10 years ago

Depends on: 1158110

David Teller [:Yoric] - still alive but not very active

Comment 25

•

10 years ago

I am thinking of the following strategy that might improve performance. 1. Use per-thread `getrusage`/`thread_info`/`GetThreadTimes` exactly twice per event, when entering/leaving the XPConnect event loop, to get the total resource usage during the execution of that loop. 2. Use `rtdsc` whenever we enter/exit a compartment, to determine which fraction of that cost should be assigned to each compartment. This should be much faster than having a few hundred calls to `getrusage` per iteration. However, we will get skewed results if the thread is moved across CPUs during the execution of an event loop. It may also be skewed by background thread activity, if this varies wildly during the execution of an event. Since our measurements last at least 10 seconds (and possibly as long as a Firefox session), I imagine that both skews will average out. Boris, Avi, Mike, what do you think about this? Can you think of any way to [in]validate the approach?

Flags: needinfo?(shaver)

Flags: needinfo?(bzbarsky)

Flags: needinfo?(avihpit)

Avi Halachmi (:avih)

Comment 26

•

10 years ago

(In reply to David Rajchenbach-Teller [:Yoric] (away June 22 - July 7th, use "needinfo") from comment #25) > Boris, Avi, Mike, what do you think about this? Can you think of any way to > [in]validate the approach? I don't think I understand it enough to comment on the approach itself, but as for validating it, assuming prototyping it is not too much hassle, maybe implement a prototype of the new approach as well and run both approaches through similar loads and compare their results?

Flags: needinfo?(avihpit)

Boris Zbarsky [:bzbarsky]

Reporter

Comment 27

•

10 years ago

Apart from vague worries about making entering/exiting compartments slower (in particular exiting; what would that need to do in addition to calling rtdsc), we could try this. I agree that it would be interesting to see how this approach compares to our existing one in terms of both performance and the output it produces.

Flags: needinfo?(bzbarsky)

David Teller [:Yoric] - still alive but not very active

Updated

•

10 years ago

Depends on: 1181175

David Teller [:Yoric] - still alive but not very active

Updated

•

10 years ago

Blocks: 1185925

David Teller [:Yoric] - still alive but not very active

Comment 28

•

10 years ago

If I just use `rdtsc`, entering/exiting compartments should actually be much faster (pending verification). Leaving the event loop should be a bit slower, as I need to walk the list of compartments that have been visited, but not by much. Now, if I try to make sure that `rdtsc` is called on the right CPU, this may slow things down.

Mike Shaver (:shaver emeritus)

Comment 29

•

10 years ago

I don't know enough about how modern processors schedule their pipelines to be sure, but amortized over a number of event loop iterations I can't imagine that rdtsc drift due to clock frequency changes or similar would really distort the data. This is intended as a rough diagnostic tool, after all, not a substitute for measurement of specific cases or profiling, right?

Flags: needinfo?(shaver)

Emanuel Hoogeveen [:ehoogeveen]

Comment 30

•

10 years ago

On older CPUs, the TSC might run at a lower frequency or halt completely on lower power states. I think that would have a much bigger impact than drift between cores. IIRC the TSC will also be synchronized whenever the CPU executes a serializing instruction (like cpuid), but my memory is hazy on this.

Mike Shaver (:shaver emeritus)

Comment 31

•

10 years ago

The good news is that generally if you're worried about CPU usage, the CPU isn't being allowed to drop into low-power states. Having it step down while we're actively executing an event seems unlikely to me, but it might be that older CPUs have issues with this approach; we could warn if the CPU is older than Sandy Bridge or whatever if we're worried about noise reports from it. (We used to have problems with CPUs stepping down between subtests of Sunspider back in the day, but I think this would measure that case properly.)

Avi Halachmi (:avih)

Comment 32

•

10 years ago

(In reply to Mike Shaver (:shaver -- probably not reading bugmail closely) from comment #31) > The good news is that generally if you're worried about CPU usage, the CPU > isn't being allowed to drop into low-power states. Having it step down while > we're actively executing an event seems unlikely to me, but it might be that > older CPUs have issues with this approach; we could warn if the CPU is older > than Sandy Bridge or whatever if we're worried about noise reports from it. My observations say otherwise. My test case is scrolling in Firefox on Windows. When choosing "balanced" power mode where the CPU changes its state according to load, typically you'll get a lesser (less consistent) scroll experience than when windows is set to "high performance". I've been monitoring various CPU power values, and they do get to lower states more than I'd have expected them to. I've observed this on recent hardware such as Haswell and Broadwell ULV CPUs in Windows 8.1.

Mike Shaver (:shaver emeritus)

Comment 33

•

10 years ago

Is it dropped between events, or during events? I guess it'd be hard to tell. Well hrmph.

Emanuel Hoogeveen [:ehoogeveen]

Comment 34

•

10 years ago

Still, at worst a lower power state will increment the counter *less*, so it will look like the event took less time than it actually did - which shouldn't be a problem for anything that finishes executing while the CPU is still in a lower power state. It won't cause us to over-report, which I assume is the main concern. New CPUs should all have constant+nonstop or invariant TSC, so even if they aggressively power down it shouldn't affect the measurements.

Avi Halachmi (:avih)

Comment 35

•

10 years ago

(In reply to Mike Shaver (:shaver -- probably not reading bugmail closely) from comment #33) > Is it dropped between events, or during events? I guess it'd be hard to tell. > > Well hrmph. I never tried to monitor it at such resolution. But the fact is that the CPU frequency can fluctuate even when you'd expect it not to - with "balanced" power mode, which is the default mode all laptops which I examined (even if it's called differently - like "HP Recommended" on some laptops). (In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #34) > Still, at worst a lower power state will increment the counter *less*, so it > will look like the event took less time than it actually did - which > shouldn't be a problem for anything that finishes executing while the CPU is > still in a lower power state. It won't cause us to over-report, which I > assume is the main concern. Wouldn't under-reporting be equally undesirable? > New CPUs should all have constant+nonstop or invariant TSC, so even if they > aggressively power down it shouldn't affect the measurements. Not sure I follow this one...

Emanuel Hoogeveen [:ehoogeveen]

Comment 36

•

10 years ago

(In reply to Avi Halachmi (:avih) from comment #35) > Wouldn't under-reporting be equally undesirable? I won't claim to speak for David, but I personally think it's *less* undesirable than over-reporting. If something doing a small amount of work takes longer in real time because the CPU needs to power up, that's not very useful information. Even if we wanted to do something to keep the CPU in a higher power state to be more responsive, the code that ran wasn't responsible for being slow. In contrast, *not* reporting something that took a lot of clock cycles would be a problem - but that should cause the CPU to power up and run the TSC at full speed. > > New CPUs should all have constant+nonstop or invariant TSC, so even if they > > aggressively power down it shouldn't affect the measurements. > > Not sure I follow this one... On newer CPUs, the TSC increments at a constant rate regardless of the power state. So the TSC reflects the real time spent executing something.

Mike Shaver (:shaver emeritus)

Comment 37

•

10 years ago

There's an overall question here too: do we want to measure wall-clock time or "CPU impact"? My original goal with the CPU accounting was the latter, in which case a given unit of time at a higher clock should "cost more" than at a lower clock. (Note that my original patches didn't do that, and instead measured "scheduled wall-clock time".) That wants performance counters rather than time, I guess.

Avi Halachmi (:avih)

Comment 38

•

10 years ago

Very good question (wallclock vs cpu impact vs maybe-something-else). Let's try to look at it backwards. Overall, I think we're interested in identifying code which impacts user experience negatively, or which performs in lesser ways than expected. I can imagine three interesting "views": - "micro lags" which mostly affect animation smoothness. - longer lags which hang the browser/content for some duration (200+ ms ?) - Impacts power consumption even without affecting UX. For the first two items, I'd think we're interested in wallclock. The first is measured in relation to the monitor refresh rate, and the second is absolute duration. The third, however is not necessarily wallclock. E.g. some stuff could take longer but in lower power state which doesn't necessarily impact power draw negatively (even if there's the theory that shorter bursts of high power draw may overall draw less power). Not saying we should cover all of those, but I do think all three are interesting to monitor. However, as far as user experience goes, it seems to me that wallclock would be a good target.

Mike Shaver (:shaver emeritus)

Comment 39

•

10 years ago

Measuring (or especially aggregating) individual event-loop turns doesn't unfortunately tell you much about the cause of longer lags in many cases, because the issue is a number of events that are in line before the one that will cause visible response to the most recent input event. Even when it's a single type of event, 100 turns at 0.2 ms each is pretty different from 1 turn of 20ms, when it comes to scroll lag. To diagnose jank (micro or macro), I think you need to measure the time between an input event being queued and the paint subsequent to it, and then figure out how to characterize the intervening time (circular log of event types?). What's being done here will let you go from "Firefox is using 55% CPU, WTF" to "ugh, facebook.com is churning", similar to the Chrome task manager. That's valuable, and being able to get to this data from dev tools would be even more so -- you could charge CPU time to specific setTimeout types, in some cases at least. (Maybe none of that is still true, and the major causes of jank are single outsized events; I last dug into this seriously something like five years ago, but then I was mostly seeing sequences of events that cumulatively took us away from things.)

David Teller [:Yoric] - still alive but not very active

Comment 40

•

10 years ago

Right now, the focus is on determining what causes us to skip frames. We have many cases in which some treatment (possibly async) takes place on the main thread, blocking repaint, without being triggered directly by an input event. For instance, AdBlock Plus causes some jank while it decides whether stuff needs to be loaded. That's the kind of thing we manage to catch with PerformanceStats/about:performance. Finding out what's causing delay between an input event and a paint is clearly useful, too, but I want to keep this for a v2. The problem with wallclock is that it's very sensitive to system load, so attempting to browse while compiling mozilla-central (for instance) will probably trigger 100% of false positives. For this reason, I'm trying to use CPU time, rather than wallclock time. I'm computing ~total CPU time for each event, then using the number of cycles to determine an approximation of the proportion of the time spent in each add-on/webpage.

David Teller [:Yoric] - still alive but not very active

Comment 41

•

10 years ago

Filed Mike's remark as bug 1187496.

Brad Lassey [:blassey] (use needinfo?)

Updated

•

10 years ago

tracking-e10s: --- → ?

Flags: needinfo?(cweiner)

David Teller [:Yoric] - still alive but not very active

Updated

•

10 years ago

Depends on: 1192566

David Teller [:Yoric] - still alive but not very active

Updated

•

10 years ago

No longer blocks: 1136927

David Teller [:Yoric] - still alive but not very active

Comment 42

•

10 years ago

bz, once you return from PTO, now that bug 1181175 has landed, it would be great if you could run the tests on your side and see whether you find an improvement. I'll also do it on my side.

David Teller [:Yoric] - still alive but not very active

Comment 43

•

9 years ago

We haven't been able to reproduce.

Status: NEW → RESOLVED

Closed: 9 years ago

Flags: needinfo?(cweiner)

Resolution: --- → WORKSFORME