Closed Bug 1192216 Opened 9 years ago Closed 6 years ago

Decide whether we should keep RDTSC for PerformanceMonitoring

Categories

(Toolkit :: Performance Monitoring, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: Yoric, Unassigned)

References

Details

Once bug 1181175 has landed, we need to determine 1/ whether it's fast enough; 2/ whether it discards too much data.

For 1/, we should keep an eye on Talos and at the benchmark of bug 1152930.
For 2/, we should keep an eye on PERF_MONITORING_TEST_CPU_RESCHEDULING_PROPORTION_MOVED. If the proportion is very low, we can simplify RDTSC by removing the same-CPU check. If the proportion is very high, RDTSC might discard too much data.
Protocol for http://mozilla.pettay.fi/moztests/events/event_speed_3.html (the benchmark of bug 1152930):
- run the benchmark without tweaking Firefox Nightly (this has Perf Monitoring enabled);
- run the benchmark with the following tweak (to deactivate Perf Monitoring)
  let PerformanceStats = Cu.import("resource://gre/modules/PerformanceStats.jsm", {});
  PerformanceStats.performanceStatsService.isMonitoringJank = false;
According to my tests, I cannot see a difference in the speed of http://mozilla.pettay.fi/moztests/events/event_speed_3.html between with and without jank monitoring. Using latest Nightly, Mac OS X.
If you have a moment, bz, could you confirm?
Flags: needinfo?(bzbarsky)
I'm sorry for the lag here; I'm still digging out from vacation.  :(

Going to try to get to this on Tuesday (Monday is a holiday).
OK, so I did the following tests:

1)  Check out hg revision 4cb5a8a692e5 (this is from before bug 1192216 landed).
2)  Run http://mozilla.pettay.fi/moztests/events/event_speed_3.html normally.
3)  Run it with the isMonitoringJank = false bit from comment 1.
4)  Check out hg revision defee685fedd (from after bug 1192216).
5)  Run the test again normally.
6)  Run it with isMonitoringJank = false.
7)  Pull m-c tip.
8)  Run the test again normally.
9)  Run it with isMonitoringJank = false.

In all cases the versions with the jank monitoring scored the same as the versions without, as far as I can tell, and the tests after bug 1192216 were a bit (about 5ms, on a baseline of 80ms) slower than the tests before.  I have no idea what to make of all that.  :(
Flags: needinfo?(bzbarsky)
That's disappointing. Barring any bug, post bug 1192216, any test without jank monitoring should be a glorified noop. A few constructors/destructors called on the stack, a few `if` and a few dereferences, but that's it.

I'll look further into it. Can you confirm that the benchmark is executed in a single tick (but several microticks) of the event loop?
Flags: needinfo?(bzbarsky)
Oh, I think I found a bug in the code. We fail to bailout early in some function if jank monitoring is deactivated.
Not sure whether we mean the same things by "tick" and "microtick", but the way the benchmark works is that it runs a bunch of tasks off the main event loop.  Each of those tasks does a loop that calls dispatchEvent a number of times and measures the time that takes.  Then it averages together the times from all the tasks.

dispatchEvent does perform a microtask checkpoint when it's done, I think...
Flags: needinfo?(bzbarsky)
But I'm not 100% sure about that last.
Let me clarify: do you know if XPConnect::AfterProcessNextTask is called at the end of the test or at each call to dispatchEvent?
Flags: needinfo?(bzbarsky)
After the fix for bug 1179909 there is no such thing... What's the relevant function in the new world?
Flags: needinfo?(bzbarsky)
Sorry, I meant `AfterProcessTask`, which I believe is the relevant function in the new world.
Flags: needinfo?(bzbarsky)
Ah, XPCJSRuntime::AfterProcessTask?

That is _not_ called after every event dispatch, just once we return to the event loop.
Flags: needinfo?(bzbarsky)
Ok, so this means that we only call `getrusage` twice, which should make things much faster. I wonder what's going wrong. I'll try and profile the test early next week.
Just profiled on my Mac, with the stopwatch enabled, using Instruments and a sampling rate of 40µs.

Of the ~500ms of the test (I assume that's CPU time):
- 0.8ms were spent in `AutoStopwatch`;
- 0.2ms were spent in `nsPerformanceStats` (which should not be covered by the test but could interfere accidentally);
- 0.1ms in `Stopwatches::commit()`;
- 0.1ms in `Stopwatches.reset()`;
- 0ms on resizing storage.


If I deactivate any monitoring
- 0.5ms were spent in `Stopwatches::reset()`, which is a bit surprising;
- 0.3ms were spent in `Stopwatches::commit()`, which is also surprising;
- 0.2ms were spent in `AutoStopwatch`;
- 0.2ms were spent in `nsPerformanceStats`.


Benchmark results are very similar. While I am surprised that the time spent in `Stopwatches::*` increases when we deactivate monitoring, nothing I see shows any activity which should be measurable in any way.

That's with a single tab. Total time spent on `AutoStopwatch` increases a bit with two tabs open.
So far, I have no explanation for any slowdown. To make matters worse, the test does not seem reliable on my machine, as the average value seems to jump between ~100ms and ~300ms on the same instance of Firefox, with no major background task running that I know of.

Boris, in bug 1152930, how did you determine that the issue was caused by `getrusage`? If we can start from that, we might be able to get some progress.
Flags: needinfo?(bzbarsky)
So, I did another round of benchmarking. I ran each benchmark 5 times, and I didn't get much fluctuation in each version this time. I kept jank monitoring on in all cases.

* Nightly before bug 674779: 73ms.
* Nightly after bug 674779: 70ms.
* Nightly before bug 1181175: 101ms.
* Nightly after bug 1181175: 102ms.

So, while I see a huge increase in the last 6 months, neither this benchmarking nor my profiling of comment 15 can confirm that it is correlated with my patches. Am I missing something?
> Boris, in bug 1152930, how did you determine that the issue was caused by `getrusage`?

Profiling in Instruments.

So I just tried that again, with rev ab01fd91bc02 and rev 66945c5c4a40 (after and before bug 674779 respectively, I believe).  And now the profile shows very little getrusage in it.  :(  I have no idea why...

The huge increase you see may be from async stack stuff, if you're measuring in nightly.  Worth checking how the numbers look if you toggle "javascript.options.asyncstack" to false in about:config.
Flags: needinfo?(bzbarsky)
If I toggle off javascript.options.asyncstack, I shave off ~20ms. Still larger than around bug 674779 by ~10ms, but not as dramatic.

So, at this stage, I am tempted to believe that write this off as a false alarm. Do you concur or did I forget something?
Flags: needinfo?(bzbarsky)
I know I saw a regression back when I filed bug 1152930.  I'm not seeing one now, and I have no idea why.  I agree that it's not clear what we can do with that, if anything.  :(
Flags: needinfo?(bzbarsky)
Well, performance has never been an exact science. I'll take today's measures as a good sign, and we can always come back to this later if we observe another regression.
Depends on: 1206377
David, is the AutoStopWatch ctor/dtor time I'm seeing in bug 1254240, accounting for 1/6 of the mainthread time and for about 1/3 the time under JS::Call, this bug, or should I file a separate one?
Flags: needinfo?(dteller)
I don't think that it is related to RDTSC, so I'd say new bug. That sounds pretty bad, though.
Flags: needinfo?(dteller)
about:performance is being redesigned; mass closing the bugs related to parts of the current about:performance page that we are not keeping.

Our goals with the redesign are to reduce the overhead caused by having the page opened, increase the reliability of the displayed information, and make the offered information actionable for most users. The back-end work is being tracked in bug 1419681.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.