Closed Bug 679427 Opened 8 years ago Closed 8 years ago
eventtracer: make responsiveness threshold a variable/parameter
Currently, the EventTracer uses a 50ms responsiveness threshold. We'd like the threshold to be a variable that can be set through a browser parameter to expand the testing options.
jlebar is out, but I have a question. Do we want to change everywhere I'm currently using 50ms, or just the threshold? (That is, should we be sending events every X ms, or just sending them every 50ms but printing data that's > x ms?)
In an ideal world, you'd send a new event as soon as the previous one returns, right? Otherwise, a 1ms pause followed by a 98ms pause would be recorded as a 1ms pause followed by a 49ms pause. (Or maybe the 98ms pause would be completely ignored, since the sequence of events was [1ms] [98ms] [notice <50ms has passed, insert sentinel into event loop]?) If sending one event right after the previous one finishes would spin the event loop too much, then what's the smallest delay we can have? 1ms? 5ms?
This entire approach is just sampling. I made a judgement call to fire sampling events at the frequency we wanted to detect. I don't know if that's the right call or not.
> This entire approach is just sampling. Meaning that you insert a sentinel every X ms, regardless of when the last sentinel finished? Does this mean that a 500ms delay is counted as 500ms, 450ms, 400ms, and so on?
No, there's only ever one tracing event in play at one time. We'll fire an event, wait for it, and then sleep for the remainder of the interval if it took < 50ms. If it took >50ms we fire another event right away. You can see the code here, it's pretty simple: http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/EventTracer.cpp#118
Okay, that's sensible. I think that if possible, the sampling interval should be on the order of the slop you're willing to accept in the measurements, since sampling every Xms means that you may under-report a delay by up to Xms. So I think the interval should be a lot smaller than 50ms, if that doesn't cause problems of its own.
Justin and I discussed this on IRC. He thinks that if we want to detect 50ms periods of unresponsiveness, we probably want to be sampling at like 200Hz (so that the minimum sampling error is on the order of 5ms). My only worry is that sampling that heavily will slow down the application, so I proposed that we make both numbers configurable and we can run some tests.
This patch lets you set both the threshold (below which we won't report an unresponsive period) and the measurement interval (how often we send events) via environment variables. I pushed this patch to try, a link to the try builds should show up here when they're finished.
Try run for 9b9d49395ee7 is complete. Detailed breakdown of the results available here: http://tbpl.allizom.org/?tree=Try&usebuildbot=1&rev=9b9d49395ee7 Results (out of 4 total builds): success: 4 Builds available at http://firstname.lastname@example.org
I'd like to see this checked in. The defaults should be set to ping time 10ms, threshold 20 ms - as determined through test runs.
Okay, I've adjusted those constants.
Attachment #559440 - Flags: review?(justin.lebar+bug)
Attachment #556629 - Attachment is obsolete: true
Comment on attachment 559440 [details] [diff] [review] make event loop responsiveness threshold and interval configurable via environment variable > + * Set MOZ_INSTRUMENT_EVENT_LOOP_INTERVAL in the environment to an > + * integer number of milliseconds to change the maximum sampling frequency. > + * The default is at most every 10 milliseconds. Nit: Can you be more concrete about what this setting does? Maybe "We will not sample twice within LOOP_INTERVAL milliseconds." > + //TODO: only wait up to a maximum of interval, return > // early if that threshold is exceeded and dump a stack trace > // or do something else useful. Nit: This is a comma splice. Maybe s/,/;
Attachment #559440 - Flags: review?(justin.lebar+bug) → review+
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
I wonder if we can achieve something like this without inserting events into the native event loop. Can we tell when we run for so long that we don't go check the native event loop for more than X ms? The main problem with the approach in this bug is that we can't assign blame for a hiccup. By the time we run the inserted native event, we've already forgotten about everything which has run since the last time. Happy to move this discussion to a new bug, if it's a reasonable idea.
BenWa's already got a proof of concept of this hooked up to his SPS profiler in bug 653703.
You need to log in before you can comment on or make changes to this bug.