Closed Bug 1234618 Opened 10 years ago Closed 10 years ago

BHR doesn't seem to handle the idle case correctly

Categories

(Toolkit :: Telemetry, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
firefox44 --- fixed
firefox45 --- fixed
firefox46 --- fixed
b2g-v2.5 --- fixed

People

(Reporter: mconley, Unassigned)

References

Details

Attachments

(1 file)

Noticed this while investigating bug 1223780. It seems that, regardless of whether or not e10s is enabled, we report microhangs for the main process if a caret is blinking and the process is idle. STR: 1) Open up a Firefox profile, and install chutten's statuser add-on[1] - this will add a toolbar item that will display how many chrome hangs have been reported to Telemetry. 2) Open about:home, and select the search input so that the caret is blinking. 3) Don't touch the mouse or keyboard. Just let the process go idle. ER: No hangs should be repoted. AR: Hangs are reported and the count goes up in the toolbar item pretty rapidly. jrmuizel has been working with me on this. We suspect the BHR is not handling the idle case correctly. I'll let him elaborate, since I think he's got more of the problem swapped into his head. [1]: https://github.com/chutten/statuser
Flags: needinfo?(jmuizelaar)
It looks when we call NotifyActivity after waiting for a long time we'll store mIntervalNow in currentThread->mInterval. Then we wake the Hang Monitor. It updates the mIntervalNow to the current time. Let's say 150 ms in the future We then compare the new mIntervalNow to currentThread->mInterval and make the assumption that we've been hanging for 150ms. It's not clear to me how this is supposed to be working.
Flags: needinfo?(nchen)
Does this happen on all platforms? In a nutshell, the hang monitor thread is supposed to keep track of updates to currentThread->mInterval, and expects mInterval to be updated regularly. When it sees that mInterval hasn't been updated for longer than the timeout value, it records the thread as hanging. When mInterval is finally updated again and the thread was hanging before, it calculates and records the hang duration. This mechanism is disabled if the thread called NotifyWait before the NotifyActivity call. Also, mIntervalNow is not the absolute current time. We add a time delta to mIntervalNow each time the hang monitor runs, provided the time delta is smaller than a certain value. However, if the system was sleeping, or if the hang monitor thread didn't get run frequently enough, we keep mIntervalNow the same, and let it get out-of-sync with the absolute time.
Flags: needinfo?(nchen)
(In reply to Jim Chen [:jchen] [:darchons] from comment #2) > Does this happen on all platforms? > This mechanism is disabled if the thread called NotifyWait before the > NotifyActivity call. Good point. If it is limited to specific platforms, we should audit their message pumps to ensure that we're not missing any of these calls. I previously found a missing call on the Mac in bug 1132305.
(In reply to Jim Chen [:jchen] [:darchons] from comment #2) > In a nutshell, the hang monitor thread is supposed to keep track of updates > to currentThread->mInterval, and expects mInterval to be updated regularly. Except it's not updated regularly. A bunch of threads have their wait time set to PR_INTERVAL_NO_TIMEOUT and so we divide this by 4 and set waitTime to that. This causes us to wait a very long but not PR_INTERVAL_NO_TIMEOUT amount of time and so we still update mIntervalNow and think that we've been hung for a long time.
Flags: needinfo?(jmuizelaar)
Attachment #8701195 - Flags: review?(nchen)
Comment on attachment 8701195 [details] [diff] [review] Fix misdetecting hangs Review of attachment 8701195 [details] [diff] [review]: ----------------------------------------------------------------- LGTM
Attachment #8701195 - Flags: review?(nchen) → review+
Flags: needinfo?(vladan.bugzilla)
Comment on attachment 8701195 [details] [diff] [review] Fix misdetecting hangs Approval Request Comment [Feature/regressing bug #]: The original BHR implementation [User impact if declined]: We have more inaccurate BHR reports. Not immediately visible. [Describe test coverage new/current, TreeHerder]: None [Risks and why]: Should be pretty low. The safety of the patch is pretty obvious from inspection.
Attachment #8701195 - Flags: approval-mozilla-beta?
Attachment #8701195 - Flags: approval-mozilla-aurora?
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Thanks for fixing this, let's get it uplifted :)
Flags: needinfo?(vladan.bugzilla)
Comment on attachment 8701195 [details] [diff] [review] Fix misdetecting hangs Helps get more accurate telemetry data, seems like a good idea. Beta44+, Aurora45+
Attachment #8701195 - Flags: approval-mozilla-beta?
Attachment #8701195 - Flags: approval-mozilla-beta+
Attachment #8701195 - Flags: approval-mozilla-aurora?
Attachment #8701195 - Flags: approval-mozilla-aurora+
Wes, I think something might have gone wrong with the landing on beta. The patch was landed on the branch GECKO430b7_2015112614_RELBRANCH, which I don't think is in use. Can you confirm this is true and re-land if so?
Flags: needinfo?(wkocher)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: