BackgroundHangManager causes many wakeups
Categories
(Core :: XPCOM, defect)
Tracking
()
People
(Reporter: rvitillo, Assigned: florian)
References
(Blocks 1 open bug, Regressed 1 open bug, )
Details
(Keywords: power)
Attachments
(2 files)
Reporter | ||
Comment 1•11 years ago
|
||
Reporter | ||
Comment 2•11 years ago
|
||
Reporter | ||
Updated•11 years ago
|
Reporter | ||
Updated•11 years ago
|
Comment 3•11 years ago
|
||
Reporter | ||
Updated•11 years ago
|
Comment 4•11 years ago
|
||
Comment 5•11 years ago
|
||
Reporter | ||
Comment 6•11 years ago
|
||
Updated•9 years ago
|
Updated•9 years ago
|
Comment 7•9 years ago
|
||
Reporter | ||
Comment 8•9 years ago
|
||
Reporter | ||
Comment 9•9 years ago
|
||
Comment 10•9 years ago
|
||
Comment 12•9 years ago
|
||
Comment 13•9 years ago
|
||
Assignee | ||
Comment 14•3 years ago
|
||
(In reply to Benjamin Smedberg from comment #5)
On Windows it should be possible to achieve this using a waitable timer:
have one timer object per tracked thread
when an event starts, call SetWaitableTimer
when an event is finished, call CancelWaitableTimer
the monitoring thread calls WaitForMultipleObjects with all of the timers
In this case I believe the monitoring thread wakes up in the event of a
thread hang and will never wake up otherwise.
I think it should be possible to implement something like this on all platforms using nsITimer. This will cause the Timer thread to wakeup, but waking up at the time something is actually happening seems less of a problem to me than waking up the BHMgr Monitor thread multiple times when nothing else is happening.
Updated•3 years ago
|
Assignee | ||
Comment 15•2 years ago
|
||
Updated•2 years ago
|
Comment 16•2 years ago
|
||
Comment 17•2 years ago
|
||
bugherder |
Comment 18•2 years ago
•
|
||
We're backing this out from m-c temporarily to see if it is the culprit for bug 1786242 and bug 1784591. We'll re-land it if the problem doesn't go away.
https://hg.mozilla.org/mozilla-central/rev/0c0bee0fee0853b836ea353cc7cbb5052ca22e02
Comment hidden (obsolete) |
Comment 20•2 years ago
|
||
Confirmed that this backout resolved the timeouts in bug 1784591. Will go back to the reporter of bug 1786242 to see if it resolves their issues too. Reopening this bug too to reflect the backout. That said, I don't think this needs to be backed out from Beta since BHR is Nightly-only, but I'm happy to do so if we're concerned about it being there still.
Comment 21•2 years ago
|
||
There are interesting details in bug 1786242 comment 34. As Kirk Steuber already identified before there seems to be a connection to the privacy.resistFingerprinting=true
preference set. In combination with this patch the failure seem to regularly happen for tests like eg. browser/base/content/test/general/browser_documentnavigation.js
.
To make the reproduction steps even easier I've created a small Marionette test which replicates the hang pretty easily in both headful and headless mode - I don't even need a shippable Windows build anymore!
from marionette_harness import MarionetteTestCase
class TestMinimizedTestCase(MarionetteTestCase):
def test_startup_hang(self):
self.marionette.enforce_gecko_prefs({"privacy.resistFingerprinting": True})
self.marionette.quit(in_app=True)
self.marionette.start_session()
Simply save that Python script to a file and run the following command:
./mach marionette-test -vv --gecko-log - --setpref="remote.log.truncate=true" %path-to-file% --run-until-failure
Getting a proper Gecko profile is still hard to get due to startup deadlocks when profiling threads (bug 1782125). I'll continue trying and maybe at some point I'm lucky...
Comment 22•2 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #21)
There are interesting details in bug 1786242 comment 34. As Kirk Steuber already identified before there seems to be a connection to the
privacy.resistFingerprinting=true
preference set. In combination with this patch the failure seem to regularly happen for tests like eg.browser/base/content/test/general/browser_documentnavigation.js
.
I was able to narrow the problem further down. When fingerprinting resistance is enabled gfxPlatform::ForceSoftwareVsync()
returns true. Changing that function to always return false avoids the problem related to privacy.resistFingerprinting=true
. Changing that function to always return true introduces the problem independent of fingerprinting resistance. With the change backed out, always returning true or false both work fine.
static int LayoutFrameRateFromPrefs() {
auto val = StaticPrefs::layout_frame_rate();
if (StaticPrefs::privacy_resistFingerprinting()) {
val = 60;
}
return val;
}
/* static */
bool gfxPlatform::ForceSoftwareVsync() {
return LayoutFrameRateFromPrefs() > 0;
}
Assignee | ||
Comment 23•2 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #20)
Confirmed that this backout resolved the timeouts in bug 1784591.
I've figured out enough to be able to have a fixed version of the patch.
I noticed that the browser_canvas_rfp_exclusion.js test reproduced more often, so I used it to test locally. Doug did too, and had noticed that removing the mTimer->InitHighResolutionWithCallback
call in NotifyActivity
"fixed" the bug. However, keeping the timer initialization and making the callback an empty function didn't help.
I managed to get a profile of the failing test https://share.firefox.dev/3RYNFLp and noticed we were adding many "0.000ms" BackgroundHangThread_timer timers. I investigated that, and it turned out these timers were actually TimeDuration::Forever (I think the 0.000ms value displayed in the profiler is an artifact of converting TimeDuration to a uint32_t to serialize in the profile).
I then noticed that the code before the patch had handling of the special TimeDuration::Forever value. Adding just a check for that special value before initializing the timers is enough to make the tests green.
Try run on a recent mozilla-central reproducing the bug 1784591 failures: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=795a878aef594f484570c35cfe6be849b1a71bbc
Green try run with the new version of the patch: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=564bea2fa434b33db59432c0b88f01bb80606fa8
I have done some more debugging on my local build to understand which threads were involved. On the parent process, 4 threads are registered with the hang monitor: the main thread (hang after 128ms, perma hang after 8s), the compositor thread (hang after 128ms, perma hang after 2s), the WindowsVsyncThread (Forever for both times) and (later when running the browser_canvas_rfp_exclusion.js test) the SoftwareVsyncThread (Forever for both times). (Profile showing all of this: https://share.firefox.dev/3LrDU5W) This matches what comment 22 said about software vsync being involved.
Hang monitoring for these vsync threads is added at https://searchfox.org/mozilla-central/rev/6f9fdc5c3b164a46004c98a5baaf55b05e2ad329/ipc/chromium/src/base/message_pump_default.cc#28-30, but I'm not sure that makes sense. Maybe we should skip that entirely if we are not actually monitoring hangs for these threads.
Comment 24•2 years ago
|
||
Comment 25•2 years ago
|
||
bugherder |
Comment 26•2 years ago
|
||
The patch landed in nightly and beta is affected.
:florian, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.Also, don't forget to request an uplift for the patches in the regressions caused by this fix.
- If no, please set
status-firefox106
towontfix
.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 27•2 years ago
|
||
I'm not sure if the right status is wontfix or disabled (given that BHR only runs on Nightly), but we definitely don't need to uplift this patch.
Updated•2 years ago
|
Updated•2 years ago
|
Description
•