Closed Bug 965392 Opened 10 years ago Closed 2 years ago

BackgroundHangManager causes many wakeups

Categories

(Core :: XPCOM, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
107 Branch
Tracking Status
firefox105 --- fixed
firefox106 --- disabled
firefox107 --- fixed

People

(Reporter: rvitillo, Assigned: florian)

References

(Blocks 1 open bug, Regressed 1 open bug, )

Details

(Keywords: power)

Attachments

(2 files)

Firefox seems to consume more energy while idling on google.com compared to other browsers (see URL and the attached plot).

There is a timer running every 250ms which accounts for some of the wakeups:
var b = (0, Date.now)(), m = function() {
    var c = (0, Date.now)(), d = c - b - a;
    0 > d && (d = 0);
    e[f] = d;
    f = (f + 1) % 20;
    1 == g && (b = c, h = window.setTimeout(m, a))
};
h = window.setTimeout(m, a)

On OSX Chrome seems to have about 4-7 idle wakeups/s while FF about 10-15.
The following are some of the wakeups I traced during a 60 seconds window:
241 nsGlobalWindow::TimerCallback()
83  nsJSEnvironment::CCTimerFired()
69  nsSocketTransportService::Poll()
60  Watchdog::Sleep() (returning)
Depends on: 965348
No longer depends on: 965348
Depends on: 965348
The main culprit for the wakeups seems to be BackgroundHangManager::RunMonitorThread(). Disabling the BgHangManager thread brings the number of wakeups down to about 8 on the Google homepage.

The issue can be reproduced with the following testcase:

<!DOCTYPE html>
    <head>
        <script>
            function foo(){
                window.setTimeout(foo, 250);
            }
            window.setTimeout(foo, 250);
        </script>
    </head>
</html>
Summary: High power consumption on Google while idling → BackgroundHangManager causes many wakeups
Component: General → XPCOM
Product: Firefox → Core
Is it possible to schedule a thread to wake up in N seconds (and subsequently un-schedule that) so that this thread will only wake up if there is in fact a hang that needs to be reported?
Flags: needinfo?(nchen)
Thanks for tracking this down, Roberto. I have some questions,

* How are these measurements made?
* How is a "wake up" defined? A physical core goes from suspended to running? A thread context switches from sleeping to running?
* What is the number of wake ups before you disabled the BgHangManager thread?
* How does the number of wake ups translate to power consumption? What's the power consumption before and after you disabled the BgHangManager thread?

Thanks!


(In reply to Benjamin Smedberg  [:bsmedberg] from comment #3)
> Is it possible to schedule a thread to wake up in N seconds (and
> subsequently un-schedule that) so that this thread will only wake up if
> there is in fact a hang that needs to be reported?

To avoid threads racing to schedule/unschedule the monitor thread, I think all scheduling should be done inside the monitor thread. Unfortunately that means the monitor thread may end up waking up more. There may be some optimizations we can do here though.
Flags: needinfo?(nchen) → needinfo?(rvitillo)
On Windows it should be possible to achieve this using a waitable timer:

* have one timer object per tracked thread
* when an event starts, call SetWaitableTimer
* when an event is finished, call CancelWaitableTimer

* the monitoring thread calls WaitForMultipleObjects with all of the timers

In this case I believe the monitoring thread wakes up in the event of a thread hang and will never wake up otherwise.

I really don't know whether other OSes have similar mechanisms.
(In reply to Jim Chen [:jchen :nchen] from comment #4)
> Thanks for tracking this down, Roberto. I have some questions,
> 
> * How are these measurements made?

With Power Gadget, powertop on Linux and Activity Monitor On OSX.

> * How is a "wake up" defined? A physical core goes from suspended to
> running? A thread context switches from sleeping to running?

A wakeup is a change of the CPU’s C-state from C(N!=0) to C0. If process A is responsible for the wakeup (e.g. timer firing) then that wakeup is attributed to A.

> * What is the number of wakeups before you disabled the BgHangManager
> thread?

The number of wakeups ranged from 10 to 15 on Activity Monitor.

> * How does the number of wake ups translate to power consumption? 

That depends on the frequency of the wakeups and the amount of work performed by the process. There is no hard and fast rule. Usually reducing the number of wakeups allows the CPU to stay longer in deep power saving c-states.

> What's the power consumption before and after you disabled the BgHangManager thread?

Without the BgHangManager thread the energy impact (Activity Monitor) is about 0.9 while with the BgHangManager it’s about 1.17.
Flags: needinfo?(rvitillo)
OS: Mac OS X → All
Whiteboard: [Power]
I just measured the test case in comment 2 using NSPR_LOG_MODULES=TimerFirings:4 and some post-processing:

> 645 counts:
> ( 1)    234 (36.3%, 36.3%): 1939817216[109040000]: [42049]    fn timer (ONE_SHOT   250 ms): [content] file:///Users/njn/a.html:5:16
> ( 2)    125 (19.4%, 55.7%): 1939817216[109040000]: [42049]    fn timer (SLACK      250 ms): CCTimerFired
> ( 3)     98 (15.2%, 70.9%): 1939817216[100445640]: [42048]    fn timer (SLACK      250 ms): CCTimerFired
> ( 4)     39 ( 6.0%, 76.9%): 1939817216[100445640]: [42048]    fn timer (SLACK      100 ms): LayerActivityTracker
> ( 5)     17 ( 2.6%, 79.5%): 272842752[10e895fc0]: [42048]    fn timer (ONE_SHOT  5000 ms): dom::workers::DummyCallback(2)
> ( 6)     12 ( 1.9%, 81.4%): 1939817216[100445640]: [42048]    fn timer (ONE_SHOT   250 ms): PresShell::sPaintSuppressionCallback
> ( 7)      9 ( 1.4%, 82.8%): 1939817216[100445640]: [42048]    fn timer (SLACK     1000 ms): nsIDocument::SelectorCache
> ( 8)      7 ( 1.1%, 83.9%): 1939817216[100445640]: [42048]    fn timer (ONE_SHOT   100 ms): InterSliceGCTimerFired
> ( 9)      6 ( 0.9%, 84.8%): 1939817216[100445640]: [42048]    fn timer (SLACK     1000 ms): TileExpiry
> (10)      5 ( 0.8%, 85.6%): 1939817216[109040000]: [42049]    fn timer (SLACK      100 ms): LayerActivityTracker
> (11)      4 ( 0.6%, 86.2%): 1939817216[100445640]: [42048] iface timer (SLACK    15000 ms): 11d6cf1c0
> (12)      4 ( 0.6%, 86.8%): 1939817216[100445640]: [42048]    fn timer (SLACK    10000 ms): GradientCache
> (13)      4 ( 0.6%, 87.4%): 1939817216[100445640]: [42048]    fn timer (SLACK       32 ms): ICCTimerFired
> (14)      3 ( 0.5%, 87.9%): 1939817216[100445640]: [42048]    fn timer (SLACK    10000 ms): imgCacheExpirationTracker
> (15)      3 ( 0.5%, 88.4%): 1939817216[109040000]: [42049]    fn timer (SLACK    10000 ms): RuleProcessorCache::ExpirationTracker
> (16)      3 ( 0.5%, 88.8%): 1939817216[100445640]: [42048]    fn timer (SLACK    10000 ms): RuleProcessorCache::ExpirationTracker
> (17)      3 ( 0.5%, 89.3%): 1939817216[100445640]: [42048]    fn timer (SLACK    10000 ms): FrameTextRunCache
> (18)      2 ( 0.3%, 89.6%): 1939817216[100445640]: [42048]    fn timer (ONE_SHOT   450 ms): ScrollbarActivity::FadeBeginTimerFired
> (19)      2 ( 0.3%, 89.9%): 1939817216[100445640]: [42048] iface timer (ONE_SHOT  5000 ms): 113897c00
> (20)      2 ( 0.3%, 90.2%): 1939817216[100445640]: [42048]    fn timer (ONE_SHOT     0 ms): [content] chrome://browser/content/browser.js:12135:8

I don't see any sign of the BackgroundHangManager. Roberto, am I missing something? How did you determine that the BackgroundHangManager was significant here?
Flags: needinfo?(rvitillo)
The last time I checked this was well over a year ago so I am not surprised that things might have changed. Given your report I say we can close this bug.
Flags: needinfo?(rvitillo)
Also, please see Comment 6 on how I attributed the wakeups to the BgHangManager, i.e. I measured the energy impact with and without BgHangManager.
I remeasured the test case from comment 2 on Mac with and without the BHM enabled, and I couldn't see any notable differences in power consumption or wakeup frequency.

I also instrumented the loop in BackgroundHangManager::RunMonitorThread(), which comment 2 says is the cause of the original problem. It triggered a few times on start-up but I couldn't get it to trigger again after that.

BHM is currently disabled on Beta and Release, and this bug is cited as the reason:

> For now, we should be getting enough data from Nightly/Aurora channels, and the hang monitor has
> non-trivial power consumption in certain idle cases, so it makes sense to disable it on Release
> and Beta.

This bug is the source of the power consumption claim. I don't know if the "enough data" part is still true.

Either way, it seems like this bug can be closed. Whether or not there should be a follow-up bug to re-enable the BHM for Beta and maybe Release, I don't know. jchen, any thoughts?
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
NI'ing jchen for the above comment
Flags: needinfo?(nchen)
(In reply to Nicholas Nethercote [:njn] from comment #10)
> 
> Either way, it seems like this bug can be closed. Whether or not there
> should be a follow-up bug to re-enable the BHM for Beta and maybe Release, I
> don't know. jchen, any thoughts?

Closing this bug sounds reasonable. BHM is actually enabled in 1% of Beta builds, and I think we probably won't enable it for all of Beta or Release just because we won't be able to process all the data.
Flags: needinfo?(nchen)
Thank you for the confirmation.

(In reply to Benjamin Smedberg from comment #5)

On Windows it should be possible to achieve this using a waitable timer:

  • have one timer object per tracked thread

  • when an event starts, call SetWaitableTimer

  • when an event is finished, call CancelWaitableTimer

  • the monitoring thread calls WaitForMultipleObjects with all of the timers

In this case I believe the monitoring thread wakes up in the event of a
thread hang and will never wake up otherwise.

I think it should be possible to implement something like this on all platforms using nsITimer. This will cause the Timer thread to wakeup, but waking up at the time something is actually happening seems less of a problem to me than waking up the BHMgr Monitor thread multiple times when nothing else is happening.

Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Blocks: power-usage
No longer blocks: power
Whiteboard: [Power]
Assignee: nobody → florian
Pushed by fqueze@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9ef37c6da56b
Use a timer instead of a condvar to run the BHMgr Monitor thread, r=dthayer.
Status: REOPENED → RESOLVED
Closed: 9 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 105 Branch

We're backing this out from m-c temporarily to see if it is the culprit for bug 1786242 and bug 1784591. We'll re-land it if the problem doesn't go away.
https://hg.mozilla.org/mozilla-central/rev/0c0bee0fee0853b836ea353cc7cbb5052ca22e02

Confirmed that this backout resolved the timeouts in bug 1784591. Will go back to the reporter of bug 1786242 to see if it resolves their issues too. Reopening this bug too to reflect the backout. That said, I don't think this needs to be backed out from Beta since BHR is Nightly-only, but I'm happy to do so if we're concerned about it being there still.

Status: RESOLVED → REOPENED
Flags: needinfo?(florian)
Resolution: FIXED → ---
Regressions: 1784591

There are interesting details in bug 1786242 comment 34. As Kirk Steuber already identified before there seems to be a connection to the privacy.resistFingerprinting=true preference set. In combination with this patch the failure seem to regularly happen for tests like eg. browser/base/content/test/general/browser_documentnavigation.js.

To make the reproduction steps even easier I've created a small Marionette test which replicates the hang pretty easily in both headful and headless mode - I don't even need a shippable Windows build anymore!

from marionette_harness import MarionetteTestCase

class TestMinimizedTestCase(MarionetteTestCase):

    def test_startup_hang(self):
        self.marionette.enforce_gecko_prefs({"privacy.resistFingerprinting": True})
        self.marionette.quit(in_app=True)
        self.marionette.start_session()

Simply save that Python script to a file and run the following command:

./mach marionette-test -vv --gecko-log - --setpref="remote.log.truncate=true" %path-to-file% --run-until-failure

Getting a proper Gecko profile is still hard to get due to startup deadlocks when profiling threads (bug 1782125). I'll continue trying and maybe at some point I'm lucky...

See Also: → 1788691

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #21)

There are interesting details in bug 1786242 comment 34. As Kirk Steuber already identified before there seems to be a connection to the privacy.resistFingerprinting=true preference set. In combination with this patch the failure seem to regularly happen for tests like eg. browser/base/content/test/general/browser_documentnavigation.js.

I was able to narrow the problem further down. When fingerprinting resistance is enabled gfxPlatform::ForceSoftwareVsync() returns true. Changing that function to always return false avoids the problem related to privacy.resistFingerprinting=true. Changing that function to always return true introduces the problem independent of fingerprinting resistance. With the change backed out, always returning true or false both work fine.

https://searchfox.org/mozilla-central/rev/25ec642ed33ca83f25e88ec0f9f27e8ad8a29e24/gfx/thebes/gfxPlatform.cpp#3180-3191

static int LayoutFrameRateFromPrefs() {
  auto val = StaticPrefs::layout_frame_rate();
  if (StaticPrefs::privacy_resistFingerprinting()) {
    val = 60;
  }
  return val;
}

/* static */
bool gfxPlatform::ForceSoftwareVsync() {
  return LayoutFrameRateFromPrefs() > 0;
}

(In reply to Ryan VanderMeulen [:RyanVM] from comment #20)

Confirmed that this backout resolved the timeouts in bug 1784591.

I've figured out enough to be able to have a fixed version of the patch.

I noticed that the browser_canvas_rfp_exclusion.js test reproduced more often, so I used it to test locally. Doug did too, and had noticed that removing the mTimer->InitHighResolutionWithCallback call in NotifyActivity "fixed" the bug. However, keeping the timer initialization and making the callback an empty function didn't help.

I managed to get a profile of the failing test https://share.firefox.dev/3RYNFLp and noticed we were adding many "0.000ms" BackgroundHangThread_timer timers. I investigated that, and it turned out these timers were actually TimeDuration::Forever (I think the 0.000ms value displayed in the profiler is an artifact of converting TimeDuration to a uint32_t to serialize in the profile).

I then noticed that the code before the patch had handling of the special TimeDuration::Forever value. Adding just a check for that special value before initializing the timers is enough to make the tests green.
Try run on a recent mozilla-central reproducing the bug 1784591 failures: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=795a878aef594f484570c35cfe6be849b1a71bbc
Green try run with the new version of the patch: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=564bea2fa434b33db59432c0b88f01bb80606fa8

I have done some more debugging on my local build to understand which threads were involved. On the parent process, 4 threads are registered with the hang monitor: the main thread (hang after 128ms, perma hang after 8s), the compositor thread (hang after 128ms, perma hang after 2s), the WindowsVsyncThread (Forever for both times) and (later when running the browser_canvas_rfp_exclusion.js test) the SoftwareVsyncThread (Forever for both times). (Profile showing all of this: https://share.firefox.dev/3LrDU5W) This matches what comment 22 said about software vsync being involved.

Hang monitoring for these vsync threads is added at https://searchfox.org/mozilla-central/rev/6f9fdc5c3b164a46004c98a5baaf55b05e2ad329/ipc/chromium/src/base/message_pump_default.cc#28-30, but I'm not sure that makes sense. Maybe we should skip that entirely if we are not actually monitoring hangs for these threads.

Flags: needinfo?(florian)
Pushed by fqueze@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/96644f389a58
Use a timer instead of a condvar to run the BHMgr Monitor thread, r=dthayer.
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED

The patch landed in nightly and beta is affected.
:florian, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.Also, don't forget to request an uplift for the patches in the regressions caused by this fix.
  • If no, please set status-firefox106 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(florian)

I'm not sure if the right status is wontfix or disabled (given that BHR only runs on Nightly), but we definitely don't need to uplift this patch.

Flags: needinfo?(florian)
Regressions: 1791678
Regressions: 1793417
Target Milestone: 105 Branch → 107 Branch
Regressions: 1817498
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: