Closed Bug 965392 Opened 11 years ago Closed 2 years ago

BackgroundHangManager causes many wakeups

Categories

(Core :: XPCOM, defect)

Product:

Component:

Platform:

x86

All

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

()

Status:

RESOLVED FIXED

Milestone:

107 Branch

Tracking Flags:

Tracking

Status

firefox105

---

fixed

firefox106

---

disabled

firefox107

---

fixed

People

(Reporter: rvitillo, Assigned: florian)

References

(Blocks 1 open bug, Regressed 1 open bug,
URL
)

Details

(Keywords: power)

Attachments

(2 files)

Power consumption on Linux 11 years ago Roberto Agostino Vitillo (:rvitillo) 68.87 KB, image/png		Details
Bug 965392 - Use a timer instead of a condvar to run the BHMgr Monitor thread, r=dthayer. 2 years ago Florian Quèze [:florian] 48 bytes, text/x-phabricator-request		Details \| Review

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

11 years ago

Attached image Power consumption on Linux — Details

Firefox seems to consume more energy while idling on google.com compared to other browsers (see URL and the attached plot). There is a timer running every 250ms which accounts for some of the wakeups: var b = (0, Date.now)(), m = function() { var c = (0, Date.now)(), d = c - b - a; 0 > d && (d = 0); e[f] = d; f = (f + 1) % 20; 1 == g && (b = c, h = window.setTimeout(m, a)) }; h = window.setTimeout(m, a) On OSX Chrome seems to have about 4-7 idle wakeups/s while FF about 10-15.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 1

•

11 years ago

The following are some of the wakeups I traced during a 60 seconds window: 241 nsGlobalWindow::TimerCallback() 83 nsJSEnvironment::CCTimerFired() 69 nsSocketTransportService::Poll() 60 Watchdog::Sleep() (returning)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

11 years ago

Depends on: 965348

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

11 years ago

No longer depends on: 965348

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

11 years ago

Depends on: 965348

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 2

•

11 years ago

The main culprit for the wakeups seems to be BackgroundHangManager::RunMonitorThread(). Disabling the BgHangManager thread brings the number of wakeups down to about 8 on the Google homepage. The issue can be reproduced with the following testcase: <!DOCTYPE html> <head> <script> function foo(){ window.setTimeout(foo, 250); } window.setTimeout(foo, 250); </script> </head> </html>

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

11 years ago

Summary: High power consumption on Google while idling → BackgroundHangManager causes many wakeups

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

11 years ago

Component: General → XPCOM

Product: Firefox → Core

Benjamin Smedberg

Comment 3

•

11 years ago

Is it possible to schedule a thread to wake up in N seconds (and subsequently un-schedule that) so that this thread will only wake up if there is in fact a hang that needs to be reported?

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

11 years ago

Flags: needinfo?(nchen)

(inactive) Jim Chen [:jchen] [:darchons]

Comment 4

•

11 years ago

Thanks for tracking this down, Roberto. I have some questions, * How are these measurements made? * How is a "wake up" defined? A physical core goes from suspended to running? A thread context switches from sleeping to running? * What is the number of wake ups before you disabled the BgHangManager thread? * How does the number of wake ups translate to power consumption? What's the power consumption before and after you disabled the BgHangManager thread? Thanks! (In reply to Benjamin Smedberg [:bsmedberg] from comment #3) > Is it possible to schedule a thread to wake up in N seconds (and > subsequently un-schedule that) so that this thread will only wake up if > there is in fact a hang that needs to be reported? To avoid threads racing to schedule/unschedule the monitor thread, I think all scheduling should be done inside the monitor thread. Unfortunately that means the monitor thread may end up waking up more. There may be some optimizations we can do here though.

Flags: needinfo?(nchen) → needinfo?(rvitillo)

Benjamin Smedberg

Comment 5

•

11 years ago

On Windows it should be possible to achieve this using a waitable timer: * have one timer object per tracked thread * when an event starts, call SetWaitableTimer * when an event is finished, call CancelWaitableTimer * the monitoring thread calls WaitForMultipleObjects with all of the timers In this case I believe the monitoring thread wakes up in the event of a thread hang and will never wake up otherwise. I really don't know whether other OSes have similar mechanisms.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 6

•

11 years ago

(In reply to Jim Chen [:jchen :nchen] from comment #4) > Thanks for tracking this down, Roberto. I have some questions, > > * How are these measurements made? With Power Gadget, powertop on Linux and Activity Monitor On OSX. > * How is a "wake up" defined? A physical core goes from suspended to > running? A thread context switches from sleeping to running? A wakeup is a change of the CPU’s C-state from C(N!=0) to C0. If process A is responsible for the wakeup (e.g. timer firing) then that wakeup is attributed to A. > * What is the number of wakeups before you disabled the BgHangManager > thread? The number of wakeups ranged from 10 to 15 on Activity Monitor. > * How does the number of wake ups translate to power consumption? That depends on the frequency of the wakeups and the amount of work performed by the process. There is no hard and fast rule. Usually reducing the number of wakeups allows the CPU to stay longer in deep power saving c-states. > What's the power consumption before and after you disabled the BgHangManager thread? Without the BgHangManager thread the energy impact (Activity Monitor) is about 0.9 while with the BgHangManager it’s about 1.17.

Flags: needinfo?(rvitillo)

(inactive) Jim Chen [:jchen] [:darchons]

Updated

•

11 years ago

Depends on: 970358

Wayne Mery (:wsmwk)

Updated

•

9 years ago

OS: Mac OS X → All

Nicholas Nethercote [inactive]

Updated

•

9 years ago

Whiteboard: [Power]

Nicholas Nethercote [inactive]

Comment 7

•

9 years ago

I just measured the test case in comment 2 using NSPR_LOG_MODULES=TimerFirings:4 and some post-processing: > 645 counts: > ( 1) 234 (36.3%, 36.3%): 1939817216[109040000]: [42049] fn timer (ONE_SHOT 250 ms): [content] file:///Users/njn/a.html:5:16 > ( 2) 125 (19.4%, 55.7%): 1939817216[109040000]: [42049] fn timer (SLACK 250 ms): CCTimerFired > ( 3) 98 (15.2%, 70.9%): 1939817216[100445640]: [42048] fn timer (SLACK 250 ms): CCTimerFired > ( 4) 39 ( 6.0%, 76.9%): 1939817216[100445640]: [42048] fn timer (SLACK 100 ms): LayerActivityTracker > ( 5) 17 ( 2.6%, 79.5%): 272842752[10e895fc0]: [42048] fn timer (ONE_SHOT 5000 ms): dom::workers::DummyCallback(2) > ( 6) 12 ( 1.9%, 81.4%): 1939817216[100445640]: [42048] fn timer (ONE_SHOT 250 ms): PresShell::sPaintSuppressionCallback > ( 7) 9 ( 1.4%, 82.8%): 1939817216[100445640]: [42048] fn timer (SLACK 1000 ms): nsIDocument::SelectorCache > ( 8) 7 ( 1.1%, 83.9%): 1939817216[100445640]: [42048] fn timer (ONE_SHOT 100 ms): InterSliceGCTimerFired > ( 9) 6 ( 0.9%, 84.8%): 1939817216[100445640]: [42048] fn timer (SLACK 1000 ms): TileExpiry > (10) 5 ( 0.8%, 85.6%): 1939817216[109040000]: [42049] fn timer (SLACK 100 ms): LayerActivityTracker > (11) 4 ( 0.6%, 86.2%): 1939817216[100445640]: [42048] iface timer (SLACK 15000 ms): 11d6cf1c0 > (12) 4 ( 0.6%, 86.8%): 1939817216[100445640]: [42048] fn timer (SLACK 10000 ms): GradientCache > (13) 4 ( 0.6%, 87.4%): 1939817216[100445640]: [42048] fn timer (SLACK 32 ms): ICCTimerFired > (14) 3 ( 0.5%, 87.9%): 1939817216[100445640]: [42048] fn timer (SLACK 10000 ms): imgCacheExpirationTracker > (15) 3 ( 0.5%, 88.4%): 1939817216[109040000]: [42049] fn timer (SLACK 10000 ms): RuleProcessorCache::ExpirationTracker > (16) 3 ( 0.5%, 88.8%): 1939817216[100445640]: [42048] fn timer (SLACK 10000 ms): RuleProcessorCache::ExpirationTracker > (17) 3 ( 0.5%, 89.3%): 1939817216[100445640]: [42048] fn timer (SLACK 10000 ms): FrameTextRunCache > (18) 2 ( 0.3%, 89.6%): 1939817216[100445640]: [42048] fn timer (ONE_SHOT 450 ms): ScrollbarActivity::FadeBeginTimerFired > (19) 2 ( 0.3%, 89.9%): 1939817216[100445640]: [42048] iface timer (ONE_SHOT 5000 ms): 113897c00 > (20) 2 ( 0.3%, 90.2%): 1939817216[100445640]: [42048] fn timer (ONE_SHOT 0 ms): [content] chrome://browser/content/browser.js:12135:8 I don't see any sign of the BackgroundHangManager. Roberto, am I missing something? How did you determine that the BackgroundHangManager was significant here?

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 8

•

9 years ago

The last time I checked this was well over a year ago so I am not surprised that things might have changed. Given your report I say we can close this bug.

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 9

•

9 years ago

Also, please see Comment 6 on how I attributed the wakeups to the BgHangManager, i.e. I measured the energy impact with and without BgHangManager.

Nicholas Nethercote [inactive]

Comment 10

•

9 years ago

I remeasured the test case from comment 2 on Mac with and without the BHM enabled, and I couldn't see any notable differences in power consumption or wakeup frequency. I also instrumented the loop in BackgroundHangManager::RunMonitorThread(), which comment 2 says is the cause of the original problem. It triggered a few times on start-up but I couldn't get it to trigger again after that. BHM is currently disabled on Beta and Release, and this bug is cited as the reason: > For now, we should be getting enough data from Nightly/Aurora channels, and the hang monitor has > non-trivial power consumption in certain idle cases, so it makes sense to disable it on Release > and Beta. This bug is the source of the power consumption claim. I don't know if the "enough data" part is still true. Either way, it seems like this bug can be closed. Whether or not there should be a follow-up bug to re-enable the BHM for Beta and maybe Release, I don't know. jchen, any thoughts?

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WORKSFORME

Mark Finkle (:mfinkle) (use needinfo?)

Comment 11

•

9 years ago

NI'ing jchen for the above comment

Flags: needinfo?(nchen)

(inactive) Jim Chen [:jchen] [:darchons]

Comment 12

•

9 years ago

(In reply to Nicholas Nethercote [:njn] from comment #10) > > Either way, it seems like this bug can be closed. Whether or not there > should be a follow-up bug to re-enable the BHM for Beta and maybe Release, I > don't know. jchen, any thoughts? Closing this bug sounds reasonable. BHM is actually enabled in 1% of Beta builds, and I think we probably won't enable it for all of Beta or Release just because we won't be able to process all the data.

Flags: needinfo?(nchen)

Nicholas Nethercote [inactive]

Comment 13

•

9 years ago

Thank you for the confirmation.

Florian Quèze [:florian]

Assignee

Comment 14

•

3 years ago

(In reply to Benjamin Smedberg from comment #5)

On Windows it should be possible to achieve this using a waitable timer:

have one timer object per tracked thread

when an event starts, call SetWaitableTimer

when an event is finished, call CancelWaitableTimer

the monitoring thread calls WaitForMultipleObjects with all of the timers

In this case I believe the monitoring thread wakes up in the event of a
thread hang and will never wake up otherwise.

I think it should be possible to implement something like this on all platforms using nsITimer. This will cause the Timer thread to wakeup, but waking up at the time something is actually happening seems less of a problem to me than waking up the BHMgr Monitor thread multiple times when nothing else is happening.

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

3 years ago

Blocks: power-usage
No longer blocks: power

Whiteboard: [Power]

Florian Quèze [:florian]

Assignee

Comment 15

•

2 years ago

Attached file Bug 965392 - Use a timer instead of a condvar to run the BHMgr Monitor thread, r=dthayer. — Details

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → florian

Comment 16

•

2 years ago

Pushed by fqueze@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9ef37c6da56b Use a timer instead of a condvar to run the BHMgr Monitor thread, r=dthayer.

Marian-Vasile Laza

Comment 17

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9ef37c6da56b

Status: REOPENED → RESOLVED

Closed: 9 years ago → 2 years ago

status-firefox105: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 105 Branch

Ryan VanderMeulen [:RyanVM]

Comment 18

•

2 years ago

•

We're backing this out from m-c temporarily to see if it is the culprit for bug 1786242 and bug 1784591. We'll re-land it if the problem doesn't go away.
https://hg.mozilla.org/mozilla-central/rev/0c0bee0fee0853b836ea353cc7cbb5052ca22e02

Comment hidden (obsolete)

Ryan VanderMeulen [:RyanVM]

Comment 20

•

2 years ago

Confirmed that this backout resolved the timeouts in bug 1784591. Will go back to the reporter of bug 1786242 to see if it resolves their issues too. Reopening this bug too to reflect the backout. That said, I don't think this needs to be backed out from Beta since BHR is Nightly-only, but I'm happy to do so if we're concerned about it being there still.

Status: RESOLVED → REOPENED

status-firefox106: --- → affected

Flags: needinfo?(florian)

Resolution: FIXED → ---

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

Regressions: 1784591

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

2 years ago

Regressions: 1784625

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

2 years ago

Regressions: 1784870

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 21

•

2 years ago

There are interesting details in bug 1786242 comment 34. As Kirk Steuber already identified before there seems to be a connection to the privacy.resistFingerprinting=true preference set. In combination with this patch the failure seem to regularly happen for tests like eg. browser/base/content/test/general/browser_documentnavigation.js.

To make the reproduction steps even easier I've created a small Marionette test which replicates the hang pretty easily in both headful and headless mode - I don't even need a shippable Windows build anymore!

from marionette_harness import MarionetteTestCase

class TestMinimizedTestCase(MarionetteTestCase):

    def test_startup_hang(self):
        self.marionette.enforce_gecko_prefs({"privacy.resistFingerprinting": True})
        self.marionette.quit(in_app=True)
        self.marionette.start_session()

Simply save that Python script to a file and run the following command:

./mach marionette-test -vv --gecko-log - --setpref="remote.log.truncate=true" %path-to-file% --run-until-failure

Getting a proper Gecko profile is still hard to get due to startup deadlocks when profiling threads (bug 1782125). I'll continue trying and maybe at some point I'm lucky...

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

2 years ago

Regressions: 1786242

Florian Quèze [:florian]

Assignee

Updated

•

2 years ago

See Also: → 1788691

Comment 22

•

2 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #21)

There are interesting details in bug 1786242 comment 34. As Kirk Steuber already identified before there seems to be a connection to the privacy.resistFingerprinting=true preference set. In combination with this patch the failure seem to regularly happen for tests like eg. browser/base/content/test/general/browser_documentnavigation.js.

I was able to narrow the problem further down. When fingerprinting resistance is enabled gfxPlatform::ForceSoftwareVsync() returns true. Changing that function to always return false avoids the problem related to privacy.resistFingerprinting=true. Changing that function to always return true introduces the problem independent of fingerprinting resistance. With the change backed out, always returning true or false both work fine.

https://searchfox.org/mozilla-central/rev/25ec642ed33ca83f25e88ec0f9f27e8ad8a29e24/gfx/thebes/gfxPlatform.cpp#3180-3191

static int LayoutFrameRateFromPrefs() {
  auto val = StaticPrefs::layout_frame_rate();
  if (StaticPrefs::privacy_resistFingerprinting()) {
    val = 60;
  }
  return val;
}

/* static */
bool gfxPlatform::ForceSoftwareVsync() {
  return LayoutFrameRateFromPrefs() > 0;
}

Florian Quèze [:florian]

Assignee

Comment 23

•

2 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #20)

Confirmed that this backout resolved the timeouts in bug 1784591.

I've figured out enough to be able to have a fixed version of the patch.

I noticed that the browser_canvas_rfp_exclusion.js test reproduced more often, so I used it to test locally. Doug did too, and had noticed that removing the mTimer->InitHighResolutionWithCallback call in NotifyActivity "fixed" the bug. However, keeping the timer initialization and making the callback an empty function didn't help.

I managed to get a profile of the failing test https://share.firefox.dev/3RYNFLp and noticed we were adding many "0.000ms" BackgroundHangThread_timer timers. I investigated that, and it turned out these timers were actually TimeDuration::Forever (I think the 0.000ms value displayed in the profiler is an artifact of converting TimeDuration to a uint32_t to serialize in the profile).

I then noticed that the code before the patch had handling of the special TimeDuration::Forever value. Adding just a check for that special value before initializing the timers is enough to make the tests green.
Try run on a recent mozilla-central reproducing the bug 1784591 failures: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=795a878aef594f484570c35cfe6be849b1a71bbc
Green try run with the new version of the patch: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=564bea2fa434b33db59432c0b88f01bb80606fa8

I have done some more debugging on my local build to understand which threads were involved. On the parent process, 4 threads are registered with the hang monitor: the main thread (hang after 128ms, perma hang after 8s), the compositor thread (hang after 128ms, perma hang after 2s), the WindowsVsyncThread (Forever for both times) and (later when running the browser_canvas_rfp_exclusion.js test) the SoftwareVsyncThread (Forever for both times). (Profile showing all of this: https://share.firefox.dev/3LrDU5W) This matches what comment 22 said about software vsync being involved.

Hang monitoring for these vsync threads is added at https://searchfox.org/mozilla-central/rev/6f9fdc5c3b164a46004c98a5baaf55b05e2ad329/ipc/chromium/src/base/message_pump_default.cc#28-30, but I'm not sure that makes sense. Maybe we should skip that entirely if we are not actually monitoring hangs for these threads.

Flags: needinfo?(florian)

Comment 24

•

2 years ago

Pushed by fqueze@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/96644f389a58 Use a timer instead of a condvar to run the BHMgr Monitor thread, r=dthayer.

Comment 25

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/96644f389a58

Status: REOPENED → RESOLVED

Closed: 2 years ago → 2 years ago

status-firefox107: --- → fixed

Resolution: --- → FIXED

BugBot [:suhaib / :marco/ :calixte]

Comment 26

•

2 years ago

The patch landed in nightly and beta is affected.
:florian, is this bug important enough to require an uplift?

If yes, please nominate the patch for beta approval.Also, don't forget to request an uplift for the patches in the regressions caused by this fix.
If no, please set status-firefox106 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(florian)

Florian Quèze [:florian]

Assignee

Comment 27

•

2 years ago

I'm not sure if the right status is wontfix or disabled (given that BHR only runs on Nightly), but we definitely don't need to uplift this patch.

status-firefox106: affected → wontfix

Flags: needinfo?(florian)

Natalia Csoregi [:nataliaCs]

Updated

•

2 years ago

Regressions: 1791678

Pascal Chevrel:pascalc

Updated

•

2 years ago

status-firefox106: wontfix → disabled

Updated

•

2 years ago

Regressions: 1793417

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

Target Milestone: 105 Branch → 107 Branch

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

2 years ago

Regressions: 1796080

Andrew McCreight [:mccr8]

Updated

•

2 years ago

Regressions: 1817498

You need to log in before you can comment on or make changes to this bug.