Closed Bug 1173140 Opened 9 years ago Closed 9 years ago

[Shinano][Aries] Battery life regression

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: gerard-majax, Unassigned)

References

Details

I cannot be more precise, but with master from a couple of days I have noticed that the battery does not last as much as my builds from ~10-15 days ago. That's with mostly the very same usage. I would say the battery life went from 4 full days to 2.5/3.
Flags: needinfo?(jhylands)
Jhylands, is there anything we can do for the aries device for battery life performance/regression range testing?
I don't know what an aries device is, but I don't have one, and don't have a harness for one, so there isn't much I can do to help.

I haven't noticed any major regressions on the flame with my automated testing.
Flags: needinfo?(jhylands)
(In reply to Jon Hylands [:jhylands] from comment #2)
...
> I haven't noticed any major regressions on the flame with my automated
> testing.

I experience a drop in battery lifetime since I moved from v18D_nightly to c18D_nightly_v2 and Build 20150615010203 (commit 1bf2da10)
bug 1178869 might be indicating that it's not just a Aries issue.
Here are some more reports :bug 1137066, bug 1135487, bug 1135466, bug 1133510, bug 1130725, bug 1121029, 
Some people have been experiencing battery loss in 2.0 : bug 1097083, bug 1095846
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #4)
> bug 1178869 might be indicating that it's not just a Aries issue.

Interesting, since the build there would more or less match the timeframe I had in mind when doing my first report.
With build efbea75178d2	Zibi Braniecki — Bug 1163245 - [Battery][Gonk] Implementation of battery discharging remaining time. r=dhylands

            NAME   PID WAKEUPS      RUNTIME NEW
             b2g   208    1000      2210 ms  no
          (Nuwa)   334       1         0 ms  no
      Homescreen  1166      28        33 ms  no
  Find My Device  2966      42        37 ms  no
          Camera  3126     181        74 ms  no
        Calendar  3886     289       821 ms  no

~ 3 hrs of time span.  Looks like email got killed.

Gecko inbound : dfeb48f0d492
            NAME   PID WAKEUPS      RUNTIME NEW
             b2g   209    1099      2497 ms  no
          (Nuwa)   445       4         1 ms  no
      Homescreen  1136      29         6 ms  no
  Find My Device  3315      36         6 ms  no
        Calendar  5532      83       206 ms  no
~ 3 hrs as well.  Not sure if it makes that much of an impact, there seems to be a regression with Calendar app?

http://hg.mozilla.org/integration/b2g-inbound/pushloghtml?fromchange=dfeb48f0d492&tochange=efbea75178d2

still collecting data.
w/ yesterday's build : dfeb48f0d492

            NAME   PID WAKEUPS      RUNTIME NEW
             b2g   207     821      2712 ms  no
          (Nuwa)   451       2       168 ms  no
      Homescreen  3326      22         4 ms  no
        Calendar  4148     106       186 ms  no
 (Preallocated a 17526     116       859 ms yes

I made a small script to make sure I get 3 hrs.
I'm not convinced that this method would help us get the information we want to figure out the battery drain issue...
gecko rev : d017bfb96f62af16f6a9af6eadf105c959ac81e7

            NAME   PID WAKEUPS      RUNTIME NEW
             b2g   209     474      1377 ms  no
          (Nuwa)   446       0         0 ms  no
  Find My Device  3675      57        13 ms  no
      Homescreen  4239      58        14 ms  no
        Calendar  4760     380       834 ms  no
 (Preallocated a  4768       4         1 ms  no
gecko rev: 9ea351d4ded09b85a026413d8f0613430dc6e335
             b2g   208     902      4617 ms  no
          (Nuwa)   380       4         7 ms  no
  Find My Device  1735      49        15 ms  no
      Homescreen  3252      43        10 ms  no
        Calendar  4166     226       407 ms  no
 (Preallocated a 12180      84       403 ms yes

Note: the difference here is I also had launched the usage app.  I'm not sure if that affected the b2g app to run more cycles...
Gabriele, I'm not sure if the power comsumption by app is visible with the trace.sh tool.  Perhaps I'm looking at it wrong.  I posted some numbers here as comparisons.  How do I tell if something is too far out of bounds?  Since these are all in ms, it doesn't seem completely off to the point where it would highly drain the battery...  Could I get some help here?

Does this also mean that we need some API in gecko around battery like Android SDK has?  I think we need a little more tooling around battery/power consumption...
Flags: needinfo?(gsvelto)
This looks like the same issue I'm bisecting in bug 1130725. The most relevant number when running my scripts is the number of wakeups. In comment 7 you're seeing 1000+ wakeups while the phone is idle (I'm assuming you did all the testing with the screen off and waiting for the phone to settle down after flashing it), that's a lot for a phone that's in standby. I'd expect an order of magnitude less wakeups in a 3h period for a phone that's in standby.
Flags: needinfo?(gsvelto)
Thanks for the information.  I wasn't sure if 5 minutes was enough for a good sample size; I was afraid of getting a lot of noise and variation.

Yes.  After launching different apps that I was thinking might be suspect, I then set the phone off and then let it settle a little bit before I ran my small script.

I suspect that usage app running the background and the email app running in the background may inflate the b2g wakeup cycles.  On my dogfood device, I noticed that I ran out of battery power faster; I had pluotsorbet open in the background. ( ~ 2 days and just sleeping )

Build ID               20150619225606
Gaia Revision          4c06ed88ddccaba8dc941e5006bd2a9e57306f07
Gaia Date              2015-06-19 22:17:18
Gecko Revision         7c1a6b1151a1539186b950a144387e2d7f378d1b

I suspect that there's two battery life regressions, that there are some apps that we might want to investigate and some best practices that we might want to mention to make awareness of having people save battery life.
I've dug further into the kernel trace and with a completely idle phone (freshly flashed, no network connection and all applications in their original, unconfigured state) I see this pattern repeating every 100ms:

          <idle>-0     [000] d.h4 13887.660015: sched_wakeup: comm=Timer pid=2128 prio=120 success=1 target_cpu=000
           Timer-2128  [000] d..5 13887.660145: sched_stat_runtime: comm=Timer pid=2128 runtime=136198 [ns] vruntime=1100766428723 [ns]
           Timer-2128  [000] d..5 13887.660158: sched_wakeup: comm=b2g pid=2085 prio=120 success=1 target_cpu=000
           Timer-2128  [000] d..3 13887.660223: sched_stat_runtime: comm=Timer pid=2128 runtime=77292 [ns] vruntime=1100766506015 [ns]
             b2g-2085  [000] d..4 13887.660339: sched_stat_runtime: comm=b2g pid=2085 runtime=116614 [ns] vruntime=1100766756273 [ns]
             b2g-2085  [000] d..4 13887.660351: sched_wakeup: comm=BgHangManager pid=2122 prio=121 success=1 target_cpu=000
             b2g-2085  [000] d..4 13887.660482: sched_stat_runtime: comm=b2g pid=2085 runtime=142188 [ns] vruntime=1100766898461 [ns]
             b2g-2085  [000] d..4 13887.660495: sched_wakeup: comm=Timer pid=2128 prio=120 success=1 target_cpu=000
             b2g-2085  [000] d..3 13887.660826: sched_stat_runtime: comm=b2g pid=2085 runtime=343489 [ns] vruntime=1100767241950 [ns]
   BgHangManager-2122  [000] d..3 13887.660900: sched_stat_runtime: comm=BgHangManager pid=2122 runtime=74636 [ns] vruntime=1100766534306 [ns]
           Timer-2128  [000] d..3 13887.660962: sched_stat_runtime: comm=Timer pid=2128 runtime=63073 [ns] vruntime=1100766569088 [ns]

I read it as follows: after an idle interval the timer thread in the main process is waking up, waking up the main process' main thread (probably by posting a task on the main thread's task queue) which in turn wakes up the BgHangManager thread. I'm not familiar with this code yet but I'll try digging further to figure out what's going on.
Gabriele, look at https://bugzilla.mozilla.org/show_bug.cgi?id=970358#c0 . Could it be that we have a misconfigured build?
(In reply to Fabrice Desré [:fabrice] from comment #16)
> Gabriele, look at https://bugzilla.mozilla.org/show_bug.cgi?id=970358#c0 .
> Could it be that we have a misconfigured build?

Thanks for the tip, I have an engineering build on my device though so I might be chasing the wrong issue. I'll try getting a trace with a plain user build.
(In reply to Gabriele Svelto [:gsvelto] from comment #17)
> (In reply to Fabrice Desré [:fabrice] from comment #16)
> > Gabriele, look at https://bugzilla.mozilla.org/show_bug.cgi?id=970358#c0 .
> > Could it be that we have a misconfigured build?
> 
> Thanks for the tip, I have an engineering build on my device though so I
> might be chasing the wrong issue. I'll try getting a trace with a plain user
> build.

I'm not sure a VARIANT=userdebug or user build will have this flagged toggled.
It's keyed on the GRE_MILESTONE, which is 42.0a1 currently on trunk:

http://mxr.mozilla.org/mozilla-central/source/configure.in#3516
I've just tried a local build for my Flame with VARIANT=user and the hang monitor is enabled, objdir-gecko/xpcom/threads/backend.mk contains this line:

DEFINES += -DMOZ_ENABLE_BACKGROUND_HANG_MONITOR=1 -DOS_POSIX=1 -DOS_LINUX=1

objdir-gecko/config/autoconf.mk has GRE_MILESTONE set to 41.0a1 so that might be the cause. Can anybody check on a Z3C using the regular foxfooding builds too? Run the following commands:

adb shell 'echo 2048 > /sys/kernel/debug/tracing/buffer_size_kb'
adb shell 'echo "" > /sys/kernel/debug/tracing/trace'
adb shell 'echo 1 > /sys/kernel/debug/tracing/events/sched/sched_stat_runtime/enable'
adb shell 'echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable'
adb shell 'echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup_new/enable'
adb shell 'echo 1 > /sys/kernel/debug/tracing/tracing_enabled'

Wait a few seconds and then

adb shell 'cat /sys/kernel/debug/tracing/trace' | grep BgHangManager

If you get any output then the background hang manager is definitely running and draining the battery.

Don't forget to run this to stop tracing when you're done:

adb shell 'echo 0 > /sys/kernel/debug/tracing/tracing_enabled'
That's definitively happening on mine. Let's disable that on all b2g device builds.
Gabriele, while this is bad, I don't see any recent change around the BackgroundHandMonitor, so I'm moved that to a new bug : bug 1180533
(In reply to Fabrice Desré [:fabrice] from comment #22)
> Gabriele, while this is bad, I don't see any recent change around the
> BackgroundHandMonitor, so I'm moved that to a new bug : bug 1180533

I don't think this is an issue with the BackgroundHangMonitor per se but rather with how it is used. Looking at the code the way it works is that once a background thread that uses it starts working on a task it's supposed to call BackgroundHangMonitor::NotifyActivity() and keep calling it to prove it's not stuck. Once it is done the thread should call BackgroundHangMonitor::NotifyWait() which stops the hang monitor and waits for the next activity notification. It seems to me that there's a thread somewhere that's never calling BackgroundHangMonitor::NotifyWait() but rather keeps calling BackgroundHangMonitor::NotifyActivity() thus prompting the hang monitor to wake up continuously. I'll try running a bisection because I feel that just disabling the hang monitor is masking the real issue.
Quick update, I prepared a build with a ~1 month old version of gecko and I can still see the BgHangManager wakeups but they're not as frequent as with recent versions. I've got the feeling these may have become worse over time so I'll take this chance to open a bug related to that. At this stage I suspect that other similar issues might have been caused by this (e.g. bug 1130725) with varying degrees of badness depending on the hang manager's clients behavior.
See Also: → 1180677
Gabriele, as we spoke on IRC, here is a feedback around this issue. I did a build just after Fabrice's patch landed, and I have dogfooded this build since. What I could notice so far is a slight improvement. I think somehow placing calls does trigger something that starts to eat my battery again.

I remember that in the past we dot Audio channels leaking, especially in the callscreen app, and that this was already triggering battery life regression.
Flags: needinfo?(gsvelto)
There hasn't been many changes to the callscreen lately so I don't think we should have problems with leaking Audio contexts. That being said since bug 1094764 landed we might just use that in the callscreen and dialer to ensure that we always stop playing audio when we don't need it anymore. I'll file a bug and attach it here.
Filed bug 1183622 for the audio context part.
Flags: needinfo?(gsvelto)
See Also: → 1183622
Situation seems to be better these days. Tracking slowdown in bug 1189004.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.