Closed Bug 1064927 Opened 5 years ago Closed 2 years ago

Power button refuses to respond under heavy CPU(/memory?) load

Categories

(Firefox OS Graveyard :: General, defect)

All
Gonk (Firefox OS)
defect
Not set

Tracking

(b2g-v2.1 unaffected, b2g-v2.2 affected)

RESOLVED WONTFIX
Tracking Status
b2g-v2.1 --- unaffected
b2g-v2.2 --- affected

People

(Reporter: cwiiis, Unassigned)

References

Details

(Keywords: regression)

Attachments

(4 files)

I find that when the phone is under heavy load (which due to some recent Gecko changes, appears to be a lot of the time, but is also easily triggered by the music indexer), it becomes impossible to unlock the phone. The power button does nothing, though sometimes responds after (literally) minutes.

A work-around if you're by a micro-USB cable/power source is to plug the phone in, which will successfully wake it up, after which you can interact with it. Locking the phone again will likely still manifest the issue (after which you'll have to unplug and plug back in again to recover).

I experience this on a ZTE Open C, and have previously experienced this on a Geeksphone Revolution and the Flame.

I'm filing this under lockscreen, but I imagine the issue is deeper than that.
Gregor pointed out that there may be a memory leak in the system app, which would explain why this is so much harder to reproduce after a fresh boot. I'll try to get a memory profile when my phone is exhibiting this.
Flags: needinfo?(chrislord.net)
Summary: Power button refuses to respond under heavy CPU load → Power button refuses to respond under heavy CPU(/memory?) load
I've seen this too, although I'm not sure whether the device is actually under load or not.
qawanted to see if this can be reproduced. Please make sure to capture a memory profile if you are able to reproduce.
Keywords: qawanted
I've seen this on the Tako, usually after several hours of use. So it is not device specific, or even specific to JB (Tako is KK).
Note that as usual, the screen turned on when I plugged the phone in (though it refused to turn on at all before that), and also, a process died when I ran the memory reporting tool, so there's a good chance that there was even less memory free before doing the report.

It does indeed look like the device (an Open C, 512mb ram) has very little memory free in this profile. There also seem to be a lot of file descriptors open - I've noticed a message about not being able to open /dev/something/power or something like that before when pressing the power button (and having nothing happen), but plugging in the device and running the profile has appeared to have temporarily fixed it, so I can't reproduce until it gets into that state again.

If I'm around my laptop when it happens again, I'll try to gather more information. It's looking quite likely to be some kind of leak, but I'm not particularly experienced at reading these memory reports. I'll see about gathering a report after rebooting too.
Flags: needinfo?(chrislord.net)
I also notice a zram section, that's the compressed memory we use on 128mb devices, right? Should that be present on a 512mb device?
Also worth noting, my phone is extremely unresponsive in this state - there is a very large lag before touches are acknowledged, though the frame-rate when scrolling begins is ok. This would indicate some kind of contention in the content process, and that the compositor is unaffected.
Heh, that said, scrolling in the browser is very janky - it scrolls smoothly for a few pixels, then there's a pause, then smooth again. I assume this is corresponding with content updates, and I posit the jank is caused by memory i/o contention?
This is a logcat captured shortly after the phone refused to unlock. Again, plugging it in gets it (temporarily) out of that bad state and I'm not sure if this is a useful log or not. I'm having difficulty reproducing this state while the phone is plugged in.
Chris, can you upload a memory report right after rebooting, so we can compare with a relatively clean memory profile?
See Also: → 1062331
I already had a memory report from a Flame shortly after unlocking the homescreen after a reboot while investigating another bug. From comment 10, just thought it might help.
removing qawanted, looks like Chris and Mason has provided data.  Also, not flagged as a blocker at the moment.
Keywords: qawanted
Possible dupe of bug 1068268?

[Blocking Requested - why for this release]: Makes phone unusable after about a day (depending on usage/memory configuration).
blocking-b2g: --- → 2.1?
Component: Gaia::System::Lockscreen → General
I was seeing what I thought was the same issue on an Open C, on 2.1 and 2.0. Since then that has degraded to the point that it refuses to turn on or off most of the time. Even from a dead stop having the battery removed. Which leads me to think this was a hardware issue with the button itself and probably unrelated. I mention it in case something similar is going on that might be confused with the CPU load issue.
So I've not seen this problem since bug 1068268 landed on master, having had my phone on for over 24 hours - it's not a Gaia change that's caused this I don't think, I recall only updating Gecko and not Gaia and seeing the problem go away. I strongly suspect that bug, but I'll verify by flashing a build without it tomorrow and using my phone for a day or two (if it gets that far).
I would be *very* surprised if this was a result of bug 1068268. I've seen the issue with the power button not responding intermittently before any of the multi-layer-apz changes landed.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #16)
> I would be *very* surprised if this was a result of bug 1068268. I've seen
> the issue with the power button not responding intermittently before any of
> the multi-layer-apz changes landed.

I've seen it before, but not nearly to the same extent - For this bug, I see it consistently happen after using the phone for around half a day or so, and the phone will literally either not respond at all, or respond after several minutes (literally minutes, not a few seconds). This is also coupled with a gradual slow-down of the phone, where the frame-rate starts getting janky, launching apps gets slower, etc. - this appears to be fixed on master now, and also appeared to be caused by a memory leak, introduced at a very similar or later time to the multi-layer-apz stuff landing.
(In reply to Chris Lord [:cwiiis] from comment #15)
> I'll verify by flashing a
> build without it tomorrow and using my phone for a day or two (if it gets
> that far).

How has it gone for the past few days, Chris?  Do you still think this needs investigation/to block 2.1?
Flags: needinfo?(chrislord.net)
So FWIW this seems significantly improved (but not completely gone) on my build from this week.  I haven't seen a delay of more than a second or so when pressing power.
(In reply to Andrew Overholt [:overholt] from comment #20)
> (In reply to Chris Lord [:cwiiis] from comment #15)
> > I'll verify by flashing a
> > build without it tomorrow and using my phone for a day or two (if it gets
> > that far).
> 
> How has it gone for the past few days, Chris?  Do you still think this needs
> investigation/to block 2.1?

This is basically fixed in 2.2 - I've had it not respond for a few seconds, but never more than 2 or 3, as opposed to the minutes that I was getting, and this is after several days of use.

I don't know what the situation is in 2.1 though, I need to find a time when I can flash my daily phone, but I've been busy outside of work recently and can't afford to be carrying a broken phone around right now :) (there's no use in flashing a dev phone, it requires frequent, long-term usage to surface 'quickly')

I strongly suspect this was a leak though, and bug 1068268 is a fixed leak in the right time-frame that has been uplifted to Aurora, so this may well be fixed.

I'll leave the needinfo to remind me to check on this at a more convenient time.
Please re-nom when you get a chance to test and determine it's still an issue.  Thanks, Chris!
blocking-b2g: 2.1? → ---
I don't know if this is an issue with 2.1 as I dogfood on master (so I can test features I'm developing and keep an eye on platform), but this is no longer fixed in 2.2 :(

Even after closing all applications, once this bug has happened, the whole phone is permanently unresponsive - typing on the keyboard becomes pretty painful, scrolling becomes less responsive, checkerboarding happens more frequently - all until you reboot the phone.

Is anyone dogfooding 2.1 that uses the phone extensively? (i.e. lots of browsing, texting, calls, music playing, etc.) This triggers pretty consistently and easily for me in 2.2 just by listening to music and using facebook/twitter a reasonable amount all day.
Flags: needinfo?(chrislord.net)
I think it'd be good, if it's at all possible, to have someone on QA reproduce this, first on 2.2 to confirm the issue and STR, then on 2.1 to see if it's still an issue there.

Note that I'm testing on an Open C, which has 512mb of RAM and may manifest the issue sooner than a 1gb Flame - I would suggest configuring the Flame with the same amount of memory.
Keywords: qawanted
It may be worth grabbing the about:memory log when this happens to see if there's some runaway memory allocation. $B2G_DIR/tools/get_about_memory.py
I believe I'm getting the issue which you are seeing on Flame 2.2 KK. I'm able to bring the phone to it's knees on the lockscreen using the lockscreen camera taking video then using home button to exit and repeating a few times. Not much works at this point including the power button, home button etc. This was seen with memory set at 319mb. I tried memory set to 512mb but the bug does not seem to occur and performance is fine.

Tested with Shallow Flash on 319mb using Engineering builds
See 2.2 build info below.

STR:
1. Enable lockscreen passcode (to keep yourself on the lockscreen after using the camera)
2. Lock the phone then tap the power button again to get to the lockscreen.
3. Go the the lockscreen camera and start taking a video.
4. While it's recording, tap the home button to go back to the lock screen.

Repeat steps 3 and 4 about 4-5 times to bring the Flame to it's knees.

This bug repro's on Flame KK builds: Flame 2.2 KK

Repro Rate: 4/4

Environmental Variables:
Device: Flame 2.2 KK
BuildID: 20141031062029
Gaia: a07994714f0552f89801d6097982308d8b0a1ee1
Gecko: 21fbf1e35090
Version: 36.0a1 (2.2) 
Firmware Version: v188
User Agent: Mozilla/5.0 (Mobile; rv:36.0) Gecko/36.0 Firefox/36.0

-----------------------------------------------------------------
-----------------------------------------------------------------

This bug does NOT repro on Flame kk build: Flame 2.1 KK

Actual Result: No performance hits are seen on 2.1 KK 319mb

Repro Rate: 0/7

Environmental Variables:
Device: Flame 2.1 KK
BuildID: 20141031025654
Gaia: 224dfde17af943b583aa0a97936343c7267c7996
Gecko: 12a56ce89cb9
Version: 34.0 (2.1) 
Firmware Version: v188
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(jmitchell)
Keywords: qawantedregression
QA Contact: croesch
I'll try and get a mem report attached soon.
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(jmitchell)
Bug 1048024 prevents us from getting an about-memory report.  The issue has been rewritten many times but is always duped to that bug.
Flags: needinfo?(jmitchell)
Flags: needinfo?(bugmail.mozilla)
We fixed that months ago.  When was the last time you did a full flash?
Attached file About Memory report
It does work only when full flashing which is itself something that is not always done.

Attaching the requested about-memory report for Cody as he has left for the day.
Is your usual work flow latest base image + shallow flash?  Sounds like we need to get the base image upgraded then.  I'll deal with that.
Flags: needinfo?(jmitchell)
I looked through the memory dump and there's nothing there that jumps out at me as being obviously wrong. The most suspicious thing I found was the 75.26 MB used up by kgsl-memory/b2g/egl_image under "system". That seems high to me but I don't know what normal is. Kyle, do you know if that value is abnormally high?
Flags: needinfo?(bugmail.mozilla) → needinfo?(khuey)
A memory report I just pulled from my Nexus 5 has about 70 MB of egl_image, so that doesn't seem super-high to me.  There's a kgsl file that about:memory produces (if you don't feed the --no-kgsl-something-or-other flag to it) that breaks down those allocations.
Flags: needinfo?(khuey)
ni JMercado to confirm comment 32.
Flags: needinfo?(jmercado)
Hi Kyle,
Our normal workflow is usually base image plus shallow flash because we cannot perform regression windows when doing full flash.  Even though we aren't always working on regression windows it is our default method.
Flags: needinfo?(jmercado)
Firefox OS is not being worked on
Status: REOPENED → RESOLVED
Closed: 5 years ago2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.