Closed Bug 1171827 Opened 9 years ago Closed 9 years ago

Gecko hangs, power button unreactive, adb still alive

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:2.5+, b2g-master fixed)

RESOLVED FIXED
FxOS-S1 (26Jun)
blocking-b2g 2.5+
Tracking Status
b2g-master --- fixed

People

(Reporter: dholbert, Assigned: xeonchen)

References

Details

(Keywords: dogfood, regression)

Attachments

(5 files, 1 obsolete file)

I don't have STR, just reporting a pretty serious issue with Aries (aka Spark, Xperia Z3C) device that I encountered today.

Around 7:50pm, I powered off my phone (or so I thought) to see a show.  At around 9:30 (intermission), I took my phone out of my pocket, and discovered that it was quite warm and unresponsive.  The screen was dark. I tried holding the power button down to turn it on, or off, or force-reboot it. This had no effect.

I tried to look for a way to pull the battery or force-reboot via some other way; I couldn't find one, so I just didn't have a working phone for the rest of the night.

Now that I'm home, I tried "adb logcat" and lo and behold, the phone is indeed still powered on & responsive to adb; but unresponsive to power-button presses.

I'll attach my adb logcat.

EXPECTED RESULTS: Should be a way to reboot the phone & return it to a working state when it's hanged. (Maybe there is a way? I'd love to know how, for the next time this happens and I don't have a computer with adb anywhere nearby...)
BTW: I think the last thing I'd done before trying to power off my phone (& triggering this bug) was sending & receiving some SMS messages, and then silencing the ringer.
Summary: [Aries] Phone sometimes just goes unresponsive instead of powering off; unclear how to get it back to a working state without connecting to a computer (?) → [Aries] Phone sometimes just goes unresponsive instead of powering off; unclear how to get it back to a working state without connecting to a computer with adb (?)
Here's adb logcat after I got home. Unfortunately it doesn't go back all the way to the time where I tried to power off my phone (which was around 7:50pm).
drs tells me in IRC that there's an orange pin in the panel on the bottom-left side of the phone (next to the SIM slot) which I can press to reboot it -- I tried that, & that successfully powered down the phone.

--> clarifying summary.
blocking-b2g: --- → spark?
Keywords: dogfood
Summary: [Aries] Phone sometimes just goes unresponsive instead of powering off; unclear how to get it back to a working state without connecting to a computer with adb (?) → [Aries] Phone sometimes just goes unresponsive instead of powering off (have to forcibly power off using orange pin next to sim slot)
See Also: → 1170836
(also, apparently holding power-button + volume-up for a few seconds works to force-power-off the phone.)
Comment on attachment 8615836 [details]
output of "adb logcat -v time" after I got home

Alex, Dave, do you make anything of this? Any idea who to pass this along to?
Flags: needinfo?(lissyx+mozillians)
Flags: needinfo?(dhylands)
I have never had this issue in three months of dogfooding with each build averaging an uptime of more than five days.
And it just happened time with a build from yesterday. So probably a regression of the last week.
See Also: → 1171943
(In reply to Alexandre LISSY :gerard-majax from comment #7)
> And it just happened time with a build from yesterday.

(I think this was a typo and by "time" you meant "to me"? i.e. did you hit this as well?)
Yes. Happened already three times today. That's a clearly serious and recent regression. ...
Naoki, is there anything we can do to escalate this more?
Flags: needinfo?(nhirata.bugzilla)
blocking-b2g: spark? → spark+
I was unable to reproduce this issue on an Aries build from today.  Leaving the steps-wanted tag for others to give it a try.  

One important thing to note is that until task cluster gets a more user friendly index (see bug 1133074) Aries only issues will be very hard for us to do regression windows on.  

Environmental Variables:
Device: Aries 3.0
BuildID: 20150605020614
Gaia: 65369b217faac7d70c1a29100c4208c6d1db16e3
Gecko: 0496b5b3e9ef
Gonk: 3af1ede0d0956cfbf9c549df7cd9a6807a9efdf2
Version: 41.0a1 (3.0) 
Firmware Version: D5803_23.1.A.1.28_NCB.ftf
User Agent: Mozilla/5.0 (Mobile; rv:41.0) Gecko/41.0 Firefox/41.0
Flags: needinfo?(pbylenga)
Flags: needinfo?(ktucker)
(In reply to Daniel Holbert [:dholbert] from comment #0)
> EXPECTED RESULTS: Should be a way to reboot the phone & return it to a
> working state when it's hanged. (Maybe there is a way? I'd love to know how,
> for the next time this happens and I don't have a computer with adb anywhere
> nearby...)

There is a hard reset button:
http://sony-xperia.mobi/eng/sony-xperia-z3-compact/283-how-to-reset-sony-xperia-z3-compact-restart-button-on-sony-xperia-z3-compact.html

The log seems to have spew from the location service, so there isn't really much useful information there.

When I do try to power off, I see the following in the log:

> 13:38:18.751  320/ 320       PowerManagerService  I   Call to virtual nsresult mozilla::dom::power::PowerManagerService::PowerOff(). The JS stack is:
> 13:38:18.751  320/ 320       PowerManagerService  I   0 sm_actualPowerOff(isReboot = false) ["app://system.gaiamobile.org/js/sleep_menu.js":451]
> 13:38:18.751  320/ 320       PowerManagerService  I       this = [object Object]
> 13:38:18.751  320/ 320       PowerManagerService  I   1 nextAnimation(e = [object AnimationEvent]) ["app://system.gaiamobile.org/js/sleep_menu.js":432]
> 13:38:18.751  320/ 320       PowerManagerService  I       this = [object HTMLDivElement]
> 13:38:18.751  320/ 320       PowerManagerService  I   

I currently have an eng build on the phone, so it may behave differently with different builds
Flags: needinfo?(dhylands)
Flags: needinfo?(lissyx+mozillians)
This is a recent *and* nasty regression, potentially harmful for the device given the heat.
(In reply to Dave Hylands [:dhylands] from comment #12)
> There is a hard reset button:

Yeah, I discovered that later, per comment 3 / comment 4.  That makes me less concerned than I was (but still concerned) about this bug.

> The log seems to have spew from the location service, so there isn't really
> much useful information there.

Darn. :-/ I was afraid of that.
Flags: needinfo?(ktucker)
dholbert, can you do a logshake instead?  That gives more logging information.  you should be able to press both volume buttons to trigger the log shake to occur.
Flags: needinfo?(nhirata.bugzilla)
QA Contact: nhirata.bugzilla
Flags: needinfo?(nhirata.bugzilla)
QA Contact: nhirata.bugzilla
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #15)
> dholbert, can you do a logshake instead?

Sure, I can try that next time this happens (if it happens to me again; only happened once so far).
I have been using my device since with updated master and a local revert for bug 1166320 that broke USB Mass Storage. No idea if it's related or not, but no spurious crash so far.
I hit this bug [or something like it] earlier today. In this case I hadn't tried to power off the phone; I simply set it down for ~30 minutes.  When I picked it up to use it, it was unresponsive.

As before, the last thing I'd done was to send a text. Not sure if that's related.

I held down both volume buttons for a few seconds (per comment 15); not sure if that triggered a logshake or anything, because nothing obviously happened. I ended up rebooting the phone after that because I needed to use it.
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #15)
> dholbert, can you do a logshake instead?

(Clarifying this: is this actually possible, when the phone is locked up? How do I actually submit the logshake data, when the phone's screen won't power on until I reboot?  From my brief testing of logshake, it looks like its "tap to share" entries in the notification tray are gone after I reboot my phone. So I think if I tried this, it might capture logs, but they'd be gone after I reboot. Do the logs get saved locally in a known location, where I can pull them off? [and do I need to pull them off before rebooting?])
(In reply to Daniel Holbert [:dholbert] from comment #19)
> (In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from
> comment #15)
> > dholbert, can you do a logshake instead?
> 
> (Clarifying this: is this actually possible, when the phone is locked up?
> How do I actually submit the logshake data, when the phone's screen won't
> power on until I reboot?  From my brief testing of logshake, it looks like
> its "tap to share" entries in the notification tray are gone after I reboot
> my phone.

That should not be the case.

> So I think if I tried this, it might capture logs, but they'd be
> gone after I reboot. Do the logs get saved locally in a known location,
> where I can pull them off? [and do I need to pull them off before
> rebooting?])

Yes, it's documented on MDN we store on sdcard, in the "logs" directory, with a new directory named after the date and time.
(Ah, thanks - I do see the "logs" directory now, with datestamped directories from my various trial-logshakes.)
(In reply to Alexandre LISSY :gerard-majax from comment #20)
> (In reply to Daniel Holbert [:dholbert] from comment #19)

[...]

> > (Clarifying this: is this actually possible, when the phone is locked up?
> > How do I actually submit the logshake data, when the phone's screen won't
> > power on until I reboot?  From my brief testing of logshake, it looks like
> > its "tap to share" entries in the notification tray are gone after I reboot
> > my phone.
> 
> That should not be the case.

Tracked in bug 1172316
Not sure why, but this has not reproduced since I've rebased and/or did a backout of bug 1166320 ...
Unable to reproduce on latest Spark 3.0

After several attempts over the span of an hour, I am unable to replicate the results seen in comment 0. Also tried holding down a number of hardware button combinations while turning off and on phone, but was unable to get device into desired state.

Leaving steps-wanted in case someone else can reproduce.

Device: Aries 3.0
Build ID: 20150608104950
Gaia: ea27c4ed5b6083c9e21d233d4804372ac4d5d353
Gecko: 4700d1cdf489
Gonk: 3af1ede0d0956cfbf9c549df7cd9a6807a9efdf2
Version: 41.0a1 (3.0)
Firmware Version: D5803_23.1.A.1.28_NCB.ftf
User Agent: Mozilla/5.0 (Mobile; rv:41.0) Gecko/41.0 Firefox/41.0
Dholbert, I think Alexandre has answered your question.  If the system is completely unresponsive, then yes, you are right; logshake won't happen.  Having said that, if there's a chance that logshake does work, then I wanted to try it out.

Could you zip the logs and place it somewhere, please?  Please be aware of : bug 1159969
[not recommending placing in bugzilla or at least set the privacy button if you do]

I haven't reproduce the issue yet; still trying.
Flags: needinfo?(nhirata.bugzilla) → needinfo?(dholbert)
Thanks. Unfortunately, though -- looking at my logs directory, it looks like my most recent instance of this bug (comment 18) unfortunately did not result in any logshake directory being created. (None of my logshake directories have a timestamp around that period of time.)

If it happens again, I'll try again to trigger a logshake and post the logs somewhere safe.
Flags: needinfo?(dholbert)
QA Whiteboard: [QAnalyst-Triage+]
Flags: needinfo?(pbylenga)
Daniel, can you make sure you can verify if it still happens on a more recent build? I have not been able to hit this anymore since my saturday's build
Flags: needinfo?(dholbert)
Yup, I'll definitely comment if I hit this again. Saturday was the last time that I hit this, too.
Flags: needinfo?(dholbert)
I hit the same bug about 14 hours ago (see bug 1173684), on a build from yesterday:

Build ID               20150610030200
Gaia Revision          9844b7c9d4429a041cff7201f1a63d0264e155ff
Gaia Date              2015-06-09 20:49:53
Gecko Revision         https://hg.mozilla.org/mozilla-central/rev/5787d784485e
Gecko Version          41.0a1
Device Name            aries
Firmware(Release)      4.4.2
Firmware(Incremental)  eng.worker.20150605.140045
Firmware Date          Fri Jun  5 14:00:54 UTC 2015
Bootloader             s1
I have a build from yesterday, so far it's fine:
 - Gecko 6aa1662524c362bb34d55aa993b26d60275ce62b
 - Gaia 27e4c1af0be302150a5f6e428666e0e3daf2a4ea
I had the same thing just now

gaia: git commit e3eaf72c (2015-06-10 03:13:33)
gecko: can't find commit hash, but build-id says 20150610165040
(In reply to Frederik Braun [:freddyb] from comment #32)
> gecko: can't find commit hash, but build-id says 20150610165040

SourceStamp=95afddf894e3 in /system/b2g/application.ini
I cannot be 100% sure, but it looks like I got into the same state on my Flame that was flashed with a build from june 9th. Device was unresponsive to power button, screen constantly black even after plugging in USB cable. Charging LED turned to white, and I did not tried to check adb status. Device did not exposed the classic QCOM USB interfaces that are exposed after a very bad crash like kernel panic, so I guess the system was still alive.

Removed battery, rebooted, and the device is in good shape, with more than 25% of battery. So it might not be a Z3c-specific issue ...
(In reply to Alexandre LISSY :gerard-majax from comment #34)
> I cannot be 100% sure, but it looks like I got into the same state on my
> Flame that was flashed with a build from june 9th. Device was unresponsive
> to power button, screen constantly black even after plugging in USB cable.
> Charging LED turned to white, and I did not tried to check adb status.
> Device did not exposed the classic QCOM USB interfaces that are exposed
> after a very bad crash like kernel panic, so I guess the system was still
> alive.
> 
> Removed battery, rebooted, and the device is in good shape, with more than
> 25% of battery. So it might not be a Z3c-specific issue ...

Gecko 4b0c32375e944b198193c042cf5cb57769b9c265
Gaia eeaacf041d025f53849b80b85de3736ba399782c
Got no reproduce on my Z3 Compact until today with a build from this morning:
 Gecko f5cca86bcb4f4a2f263225903a25f9cb64b47519
 Gaia 14c0c8eba35f2c838d0cef68cf3b1bf4d29678d5
Taking the regression window.  Due to the facts that:
* Many people encountered it
* Nobody has a single clue on what could trigger the regression,

I'll proceed the same way as in bug 1122119 comment 65 (and following), I'll leave some phones electrically plugged and I'll wait until the black screen appears. I have 6 flames and 1 aries. If comment 34 is correct and it does happen on Flame, it won't take much time to get a regression window. 

Based on comment 6, I started with a range from May 29th to June 10th on Flame. With the single aries I have, I'll start with a build from June 5th and go back in daily builds once I stop seeing the black screen.

Clearing steps-wanted.
Keywords: steps-wanted
QA Contact: jlorenzo
I also experienced this with the OTA build from yesterday (Gaia 682697b, build id 20150611132111).

In past instanced, the phone has unplugged and unattended at the time of failure. In one case, airplane mode was on, in the other case, both cellular and wifi were enabled.

During the most recent failure, an initial adb logcat was successful (which I stupidly didn't save), and the phone mounted as MTP per usual. Upon attempting a second adb logcat, the phone became unresponsive, even after replugging USB, and had to be rebooted with power+volup.
(In reply to Alexandre LISSY :gerard-majax from comment #36)
> Got no reproduce on my Z3 Compact until today with a build from this morning:
>  Gecko f5cca86bcb4f4a2f263225903a25f9cb64b47519
>  Gaia 14c0c8eba35f2c838d0cef68cf3b1bf4d29678d5

And reproduced again after a couple of hours. This time I was checking my bugmail while eating along the Loire. So I'm confident the device was not overheating, and I was in the middle of disabling the data connection when it happened, the device was thus frozen displaying the window.
I'm currently testing a build based on:
 Gecko ee56f8bd591a50d6d3c0528e772f737ee031c147
 Gaia bdc3410972358afb7ecc8f60d92b88aaf6c96f97

With also the HWComposer patch from bug 1171201. Flashed this about twelve hours ago, no issue so far.
Happened again. This time I connected my device to my laptop:
 - adb works
 - adb root works
 - nothing in dmesg
 - logcat was silent probably since the lockup

But the more interesting:
 - events were still working (getevent; press touches)
 - adb shell stop b2g && adb shell start b2g, and Gecko lives again

So the device itself seems good. It was not hot.
Sotaro, Justin, one of you told me there may be graphical issues tied with this. A couple of hours before my device exposed the bug again, I have noticed some: pulling the utility tray from the statusbar resulted in the wrong animation around the middle of the process. It's hard to describe, but it was like it was jumping two or three times while I was still pulling the tray to the bottom of the screen.

Now that Gecko has restarted for some minutes, I don't reproduce this behavior at all. Might be a long shot, but just in case ... I have the configuration mentionned in comment 40.
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(jdarcangelo)
(In reply to Alexandre LISSY :gerard-majax from comment #41)
> Happened again. This time I connected my device to my laptop:
>  - adb works
>  - adb root works
>  - nothing in dmesg
>  - logcat was silent probably since the lockup
> 
> But the more interesting:
>  - events were still working (getevent; press touches)
>  - adb shell stop b2g && adb shell start b2g, and Gecko lives again
> 
> So the device itself seems good. It was not hot.

My Flame is also exposing the behavior.
Attached file gdb backtrace on Flame
Backtrace in comment 44 makes me suspecting bug 1115480.
Blocks: 1115480
Flags: needinfo?(xeonchen)
The backtrace shows it cannot register the service for some reason, and causes endless recursive call.
Would you try this patch to see if it fixes the problem?

However, this registration is not expected to be ran currently, did you turn "dom.presentation.enabled" on?
Flags: needinfo?(xeonchen) → needinfo?(lissyx+mozillians)
No, I have not enabled the pref. I cannot test the patch for now ...
Flags: needinfo?(lissyx+mozillians) → needinfo?(jlorenzo)
Gary, is this patch low-risk? It looks like it to me. If so, we'd like to pull it into one of our build snapshots before it lands on m-c.

Naoki, as stated above, we should pull this into our next build snapshot unless Gary says otherwise.
Flags: needinfo?(xeonchen)
Flags: needinfo?(nhirata.bugzilla)
The code itself is low-risk, but what I worry about is the root cause (|ERROR_SERVICE_NOT_RUNNING| in the callback).
Flags: needinfo?(xeonchen)
This bug happened to me once yesterday and once today, but volup+voldown didn't trigger any logs to be captured, unfortunately (confirming my suspicion in comment 26). I'm glad that it seems like we've got a potential patch, though!

(In reply to Alexandre LISSY :gerard-majax from comment #47)
> No, I have not enabled the pref

(I haven't enabled the pref discussed here either, BTW.)
Gary, I rebuilt an uptodate Gecko from scratch on my Flame, and after sitting on my desk with WiFi enabled (but not connected) and data connected, it crashed with this backtrace when connecting gdb ...
Flags: needinfo?(xeonchen)
Temporary workaround by disabling MDNS.
Pref "dom.presentation.enabled" has nothing to do with this issue. For anyone who wants to workaround this issue, please use attachment 8622375 [details] [diff] [review].
(In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #52)
> Created attachment 8622375 [details] [diff] [review]
> [workaround] disable MDNS device discovery
> 
> Temporary workaround by disabling MDNS.

This should land on master, not only on any spark-specific snapshot: any device running master will be badly hit!
Attachment #8622375 - Attachment is obsolete: true
Flags: needinfo?(xeonchen)
Attachment #8622407 - Flags: review?(schien)
Attachment #8622407 - Flags: review?(schien) → review+
We need to land the workaround patch (attachment 8622407 [details] [diff] [review]) first for unblocking people. Leave this bug open for resolving the crash issue.
Whiteboard: [leave open]
(In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #56)
> We need to land the workaround patch (attachment 8622407 [details] [diff] [review]
> [review]) first for unblocking people. Leave this bug open for resolving the
> crash issue.
Summary: [Aries] Phone sometimes just goes unresponsive instead of powering off (have to forcibly power off using orange pin next to sim slot) → Phone sometimes just goes unresponsive instead of powering off (have to forcibly power off using orange pin next to sim slot)
(In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #56)
> We need to land the workaround patch (attachment 8622407 [details] [diff] [review]
> [review]) first for unblocking people. Leave this bug open for resolving the
> crash issue.

Would you mind opening a follow up and mark this as RESOLVED:FIXED ? This will ease tracking for Spark.
Summary: Phone sometimes just goes unresponsive instead of powering off (have to forcibly power off using orange pin next to sim slot) → Gecko hangs, power button unreactive, adb still alive
Assignee: nobody → xeonchen
(In reply to Alexandre LISSY :gerard-majax from comment #42)
> Sotaro, Justin, one of you told me there may be graphical issues tied with
> this. A couple of hours before my device exposed the bug again, I have
> noticed some: pulling the utility tray from the statusbar resulted in the
> wrong animation around the middle of the process. It's hard to describe, but
> it was like it was jumping two or three times while I was still pulling the
> tray to the bottom of the screen.
> 
> Now that Gecko has restarted for some minutes, I don't reproduce this
> behavior at all. Might be a long shot, but just in case ... I have the
> configuration mentionned in comment 40.

I did not talk about it. Justin seemed to talk to you.
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Alexandre LISSY :gerard-majax from comment #58)
> (In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment
> #56)
> > We need to land the workaround patch (attachment 8622407 [details] [diff] [review]
> > [review]) first for unblocking people. Leave this bug open for resolving the
> > crash issue.
> 
> Would you mind opening a follow up and mark this as RESOLVED:FIXED ? This
> will ease tracking for Spark.

Sure we can do that for project management reason, file bug 1175387 for follow-up.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
Clearing the NI as the issue has been worked around. I'll keep watching what's going on my dogfood device.
Flags: needinfo?(jlorenzo)
> $ adb shell cat /proc/uptime
> 158837.76 12761.27

So that's nearly two full days of uptime. Without any issue.
Flags: needinfo?(nhirata.bugzilla)
blocking-b2g: spark+ → 2.5+
Target Milestone: --- → FxOS-S1 (26Jun)
Blocks: 1172343
Flags: needinfo?(jdarcangelo)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: