Closed Bug 1179219 Opened 9 years ago Closed 9 years ago

Phone randomly crashing and keeping a crashed state ; trigger a QHSUSB__BULK peripheral on a computer

Categories

(Firefox OS Graveyard :: GonkIntegration, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(b2g-v2.5 affected, b2g-master affected)

RESOLVED WONTFIX
Tracking Status
b2g-v2.5 --- affected
b2g-master --- affected

People

(Reporter: clement.lefevre, Unassigned)

References

Details

(Keywords: crash, foxfood, regression)

Attachments

(5 files)

On Flame, on master, the phone currently can randomly crash.
When the crash occur, the screen remains black if the phone was sleeping, or fully white if the screen was on. Next message will contain a video of the screen suddenly becoming white when it happens. It's not even possible to restart the phone with the button, it's needed to get off the battery and then get it in again to be able to restart.
Moreover, if the phone is connected to a computer when it happens and you're notified for connected USB devices, while it usually is named "Firefox OS", it becomes named "QHSUSB__BULK" and is permanently disconnecting and connecting, as you will see in second joined video.

I tried to gather logcat while the crash happen, but it seems to show nothing useful, adb is just stopping when it happens. Same about /proc/kmsg

While the crash was very occasional in some situations, at my current work where network reception is *very* bad (switching between no network and 0/1 bars, sometimes 4/5, but rarely), the crash is happening very often: several times every day. This is a real problem when the phone is used for everyday usage.
Going into or coming from metro can make this crash trigger too.

Informations of my device on nightly build: 
Build ID               20150619010205
Build Type             user
Gaia Revision          a0df9c367a68764bdcf2e2e1c4d27f0d6ee165b8
Gaia Date              2015-06-18 18:49:14
Gecko Revision         https://hg.mozilla.org/mozilla-central/rev/2694ff2ace6a
Gecko Version          41.0a1
Device ID              flame
Firmware(Release)      4.4.2
Firmware(Incremental)  eng.cltbld.20150212.043653
Firmware Date          Thu Feb 12 04:37:04 EST 2015
Bootloader             L1TC000118D0

From what I could hear, it looks like someone using 2.1 builds is having the same issue,  can ask if he can add something, if relevant.
Even we don't get exactly the same USB devices than in bug 1122119, we're facing a kernel panic. Gabriele, do you have an idea of a race condition in the kernel which might be related to low cell coverage?

I'm not sure about 2.1 getting this issue when you have low cell coverage. I had dogfooded this version until last month, took multiple time the parisian subway where I got mostly no coverage, and I've never encountered this bug.
Flags: needinfo?(gsvelto)
Blocks: 1154072
Interesting. Is it possible this is linked to the crash during call bug we had on Open C ? I know that the buggy code is not limited to the Open C's modem so maybe ...

Sadly we have not been able to get last_kmsg working (see bug 1025265)
Flags: needinfo?(clement.lefevre)
I see random crashes too, and phone restarts when pressing powerbutton for ~10 sec.
I can usually still connect with adb shell and do a restart. But can't confirm right now as I can't reproduce.
It startet roughly the same time I experienced issues with the battery (bug 1178869)
(In reply to Alexandre LISSY :gerard-majax from comment #4)
> Interesting. Is it possible this is linked to the crash during call bug we
> had on Open C ? I know that the buggy code is not limited to the Open C's
> modem so maybe ...
> 
> Sadly we have not been able to get last_kmsg working (see bug 1025265)

I don't know if this can be related: The described bug here is happening while no actions at all are performed on the phone. Most of the time it occurs while the phone is sleeping, here I disabled screen sleeping as a choice for this video to see it happening and show it.

It never happened during a phone call as far as I can remember.

(In reply to Mark Trompell from comment #5)
> I see random crashes too, and phone restarts when pressing powerbutton for
> ~10 sec.
> I can usually still connect with adb shell and do a restart. But can't
> confirm right now as I can't reproduce.
> It startet roughly the same time I experienced issues with the battery (bug
> 1178869)

I'll try next time to press the powerbutton for ~10s but I don't think it will work: in my case, I was having adb logcat running while waiting for the crash. When the crash happen, adb logcat is stopping instantly, and then any adb command to the phone is failing

Alexandre's guess was that the modem was going into a special state. I don't know how to check this though as kmsg and logcat were giving pretty much no informations about this.
Flags: needinfo?(clement.lefevre)
(In reply to Clément Lefèvre from comment #6)
> (In reply to Alexandre LISSY :gerard-majax from comment #4)
> > Interesting. Is it possible this is linked to the crash during call bug we
> > had on Open C ? I know that the buggy code is not limited to the Open C's
> > modem so maybe ...
> > 
> > Sadly we have not been able to get last_kmsg working (see bug 1025265)
> 
> I don't know if this can be related: The described bug here is happening
> while no actions at all are performed on the phone. Most of the time it
> occurs while the phone is sleeping, here I disabled screen sleeping as a
> choice for this video to see it happening and show it.

It never ever occurs with the screenlock disabled ? If yes, then I'd tend to agree that it's totally unrelated and it may be exposing issues around cgroups like what Gabriele already had fun to play with :)

> 
> It never happened during a phone call as far as I can remember.
> 
> (In reply to Mark Trompell from comment #5)
> > I see random crashes too, and phone restarts when pressing powerbutton for
> > ~10 sec.
> > I can usually still connect with adb shell and do a restart. But can't
> > confirm right now as I can't reproduce.
> > It startet roughly the same time I experienced issues with the battery (bug
> > 1178869)
> 
> I'll try next time to press the powerbutton for ~10s but I don't think it
> will work: in my case, I was having adb logcat running while waiting for the
> crash. When the crash happen, adb logcat is stopping instantly, and then any
> adb command to the phone is failing
> 
> Alexandre's guess was that the modem was going into a special state. I don't
> know how to check this though as kmsg and logcat were giving pretty much no
> informations about this.

That's the fun part: I have no idea how we can expose this.
Flags: needinfo?(clement.lefevre)
(In reply to Alexandre LISSY :gerard-majax from comment #7)
> 
> It never ever occurs with the screenlock disabled ? If yes, then I'd tend to
> agree that it's totally unrelated and it may be exposing issues around
> cgroups like what Gabriele already had fun to play with :)
> 
> That's the fun part: I have no idea how we can expose this.

Yes, I tried with disabled screenlock, and it's reproducing too.
Flags: needinfo?(clement.lefevre)
(In reply to Clément Lefèvre from comment #8)
> (In reply to Alexandre LISSY :gerard-majax from comment #7)
> > 
> > It never ever occurs with the screenlock disabled ? If yes, then I'd tend to
> > agree that it's totally unrelated and it may be exposing issues around
> > cgroups like what Gabriele already had fun to play with :)
> > 
> > That's the fun part: I have no idea how we can expose this.
> 
> Yes, I tried with disabled screenlock, and it's reproducing too.

Then there's still a chance it's the same bug as on Open C's modem, even if the triggering conditions looks different: hardware is close.
In order to help to find what factors might help to repro, can you give more infomation about:
* Wi-Fi, have you ever been able to repro with Wi-Fi off?
* USB, same question with a device not plugged at all.
* Do you have an SD card in your device?

Thanks!
Flags: needinfo?(clement.lefevre)
(In reply to Johan Lorenzo [:jlorenzo] (QA) from comment #10)
> In order to help to find what factors might help to repro, can you give more
> infomation about:
> * Wi-Fi, have you ever been able to repro with Wi-Fi off?
> * USB, same question with a device not plugged at all.
> * Do you have an SD card in your device?
> 
> Thanks!

So, after testing for missing informations: this device never contained any SD card.
The crash is happening whether the Wifi is activated or not and whether the phone is plugged into a computer or not.
Flags: needinfo?(clement.lefevre)
I noticed in the past, maybe two or three times, that when I tried turning on my phone after it being on standby for a while the phone just didn't turn on at all and I was forced to pull the battery to have it working again. IIRC keeping the power button pushed for a long time also forced the phone to reboot. I've never bothered to check what caused this before this but report but if it happens again I'll double-check it. In my case this is happening on my main phone which is a Flame I use for dogfooding with both SIMs populated, running the latest base build plus the current master.
Flags: needinfo?(gsvelto)
(In reply to Gabriele Svelto [:gsvelto] from comment #12)
> I noticed in the past, maybe two or three times, that when I tried turning
> on my phone after it being on standby for a while the phone just didn't turn
> on at all and I was forced to pull the battery to have it working again.
> IIRC keeping the power button pushed for a long time also forced the phone
> to reboot. I've never bothered to check what caused this before this but
> report but if it happens again I'll double-check it. In my case this is
> happening on my main phone which is a Flame I use for dogfooding with both
> SIMs populated, running the latest base build plus the current master.

Looking at how frequently I'm reproducing it at my work place on this Flame, it looks like very bad network conditions/reception, moving easily between no network, a bad quality and a good quality.
A guess can be that, as I mainly reproduce this in a building with bad reception in a city like Paris, but as there are several cells around, all with random reception quality because of the building, the phone jump from cell to cell and this is causing some issues.

This is still just a random guess, Gabriele, maybe have you an idea about this, or a way to check it? Or I'm just mistaking and this have nothing to do with it.
Flags: needinfo?(gsvelto)
(In reply to Clément Lefèvre from comment #13)

[...]

> 
> Looking at how frequently I'm reproducing it at my work place on this Flame,
> it looks like very bad network conditions/reception, moving easily between
> no network, a bad quality and a good quality.
> A guess can be that, as I mainly reproduce this in a building with bad
> reception in a city like Paris, but as there are several cells around, all
> with random reception quality because of the building, the phone jump from
> cell to cell and this is causing some issues.
> 

Sadly, I do have a Flame that I carry with me and that never ever reproduce such kind of behavior. Yet I'm often in high speed train, where network connectivity is bad as you describe. That does not means it's not what triggers it in your case.
(In reply to Gabriele Svelto [:gsvelto] from comment #12)
> I noticed in the past, maybe two or three times, that when I tried turning
> on my phone after it being on standby for a while the phone just didn't turn
> on at all and I was forced to pull the battery to have it working again.
> IIRC keeping the power button pushed for a long time also forced the phone
> to reboot. I've never bothered to check what caused this before this but
> report but if it happens again I'll double-check it. In my case this is
> happening on my main phone which is a Flame I use for dogfooding with both
> SIMs populated, running the latest base build plus the current master.

Are you still able to connect with adb when that happens and reboot from there, like I am?
Gathering a RIL log might be helpful if this is network-related, it can be done from the prefs, see here:

https://wiki.mozilla.org/B2G/QA/Tips_And_Tricks#RIL_Debugging
Flags: needinfo?(gsvelto)
These logs are those from just before the crash happen with RIL debug activated graphically as described in the wiki webpage.
(In reply to Gabriele Svelto [:gsvelto] from comment #16)
> Gathering a RIL log might be helpful if this is network-related, it can be
> done from the prefs, see here:
> 
> https://wiki.mozilla.org/B2G/QA/Tips_And_Tricks#RIL_Debugging

These logs are those from just before the crash, using the procedure described in the wiki webpage by modifying config files.

NI you Gabriele, if you can see something useful in those logs…?
Flags: needinfo?(gsvelto)
I can't see anything that should crash the phone but I'm not super-familiar with the RIL's internals. Hsin-Yi, can you see something suspicious in those logs? What are the chances of something network-related to be able to take down the phone like this?
Flags: needinfo?(gsvelto) → needinfo?(htsai)
I encountered the same behavior as Clément at work where network reception is *very* bad too. Since i changed the date & time setting from automatic to manual, i don't have the issue anymore (since 2 weeks). May it be related to bug 1110010 ?
(In reply to Gabriele Svelto [:gsvelto] from comment #19)
> I can't see anything that should crash the phone but I'm not super-familiar
> with the RIL's internals. Hsin-Yi, can you see something suspicious in those
> logs? What are the chances of something network-related to be able to take
> down the phone like this?

Let me ask for Edgar's help :P
Flags: needinfo?(htsai) → needinfo?(echen)
Both log shows the device lost network and receives a NITZ update just before crash.
And after receiving a NITZ update, the system time is set back a bit.
log_RIL_graphical_debug.txt: 07-03 13:08:43.802 -> 07-03 13:08:43.055
log_RIL_config_files_debug.txt: 07-06 15:03:53.869 -> 07-06 15:03:53.051

Not sure if these behaviour has any connection with the crash?

log_RIL_graphical_debug.txt:
> 07-03 13:07:55.241   209   636 I Gecko   : RIL Worker: [0] Handling parcel as UNSOLICITED_NITZ_TIME_RECEIVED
> 07-03 13:07:55.241   209   636 I Gecko   : RIL Worker: [0] DateTimeZone string 15/07/03,11:07:55+08,00
> 07-03 13:08:43.752   209   636 I Gecko   : RIL Worker: [0] Handling parcel as UNSOLICITED_NITZ_TIME_RECEIVED
> 07-03 13:08:43.752   209   636 I Gecko   : RIL Worker: [0] DateTimeZone string 15/07/03,11:08:43+08,00
> 07-03 13:08:43.772   209   636 I Gecko   : RIL Worker: [0] Handling parcel as REQUEST_VOICE_REGISTRATION_STATE
> 07-03 13:08:43.772   209   636 I Gecko   : RIL Worker: [0] Received voiceRegistrationState network info.
> 07-03 13:08:43.782   209   636 I Gecko   : RIL Worker: [0] Still missing some more network info, not notifying main thread.
> 07-03 13:08:43.782   209   636 I Gecko   : RIL Worker: [0] voice registration state: 12,,,0,,,,0,,,,,,0,
> 07-03 13:08:43.782   209   636 I Gecko   : RIL Worker: [0] Queuing voiceRegistrationState network info message: {"regState":12,"state":"searching","connected":false,"roaming":false,"emergencyCallsOnly":true,"cell":{"gsmLocationAreaCode":-1,"gsmCellId":-1},"radioTech":0,"type":null,"rilMessageType":"voiceregistrationstatechange"}
> 07-03 13:08:43.055   209   209 I GeckoDump: [system] [TimeCore][1209905.923] handling moztimechange

log_RIL_config_files_debug.txt:
> 07-06 15:03:53.819   211   625 I Gecko   : RIL Worker: [0] Handling parcel as UNSOLICITED_NITZ_TIME_RECEIVED
> 07-06 15:03:53.829   211   625 I Gecko   : RIL Worker: [0] DateTimeZone string 15/07/06,13:03:53+08,00
> 07-06 15:03:53.051   211   211 I GeckoDump: [system] [TimeCore][1476015.942] handling moztimechange
> 07-06 15:03:53.051   211   625 I Gecko   : RIL Worker: [0] Handling parcel as REQUEST_VOICE_REGISTRATION_STATE
> 07-06 15:03:53.051   211   625 I Gecko   : RIL Worker: [0] Received voiceRegistrationState network info.
> 07-06 15:03:53.051   211   625 I Gecko   : RIL Worker: [0] Still missing some more network info, not notifying main thread.
> 07-06 15:03:53.051   211   625 I Gecko   : RIL Worker: [0] voice registration state: 12,,,0,,,,0,,,,,,0,
> 07-06 15:03:53.051   211   625 I Gecko   : RIL Worker: [0] Queuing voiceRegistrationState network info message: {"regState":12,"state":"searching","connected":false,"roaming":false,"emergencyCallsOnly":true,"cell":{"gsmLocationAreaCode":-1,"gsmCellId":-1},"radioTech":0,"type":null,"rilMessageType":"voiceregistrationstatechange"}
Flags: needinfo?(echen)
(In reply to Mark Trompell from comment #5)
> I see random crashes too, and phone restarts when pressing powerbutton for
> ~10 sec.
> I can usually still connect with adb shell and do a restart. But can't
> confirm right now as I can't reproduce.
> It startet roughly the same time I experienced issues with the battery (bug
> 1178869)

So, I double checked about this and:

- Yes, pressing the powerbutton for ~10-15s is restarting the phone (maybe not exactly the same way as by getting off the battery though: I felt like it needed more time to recover the hour from network. But I was maybe just unlucky).

- No, it doesn't allow me to connect to the phone with adb, especially as the phone is presenting/exposed as a different state and device connected to the computer.
I see different behaviour, so I created Bug 1179679 
I can still connect via adb and the issue is independant of the date-time setting.
It may still be the same root cause. Maybe even connected to the battery issues some experience. Like a process going wild and oom kills the wrong one.
Now that bug 1154072 is done, can you give a try to this new base image? Make sure you backup your data of course :)
Flags: needinfo?(clement.lefevre)
(In reply to Alexandre LISSY :gerard-majax from comment #25)
> Now that bug 1154072 is done, can you give a try to this new base image?
> Make sure you backup your data of course :)

So, after flashing the new v18D v3, it appears that the crash is still quickly happening in the same conditions.

Here is attached a new logcat gathered with this new base image, in case it would offer more informations.
Flags: needinfo?(clement.lefevre)
Keywords: foxfood
QA Whiteboard: [foxfood-triage]
I could notice recently, what is probably a side effect: if a recurrent alarm is set, after one of those crashes, the alarm is still set if you go to check, but will not ring anymore unless you unset it and set re-set it.

This is the current most common behavior. Some times ago, I noticed a ring on reboot, and then no ring anymore too.

It looks anyway like it does have side effects on alarms.
[Blocking Requested - why for this release]:
blocking-b2g: --- → 2.5?
Gregor, Can you please comment on the nomination, and block accordingly.In triage session we do not have the right audience to block this.
Flags: needinfo?(anygregor)
I don't know this part of the code. We need Hsin-Yi to weight in about how bad this random crash is.
Flags: needinfo?(anygregor) → needinfo?(htsai)
(In reply to Gregor Wagner [:gwagner] from comment #30)
> I don't know this part of the code. We need Hsin-Yi to weight in about how
> bad this random crash is.

I couldn't see obvious messages about crash from RIL. Couldn't provide more comments.
Flags: needinfo?(htsai)
As Hsin-Yi can't find anything. Removing the nomination. Please re nominate if it is reproduced. 

Thanks
blocking-b2g: 2.5? → ---
Wontfix as un-reproducible, feel free to provide a better scenario
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
This is kernel panic and we had a bug about it awhile ago (bug 1130035). At first we were able to reliably reproduce while visiting the website mentioned in that bug, then we were no longer able to repro. I still see this from time to time on Flame and doubt that we will ever get any steps closer to what we had in bug 1130035.
Flags: needinfo?(jmercado)
Keywords: steps-wanted
Flags: needinfo?(jmercado)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: