Closed Bug 1121374 Opened 9 years ago Closed 9 years ago

Devices crash during test runs and can no longer be found via ADB

Categories

(Firefox OS Graveyard :: Infrastructure, defect)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1122119

People

(Reporter: RobertC, Assigned: gmealer)

References

Details

We have seen a lot of device not found errors lately and some of those errors seem to be cause by the device crashing in the previous test run with the following stacktrace: http://pastebin.mozilla.org/8214824

I looked over the nodes for a pattern, but could not find one. Below you will find the information from some of the nodes and the job in which the device seems to have crashed.

node: 28.2
job: flame-kk-319.mozilla-central.ui.graphics.smoke #58
job link: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.ui.graphics.smoke/58/console
job run time: 02:46:13 -> 03:10:01
last test running: TestCalendar.test_calendar_flick

node: 15.1
job: flame-kk-319.b2g-inbound.ui.functional.smoke #2014
job link: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2014/console
job run time: 14:33:28 -> 15:07:23
last test running: TestSetupAndSendIMAPEmail.test_setup_and_send_imap_email

node:23.2
job: flame-kk-319.b2g-inbound.ui.functional.sanity #6027
job link: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.sanity/6027/console
job run time: 16:53:24 -> 17:06:11
last test runnning: TestSms.test_sms_send

The time after the start of the job when the device crashes varies, so does test that ran last and the job that fails.
I think we need more data before we can identify a pattern. Just three examples is probably not enough.
Bebe noticed an increase in jobs being aborted by the build timeout plugin, which was set to abort after 5 minutes of inactivity. I could imagine an abort could leave the device in an unknown state (although I wouldn't really expect it to go offline, and it should recover). A recently released version of the build timeout plugin may have made this detection more effective. I understand Bebe is going to disable these timeouts to see if it helps with the stability.
Looks like more and more devices are crashing because of this issue
See Bug 1121501:

b2g-20.1 
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2070/console 
  mozdevice.devicemanager.DMError: bad status for device e472d8d7: offline

b2g-23.1 
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2042/console

b2g-09.1

http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.ui.functional.smoke/259/console

are in the same issue.


Geo we need some investigation on these devices and see why and how this is reproducing?
Flags: needinfo?(gmealer)
(In reply to Dave Hunt (:davehunt) from comment https://bugzilla.mozilla.org/show_bug.cgi?id=1121501#c7)
> It looks like this often happens in the e-mail or SMS tests? Perhaps this is
> a genuine crasher. Can we replicate it locally, or try with an older build?
> Is it possible that we need the v18D-1 base build?

I ran the email tests and can't replicate the crash.
Gaia-Rev        e018c9de2f28927e5ab0d12baeb9d463ad7ff883
Gecko-Rev       https://hg.mozilla.org/integration/b2g-inbound/rev/8741b8bde94f
Build-ID        20150114044721
Version         38.0a1
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20150114.083952
FW-Date         Wed Jan 14 08:40:02 EST 2015
Bootloader      L1TC10011880
(In reply to Florin Strugariu [:Bebe] from comment #3)

> Geo we need some investigation on these devices and see why and how this is
> reproducing?

Ugh, I'm not even sure how to get started with trying to reproduce that. Any tips from historical fixes?
Flags: needinfo?(gmealer)
Flags: needinfo?(florin.strugariu)
Geo let's start by logging the current state of the phones and maybe try to get a logcat from them

I will run some more tests and see if I can reproduce the issue.
Flags: needinfo?(florin.strugariu)
OK, I'll have a better report forthcoming, but I'm finding a lot of devices off and at battery prompt. More concerning is after booting them, usb debugging is disabled.

Are we putting a poison build on these? What are the various things that could cause usb debugging to be flipped to disabled? I know I personally flipped many these to adb on Monday.
as far as I know we are running the same setup as always... from the console log the devices are running the usual test until for some reason during the email/sms tests  something goes bad and we can't find the device anymore...
OK, I've reset a number of machines again and verified ADB access.

Before I list them, the "disabled" under USB Debugging is a red herring. Something about our test suite is making that appear that way, but ADB isn't really disabled. 

I was able to verify this running the sanity suite locally on 25-2's phone, which was sufficient to make "Disabled" occur. I used the same build and gaia revision as the last set of sanity tests that ran on that phone. This happened even though the bluetooth test didn't run for lack of installed module (we should add it to setup.py, maybe?).

I did go through the same tests manually, and couldn't make it happen. Either there's a subtle difference in what I did and the test did, or it's the "reset to known state" intermission. I suspect the latter, and probably not user-facing. That said, I did try killing b2g to restart it and the problem didn't occur.

These are the phones I touched:

Were off when I found them:

9
13
16.2
17.1
17.2
18.1
23.1
24.2
26.2
28.2

20.1 was so far off that it wouldn't respond to power button. Removing/replacing battery reset it and it booted from there.

All phones were charged to 100%. I did check version, and it was a 3.0.0 prerelease, with a range of different build dates. USB Storage was disabled (I sanity-checked it just in case we were shutting down ADB that way).

Of the other phones mentioned above:

28.2 was idle but on. It is, however, showing a "must be online to do that" type error, and the sim has a !.
23.2 was idle but on. It was sitting on the contacts screen.
15.1 was actually running tests by the time I looked at it, so couldn't have been *that* non-viable.

Of these, these devices overlapped with Monday's bug 1120307, so I know they were reset recently:

16.2
18.1
24.2
28.2

I'm keeping this bug open for more investigation. In particular, I think we should look at test runs and configuration for the four devices above, focusing on things between Monday and today as a range, and see if we can figure out a commonality.
Assignee: nobody → gmealer
I just bricked my device, and it came back up after 10 minutes or so with the crash report. (usually there is no crash report, but this time for some reason it showed up) the link is:
https://crash-stats.mozilla.com/report/index/d23813bd-fb9a-41be-a919-30c522150115

Just in case it helps.
(In reply to No-Jun Park [:njpark] from comment #11)
> I just bricked my device, and it came back up after 10 minutes or so with
> the crash report. (usually there is no crash report, but this time for some
> reason it showed up) the link is:
> https://crash-stats.mozilla.com/report/index/d23813bd-fb9a-41be-a919-
> 30c522150115
> 
> Just in case it helps.

In my case the crash happens (intermittently) when python marionette restarts the device.
Also ran tests all morning and could not reproduce the fail.
Looks like b2g-28.1  is down
http://jenkins1.qa.scl3.mozilla.com/computer/b2g-28.1/

I will mark the node as offline :(
Severity: normal → critical
I'll follow up briefly since I'm going to have to turn on more devices today.

But you need to look further than just show_version to check the device. If you'd run adb on the machine, you'd have seen two devices there as expected. 

Instead, the job failed because 28.1 has reverted to generic serial "0123456789ABCDEF" rather than what we have in ANDROID_SERIAL.

I'm going to temporarily change the job spec for 28.1 to the generic serial, since the other device has a real number. But I'll also file a bug, since that shouldn't happen in the first place.

NI'ing so that you see this. I really recommend you use the listb2g() shell function I sent in email to check rather than running show_version. At the very least, a show_version fail should be doublechecked (a pass is likely legit).
Flags: needinfo?(florin.strugariu)
Filed bug on b2g-28.1, since I couldn't get the config change to take, and more work might be necessary there anyway.

Today's list of restarts are:

16.1
18.1
21.1
26.1
26.2
27.2

18.1 and 26.2 are repeats from yesterday, making them the current best candidates for investigation.

Right now, my best guess is that the following scenario is happening:

1) Something crashes or freezes the phone to the point that USB charging is disabled.
2) Phone runs out of power, turning itself off.
3) Having turned off, and without whatever 1) was in the way, the powered-off-charge kicks in.

Alternately, something is simply shutting down the phone and not turning it back on, but I can't imagine what would do that spontaneously.

Am going to continue to investigate. I've brought 21.1 back online as well.
See Also: → 1122292
More info:

First, thought I'd found a pattern:

Killed 18.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.perf.gaia/3700/console

Was good at 9:22:44 after test data push
Bad as of 10:03:57 after making node modules

Why 40 minute gulf between making node modules and next line?

Killed 26.2: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.perf.gaia/3670/console

Good at 20:15:03 after test data push
Bad as of 20:47:01 after making node modules

Same 40 minute gulf.

Killed 16.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.perf.gaia/3697/console 08:42:08 -> 08:48:16, same bracketing

No 40 minute gulf.

Unfortunately, the next one I checked wasn't an inbound perf job:

Killed 21.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2090/console

Good at 3:45:06 (testing 3gp video)
Bad at 3:46:16

With the crash theory in mind, I investigated whether crash reports were left on the device. No such luck. This was curious, though, on 21.1:

root@flame:/data/b2g/mozilla/Crash Reports # ls -l
-rw------- root     root           10 2015-01-15 03:45 InstallTime20150115003933
drwx------ root     root              2015-01-15 03:45 events

The contents of InstallTime was an epoch timestamp for 03:45:10. That's when the crash happened.

So, I went back to 16.1:

root@flame:/data/b2g/mozilla/Crash Reports # ls -l
-rw------- root     root           10 2015-01-15 15:09 InstallTime20150115054735
drwx------ root     root              2015-01-15 15:09 events

Contents: 1421334571 == Thu Jan 15 07:09:31 PST 2015

It is a little curious that the InstallTime contents are a clean 12 hours off from the filestamp, but either way that doesn't correspond with the 08:42:08 -> 08:48:16 window the log shows.

Still, I think the next step might be to see what makes InstallTime* update itself, since at least on 16.1 that's one heck of a coincidence.
Just realized I misread the file timestamp on 16.1, which is 3:09PM. So that's 8 hours off, which makes more sense for a phone set to GMT.
Killed 23.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk.ui.adhoc/593/console

Good at 10:58:59 (possibly 10:57:42, depends on whether locator failed due to crash), bad at 11:00:17, after testing a11y callscreen visibility.

Killed 24.2: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.sanity/6152/console

Good at 10:01:43, bad at 10:02:57, after testing sms send or bluetooth settings (probably former).

Those are both radio tests, so there's that.

That said, I just watched 24.2's InstallTime* timestamp change during the test, during an otherwise clean run. I think it's the kill/restart of b2g that does it, so no luck there. 

b2g-16's non-corresponding time must be because the perf tests aren't restart as part of setup, so that really was the install time.
(In reply to No-Jun Park [:njpark] from comment #12)

> In my case the crash happens (intermittently) when python marionette
> restarts the device.

That aligns with the InstallTime* timestamps I'm finding, when they haven't been clobbered by a later test run like 24.2's was.
With v188-1, the crash happens as well, although the recovery time seem to be shorter in v188-1 (3-4 minutes or so)  
When rebooted, it asked me to go through the FTE again.

Crash log in v188-1:
https://crash-stats.mozilla.com/report/index/f5ea907e-4277-4dcf-b6c5-a70f02150116
Sometimes it does not recover from crash, and marionette gives out this message:

Most recent errors/exceptions are:
01-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -1
The B2G process has restarted after crashing during  the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.
(In reply to No-Jun Park [:njpark] from comment #24)
> Sometimes it does not recover from crash, and marionette gives out this
> message:
> 
> Most recent errors/exceptions are:
> 01-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16
> 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16
> 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16
> 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16
> 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -1
> The B2G process has restarted after crashing during  the tests so Marionette
> can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5
> most recent errors are listed. Check logcat for all errors if these errors
> are not the cause of the failure.

What I'd also really like to know is whether the phone powers off or stops charging if you leave it at the crash. If you can make one happen before you go to bed or something, that'd be really useful.
Just checked logcat for 23.1's. It cuts off cold at 11:00:09, no particular errors showing above it. 

The system was not in the middle of a restart, afaict--that happened somewhat prior to 10:59, but the homescreen had been found by Marionette and it looks like it was in the middle of a11y test code.

Last few entries:

01-15 11:00:09.295  7773  7773 W Communications:     at reportMissingEntity (app://communications.gaiamobile.org/shared/js/l10n.js:1513:8)
01-15 11:00:09.425  7289  7289 I Gecko   : 1421348409439	Marionette	DEBUG	Got request: findElement, data: {"to":"0","sessionId":"{a1eb7cbf-c070-4705-b3ae-61d2effc8d8c}","name":"findElement","parameters":{"using":"css selector","value":"div.keypad-key[data-value=\"9\"]"}}, id: null
01-15 11:00:09.435  7289  7289 I Gecko   : 1421348409446	Marionette	INFO	sendToClient: {"from":"0","sessionId":"{3f28d6f7-52cb-440e-8a4c-a6cb6d1e1bca}","value":{"ELEMENT":"{ac878fb4-74b0-442d-a523-a63cc52cfee1}"}}, {4b3bf113-13cf-44bc-a967-6f517f9e72af}, {4b3bf113-13cf-44bc-a967-6f517f9e72af}
01-15 11:00:09.445  7289  7289 I Gecko   : 1421348409456	Marionette	DEBUG	Got request: executeAsyncScript, data: {"to":"0","sessionId":"{a1eb7cbf-c070-4705-b3ae-61d2effc8d8c}","name":"executeAsyncScript","parameters":{"scriptTimeout":null,"specialPowers":true,"script":"return Accessibility.click.apply(Accessibility, arguments)","newSandbox":true,"args":[{"ELEMENT":"{ac878fb4-74b0-442d-a523-a63cc52cfee1}"}],"filename":"gaia_test.py","line":491}}, id: null
01-15 11:00:09.535  7289  7289 I Gecko   : 1421348409545	Marionette	INFO	sendToClient: {"from":"0","sessionId":"{3f28d6f7-52cb-440e-8a4c-a6cb6d1e1bca}","value":null}, {af36e5ee-5d3a-40bc-a10d-5b6c11403574}, {af36e5ee-5d3a-40bc-a10d-5b6c11403574}
01-15 11:00:09.795  7289  7289 I Gecko   : 1421348409801	Marionette	DEBUG	Got request: findElement, data: {"to":"0","sessionId":"{a1eb7cbf-c070-4705-b3ae-61d2effc8d8c}","name":"findElement","parameters":{"using":"css selector","value":"div.keypad-key[data-value=\"7\"]"}}, id: null

I'll look in other ones tomorrow--I'm out of time for today.
(In reply to Geo Mealer [:geo] from comment #19)
> Was good at 9:22:44 after test data push
> Bad as of 10:03:57 after making node modules
> 
> Why 40 minute gulf between making node modules and next line?

The perf builds don't flush anything to console during this time. A successful build could be quiet for up to two hours.
http://jenkins1.qa.scl3.mozilla.com/computer/b2g-07.1/ is down again
Flags: needinfo?(florin.strugariu)
I had to put some more devices down today.


Also Investigated No-Jun's fail and it looks like a different Critical issue see: Bug 1122503 

Tested locally with multiple build and with both flasing scrips (Jenkins and Tw) and could not reproduce the fail.
I had to put down some nodes this morning :(

Take a look over http://jenkins1.qa.scl3.mozilla.com/computer/ and see the disabled nodes
Depends on: 1122119
I disabled 2 more nodes:
20.1 - adb devices return 1 device
24.1 - adb devices return 0 devices

Can someone try and restart them? We are running out of nodes to run the tests on.
Flags: needinfo?(pmathur)
Flags: needinfo?(gmealer)
Restarted:

4
13
15
17-1
18-1
18-2
19-1
20-1
22-2
23-1
24-1
24-2
25-2
26-2
28-2

22-2 came up in a recovery mode, so I need to manually reflash it.
28-1 was taken offline, but looked viable to me. I'm checking it further.

I brought all nodes but those two (and 25-1, which is a perf node Dave disabled) back online in Jenkins. By the time you read this, I'll probably have fixed those and brought them online too. I'll update only if I do not do this.
Flags: needinfo?(gmealer)
I brought 22-2 back up. 28-1 was the one with the android serial number issue which I have not solved yet.

I also took 26-1 down for post-mortem on bug 1122119, and am keeping it for a day so people can respond without us having blown away the logs.
Flags: needinfo?(pmathur)
23.1 crashed inhttp://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-central.ui.functional.smoke/275/console

I took the node offline, but it need to be restarted.
Flags: needinfo?(gmealer)
Have continued to bring nodes up on a daily basis. I'm going to quit posting the list, as I think it's not useful for isolation like I'd hoped. I'm also going to start bringing up nodes twice a day (when I come in and before I go home).
Flags: needinfo?(gmealer)
Upstream bug is fixed via backout at this point. A bunch of phones were still down this morning, but that's not surprising given the backout didn't happen until today. Once I see that no phones are crashing, I'll resolve-fix this.
I'm closing this out. We're not white-screening anymore, which is what this bug really applied to.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Marking this bug as a duplicate of bug 1122119 as per comment 38.
Resolution: FIXED → DUPLICATE
You need to log in before you can comment on or make changes to this bug.