1121374 - Devices crash during test runs and can no longer be found via ADB

Reporter

Description

•

10 years ago

We have seen a lot of device not found errors lately and some of those errors seem to be cause by the device crashing in the previous test run with the following stacktrace: http://pastebin.mozilla.org/8214824 I looked over the nodes for a pattern, but could not find one. Below you will find the information from some of the nodes and the job in which the device seems to have crashed. node: 28.2 job: flame-kk-319.mozilla-central.ui.graphics.smoke #58 job link: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.ui.graphics.smoke/58/console job run time: 02:46:13 -> 03:10:01 last test running: TestCalendar.test_calendar_flick node: 15.1 job: flame-kk-319.b2g-inbound.ui.functional.smoke #2014 job link: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2014/console job run time: 14:33:28 -> 15:07:23 last test running: TestSetupAndSendIMAPEmail.test_setup_and_send_imap_email node:23.2 job: flame-kk-319.b2g-inbound.ui.functional.sanity #6027 job link: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.sanity/6027/console job run time: 16:53:24 -> 17:06:11 last test runnning: TestSms.test_sms_send The time after the start of the job when the device crashes varies, so does test that ran last and the job that fails.

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 1

•

10 years ago

I think we need more data before we can identify a pattern. Just three examples is probably not enough.

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 2

•

10 years ago

Bebe noticed an increase in jobs being aborted by the build timeout plugin, which was set to abort after 5 minutes of inactivity. I could imagine an abort could leave the device in an unknown state (although I wouldn't really expect it to go offline, and it should recover). A recently released version of the build timeout plugin may have made this detection more effective. I understand Bebe is going to disable these timeouts to see if it helps with the stability.

Florin Strugariu [:Bebe]

Comment 3

•

10 years ago

Looks like more and more devices are crashing because of this issue See Bug 1121501: b2g-20.1 http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2070/console mozdevice.devicemanager.DMError: bad status for device e472d8d7: offline b2g-23.1 http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2042/console b2g-09.1 http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.ui.functional.smoke/259/console are in the same issue. Geo we need some investigation on these devices and see why and how this is reproducing?

Flags: needinfo?(gmealer)

Florin Strugariu [:Bebe]

Comment 5

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment https://bugzilla.mozilla.org/show_bug.cgi?id=1121501#c7) > It looks like this often happens in the e-mail or SMS tests? Perhaps this is > a genuine crasher. Can we replicate it locally, or try with an older build? > Is it possible that we need the v18D-1 base build? I ran the email tests and can't replicate the crash. Gaia-Rev e018c9de2f28927e5ab0d12baeb9d463ad7ff883 Gecko-Rev https://hg.mozilla.org/integration/b2g-inbound/rev/8741b8bde94f Build-ID 20150114044721 Version 38.0a1 Device-Name flame FW-Release 4.4.2 FW-Incremental eng.cltbld.20150114.083952 FW-Date Wed Jan 14 08:40:02 EST 2015 Bootloader L1TC10011880

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 6

•

10 years ago

(In reply to Florin Strugariu [:Bebe] from comment #3) > Geo we need some investigation on these devices and see why and how this is > reproducing? Ugh, I'm not even sure how to get started with trying to reproduce that. Any tips from historical fixes?

Flags: needinfo?(gmealer)

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Updated

•

10 years ago

Flags: needinfo?(florin.strugariu)

Florin Strugariu [:Bebe]

Comment 7

•

10 years ago

Geo let's start by logging the current state of the phones and maybe try to get a logcat from them I will run some more tests and see if I can reproduce the issue.

Flags: needinfo?(florin.strugariu)

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 8

•

10 years ago

OK, I'll have a better report forthcoming, but I'm finding a lot of devices off and at battery prompt. More concerning is after booting them, usb debugging is disabled. Are we putting a poison build on these? What are the various things that could cause usb debugging to be flipped to disabled? I know I personally flipped many these to adb on Monday.

Florin Strugariu [:Bebe]

Comment 9

•

10 years ago

as far as I know we are running the same setup as always... from the console log the devices are running the usual test until for some reason during the email/sms tests something goes bad and we can't find the device anymore...

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 10

•

10 years ago

OK, I've reset a number of machines again and verified ADB access. Before I list them, the "disabled" under USB Debugging is a red herring. Something about our test suite is making that appear that way, but ADB isn't really disabled. I was able to verify this running the sanity suite locally on 25-2's phone, which was sufficient to make "Disabled" occur. I used the same build and gaia revision as the last set of sanity tests that ran on that phone. This happened even though the bluetooth test didn't run for lack of installed module (we should add it to setup.py, maybe?). I did go through the same tests manually, and couldn't make it happen. Either there's a subtle difference in what I did and the test did, or it's the "reset to known state" intermission. I suspect the latter, and probably not user-facing. That said, I did try killing b2g to restart it and the problem didn't occur. These are the phones I touched: Were off when I found them: 9 13 16.2 17.1 17.2 18.1 23.1 24.2 26.2 28.2 20.1 was so far off that it wouldn't respond to power button. Removing/replacing battery reset it and it booted from there. All phones were charged to 100%. I did check version, and it was a 3.0.0 prerelease, with a range of different build dates. USB Storage was disabled (I sanity-checked it just in case we were shutting down ADB that way). Of the other phones mentioned above: 28.2 was idle but on. It is, however, showing a "must be online to do that" type error, and the sim has a !. 23.2 was idle but on. It was sitting on the contacts screen. 15.1 was actually running tests by the time I looked at it, so couldn't have been *that* non-viable. Of these, these devices overlapped with Monday's bug 1120307, so I know they were reset recently: 16.2 18.1 24.2 28.2 I'm keeping this bug open for more investigation. In particular, I think we should look at test runs and configuration for the four devices above, focusing on things between Monday and today as a range, and see if we can figure out a commonality.

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Updated

•

10 years ago

Assignee: nobody → gmealer

No-Jun Park [:njpark]

Comment 11

•

10 years ago

I just bricked my device, and it came back up after 10 minutes or so with the crash report. (usually there is no crash report, but this time for some reason it showed up) the link is: https://crash-stats.mozilla.com/report/index/d23813bd-fb9a-41be-a919-30c522150115 Just in case it helps.

No-Jun Park [:njpark]

Comment 12

•

10 years ago

(In reply to No-Jun Park [:njpark] from comment #11) > I just bricked my device, and it came back up after 10 minutes or so with > the crash report. (usually there is no crash report, but this time for some > reason it showed up) the link is: > https://crash-stats.mozilla.com/report/index/d23813bd-fb9a-41be-a919- > 30c522150115 > > Just in case it helps. In my case the crash happens (intermittently) when python marionette restarts the device.

Florin Strugariu [:Bebe]

Comment 13

•

10 years ago

I took a look over the logcat's from Jenkins. I could not find anything that would point out our issue. http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.ui.functional.smoke/259/artifact/tests/python/gaia-ui-tests/logcat.txt http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2042/artifact/tests/python/gaia-ui-tests/logcat.txt http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2070/artifact/tests/python/gaia-ui-tests/logcat.txt

Florin Strugariu [:Bebe]

Comment 14

•

10 years ago

Also ran tests all morning and could not reproduce the fail.

Florin Strugariu [:Bebe]

Comment 15

•

10 years ago

Looks like b2g-28.1 is down http://jenkins1.qa.scl3.mozilla.com/computer/b2g-28.1/ I will mark the node as offline :(

Florin Strugariu [:Bebe]

Comment 16

•

10 years ago

b2g-21.1 is in the same state. http://jenkins1.qa.scl3.mozilla.com/computer/b2g-21.1 Failed in build: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2090/

Florin Strugariu [:Bebe]

Updated

•

10 years ago

Severity: normal → critical

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 17

•

10 years ago

I'll follow up briefly since I'm going to have to turn on more devices today. But you need to look further than just show_version to check the device. If you'd run adb on the machine, you'd have seen two devices there as expected. Instead, the job failed because 28.1 has reverted to generic serial "0123456789ABCDEF" rather than what we have in ANDROID_SERIAL. I'm going to temporarily change the job spec for 28.1 to the generic serial, since the other device has a real number. But I'll also file a bug, since that shouldn't happen in the first place. NI'ing so that you see this. I really recommend you use the listb2g() shell function I sent in email to check rather than running show_version. At the very least, a show_version fail should be doublechecked (a pass is likely legit).

Flags: needinfo?(florin.strugariu)

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 18

•

10 years ago

Filed bug on b2g-28.1, since I couldn't get the config change to take, and more work might be necessary there anyway. Today's list of restarts are: 16.1 18.1 21.1 26.1 26.2 27.2 18.1 and 26.2 are repeats from yesterday, making them the current best candidates for investigation. Right now, my best guess is that the following scenario is happening: 1) Something crashes or freezes the phone to the point that USB charging is disabled. 2) Phone runs out of power, turning itself off. 3) Having turned off, and without whatever 1) was in the way, the powered-off-charge kicks in. Alternately, something is simply shutting down the phone and not turning it back on, but I can't imagine what would do that spontaneously. Am going to continue to investigate. I've brought 21.1 back online as well.

Comment 19

•

10 years ago

More info: First, thought I'd found a pattern: Killed 18.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.perf.gaia/3700/console Was good at 9:22:44 after test data push Bad as of 10:03:57 after making node modules Why 40 minute gulf between making node modules and next line? Killed 26.2: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.perf.gaia/3670/console Good at 20:15:03 after test data push Bad as of 20:47:01 after making node modules Same 40 minute gulf. Killed 16.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.perf.gaia/3697/console 08:42:08 -> 08:48:16, same bracketing No 40 minute gulf. Unfortunately, the next one I checked wasn't an inbound perf job: Killed 21.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2090/console Good at 3:45:06 (testing 3gp video) Bad at 3:46:16 With the crash theory in mind, I investigated whether crash reports were left on the device. No such luck. This was curious, though, on 21.1: root@flame:/data/b2g/mozilla/Crash Reports # ls -l -rw------- root root 10 2015-01-15 03:45 InstallTime20150115003933 drwx------ root root 2015-01-15 03:45 events The contents of InstallTime was an epoch timestamp for 03:45:10. That's when the crash happened. So, I went back to 16.1: root@flame:/data/b2g/mozilla/Crash Reports # ls -l -rw------- root root 10 2015-01-15 15:09 InstallTime20150115054735 drwx------ root root 2015-01-15 15:09 events Contents: 1421334571 == Thu Jan 15 07:09:31 PST 2015 It is a little curious that the InstallTime contents are a clean 12 hours off from the filestamp, but either way that doesn't correspond with the 08:42:08 -> 08:48:16 window the log shows. Still, I think the next step might be to see what makes InstallTime* update itself, since at least on 16.1 that's one heck of a coincidence.

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 20

•

10 years ago

Just realized I misread the file timestamp on 16.1, which is 3:09PM. So that's 8 hours off, which makes more sense for a phone set to GMT.

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 21

•

10 years ago

Killed 23.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk.ui.adhoc/593/console Good at 10:58:59 (possibly 10:57:42, depends on whether locator failed due to crash), bad at 11:00:17, after testing a11y callscreen visibility. Killed 24.2: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.sanity/6152/console Good at 10:01:43, bad at 10:02:57, after testing sms send or bluetooth settings (probably former). Those are both radio tests, so there's that. That said, I just watched 24.2's InstallTime* timestamp change during the test, during an otherwise clean run. I think it's the kill/restart of b2g that does it, so no luck there. b2g-16's non-corresponding time must be because the perf tests aren't restart as part of setup, so that really was the install time.

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 22

•

10 years ago

(In reply to No-Jun Park [:njpark] from comment #12) > In my case the crash happens (intermittently) when python marionette > restarts the device. That aligns with the InstallTime* timestamps I'm finding, when they haven't been clobbered by a later test run like 24.2's was.

No-Jun Park [:njpark]

Comment 23

•

10 years ago

With v188-1, the crash happens as well, although the recovery time seem to be shorter in v188-1 (3-4 minutes or so) When rebooted, it asked me to go through the FTE again. Crash log in v188-1: https://crash-stats.mozilla.com/report/index/f5ea907e-4277-4dcf-b6c5-a70f02150116

No-Jun Park [:njpark]

Comment 24

•

10 years ago

Sometimes it does not recover from crash, and marionette gives out this message: Most recent errors/exceptions are: 01-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -1 The B2G process has restarted after crashing during the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 25

•

10 years ago

(In reply to No-Jun Park [:njpark] from comment #24) > Sometimes it does not recover from crash, and marionette gives out this > message: > > Most recent errors/exceptions are: > 01-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 > 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 > 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 > 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 > 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -1 > The B2G process has restarted after crashing during the tests so Marionette > can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 > most recent errors are listed. Check logcat for all errors if these errors > are not the cause of the failure. What I'd also really like to know is whether the phone powers off or stops charging if you leave it at the crash. If you can make one happen before you go to bed or something, that'd be really useful.

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 26

•

10 years ago

Just checked logcat for 23.1's. It cuts off cold at 11:00:09, no particular errors showing above it. The system was not in the middle of a restart, afaict--that happened somewhat prior to 10:59, but the homescreen had been found by Marionette and it looks like it was in the middle of a11y test code. Last few entries: 01-15 11:00:09.295 7773 7773 W Communications: at reportMissingEntity (app://communications.gaiamobile.org/shared/js/l10n.js:1513:8) 01-15 11:00:09.425 7289 7289 I Gecko : 1421348409439 Marionette DEBUG Got request: findElement, data: {"to":"0","sessionId":"{a1eb7cbf-c070-4705-b3ae-61d2effc8d8c}","name":"findElement","parameters":{"using":"css selector","value":"div.keypad-key[data-value=\"9\"]"}}, id: null 01-15 11:00:09.435 7289 7289 I Gecko : 1421348409446 Marionette INFO sendToClient: {"from":"0","sessionId":"{3f28d6f7-52cb-440e-8a4c-a6cb6d1e1bca}","value":{"ELEMENT":"{ac878fb4-74b0-442d-a523-a63cc52cfee1}"}}, {4b3bf113-13cf-44bc-a967-6f517f9e72af}, {4b3bf113-13cf-44bc-a967-6f517f9e72af} 01-15 11:00:09.445 7289 7289 I Gecko : 1421348409456 Marionette DEBUG Got request: executeAsyncScript, data: {"to":"0","sessionId":"{a1eb7cbf-c070-4705-b3ae-61d2effc8d8c}","name":"executeAsyncScript","parameters":{"scriptTimeout":null,"specialPowers":true,"script":"return Accessibility.click.apply(Accessibility, arguments)","newSandbox":true,"args":[{"ELEMENT":"{ac878fb4-74b0-442d-a523-a63cc52cfee1}"}],"filename":"gaia_test.py","line":491}}, id: null 01-15 11:00:09.535 7289 7289 I Gecko : 1421348409545 Marionette INFO sendToClient: {"from":"0","sessionId":"{3f28d6f7-52cb-440e-8a4c-a6cb6d1e1bca}","value":null}, {af36e5ee-5d3a-40bc-a10d-5b6c11403574}, {af36e5ee-5d3a-40bc-a10d-5b6c11403574} 01-15 11:00:09.795 7289 7289 I Gecko : 1421348409801 Marionette DEBUG Got request: findElement, data: {"to":"0","sessionId":"{a1eb7cbf-c070-4705-b3ae-61d2effc8d8c}","name":"findElement","parameters":{"using":"css selector","value":"div.keypad-key[data-value=\"7\"]"}}, id: null I'll look in other ones tomorrow--I'm out of time for today.

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 27

•

10 years ago

(In reply to Geo Mealer [:geo] from comment #19) > Was good at 9:22:44 after test data push > Bad as of 10:03:57 after making node modules > > Why 40 minute gulf between making node modules and next line? The perf builds don't flush anything to console during this time. A successful build could be quiet for up to two hours.

Florin Strugariu [:Bebe]

Comment 28

•

10 years ago

http://jenkins1.qa.scl3.mozilla.com/computer/b2g-07.1/ is down again

Flags: needinfo?(florin.strugariu)

Florin Strugariu [:Bebe]

Comment 29

•

10 years ago

I had to put some more devices down today. Also Investigated No-Jun's fail and it looks like a different Critical issue see: Bug 1122503 Tested locally with multiple build and with both flasing scrips (Jenkins and Tw) and could not reproduce the fail.

Florin Strugariu [:Bebe]

Comment 31

•

10 years ago

I had to put down some nodes this morning :( Take a look over http://jenkins1.qa.scl3.mozilla.com/computer/ and see the disabled nodes

Florin Strugariu [:Bebe]

Updated

•

10 years ago

Depends on: 1122119

Robert Chira [:RobertC]

Reporter

Comment 32

•

10 years ago

I disabled 2 more nodes: 20.1 - adb devices return 1 device 24.1 - adb devices return 0 devices Can someone try and restart them? We are running out of nodes to run the tests on.

Flags: needinfo?(pmathur)

Flags: needinfo?(gmealer)

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 33

•

10 years ago

Restarted: 4 13 15 17-1 18-1 18-2 19-1 20-1 22-2 23-1 24-1 24-2 25-2 26-2 28-2 22-2 came up in a recovery mode, so I need to manually reflash it. 28-1 was taken offline, but looked viable to me. I'm checking it further. I brought all nodes but those two (and 25-1, which is a perf node Dave disabled) back online in Jenkins. By the time you read this, I'll probably have fixed those and brought them online too. I'll update only if I do not do this.

Flags: needinfo?(gmealer)

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 34

•

10 years ago

I brought 22-2 back up. 28-1 was the one with the android serial number issue which I have not solved yet. I also took 26-1 down for post-mortem on bug 1122119, and am keeping it for a day so people can respond without us having blown away the logs.

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Updated

•

10 years ago

Flags: needinfo?(pmathur)

Robert Chira [:RobertC]

Reporter

Comment 35

•

10 years ago

23.1 crashed inhttp://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-central.ui.functional.smoke/275/console I took the node offline, but it need to be restarted.

Flags: needinfo?(gmealer)

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 36

•

10 years ago

Have continued to bring nodes up on a daily basis. I'm going to quit posting the list, as I think it's not useful for isolation like I'd hoped. I'm also going to start bringing up nodes twice a day (when I come in and before I go home).

Flags: needinfo?(gmealer)

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 37

•

10 years ago

Upstream bug is fixed via backout at this point. A bunch of phones were still down this morning, but that's not surprising given the backout didn't happen until today. Once I see that no phones are crashing, I'll resolve-fix this.

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Assignee

Comment 38

•

10 years ago

I'm closing this out. We're not white-screening anymore, which is what this bug really applied to.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Johan Lorenzo [:jlorenzo]

Comment 39

•

10 years ago

Marking this bug as a duplicate of bug 1122119 as per comment 38.

Resolution: FIXED → DUPLICATE