Closed Bug 895176 Opened 12 years ago Closed 9 years ago

Frequent Android 4.0 retries: "Remote Device Error: Unhandled exception in cleanupDevice"

Categories

(Release Engineering :: General, defect)

x86
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gbrown, Assigned: gbrown)

Details

There are frequent Android 4.0 retries, on various tests, and likely with more than one cause. A significant set of such retries occurs in the cleanupDevice step: https://tbpl.mozilla.org/php/getParsedLog.php?id=25393834&tree=Mozilla-Central 13:22:28 INFO - 07/17/2013 13:22:28: INFO: Calling PDU powercycle for panda-0761, panda-relay-068.p8.releng.scl1.mozilla.com:2:5 13:22:29 INFO - 07/17/2013 13:22:29: INFO: Waiting for device to come back... 13:23:59 INFO - 07/17/2013 13:23:59: INFO: Try 1 13:25:11 INFO - 07/17/2013 13:25:11: INFO: devroot None 13:26:41 INFO - 07/17/2013 13:26:41: INFO: Try 2 13:27:46 INFO - 07/17/2013 13:27:46: INFO: devroot None 13:29:16 INFO - 07/17/2013 13:29:16: INFO: Try 3 13:30:21 INFO - 07/17/2013 13:30:21: INFO: devroot None 13:31:51 INFO - 07/17/2013 13:31:51: INFO: /builds/panda-0761/error.flg 13:32:21 INFO - 07/17/2013 13:32:21: INFO: verifyDevice: failing to cleanup device 13:32:21 INFO - reconnecting socket 13:32:21 INFO - removing file: /mnt/sdcard/writetest 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 5 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 10 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 15 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 20 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 5 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 10 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 15 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 20 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 5 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 10 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 15 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Could not connect; sleeping for 20 seconds. 13:32:21 INFO - reconnecting socket 13:32:21 INFO - Remote Device Error: Unhandled exception in cleanupDevice 13:32:21 ERROR - Return code: 1 13:32:21 CRITICAL - Preparing to abort run due to failed verify check. 13:32:21 INFO - Request 'http://mobile-imaging-008.p8.releng.scl1.mozilla.com/api/request/246533/' deleted on cleanup 13:32:21 FATAL - Dieing due to failing verification 13:32:21 FATAL - Exiting -1
Soooo... if I mentally look for and splice logs together this is what I see ... MOZPOOL(req 246533) - 2013-07-17T13:19:20 statemachine entering state finding_device 2013-07-17T13:19:20 statemachine Finding device. 2013-07-17T13:19:20 statemachine Assigning device panda-0761. 2013-07-17T13:19:20 statemachine Request succeeded. 2013-07-17T13:19:20 statemachine entering state contacting_lifeguard 2013-07-17T13:19:20 statemachine entering state pending 2013-07-17T13:21:30 statemachine entering state ready 2013-07-17T13:32:21 statemachine Request closed. 2013-07-17T13:32:21 statemachine entering state closed LIFEGUARD -- From device log 2013-07-17T12:36:27 statemachine entering state ready 2013-07-17T13:06:39 sut connecting to SUT agent 2013-07-17T13:16:40 sut connecting to SUT agent 2013-07-17T13:19:20 statemachine entering state pc_power_cycling 2013-07-17T13:19:20 statemachine initiating power cycle 2013-07-17T13:19:22 statemachine entering state pc_rebooting 2013-07-17T13:21:30 statemachine entering state pc_pinging 2013-07-17T13:21:30 statemachine entering state operation_complete 2013-07-17T13:21:30 statemachine sending imaging result 'complete' to Mozpool 2013-07-17T13:21:30 statemachine entering state ready 2013-07-17T13:34:38 statemachine entering state pc_power_cycling 2013-07-17T13:34:38 statemachine initiating power cycle 2013-07-17T13:34:40 statemachine entering state pc_rebooting 2013-07-17T13:36:41 statemachine entering state pc_pinging 2013-07-17T13:36:41 statemachine entering state operation_complete 2013-07-17T13:36:42 statemachine sending imaging result 'complete' to Mozpool 2013-07-17T13:36:42 statemachine entering state ready From Job Log: ... 13:19:20 INFO - Got request, url=http://mobile-imaging-008.p8.releng.scl1.mozilla.com/api/request/246533/ 13:19:20 INFO - Waiting for request 'ready' stage. Current state: 'contacting_lifeguard' 13:20:20 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:21:20 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:22:20 INFO - Running command: ['/tools/buildbot/bin/python', '/builds/sut_tools/verify.py'] ... 13:22:25 INFO - 07/17/2013 13:22:25: INFO: INFO: Got expected SUTAgent version '1.17' 13:22:26 INFO - 07/17/2013 13:22:26: INFO: Uninstalling org.mozilla.fennec_aurora... 13:22:27 INFO - 07/17/2013 13:22:27: INFO: Uninstalling org.mozilla.roboexample.test... 13:22:28 INFO - 07/17/2013 13:22:28: INFO: Calling PDU powercycle for panda-0761, panda-relay-068.p8.releng.scl1.mozilla.com:2:5 13:22:29 INFO - 07/17/2013 13:22:29: INFO: Waiting for device to come back... ... 13:29:16 INFO - 07/17/2013 13:29:16: INFO: Try 3 13:30:21 INFO - 07/17/2013 13:30:21: INFO: devroot None 13:31:51 INFO - 07/17/2013 13:31:51: INFO: /builds/panda-0761/error.flg 13:32:21 INFO - 07/17/2013 13:32:21: INFO: verifyDevice: failing to cleanup device ... 13:32:21 INFO - Remote Device Error: Unhandled exception in cleanupDevice 13:32:21 ERROR - Return code: 1 13:32:21 CRITICAL - Preparing to abort run due to failed verify check. 13:32:21 INFO - Request 'http://mobile-imaging-008.p8.releng.scl1.mozilla.com/api/request/246533/' deleted on cleanup 13:32:21 FATAL - Dieing due to failing verification 13:32:21 FATAL - Exiting -1 ============= Given all that I can confirm its not a mozpool issue (like I first thought it was) Possible thoughts, "something" is going wrong when trying to uninstall robocop. We need to force reboot after each uninstall again (I recall we explicitly removed that). I'm open to ideas/investigation but I can't see an obvious issue (other than poor error message for this state in cleanupDevice)
Product: mozilla.org → Release Engineering
FYI, I see from the log: 13:22:28 INFO - 07/17/2013 13:22:28: INFO: Calling PDU powercycle for panda-0761, panda-relay-068.p8.releng.scl1.mozilla.com:2:5 But the script is also making API calls to mozpool. These two don't mix. If the mozpool is going to be used to maintain state on a board, it should also be used to request powercylces via the lifeguard API.
(In reply to Jake Watkins [:dividehex] from comment #2) > FYI, I see from the log: > 13:22:28 INFO - 07/17/2013 13:22:28: INFO: Calling PDU powercycle for > panda-0761, panda-relay-068.p8.releng.scl1.mozilla.com:2:5 > > But the script is also making API calls to mozpool. These two don't mix. > If the mozpool is going to be used to maintain state on a board, it should > also be used to request powercylces via the lifeguard API. I've been told that while it is bad practice, it won't hurt anything really. The state of affairs as I understand it: - mozharness requests device from mozpool - mozpool sees device is ready, hands device to foopy/mozharness - mozharness runs verify.py making sure the device is "clean" and in a useable state. -- this cleans remnants of old fennec installs -- cleans remnants of old test runs -- verifies SUTAgent is the proper (automation expected) version --- updates if required -- reboots after uninstall of fennec, and after update of SUTAgent (if done) --- Reboots happen via relay board, since it is shared code that has no knowledge of what devices are in mozpool vs not in mozpool - ... - mozharness releases mozpool request when job is done. I've been told that mozpool when a request is out won't care that a device goes down, during the loan duration. This code assumes that is true. It's not really easy without severe code duplication and hacks to have our panda reboot code contact mozpool for mozpool managed devices, until/unless one of the following is true: - We have a command we can call into mozpool to allow it to relay-reboot locked-out devices as well as on-request devices. - We have talos pandas being managed by mozpool, rather than locked out. Given the above, I'd be happy to hear where any of my assumptions are wrong, and/or why the current state of the mozpool-managed-devices being rebooted outside of mozpool itself is a bad idea (when you acknowledge that they are within a mozpool request at the time). I also acknowledge that not using mozpool for the reboots when we use it for the request itself is sub-optimal, but it is the best sanity we can achieve until we manage all devices with mozpool, imo.
What we discussed was that foopies shouldn't be talking to the pandas at all, and my understanding was that you were dividing the chassis into locked_out pandas (managed by foopies) and mozpool pandas (with relays untouched by foopies). To the question of the harm in double-teaming the relays, I indicated that both foopies and mozpool do retries, so it probably won't cause too many failures. That doesn't mean it won't cause failures, and some of those failures are cumulative (failed pandas). The contended resource is the relays, not the panda. When you have a panda out, you're free to do with it as you please, but you do not have free reign over its relay. In fact, you should never access a relay in a chassis with at least one mozpool-managed panda. You can make an bmm API call to power-cycle a board. If you need to do a reboot of a Mozpool-managed device, that's the right approach. To do otherwise is not "sane" - it's wrong, and it will cause failures. https://github.com/mozilla/mozpool/blob/master/API.txt#L160 /api/device/{id}/power-cycle/ * POST to initiate a power-cycle of this device. The POST body is a JSON object, with optional keys `pxe_config` and `boot_config`. If `pxe_config` is specified, then the device is configured to boot with that PXE config; otherwise, the device boots from its internal storage. If `boot_config` is supplied (as a string), it is stored for later use by the device via `/api/device/{id}/bootconfig/`. This is a bit OT for the original purpose of this bug - but it sounds like we've uncovered another part of the panda automation that's doing the wrong thing, so this probably deserves its own bug.
I pretty sure these concerns only related to Pandas.
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.