Closed
Bug 895176
Opened 12 years ago
Closed 9 years ago
Frequent Android 4.0 retries: "Remote Device Error: Unhandled exception in cleanupDevice"
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: gbrown, Assigned: gbrown)
Details
There are frequent Android 4.0 retries, on various tests, and likely with more than one cause. A significant set of such retries occurs in the cleanupDevice step:
https://tbpl.mozilla.org/php/getParsedLog.php?id=25393834&tree=Mozilla-Central
13:22:28 INFO - 07/17/2013 13:22:28: INFO: Calling PDU powercycle for panda-0761, panda-relay-068.p8.releng.scl1.mozilla.com:2:5
13:22:29 INFO - 07/17/2013 13:22:29: INFO: Waiting for device to come back...
13:23:59 INFO - 07/17/2013 13:23:59: INFO: Try 1
13:25:11 INFO - 07/17/2013 13:25:11: INFO: devroot None
13:26:41 INFO - 07/17/2013 13:26:41: INFO: Try 2
13:27:46 INFO - 07/17/2013 13:27:46: INFO: devroot None
13:29:16 INFO - 07/17/2013 13:29:16: INFO: Try 3
13:30:21 INFO - 07/17/2013 13:30:21: INFO: devroot None
13:31:51 INFO - 07/17/2013 13:31:51: INFO: /builds/panda-0761/error.flg
13:32:21 INFO - 07/17/2013 13:32:21: INFO: verifyDevice: failing to cleanup device
13:32:21 INFO - reconnecting socket
13:32:21 INFO - removing file: /mnt/sdcard/writetest
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 5 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 10 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 15 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 20 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 5 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 10 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 15 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 20 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 5 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 10 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 15 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Could not connect; sleeping for 20 seconds.
13:32:21 INFO - reconnecting socket
13:32:21 INFO - Remote Device Error: Unhandled exception in cleanupDevice
13:32:21 ERROR - Return code: 1
13:32:21 CRITICAL - Preparing to abort run due to failed verify check.
13:32:21 INFO - Request 'http://mobile-imaging-008.p8.releng.scl1.mozilla.com/api/request/246533/' deleted on cleanup
13:32:21 FATAL - Dieing due to failing verification
13:32:21 FATAL - Exiting -1
Comment 1•12 years ago
|
||
Soooo... if I mentally look for and splice logs together this is what I see ...
MOZPOOL(req 246533) -
2013-07-17T13:19:20 statemachine entering state finding_device
2013-07-17T13:19:20 statemachine Finding device.
2013-07-17T13:19:20 statemachine Assigning device panda-0761.
2013-07-17T13:19:20 statemachine Request succeeded.
2013-07-17T13:19:20 statemachine entering state contacting_lifeguard
2013-07-17T13:19:20 statemachine entering state pending
2013-07-17T13:21:30 statemachine entering state ready
2013-07-17T13:32:21 statemachine Request closed.
2013-07-17T13:32:21 statemachine entering state closed
LIFEGUARD -- From device log
2013-07-17T12:36:27 statemachine entering state ready
2013-07-17T13:06:39 sut connecting to SUT agent
2013-07-17T13:16:40 sut connecting to SUT agent
2013-07-17T13:19:20 statemachine entering state pc_power_cycling
2013-07-17T13:19:20 statemachine initiating power cycle
2013-07-17T13:19:22 statemachine entering state pc_rebooting
2013-07-17T13:21:30 statemachine entering state pc_pinging
2013-07-17T13:21:30 statemachine entering state operation_complete
2013-07-17T13:21:30 statemachine sending imaging result 'complete' to Mozpool
2013-07-17T13:21:30 statemachine entering state ready
2013-07-17T13:34:38 statemachine entering state pc_power_cycling
2013-07-17T13:34:38 statemachine initiating power cycle
2013-07-17T13:34:40 statemachine entering state pc_rebooting
2013-07-17T13:36:41 statemachine entering state pc_pinging
2013-07-17T13:36:41 statemachine entering state operation_complete
2013-07-17T13:36:42 statemachine sending imaging result 'complete' to Mozpool
2013-07-17T13:36:42 statemachine entering state ready
From Job Log:
...
13:19:20 INFO - Got request, url=http://mobile-imaging-008.p8.releng.scl1.mozilla.com/api/request/246533/
13:19:20 INFO - Waiting for request 'ready' stage. Current state: 'contacting_lifeguard'
13:20:20 INFO - Waiting for request 'ready' stage. Current state: 'pending'
13:21:20 INFO - Waiting for request 'ready' stage. Current state: 'pending'
13:22:20 INFO - Running command: ['/tools/buildbot/bin/python', '/builds/sut_tools/verify.py']
...
13:22:25 INFO - 07/17/2013 13:22:25: INFO: INFO: Got expected SUTAgent version '1.17'
13:22:26 INFO - 07/17/2013 13:22:26: INFO: Uninstalling org.mozilla.fennec_aurora...
13:22:27 INFO - 07/17/2013 13:22:27: INFO: Uninstalling org.mozilla.roboexample.test...
13:22:28 INFO - 07/17/2013 13:22:28: INFO: Calling PDU powercycle for panda-0761, panda-relay-068.p8.releng.scl1.mozilla.com:2:5
13:22:29 INFO - 07/17/2013 13:22:29: INFO: Waiting for device to come back...
...
13:29:16 INFO - 07/17/2013 13:29:16: INFO: Try 3
13:30:21 INFO - 07/17/2013 13:30:21: INFO: devroot None
13:31:51 INFO - 07/17/2013 13:31:51: INFO: /builds/panda-0761/error.flg
13:32:21 INFO - 07/17/2013 13:32:21: INFO: verifyDevice: failing to cleanup device
...
13:32:21 INFO - Remote Device Error: Unhandled exception in cleanupDevice
13:32:21 ERROR - Return code: 1
13:32:21 CRITICAL - Preparing to abort run due to failed verify check.
13:32:21 INFO - Request 'http://mobile-imaging-008.p8.releng.scl1.mozilla.com/api/request/246533/' deleted on cleanup
13:32:21 FATAL - Dieing due to failing verification
13:32:21 FATAL - Exiting -1
=============
Given all that I can confirm its not a mozpool issue (like I first thought it was)
Possible thoughts, "something" is going wrong when trying to uninstall robocop. We need to force reboot after each uninstall again (I recall we explicitly removed that). I'm open to ideas/investigation but I can't see an obvious issue (other than poor error message for this state in cleanupDevice)
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Comment 2•12 years ago
|
||
FYI, I see from the log:
13:22:28 INFO - 07/17/2013 13:22:28: INFO: Calling PDU powercycle for panda-0761, panda-relay-068.p8.releng.scl1.mozilla.com:2:5
But the script is also making API calls to mozpool. These two don't mix. If the mozpool is going to be used to maintain state on a board, it should also be used to request powercylces via the lifeguard API.
Comment 3•12 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #2)
> FYI, I see from the log:
> 13:22:28 INFO - 07/17/2013 13:22:28: INFO: Calling PDU powercycle for
> panda-0761, panda-relay-068.p8.releng.scl1.mozilla.com:2:5
>
> But the script is also making API calls to mozpool. These two don't mix.
> If the mozpool is going to be used to maintain state on a board, it should
> also be used to request powercylces via the lifeguard API.
I've been told that while it is bad practice, it won't hurt anything really.
The state of affairs as I understand it:
- mozharness requests device from mozpool
- mozpool sees device is ready, hands device to foopy/mozharness
- mozharness runs verify.py making sure the device is "clean" and in a useable state.
-- this cleans remnants of old fennec installs
-- cleans remnants of old test runs
-- verifies SUTAgent is the proper (automation expected) version
--- updates if required
-- reboots after uninstall of fennec, and after update of SUTAgent (if done)
--- Reboots happen via relay board, since it is shared code that has no knowledge of what devices are in mozpool vs not in mozpool
- ...
- mozharness releases mozpool request when job is done.
I've been told that mozpool when a request is out won't care that a device goes down, during the loan duration. This code assumes that is true.
It's not really easy without severe code duplication and hacks to have our panda reboot code contact mozpool for mozpool managed devices, until/unless one of the following is true:
- We have a command we can call into mozpool to allow it to relay-reboot locked-out devices as well as on-request devices.
- We have talos pandas being managed by mozpool, rather than locked out.
Given the above, I'd be happy to hear where any of my assumptions are wrong, and/or why the current state of the mozpool-managed-devices being rebooted outside of mozpool itself is a bad idea (when you acknowledge that they are within a mozpool request at the time).
I also acknowledge that not using mozpool for the reboots when we use it for the request itself is sub-optimal, but it is the best sanity we can achieve until we manage all devices with mozpool, imo.
Comment 4•12 years ago
|
||
What we discussed was that foopies shouldn't be talking to the pandas at all, and my understanding was that you were dividing the chassis into locked_out pandas (managed by foopies) and mozpool pandas (with relays untouched by foopies). To the question of the harm in double-teaming the relays, I indicated that both foopies and mozpool do retries, so it probably won't cause too many failures. That doesn't mean it won't cause failures, and some of those failures are cumulative (failed pandas).
The contended resource is the relays, not the panda. When you have a panda out, you're free to do with it as you please, but you do not have free reign over its relay. In fact, you should never access a relay in a chassis with at least one mozpool-managed panda.
You can make an bmm API call to power-cycle a board. If you need to do a reboot of a Mozpool-managed device, that's the right approach. To do otherwise is not "sane" - it's wrong, and it will cause failures.
https://github.com/mozilla/mozpool/blob/master/API.txt#L160
/api/device/{id}/power-cycle/
* POST to initiate a power-cycle of this device. The POST body is a JSON object,
with optional keys `pxe_config` and `boot_config`. If `pxe_config` is
specified, then the device is configured to boot with that PXE config;
otherwise, the device boots from its internal storage. If `boot_config` is
supplied (as a string), it is stored for later use by the device via
`/api/device/{id}/bootconfig/`.
This is a bit OT for the original purpose of this bug - but it sounds like we've uncovered another part of the panda automation that's doing the wrong thing, so this probably deserves its own bug.
| Assignee | ||
Comment 5•9 years ago
|
||
I pretty sure these concerns only related to Pandas.
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•