Closed
Bug 878880
Opened 12 years ago
Closed 12 years ago
The mozpool UI shows failed_device_busy
Categories
(Testing Graveyard :: Mozpool, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: dustin)
References
Details
Attachments
(1 file, 1 obsolete file)
|
4.76 KB,
patch
|
kmoir
:
review+
|
Details | Diff | Splinter Review |
dustin mentioned that it might be a UI issue but this might be different:
207882 panda-0062 - foopy106.p10.releng.scl1.mozilla.com failed_device_busy 2013-06-03 14:50:51 mobile-imaging-010.p10.releng.scl1.mozilla.com
| Assignee | ||
Comment 1•12 years ago
|
||
What's the issue here? Did the request *not* fail because the device is busy?
Comment 2•12 years ago
|
||
I'm running a lot of tests in staging today
About 95% of the time on pandas I'm using [61,69), [75,82)
mozpool reports device busy and the request fails
http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/mozpool.html
I'm not sure what's happening here - why it's failing so often, so if you have suggestions they would be welcome :-)
| Assignee | ||
Comment 3•12 years ago
|
||
Request 208048:
> request.208048 INFO - [2013-06-03 13:46:16,693] entering state finding_device
> request.208048 INFO - [2013-06-03 13:46:16,711] Finding device.
> request.208048 INFO - [2013-06-03 13:46:16,785] Assigning device panda-0067.
> request.208048 INFO - [2013-06-03 13:46:16,813] Request succeeded.
> request.208048 INFO - [2013-06-03 13:46:16,824] entering state contacting_lifeguard
> 127.0.0.1:58947 - - [03/Jun/2013 13:46:16] "HTTP/1.1 POST /api/device/panda-0067/request/" - 200 OK
> device.panda-0067 INFO - [2013-06-03 13:46:16,923] entering state pc_power_cycling
> device.panda-0067 INFO - [2013-06-03 13:46:16,995] starting SUT reboot
> 127.0.0.1:58949 - - [03/Jun/2013 13:46:17] "HTTP/1.1 POST /api/device/panda-0067/event/please_power_cycle/" - 200 OK
> request.208048 INFO - [2013-06-03 13:46:17,020] entering state pending
> device.panda-0067 INFO - [2013-06-03 13:46:17,170] entering state pc_sut_rebooting
> device INFO - [2013-06-03 13:46:29,224] handling timeout on panda-0067
> device.panda-0067 INFO - [2013-06-03 13:46:29,237] entering state sut_verifying
> sut.cli ERROR - [2013-06-03 13:46:50,297] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:47:19,236] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:47:40,305] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:48:09,297] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:48:12,379] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:48:59,350] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:49:02,429] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:49:49,707] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:49:53,227] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:50:39,448] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:50:42,523] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:51:29,507] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:51:32,591] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:52:19,578] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:52:22,651] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:53:09,574] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:53:12,655] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:53:59,663] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:54:02,732] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:54:49,757] handling timeout on panda-0067
> device.panda-0067 INFO - [2013-06-03 13:54:49,788] entering state sut_verify_power_cycle
> device.panda-0067 INFO - [2013-06-03 13:54:52,011] entering state sut_verifying
> sut.cli ERROR - [2013-06-03 13:54:55,071] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:55:39,756] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:56:00,825] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:56:29,773] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:56:32,878] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> device INFO - [2013-06-03 13:57:19,825] handling timeout on panda-0067
> sut.cli ERROR - [2013-06-03 13:57:19,897] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0067.p10.releng.scl1.mozilla.com after 1 attempts
> request.208048 INFO - [2013-06-03 13:57:53,476] entering state finding_device
> request.208048 INFO - [2013-06-03 13:57:53,499] Finding device.
> request.208048 WARNING - [2013-06-03 13:57:53,625] Request failed!
In summary, mozpool requested that lifeguard reboot the device. Lifeguard issued a SUT reboot, then tried to verify it via SUT, and it didn't come back. 10 minutes later, the request's pending state timed out, and it re-entered the finding_device state. At that point, the device was busy (still trying to reboot), hence the error.
I think that, for requests for a specific device, the fix is to just remove the pending timeout, and let the request's expiration be the limit. For requests that can be fulfilled by multiple devices, it makes more sense to re-enter finding_device in hopes of finding a different device. Do you want to take a crack at that patch? The pending state is defined in a class in mozpool/mozpool/requestmachine.py. I can do it, if you'd prefer - just trying to broaden the bus factor of Mozpool.
The other question is, assuming this is the story with all of the devices, why are 95% of them failing the SUT verification after a SUT reboot?
Comment 4•12 years ago
|
||
I'll look at crafting a mozpool patch for this issue. Maybe the rack of pandas I'm using is wonky. I'm going to exclude some more and see if this can get the failure rate down. Thanks Dustin!
Comment 5•12 years ago
|
||
So I looked at this. Not quite sure if this is right but I would guess that you are suggesting to comment out line 250 in mozpool/mozpool/requestmachine.py
#TIMEOUT = 60
and then the timeout will have longer than the 60 seconds specified.
>>For requests that can be fulfilled by multiple devices, it makes more sense to re-enter finding_device in hopes of finding a different device.
I don't think this applies in the case for android pandas. With our current setup we get the name of a device that's idle and attached to the buildbot master, setup virtual env etc and then contact mozpool to make sure the device is in a good state. We can't go back to request a different device at this time, the job will just get retriggered when it fails to get that specific device from mozpool.
| Assignee | ||
Comment 6•12 years ago
|
||
Yes, that's basically what I'm suggesting, but the need to support multi-device requests makes it a bit harder. You're right that we aren't currently using such requests for android pandas, but Mozpool supports such requests so we need to make sure they work correctly.
Comment 7•12 years ago
|
||
Patch how to fix the issue with the panda_android devices getting failed_device_busy much the time. Not sure how to address the issue with retrying for multiple devices....but this isn't a use case that's blocking me now :-)
Attachment #759469 -
Flags: review?(dustin)
| Assignee | ||
Comment 8•12 years ago
|
||
Comment on attachment 759469 [details] [diff] [review]
patch
Let me see if I can come up with a more general solution.
Attachment #759469 -
Flags: review?(dustin) → review-
| Assignee | ||
Comment 9•12 years ago
|
||
How's this look? The two integration tests test both sides of this functionality.
Assignee: nobody → dustin
Attachment #759469 -
Attachment is obsolete: true
Attachment #759821 -
Flags: review?(kmoir)
Comment 10•12 years ago
|
||
Attachment #759821 -
Flags: review?(kmoir) → review+
| Assignee | ||
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 11•12 years ago
|
||
Do we need to push a new version for this soon?
Comment 12•12 years ago
|
||
Yes please Dustin, I would really appreciate it. Right now my test results are quite orange due to this issue (so many failed_device_busy)
| Assignee | ||
Comment 13•12 years ago
|
||
Comment on attachment 759821 [details] [diff] [review]
bug878880.patch
Review of attachment 759821 [details] [diff] [review]:
-----------------------------------------------------------------
::: mozpool/mozpool/requestmachine.py
@@ +276,4 @@
> if self.machine.increment_counter(self.state_name) < self.PERMANENT_FAILURE_COUNT:
> self.machine.goto_state(pending)
> else:
> + self.logger.warning('THERE')
I blame the reviewer as much as myself for missing these debug prints ;)
Updated•9 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•