Closed
Bug 864488
Opened 11 years ago
Closed 11 years ago
sut_verifying runs before sut_rebooting is done, causes pandas to fail verifying sdcard
Categories
(Testing Graveyard :: Mozpool, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kmoir, Assigned: dustin)
References
Details
Attachments
(1 file, 1 obsolete file)
1.58 KB,
patch
|
mcote
:
review+
|
Details | Diff | Splinter Review |
I have a problem when requesting pandas with an android image via mozpool that I noticed when running tests for bug 829211. In summary, the first time I request the device it's fine, but the second time it loses its android image. My understanding is that if the existing image on the device is the same as the requested image, it shouldn't change. Here are examples of api calls that I make via the command line [cltbld@foopy106 ~]$ curl --data '{"assignee": "foopy106.p10.releng.scl1.mozilla.com", "duration": 600, "image": "android"}' 'http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/device/panda-0064/request/' {"request": {"environment": "any", "requested_device": "panda-0064", "boot_config": "{}", "assignee": "foopy106.p10.releng.scl1.mozilla.com", "url": "http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/request/206889/", "image": "android", "expires": "2013-04-22T19:50:00", "id": 206889, "assigned_device": "panda-0064"}} So the panda is assigned and has the correct image. The next time the same panda is requested. (Previous request has expired or device returned) [cltbld@foopy106 ~]$ curl --data '{"assignee": "foopy106.p10.releng.scl1.mozilla.com", "duration": 600, "image": "android"}' 'http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/device/panda-0064/request/' {"request": {"environment": "any", "requested_device": "panda-0064", "boot_config": "{}", "assignee": "foopy106.p10.releng.scl1.mozilla.com", "url": "http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/request/206890/", "image": "android", "expires": "2013-04-22T20:00:41", "id": 206890, "assigned_device": "panda-0064"}} If you then look at mozpool ui, the device no longer has the android image associated with it. One thing I noticed is that you look at the boot_config for the device when it's requested via mozpool api it's null. If you image the device through mozpool via the ui, the default boot_config value is "". panda freshly imaged with android image { "environment": "cedar", "relay_info": "panda-relay-005.p10.releng.scl1.mozilla.com:bank2:relay1", "name": "panda-0062", "image": "android", "request_id": null, "fqdn": "panda-0062.p10.releng.scl1.mozilla.com", "comments": "bug 860028", "inventory_id": 7021, "state": "ready", "imaging_server": "mobile-imaging-010.p10.releng.scl1.mozilla.com", "mac_address": "2e6083154e01", "id": 2341, "boot_config": "" }, panda requested via api has null image after second request { "environment": "cedar", "relay_info": "panda-relay-005.p10.releng.scl1.mozilla.com:bank2:relay3", "name": "panda-0064", "image": null, "request_id": null, "fqdn": "panda-0064.p10.releng.scl1.mozilla.com", "comments": "bug 860028", "inventory_id": 7023, "state": "ready", "imaging_server": "mobile-imaging-010.p10.releng.scl1.mozilla.com", "mac_address": "0e6069284e01", "id": 2343, "boot_config": null }, Is this a bug in mozpool or in there a problem with the API I'm using?
Reporter | ||
Updated•11 years ago
|
Flags: needinfo?(dustin)
Assignee | ||
Comment 1•11 years ago
|
||
So, the first request succeeded, per the logs: http://mobile-imaging-010.p10.releng.scl1.mozilla.com/ui/log.html?request=206889 2013-04-22T12:40:00 statemachine entering state finding_device 2013-04-22T12:40:00 statemachine Finding device. 2013-04-22T12:40:00 statemachine Assigning device panda-0064. 2013-04-22T12:40:00 statemachine Request succeeded. 2013-04-22T12:40:00 statemachine entering state contacting_lifeguard 2013-04-22T12:40:00 statemachine entering state pending 2013-04-22T12:47:16 statemachine entering state ready 2013-04-22T12:50:05 statemachine Request expired. 2013-04-22T12:50:05 statemachine entering state expired The second request failed, though: http://mobile-imaging-010.p10.releng.scl1.mozilla.com/ui/log.html?request=206890 2013-04-22T12:50:41 statemachine entering state finding_device 2013-04-22T12:50:41 statemachine Finding device. 2013-04-22T12:50:41 statemachine Assigning device panda-0064. 2013-04-22T12:50:41 statemachine Request succeeded. 2013-04-22T12:50:42 statemachine entering state contacting_lifeguard 2013-04-22T12:50:42 statemachine entering state pending 2013-04-22T12:53:15 statemachine entering state finding_device 2013-04-22T12:53:15 statemachine Finding device. 2013-04-22T12:53:15 statemachine Request failed! 2013-04-22T12:53:15 statemachine Requested device None is busy. 2013-04-22T12:53:15 statemachine entering state device_busy 2013-04-22T13:00:45 statemachine Request expired. 2013-04-22T13:00:45 statemachine entering state expired And checking the device logs, you'll see that it failed in such a way that the image on the device was unspecified. 2013-04-22T12:50:42 statemachine entering state pc_power_cycling 2013-04-22T12:50:42 statemachine starting SUT reboot 2013-04-22T12:50:42 sut starting reboot 2013-04-22T12:50:42 statemachine entering state sut_verifying 2013-04-22T12:50:42 sut connecting to SUT agent 2013-04-22T12:50:42 statemachine entering state sut_sdcard_verifying 2013-04-22T12:50:42 sut verifying SD card 2013-04-22T12:50:58 sut verifying SD card 2013-04-22T12:52:05 sut verifying SD card 2013-04-22T12:53:15 statemachine entering state failed_sut_sdcard_verifying 2013-04-22T12:53:15 statemachine device has failed: failed_sut_sdcard_verifying 2013-04-22T12:53:15 statemachine sending imaging result 'failed-bad-device' to Mozpool I think the error of your way here is not monitoring the request state to see if it ends up at 'ready'. Just seeing the device assigned is not enough (that happens immediately as long as the device isn't assigned to some other request).
No longer blocks: 829211
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(dustin)
Resolution: --- → WORKSFORME
Reporter | ||
Comment 2•11 years ago
|
||
Dustin, thanks for the comments but I still see a problem. I tried this again with the following steps image device with android image via mozpool ui wait for device to appear as ready http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/device/panda-0064/state/ {"state": "ready"} Then request it via the api on the command line [cltbld@foopy106 ~]$ curl --data '{"assignee": "foopy106.p10.releng.scl1.mozilla.com", "duration": 600, "image": "android"}' 'http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/device/panda-0064/request/' {"request": {"environment": "any", "requested_device": "panda-0064", "boot_config": "{}", "assignee": "foopy106.p10.releng.scl1.mozilla.com", "url": "http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/request/206909/", "image": "android", "expires": "2013-04-23T02:05:07", "id": 206909, "assigned_device": "panda-0064"}}[cltbld@foopy106 ~]$ http://mobile-imaging-010.p10.releng.scl1.mozilla.com/ui/log.html?request=206909 2013-04-22T18:55:07 statemachine entering state finding_device 2013-04-22T18:55:07 statemachine Finding device. 2013-04-22T18:55:07 statemachine Assigning device panda-0064. 2013-04-22T18:55:07 statemachine Request succeeded. 2013-04-22T18:55:07 statemachine entering state contacting_lifeguard 2013-04-22T18:55:08 statemachine entering state pending 2013-04-22T18:57:45 statemachine entering state finding_device 2013-04-22T18:57:46 statemachine Finding device. 2013-04-22T18:57:48 statemachine Request failed! 2013-04-22T18:57:48 statemachine Requested device None is busy. 2013-04-22T18:57:49 statemachine entering state device_busy 2013-04-22T19:05:08 statemachine Request expired. 2013-04-22T19:05:08 statemachine entering state expired And now http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/device/panda-0064/state/ shows {"state": "ready"} but again the android image isn't associated with that device anymore. { "environment": "cedar", "relay_info": "panda-relay-005.p10.releng.scl1.mozilla.com:bank2:relay3", "name": "panda-0064", "image": null, "request_id": null, "fqdn": "panda-0064.p10.releng.scl1.mozilla.com", "comments": "bug 860028", "inventory_id": 7023, "state": "ready", "imaging_server": "mobile-imaging-010.p10.releng.scl1.mozilla.com", "mac_address": "0e6069284e01", "id": 2343, "boot_config": null }, This happens on every single panda I try. So I don't understand how this can be an issue with the device not being in a ready state, because it does seem to be in a ready state, but doesn't have an image associated with it anymore after I request a specific panda that used to have an android image associated with it.
Assignee | ||
Comment 3•11 years ago
|
||
Your manual image did the following: 2013-04-22T18:37:36 statemachine setting next image to android, boot config to 2013-04-22T18:37:36 statemachine entering state start_pxe_boot 2013-04-22T18:37:36 statemachine entering state pxe_power_cycling 2013-04-22T18:37:36 statemachine setting PXE config to panda-android.2 2013-04-22T18:37:36 statemachine initiating power cycle 2013-04-22T18:37:39 statemachine entering state pxe_booting 2013-04-22T18:38:54 syslog mobile-init.sh starting. 2013-04-22T18:38:54 syslog Reading boot args. 2013-04-22T18:38:54 syslog second stage URL found: http://10.12.137.33/scripts/android-second-stage.sh 2013-04-22T18:38:54 syslog syslog/imaging server IP found: 10.12.137.33 2013-04-22T18:38:54 syslog Submitting lifeguard event at http://10.12.137.33/api/device/panda-0064/event/mobile_init_started/ 2013-04-22T18:38:55 statemachine entering state mobile_init_started 2013-04-22T18:38:55 syslog wget success: --2013-04-22 18:38:55-- http://10.12.137.33/scripts/android-second-stage.sh#012Connecting to 10.12.137.33:80... connected.#012HTTP request sent, awaiting response... 200 OK#012Length: 2370 (2.3K) [application/x-sh]#012Saving to: `/opt/scripts/second-stage.sh'#012#012 0K .. 100% 37.0M=0s#012#0122013-04-22 18:38:55 (37.0 MB/s) - `/opt/scripts/second-stage.sh' saved [2370/2370] 2013-04-22T18:38:55 syslog Executing second-stage.sh 2013-04-22T18:38:55 syslog beginning android-second-stage.sh 2013-04-22T18:38:55 syslog Submitting lifeguard event at http://10.12.137.33/api/device/panda-0064/event/android_downloading/ 2013-04-22T18:38:55 statemachine entering state android_downloading 2013-04-22T18:38:55 syslog got android image panda-android.2 2013-04-22T18:38:55 syslog fetching http://10.12.137.33/artifacts/panda-android.2/system.tar.bz2 2013-04-22T18:39:08 syslog fetching http://10.12.137.33/artifacts/panda-android.2/userdata.tar.bz2 2013-04-22T18:39:09 syslog fetching http://10.12.137.33/artifacts/panda-android.2/boot.tar.bz2 2013-04-22T18:39:10 syslog formatting system partition 2013-04-22T18:39:11 syslog formatting cache partition 2013-04-22T18:39:12 syslog formatting userdata partition 2013-04-22T18:39:14 syslog formatting media partition 2013-04-22T18:39:15 syslog mounting boot partition 2013-04-22T18:39:15 syslog mounting system partition 2013-04-22T18:39:15 syslog mounting userdata partition 2013-04-22T18:39:15 syslog Submitting lifeguard event at http://10.12.137.33/api/device/panda-0064/event/android_extracting/ 2013-04-22T18:39:15 statemachine entering state android_extracting 2013-04-22T18:39:15 syslog extracting boot artifact 2013-04-22T18:39:18 syslog extracting system artifact 2013-04-22T18:40:59 syslog extracting userdata artifact 2013-04-22T18:41:13 syslog unmounting partitions 2013-04-22T18:41:18 syslog imaging complete. Rebooting 2013-04-22T18:41:18 syslog Submitting lifeguard event at http://10.12.137.33/api/device/panda-0064/event/android_rebooting/ 2013-04-22T18:41:18 statemachine entering state sut_verifying 2013-04-22T18:41:18 sut connecting to SUT agent 2013-04-22T18:42:10 sut connecting to SUT agent 2013-04-22T18:43:00 sut connecting to SUT agent 2013-04-22T18:43:50 sut connecting to SUT agent 2013-04-22T18:44:40 sut connecting to SUT agent 2013-04-22T18:44:41 statemachine entering state sut_sdcard_verifying 2013-04-22T18:44:41 sut verifying SD card 2013-04-22T18:44:41 statemachine entering state operation_complete 2013-04-22T18:44:41 statemachine entering state ready 2013-04-22T18:44:42 sut connecting to SUT agent 2013-04-22T18:54:51 sut connecting to SUT agent at which point the device was ready, with image=android, but not part of a request. Then, you made request 206909, and it power-cycled the device and tried to run the SUT checks: 2013-04-22T18:55:08 statemachine entering state pc_power_cycling 2013-04-22T18:55:08 statemachine starting SUT reboot 2013-04-22T18:55:08 sut starting reboot 2013-04-22T18:55:08 statemachine entering state sut_verifying 2013-04-22T18:55:08 sut connecting to SUT agent 2013-04-22T18:55:08 statemachine entering state sut_sdcard_verifying 2013-04-22T18:55:08 sut verifying SD card 2013-04-22T18:55:24 sut verifying SD card 2013-04-22T18:56:31 sut verifying SD card 2013-04-22T18:57:42 statemachine entering state failed_sut_sdcard_verifying 2013-04-22T18:57:43 statemachine device has failed: failed_sut_sdcard_verifying 2013-04-22T18:57:44 statemachine sending imaging result 'failed-bad-device' to Mozpool 2013-04-22T18:57:45 statemachine entering state start_self_test and the device failed its sut_sdcard_verify, so the request failed. This is pretty much exactly what I showed in comment 1. Note that, at the end of your example, you looked at the device's status, not the request status. The device was ready at that point, having passed its self-test. The request was in state 'device_busy', which in version 3.0.1 and before is how Mozpool says "bad device". Now, it's certainly reasonable to ask why the sdcard verify fails, but that's a Panda/Android question. Mozpool is handling the situation correctly, based on the information it has.
Reporter | ||
Comment 4•11 years ago
|
||
Okay thanks Dustin for the explanation. This morning I imaged about ten pandas and then requested all of them. Only three of them went into the ready state. I didn't think that that many pandas could be faulty in the range I'm using for testing, but this seems to be the case.
Assignee | ||
Comment 5•11 years ago
|
||
Well, that suggests there's something wrong with the Android image that's causing it to fail the SUT sdcard validation. That's beyond my capabilities, but certainly there are folks around who could track it down. The code is at https://github.com/djmitche/mozpool/blob/master/mozpool/bmm/sut.py#L27. Looking in /var/log/mozpool.log, i see sut.cli ERROR - [2013-04-23 06:11:18,287] Exception while checking SD card! sut.cli ERROR - [2013-04-23 06:11:18,287] Traceback (most recent call last): File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/bmm/sut.py", line 36, in check_sdcard dm = DeviceManagerSUT(device_fqdn) File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozdevice/devicemanagerSUT.py", line 37, in __init__ self.getDeviceRoot() File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozdevice/devicemanagerSUT.py", line 736, in getDeviceRoot data = self._runCmds([{ 'cmd': 'testroot' }]) File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozdevice/devicemanagerSUT.py", line 141, in _runCmds self._sendCmds(cmdlist, outputfile, timeout, retryLimit=retryLimit) File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozdevice/devicemanagerSUT.py", line 132, in _sendCmds raise DMError("Remote Device Error: unable to connect to %s after %s attempts" % (self.host, retryLimit)) DMError: Remote Device Error: unable to connect to panda-0064.p10.releng.scl1.mozilla.com after 5 attempts which suggests that the SUT agent has failed at that point.
Comment 6•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #3) > Then, you made request 206909, and it power-cycled the device and tried to > run the SUT checks: > > 2013-04-22T18:55:08 statemachine entering state pc_power_cycling > 2013-04-22T18:55:08 statemachine starting SUT reboot > 2013-04-22T18:55:08 sut starting reboot > 2013-04-22T18:55:08 statemachine entering state sut_verifying > 2013-04-22T18:55:08 sut connecting to SUT agent > 2013-04-22T18:55:08 statemachine entering state sut_sdcard_verifying > 2013-04-22T18:55:08 sut verifying SD card > 2013-04-22T18:55:24 sut verifying SD card > 2013-04-22T18:56:31 sut verifying SD card > 2013-04-22T18:57:42 statemachine entering state failed_sut_sdcard_verifying > 2013-04-22T18:57:43 statemachine device has failed: > failed_sut_sdcard_verifying I want to point out here that mozpool is requesting a reboot, verifing the sut connection and then checking the sdcard via SUT all within the same second (18:55:08). There is no delay in checks to allow the panda board to reboot. Seems like a mozpool bug to me.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Assignee | ||
Comment 7•11 years ago
|
||
Oh, interesting - so the sut_verifying succeeds *before* the reboot, and then it doesn't retry the sdcard enough times for the reboot to complete. Mark, what do you think? Should we add a 10s wait state?
Assignee | ||
Updated•11 years ago
|
Summary: problem requesting pandas with android image → sut_verifying runs before sut_rebooting is done, causes pandas to fail verifying sdcard
Comment 8•11 years ago
|
||
Huh, yeah, looks like the board is still preparing to reboot when the sut_verification state happens, so it looks like the board is back up when it hasn't yet rebooted. So yeah, I guess a little wait state is what we need... in general I don't like arbitrary sleeps, but I'm not sure what else to do here except something more complicated like checking uptime or something.
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → dustin
Assignee | ||
Comment 9•11 years ago
|
||
Attachment #740958 -
Flags: review?(mcote)
Comment 10•11 years ago
|
||
Comment on attachment 740958 [details] [diff] [review] bug864488.patch Review of attachment 740958 [details] [diff] [review]: ----------------------------------------------------------------- ::: mozpool/lifeguard/devicemachine.py @@ +371,5 @@ > + > + TIMEOUT = 10 > + > + def on_timeout(self): > + self.machine.goto_state(pc_pinging) Uh, shouldn't we be going to sut_verifying?
Assignee | ||
Comment 11•11 years ago
|
||
eep, yes!
Assignee | ||
Updated•11 years ago
|
Attachment #740958 -
Flags: review?(mcote) → review-
Assignee | ||
Comment 12•11 years ago
|
||
That worked fine in my test environment, with fake pandas..
Attachment #740958 -
Attachment is obsolete: true
Attachment #741006 -
Flags: review?(mcote)
Updated•11 years ago
|
Attachment #741006 -
Flags: review?(mcote) → review+
Assignee | ||
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 13•11 years ago
|
||
When can another version of mozpool be deployed to include this change? Right now I can't test mozpool queries for my android mozpool/mozharness bug. (I have hacked my scripts to connect to the pandas directly). So a new version of mozpool on the servers would be much appreciated :-)
Updated•8 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•