Closed Bug 864488 Opened 8 years ago Closed 8 years ago

sut_verifying runs before sut_rebooting is done, causes pandas to fail verifying sdcard

Categories

(Testing Graveyard :: Mozpool, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Assigned: dustin)

References

Details

Attachments

(1 file, 1 obsolete file)

I have a problem when requesting pandas with an android image via mozpool
that I noticed when running tests for bug 829211. In summary, the first time I request the device it's fine, but the second time it loses its android image.  My understanding is that if the existing image on the device is the same as the requested image, it shouldn't change.

Here are examples of api calls that I make via the command line

[cltbld@foopy106 ~]$ curl --data '{"assignee": "foopy106.p10.releng.scl1.mozilla.com", "duration": 600, "image": "android"}' 'http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/device/panda-0064/request/'
{"request": {"environment": "any", "requested_device": "panda-0064", "boot_config": "{}", "assignee": "foopy106.p10.releng.scl1.mozilla.com", "url": "http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/request/206889/", "image": "android", "expires": "2013-04-22T19:50:00", "id": 206889, "assigned_device": "panda-0064"}}

So the panda is assigned and has the correct image. The next time the same panda is requested.  (Previous request has expired or device returned)

[cltbld@foopy106 ~]$ curl --data '{"assignee": "foopy106.p10.releng.scl1.mozilla.com", "duration": 600, "image": "android"}' 'http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/device/panda-0064/request/'
{"request": {"environment": "any", "requested_device": "panda-0064", "boot_config": "{}", "assignee": "foopy106.p10.releng.scl1.mozilla.com", "url": "http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/request/206890/", "image": "android", "expires": "2013-04-22T20:00:41", "id": 206890, "assigned_device": "panda-0064"}}

If you then look at mozpool ui, the device no longer has the android image associated with it.  One thing I noticed is that you look at the boot_config for the device when it's requested via mozpool api it's null.  If you image the device through mozpool via the ui, the default boot_config value is "".

panda freshly imaged with android image
{
      "environment": "cedar",
      "relay_info": "panda-relay-005.p10.releng.scl1.mozilla.com:bank2:relay1",
      "name": "panda-0062",
      "image": "android",
      "request_id": null,
      "fqdn": "panda-0062.p10.releng.scl1.mozilla.com",
      "comments": "bug 860028",
      "inventory_id": 7021,
      "state": "ready",
      "imaging_server": "mobile-imaging-010.p10.releng.scl1.mozilla.com",
      "mac_address": "2e6083154e01",
      "id": 2341,
      "boot_config": ""
    },
panda requested via api has null image after second request
    {
      "environment": "cedar",
      "relay_info": "panda-relay-005.p10.releng.scl1.mozilla.com:bank2:relay3",
      "name": "panda-0064",
      "image": null,
      "request_id": null,
      "fqdn": "panda-0064.p10.releng.scl1.mozilla.com",
      "comments": "bug 860028",
      "inventory_id": 7023,
      "state": "ready",
      "imaging_server": "mobile-imaging-010.p10.releng.scl1.mozilla.com",
      "mac_address": "0e6069284e01",
      "id": 2343,
      "boot_config": null
    },

Is this a bug in mozpool or in there a problem with the API I'm using?
Flags: needinfo?(dustin)
Blocks: 829211
So, the first request succeeded, per the logs:

http://mobile-imaging-010.p10.releng.scl1.mozilla.com/ui/log.html?request=206889

2013-04-22T12:40:00 statemachine entering state finding_device
2013-04-22T12:40:00 statemachine Finding device.
2013-04-22T12:40:00 statemachine Assigning device panda-0064.
2013-04-22T12:40:00 statemachine Request succeeded.
2013-04-22T12:40:00 statemachine entering state contacting_lifeguard
2013-04-22T12:40:00 statemachine entering state pending
2013-04-22T12:47:16 statemachine entering state ready
2013-04-22T12:50:05 statemachine Request expired.
2013-04-22T12:50:05 statemachine entering state expired

The second request failed, though:

http://mobile-imaging-010.p10.releng.scl1.mozilla.com/ui/log.html?request=206890
2013-04-22T12:50:41 statemachine entering state finding_device
2013-04-22T12:50:41 statemachine Finding device.
2013-04-22T12:50:41 statemachine Assigning device panda-0064.
2013-04-22T12:50:41 statemachine Request succeeded.
2013-04-22T12:50:42 statemachine entering state contacting_lifeguard
2013-04-22T12:50:42 statemachine entering state pending
2013-04-22T12:53:15 statemachine entering state finding_device
2013-04-22T12:53:15 statemachine Finding device.
2013-04-22T12:53:15 statemachine Request failed!
2013-04-22T12:53:15 statemachine Requested device None is busy.
2013-04-22T12:53:15 statemachine entering state device_busy
2013-04-22T13:00:45 statemachine Request expired.
2013-04-22T13:00:45 statemachine entering state expired

And checking the device logs, you'll see that it failed in such a way that the image on the device was unspecified.

2013-04-22T12:50:42 statemachine entering state pc_power_cycling
2013-04-22T12:50:42 statemachine starting SUT reboot
2013-04-22T12:50:42 sut starting reboot
2013-04-22T12:50:42 statemachine entering state sut_verifying
2013-04-22T12:50:42 sut connecting to SUT agent
2013-04-22T12:50:42 statemachine entering state sut_sdcard_verifying
2013-04-22T12:50:42 sut verifying SD card
2013-04-22T12:50:58 sut verifying SD card
2013-04-22T12:52:05 sut verifying SD card
2013-04-22T12:53:15 statemachine entering state failed_sut_sdcard_verifying
2013-04-22T12:53:15 statemachine device has failed: failed_sut_sdcard_verifying
2013-04-22T12:53:15 statemachine sending imaging result 'failed-bad-device' to Mozpool

I think the error of your way here is not monitoring the request state to see if it ends up at 'ready'.  Just seeing the device assigned is not enough (that happens immediately as long as the device isn't assigned to some other request).
No longer blocks: 829211
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(dustin)
Resolution: --- → WORKSFORME
Blocks: 829211
Dustin, thanks for the comments but I still see a problem.

I tried this again with the following steps
image device with android image via mozpool ui

wait for device to appear as ready
http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/device/panda-0064/state/
{"state": "ready"}

Then request it via the api on the command line

[cltbld@foopy106 ~]$ curl --data '{"assignee": "foopy106.p10.releng.scl1.mozilla.com", "duration": 600, "image": "android"}' 'http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/device/panda-0064/request/'
{"request": {"environment": "any", "requested_device": "panda-0064", "boot_config": "{}", "assignee": "foopy106.p10.releng.scl1.mozilla.com", "url": "http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/request/206909/", "image": "android", "expires": "2013-04-23T02:05:07", "id": 206909, "assigned_device": "panda-0064"}}[cltbld@foopy106 ~]$ 


http://mobile-imaging-010.p10.releng.scl1.mozilla.com/ui/log.html?request=206909

2013-04-22T18:55:07 statemachine entering state finding_device
2013-04-22T18:55:07 statemachine Finding device.
2013-04-22T18:55:07 statemachine Assigning device panda-0064.
2013-04-22T18:55:07 statemachine Request succeeded.
2013-04-22T18:55:07 statemachine entering state contacting_lifeguard
2013-04-22T18:55:08 statemachine entering state pending
2013-04-22T18:57:45 statemachine entering state finding_device
2013-04-22T18:57:46 statemachine Finding device.
2013-04-22T18:57:48 statemachine Request failed!
2013-04-22T18:57:48 statemachine Requested device None is busy.
2013-04-22T18:57:49 statemachine entering state device_busy
2013-04-22T19:05:08 statemachine Request expired.
2013-04-22T19:05:08 statemachine entering state expired

And now http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/device/panda-0064/state/
shows {"state": "ready"}

but again the android image isn't associated with that device anymore.   

{
      "environment": "cedar",
      "relay_info": "panda-relay-005.p10.releng.scl1.mozilla.com:bank2:relay3",
      "name": "panda-0064",
      "image": null,
      "request_id": null,
      "fqdn": "panda-0064.p10.releng.scl1.mozilla.com",
      "comments": "bug 860028",
      "inventory_id": 7023,
      "state": "ready",
      "imaging_server": "mobile-imaging-010.p10.releng.scl1.mozilla.com",
      "mac_address": "0e6069284e01",
      "id": 2343,
      "boot_config": null
    },

This happens on every single panda I try. So I don't understand how this can be an issue with the device not being in a ready state, because it does seem to be in a ready state, but doesn't have an image associated with it anymore after I request a specific panda that used to have an android image associated with it.
Your manual image did the following:

2013-04-22T18:37:36 statemachine setting next image to android, boot config to
2013-04-22T18:37:36 statemachine entering state start_pxe_boot
2013-04-22T18:37:36 statemachine entering state pxe_power_cycling
2013-04-22T18:37:36 statemachine setting PXE config to panda-android.2
2013-04-22T18:37:36 statemachine initiating power cycle
2013-04-22T18:37:39 statemachine entering state pxe_booting
2013-04-22T18:38:54 syslog mobile-init.sh starting.
2013-04-22T18:38:54 syslog Reading boot args.
2013-04-22T18:38:54 syslog second stage URL found: http://10.12.137.33/scripts/android-second-stage.sh
2013-04-22T18:38:54 syslog syslog/imaging server IP found: 10.12.137.33
2013-04-22T18:38:54 syslog Submitting lifeguard event at http://10.12.137.33/api/device/panda-0064/event/mobile_init_started/
2013-04-22T18:38:55 statemachine entering state mobile_init_started
2013-04-22T18:38:55 syslog wget success: --2013-04-22 18:38:55-- http://10.12.137.33/scripts/android-second-stage.sh#012Connecting to 10.12.137.33:80... connected.#012HTTP request sent, awaiting response... 200 OK#012Length: 2370 (2.3K) [application/x-sh]#012Saving to: `/opt/scripts/second-stage.sh'#012#012 0K .. 100% 37.0M=0s#012#0122013-04-22 18:38:55 (37.0 MB/s) - `/opt/scripts/second-stage.sh' saved [2370/2370]
2013-04-22T18:38:55 syslog Executing second-stage.sh
2013-04-22T18:38:55 syslog beginning android-second-stage.sh
2013-04-22T18:38:55 syslog Submitting lifeguard event at http://10.12.137.33/api/device/panda-0064/event/android_downloading/
2013-04-22T18:38:55 statemachine entering state android_downloading
2013-04-22T18:38:55 syslog got android image panda-android.2
2013-04-22T18:38:55 syslog fetching http://10.12.137.33/artifacts/panda-android.2/system.tar.bz2
2013-04-22T18:39:08 syslog fetching http://10.12.137.33/artifacts/panda-android.2/userdata.tar.bz2
2013-04-22T18:39:09 syslog fetching http://10.12.137.33/artifacts/panda-android.2/boot.tar.bz2
2013-04-22T18:39:10 syslog formatting system partition
2013-04-22T18:39:11 syslog formatting cache partition
2013-04-22T18:39:12 syslog formatting userdata partition
2013-04-22T18:39:14 syslog formatting media partition
2013-04-22T18:39:15 syslog mounting boot partition
2013-04-22T18:39:15 syslog mounting system partition
2013-04-22T18:39:15 syslog mounting userdata partition
2013-04-22T18:39:15 syslog Submitting lifeguard event at http://10.12.137.33/api/device/panda-0064/event/android_extracting/
2013-04-22T18:39:15 statemachine entering state android_extracting
2013-04-22T18:39:15 syslog extracting boot artifact
2013-04-22T18:39:18 syslog extracting system artifact
2013-04-22T18:40:59 syslog extracting userdata artifact
2013-04-22T18:41:13 syslog unmounting partitions
2013-04-22T18:41:18 syslog imaging complete. Rebooting
2013-04-22T18:41:18 syslog Submitting lifeguard event at http://10.12.137.33/api/device/panda-0064/event/android_rebooting/
2013-04-22T18:41:18 statemachine entering state sut_verifying
2013-04-22T18:41:18 sut connecting to SUT agent
2013-04-22T18:42:10 sut connecting to SUT agent
2013-04-22T18:43:00 sut connecting to SUT agent
2013-04-22T18:43:50 sut connecting to SUT agent
2013-04-22T18:44:40 sut connecting to SUT agent
2013-04-22T18:44:41 statemachine entering state sut_sdcard_verifying
2013-04-22T18:44:41 sut verifying SD card
2013-04-22T18:44:41 statemachine entering state operation_complete
2013-04-22T18:44:41 statemachine entering state ready
2013-04-22T18:44:42 sut connecting to SUT agent
2013-04-22T18:54:51 sut connecting to SUT agent

at which point the device was ready, with image=android, but not part of a request.

Then, you made request 206909, and it power-cycled the device and tried to run the SUT checks:

2013-04-22T18:55:08 statemachine entering state pc_power_cycling
2013-04-22T18:55:08 statemachine starting SUT reboot
2013-04-22T18:55:08 sut starting reboot
2013-04-22T18:55:08 statemachine entering state sut_verifying
2013-04-22T18:55:08 sut connecting to SUT agent
2013-04-22T18:55:08 statemachine entering state sut_sdcard_verifying
2013-04-22T18:55:08 sut verifying SD card
2013-04-22T18:55:24 sut verifying SD card
2013-04-22T18:56:31 sut verifying SD card
2013-04-22T18:57:42 statemachine entering state failed_sut_sdcard_verifying
2013-04-22T18:57:43 statemachine device has failed: failed_sut_sdcard_verifying
2013-04-22T18:57:44 statemachine sending imaging result 'failed-bad-device' to Mozpool
2013-04-22T18:57:45 statemachine entering state start_self_test

and the device failed its sut_sdcard_verify, so the request failed.  This is pretty much exactly what I showed in comment 1.

Note that, at the end of your example, you looked at the device's status, not the request status.  The device was ready at that point, having passed its self-test.  The request was in state 'device_busy', which in version 3.0.1 and before is how Mozpool says "bad device".

Now, it's certainly reasonable to ask why the sdcard verify fails, but that's a Panda/Android question.  Mozpool is handling the situation correctly, based on the information it has.
Okay thanks Dustin for the explanation. This morning I imaged about ten pandas and then requested all of them.  Only three of them went into the ready state.  I didn't think that that many pandas could be faulty in the range I'm using for testing, but this seems to be the case.
Well, that suggests there's something wrong with the Android image that's causing it to fail the SUT sdcard validation.  That's beyond my capabilities, but certainly there are folks around who could track it down.  The code is at https://github.com/djmitche/mozpool/blob/master/mozpool/bmm/sut.py#L27.  Looking in /var/log/mozpool.log, i see

sut.cli ERROR - [2013-04-23 06:11:18,287] Exception while checking SD card!
sut.cli ERROR - [2013-04-23 06:11:18,287] Traceback (most recent call last):
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/bmm/sut.py", line 36, in check_sdcard
    dm = DeviceManagerSUT(device_fqdn)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozdevice/devicemanagerSUT.py", line 37, in __init__
    self.getDeviceRoot()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozdevice/devicemanagerSUT.py", line 736, in getDeviceRoot
    data = self._runCmds([{ 'cmd': 'testroot' }])
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozdevice/devicemanagerSUT.py", line 141, in _runCmds
    self._sendCmds(cmdlist, outputfile, timeout, retryLimit=retryLimit)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozdevice/devicemanagerSUT.py", line 132, in _sendCmds
    raise DMError("Remote Device Error: unable to connect to %s after %s attempts" % (self.host, retryLimit))
DMError: Remote Device Error: unable to connect to panda-0064.p10.releng.scl1.mozilla.com after 5 attempts

which suggests that the SUT agent has failed at that point.
(In reply to Dustin J. Mitchell [:dustin] from comment #3)

> Then, you made request 206909, and it power-cycled the device and tried to
> run the SUT checks:
> 
> 2013-04-22T18:55:08 statemachine entering state pc_power_cycling
> 2013-04-22T18:55:08 statemachine starting SUT reboot
> 2013-04-22T18:55:08 sut starting reboot
> 2013-04-22T18:55:08 statemachine entering state sut_verifying
> 2013-04-22T18:55:08 sut connecting to SUT agent
> 2013-04-22T18:55:08 statemachine entering state sut_sdcard_verifying
> 2013-04-22T18:55:08 sut verifying SD card
> 2013-04-22T18:55:24 sut verifying SD card
> 2013-04-22T18:56:31 sut verifying SD card
> 2013-04-22T18:57:42 statemachine entering state failed_sut_sdcard_verifying
> 2013-04-22T18:57:43 statemachine device has failed:
> failed_sut_sdcard_verifying


I want to point out here that mozpool is requesting a reboot, verifing the sut connection and then checking the sdcard via SUT all within the same second (18:55:08).  There is no delay in checks to allow the panda board to reboot.

Seems like a mozpool bug to me.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Oh, interesting - so the sut_verifying succeeds *before* the reboot, and then it doesn't retry the sdcard enough times for the reboot to complete.

Mark, what do you think?  Should we add a 10s wait state?
Summary: problem requesting pandas with android image → sut_verifying runs before sut_rebooting is done, causes pandas to fail verifying sdcard
Huh, yeah, looks like the board is still preparing to reboot when the sut_verification state happens, so it looks like the board is back up when it hasn't yet rebooted.  So yeah, I guess a little wait state is what we need... in general I don't like arbitrary sleeps, but I'm not sure what else to do here except something more complicated like checking uptime or something.
Assignee: nobody → dustin
Attached patch bug864488.patch (obsolete) — Splinter Review
Attachment #740958 - Flags: review?(mcote)
Comment on attachment 740958 [details] [diff] [review]
bug864488.patch

Review of attachment 740958 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozpool/lifeguard/devicemachine.py
@@ +371,5 @@
> +
> +    TIMEOUT = 10
> +
> +    def on_timeout(self):
> +        self.machine.goto_state(pc_pinging)

Uh, shouldn't we be going to sut_verifying?
eep, yes!
Attachment #740958 - Flags: review?(mcote) → review-
That worked fine in my test environment, with fake pandas..
Attachment #740958 - Attachment is obsolete: true
Attachment #741006 - Flags: review?(mcote)
Attachment #741006 - Flags: review?(mcote) → review+
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
When can another version of mozpool be deployed to include this change?  Right now I can't test mozpool queries for my android mozpool/mozharness bug. (I have hacked my scripts to connect to the pandas directly).  So a new version of mozpool on the servers would be much appreciated :-)
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.