Closed Bug 819492 (releng-panda-mozpool) Opened 13 years ago Closed 13 years ago

[tracking bug] panda-mozpool imaging issues

Categories

(Release Engineering :: General, defect, P2)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: armenzg, Assigned: armenzg)

References

()

Details

Attachments

(1 file)

Mozpool is a new piece in our systems and we will find issues because we're dealing with mobile devices (which are not the most reliable) and mozpool has to be put to the test of edge cases. To make tracking the issues easier we will have to improve the releng scripts to fail gracefully and re-trigger where needed. I will associate an etherpad to this bug and then materialize issues into bugs when they're clear. For now we will dedicate this to the pandas since it is the only thing we're using it for.
Attached patch panda168 logSplinter Review
SCENARIO A: A) "ERROR - INFRA-ERROR: Request did not become ready in time" * log: [1] * flow: ** a device is requested (we have a self.request_url) ** we wait for device for be ready [1] http://hg.mozilla.org/build/mozharness/file/default/scripts/b2g_panda.py#l84 ** the device is not ready on time ** we set the job to RETRY ** we do not delete the request ** we do not stop buildbot ** proposed fix: TBD [1] https://tbpl.mozilla.org/php/getParsedLog.php?id=17683047&tree=Cedar&full=1#error0 [2] http://hg.mozilla.org/build/mozharness/file/default/scripts/b2g_panda.py#l103 I think instead of printing a message and setting a status we could benefit if we set a function to call. Right now we call this: self._retry_sleep(blah, blah, error_message="INFRA-ERROR: Request did not become ready in time", tbpl_status=TBPL_RETRY): I'm proposing to call this: self._retry_sleep(blah, blah, once_exceeded_retries=deal_with_it) Then have a retry_sleep to call this function: def deal_with_it(self): # try to close the request # if it does not close then output some message and then # call /builds/manage_buildslave.sh stop WithProperties("slavename") # OR similar What do you think? ######################### FTR I think it would have got into a working state *50 minutes* later. There are 40 minutes wasted from "Imaging complete. Rebooting" to "entering state b2g_pinging": http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/log.html?device=panda-0168 2012-12-06T13:21:36 statemachine writing pxe_config panda-b2g.2, boot config {"b2gbase": "http://pvtbuilds.pvt.build.mozilla.org//pub/mozilla.org/b2g/tinderbox-builds/cedar-panda/20121206153204"} to db 2012-12-06T13:21:36 statemachine entering state start_pxe_boot 2012-12-06T13:21:36 statemachine entering state pxe_power_cycling 2012-12-06T13:21:36 bmm setting PXE config to 'panda-b2g.2' ... 2012-12-06T13:24:16 syslog Imaging complete. Rebooting 2012-12-06T14:03:30 statemachine entering state b2g_pinging * it pings 5 times and then tries to recover by starting from "pxe_power_cycling" 2012-12-06T14:05:11 statemachine entering state pxe_power_cycling ... 2012-12-06T14:10:10 statemachine entering state ready 2012-12-06T14:11:20 statemachine in ready state but not assigned to a request; moving to free state 2012-12-06T14:11:20 statemachine entering state free
Depends on: 819552
Mozpool should be the one to report a failed request in most conditions. The timeout in the releng stuff should only be a backup for a hung mozpool (which hopefully won't happen again, but..), and should log accordingly ("something's seriously wrong with mozpool!" rather than "this device didn't work"). Depending on the nature of the request, 50m may be a bit short. Note that the 40 minute wait above was due to bug 817762, and should not occur in normal operations. b2g_rebooting -> b2g_pinging should always take 120s (long enough for uboot to come up and finish, which could otherwise give a false positive for the ping check).
Hm. I don't think I have enough information here to say. What happens when we close the request while it's imaging? I'm concerned with this scenario: * image a panda * it times out, we close the request, we retry in buildbot * 2nd buildbot job tries to image a panda, repeat til all pandas are in a bad imaging state * 120s sounds like a VERY short request for a mobile device. * I'd be willing to wait up to an hour for these things to come back, with the caveat that if we have to wait above a certain amount of time, we should file a bug.
(In reply to Aki Sasaki [:aki] from comment #4) > * image a panda > * it times out, we close the request, we retry in buildbot > * 2nd buildbot job tries to image a panda, repeat til all pandas are in a > bad imaging state That's exactly what I want to avoid, and why Mozpool's timeouts should be primary. > * 120s sounds like a VERY short request for a mobile device. It's not a general timeout - it's specifically for that specific state transition, and it's pretty conservatively long, actually. It just exists to make sure lifeguard waits before it starts pinging. Also, this isn't a timeout where it's waiting for some event to happen in less than 120s -- lifeguard *always* waits 120s between the reboot and the first ping. If you'd like to look at this in more detail, have a look at the code (mozpool/lifeguard/devicemachine.py) and look at the state machine in its entirety. > * I'd be willing to wait up to an hour for these things to come back, with > the caveat that if we have to wait above a certain amount of time, we should > file a bug. This sounds OK, and yes, it should shout loudly since it means Mozpool's fallbacks aren't working, or something more serious is wrong. It may also mean that it's time for Mozpool to start pre-imaging devices! As long as Buildbot is careful to close the request after the 1h timeout, I think this will avoid the first scenario you describe. In the case of a pathologically slow reimage, Buildbot will give up after 1h and close the request. Mozpool won't re-allocate that device immediately, but once it reaches the "ready" state Mozpool will see that it's unassigned and potentially re-deploy it (hopefully without as slow an imaging process). A bug or alert of some sort would be good in that case, just to have a look, but at least it wouldn't eat the entire pool.
B) INFO - ERROR: Unable to properly connect to SUT Port on device. https://tbpl.mozilla.org/php/getParsedLog.php?id=17716620&tree=Cedar&full=1#error0 * we get a device but we fail to connect through the SUT agent http://hg.mozilla.org/build/tools/file/e893acdceea9/sut_tools/check_b2g_application_ini.py#l9 * proposed fix: ** trigger tbpl retry ** the error code was 1 but the job showed up as green ** ask devicemanagerSUT.py to have more output [1] * root cause: ** undetermined ** SUT agent does not come back up? ** B2G Pandas are unreliable? [1] http://hg.mozilla.org/build/tools/file/e893acdceea9/sut_tools/mozdevice/devicemanagerSUT.py#l146 What happened with panda-0166 ? ################################ Job #17 * It got imaged well with http://pvtbuilds.pvt.build.mozilla.org//pub/mozilla.org/b2g/tinderbox-builds/cedar-panda/20121207115323 * failed to connect to the SUT agent * request closed * in between jobs it got imaged with Jake's image (through the UI) Job #20 * It got imaged well with http://pvtbuilds.pvt.build.mozilla.org//pub/mozilla.org/b2g/tinderbox-builds/cedar-panda/20121207144935 * it managed to connect to SUT agent Regardless of that I cannot manually telnet to the board.
Bug 815785 adds SUT checks to the post-image checks in lifeguard (and will ameliorate the need for the 120s pause I mentioned in comment 3, since u-boot doesn't open the SUT port).
See Also: → 816345
bug 815785 is to make sure that we can come back up with a working SUT agent.
Depends on: 815785
Depends on: 823656
Priority: -- → P2
Let's fix new failures as they appear.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: