Closed
Bug 959929
Opened 11 years ago
Closed 9 years ago
Android verify.py requests to mozpool should retry on some request responses
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: Callek, Assigned: Callek)
References
Details
So, in Bug 913870 I noticed at least one result (in c#258) that was an expected state during periodic device checking. Of course on thinking about it this state appeared *after* we had a successful device request. :dividehex, can you first try and enumerate all "things are going well" states that happen automatically for me? Second can you identify if there is a problem in our collective logic here? first, lifeguard log: 2014-01-14T17:42:50 statemachine entering state sut_verifying 2014-01-14T17:42:50 sut connecting to SUT agent 2014-01-14T17:43:39 sut connecting to SUT agent 2014-01-14T17:44:29 sut connecting to SUT agent 2014-01-14T17:45:19 sut connecting to SUT agent 2014-01-14T17:46:09 sut connecting to SUT agent 2014-01-14T17:46:09 statemachine entering state sut_sdcard_verifying 2014-01-14T17:46:10 sut verifying SD card 2014-01-14T17:46:10 statemachine entering state operation_complete 2014-01-14T17:46:11 statemachine sending imaging result 'complete' to Mozpool 2014-01-14T17:46:11 statemachine entering state ready 2014-01-14T17:56:20 sut connecting to SUT agent 2014-01-14T18:06:30 sut connecting to SUT agent next job log: 17:39:14 INFO - retry: Calling <built-in function remove> with args: ('tmpfile_stdout',), kwargs: {}, attempt #1 17:39:14 INFO - Got request, url=http://mobile-imaging-007.p7.releng.scl1.mozilla.com/api/request/787764/ 17:39:14 INFO - Waiting for request 'ready' stage. Current state: 'contacting_lifeguard' 17:40:14 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:41:14 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:42:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:43:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:44:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:45:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:46:15 INFO - Running command: ['/tools/buildbot/bin/python', '/builds/sut_tools/verify.py'] 17:46:15 INFO - Copy/paste: /tools/buildbot/bin/python /builds/sut_tools/verify.py 17:46:15 INFO - Using env: {'HOME': '/home/cltbld', .... 17:46:15 INFO - '_': '/tools/buildbot/bin/python2.7'} 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: updateSUT: Using device 'panda-0657' found in env variable 17:46:15 INFO - 01/14/2014 17:46:15: INFO: INFO: Using device 'panda-0657' found in env variable 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: calling [nslookup panda-0657] 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: calling [ps -x -U cltbld] ... 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: Request GET http://mobile-imaging/api/device/panda-0657/state/... 17:46:15 INFO - 01/14/2014 17:46:15: INFO: Starting new HTTP connection (1): mobile-imaging 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: "GET /api/device/panda-0657/state/ HTTP/1.1" 200 33 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: Status code: 200 17:46:15 INFO - 01/14/2014 17:46:15: ERROR: Mozpool state is 'sut_sdcard_verifying' 17:46:15 INFO - 01/14/2014 17:46:15: INFO: Mozpool knows about device, but claims we're not safe to continue 17:46:15 ERROR - Return code: 1 17:46:15 CRITICAL - Preparing to abort run due to failed verify check. 17:46:15 INFO - Request 'http://mobile-imaging-007.p7.releng.scl1.mozilla.com/api/request/787764/' deleted on cleanup 17:46:15 FATAL - Dieing due to failing verification 17:46:15 FATAL - Running post_fatal callback...
Flags: needinfo?(jwatkins)
Comment 2•11 years ago
|
||
The important two lines are 2014-01-14T17:46:09 statemachine entering state sut_sdcard_verifying 2014-01-14T17:46:11 statemachine entering state ready and 17:46:15 INFO - 01/14/2014 17:46:15: ERROR: Mozpool state is 'sut_sdcard_verifying' and this is caused by caching of states within mozpool with a TTL of 5 seconds. We added this caching when we started monitoring states with nagios, since nagios will check each device separately, and frequently. Callek will file another bug to see if we can make this method return immediate data instead. As for enumerating "OK" states: based on API.txt, there are four defined states, plus the failed_* states. I would recommend treating these states specifically, and treating anything else as unknown, with a retry after a reasonable time. Checking the device's state outside of a request is mucking about with mozpool's internals, so I think the right approach is to just wait patiently when you see something unknown.
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•