Closed Bug 959929 Opened 11 years ago Closed 9 years ago

Android verify.py requests to mozpool should retry on some request responses

Categories

(Release Engineering :: General, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: Callek, Assigned: Callek)

References

Details

So, in Bug 913870 I noticed at least one result (in c#258) that was an expected state during periodic device checking.  Of course on thinking about it this state appeared *after* we had a successful device request.

:dividehex, can you first try and enumerate all "things are going well" states that happen automatically for me?

Second can you identify if there is a problem in our collective logic here?

first, lifeguard log:
2014-01-14T17:42:50 statemachine entering state sut_verifying
2014-01-14T17:42:50 sut connecting to SUT agent
2014-01-14T17:43:39 sut connecting to SUT agent
2014-01-14T17:44:29 sut connecting to SUT agent
2014-01-14T17:45:19 sut connecting to SUT agent
2014-01-14T17:46:09 sut connecting to SUT agent
2014-01-14T17:46:09 statemachine entering state sut_sdcard_verifying
2014-01-14T17:46:10 sut verifying SD card
2014-01-14T17:46:10 statemachine entering state operation_complete
2014-01-14T17:46:11 statemachine sending imaging result 'complete' to Mozpool
2014-01-14T17:46:11 statemachine entering state ready
2014-01-14T17:56:20 sut connecting to SUT agent
2014-01-14T18:06:30 sut connecting to SUT agent

next job log:
17:39:14     INFO - retry: Calling <built-in function remove> with args: ('tmpfile_stdout',), kwargs: {}, attempt #1
17:39:14     INFO - Got request, url=http://mobile-imaging-007.p7.releng.scl1.mozilla.com/api/request/787764/
17:39:14     INFO - Waiting for request 'ready' stage.  Current state: 'contacting_lifeguard'
17:40:14     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
17:41:14     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
17:42:15     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
17:43:15     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
17:44:15     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
17:45:15     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
17:46:15     INFO - Running command: ['/tools/buildbot/bin/python', '/builds/sut_tools/verify.py']
17:46:15     INFO - Copy/paste: /tools/buildbot/bin/python /builds/sut_tools/verify.py
17:46:15     INFO - Using env: {'HOME': '/home/cltbld',
....
17:46:15     INFO -  '_': '/tools/buildbot/bin/python2.7'}
17:46:15     INFO -  01/14/2014 17:46:15: DEBUG: updateSUT: Using device 'panda-0657' found in env variable
17:46:15     INFO -  01/14/2014 17:46:15: INFO: INFO: Using device 'panda-0657' found in env variable
17:46:15     INFO -  01/14/2014 17:46:15: DEBUG: calling [nslookup panda-0657]
17:46:15     INFO -  01/14/2014 17:46:15: DEBUG: calling [ps -x -U cltbld]
...
17:46:15     INFO -  01/14/2014 17:46:15: DEBUG: Request GET http://mobile-imaging/api/device/panda-0657/state/...
17:46:15     INFO -  01/14/2014 17:46:15: INFO: Starting new HTTP connection (1): mobile-imaging
17:46:15     INFO -  01/14/2014 17:46:15: DEBUG: "GET /api/device/panda-0657/state/ HTTP/1.1" 200 33
17:46:15     INFO -  01/14/2014 17:46:15: DEBUG: Status code: 200
17:46:15     INFO -  01/14/2014 17:46:15: ERROR: Mozpool state is 'sut_sdcard_verifying'
17:46:15     INFO -  01/14/2014 17:46:15: INFO: Mozpool knows about device, but claims we're not safe to continue
17:46:15    ERROR - Return code: 1
17:46:15 CRITICAL - Preparing to abort run due to failed verify check.
17:46:15     INFO - Request 'http://mobile-imaging-007.p7.releng.scl1.mozilla.com/api/request/787764/' deleted on cleanup
17:46:15    FATAL - Dieing due to failing verification
17:46:15    FATAL - Running post_fatal callback...
Flags: needinfo?(jwatkins)
Blocks: 913870
302 to :dustin
Flags: needinfo?(jwatkins)
Blocks: 962161
The important two lines are

2014-01-14T17:46:09 statemachine entering state sut_sdcard_verifying
2014-01-14T17:46:11 statemachine entering state ready

and

17:46:15     INFO -  01/14/2014 17:46:15: ERROR: Mozpool state is 'sut_sdcard_verifying'

and this is caused by caching of states within mozpool with a TTL of 5 seconds.  We added this caching when we started monitoring states with nagios, since nagios will check each device separately, and frequently.  Callek will file another bug to see if we can make this method return immediate data instead.

As for enumerating "OK" states: based on API.txt, there are four defined states, plus the failed_* states.  I would recommend treating these states specifically, and treating anything else as unknown, with a retry after a reasonable time.  Checking the device's state outside of a request is mucking about with mozpool's internals, so I think the right approach is to just wait patiently when you see something unknown.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.