Closed Bug 962161 Opened 11 years ago Closed 11 years ago

Device state (for individual devices) should not be cached

Categories

(Testing Graveyard :: Mozpool, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: dustin)

References

Details

Attachments

(1 file)

So in discussing Bug 959929 with :dustin it was decided to split out the caching issue to its own bug (I'm leaving initial c#0 intact below) Basically device state has a 5 second cache, which *might* be related to the fact that "all devices" return of state has a cache as well (the latter is important, the former likely not) :dustin promised to look into the former and if the 5 second cache on individual devices is indeed needed, we (releng) can add a 6 second sleep around retries (and successful results) as well, for added sanity. +++ This bug was initially created as a clone of Bug #959929 +++ So, in Bug 913870 I noticed at least one result (in c#258) that was an expected state during periodic device checking. Of course on thinking about it this state appeared *after* we had a successful device request. :dividehex, can you first try and enumerate all "things are going well" states that happen automatically for me? Second can you identify if there is a problem in our collective logic here? first, lifeguard log: 2014-01-14T17:42:50 statemachine entering state sut_verifying 2014-01-14T17:42:50 sut connecting to SUT agent 2014-01-14T17:43:39 sut connecting to SUT agent 2014-01-14T17:44:29 sut connecting to SUT agent 2014-01-14T17:45:19 sut connecting to SUT agent 2014-01-14T17:46:09 sut connecting to SUT agent 2014-01-14T17:46:09 statemachine entering state sut_sdcard_verifying 2014-01-14T17:46:10 sut verifying SD card 2014-01-14T17:46:10 statemachine entering state operation_complete 2014-01-14T17:46:11 statemachine sending imaging result 'complete' to Mozpool 2014-01-14T17:46:11 statemachine entering state ready 2014-01-14T17:56:20 sut connecting to SUT agent 2014-01-14T18:06:30 sut connecting to SUT agent next job log: 17:39:14 INFO - retry: Calling <built-in function remove> with args: ('tmpfile_stdout',), kwargs: {}, attempt #1 17:39:14 INFO - Got request, url=http://mobile-imaging-007.p7.releng.scl1.mozilla.com/api/request/787764/ 17:39:14 INFO - Waiting for request 'ready' stage. Current state: 'contacting_lifeguard' 17:40:14 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:41:14 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:42:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:43:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:44:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:45:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:46:15 INFO - Running command: ['/tools/buildbot/bin/python', '/builds/sut_tools/verify.py'] 17:46:15 INFO - Copy/paste: /tools/buildbot/bin/python /builds/sut_tools/verify.py 17:46:15 INFO - Using env: {'HOME': '/home/cltbld', .... 17:46:15 INFO - '_': '/tools/buildbot/bin/python2.7'} 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: updateSUT: Using device 'panda-0657' found in env variable 17:46:15 INFO - 01/14/2014 17:46:15: INFO: INFO: Using device 'panda-0657' found in env variable 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: calling [nslookup panda-0657] 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: calling [ps -x -U cltbld] ... 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: Request GET http://mobile-imaging/api/device/panda-0657/state/... 17:46:15 INFO - 01/14/2014 17:46:15: INFO: Starting new HTTP connection (1): mobile-imaging 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: "GET /api/device/panda-0657/state/ HTTP/1.1" 200 33 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: Status code: 200 17:46:15 INFO - 01/14/2014 17:46:15: ERROR: Mozpool state is 'sut_sdcard_verifying' 17:46:15 INFO - 01/14/2014 17:46:15: INFO: Mozpool knows about device, but claims we're not safe to continue 17:46:15 ERROR - Return code: 1 17:46:15 CRITICAL - Preparing to abort run due to failed verify check. 17:46:15 INFO - Request 'http://mobile-imaging-007.p7.releng.scl1.mozilla.com/api/request/787764/' deleted on cleanup 17:46:15 FATAL - Dieing due to failing verification 17:46:15 FATAL - Running post_fatal callback...
I looked a bit harder while you were filing this. Nagios uses the /state endpoint to get the state for each device. We implement that by fetching the state of *all* devices, and caching that information for 5s. Probably the best fix here is to have nagios pass a URL argument indicating that it doesn't need up-to-the-moment data (?cached=ok), and by default return non-cached data.
And to be fair, this *is* documented: /api/device/{id}/state/ * GET returns a JSON response similar to `/api/device/{id}/status/`, but without the `logs` key. This value may be slightly (on the order of seconds) out of date due to caching. It is intended for use by monitoring tools like Nagios.
I just modified nagios to check with ?cache=1, so that will be in place by the time this fix lands.
Attached patch bug962161.patchSplinter Review
Attachment #8363141 - Flags: review?(jwatkins)
Comment on attachment 8363141 [details] [diff] [review] bug962161.patch I like this solution. Don't forget to update nagios when we push the next upgrade. :-)
Attachment #8363141 - Flags: review?(jwatkins) → review+
Nagios is already updated, but thanks :)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: