Closed
Bug 962161
Opened 11 years ago
Closed 11 years ago
Device state (for individual devices) should not be cached
Categories
(Testing Graveyard :: Mozpool, defect)
Testing Graveyard
Mozpool
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Assigned: dustin)
References
Details
Attachments
(1 file)
4.60 KB,
patch
|
dividehex
:
review+
|
Details | Diff | Splinter Review |
So in discussing Bug 959929 with :dustin it was decided to split out the caching issue to its own bug (I'm leaving initial c#0 intact below)
Basically device state has a 5 second cache, which *might* be related to the fact that "all devices" return of state has a cache as well (the latter is important, the former likely not)
:dustin promised to look into the former and if the 5 second cache on individual devices is indeed needed, we (releng) can add a 6 second sleep around retries (and successful results) as well, for added sanity.
+++ This bug was initially created as a clone of Bug #959929 +++
So, in Bug 913870 I noticed at least one result (in c#258) that was an expected state during periodic device checking. Of course on thinking about it this state appeared *after* we had a successful device request.
:dividehex, can you first try and enumerate all "things are going well" states that happen automatically for me?
Second can you identify if there is a problem in our collective logic here?
first, lifeguard log:
2014-01-14T17:42:50 statemachine entering state sut_verifying
2014-01-14T17:42:50 sut connecting to SUT agent
2014-01-14T17:43:39 sut connecting to SUT agent
2014-01-14T17:44:29 sut connecting to SUT agent
2014-01-14T17:45:19 sut connecting to SUT agent
2014-01-14T17:46:09 sut connecting to SUT agent
2014-01-14T17:46:09 statemachine entering state sut_sdcard_verifying
2014-01-14T17:46:10 sut verifying SD card
2014-01-14T17:46:10 statemachine entering state operation_complete
2014-01-14T17:46:11 statemachine sending imaging result 'complete' to Mozpool
2014-01-14T17:46:11 statemachine entering state ready
2014-01-14T17:56:20 sut connecting to SUT agent
2014-01-14T18:06:30 sut connecting to SUT agent
next job log:
17:39:14 INFO - retry: Calling <built-in function remove> with args: ('tmpfile_stdout',), kwargs: {}, attempt #1
17:39:14 INFO - Got request, url=http://mobile-imaging-007.p7.releng.scl1.mozilla.com/api/request/787764/
17:39:14 INFO - Waiting for request 'ready' stage. Current state: 'contacting_lifeguard'
17:40:14 INFO - Waiting for request 'ready' stage. Current state: 'pending'
17:41:14 INFO - Waiting for request 'ready' stage. Current state: 'pending'
17:42:15 INFO - Waiting for request 'ready' stage. Current state: 'pending'
17:43:15 INFO - Waiting for request 'ready' stage. Current state: 'pending'
17:44:15 INFO - Waiting for request 'ready' stage. Current state: 'pending'
17:45:15 INFO - Waiting for request 'ready' stage. Current state: 'pending'
17:46:15 INFO - Running command: ['/tools/buildbot/bin/python', '/builds/sut_tools/verify.py']
17:46:15 INFO - Copy/paste: /tools/buildbot/bin/python /builds/sut_tools/verify.py
17:46:15 INFO - Using env: {'HOME': '/home/cltbld',
....
17:46:15 INFO - '_': '/tools/buildbot/bin/python2.7'}
17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: updateSUT: Using device 'panda-0657' found in env variable
17:46:15 INFO - 01/14/2014 17:46:15: INFO: INFO: Using device 'panda-0657' found in env variable
17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: calling [nslookup panda-0657]
17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: calling [ps -x -U cltbld]
...
17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: Request GET http://mobile-imaging/api/device/panda-0657/state/...
17:46:15 INFO - 01/14/2014 17:46:15: INFO: Starting new HTTP connection (1): mobile-imaging
17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: "GET /api/device/panda-0657/state/ HTTP/1.1" 200 33
17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: Status code: 200
17:46:15 INFO - 01/14/2014 17:46:15: ERROR: Mozpool state is 'sut_sdcard_verifying'
17:46:15 INFO - 01/14/2014 17:46:15: INFO: Mozpool knows about device, but claims we're not safe to continue
17:46:15 ERROR - Return code: 1
17:46:15 CRITICAL - Preparing to abort run due to failed verify check.
17:46:15 INFO - Request 'http://mobile-imaging-007.p7.releng.scl1.mozilla.com/api/request/787764/' deleted on cleanup
17:46:15 FATAL - Dieing due to failing verification
17:46:15 FATAL - Running post_fatal callback...
Assignee | ||
Comment 1•11 years ago
|
||
I looked a bit harder while you were filing this. Nagios uses the /state endpoint to get the state for each device. We implement that by fetching the state of *all* devices, and caching that information for 5s.
Probably the best fix here is to have nagios pass a URL argument indicating that it doesn't need up-to-the-moment data (?cached=ok), and by default return non-cached data.
Assignee | ||
Comment 2•11 years ago
|
||
And to be fair, this *is* documented:
/api/device/{id}/state/
* GET returns a JSON response similar to `/api/device/{id}/status/`, but
without the `logs` key. This value may be slightly (on the order of
seconds) out of date due to caching. It is intended for use by monitoring
tools like Nagios.
Assignee | ||
Comment 3•11 years ago
|
||
I just modified nagios to check with ?cache=1, so that will be in place by the time this fix lands.
Assignee | ||
Comment 4•11 years ago
|
||
Attachment #8363141 -
Flags: review?(jwatkins)
Comment 5•11 years ago
|
||
Comment on attachment 8363141 [details] [diff] [review]
bug962161.patch
I like this solution. Don't forget to update nagios when we push the next upgrade. :-)
Attachment #8363141 -
Flags: review?(jwatkins) → review+
Assignee | ||
Comment 6•11 years ago
|
||
Nagios is already updated, but thanks :)
Assignee | ||
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•