Closed Bug 962161 Opened 11 years ago Closed 11 years ago

Device state (for individual devices) should not be cached

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: Callek, Assigned: dustin)

References

Details

Attachments

(1 file)

bug962161.patch 11 years ago Dustin J. Mitchell [:dustin] (he/him) 4.60 KB, patch	dividehex : review+	Details \| Diff \| Splinter Review

Justin Wood (:Callek)

Reporter

Description

•

11 years ago

So in discussing Bug 959929 with :dustin it was decided to split out the caching issue to its own bug (I'm leaving initial c#0 intact below) Basically device state has a 5 second cache, which *might* be related to the fact that "all devices" return of state has a cache as well (the latter is important, the former likely not) :dustin promised to look into the former and if the 5 second cache on individual devices is indeed needed, we (releng) can add a 6 second sleep around retries (and successful results) as well, for added sanity. +++ This bug was initially created as a clone of Bug #959929 +++ So, in Bug 913870 I noticed at least one result (in c#258) that was an expected state during periodic device checking. Of course on thinking about it this state appeared *after* we had a successful device request. :dividehex, can you first try and enumerate all "things are going well" states that happen automatically for me? Second can you identify if there is a problem in our collective logic here? first, lifeguard log: 2014-01-14T17:42:50 statemachine entering state sut_verifying 2014-01-14T17:42:50 sut connecting to SUT agent 2014-01-14T17:43:39 sut connecting to SUT agent 2014-01-14T17:44:29 sut connecting to SUT agent 2014-01-14T17:45:19 sut connecting to SUT agent 2014-01-14T17:46:09 sut connecting to SUT agent 2014-01-14T17:46:09 statemachine entering state sut_sdcard_verifying 2014-01-14T17:46:10 sut verifying SD card 2014-01-14T17:46:10 statemachine entering state operation_complete 2014-01-14T17:46:11 statemachine sending imaging result 'complete' to Mozpool 2014-01-14T17:46:11 statemachine entering state ready 2014-01-14T17:56:20 sut connecting to SUT agent 2014-01-14T18:06:30 sut connecting to SUT agent next job log: 17:39:14 INFO - retry: Calling <built-in function remove> with args: ('tmpfile_stdout',), kwargs: {}, attempt #1 17:39:14 INFO - Got request, url=http://mobile-imaging-007.p7.releng.scl1.mozilla.com/api/request/787764/ 17:39:14 INFO - Waiting for request 'ready' stage. Current state: 'contacting_lifeguard' 17:40:14 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:41:14 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:42:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:43:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:44:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:45:15 INFO - Waiting for request 'ready' stage. Current state: 'pending' 17:46:15 INFO - Running command: ['/tools/buildbot/bin/python', '/builds/sut_tools/verify.py'] 17:46:15 INFO - Copy/paste: /tools/buildbot/bin/python /builds/sut_tools/verify.py 17:46:15 INFO - Using env: {'HOME': '/home/cltbld', .... 17:46:15 INFO - '_': '/tools/buildbot/bin/python2.7'} 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: updateSUT: Using device 'panda-0657' found in env variable 17:46:15 INFO - 01/14/2014 17:46:15: INFO: INFO: Using device 'panda-0657' found in env variable 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: calling [nslookup panda-0657] 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: calling [ps -x -U cltbld] ... 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: Request GET http://mobile-imaging/api/device/panda-0657/state/... 17:46:15 INFO - 01/14/2014 17:46:15: INFO: Starting new HTTP connection (1): mobile-imaging 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: "GET /api/device/panda-0657/state/ HTTP/1.1" 200 33 17:46:15 INFO - 01/14/2014 17:46:15: DEBUG: Status code: 200 17:46:15 INFO - 01/14/2014 17:46:15: ERROR: Mozpool state is 'sut_sdcard_verifying' 17:46:15 INFO - 01/14/2014 17:46:15: INFO: Mozpool knows about device, but claims we're not safe to continue 17:46:15 ERROR - Return code: 1 17:46:15 CRITICAL - Preparing to abort run due to failed verify check. 17:46:15 INFO - Request 'http://mobile-imaging-007.p7.releng.scl1.mozilla.com/api/request/787764/' deleted on cleanup 17:46:15 FATAL - Dieing due to failing verification 17:46:15 FATAL - Running post_fatal callback...

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 1

•

11 years ago

I looked a bit harder while you were filing this. Nagios uses the /state endpoint to get the state for each device. We implement that by fetching the state of *all* devices, and caching that information for 5s. Probably the best fix here is to have nagios pass a URL argument indicating that it doesn't need up-to-the-moment data (?cached=ok), and by default return non-cached data.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 2

•

11 years ago

And to be fair, this *is* documented: /api/device/{id}/state/ * GET returns a JSON response similar to `/api/device/{id}/status/`, but without the `logs` key. This value may be slightly (on the order of seconds) out of date due to caching. It is intended for use by monitoring tools like Nagios.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

11 years ago

I just modified nagios to check with ?cache=1, so that will be in place by the time this fix lands.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

11 years ago

Attached patch bug962161.patch — Details — Splinter Review

Attachment #8363141 - Flags: review?(jwatkins)

Jake Watkins [:dividehex]

Comment 5

•

11 years ago

Comment on attachment 8363141 [details] [diff] [review] bug962161.patch I like this solution. Don't forget to update nagios when we push the next upgrade. :-)

Attachment #8363141 - Flags: review?(jwatkins) → review+

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

11 years ago

Nagios is already updated, but thanks :)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: Testing → Testing Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Device state (for individual devices) should not be cached

Categories

(Testing Graveyard :: Mozpool, defect)

Tracking

(Not tracked)

People

(Reporter: Callek, Assigned: dustin)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Updated

Attachment

General

Description

File Name

Content Type