Closed Bug 905616 Opened 12 years ago Closed 12 years ago

Add redis health check to redis01.build.scl1

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: ashish)

References

Details

Attachments

(3 files, 2 obsolete files)

There's a check on the redis process count, but nothing to check that redis is actually responsive. Please add a health check for responsiveness. If there's nothing already in the nrpe toolbox then lets talk about writing something that telnets in or something.
AMO/SUMO redis have checks that do a TCP check against the corresponding redis port. It's trivial to set that up. Let me know whether that works and I can setup that up a jiffy :)
Assignee: server-ops → ashish
Does that just try to open a socket ? Bug 735252 indicates that may not be enough for our case.
Blocks: 926246
(In reply to Nick Thomas [:nthomas] from comment #2) > Does that just try to open a socket ? Bug 735252 indicates that may not be > enough for our case. How else can we help here?
A pointer to the code + config for what AMO/SUMO does would be great.
Needinfo'ing Jeremy, Jason and Ricky.
Flags: needinfo?(rrosario)
Flags: needinfo?(oremj)
Flags: needinfo?(jthomas)
And Ashish to see what the SUMO/AMO nagios configs are.
Flags: needinfo?(ashish)
On SUMO, we have the services monitor page: https://support.mozilla.org/services/monitor For redis, just connects and calls the exists command since that is cheap: http://redis.io/commands/exists That should return 0 or 1 and not blow up.
Flags: needinfo?(rrosario)
Ok. We don't already have a predictable key to call EXISTS on, so I suggest * do a 'SET nagios:<timestamp> "nagios woz here" EX 1', then do an EXISTS on that * use PING, look for PONG response
Here's the zamboni redis check: https://github.com/mozilla/zamboni/blob/master/apps/amo/monitors.py#L153 It just runs the info command and returns OK if it succeeds.
Flags: needinfo?(oremj)
Flags: needinfo?(jthomas)
Hal, Nick : Seems like we don't have a ready made script for this. The AMO one is part of their monitor webapp that they use to check. If someone can whip up a script, we'd be happy to hook it up to nagios. CC'ing Rob to see if this is something he can pick up, I'm not sure he has the time.
Flags: needinfo?(nthomas)
Flags: needinfo?(hwine)
Flags: needinfo?(ashish)
I should be able to whip this up soon. Can someone point me at a dev redis instance that I can use for testing?
I setup a local instance of redis to play around. I wrote a check script that should be configurable to do the simple checks we need. Attaching it now.
Attached file test_redis.py (obsolete) —
Simple linear redis check script.
Attachment #8338808 - Attachment mime type: text/x-python-script → text/plain
Attached file test_redis.py
Updated with proper exit(0) and output text.
Attachment #8338808 - Attachment is obsolete: true
Rob - cool. Thanks! Nick - does this look like it'll do the job? Shyam, is this to be an IT plugin, or releng only? if the latter, we'll drop the script in http://hg.mozilla.org/build/nagios-tools/ to get it wrapped for NRPE.
Flags: needinfo?(hwine) → needinfo?(shyam)
Hal, I'll let Ashish decide. I think we can use it in other places too. I don't see why it can't be shared...
Flags: needinfo?(shyam)
This ran fine against redis01.build.mozilla.org. > 'statement': "set nagios:%s foo" % this_second, > 'response' : 'OK' Would be good to set an expiry on this, given the cleanup doesn't get run if EXISTS fails. eg "SETEX nagios:%s 60 foo" % this_second, for a 60 second expiry. We can't use the SET form because our redis doesn't have support for it.
Flags: needinfo?(nthomas)
(In reply to Shyam Mani [:fox2mike] from comment #16) > Hal, I'll let Ashish decide. I think we can use it in other places too. I > don't see why it can't be shared... I would have this script shared so that it can be used for other redis instances as well.
Here is a new version of the check script that allows a sleep interval to be set so that we can confirm EXPIRED keys
Attached file test_redis_with_sleep_and_optparse.py (obsolete) —
Added optparse to pass in host and port via -H and -P respectively
:nthomas Can you verify the script in Comment 20? If this looks good, I shall import it into NRPE/Nagios. Thanks!
Flags: needinfo?(nthomas)
It works fine. I would suggest these though: * making the Debug variable default to off and have a -v argument to swap that * s/set/SET/g in the statements definitions * for debugging, when the output doesn't match the expected response print out the actual response
Flags: needinfo?(nthomas)
Added requested features from nthomas
Attachment #8339283 - Attachment is obsolete: true
Works fine against redis01.build.mozilla.org. All set to go ahead with installing and using this ?
Installed this: https://nagios.mozilla.org/releng-scl3/cgi-bin/extinfo.cgi?type=2&host=redis01.build.scl1.mozilla.com&service=redis As a last thought, could the plugin have a timeout, since it isn't run via NRPE? Nagios' timeout is much longer (close to 180s) and it would be nice to have "-t <timeout seconds>" in the plugin itself.
Status: NEW → ASSIGNED
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: