Closed Bug 905616 Opened 6 years ago Closed 6 years ago

Add redis health check to


( Graveyard :: Server Operations, task)

Not set


(Not tracked)



(Reporter: nthomas, Assigned: ashish)




(3 files, 2 obsolete files)

There's a check on the redis process count, but nothing to check that redis is actually responsive. Please add a health check for responsiveness.

If there's nothing already in the nrpe toolbox then lets talk about writing something that telnets in or something.
AMO/SUMO redis have checks that do a TCP check against the corresponding redis port. It's trivial to set that up. Let me know whether that works and I can setup that up a jiffy :)
Assignee: server-ops → ashish
Does that just try to open a socket ? Bug 735252 indicates that may not be enough for our case.
Blocks: 926246
(In reply to Nick Thomas [:nthomas] from comment #2)
> Does that just try to open a socket ? Bug 735252 indicates that may not be
> enough for our case.

How else can we help here?
A pointer to the code + config for what AMO/SUMO does would be great.
Needinfo'ing Jeremy, Jason and Ricky.
Flags: needinfo?(rrosario)
Flags: needinfo?(oremj)
Flags: needinfo?(jthomas)
And Ashish to see what the SUMO/AMO nagios configs are.
Flags: needinfo?(ashish)
On SUMO, we have the services monitor page:

For redis, just connects and calls the exists command since that is cheap:

That should return 0 or 1 and not blow up.
Flags: needinfo?(rrosario)
Ok. We don't already have a predictable key to call EXISTS on, so I suggest 
* do a 'SET nagios:<timestamp> "nagios woz here" EX 1', then do an EXISTS on that
* use PING, look for PONG response
Here's the zamboni redis check:

It just runs the info command and returns OK if it succeeds.
Flags: needinfo?(oremj)
Flags: needinfo?(jthomas)
Hal, Nick :

Seems like we don't have a ready made script for this. The AMO one is part of their monitor webapp that they use to check. 

If someone can whip up a script, we'd be happy to hook it up to nagios. 

CC'ing Rob to see if this is something he can pick up, I'm not sure he has the time.
Flags: needinfo?(nthomas)
Flags: needinfo?(hwine)
Flags: needinfo?(ashish)
I should be able to whip this up soon.

Can someone point me at a dev redis instance that I can use for testing?
I setup a local instance of redis to play around. I wrote a check script that should be configurable to do the simple checks we need. Attaching it now.
Attached file (obsolete) —
Simple linear redis check script.
Attachment #8338808 - Attachment mime type: text/x-python-script → text/plain
Attached file
Updated with proper exit(0) and output text.
Attachment #8338808 - Attachment is obsolete: true
Rob - cool. Thanks!

Nick - does this look like it'll do the job?

Shyam, is this to be an IT plugin, or releng only? if the latter, we'll drop the script in to get it wrapped for NRPE.
Flags: needinfo?(hwine) → needinfo?(shyam)
Hal, I'll let Ashish decide. I think we can use it in other places too. I don't see why it can't be shared...
Flags: needinfo?(shyam)
This ran fine against

>        'statement': "set nagios:%s foo" % this_second,
>        'response' : 'OK'

Would be good to set an expiry on this, given the cleanup doesn't get run if EXISTS fails. eg
    "SETEX nagios:%s 60 foo" % this_second,
for a 60 second expiry. We can't use the SET form because our redis doesn't have support for it.
Flags: needinfo?(nthomas)
(In reply to Shyam Mani [:fox2mike] from comment #16)
> Hal, I'll let Ashish decide. I think we can use it in other places too. I
> don't see why it can't be shared...

I would have this script shared so that it can be used for other redis instances as well.
Here is a new version of the check script that allows a sleep interval to be set so that we can confirm EXPIRED keys
Attached file (obsolete) —
Added optparse to pass in host and port via -H and -P respectively
:nthomas Can you verify the script in Comment 20? If this looks good, I shall import it into NRPE/Nagios. Thanks!
Flags: needinfo?(nthomas)
It works fine. I would suggest these though:
* making the Debug variable default to off and have a -v argument to swap that
* s/set/SET/g in the statements definitions
* for debugging, when the output doesn't match the expected response print out the actual response
Flags: needinfo?(nthomas)
Added requested features from nthomas
Attachment #8339283 - Attachment is obsolete: true
Works fine against All set to go ahead with installing and using this ?
Installed this:

As a last thought, could the plugin have a timeout, since it isn't run via NRPE? Nagios' timeout is much longer (close to 180s) and it would be nice to have "-t <timeout seconds>" in the plugin itself.
Closed: 6 years ago
Resolution: --- → FIXED
Product: → Graveyard
You need to log in before you can comment on or make changes to this bug.