Closed Bug 905616 Opened 6 years ago Closed 6 years ago

Add redis health check to redis01.build.scl1

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: ashish)

References

Details

Attachments

(3 files, 2 obsolete files)

There's a check on the redis process count, but nothing to check that redis is actually responsive. Please add a health check for responsiveness.

If there's nothing already in the nrpe toolbox then lets talk about writing something that telnets in or something.
AMO/SUMO redis have checks that do a TCP check against the corresponding redis port. It's trivial to set that up. Let me know whether that works and I can setup that up a jiffy :)
Assignee: server-ops → ashish
Does that just try to open a socket ? Bug 735252 indicates that may not be enough for our case.
Blocks: 926246
(In reply to Nick Thomas [:nthomas] from comment #2)
> Does that just try to open a socket ? Bug 735252 indicates that may not be
> enough for our case.

How else can we help here?
A pointer to the code + config for what AMO/SUMO does would be great.
Needinfo'ing Jeremy, Jason and Ricky.
Flags: needinfo?(rrosario)
Flags: needinfo?(oremj)
Flags: needinfo?(jthomas)
And Ashish to see what the SUMO/AMO nagios configs are.
Flags: needinfo?(ashish)
On SUMO, we have the services monitor page:
https://support.mozilla.org/services/monitor

For redis, just connects and calls the exists command since that is cheap:
http://redis.io/commands/exists

That should return 0 or 1 and not blow up.
Flags: needinfo?(rrosario)
Ok. We don't already have a predictable key to call EXISTS on, so I suggest 
* do a 'SET nagios:<timestamp> "nagios woz here" EX 1', then do an EXISTS on that
* use PING, look for PONG response
Here's the zamboni redis check: https://github.com/mozilla/zamboni/blob/master/apps/amo/monitors.py#L153

It just runs the info command and returns OK if it succeeds.
Flags: needinfo?(oremj)
Flags: needinfo?(jthomas)
Hal, Nick :

Seems like we don't have a ready made script for this. The AMO one is part of their monitor webapp that they use to check. 

If someone can whip up a script, we'd be happy to hook it up to nagios. 

CC'ing Rob to see if this is something he can pick up, I'm not sure he has the time.
Flags: needinfo?(nthomas)
Flags: needinfo?(hwine)
Flags: needinfo?(ashish)
I should be able to whip this up soon.

Can someone point me at a dev redis instance that I can use for testing?
I setup a local instance of redis to play around. I wrote a check script that should be configurable to do the simple checks we need. Attaching it now.
Attached file test_redis.py (obsolete) —
Simple linear redis check script.
Attachment #8338808 - Attachment mime type: text/x-python-script → text/plain
Attached file test_redis.py
Updated with proper exit(0) and output text.
Attachment #8338808 - Attachment is obsolete: true
Rob - cool. Thanks!

Nick - does this look like it'll do the job?

Shyam, is this to be an IT plugin, or releng only? if the latter, we'll drop the script in http://hg.mozilla.org/build/nagios-tools/ to get it wrapped for NRPE.
Flags: needinfo?(hwine) → needinfo?(shyam)
Hal, I'll let Ashish decide. I think we can use it in other places too. I don't see why it can't be shared...
Flags: needinfo?(shyam)
This ran fine against redis01.build.mozilla.org.

>        'statement': "set nagios:%s foo" % this_second,
>        'response' : 'OK'

Would be good to set an expiry on this, given the cleanup doesn't get run if EXISTS fails. eg
    "SETEX nagios:%s 60 foo" % this_second,
for a 60 second expiry. We can't use the SET form because our redis doesn't have support for it.
Flags: needinfo?(nthomas)
(In reply to Shyam Mani [:fox2mike] from comment #16)
> Hal, I'll let Ashish decide. I think we can use it in other places too. I
> don't see why it can't be shared...

I would have this script shared so that it can be used for other redis instances as well.
Here is a new version of the check script that allows a sleep interval to be set so that we can confirm EXPIRED keys
Attached file test_redis_with_sleep_and_optparse.py (obsolete) —
Added optparse to pass in host and port via -H and -P respectively
:nthomas Can you verify the script in Comment 20? If this looks good, I shall import it into NRPE/Nagios. Thanks!
Flags: needinfo?(nthomas)
It works fine. I would suggest these though:
* making the Debug variable default to off and have a -v argument to swap that
* s/set/SET/g in the statements definitions
* for debugging, when the output doesn't match the expected response print out the actual response
Flags: needinfo?(nthomas)
Added requested features from nthomas
Attachment #8339283 - Attachment is obsolete: true
Works fine against redis01.build.mozilla.org. All set to go ahead with installing and using this ?
Installed this:

https://nagios.mozilla.org/releng-scl3/cgi-bin/extinfo.cgi?type=2&host=redis01.build.scl1.mozilla.com&service=redis

As a last thought, could the plugin have a timeout, since it isn't run via NRPE? Nagios' timeout is much longer (close to 180s) and it would be nice to have "-t <timeout seconds>" in the plugin itself.
Status: NEW → ASSIGNED
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.