Closed Bug 375312 Opened 18 years ago Closed 18 years ago

Need AMO specific Netscaler health check

Categories

(addons.mozilla.org Graveyard :: Administration, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mrz, Assigned: clouserw)

Details

Current Netscaler health checks are based on an HTTP HEAD check and takes a pretty liberal view of what's a good response code: nslb02> show monitor http-moz 1) Name.......: http-moz Type......: HTTP State....ENABLED Standard parameters: Interval.........: 5 sec Retries...........: 3 Response timeout.: 2 sec Down time.........: 30 sec Reverse..........: NO Transparent.......: NO Secure...........: NO LRTM..............: ENABLED Action...........: Not applicable Deviation.........: 0 min Destination IP...: Bound service Destination port..: Bound service Special parameters: HTTP request.....:" HEAD /" Custom headers...:"" Response codes...: 200 301-302 401 403-404 Done This generally catches Apache failures or high load issues but doesn't do anything to verify AMO's working correctly. I'd like an AMO specific health check that does a better job of doing so. See me if you need more details.
Component: Add-ons → Maintenance Scripts
QA Contact: add-ons → maintenance
Component: Maintenance Scripts → Add-ons
QA Contact: maintenance → add-ons
What are the limits on the health check? Not sure why the nagios health check isn't sufficient for this, I confess.
Nagios is doing a simple string check. I think it's more valuable to test AMO and make sure everything about it is working.
In the long term that's true, but in the short term having crashing servers removed from the rotation will greatly reduce the impact of that problem on the integrity of the system. (It's currently associated in a strong-but-circumstantial way with data being corrupted or lost when updating add-ons, and session loss, at least.)
I currently have a string check in place on the Netscaler that expects to see the same string Nagios does, "Recommended Add-ons". This will catch servers when php stops working. I'm concerned that any content update that would remove that would take down all the AMO backend servers (it's happened for other updates and Nagios goes crazy). I've left it off several for now. I'd like a non-changeable text string that I can use for monitoring, or specific URL under AMO that is static (the current check is on "/en-US/firefox/"). I may be able to do a string check on <head> or <title>, but if not, can you add an HTML comment or some other text string that will -never- change?
Here's a rough draft of a low-bandwidth monitoring page. Anything we should add or remove? http://remora.stage.mozilla.com/services/monitor.php As far as a text string on the front page, "All rights reserved." is a classic and shouldn't change.
Assignee: nobody → clouserw
As far as a nagios check for that page, an easy one is if "FAILED" shows up anywhere on that page, something is wrong. Since we got a second memcache server today, you can check if the # of servers is 2 also. If it would be easier to serve this as text/plain or xml, or just have the page blank unless there is an error, just let me know. Wil
neat page - is there's any error on that page, can you return a non-200? It's easier to do a response code check than a string check.
Page is updated. Will throw a 500 error if anything fails.
Is this in production yet?
No, it's just in trunk (on staging). If you're happy with it, it's ready to go with the next push.
I am but I haven't seen a failure case to test. Check that later today?
Yes that works great! Right now I'm getting: Connect to MAIN database (10.2.70.20): success Select MAIN database (remora): success Connect to SHADOW database (10.2.70.20): success Select SHADOW database (remora): success Memcache is installed: success Memcache is configured: success Memcache server (localhost) is responding: success At least 2 memcache servers? (1): FAILED which is great for Nagios but no so good for the Netscaler to use since a node can still run with one memcache server. I'm not sure if that's of any value vs. the GET /en-US/firefox/ health check I'm doing now (matching on a 200). That's probably good enough?
Yes, I think using the "GET /en-US/firefox/" for the netscaler and monitor.php for nagios would be great.
https://addons.mozilla.org/services/monitor.php is live. I'm calling this one resolved.
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Component: Add-ons → Administration
QA Contact: add-ons → administration
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.