Closed
Bug 375312
Opened 18 years ago
Closed 18 years ago
Need AMO specific Netscaler health check
Categories
(addons.mozilla.org Graveyard :: Administration, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mrz, Assigned: clouserw)
Details
Current Netscaler health checks are based on an HTTP HEAD check and takes a pretty liberal view of what's a good response code:
nslb02> show monitor http-moz
1) Name.......: http-moz Type......: HTTP State....ENABLED
Standard parameters:
Interval.........: 5 sec Retries...........: 3
Response timeout.: 2 sec Down time.........: 30 sec
Reverse..........: NO Transparent.......: NO
Secure...........: NO LRTM..............: ENABLED
Action...........: Not applicable Deviation.........: 0 min
Destination IP...: Bound service Destination port..: Bound service
Special parameters:
HTTP request.....:" HEAD /"
Custom headers...:""
Response codes...:
200 301-302 401 403-404
Done
This generally catches Apache failures or high load issues but doesn't do anything to verify AMO's working correctly.
I'd like an AMO specific health check that does a better job of doing so.
See me if you need more details.
Updated•18 years ago
|
Component: Add-ons → Maintenance Scripts
QA Contact: add-ons → maintenance
Updated•18 years ago
|
Component: Maintenance Scripts → Add-ons
QA Contact: maintenance → add-ons
Comment 1•18 years ago
|
||
What are the limits on the health check? Not sure why the nagios health check isn't sufficient for this, I confess.
Reporter | ||
Comment 2•18 years ago
|
||
Nagios is doing a simple string check. I think it's more valuable to test AMO and make sure everything about it is working.
Comment 3•18 years ago
|
||
In the long term that's true, but in the short term having crashing servers removed from the rotation will greatly reduce the impact of that problem on the integrity of the system. (It's currently associated in a strong-but-circumstantial way with data being corrupted or lost when updating add-ons, and session loss, at least.)
Reporter | ||
Comment 4•18 years ago
|
||
I currently have a string check in place on the Netscaler that expects to see the same string Nagios does, "Recommended Add-ons". This will catch servers when php stops working.
I'm concerned that any content update that would remove that would take down all the AMO backend servers (it's happened for other updates and Nagios goes crazy). I've left it off several for now.
I'd like a non-changeable text string that I can use for monitoring, or specific URL under AMO that is static (the current check is on "/en-US/firefox/"). I may be able to do a string check on <head> or <title>, but if not, can you add an HTML comment or some other text string that will -never- change?
Assignee | ||
Comment 5•18 years ago
|
||
Here's a rough draft of a low-bandwidth monitoring page. Anything we should add or remove?
http://remora.stage.mozilla.com/services/monitor.php
As far as a text string on the front page, "All rights reserved." is a classic and shouldn't change.
Assignee: nobody → clouserw
Assignee | ||
Comment 6•18 years ago
|
||
As far as a nagios check for that page, an easy one is if "FAILED" shows up anywhere on that page, something is wrong.
Since we got a second memcache server today, you can check if the # of servers is 2 also.
If it would be easier to serve this as text/plain or xml, or just have the page blank unless there is an error, just let me know.
Wil
Reporter | ||
Comment 7•18 years ago
|
||
neat page - is there's any error on that page, can you return a non-200? It's easier to do a response code check than a string check.
Assignee | ||
Comment 8•18 years ago
|
||
Page is updated. Will throw a 500 error if anything fails.
Reporter | ||
Comment 9•18 years ago
|
||
Is this in production yet?
Assignee | ||
Comment 10•18 years ago
|
||
No, it's just in trunk (on staging). If you're happy with it, it's ready to go with the next push.
Reporter | ||
Comment 11•18 years ago
|
||
I am but I haven't seen a failure case to test. Check that later today?
Reporter | ||
Comment 12•18 years ago
|
||
Yes that works great!
Right now I'm getting:
Connect to MAIN database (10.2.70.20): success
Select MAIN database (remora): success
Connect to SHADOW database (10.2.70.20): success
Select SHADOW database (remora): success
Memcache is installed: success
Memcache is configured: success
Memcache server (localhost) is responding: success
At least 2 memcache servers? (1): FAILED
which is great for Nagios but no so good for the Netscaler to use since a node can still run with one memcache server. I'm not sure if that's of any value vs. the GET /en-US/firefox/ health check I'm doing now (matching on a 200). That's probably good enough?
Assignee | ||
Comment 13•18 years ago
|
||
Yes, I think using the "GET /en-US/firefox/" for the netscaler and monitor.php for nagios would be great.
Assignee | ||
Comment 14•18 years ago
|
||
https://addons.mozilla.org/services/monitor.php is live. I'm calling this one resolved.
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Updated•16 years ago
|
Component: Add-ons → Administration
QA Contact: add-ons → administration
Updated•9 years ago
|
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•