Closed Bug 1189379 Opened 10 years ago Closed 9 years ago

add an endpoint to balrog that doesn't eat 500 errors

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1248751

People

(Reporter: bhearsum, Unassigned)

Details

Attachments

(1 file)

11:45 < arr> bhearsum: I think that https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?host=aus4.mozilla.org is the right thing, but it claims that http has been fine for almost 4 days 11:45 < bhearsum> arr: https://mana.mozilla.org/wiki/display/websites/aus4.mozilla.org 11:46 < bhearsum> arr: unfortunately, the client expects aus to always return a 200... 11:46 < bhearsum> so we eat 500s and return empty xml instead 11:46 < bhearsum> we talked about adding a special endpoint for monitoring at one ponit... 11:47 < bhearsum> newrelic noticed the error rate spike, for what it's worth 11:48 < arr> bhearsum: okay, ouch, monitoring something that always says 200 is hard :} 11:48 < bhearsum> yeah 11:48 < catlee> wiiifiiii 11:48 < bhearsum> arr: we can add a special endpoint though... 11:48 < bhearsum> which URL do we monitor right now? 11:49 < arr> bhearsum: looks like update.xml 11:49 < arr> let me pull the specific URL out of the config 11:49 < arr> bhearsum: /update/3/Firefox/14.0a1/20120222174716/WINNT_x86-msvc/en-US/nightly/Windows%205.1/default/default/update.xml 11:50 < catlee> could add a /healthcheck endpoint or something
When this is ready, let me know, and I can make the nagios changes. Also, the MOC would like some runbook documentation to go along with this check, aka: error a - page team 1 error b - restart this any error look in this log I'm not sure if we have anything like that already.
Pretty simple: just do something trivial with the database, and don't eat any exceptions that might be raised. In addition to the unit tests, I gave this a try in Vagrant. When mysql was running it returned "All is well.", when it wasn't it threw an ISE 500.
Assignee: nobody → bhearsum
Status: NEW → ASSIGNED
Attachment #8641221 - Flags: review?(nthomas)
Comment on attachment 8641221 [details] [diff] [review] add healthcheck endpoint Converting to a f+, I'd like to add a bit more checking for the db-has-gone-AWOL case. >+class HealthCheckView(MethodView): >+ def __init__(self, *args, **kwargs): >+ self.log = logging.getLogger(self.__class__.__name__) >+ MethodView.__init__(self, *args, **kwargs) >+ >+ def get(self): >+ # Perform a database operation to make sure that connection is working fine. >+ # This will bubble an Exception if something goes wrong. Otherwise, a normal >+ # response will be returned. >+ dbo.rules.countRules() Please add a dbo.releases.countReleases(), and maybe a dbo.permissions.countAllUsers(), so that we have all the vital tables covered. I think we would have missed the issue from bug 1189289 with only a rules check. >+ return make_response("All is well.") >diff --git a/requirements/dev.txt b/requirements/dev.txt > pyflakes > ordereddict==1.1 >+six Not being used ? Thinking more generally, I'm wondering if we can exercise more of the app in the nagios checks. What if we support healthcheck=1 as a query arg, and check for that in generic() and raise instead of eating the exception.
Attachment #8641221 - Flags: review?(nthomas) → feedback+
(In reply to Nick Thomas [:nthomas] from comment #3) > >+ return make_response("All is well.") > >diff --git a/requirements/dev.txt b/requirements/dev.txt > > pyflakes > > ordereddict==1.1 > >+six > > Not being used ? I think it's indirectly required by nose...I'll double check though. I was having a lot of python env issues yesterday. > Thinking more generally, I'm wondering if we can exercise more of the app in > the nagios checks. What if we support healthcheck=1 as a query arg, and > check for that in generic() and raise instead of eating the exception. That's a neat idea, let's do it! Are you thinking this is a full on replacement for /healthcheck, or something we can do in addition to it?
The http/s check doesn't take extra arguments like that (and I doubt you want to write an https check from scratch), so I'm not sure how you would trigger that "this is a nagios check" unless you're just doing that for a specific path?
(In reply to Amy Rich [:arr] [:arich] from comment #5) > The http/s check doesn't take extra arguments like that (and I doubt you > want to write an https check from scratch), so I'm not sure how you would > trigger that "this is a nagios check" unless you're just doing that for a > specific path? Nick's suggestion (if I understood correctly) would be that we add ?healthcheck=1 to the URLs, so we'd be testing URLs like this: /update/3/Firefox/14.0a1/20120222174716/WINNT_x86-msvc/en-US/nightly/Windows%205.1/default/default/update.xml?healthcheck=1
Doesn't look like I'll be getting back to this anytime soon...
Assignee: bhearsum → nobody
Status: ASSIGNED → NEW
We're going to end up doing something very similar to this as part of the CloudOps migration.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: