Closed
Bug 1189379
Opened 10 years ago
Closed 9 years ago
add an endpoint to balrog that doesn't eat 500 errors
Categories
(Release Engineering Graveyard :: Applications: Balrog (backend), defect)
Release Engineering Graveyard
Applications: Balrog (backend)
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 1248751
People
(Reporter: bhearsum, Unassigned)
Details
Attachments
(1 file)
4.36 KB,
patch
|
nthomas
:
feedback+
|
Details | Diff | Splinter Review |
11:45 < arr> bhearsum: I think that https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?host=aus4.mozilla.org is the right thing, but it claims that http has been fine for almost 4 days
11:45 < bhearsum> arr: https://mana.mozilla.org/wiki/display/websites/aus4.mozilla.org
11:46 < bhearsum> arr: unfortunately, the client expects aus to always return a 200...
11:46 < bhearsum> so we eat 500s and return empty xml instead
11:46 < bhearsum> we talked about adding a special endpoint for monitoring at one ponit...
11:47 < bhearsum> newrelic noticed the error rate spike, for what it's worth
11:48 < arr> bhearsum: okay, ouch, monitoring something that always says 200 is hard :}
11:48 < bhearsum> yeah
11:48 < catlee> wiiifiiii
11:48 < bhearsum> arr: we can add a special endpoint though...
11:48 < bhearsum> which URL do we monitor right now?
11:49 < arr> bhearsum: looks like update.xml
11:49 < arr> let me pull the specific URL out of the config
11:49 < arr> bhearsum: /update/3/Firefox/14.0a1/20120222174716/WINNT_x86-msvc/en-US/nightly/Windows%205.1/default/default/update.xml
11:50 < catlee> could add a /healthcheck endpoint or something
Comment 1•10 years ago
|
||
When this is ready, let me know, and I can make the nagios changes.
Also, the MOC would like some runbook documentation to go along with this check, aka:
error a - page team 1
error b - restart this
any error look in this log
I'm not sure if we have anything like that already.
Reporter | ||
Comment 2•10 years ago
|
||
Pretty simple: just do something trivial with the database, and don't eat any exceptions that might be raised. In addition to the unit tests, I gave this a try in Vagrant. When mysql was running it returned "All is well.", when it wasn't it threw an ISE 500.
Comment 3•10 years ago
|
||
Comment on attachment 8641221 [details] [diff] [review]
add healthcheck endpoint
Converting to a f+, I'd like to add a bit more checking for the db-has-gone-AWOL case.
>+class HealthCheckView(MethodView):
>+ def __init__(self, *args, **kwargs):
>+ self.log = logging.getLogger(self.__class__.__name__)
>+ MethodView.__init__(self, *args, **kwargs)
>+
>+ def get(self):
>+ # Perform a database operation to make sure that connection is working fine.
>+ # This will bubble an Exception if something goes wrong. Otherwise, a normal
>+ # response will be returned.
>+ dbo.rules.countRules()
Please add a dbo.releases.countReleases(), and maybe a dbo.permissions.countAllUsers(), so that we have all the vital tables covered. I think we would have missed the issue from bug 1189289 with only a rules check.
>+ return make_response("All is well.")
>diff --git a/requirements/dev.txt b/requirements/dev.txt
> pyflakes
> ordereddict==1.1
>+six
Not being used ?
Thinking more generally, I'm wondering if we can exercise more of the app in the nagios checks. What if we support healthcheck=1 as a query arg, and check for that in generic() and raise instead of eating the exception.
Attachment #8641221 -
Flags: review?(nthomas) → feedback+
Reporter | ||
Comment 4•10 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #3)
> >+ return make_response("All is well.")
> >diff --git a/requirements/dev.txt b/requirements/dev.txt
> > pyflakes
> > ordereddict==1.1
> >+six
>
> Not being used ?
I think it's indirectly required by nose...I'll double check though. I was having a lot of python env issues yesterday.
> Thinking more generally, I'm wondering if we can exercise more of the app in
> the nagios checks. What if we support healthcheck=1 as a query arg, and
> check for that in generic() and raise instead of eating the exception.
That's a neat idea, let's do it! Are you thinking this is a full on replacement for /healthcheck, or something we can do in addition to it?
Comment 5•10 years ago
|
||
The http/s check doesn't take extra arguments like that (and I doubt you want to write an https check from scratch), so I'm not sure how you would trigger that "this is a nagios check" unless you're just doing that for a specific path?
Reporter | ||
Comment 6•10 years ago
|
||
(In reply to Amy Rich [:arr] [:arich] from comment #5)
> The http/s check doesn't take extra arguments like that (and I doubt you
> want to write an https check from scratch), so I'm not sure how you would
> trigger that "this is a nagios check" unless you're just doing that for a
> specific path?
Nick's suggestion (if I understood correctly) would be that we add ?healthcheck=1 to the URLs, so we'd be testing URLs like this:
/update/3/Firefox/14.0a1/20120222174716/WINNT_x86-msvc/en-US/nightly/Windows%205.1/default/default/update.xml?healthcheck=1
Reporter | ||
Comment 7•9 years ago
|
||
Doesn't look like I'll be getting back to this anytime soon...
Assignee: bhearsum → nobody
Status: ASSIGNED → NEW
Reporter | ||
Comment 8•9 years ago
|
||
We're going to end up doing something very similar to this as part of the CloudOps migration.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
Updated•5 years ago
|
Product: Release Engineering → Release Engineering Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•