Closed Bug 1189379 Opened 10 years ago Closed 9 years ago

add an endpoint to balrog that doesn't eat 500 errors

Tracking

(Not tracked)

Status:

RESOLVED DUPLICATE of bug 1248751

People

(Reporter: bhearsum, Unassigned)

Details

Attachments

(1 file)

add healthcheck endpoint 10 years ago bhearsum@mozilla.com (:bhearsum) 4.36 KB, patch	nthomas : feedback+	Details \| Diff \| Splinter Review

bhearsum@mozilla.com (:bhearsum)

Reporter

Description

•

10 years ago

11:45 < arr> bhearsum: I think that https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?host=aus4.mozilla.org is the right thing, but it claims that http has been fine for almost 4 days 11:45 < bhearsum> arr: https://mana.mozilla.org/wiki/display/websites/aus4.mozilla.org 11:46 < bhearsum> arr: unfortunately, the client expects aus to always return a 200... 11:46 < bhearsum> so we eat 500s and return empty xml instead 11:46 < bhearsum> we talked about adding a special endpoint for monitoring at one ponit... 11:47 < bhearsum> newrelic noticed the error rate spike, for what it's worth 11:48 < arr> bhearsum: okay, ouch, monitoring something that always says 200 is hard :} 11:48 < bhearsum> yeah 11:48 < catlee> wiiifiiii 11:48 < bhearsum> arr: we can add a special endpoint though... 11:48 < bhearsum> which URL do we monitor right now? 11:49 < arr> bhearsum: looks like update.xml 11:49 < arr> let me pull the specific URL out of the config 11:49 < arr> bhearsum: /update/3/Firefox/14.0a1/20120222174716/WINNT_x86-msvc/en-US/nightly/Windows%205.1/default/default/update.xml 11:50 < catlee> could add a /healthcheck endpoint or something

Amy Rich [:arr] [:arich]

Comment 1

•

10 years ago

When this is ready, let me know, and I can make the nagios changes. Also, the MOC would like some runbook documentation to go along with this check, aka: error a - page team 1 error b - restart this any error look in this log I'm not sure if we have anything like that already.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 2

•

10 years ago

Attached patch add healthcheck endpoint — Details — Splinter Review

Pretty simple: just do something trivial with the database, and don't eat any exceptions that might be raised. In addition to the unit tests, I gave this a try in Vagrant. When mysql was running it returned "All is well.", when it wasn't it threw an ISE 500.

Assignee: nobody → bhearsum

Status: NEW → ASSIGNED

Attachment #8641221 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

10 years ago

Comment on attachment 8641221 [details] [diff] [review] add healthcheck endpoint Converting to a f+, I'd like to add a bit more checking for the db-has-gone-AWOL case. >+class HealthCheckView(MethodView): >+ def __init__(self, *args, **kwargs): >+ self.log = logging.getLogger(self.__class__.__name__) >+ MethodView.__init__(self, *args, **kwargs) >+ >+ def get(self): >+ # Perform a database operation to make sure that connection is working fine. >+ # This will bubble an Exception if something goes wrong. Otherwise, a normal >+ # response will be returned. >+ dbo.rules.countRules() Please add a dbo.releases.countReleases(), and maybe a dbo.permissions.countAllUsers(), so that we have all the vital tables covered. I think we would have missed the issue from bug 1189289 with only a rules check. >+ return make_response("All is well.") >diff --git a/requirements/dev.txt b/requirements/dev.txt > pyflakes > ordereddict==1.1 >+six Not being used ? Thinking more generally, I'm wondering if we can exercise more of the app in the nagios checks. What if we support healthcheck=1 as a query arg, and check for that in generic() and raise instead of eating the exception.

Attachment #8641221 - Flags: review?(nthomas) → feedback+

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 4

•

10 years ago

(In reply to Nick Thomas [:nthomas] from comment #3) > >+ return make_response("All is well.") > >diff --git a/requirements/dev.txt b/requirements/dev.txt > > pyflakes > > ordereddict==1.1 > >+six > > Not being used ? I think it's indirectly required by nose...I'll double check though. I was having a lot of python env issues yesterday. > Thinking more generally, I'm wondering if we can exercise more of the app in > the nagios checks. What if we support healthcheck=1 as a query arg, and > check for that in generic() and raise instead of eating the exception. That's a neat idea, let's do it! Are you thinking this is a full on replacement for /healthcheck, or something we can do in addition to it?

Amy Rich [:arr] [:arich]

Comment 5

•

10 years ago

The http/s check doesn't take extra arguments like that (and I doubt you want to write an https check from scratch), so I'm not sure how you would trigger that "this is a nagios check" unless you're just doing that for a specific path?

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 6

•

10 years ago

(In reply to Amy Rich [:arr] [:arich] from comment #5) > The http/s check doesn't take extra arguments like that (and I doubt you > want to write an https check from scratch), so I'm not sure how you would > trigger that "this is a nagios check" unless you're just doing that for a > specific path? Nick's suggestion (if I understood correctly) would be that we add ?healthcheck=1 to the URLs, so we'd be testing URLs like this: /update/3/Firefox/14.0a1/20120222174716/WINNT_x86-msvc/en-US/nightly/Windows%205.1/default/default/update.xml?healthcheck=1

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 7

•

9 years ago

Doesn't look like I'll be getting back to this anytime soon...

Assignee: bhearsum → nobody

Status: ASSIGNED → NEW

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 8

•

9 years ago

We're going to end up doing something very similar to this as part of the CloudOps migration.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → DUPLICATE

BMO Automation

Updated

•

5 years ago

Product: Release Engineering → Release Engineering Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

add an endpoint to balrog that doesn't eat 500 errors

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Attachment

General

Description

File Name

Content Type