Closed Bug 734394 Opened 12 years ago Closed 11 years ago

send mail about server side errors on balrog servers

Categories

(Release Engineering :: General, defect, P3)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

(Whiteboard: [balrog])

Attachments

(1 file, 1 obsolete file)

It would be great to have logs from ISA 500 errors on the aus4-dev machines e-mailed to us. Ideally, the information from both the web server log and the application log would be included.
Assignee: server-ops-releng → server-ops
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → phong
Assignee: server-ops → bburton
Status: NEW → ASSIGNED
Django based sites use a plugin that sends any 500 errors from the app via email, and I think they send them to aricebo as well?

The 500s are what is also dumped to the Apache error log

We don't currently have a solution for sending Apache logs outside of what your application logs.

So I'd suggest you see if you can adapt the code in playdoh that sends 500s via email and include that in your app.  I'm not sure where the code is but they are usually referred to as tracebacks by the devs
Hmmm, we're not using playdoh for this, but I'll see what I can figure out on the application side. Thanks for your input here!
Assignee: bburton → nobody
Component: Server Operations → Release Engineering: Automation (General)
QA Contact: phong → catlee
Summary: send mail about server side errors on aus4-dev machines → send mail about server side errors on balrog servers
Priority: -- → P3
Whiteboard: [balrog]
http://flask.pocoo.org/docs/errorhandling/ talks about sending mail on error from within the application.
Someone suggested that IT already has a thing for this called Arecibo.
Blocks: balrog-nightly
No longer blocks: balrog
I asked around about Arecibo yesterday and was told it got replaced with Sentry (https://errormill.mozilla.org/). I've got an initial setup for Balrog done there. It needs a bunch of tweaking still, and I need to do more testing on the code side.
Assignee: nobody → bhearsum
Quick braindump, because it might be a week or two until I get back to this:
* We have an initial project setup on errormill.mozilla.org for the admin dev app
* We still need a project for the non-admin dev app
* I have a partial patch for the code in https://github.com/bhearsum/balrog/compare/master...sentry
** It still needs to make sentry logging optional
** Needs more testing, too
* We'll need IT to update the configs on the dev apps when this gets deployed.
Catlee mentioned that we should find out what happens if Sentry goes down. Does the application hang? Does it use up fds? Will it cause a performance impact?
(In reply to Ben Hearsum [:bhearsum] from comment #7)
> Catlee mentioned that we should find out what happens if Sentry goes down.
> Does the application hang? Does it use up fds? Will it cause a performance
> impact?

I can't find much information on this, but I got some response on IRC:
15:37 < davidcramer_> bhearsum it doesnt retry
15:37 < davidcramer_> bhearsum it just fails immediately and then waits to try to send a future message for at least N seconds
15:37 < davidcramer_> and N is exponential IIRC
15:37 < davidcramer_> for each consecutive failure
15:37 < bhearsum> ahhhh, ok
15:37 < davidcramer_> granted, tahts per thread
15:37 < davidcramer_> so it could still fail a bunch
15:37 < davidcramer_> i think the default itmeout is something like 5s
15:37 < davidcramer_> but it might be lower

I'm going to do some testing on my own, too.
OK, I did a bunch of testing locally with Vaurien and I don't think Sentry is going to cause any issues in production.

I configured the Sentry DSN to use the threaded http transport, which means it doesn't block requests from returning the error to the http client while it sends the error. It defaults to a timeout of 1 second (which seems like it might be a little low, to be honest), after which it gives up on sending the error. So, we could eat up a maximum of (processes * threads * requests handled per second) fds used. We have 4 processes and 2 threads in the dev env, which makes me pretty sure that we'll be fine. At the very least, I think we enough positive data to go ahead and use this in the dev environment. Hopefully we can flush out other potential problems in there.
Attached patch add sentry support (obsolete) — Splinter Review
I did some local tests with the wsgi app, ended up with things like: https://errormill.mozilla.org/Releng/Balrog/group/15007/.
Attachment #720018 - Flags: review?(catlee)
Comment on attachment 720018 [details] [diff] [review]
add sentry support

Catlee pointed out to me that all of the request headers are sent to Sentry, including Authorization. I checked the dev servers and Authorization is forwarded to them. We'll need to do something in the code to make sure we never send that to Sentry, else we expose everyone's LDAP passwords....
Attachment #720018 - Attachment is obsolete: true
Attachment #720018 - Flags: review?(catlee)
Same patch as before, except there's now a Processor that redacts the information in the Authorization header. Here's an exception that was sent with this code: https://errormill.mozilla.org/Releng/Balrog/group/15016/
Attachment #720683 - Flags: review?(catlee)
Attachment #720683 - Flags: review?(catlee) → review+
Commit pushed to master at https://github.com/mozilla/balrog

https://github.com/mozilla/balrog/commit/cc3f52078864a200cdaa2b7c5b3ffbf104af8b4d
bug 734394: send mail about server side errors on balrog servers. r=catlee
Comment on attachment 720683 [details] [diff] [review]
sanitize the authorization header

I landed this. It's a no-op until bug 849245 is fixed, which will get sentry_dsn added to the config files. Leaving this bug open until that happens, so we can verify.
Attachment #720683 - Flags: checked-in+
Jenkins run is green, and here: https://jenkins.mozilla.org/job/Balrog/147/
With the config update done this is mostly working. The admin app is already throwing errors: https://errormill.mozilla.org/Releng/Balrog/group/15042/

However, the non-admin app doesn't seem to be working. I did some local modifications on the dev servers to force exceptions and nothing showed up on the Sentry ui. I ran tcpdump and it showed some traffic to errormill.mozilla.org, and I verified that the dsn is correct...so it seems like it's a problem on the server side. I'm pinging webdev to try to get more info.
The admin app seemed to stop working yesterday too. Today, all the data seems to be filled in...which makes me think that the server was just particularly busy and took awhile to process the events.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: