Closed
Bug 734394
Opened 12 years ago
Closed 11 years ago
send mail about server side errors on balrog servers
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: bhearsum)
References
Details
(Whiteboard: [balrog])
Attachments
(1 file, 1 obsolete file)
211.35 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
It would be great to have logs from ISA 500 errors on the aus4-dev machines e-mailed to us. Ideally, the information from both the web server log and the application log would be included.
Updated•12 years ago
|
Assignee: server-ops-releng → server-ops
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → phong
Updated•12 years ago
|
Assignee: server-ops → bburton
Status: NEW → ASSIGNED
Comment 1•12 years ago
|
||
Django based sites use a plugin that sends any 500 errors from the app via email, and I think they send them to aricebo as well? The 500s are what is also dumped to the Apache error log We don't currently have a solution for sending Apache logs outside of what your application logs. So I'd suggest you see if you can adapt the code in playdoh that sends 500s via email and include that in your app. I'm not sure where the code is but they are usually referred to as tracebacks by the devs
Assignee | ||
Comment 2•12 years ago
|
||
Hmmm, we're not using playdoh for this, but I'll see what I can figure out on the application side. Thanks for your input here!
Assignee: bburton → nobody
Component: Server Operations → Release Engineering: Automation (General)
QA Contact: phong → catlee
Summary: send mail about server side errors on aus4-dev machines → send mail about server side errors on balrog servers
Updated•12 years ago
|
Priority: -- → P3
Whiteboard: [balrog]
Assignee | ||
Comment 3•12 years ago
|
||
http://flask.pocoo.org/docs/errorhandling/ talks about sending mail on error from within the application.
Assignee | ||
Comment 4•11 years ago
|
||
Someone suggested that IT already has a thing for this called Arecibo.
Assignee | ||
Comment 5•11 years ago
|
||
I asked around about Arecibo yesterday and was told it got replaced with Sentry (https://errormill.mozilla.org/). I've got an initial setup for Balrog done there. It needs a bunch of tweaking still, and I need to do more testing on the code side.
Assignee: nobody → bhearsum
Assignee | ||
Comment 6•11 years ago
|
||
Quick braindump, because it might be a week or two until I get back to this: * We have an initial project setup on errormill.mozilla.org for the admin dev app * We still need a project for the non-admin dev app * I have a partial patch for the code in https://github.com/bhearsum/balrog/compare/master...sentry ** It still needs to make sentry logging optional ** Needs more testing, too * We'll need IT to update the configs on the dev apps when this gets deployed.
Assignee | ||
Comment 7•11 years ago
|
||
Catlee mentioned that we should find out what happens if Sentry goes down. Does the application hang? Does it use up fds? Will it cause a performance impact?
Assignee | ||
Comment 8•11 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #7) > Catlee mentioned that we should find out what happens if Sentry goes down. > Does the application hang? Does it use up fds? Will it cause a performance > impact? I can't find much information on this, but I got some response on IRC: 15:37 < davidcramer_> bhearsum it doesnt retry 15:37 < davidcramer_> bhearsum it just fails immediately and then waits to try to send a future message for at least N seconds 15:37 < davidcramer_> and N is exponential IIRC 15:37 < davidcramer_> for each consecutive failure 15:37 < bhearsum> ahhhh, ok 15:37 < davidcramer_> granted, tahts per thread 15:37 < davidcramer_> so it could still fail a bunch 15:37 < davidcramer_> i think the default itmeout is something like 5s 15:37 < davidcramer_> but it might be lower I'm going to do some testing on my own, too.
Assignee | ||
Comment 9•11 years ago
|
||
OK, I did a bunch of testing locally with Vaurien and I don't think Sentry is going to cause any issues in production. I configured the Sentry DSN to use the threaded http transport, which means it doesn't block requests from returning the error to the http client while it sends the error. It defaults to a timeout of 1 second (which seems like it might be a little low, to be honest), after which it gives up on sending the error. So, we could eat up a maximum of (processes * threads * requests handled per second) fds used. We have 4 processes and 2 threads in the dev env, which makes me pretty sure that we'll be fine. At the very least, I think we enough positive data to go ahead and use this in the dev environment. Hopefully we can flush out other potential problems in there.
Assignee | ||
Comment 10•11 years ago
|
||
I did some local tests with the wsgi app, ended up with things like: https://errormill.mozilla.org/Releng/Balrog/group/15007/.
Attachment #720018 -
Flags: review?(catlee)
Assignee | ||
Comment 11•11 years ago
|
||
Comment on attachment 720018 [details] [diff] [review] add sentry support Catlee pointed out to me that all of the request headers are sent to Sentry, including Authorization. I checked the dev servers and Authorization is forwarded to them. We'll need to do something in the code to make sure we never send that to Sentry, else we expose everyone's LDAP passwords....
Attachment #720018 -
Attachment is obsolete: true
Attachment #720018 -
Flags: review?(catlee)
Assignee | ||
Comment 12•11 years ago
|
||
Same patch as before, except there's now a Processor that redacts the information in the Authorization header. Here's an exception that was sent with this code: https://errormill.mozilla.org/Releng/Balrog/group/15016/
Attachment #720683 -
Flags: review?(catlee)
Updated•11 years ago
|
Attachment #720683 -
Flags: review?(catlee) → review+
Comment 13•11 years ago
|
||
Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/cc3f52078864a200cdaa2b7c5b3ffbf104af8b4d bug 734394: send mail about server side errors on balrog servers. r=catlee
Assignee | ||
Comment 14•11 years ago
|
||
Comment on attachment 720683 [details] [diff] [review] sanitize the authorization header I landed this. It's a no-op until bug 849245 is fixed, which will get sentry_dsn added to the config files. Leaving this bug open until that happens, so we can verify.
Attachment #720683 -
Flags: checked-in+
Assignee | ||
Comment 15•11 years ago
|
||
Jenkins run is green, and here: https://jenkins.mozilla.org/job/Balrog/147/
Assignee | ||
Comment 16•11 years ago
|
||
With the config update done this is mostly working. The admin app is already throwing errors: https://errormill.mozilla.org/Releng/Balrog/group/15042/ However, the non-admin app doesn't seem to be working. I did some local modifications on the dev servers to force exceptions and nothing showed up on the Sentry ui. I ran tcpdump and it showed some traffic to errormill.mozilla.org, and I verified that the dsn is correct...so it seems like it's a problem on the server side. I'm pinging webdev to try to get more info.
Assignee | ||
Comment 17•11 years ago
|
||
The admin app seemed to stop working yesterday too. Today, all the data seems to be filled in...which makes me think that the server was just particularly busy and took awhile to process the events.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•