Closed
Bug 1191080
Opened 10 years ago
Closed 10 years ago
Fix the way we use the New Relic Python agent with gunicorn
Categories
(Tree Management :: Treeherder, defect, P1)
Tree Management
Treeherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mdoglio, Assigned: emorley)
References
Details
Attachments
(2 files)
It looks like all the exceptions raised in the api are not reported to newrelic.
I tried to generate a 500 with the following url:
https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?name=buildapi&type=json&job_id__in=12475518,12475512,12475511&count=a
but I can't see any trace of it on newrelic.
| Assignee | ||
Comment 1•10 years ago
|
||
I was looking at the New Relic docs, and Django exception reporting is supported:
https://docs.newrelic.com/docs/agents/python-agent/supported-features/instrumented-python-packages#web-frameworks
The setup docs do seem to generally suggest using the wrapper scripts, rather than manually adding in calls to the agent API:
https://docs.newrelic.com/docs/agents/python-agent/hosting-mechanisms/python-agent-gunicorn#wrapper-script
...there is even a specific wrapper script for Django.
Mauro: Do you remember why we're not using the wrapper script, vs initialising the agent in webapp/wsgi.py?
Also, when initialising the agent manually, the docs do say to make sure that `agent.initialize()` occurs before anything else is imported:
https://docs.newrelic.com/docs/agents/python-agent/installation-configuration/python-agent-integration#manual-integration
However we don't do that:
https://github.com/mozilla/treeherder/blob/fad76032a5a3cca48ac564262cd45c7288fdd3d8/treeherder/webapp/wsgi.py#L31-L37
In addition, the docs say to use WSGIApplicationWrapper() rather than the wsgi_application() we're using at present (the latter is for when you're using it as a decorator):
https://docs.newrelic.com/docs/agents/python-agent/installation-configuration/python-agent-integration#wsgi-application
This seems to be confirmed by the example they give here:
https://docs.newrelic.com/docs/agents/python-agent/customization-extension/python-instrumentation-api#WSGIApplicationWrapper
We should either use one of the wrapper scripts, or fix our manual agent initialisation to see if either of these fixes this bug.
| Assignee | ||
Updated•10 years ago
|
Assignee: nobody → emorley
Status: NEW → ASSIGNED
| Reporter | ||
Comment 2•10 years ago
|
||
I don't remember why we initialize the nr agent in the wsgi application vs the wrapper script :(
| Assignee | ||
Comment 3•10 years ago
|
||
Hmm though the Heroku Procfile uses the wrapper form already, but exceptions don't show up there either :-(
I also tried making the fixes mentioned in comment 2, to wgsi.py, and pushing to heroku, but that didn't help.
| Assignee | ||
Comment 4•10 years ago
|
||
I enabled debug logging on Heroku and have attached the resultant log along with things I've tried to a New Relic support request:
https://support.newrelic.com/requests/156680
| Assignee | ||
Comment 5•10 years ago
|
||
My initial ticket:
> We're using the New Relic Python agent (v2.52.0.40) with gunicorn (v19.3.0),
> Django (v1.7.9) and Python 2.7.9.
>
> Exceptions in our Celery tasks appear fine in New Relic, however Django
> exceptions during web transactions (eg those that result in an HTTP 500) are not
> being displayed in New Relic. This is occurring on both our production instance,
> and our prototype Heroku instance (we're in the process of moving to Heroku).
>
> For example, the ValueError exception behind this HTTP 500 does not appear on
> https://rpm.newrelic.com/accounts/677903/applications/7385291/traced_errors
> (even though we have no exception filters set):
> https://treeherder-heroku.herokuapp.com/api/project/mozilla-inbound/artifact/?name=buildapi&type=json&job_id__in=12475518,12475512,12475511&count=a
>
> The production instance is relying on wsgi.py agent initialisation: https://gith
> ub.com/mozilla/treeherder/blob/master/treeherder/webapp/wsgi.py#L31-L37
>
> Whereas the Heroku instance also uses newrelic-admin run-program in addition:
> https://github.com/mozilla/treeherder/blob/master/Procfile#L5
>
> Now I realise the production initialisation is currently suboptimal (the init
> call should be before some of the other imports), however I've tried a branch
> that fixes this:
> https://github.com/mozilla/treeherder/compare/fix-newrelic-wsgi-init?expand=1
>
> As well as tried a branch that removes the manual wsgi.py initialisation, in
> case it was interfering with newrelic-admin run-program:
> https://github.com/mozilla/treeherder/compare/use-newrelic-gunicorn-wrapper?expand=1
>
> However neither of these helped.
>
> I've attached a copy of the debug agent log output on Heroku, which shows the
> exception occurring (search for `ValueError`), but that was not showing on
> https://rpm.newrelic.com/accounts/677903/applications/7385291/traced_errors
>
> Many thanks :-)
Their first response:
> Hi Ed,
>
> Our Python agent will report any unhandled exceptions at the point where they
> are being intercepted or any API calls to record_exception().
>
> Exceptions will be reported if they are unhandled and propagate up to WSGI
> application object that our instrumentation wrapper is applied to. For specific
> web frameworks, we will also intercept exceptions at point that framework
> converts them into a 500 error response. It appears that this exception was
> passed along until it became a 500 error response which is when we recorded the
> stack trace.
>
> If you wish to track this exception at an earlier juncture you should use our
> agents API to record the exception as it was presented.
>
> I hope this helps you to understand the reason why the first portion of the
> exception was not raised.
>
> If you have any other concerns or issues, please feel free to contact us.
My reply:
> Hi Walden , thank you for the quick reply.
>
> Unfortunately the problem here is that the exception isn't being reported to New
> Relic at all, rather than it being reported to New Relic but with a stack trace
> I'm not happy with.
>
> Thanks,
Their latest response:
> Hey Ed,
>
> I do understand your concern, and we are taking a closer look at the exact
> details of what happens when an error bubbles up to a 500 exception for you.
>
> Once I have more information I will get back in touch.
>
> Thanks for your patience.
| Assignee | ||
Comment 6•10 years ago
|
||
This unfortunately doesn't this bug, or answer the question of whether we should instead be using the wrapper scripts (which incidentally don't fix the bug either), but if we're going to initialise the agent manually, then at least let's do it the way the docs recommend.
Attachment #8646311 -
Flags: review?(mdoglio)
| Assignee | ||
Updated•10 years ago
|
Attachment #8646311 -
Attachment description: Initialise the New Relic agent correctlyThis unfortunately doesn't fix bug 1191080 itself, or answer the question of whether we should instead be using the wrapper scripts (which incidentally don't fi → Initialise the New Relic agent correctly
| Reporter | ||
Updated•10 years ago
|
Attachment #8646311 -
Flags: review?(mdoglio) → review+
Comment 7•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/9e152c6732019cb45121da19cde6ea7e15cb2cef
Bug 1191080 - Initialise the New Relic agent correctly
The agent is supposed to be initialised before any other imports. In
addition, we're supposed to use WSGIApplicationWrapper() if we're not
using it as a decorator.
This unfortunately doesn't fix bug 1191080 itself, or answer the
question of whether we should instead be using the wrapper scripts
(which incidentally don't fix the bug either), but if we're going to
initialise the agent manually, then at least let's do it the way the
docs recommend.
See:
https://docs.newrelic.com/docs/agents/python-agent/installation-configuration/python-agent-integration#wsgi-application
https://docs.newrelic.com/docs/agents/python-agent/customization-extension/python-instrumentation-api#WSGIApplicationWrapper
| Assignee | ||
Updated•10 years ago
|
Attachment #8646311 -
Flags: checkin+
| Reporter | ||
Comment 8•10 years ago
|
||
Attachment #8649826 -
Flags: review?(mdoglio)
| Reporter | ||
Updated•10 years ago
|
Attachment #8649826 -
Flags: review?(mdoglio) → review+
Comment 9•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/e401d3d26fab92aa339ecb3d6405899c9d56f88d
Bug 1191080 - Use newrelic-admin run-program rather than manual init
| Assignee | ||
Comment 10•10 years ago
|
||
Comment on attachment 8649826 [details] [review]
PR 884
These changes aside, New Relic is still not reporting exceptions in web transactions, I'll chase up the ticket with them soon.
Attachment #8649826 -
Flags: checkin+
| Assignee | ||
Comment 11•10 years ago
|
||
So New Relic's latest reply (since comment 5), is a bit frustrating again, since it's asking us to do things we've already done.
> Can you please provide us with more information about this error message. It appears the error is happening outside of a transaction we are recording since there is not New Relic associated code in your stack trace. It also appears as if multiple applications are logging to the same file.
>
> One of the best solutions would be if you could separate logging for this application into it's own log file, then send us a set of logs (including a crash as previously shown) following these steps:
>
> The easiest way to get the Python agent to write to a log is to edit the newrelic.ini and uncomment the line:
> log_file = /tmp/newrelic-python-agent.log
> * Then set the log_level setting to debug.
> * If you are using the newrelic-admin run-program wrapper script, please add NEW_RELIC_STARTUP_DEBUG=true to the beginning of your launch command.
> * Restart your app server, exercise your app for five minutes, capture the debug log, and send it to us for review. Additional data from the NEW_RELIC_STARTUP_DEBUG=true environment variable will be sent to stdout. Please also capture and send us this data.
> Once you've finished gathering logs for us, don't forget to reset the log_level to info and restart your app server again to pick up the change.
> Please also let us know which time zone the logs were collected in.
| Assignee | ||
Comment 12•10 years ago
|
||
My reply sent now:
> Hi <redacted>
>
> I set the following:
> `NEW_RELIC_STARTUP_DEBUG=true`
> `NEW_RELIC_LOG_LEVEL=debug`
>
> And used `NEW_RELIC_LOG` to save the agent log separately (attached as `python-agent.log`).
>
> I also switched off the other worker nodes whilst capturing the app logs (attached as `app.log`), so there is less noise.
>
> The timezone in the logs is UTC.
>
> It's worth noting we are using the Django REST framework (v3.1.3 ; http://www.django-rest-framework.org) along with this `exception_handler()`:
> https://github.com/mozilla/treeherder/blob/7c15aeb4ced90369029af287d1cf8a4eb60969bd/treeherder/webapp/api/exceptions.py#L18
>
> Note also since the original post (and thus for the new logs here) we've switched from initialising the agent manually in wgsi.py, to using the `newrelic-admin run-program` wrapper script (in https://github.com/mozilla/treeherder/commit/e401d3d26fab92aa339ecb3d6405899c9d56f88d).
>
> The Heroku Procfile entry used to start gunicorn is:
> `newrelic-admin run-program gunicorn treeherder.webapp.wsgi:application --log-file - --timeout 29 --max-requests 2000`
>
> Let me know if you need any more details.
>
> Many thanks,
| Assignee | ||
Comment 13•10 years ago
|
||
Reply from New Relic:
> Thanks so much for the detailed report. Now that we've had a chance to see the exception_handler() snippet, it makes sense why an error was getting logged, but you weren't seeing it in the New Relic UI.
>
> The way that the Python agent works is that it is designed to capture and report unhandled exceptions. It doesn't capture handled exceptions. In your case, you are handling the exception yourself, logging the traceback, and generating your own 500 error code response, so there isn't an exception for the agent to capture.
>
> Typically, an unhandled exception results in a 500 error code response. So, it may look like the agent captures all errors that occur when a 500 response is sent. However, it's the unhandled exception that triggers the recording of the exception, not the 500 response code itself.
>
> As you've experienced, it's possible for an application to send a 500 error code response without any exception being captured at all.
>
> If you would like to both a) handle the exception, and b) record the exception in New Relic, I suggest using our agent API inside your exception handler to record it:
>
> newrelic.agent.record_exception(exc=None, value=None, tb=None)
> You can read more about how to record exceptions here: https://docs.newrelic.com/docs/agents/python-agent/customization-extension/python-transaction-api#record_exception
Mauro, I see you added `treeherder/webapp/api/exceptions.py` - is the above expected?
It seems like they could still be handling django rest framework better for this?
Flags: needinfo?(mdoglio)
| Assignee | ||
Comment 14•10 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #13)
> It seems like they could still be handling django rest framework better for
> this?
I've tested with the custom exception_handler() removed (https://github.com/mozilla/treeherder/compare/test-rm-custom-exception-handler) and the ValueError exception now does appear in New Relic:
https://rpm.newrelic.com/accounts/677903/applications/7385291/traced_errors/70b0cd-fdad297d-473a-11e5-97ea-b82a72d22a14
So this is specific to our custom exception_handler() rather than the default django rest framework behaviour.
| Reporter | ||
Comment 15•10 years ago
|
||
I wrote the custom exception handler to convert a couple of common exceptions DatasetNotFoundError and ObjectNotFoundException to something meaningful in terms of http response code. The alternative is to remove the custom handler and either let the exception propagate as it is or catch it in the view code and convert it where appropriate. I'm ok with both, although I would be in favor of the latter.
Flags: needinfo?(mdoglio)
| Assignee | ||
Comment 16•10 years ago
|
||
(In reply to Mauro Doglio [:mdoglio] from comment #15)
> I wrote the custom exception handler to convert a couple of common
> exceptions DatasetNotFoundError and ObjectNotFoundException to something
> meaningful in terms of http response code. The alternative is to remove the
> custom handler and either let the exception propagate as it is or catch it
> in the view code and convert it where appropriate. I'm ok with both,
> although I would be in favor of the latter.
Would you mind taking a look at this, since you have a solution in mind?
I've also commented on the New Relic ticket asking if they would be up for adding specific support for the Django rest framework (ie hooking into exception_handler and grabbing the exceptions from there).
Assignee: emorley → nobody
Status: ASSIGNED → NEW
| Assignee | ||
Comment 17•10 years ago
|
||
Actually, I'll file a new bug for the exception_handler changes, since patches landed here.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Summary: Errors in web transactions are not reported to newrelic → Fix the way we use the New Relic Python agent with gunicorn
| Assignee | ||
Updated•10 years ago
|
Component: Treeherder: API → Treeherder
| Assignee | ||
Comment 18•10 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #17)
> Actually, I'll file a new bug for the exception_handler changes, since
> patches landed here.
Bug 1201086.
You need to log in
before you can comment on or make changes to this bug.
Description
•