Closed Bug 1387543 Opened 7 years ago Closed 1 year ago

Set up Papertrail alerts for errors that don't appear in New Relic

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: emorley, Unassigned)

References

Details

Certain types of failure modes due to their nature will never be able to be shown in New Relic, however still appear in Papertrail. For example:
* Heroku killing a dyno for consuming too much RAM (aka Error R15) - which is actually happening for cycle_data right now (bug ).
* when we hit hard celery time limits (eg bug 1387536).

We can set up Papertrail alerts (emailed to treeherder-internal@) that list these errors so we can actually know they are occurring.

The Papertrail docs have some suggested search terms for Heroku errors:
http://help.papertrailapp.com/kb/hosting-services/heroku/#4-add-heroku-searches-optional

...however my experimentation with them so far has found they result in many false positives, such as:

"""
Aug 03 05:09:08 treeherder-prod app/worker_log_parser.1: [2017-08-03 04:09:08,586: WARNING/Worker-278] Data collector is not contactable. This can be because of a network issue or because of the data collector being restarted. In the event that contact cannot be made after a period of time then please report this problem to New Relic support for further investigation. The error raised was ConnectionError(ProtocolError('Connection aborted.', BadStatusLine("''",)),). 
"""

...so I'll need to try our some variations before sending to treeherder-internal@, to keep the noise down.
I've filed a New Relic ticket for the data collector noisy warning:
https://support.newrelic.com/tickets/256725/edit

```
Title: Noisy "Data collector is not contactable" Python agent warnings 

Hi!

We're using the New Relic Python agent v2.90.0.75 on several Heroku apps, which report to a standalone New Relic account.

These apps send their logs to Papertrail, where we have alerts set up that attempt to match against various Heroku error strings (such as a dyno being killed due to excessive RAM usage - something that New Relic understandably can't detect itself). The search terms for these alerts come from:
http://help.papertrailapp.com/kb/hosting-services/heroku/

However multiple times a day, these alerts generate false positives log matches of form:

"""
Aug 05 10:05:45 treeherder-stage app/worker_log_parser.7: [2017-08-05 09:05:45,561: WARNING/Worker-158] Data collector is not contactable. This can be because of a network issue or because of the data collector being restarted. In the event that contact cannot be made after a period of time then please report this problem to New Relic support for further investigation. The error raised was ConnectionError(ProtocolError('Connection aborted.', BadStatusLine("''",)),).
"""

This is due to the "error R" alert term are matching against "error raised" from that message.

It would be helpful if these warning messages could be made less noisy, by either making the Python client only output a warning if the say the second or third submission failed, rather than immediately after the first one failed.

Alternatively, making the collector more reliable, or removing the term "error" from the message output would prevent us seeing so many noisy alerts.

Many thanks!
```
Their reply:

"""
Thank you for reaching out to us. Generally the error you are seeing is due to temporary network instabilities and I can see how it could add to the verbosity of your Heroku logs. Especially when parsing for specific elements for other alerting paradigms.

The best option to mitigate the appearance of these messages would be to set the NEW_RELIC_LOG_LEVEL environment variable to either critical or error. Since this message is logged at a warn level you should not see them at these verbosities. Please let us know if this continues to be problematic with this change.
"""


My reply to that:

"""
We can try setting `NEW_RELIC_LOG_LEVEL` to critical or error, however that will presumably suppress other important warnings.

I think in this case perhaps `warning` isn't the right level for this log output, and perhaps it should be `info` or `debug` in the "retrying but haven't hit max retries yet" case?
"""
Their reply:

"""
I spoke with our Python devs and we hear what you are saying. We've put in a ticket to update the wording of that message from 'error raised' to 'exception raised', removing the term 'error' which should hopefully quiet the noise a bit for you. I don't have a timeline for when that will be updated, but we'll let you know when it is. As usual, always great to hear your thoughts on how we can make things better. 
"""
(In reply to Ed Morley [:emorley] from comment #3)
> Their reply:
> 
> """
> I spoke with our Python devs and we hear what you are saying. We've put in a
> ticket to update the wording of that message from 'error raised' to
> 'exception raised', removing the term 'error' which should hopefully quiet
> the noise a bit for you. 

This change is in the latest release:
https://github.com/edmorley/newrelic-python-agent/commit/c7067d55f0ea7a646e9ef066dfde7adcafc5bf33#diff-7d14db7f763b7a6a97639da1fb45f2f4L484

Which we've updated to in:
https://github.com/mozilla/treeherder/pull/2758
Assignee: emorley → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3

cleaning up old bugs

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.