Open Bug 1152896 (treeherder-nr-exceptions) Opened 7 years ago Updated 3 years ago

[Meta] Drive the New Relic exception rate down

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: emorley, Unassigned)

References

(Depends on 3 open bugs)

Details

(Keywords: meta)

In the last 7 days:

parse-log
celery.exceptions:Retry
14,486 occurrences (max 10 retries per log)
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531807536/similar_errors?original_error_id=3531807536
-> Mainly HTTPError(), one SSLError('The read operation timed out',)
...we should see if we can get these to display more intelligently, since they are the individual retries (so we don't want to suppress them), but are currently all lumped together, rather than split by reason.

parse-log
urllib2:HTTPError
1,496 occurrences
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531874326/similar_errors?original_error_id=3531874326
-> all: HTTP Error 404: Not Found
...I've landed an improvement to the logging in bug 1152769 which will give us the info we need to track these down.

fetch-buildapi-running
ssl:SSLError
66 occurrences
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531187577/similar_errors?original_error_id=3531187577
-> Combination of "The read operation timed out" and "_ssl.c:495: The handshake operation timed out"

submit-bug-comment
celery.exceptions:Retry
50 occurrences
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3519054823/similar_errors?original_error_id=3519054823
-> all: 401 Client Error: Authorization Required
...presume bug 1142258

fetch-buildapi-pending
ssl:SSLError
49 occurrences
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531904278/similar_errors?original_error_id=3531904278
...same as the fetch-buildapi-running one above, but for pending

fetch-buildapi-build4h
ssl:SSLError
42 occurrences
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531769620/similar_errors?original_error_id=3531769620
...ditto but for build4h

parse-json-log
celery.exceptions:Retry
20 occurrences
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3528554684/similar_errors?original_error_id=3528554684
-> all: MemoryError() (apart from one timeout similar to the above)
...bug 1152742.

process-objects
exceptions:TypeError
17 occurrences
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531829249/similar_errors?original_error_id=3531829249
-> 'NoneType' object has no attribute '__getitem__'
...need a bug filed.

fetch-buildapi-build4h
treeherder.etl.mixins:CollectionNotLoadedException
9 occurrences
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3528253581/similar_errors?original_error_id=3528253581
-> eg "[try] Error posting data to objectstore: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><title>Service Unavailable</title><style type="text/css">body, p, h1 { font-family: Verdana, Arial, Helvetica, sans-serif;}h2 { font-family: Arial, Helvetica, sans-serif; color: #b10b29;}</style></head><body><h2>Service Unavailable</h2><p>The service is temporarily unavailable. Please try again later.</p></body></html>"
...needs a bug filed.

parse-log
exceptions:UnicodeDecodeError
6 occurrences
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3527093001/similar_errors?original_error_id=3527093001
-> "'utf8' codec can't decode byte 0xe0 in position 156: invalid continuation byte"
...bug 1091759

submit-bug-comment
requests.exceptions:HTTPError
4 occurrences
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3519108386/similar_errors?original_error_id=3519108386
...is the counterpart to the submit-bug-comment retry one above.

fetch-buildapi-running
urllib2:URLError
2 occurrences
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3529398646/similar_errors?original_error_id=3529398646
-> "<urlopen error timed out>" and "<urlopen error _ssl.c:495: The handshake operation timed out>"

(and a few other single occurrence exceptions that are likely just infra blip related)
Depends on: 1154248
Depends on: 1154249
No longer depends on: 1154249
Depends on: 1155647
Depends on: 1155702
Depends on: 1159934
No longer depends on: 1154248
Depends on: 1165335
Depends on: 1205049
Depends on: 1220418, 1220427, 1213939
Depends on: 1224931
Depends on: 1268676
The New Relic exception rate on production is higher than normal at the moment (the KeyError exceptions should now be fixed on master, but there are still lots of others) - please could everyone take a look and see if there's anything obvious that stands out? (I see exceptions there relating to auto-classification, perfherder, and possibly some refdata datasource changes fallout.)

See:
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors
(It will default to 30 mins; best to switch to 24 hours or 3 day view to catch the periodic tasks)

Many thanks! :-)
Depends on: 1265188
The main autoclassification intermittent is also fixed on master.

(fwiw I think the filterable_errors view i.e. the Error Analytics panel is a big improvement).
Depends on: 1272532
Depends on: 1277499
Depends on: 1233164
Depends on: 1277506
Depends on: 1277575
Depends on: 1281808
Depends on: 1281809
Depends on: 1281810
Depends on: 1283413
Depends on: 1283505
Depends on: 1283856
Depends on: 1283859
Depends on: 1283146
Depends on: 1284360
Depends on: 1284418
Depends on: 1284429
Depends on: 1284432
Depends on: 1287111
Depends on: 1287113
Depends on: 1287930
Depends on: 1288202
Depends on: 1289354
Depends on: 1289404
Depends on: 1295536
Depends on: 1300789
Depends on: 1301698
Depends on: 1301700
Depends on: 1301702
Depends on: 1306580
Depends on: 1308122
Depends on: 1308123
Depends on: 1308166
Alias: treeherder-nr-exceptions
Depends on: 1310053
Depends on: 1311974
Depends on: 1311976
Depends on: 1311977
Depends on: 1311980
Depends on: 1311982
Depends on: 1368982, 1368984, 1368985
Depends on: 1368988
Depends on: 1368989
Depends on: 1368991
Depends on: 1380450
Depends on: 1416001
Depends on: 1490741
Depends on: 1431085
Depends on: 1514412
Keywords: meta
You need to log in before you can comment on or make changes to this bug.