Open
Bug 1152896
(treeherder-nr-exceptions)
Opened 9 years ago
Updated 5 years ago
[Meta] Drive the New Relic exception rate down
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P3)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
NEW
People
(Reporter: emorley, Unassigned)
References
(Depends on 3 open bugs)
Details
(Keywords: meta)
Reporter | ||
Updated•9 years ago
|
Reporter | ||
Comment 1•9 years ago
|
||
In the last 7 days: parse-log celery.exceptions:Retry 14,486 occurrences (max 10 retries per log) https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531807536/similar_errors?original_error_id=3531807536 -> Mainly HTTPError(), one SSLError('The read operation timed out',) ...we should see if we can get these to display more intelligently, since they are the individual retries (so we don't want to suppress them), but are currently all lumped together, rather than split by reason. parse-log urllib2:HTTPError 1,496 occurrences https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531874326/similar_errors?original_error_id=3531874326 -> all: HTTP Error 404: Not Found ...I've landed an improvement to the logging in bug 1152769 which will give us the info we need to track these down. fetch-buildapi-running ssl:SSLError 66 occurrences https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531187577/similar_errors?original_error_id=3531187577 -> Combination of "The read operation timed out" and "_ssl.c:495: The handshake operation timed out" submit-bug-comment celery.exceptions:Retry 50 occurrences https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3519054823/similar_errors?original_error_id=3519054823 -> all: 401 Client Error: Authorization Required ...presume bug 1142258 fetch-buildapi-pending ssl:SSLError 49 occurrences https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531904278/similar_errors?original_error_id=3531904278 ...same as the fetch-buildapi-running one above, but for pending fetch-buildapi-build4h ssl:SSLError 42 occurrences https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531769620/similar_errors?original_error_id=3531769620 ...ditto but for build4h parse-json-log celery.exceptions:Retry 20 occurrences https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3528554684/similar_errors?original_error_id=3528554684 -> all: MemoryError() (apart from one timeout similar to the above) ...bug 1152742. process-objects exceptions:TypeError 17 occurrences https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3531829249/similar_errors?original_error_id=3531829249 -> 'NoneType' object has no attribute '__getitem__' ...need a bug filed. fetch-buildapi-build4h treeherder.etl.mixins:CollectionNotLoadedException 9 occurrences https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3528253581/similar_errors?original_error_id=3528253581 -> eg "[try] Error posting data to objectstore: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><title>Service Unavailable</title><style type="text/css">body, p, h1 { font-family: Verdana, Arial, Helvetica, sans-serif;}h2 { font-family: Arial, Helvetica, sans-serif; color: #b10b29;}</style></head><body><h2>Service Unavailable</h2><p>The service is temporarily unavailable. Please try again later.</p></body></html>" ...needs a bug filed. parse-log exceptions:UnicodeDecodeError 6 occurrences https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3527093001/similar_errors?original_error_id=3527093001 -> "'utf8' codec can't decode byte 0xe0 in position 156: invalid continuation byte" ...bug 1091759 submit-bug-comment requests.exceptions:HTTPError 4 occurrences https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3519108386/similar_errors?original_error_id=3519108386 ...is the counterpart to the submit-bug-comment retry one above. fetch-buildapi-running urllib2:URLError 2 occurrences https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3529398646/similar_errors?original_error_id=3529398646 -> "<urlopen error timed out>" and "<urlopen error _ssl.c:495: The handshake operation timed out>" (and a few other single occurrence exceptions that are likely just infra blip related)
Reporter | ||
Updated•9 years ago
|
Reporter | ||
Comment 2•8 years ago
|
||
The New Relic exception rate on production is higher than normal at the moment (the KeyError exceptions should now be fixed on master, but there are still lots of others) - please could everyone take a look and see if there's anything obvious that stands out? (I see exceptions there relating to auto-classification, perfherder, and possibly some refdata datasource changes fallout.) See: https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors (It will default to 30 mins; best to switch to 24 hours or 3 day view to catch the periodic tasks) Many thanks! :-)
Comment 3•8 years ago
|
||
The main autoclassification intermittent is also fixed on master. (fwiw I think the filterable_errors view i.e. the Error Analytics panel is a big improvement).
Reporter | ||
Updated•8 years ago
|
Alias: treeherder-nr-exceptions
Reporter | ||
Updated•7 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•