Closed Bug 1232776 Opened 9 years ago Closed 8 years ago

Treeherder's own resultset ingestion is causing 17 million HTTP 429 responses/week

Categories

(Tree Management :: Treeherder: API, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(1 file)

47 bytes, text/x-github-pull-request
emorley
: review+
Details | Review
There are 16.5 million HTTP 429 responses compared to 8 million HTTP 200 for treeherder prod in the last 7 days:

https://insights.newrelic.com/accounts/677903/explorer?eventType=Transaction&timerange=week&filters=%255B%257B%2522key%2522%253A%2522appName%2522%252C%2522value%2522%253A%2522treeherder-prod%2522%257D%255D&facet=response.status

These are coming from Treeherder's resultset ingestion:

eg:
https://rpm.newrelic.com/accounts/677903/applications/4180461/filterable_errors#/show/4e9295-d92ce55a-a36b-11e5-bb1e-b82a72d22a14/stack_trace?top_facet=transactionUiName&bottom_facet=host&primary_facet=error.class&_k=0e2tld

There are a few issues:
1) treeherder-client doesn't honour HTTP 429s, so once we hit it, it's just made worse

2) The fetch missing pushlog task then makes things even worse, since if the main pushlog task hits a 429, we then schedule many "fetch missing" tasks increasing load further

3) We still see lots of junk revisions due to bad data (a la bug 1090289)
Meant to say:
* These are what are filling up the logs/disk in bug 1229020
* Not sure why this has suddenly started a few weeks ago. Have we broken something or were we just really close to the rate limit?
Depends on: 1232781
Attached file PR
This temporarily increases the rate limit on /resultset/ from 220 -> 400, until we can address the root causes.
Attachment #8698659 - Flags: review?(emorley)
Attachment #8698659 - Flags: review?(emorley) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/d994fd8194dbacb66dd3258e030f05fe94ebf0af
Bug 1232776 - Temporarily increase rate limit on /resultset/ endpoint

For probably not-so-great reasons, we seem to be hitting the limits of
what we can submit in production. Until we've addressed the root causes,
let's temporarily increase things, since the rate limiting is causing
more problems than it's solving.
Depends on: 1234241
The dependant bugs have resolved the issue; there's more to do in bug 1191934 et al, but this bug can be closed for now.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: