Treeherder's own resultset ingestion is causing 17 million HTTP 429 responses/week

RESOLVED FIXED

Status

Tree Management
Treeherder: API
P1
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

(Depends on: 1 bug)

Details

Attachments

(1 attachment)

(Assignee)

Description

2 years ago
There are 16.5 million HTTP 429 responses compared to 8 million HTTP 200 for treeherder prod in the last 7 days:

https://insights.newrelic.com/accounts/677903/explorer?eventType=Transaction&timerange=week&filters=%255B%257B%2522key%2522%253A%2522appName%2522%252C%2522value%2522%253A%2522treeherder-prod%2522%257D%255D&facet=response.status

These are coming from Treeherder's resultset ingestion:

eg:
https://rpm.newrelic.com/accounts/677903/applications/4180461/filterable_errors#/show/4e9295-d92ce55a-a36b-11e5-bb1e-b82a72d22a14/stack_trace?top_facet=transactionUiName&bottom_facet=host&primary_facet=error.class&_k=0e2tld

There are a few issues:
1) treeherder-client doesn't honour HTTP 429s, so once we hit it, it's just made worse

2) The fetch missing pushlog task then makes things even worse, since if the main pushlog task hits a 429, we then schedule many "fetch missing" tasks increasing load further

3) We still see lots of junk revisions due to bad data (a la bug 1090289)
(Assignee)

Comment 1

2 years ago
Meant to say:
* These are what are filling up the logs/disk in bug 1229020
* Not sure why this has suddenly started a few weeks ago. Have we broken something or were we just really close to the rate limit?
(Assignee)

Updated

2 years ago
Depends on: 1232781
Created attachment 8698659 [details] [review]
PR

This temporarily increases the rate limit on /resultset/ from 220 -> 400, until we can address the root causes.
Attachment #8698659 - Flags: review?(emorley)
(Assignee)

Updated

2 years ago
Attachment #8698659 - Flags: review?(emorley) → review+

Comment 3

2 years ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/d994fd8194dbacb66dd3258e030f05fe94ebf0af
Bug 1232776 - Temporarily increase rate limit on /resultset/ endpoint

For probably not-so-great reasons, we seem to be hitting the limits of
what we can submit in production. Until we've addressed the root causes,
let's temporarily increase things, since the rate limiting is causing
more problems than it's solving.
(Assignee)

Updated

2 years ago
Depends on: 1234241
(Assignee)

Comment 4

2 years ago
The dependant bugs have resolved the issue; there's more to do in bug 1191934 et al, but this bug can be closed for now.
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.