Closed Bug 1501682 Opened 6 years ago Closed 6 years ago

Slow API response times due to higher number of requests to /jobs/ API than usual

Categories

(Tree Management :: Treeherder: Frontend, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: camd)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

Today treeherder often feels slow when loaded. When I loaded 50 pushes (1 in each tab), many tabs took dozen of seconds until they showed the favicon. Some XHR actions like removing bug annotations from the annotations tab took several seconds in a way that I considered it broken. A few requests take very long, e.g. for https://treeherder.mozilla.org/api/project/autoland/jobs/?return_type=list&count=2000&push_id=394869&last_modified__gt=2018-10-24T14%3A30%3A43.289 Blocked: 1 ms DNS resolution: 1 ms Connecting: 0 ms TLS setup: 0 ms Sending: 0 ms Waiting: 10738 ms Receiving: 0 ms Similar ones before and after are in the range 200-500ms, few 4000+ ms.
Looking at New Relic, a fair bit of time is spent waiting for a free web dyno to serve the request ("request queueing"): https://screenshots.firefox.com/lrUPiyKhrbsPSaeQ/rpm.newrelic.com (https://rpm.newrelic.com/accounts/677903/applications/14179757?tw%5Bend%5D=1540394332&tw%5Bstart%5D=1540383532) The number of web dynos we have has not changed for some time. This means that either: * more people are using Treeherder, or are having multiple tabs open, or are having more pushes displayed at once * the quantity or type of requests that the UI is now making has changed (eg if there was a bug with polling frequency or ...) * the requests are somehow each taking longer to process, therefore hogging the workers for longer (eg DB slower to respond, API code regression, Heroku noisy neighbour, ...) Looking at the web transactions view for the last 3 hours: * 82% of the wall time is consumed by JobsViewSet.list (followed by PushViewSet.list at 12%) * looking instead at request throughput, JobsViewSet.list is again the leader at 1.1k/min, followed by PushViewSet.list at 169/min (https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?tw%5Bend%5D=1540394481&tw%5Bstart%5D=1540383681#id=&sort_by=throughput) Looking at JobsViewSet.list, the approx time-spent breakdown is: * Python code within JobsViewSet.list: 50% * MySQL job select: 24% * WSGI bits: 10% * other DB: ~8% Looking at New Relic Insights, a significant proportion (60+%) of API requests are coming from Windows User Agents, which is unexpected: https://insights.newrelic.com/accounts/677903/explorer/events?eventType=Transaction&duration=43200000&facet=user_agent Are the sheriffs using Windows machines? I wonder as well, how many of these requests come from Treeherder tabs that have been left open and forgotten about - and that have hundreds of pushes displayed/continually polling? Perhaps we could: * set a limit on the max number of pushes displayed at the same time (similar to bug 1107667) * have the polling automatically stop polling (or reduce frequency) for older pushesthat are now complete, or else tabs that have been open for days? (eg with a "Data refresh paused, refresh page to resume") Cameron, thoughts?
Flags: needinfo?(cdawson)
Thinking about the 1.1k/min request throughput, that's higher than I remember in the past, but still not a huge amount given the way we poll for pushes. ie I believe we poll once a minute for every visible push, and each push is queried separately. As such 1.1k/min means that 1100 pushes are open across all Treeherder tabs. And if users are using the "view next 50" pushes button, that only means 20 tabs. I suppose the real fix for this is just to not load each push's results separately.
So, one change that would fall into Ed's analysis here is that we now let each push poll for its own jobs. We used to do one http request for ALL pushes on the page and get any new jobs, then split the jobs up to each push and distribute them accordingly. Now, each push does its own polling/http request. This would account for the increase in requests. This new approach simplified some of the code, and I had made the assumption that it wouldn't make any difference in performance. I was wrong. :) I'll work to remedy this right away.
Heh, Ed and I were writing our comments at the same time. :) Agreed.
Flags: needinfo?(cdawson)
Assignee: nobody → cdawson
Status: NEW → ASSIGNED
(In reply to Cameron Dawson [:camd] from comment #3) > We used to do one http request for ALL pushes on the page Ah I didn't realise that was how we used to do it :-)
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/d5b85be8968fea6d3dce6c7a50737c2e1e7934b1 Bug 1501682 - Poll for new jobs on all pushes at once Instead of polling for new jobs on each push, this will poll for new jobs on all loaded pushes in one http call, then distribute them to each Push.
This has been deployed to production. Please reopen or create a new bug if you notice more issue in this area.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Blocks: 1450040
Component: Treeherder → Treeherder: Frontend
Priority: -- → P1
Summary: slow loading of both the treeherder page and some XHRs → Slow API response times due to higher number of requests to /jobs/ API than usual
This looks to be greatly improved after the change here (see screenshot). Previously there were between 900 and 1600 requests/min to JobsViewSet.list, now there are only 350-500. I've scaled the prod Heroku web dynos back down from 6 to 4.
Attached image Deployments overview
This can actually be seen on the deployments overview page (which shows the total requests throughput for the whole app). That deployment is: https://rpm.newrelic.com/accounts/677903/applications/14179757/deployments/18001385 Which occurred on 16th Oct - the day bug 1499551 was filed, which means the fix here should hopefully help with that too.
Attachment #9020370 - Attachment filename: Screenshot_2018-10-26 Deployments - treeherder-prod - New Relic.png → Deployments overview
Attachment #9020370 - Attachment description: Screenshot_2018-10-26 Deployments - treeherder-prod - New Relic.png → Deployments overview
Attachment #9020370 - Attachment filename: Deployments overview → Deployments-overview.png
Blocks: 1499551
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: