Closed Bug 1504569 Opened 7 years ago Closed 7 years ago

Periodic spikes in Treeherder response times until Heroku web dynos restarted

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: apavel, Assigned: emorley)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

When trying to reload treeherder pages we receive the following: https://irccloud.mozilla.com/file/seayO0m2/image.png Sometimes there is an error message before "Error to retrieve push data", sometimes it just directly displays the heroku related error.
Ed, Cameron, can you please take a look?
Flags: needinfo?(emorley)
Flags: needinfo?(cdawson)
Trees are closed for this, except for try.
Severity: normal → blocker
I've escalated to the developers so far, with :wlach responding, although he is no longer part of the project, to take a look at the issues
Not seeing any obvious cause from the logs. New relic doesn't see any obvious problem, no errors or tasks that are obviously taking too long. Heroku console is showing memory errors. As an interim fix, I swapped out the 2x dynos out for "performance-m" dynos. They have 2.5gb of memory as opposed to the 2x's (which we were using) 1gb. They are also 5x more expensive... but hopefully this will keep things going until someone from the treeherder team can have a look tomorrow. So far the response rate seems to have gone down to normal.
We've tested and seems that treeherder is back to normal for now. Changing the importance to major and opening the trees based on the above comment. If this reoccurs, we'll closed the trees and update the bug. Thanks William for helping out with this. Regards!
Severity: blocker → major
Flags: needinfo?(emorley)
From what I've found so far: * starting right at the time of the Heroku daily restart event marker on the Heroku metrics page(21:54 UTC), the average response time graph spikes * breaking down response time by host on New Relic shows all the slow requests to be served by just one dyno, `web.2` * inspecting the logs for requests served by that dyno (see below) shows even static requests to be served really slowly, suggesting an issue with that dyno/dyno host specifically * I believe Will's bumping of dyno size from P2 to Pm worked because it forced the dyno to be recreated (likely on another host) - however using the "restart all dynos" option would do this too * this "response time graph cliff" after a daily restart also occurred a few weeks ago tl;dr: It seems like it might be a Heroku issue, so we should file a ticket. Excerpt of web.2 dyno Papertrail logs: Nov 04 21:54:03 treeherder-prod heroku/web.2: Cycling Nov 04 21:54:03 treeherder-prod heroku/web.2: State changed from up to starting Nov 04 21:54:11 treeherder-prod heroku/web.2: Starting process with command `newrelic-admin run-program gunicorn treeherder.config.wsgi:application --timeout 20` Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [4] [INFO] Starting gunicorn 19.9.0 Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [4] [INFO] Listening at: http://0.0.0.0:8496 (4) Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [4] [INFO] Using worker: sync Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [13] [INFO] Booting worker with pid: 13 Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [14] [INFO] Booting worker with pid: 14 Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [15] [INFO] Booting worker with pid: 15 Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [16] [INFO] Booting worker with pid: 16 Nov 04 21:54:14 treeherder-prod heroku/web.2: State changed from starting to up ... Nov 04 21:54:22 treeherder-prod heroku/web.2: source=web.2 dyno=heroku.47819539.4b58aac2-d87a-4f3d-92c8-72ed1822eec5 sample#memory_total=259.16MB sample#memory_rss=255.14MB sample#memory_cache=4.02MB sample#memory_swap=0.00MB sample#memory_pgpgin=160581pages sample#memory_pgpgout=94236pages sample#memory_quota=1024.00MB ... Nov 04 21:54:32 treeherder-prod heroku/router: at=info method=GET path="/revision.txt" host=treeherder.mozilla.org request_id=1ad752f6-561f-43b4-b628-9862782e8b75 fwd="156.67.252.9" dyno=web.2 connect=1200ms service=11497ms status=304 bytes=380 protocol=https Nov 04 21:54:32 treeherder-prod app/web.2: 10.101.253.141 - - [04/Nov/2018:21:54:31 +0000] "GET /revision.txt HTTP/1.1" 304 0 "https://treeherder.mozilla.org/" "Mozilla/5.0 (X11; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0" ... Nov 05 02:14:01 treeherder-prod heroku/web.2: State changed from up to down
Looks like something similar happened again (see screenshot), though this time it: * did not start around the same time as a daily dyno restart * did not require a restart on our side to fix ...so this case more looks like an app issue? I'll look more tomorrow.
Thanks for taking this, Ed!
Flags: needinfo?(cdawson)
Meant to say last night I also: * opened a Heroku ticket: https://help.heroku.com/tickets/648011 (PM me for sharing link if you'd like to view) * reduced dyno type back down from Performance-M to P2
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Priority: -- → P1
Summary: Treeherder pages fail to load with heroku related error → Periodic spikes in Treeherder response times until Heroku web dynos restarted
The replies to the Heroku ticket so far haven't been overly confidence-filling (I can't tell if they've misunderstood the issue being reported, or whether these issues are seen by them as "expected" - plus the tone of the reply was also frustratingly defensive).
Depends on: 1504990
Blocks: 1504990
No longer depends on: 1504990
A later reply on the Heroku ticket: """ Looking into this further I still don't have a specific answer but I do have some possibilities; The host your Dyno was on from Nov 4 21:43 until Nov 5 02:14 was not starved for most resources, Dyno load was only around 5, 50% idle the vast majority of the time, memory usage was barely even 1/3rd used by both resident and cached memory, however there was another Dyno on the host that was transferring a large amount of data for that entire period. Our metrics do show some minorly elevated latency on networking related processes, NTP sync for example reported a few spikes in jitter but I'm only talking milliseconds here, nothing that indicates network throughput hung up nor was excessively constrained for a significant amount of time. That being said, Heroku automation requires little throughput to external sources, and very little throughput by comparison to even a handful of HTTP requests. It's reasonable to think that an application with a modicum of throughput and various request/response sizes, assets, etc. could quickly become encumbered with a single Dyno with a persistent transfer of data. This behavior doesn't appear to be abusive, but I'm going to get a second pair of eyes on that in addition to seeing if the usage was higher than it should be. I recall a network interface change being made a year or two ago that improved virtual interface priority amongst Dynos fairly significantly, so it's been a while since I've seen this situation be a possible problem. But, this is about the only metric I see that would explain a host-caused issue. Additionally, the host was taken out of service a handful of hours after your restarted the Dyno and were moved off it. It's difficult for me to tell if this was specifically due to the aforementioned high throughput Dyno, or if something else happened later. There was a single health check failure in the 03:00 hour but it recovered, and it wasn't until about 09:00 UTC that health failures became repeated, and consistent enough for the host to be shut down. It's likely that you saw the tip of the failure, but for it to have caused such an outsized amount of issues for you seems strange. In these situations, it's impossible for us to properly spot check other Dynos on the same host primarily for privileged access reasons. While we can and do look at metrics at a high level, we don't dive into app specific metrics (errors being emitted, etc.) without permission from the app owner. The host was eventually taken out of service so we have that data point, but given that health failures only started occurring after you moved your Dyno off it, it's difficult to know if these two incidents are related are not. """ I think for now we'll have to chalk this up to bad luck/edge case and see whether this occurs again in the future.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED

This seems to be happening again, i got the message 3 times. no errors shown in the console.

Flags: needinfo?(aryx.bugmail)

NewRelic says response times were below 600ms at that time. No idea what it was. Please ask in #treeherder if it occurs again while the people there are active.

Flags: needinfo?(aryx.bugmail)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: