Closed Bug 1504569 Opened 7 years ago Closed 7 years ago

Periodic spikes in Treeherder response times until Heroku web dynos restarted

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: apavel, Assigned: emorley)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

Screenshot of Heroku metrics page 7 years ago Ed Morley [:emorley] 168.98 KB, image/png		Details
Response time grouped by web dyno name 7 years ago Ed Morley [:emorley] 138.56 KB, image/png		Details
Per dyno New Relic metrics 7 years ago Ed Morley [:emorley] 95.38 KB, image/png		Details
2nd occurence: Response time grouped by web dyno name 7 years ago Ed Morley [:emorley] 58.44 KB, image/png		Details

Andreea Pavel [:apavel]

Reporter

Description

•

7 years ago

When trying to reload treeherder pages we receive the following: https://irccloud.mozilla.com/file/seayO0m2/image.png Sometimes there is an error message before "Error to retrieve push data", sometimes it just directly displays the heroku related error.

Andreea Pavel [:apavel]

Reporter

Comment 1

•

7 years ago

Ed, Cameron, can you please take a look?

Flags: needinfo?(emorley)

Flags: needinfo?(cdawson)

Andreea Pavel [:apavel]

Reporter

Comment 2

•

7 years ago

Trees are closed for this, except for try.

Severity: normal → blocker

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 3

•

7 years ago

FWIW, https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound seems to load just fine.

Mike Conley (:mconley) (:⚙️)

Comment 4

•

7 years ago

Possibly related, I'm getting a 503 when fetching this resource as application/json: https://treeherder.mozilla.org/api/project/try/jobs/?count=2000&result_set_id=399187&job_type_name=test-macosx64/opt-mochitest-browser-screenshots-e10s&exclusion_profile=false

Justin Lazaro [:jlaz] (use needinfo)

Comment 5

•

7 years ago

I've escalated to the developers so far, with :wlach responding, although he is no longer part of the project, to take a look at the issues

William Lachance (:wlach)

Comment 7

•

7 years ago

Not seeing any obvious cause from the logs. New relic doesn't see any obvious problem, no errors or tasks that are obviously taking too long. Heroku console is showing memory errors. As an interim fix, I swapped out the 2x dynos out for "performance-m" dynos. They have 2.5gb of memory as opposed to the 2x's (which we were using) 1gb. They are also 5x more expensive... but hopefully this will keep things going until someone from the treeherder team can have a look tomorrow. So far the response rate seems to have gone down to normal.

Andreea Pavel [:apavel]

Reporter

Comment 8

•

7 years ago

We've tested and seems that treeherder is back to normal for now. Changing the importance to major and opening the trees based on the above comment. If this reoccurs, we'll closed the trees and update the bug. Thanks William for helping out with this. Regards!

Severity: blocker → major

Ed Morley [:emorley]

Assignee

Comment 9

•

7 years ago

Attached image Screenshot of Heroku metrics page — Details

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Assignee

Comment 10

•

7 years ago

Attached image Response time grouped by web dyno name — Details

Ed Morley [:emorley]

Assignee

Comment 11

•

7 years ago

Attached image Per dyno New Relic metrics — Details

Ed Morley [:emorley]

Assignee

Comment 12

•

7 years ago

From what I've found so far: * starting right at the time of the Heroku daily restart event marker on the Heroku metrics page(21:54 UTC), the average response time graph spikes * breaking down response time by host on New Relic shows all the slow requests to be served by just one dyno, `web.2` * inspecting the logs for requests served by that dyno (see below) shows even static requests to be served really slowly, suggesting an issue with that dyno/dyno host specifically * I believe Will's bumping of dyno size from P2 to Pm worked because it forced the dyno to be recreated (likely on another host) - however using the "restart all dynos" option would do this too * this "response time graph cliff" after a daily restart also occurred a few weeks ago tl;dr: It seems like it might be a Heroku issue, so we should file a ticket. Excerpt of web.2 dyno Papertrail logs: Nov 04 21:54:03 treeherder-prod heroku/web.2: Cycling Nov 04 21:54:03 treeherder-prod heroku/web.2: State changed from up to starting Nov 04 21:54:11 treeherder-prod heroku/web.2: Starting process with command `newrelic-admin run-program gunicorn treeherder.config.wsgi:application --timeout 20` Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [4] [INFO] Starting gunicorn 19.9.0 Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [4] [INFO] Listening at: http://0.0.0.0:8496 (4) Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [4] [INFO] Using worker: sync Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [13] [INFO] Booting worker with pid: 13 Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [14] [INFO] Booting worker with pid: 14 Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [15] [INFO] Booting worker with pid: 15 Nov 04 21:54:14 treeherder-prod app/web.2: [2018-11-04 21:54:14 +0000] [16] [INFO] Booting worker with pid: 16 Nov 04 21:54:14 treeherder-prod heroku/web.2: State changed from starting to up ... Nov 04 21:54:22 treeherder-prod heroku/web.2: source=web.2 dyno=heroku.47819539.4b58aac2-d87a-4f3d-92c8-72ed1822eec5 sample#memory_total=259.16MB sample#memory_rss=255.14MB sample#memory_cache=4.02MB sample#memory_swap=0.00MB sample#memory_pgpgin=160581pages sample#memory_pgpgout=94236pages sample#memory_quota=1024.00MB ... Nov 04 21:54:32 treeherder-prod heroku/router: at=info method=GET path="/revision.txt" host=treeherder.mozilla.org request_id=1ad752f6-561f-43b4-b628-9862782e8b75 fwd="156.67.252.9" dyno=web.2 connect=1200ms service=11497ms status=304 bytes=380 protocol=https Nov 04 21:54:32 treeherder-prod app/web.2: 10.101.253.141 - - [04/Nov/2018:21:54:31 +0000] "GET /revision.txt HTTP/1.1" 304 0 "https://treeherder.mozilla.org/" "Mozilla/5.0 (X11; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0" ... Nov 05 02:14:01 treeherder-prod heroku/web.2: State changed from up to down

Ed Morley [:emorley]

Assignee

Comment 14

•

7 years ago

Attached image 2nd occurence: Response time grouped by web dyno name — Details

Looks like something similar happened again (see screenshot), though this time it: * did not start around the same time as a daily dyno restart * did not require a restart on our side to fix ...so this case more looks like an app issue? I'll look more tomorrow.

Cameron Dawson [:camd]

Comment 15

•

7 years ago

Thanks for taking this, Ed!

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Assignee

Comment 16

•

7 years ago

Meant to say last night I also: * opened a Heroku ticket: https://help.heroku.com/tickets/648011 (PM me for sharing link if you'd like to view) * reduced dyno type back down from Performance-M to P2

Ed Morley [:emorley]

Assignee

Updated

•

7 years ago

Assignee: nobody → emorley

Status: NEW → ASSIGNED

Priority: -- → P1

Summary: Treeherder pages fail to load with heroku related error → Periodic spikes in Treeherder response times until Heroku web dynos restarted

Ed Morley [:emorley]

Assignee

Comment 17

•

7 years ago

The replies to the Heroku ticket so far haven't been overly confidence-filling (I can't tell if they've misunderstood the issue being reported, or whether these issues are seen by them as "expected" - plus the tone of the reply was also frustratingly defensive).

Ed Morley [:emorley]

Assignee

Updated

•

7 years ago

Depends on: 1504990

Ed Morley [:emorley]

Assignee

Updated

•

7 years ago

Blocks: 1504990

No longer depends on: 1504990

Ed Morley [:emorley]

Assignee

Comment 18

•

7 years ago

A later reply on the Heroku ticket: """ Looking into this further I still don't have a specific answer but I do have some possibilities; The host your Dyno was on from Nov 4 21:43 until Nov 5 02:14 was not starved for most resources, Dyno load was only around 5, 50% idle the vast majority of the time, memory usage was barely even 1/3rd used by both resident and cached memory, however there was another Dyno on the host that was transferring a large amount of data for that entire period. Our metrics do show some minorly elevated latency on networking related processes, NTP sync for example reported a few spikes in jitter but I'm only talking milliseconds here, nothing that indicates network throughput hung up nor was excessively constrained for a significant amount of time. That being said, Heroku automation requires little throughput to external sources, and very little throughput by comparison to even a handful of HTTP requests. It's reasonable to think that an application with a modicum of throughput and various request/response sizes, assets, etc. could quickly become encumbered with a single Dyno with a persistent transfer of data. This behavior doesn't appear to be abusive, but I'm going to get a second pair of eyes on that in addition to seeing if the usage was higher than it should be. I recall a network interface change being made a year or two ago that improved virtual interface priority amongst Dynos fairly significantly, so it's been a while since I've seen this situation be a possible problem. But, this is about the only metric I see that would explain a host-caused issue. Additionally, the host was taken out of service a handful of hours after your restarted the Dyno and were moved off it. It's difficult for me to tell if this was specifically due to the aforementioned high throughput Dyno, or if something else happened later. There was a single health check failure in the 03:00 hour but it recovered, and it wasn't until about 09:00 UTC that health failures became repeated, and consistent enough for the host to be shut down. It's likely that you saw the tip of the failure, but for it to have caused such an outsized amount of issues for you seems strange. In these situations, it's impossible for us to properly spot check other Dynos on the same host primarily for privileged access reasons. While we can and do look at metrics at a high level, we don't dive into app specific metrics (errors being emitted, etc.) without permission from the app owner. The host was eventually taken out of service so we have that data point, but given that health failures only started occurring after you moved your Dyno off it, it's difficult to know if these two incidents are related are not. """ I think for now we'll have to chalk this up to bad luck/edge case and see whether this occurs again in the future.

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Andreea Pavel [:apavel]

Reporter

Comment 19

•

7 years ago

This seems to be happening again, i got the message 3 times. no errors shown in the console.

Flags: needinfo?(aryx.bugmail)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 20

•

7 years ago

NewRelic says response times were below 600ms at that time. No idea what it was. Please ask in #treeherder if it occurs again while the people there are active.

Flags: needinfo?(aryx.bugmail)

You need to log in before you can comment on or make changes to this bug.