If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

daily traffic spikes in request queueing portion of web transaction time on aus4

RESOLVED FIXED

Status

Infrastructure & Operations Graveyard
WebOps: Product Delivery
--
critical
RESOLVED FIXED
3 years ago
a year ago

People

(Reporter: bhearsum, Assigned: fox2mike)

Tracking

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/956] )

(Reporter)

Description

3 years ago
I happened to be on New Relic this morning and noticed that every day this week we've had lengthy daily spikes of web transaction time on the aus4 cluster. Looking back over a month of data, it looks like <25ms response time was the norm, and we're now seeing spikes of nearly 250ms.

They appear to start at midnight (I'm guessing that's UTC, but I'm honestly not sure), where we get to around 50ms of request queuing, worsen around 4am at ~200ms, begin to improve around 7am, and finally level off back below 25ms around noon.

The first occurence of them appears to be on April 13th. Since that's a Monday, the regression window may go back as far as Friday afternoon, since weekend traffic may not be high enough to trigger whatever condition we're hitting.

Updated

3 years ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/956]

Comment 1

3 years ago
No obviously relevant puppet changes on the 13th or the 9th/10th the previous week.

Apache logs indicate restarts on: (all times UTC)

Sun April 5 around 3:00-3:30
Sun April 12 around 3:00-4:00
Wed April 15 around 14:07 - graceful


According to New Relic, many of the nodes experience a spike, but the worst by far is aus5.webapp.phx1. I don't know if it's somehow different than the others or if it's just coincidence- I'm only looking at those 3 occurrences. So far I haven't spotted any obvious differences (CPU, RAM, package version, etc). It's possible that the spikes on the other nodes are some sort of residual effect from a problem with aus5... or maybe aus5 is a red herring entirely.
(Reporter)

Comment 2

3 years ago
Good catch, I didn't think to dig into the specific servers.

An interesting data point is that aus6 was idle (presumably pulled out of the pool) until sometime yesterday. Now that it's back in the pool I don't see a similar load spike starting this morning.
(Reporter)

Comment 3

3 years ago
Looks like I spoke too soon. We had another spike that started around 5am Pacific this morning and appeared to get bad enough to take down the service - QE is reporting problems testing updates for 37.0.2. I'm raising this to critical because this is now impacting our ability to ship releases.
Severity: normal → critical
(Assignee)

Comment 4

3 years ago
Ben,

We've gone through and completely restarted apache on all the web heads, New Relic seems happier right now, would it be possible for you to confirm the same? 

Seems like there was a deployment done yesterday (bug 1154278) and as part of the process, there was a graceful restart (which apparently didn't do enough I guess). Would be nice to know what the changes were for this push as well.
Assignee: server-ops-webops → smani
(Reporter)

Comment 5

3 years ago
(In reply to Shyam Mani [:fox2mike] from comment #4)
> Ben,
> 
> We've gone through and completely restarted apache on all the web heads, New
> Relic seems happier right now, would it be possible for you to confirm the
> same? 

I've relayed this to QE, I'll let you know if they have any more issues, thank you!

> Seems like there was a deployment done yesterday (bug 1154278) and as part
> of the process, there was a graceful restart (which apparently didn't do
> enough I guess). Would be nice to know what the changes were for this push
> as well.

That push only had a single change to the admin interface. I think it's unlikely it affected anything, especially given that these load spikes started on Monday. Anything is possible, of course.

Comment 6

3 years ago
Data point: no increase in transaction time this morning (Friday, April 17th).  This might be due to lower traffic; it might be due to the full reboot.  Just notating.
(Assignee)

Comment 7

3 years ago
Closing out for now, C's point in comment #6 is valid.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED

Comment 8

3 years ago
As a followup: so far this morning (April 20th), there was a brief uptick in response time to 92.5 ms around 5 AM PDT but most of the time its been below 25 ms.  This is much better in comparison to last week.
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.