Closed Bug 1154727 Opened 9 years ago Closed 9 years ago

daily traffic spikes in request queueing portion of web transaction time on aus4

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

x86_64
Linux
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: fox2mike)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/956] )

I happened to be on New Relic this morning and noticed that every day this week we've had lengthy daily spikes of web transaction time on the aus4 cluster. Looking back over a month of data, it looks like <25ms response time was the norm, and we're now seeing spikes of nearly 250ms.

They appear to start at midnight (I'm guessing that's UTC, but I'm honestly not sure), where we get to around 50ms of request queuing, worsen around 4am at ~200ms, begin to improve around 7am, and finally level off back below 25ms around noon.

The first occurence of them appears to be on April 13th. Since that's a Monday, the regression window may go back as far as Friday afternoon, since weekend traffic may not be high enough to trigger whatever condition we're hitting.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/956]
No obviously relevant puppet changes on the 13th or the 9th/10th the previous week.

Apache logs indicate restarts on: (all times UTC)

Sun April 5 around 3:00-3:30
Sun April 12 around 3:00-4:00
Wed April 15 around 14:07 - graceful


According to New Relic, many of the nodes experience a spike, but the worst by far is aus5.webapp.phx1. I don't know if it's somehow different than the others or if it's just coincidence- I'm only looking at those 3 occurrences. So far I haven't spotted any obvious differences (CPU, RAM, package version, etc). It's possible that the spikes on the other nodes are some sort of residual effect from a problem with aus5... or maybe aus5 is a red herring entirely.
Good catch, I didn't think to dig into the specific servers.

An interesting data point is that aus6 was idle (presumably pulled out of the pool) until sometime yesterday. Now that it's back in the pool I don't see a similar load spike starting this morning.
Looks like I spoke too soon. We had another spike that started around 5am Pacific this morning and appeared to get bad enough to take down the service - QE is reporting problems testing updates for 37.0.2. I'm raising this to critical because this is now impacting our ability to ship releases.
Severity: normal → critical
Ben,

We've gone through and completely restarted apache on all the web heads, New Relic seems happier right now, would it be possible for you to confirm the same? 

Seems like there was a deployment done yesterday (bug 1154278) and as part of the process, there was a graceful restart (which apparently didn't do enough I guess). Would be nice to know what the changes were for this push as well.
Assignee: server-ops-webops → smani
(In reply to Shyam Mani [:fox2mike] from comment #4)
> Ben,
> 
> We've gone through and completely restarted apache on all the web heads, New
> Relic seems happier right now, would it be possible for you to confirm the
> same? 

I've relayed this to QE, I'll let you know if they have any more issues, thank you!

> Seems like there was a deployment done yesterday (bug 1154278) and as part
> of the process, there was a graceful restart (which apparently didn't do
> enough I guess). Would be nice to know what the changes were for this push
> as well.

That push only had a single change to the admin interface. I think it's unlikely it affected anything, especially given that these load spikes started on Monday. Anything is possible, of course.
Data point: no increase in transaction time this morning (Friday, April 17th).  This might be due to lower traffic; it might be due to the full reboot.  Just notating.
Closing out for now, C's point in comment #6 is valid.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
As a followup: so far this morning (April 20th), there was a brief uptick in response time to 92.5 ms around 5 AM PDT but most of the time its been below 25 ms.  This is much better in comparison to last week.
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.