Closed
Bug 1154727
Opened 9 years ago
Closed 9 years ago
daily traffic spikes in request queueing portion of web transaction time on aus4
Categories
(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)
Infrastructure & Operations Graveyard
WebOps: Product Delivery
x86_64
Linux
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: fox2mike)
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/956] )
I happened to be on New Relic this morning and noticed that every day this week we've had lengthy daily spikes of web transaction time on the aus4 cluster. Looking back over a month of data, it looks like <25ms response time was the norm, and we're now seeing spikes of nearly 250ms. They appear to start at midnight (I'm guessing that's UTC, but I'm honestly not sure), where we get to around 50ms of request queuing, worsen around 4am at ~200ms, begin to improve around 7am, and finally level off back below 25ms around noon. The first occurence of them appears to be on April 13th. Since that's a Monday, the regression window may go back as far as Friday afternoon, since weekend traffic may not be high enough to trigger whatever condition we're hitting.
Comment 1•9 years ago
|
||
No obviously relevant puppet changes on the 13th or the 9th/10th the previous week. Apache logs indicate restarts on: (all times UTC) Sun April 5 around 3:00-3:30 Sun April 12 around 3:00-4:00 Wed April 15 around 14:07 - graceful According to New Relic, many of the nodes experience a spike, but the worst by far is aus5.webapp.phx1. I don't know if it's somehow different than the others or if it's just coincidence- I'm only looking at those 3 occurrences. So far I haven't spotted any obvious differences (CPU, RAM, package version, etc). It's possible that the spikes on the other nodes are some sort of residual effect from a problem with aus5... or maybe aus5 is a red herring entirely.
Reporter | ||
Comment 2•9 years ago
|
||
Good catch, I didn't think to dig into the specific servers. An interesting data point is that aus6 was idle (presumably pulled out of the pool) until sometime yesterday. Now that it's back in the pool I don't see a similar load spike starting this morning.
Reporter | ||
Comment 3•9 years ago
|
||
Looks like I spoke too soon. We had another spike that started around 5am Pacific this morning and appeared to get bad enough to take down the service - QE is reporting problems testing updates for 37.0.2. I'm raising this to critical because this is now impacting our ability to ship releases.
Severity: normal → critical
Assignee | ||
Comment 4•9 years ago
|
||
Ben, We've gone through and completely restarted apache on all the web heads, New Relic seems happier right now, would it be possible for you to confirm the same? Seems like there was a deployment done yesterday (bug 1154278) and as part of the process, there was a graceful restart (which apparently didn't do enough I guess). Would be nice to know what the changes were for this push as well.
Assignee: server-ops-webops → smani
Reporter | ||
Comment 5•9 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #4) > Ben, > > We've gone through and completely restarted apache on all the web heads, New > Relic seems happier right now, would it be possible for you to confirm the > same? I've relayed this to QE, I'll let you know if they have any more issues, thank you! > Seems like there was a deployment done yesterday (bug 1154278) and as part > of the process, there was a graceful restart (which apparently didn't do > enough I guess). Would be nice to know what the changes were for this push > as well. That push only had a single change to the admin interface. I think it's unlikely it affected anything, especially given that these load spikes started on Monday. Anything is possible, of course.
Comment 6•9 years ago
|
||
Data point: no increase in transaction time this morning (Friday, April 17th). This might be due to lower traffic; it might be due to the full reboot. Just notating.
Assignee | ||
Comment 7•9 years ago
|
||
Closing out for now, C's point in comment #6 is valid.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 8•9 years ago
|
||
As a followup: so far this morning (April 20th), there was a brief uptick in response time to 92.5 ms around 5 AM PDT but most of the time its been below 25 ms. This is much better in comparison to last week.
Updated•8 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•