Closed Bug 1154727 Opened 9 years ago Closed 9 years ago

daily traffic spikes in request queueing portion of web transaction time on aus4

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: fox2mike)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/956] )

bhearsum@mozilla.com (:bhearsum)

Reporter

Description

•

9 years ago

I happened to be on New Relic this morning and noticed that every day this week we've had lengthy daily spikes of web transaction time on the aus4 cluster. Looking back over a month of data, it looks like <25ms response time was the norm, and we're now seeing spikes of nearly 250ms.

They appear to start at midnight (I'm guessing that's UTC, but I'm honestly not sure), where we get to around 50ms of request queuing, worsen around 4am at ~200ms, begin to improve around 7am, and finally level off back below 25ms around noon.

The first occurence of them appears to be on April 13th. Since that's a Monday, the regression window may go back as far as Friday afternoon, since weekend traffic may not be high enough to trigger whatever condition we're hitting.

:kanban

Updated

•

9 years ago

Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/956]

Jake Maul [:jakem]

Comment 1

•

9 years ago

No obviously relevant puppet changes on the 13th or the 9th/10th the previous week.

Apache logs indicate restarts on: (all times UTC)

Sun April 5 around 3:00-3:30
Sun April 12 around 3:00-4:00
Wed April 15 around 14:07 - graceful


According to New Relic, many of the nodes experience a spike, but the worst by far is aus5.webapp.phx1. I don't know if it's somehow different than the others or if it's just coincidence- I'm only looking at those 3 occurrences. So far I haven't spotted any obvious differences (CPU, RAM, package version, etc). It's possible that the spikes on the other nodes are some sort of residual effect from a problem with aus5... or maybe aus5 is a red herring entirely.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 2

•

9 years ago

Good catch, I didn't think to dig into the specific servers.

An interesting data point is that aus6 was idle (presumably pulled out of the pool) until sometime yesterday. Now that it's back in the pool I don't see a similar load spike starting this morning.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 3

•

9 years ago

Looks like I spoke too soon. We had another spike that started around 5am Pacific this morning and appeared to get bad enough to take down the service - QE is reporting problems testing updates for 37.0.2. I'm raising this to critical because this is now impacting our ability to ship releases.

Severity: normal → critical

Shyam Mani [:fox2mike]

Assignee

Comment 4

•

9 years ago

Ben,

We've gone through and completely restarted apache on all the web heads, New Relic seems happier right now, would it be possible for you to confirm the same? 

Seems like there was a deployment done yesterday (bug 1154278) and as part of the process, there was a graceful restart (which apparently didn't do enough I guess). Would be nice to know what the changes were for this push as well.

Assignee: server-ops-webops → smani

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 5

•

9 years ago

(In reply to Shyam Mani [:fox2mike] from comment #4)
> Ben,
> 
> We've gone through and completely restarted apache on all the web heads, New
> Relic seems happier right now, would it be possible for you to confirm the
> same? 

I've relayed this to QE, I'll let you know if they have any more issues, thank you!

> Seems like there was a deployment done yesterday (bug 1154278) and as part
> of the process, there was a graceful restart (which apparently didn't do
> enough I guess). Would be nice to know what the changes were for this push
> as well.

That push only had a single change to the admin interface. I think it's unlikely it affected anything, especially given that these load spikes started on Monday. Anything is possible, of course.

C. Liang [:cyliang]

Comment 6

•

9 years ago

Data point: no increase in transaction time this morning (Friday, April 17th).  This might be due to lower traffic; it might be due to the full reboot.  Just notating.

Shyam Mani [:fox2mike]

Assignee

Comment 7

•

9 years ago

Closing out for now, C's point in comment #6 is valid.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

C. Liang [:cyliang]

Comment 8

•

9 years ago

As a followup: so far this morning (April 20th), there was a brief uptick in response time to 92.5 ms around 5 AM PDT but most of the time its been below 25 ms.  This is much better in comparison to last week.

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

daily traffic spikes in request queueing portion of web transaction time on aus4

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Assigned: fox2mike)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/956] )

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated