Closed Bug 1092242 Opened 10 years ago Closed 8 years ago

Unexplained spikes in latency resulting in timeouts and tree closures

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: jbarnell)

References

Details

We've had a number of tree closures this week caused by many build/test machines timing out while trying to download required artifacts. Bug 1091707 is one such manifestation. Bug 1039849 might be another. The smokeping graphs show a number of these events. Here's one for use1: http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?displaymode=n;start=2014-10-30%2007:30;end=2014-10-31%2010:00;target=Datacenters.RELENG-SCL3.nagios1-releng-use1~admin1b.private.releng.scl3.mozilla.com ...and one for scl3: http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?displaymode=n;start=2014-10-30%2007:30;end=2014-10-31%2010:00;target=Datacenters.RELENG-SCL3.fw1-scl3~admin1b.private.releng.scl3.mozilla.com If I look at the releng-scl3 smokeping graphs in aggregate, it's not limited to any one system, i.e. it seems pervasive: http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3 Can netops provide any insight on what might be causing these events based on the timings? Is there any monitoring we can put in place to figure out why they're happening? Could we be saturating our network at those times and, in essence, DoS-ing ourselves?
I'm talking a look ...
Assignee: network-operations → jbarnell
We've upgrade firewalls which has provided some level of relief on this.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.