Closed Bug 937322 Opened 11 years ago Closed 11 years ago

RFO needed for mtv1 downtime on Sat Nov 9, 2013

Categories

(Infrastructure & Operations Graveyard :: NetOps: Office Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Unassigned)

Details

(Whiteboard: [reit-ops] [closed-trees])

It looks like all/most hosts in *.build.mtv1.mozilla.com went down on Nov 9'th 2013

Came down on/around: Date/Time: 11-09-2013 07:35:50 (PT) 
Came up on/around: Date/Time: 11-09-2013 08:44:50 (PT)

The first nagios alert I see is pdu1.df201-4.build.mtv1.mozilla.com (the down line above)

The first up nagios alert I see is mw32-ix-slave01.build.mtv1.mozilla.com

This affected at least:
foopies, tegras, kvm, pdu's in releng network, windows machines, puppetmasters

I suspect either a power event or network event, but I saw no bugmail/e-mail on said outage, hence this bug.

I'm not sure who is on point for this investigation/issue so sending two needinfos out based on best-guess. Please feel free to redirect as warranted.
Flags: needinfo?(avillarde)
Flags: needinfo?(adam)
just for added info, :dustin suggested in private message that this may have only been a nagios event and NOT a real event...

Looking at foopy28 specifically (one of the affected hosts) in /var/log/messages I see:

[root@foopy28.build.mtv1.mozilla.com ~]# uptime
 14:07:46 up 356 days,  4:22,  1 user,  load average: 0.21, 0.12, 0.08

so it was not power, and:

Nov  9 07:57:52 foopy28 collectd[13668]: write_graphite plugin: Connecting to graphite-rel
ay.private.scl3.mozilla.com:2003 failed. The last error was: Connection timed out

so mtv1->scl3 was acting up. and:

Nov  9 08:33:28 foopy28 puppet-agent[23185]: Finished catalog run in 19.06 seconds
Nov  9 08:44:48 foopy28 collectd[13668]: write_graphite plugin: Successfully connected to
graphite-relay.private.scl3.mozilla.com:2003.

so it was able to run puppet at ~8:30 and connect to scl3 right near/at the end of this window.

hope that helps
Assignee: server-ops → network-operations
Component: Server Operations → NetOps: Office Other
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → adam
Whiteboard: [reit-ops] [closed-trees]
A routing issue on fw1.mtv1 caused MTV1 traffic to SCL3 to get lost. A configuration change was made intended to keep this from happening again.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(adam)
Resolution: --- → FIXED
Flags: needinfo?(avillarde)
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.