Closed Bug 679053 Opened 14 years ago Closed 14 years ago

KVM outage took buildbot-master4/6/11/12 offline

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
All
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: bkero)

Details

Problem already being worked by IT; filing this bug to track resolution.
fyi: first nagios alert was at 10:48 PDT. 10:48 < nagios-sjc1> [99] buildbot-master04.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100% 10:48 < nagios-sjc1> [00] buildbot-master06.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100% 10:48 < nagios-sjc1> [01] buildbot-master11.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100% 10:48 < nagios-sjc1> [02] buildbot-master12.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100% 10:48 < nagios-sjc1> [03] dev-master01.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100% 10:48 < nagios-sjc1> [06] ganglia1.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100% 10:49 < nagios-sjc1> [07] slavealloc.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100% ...and we appear to be back working again now: 11:24:48 < arr> I think that's everything
networking problem, after I added the interfaces back to the correct bridges nagios verified that they came back online
so far it looks like the casualties are some tests for mozilla-inbound 9d083bbd47f5d202d69331cf073860f7d53f1b72. I have retriggered these tests.
Also fx-team 46cd0808d1c539b3068e871c88932f13fc21c552
(In reply to Ben Kero [:bkero] from comment #2) > networking problem, after I added the interfaces back to the correct bridges > nagios verified that they came back online Did the machines go down or just loose networking?
The physical machines networking was fine. The mapping of VMs to vlan bridges was lost, which is the part that I restored manually. The only VM to be rebooted was slavealloc.
Ok, I have retriggered any failed jobs I could find for the affected pushes. Leaving bug open for IT investigation.
I think the best analogy is that these machines had their switches' uplink ports unplugged briefly - so the machines didn't lose "link", but packets to/from them were bitbucketed for a while. Whether that killed builds' connections depends on the exact duration of "a while", and whether there was traffic on the connection at that time: a totally quiet TCP connection would not have noticed any disruption. slavealloc can ganglia were rebooted because that was the easier way to fix them; the rest were fixed up manually.
Today (8/15) build machines hosted on kvm2.infra.scl1 experienced an unexpected network outage after the mapping of tap interfaces to Ethernet bridges was lost. The outage occurred after a script that would re-add the network interfaces after a network change failed due to an offline secondary node (kvm1). kvm1 and kvm3 were being reintegrated back into the cluster after disks were added to reduce IOwait. Service began being restored at 10:51, and all hosts were back online by 11:24. Resolution: tap interfaces were added back into the correct bridges, and Nagios verified that the hosts came back online. ganglia1.build.scl1, and slavealloc.build.scl1 were the only hosts that were rebooted. Documentation pertaining to the correct method of re-adding a new node has been added to the internal IT wiki. If there are any further questions or concerns, please feel free to contact bkero@mozilla.com
Assignee: server-ops-releng → bkero
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.