Closed
Bug 679053
Opened 14 years ago
Closed 14 years ago
KVM outage took buildbot-master4/6/11/12 offline
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: bkero)
Details
Problem already being worked by IT; filing this bug to track resolution.
| Reporter | ||
Comment 1•14 years ago
|
||
fyi: first nagios alert was at 10:48 PDT.
10:48 < nagios-sjc1> [99] buildbot-master04.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100%
10:48 < nagios-sjc1> [00] buildbot-master06.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100%
10:48 < nagios-sjc1> [01] buildbot-master11.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100%
10:48 < nagios-sjc1> [02] buildbot-master12.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100%
10:48 < nagios-sjc1> [03] dev-master01.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100%
10:48 < nagios-sjc1> [06] ganglia1.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100%
10:49 < nagios-sjc1> [07] slavealloc.build.scl1 is DOWN: PING CRITICAL - Packet loss = 100%
...and we appear to be back working again now:
11:24:48 < arr> I think that's everything
| Assignee | ||
Comment 2•14 years ago
|
||
networking problem, after I added the interfaces back to the correct bridges nagios verified that they came back online
Comment 3•14 years ago
|
||
so far it looks like the casualties are some tests for mozilla-inbound 9d083bbd47f5d202d69331cf073860f7d53f1b72. I have retriggered these tests.
Comment 4•14 years ago
|
||
Also fx-team 46cd0808d1c539b3068e871c88932f13fc21c552
Comment 5•14 years ago
|
||
(In reply to Ben Kero [:bkero] from comment #2)
> networking problem, after I added the interfaces back to the correct bridges
> nagios verified that they came back online
Did the machines go down or just loose networking?
| Assignee | ||
Comment 6•14 years ago
|
||
The physical machines networking was fine. The mapping of VMs to vlan bridges was lost, which is the part that I restored manually. The only VM to be rebooted was slavealloc.
Comment 7•14 years ago
|
||
Ok, I have retriggered any failed jobs I could find for the affected pushes. Leaving bug open for IT investigation.
Comment 8•14 years ago
|
||
I think the best analogy is that these machines had their switches' uplink ports unplugged briefly - so the machines didn't lose "link", but packets to/from them were bitbucketed for a while. Whether that killed builds' connections depends on the exact duration of "a while", and whether there was traffic on the connection at that time: a totally quiet TCP connection would not have noticed any disruption.
slavealloc can ganglia were rebooted because that was the easier way to fix them; the rest were fixed up manually.
Comment 9•14 years ago
|
||
Today (8/15) build machines hosted on kvm2.infra.scl1 experienced an unexpected network outage after the mapping of tap interfaces to Ethernet bridges was lost. The outage occurred after a script that would re-add the network interfaces after a network change failed due to an offline secondary node (kvm1). kvm1 and kvm3 were being reintegrated back into the cluster after disks were added to reduce IOwait. Service began being restored at 10:51, and all hosts were back online by 11:24.
Resolution: tap interfaces were added back into the correct bridges, and Nagios verified that the hosts came back online. ganglia1.build.scl1, and slavealloc.build.scl1 were the only hosts that were rebooted.
Documentation pertaining to the correct method of re-adding a new node has been added to the internal IT wiki.
If there are any further questions or concerns, please feel free to contact bkero@mozilla.com
Updated•14 years ago
|
Assignee: server-ops-releng → bkero
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•