Closed Bug 493181 Opened 15 years ago Closed 15 years ago

RelEng ESX outage

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: chizu)

Details

At this point bm-vmware03 and 12 are "not responding" in the VI, along with 16 VMs "disconnected". Nagios is reporting all sorts of timeouts for the ESX hosts and ping fail for the VMs.

Nagios is also reporting that DHCP is down in the build network:
 v74boris:DHCP - bm-admin01 is CRITICAL: CRITICAL: No DHCPOFFERs were received.
which earns this blocker status.
Assignee: server-ops → thardcastle
VI says bm-vmware08 and 13 is gone now too. Can't raise any of the three ESX hosts with a ssh request.

Partial VM list (doesn't include bm-vmware08):
moz2-linux-slave02, 04, 08, 17  (4/22 = 18% production build pool degradation)
moz2-win32-slave01, 02, 03, 07, 11, 12, 18, 19, 20, 22, 23 (10/26 = 38% prod. degr.)
try-linux-slave03, 04, 05  (2/8 = 25% prod. degr.)
try-win32-slave04, 09      (2/8 = 25% prod. degr.)

xr-linux-tbox
production-patrocles
production-crazyhorse

staging-1.9-master
staging-master

staging-opsi
test-mgmt
test-winslave
Summary: ESX outage → RelEng ESX outage
Assignee: thardcastle → server-ops
Summary: RelEng ESX outage → ESX outage
Assignee: server-ops → thardcastle
(fixing lost change to summary after bugzilla mid-air)
Summary: ESX outage → RelEng ESX outage
dhcpd died shortly after bm-admin01 was rebooted. I missed the first two pages, so leases began to expire around 1:00AM.

justdave restarted dhcpd and the ESX hosts are back on the network.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.