bm82 and releng-puppet (scl3) are down

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
--
major
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: bhearsum, Assigned: gcox)

Tracking

Details

(Reporter)

Description

4 years ago
10:30 < nagios-releng> Fri 07:30:37 PST [4026] buildbot-master82.srv.releng.scl3.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
10:30 < nagios-releng> Fri 07:30:57 PST [4027] releng-puppet2.srv.releng.scl3.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%

Nothing else seems to be down, still looking it. Trees are still open.
(Reporter)

Comment 1

4 years ago
These are both on the same ESX cluster according to inventory, I think there's something wrong there...

I also just saw a few more go:
10:39 < nagios-releng> Fri 07:39:48 PST [4031] admin1b.private.releng.scl3.mozilla.com is DOWN :CRITICAL - Host Unreachable (10.26.75.7)
10:40 < nagios-releng> Fri 07:40:57 PST [4033] buildbot-master84.srv.releng.scl3.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
10:40 < nagios-releng> Fri 07:40:58 PST [4035] ns2.private.releng.scl3.mozilla.com is DOWN :CRITICAL - Host Unreachable (10.26.75.41)

Raising to blocker so we can try to fix this before the trees get closed because of it.
Assignee: nobody → server-ops
Severity: critical → blocker
Component: Buildduty → Server Operations
Product: Release Engineering → mozilla.org
QA Contact: armenzg → shyam
Version: unspecified → other
We are looking into this now.

Updated

4 years ago
Assignee: server-ops → rwatson
(Assignee)

Comment 3

4 years ago
We are moving our ESX boxes around in scl3, from one rack area to another.  I just brought up two post-physical move.  I put some test boxes on them, looked fine, so I put them back in rotation.

*speculation* It would appear that some releng VLAN wasn't trunked in properly.  So when the cluster rebalanced releng VMs onto the 'new' hosts, they got cut off.

Sorry for the trouble.  They're back in maintenance mode while I go look for the exact root cause.
Assignee: rwatson → gcox
Severity: blocker → major
Status: NEW → ASSIGNED
(Assignee)

Updated

4 years ago
Duplicate of this bug: 960784
(Assignee)

Comment 5

4 years ago
Found the knowledge and cabling gap, ppened 961068 to get the right cabling into place.
In the meantime, I've left the moved ESX hosts in maintenance mode and paused doing more physical moves.
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.