Closed Bug 1533792 Opened 7 years ago Closed 7 years ago

[MDC1] moon-chassis-4 connection issues

Categories

(Infrastructure & Operations :: RelOps: Hardware, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dhouse, Assigned: dhouse)

Details

We have seen connection drops (reported by nagios) for the moon-chassis-4 iLO starting yesterday.

Yesterday, I asked QTS to check the iLO network cable connection for this moonshot's management module.
They found no problems and reported that the cable has solid connections at both ends.

Today the problem has repeated intermittently.

I opened another case with QTS to re-seat the module, verify there is not a secondary network cable in place and that the health LED is green: REQ0273399

QTS found no problems. They reseated the module. However, nagios found it down again twice since then (once just now).

I updated the case with QTS and asked them to replace the network cable.

QTS replaced the network cable and the problem did not repeat over the weekend. So, I think it was a bad cable.

Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED

Reopening; CIDuty found that the workers on this chassis are all unavailable. They show as on, but there is no response to ping.

Maybe we broke all the workers with the troublshooting for the ilo.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: [MDC1] moon-chassis-4 ilo connection dropping → [MDC1] moon-chassis-4 connection issues

I powered off, and then back on, the switches (A and B) for moon-chassis-4 and that recovered the network connection for the cartridges.

Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED

I've just checked the windows machines from moon chassis 4, there are 17 are not available on taskcluster, 12 are available and took jobs and 1 is in quarantine.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

I've checked the windows machines, now only one is not available ( T-W1064-MS-164 ). Currently, I have started a reimage process on it. I'll keep monitoring it.

Checked the machine T-W1064-MS-164 and all the other machines from chassis 4 and everything is fine. I will close the ticket. If the problem will persist in the future, we will re-open the ticket.

Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.