Closed Bug 1472510 Opened 7 years ago Closed 6 years ago

[MDC2] t-yosemite-r7-189 is unreachable

Categories

(Infrastructure & Operations :: RelOps: Hardware, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zfay, Assigned: dhouse)

References

Details

(Whiteboard: REQ0264016 )

Attachments

(1 file)

55 bytes, text/x-github-pull-request
dhouse
: checked-in+
Details | Review
Have tried to ssh into it to try and reimage it, but can't connect at all.
Blocks: 1472043
Worker is still awol. Could you dig deeper into it dhouse?
Flags: needinfo?(dhouse)
no logs in papertrail: https://papertrailapp.com/groups/1141234?filter=t-yosemite-r7-189 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-189 Powered off by snmp, 30s then powered back on, and still no response to ping/ssh. ``` # snmpget -v 2c -c public pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 iso.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 = INTEGER: 1 # snmpset -v 2c -c secret pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 i 2 iso.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 = INTEGER: 2 # sleep 30; snmpset -v 2c -c secret pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 i ^C # snmpget -v 2c -c public pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 iso.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 = INTEGER: 0 # sleep 30; snmpset -v 2c -c secret pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 i 1 iso.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 = INTEGER: 1 # snmpget -v 2c -c public pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 iso.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 = INTEGER: 1 ``` ``` [dhouse@rejh2.srv.releng.mdc1.mozilla.com relops-infra]$ dc=2;h=189;ping -c1 t-yosemite-r7-$h.test.releng.mdc$dc.mozilla.com; ssh -vv root@t-yosemite-r7-$h.test.releng.mdc$dc.mozilla.com PING t-yosemite-r7-189.test.releng.mdc2.mozilla.com (10.51.56.80) 56(84) bytes of data. --- t-yosemite-r7-189.test.releng.mdc2.mozilla.com ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 10000ms OpenSSH_6.6, OpenSSL 1.0.1e-fips 11 Feb 2013 debug1: Reading configuration data /home/dhouse/.ssh/config debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 4: Applying options for * debug2: ssh_connect: needpriv 0 debug1: Connecting to t-yosemite-r7-189.test.releng.mdc2.mozilla.com [10.51.56.80] port 22. debug1: connect to address 10.51.56.80 port 22: Connection timed out ssh: connect to host t-yosemite-r7-189.test.releng.mdc2.mozilla.com port 22: Connection timed out ```
Flags: needinfo?(dhouse)
Please physically check this machine. It has become unresponsive again. It looks like we've reimaged it about once a month and it keeps falling over :( The old bug for it: https://bugzilla.mozilla.org/show_bug.cgi?id=1472683
Assignee: relops → server-ops-dcops
Component: RelOps: General → DCOps
QA Contact: klibby → cshields
Summary: t-yosemite-r7-189 is unreachable → [MDC2] t-yosemite-r7-189 is unreachable
bost had a video issue - reimaged and back online.
Assignee: server-ops-dcops → rchilds
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Depends on: 1512019
Resolution: FIXED → ---
Assignee: rchilds → server-ops-dcops
Status: REOPENED → NEW
QTS reimaged this machine on Dec 29th.

I've tried to log on the machine. Machines seems to be broken, I've received the following message :
Stdio forwarding request failed: Session open refused by peer
ssh_exchange_identification: Connection closed by remote host

I've tried to reboot it via roller, but, it didn't helped

https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-189

Dave, can you please check the machine ?

Flags: needinfo?(dhouse)

I'll ask QTS to check it. I cannot access it either.

QTS found this machine has no display and was not able to revive it. They tried powering it off and leaving it off for a short time. They tried resetting the SMC and nvram, and they tried re-connecting all cables. Still there is no display.

So let's leave this machine in quarantine. It is off warranty and so we will likely de-comm it.

Assignee: server-ops-dcops → dhouse
Component: DCOps → RelOps: Hardware
Flags: needinfo?(dhouse)
QA Contact: cshields

(In reply to Van Le [:van] from comment #4)

bost had a video issue - reimaged and back online.

Van, is there anything more you would suggest we try with this one? Or are there possible fixes for the video issues? QTS found this machine has a video issue again. They tried reconnecting cables and doing the smc/nvram resets at boot but it doesn't power-on video to the monitor on the crash cart.

Flags: needinfo?(vle)

unfortunately no. i'll have them ship it to us and we can bring it to the Apple store.

Whiteboard: REQ0264016

(In reply to Van Le [:van] from comment #11)

unfortunately no. i'll have them ship it to us and we can bring it to the Apple store.

Shoot, I'm sorry this one went off warranty last summer. So, I think we just need to de-comm it?

So, I think we just need to de-comm it?

ill decomm on my end, can you remove it from monitoring and anything else on your end?

Flags: needinfo?(vle) → needinfo?(dhouse)
Attached file GitHub Pull Request
Flags: needinfo?(dhouse)
Attachment #9040190 - Flags: checked-in+

I've removed this machine from releng nagios and roller.

Status: NEW → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: