Closed Bug 1472683 Opened 7 years ago Closed 6 years ago

t-yosemite-r7-189.test.releng.mdc2.mozilla.com. is unreachable

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: relops-bug-generator, Assigned: van)

References

Details

(Whiteboard: REQ0245333, REQ0235357 ,REQ0235524 )

Reboot t-yosemite-r7-189.test.releng.mdc2.mozilla.com. 10.51.56.80 Requested by mozilla-auth0/ad|Mozilla-LDAP|dhouse Relops controller action failed: 2018-07-02T13:32:39.633531 ssh_reboot -l roller -i ssh.key TimeoutExpired 2018-07-02T13:32:39.635769 ipmi ipmi_reset KeyError 2018-07-02T13:32:39.637727 ipmi ipmi_cycle KeyError 2018-07-02T13:32:45.658640 snmp_reboot pdu1.gc132.ops.releng.mdc2.mozilla.com ba3 CalledProcessError
I tried a manual snmp call to the pdu and I get a timeout: ``` [root@roller-dev1 ~]# snmpget -v 2c -c communitystring pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 Timeout: No Response from pdu1.gc132.ops.releng.mdc2.mozilla.com. [dhouse@roller1.srv.releng.mdc1.mozilla.com ~]$ snmpset -v 2c -c communitystring pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 i 3 Timeout: No Response from pdu1.gc132.ops.releng.mdc2.mozilla.com ``` However, on a public call snmp works: ``` [dhouse@roller1.srv.releng.mdc1.mozilla.com ~]$ snmpget -v 2c -c public pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 iso.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 = INTEGER: 0 ``` I also tried from roller-dev1.srv.releng.mdc2 to see if I could reach it from within mdc2, but i get the same results.
I tested a reboot of another machine on this same pdu: t-yosemite-r7-187 And I get the same failure: https://bugzilla.mozilla.org/show_bug.cgi?id=1472713 So I think the snmp community string on this pdu may be not matching.
See Also: → 1472713
Reboot t-yosemite-r7-189.test.releng.mdc2.mozilla.com. 10.51.56.80 Requested by mozilla-auth0/ad|Mozilla-LDAP|dhouse Relops controller action failed: 2018-07-02T17:26:24.271370 ssh_reboot -l roller -i ssh.key TimeoutExpired 2018-07-02T17:26:24.276580 ipmi ipmi_reset KeyError 2018-07-02T17:26:24.279591 ipmi ipmi_cycle KeyError
This machine needs physically netbooted/reimaged. The snmp issue is fixed, but this and the 187 machine are not responding to ping after the snmp power is cycled or turned on. I confirmed turning power off and then back on, and waiting; I don't see them become ping-able. ``` # snmpget -v 2c -c public pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.5.2.1.2 iso.3.6.1.4.1.1718.3.2.3.1.5.2.1.2 = INTEGER: 1 # snmpget -v 2c -c public pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 iso.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 = INTEGER: 1 ```
power cycling didn't work, opened REQ0235357 with QTS for remote hands.
Assignee: server-ops-dcops → vle
Whiteboard: REQ0235357
Reboot t-yosemite-r7-189.test.releng.mdc2.mozilla.com. 10.51.56.80 Requested by mozilla-auth0/ad|Mozilla-LDAP|zfay Relops controller action failed: 2018-07-03T10:52:16.701682 ssh_reboot -l roller -i ssh.key TimeoutExpired 2018-07-03T10:52:16.703472 ipmi ipmi_reset KeyError 2018-07-03T10:52:16.704844 ipmi ipmi_cycle KeyError
opened REQ0235524 with QTS to reimage mac mini
Whiteboard: REQ0235357 → REQ0235357 ,REQ0235524
Reboot t-yosemite-r7-189.test.releng.mdc2.mozilla.com. 10.51.56.80 Requested by mozilla-auth0/ad|Mozilla-LDAP|zfay Relops controller action failed: 2018-07-08T12:46:19.987782 ssh_reboot -l roller -i ssh.key TimeoutExpired 2018-07-08T12:46:19.989639 ipmi ipmi_reset KeyError 2018-07-08T12:46:19.991406 ipmi ipmi_cycle KeyError
qts reimaged the mini but looks like there is still an issue. will check next time im on site.
bad cable, back online. [vle@admin2a.private.mdc1 ~]$ fping !$ fping t-yosemite-r7-189.test.releng.mdc2.mozilla.com t-yosemite-r7-189.test.releng.mdc2.mozilla.com is alive
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
I'm seeing problems with the machine. Could you check it again? It kept crashing and rebooting. I tried shutting off the power for it, and bringing it back on, but now it hasn't coming back online (no ping/ssh, no logs). ``` # snmpget -v 2c -c public pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 iso.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 = INTEGER: 1 # snmpset -v 2c -c secret pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 i 2 iso.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 = INTEGER: 2 # snmpget -v 2c -c public pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 iso.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 = INTEGER: 0 # snmpset -v 2c -c secret pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 i 1 iso.3.6.1.4.1.1718.3.2.3.1.11.2.1.3 = INTEGER: 1 # snmpget -v 2c -c public pdu1.gc132.ops.releng.mdc2.mozilla.com 1.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 iso.3.6.1.4.1.1718.3.2.3.1.5.2.1.3 = INTEGER: 1 [dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ping t-yosemite-r7-189.test.releng.mdc2.mozilla.com PING t-yosemite-r7-189.test.releng.mdc2.mozilla.com (10.51.56.80) 56(84) bytes of data. ^C --- t-yosemite-r7-189.test.releng.mdc2.mozilla.com ping statistics --- 81 packets transmitted, 0 received, 100% packet loss, time 80458ms ```
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Hey Van! This one is missing for taskcluster for quite a while and any attempt of ssh-ing into it just hangs. Also doesn't show up in papertrail.
Flags: needinfo?(vle)
opened REQ0245333 with QTS for a reimage.
Flags: needinfo?(vle)
Whiteboard: REQ0235357 ,REQ0235524 → REQ0245333, REQ0235357 ,REQ0235524
back online. vle@DESKTOP-3HK51T3:~$ fping t-yosemite-r7-189.test.releng.mdc2.mozilla.com t-yosemite-r7-189.test.releng.mdc2.mozilla.com is alive
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.