Closed Bug 915243 Opened 11 years ago Closed 11 years ago

Please investigate bld-lion-r5-040

Categories

(Infrastructure & Operations :: DCOps, task)

x86_64
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: arich)

References

Details

This slave is pingable, and ssh and VNC seem to be running, but I cannot connect via either protocol. I'm guessing this is due to an aborted re-image.

Can someone from DCOps please investigate, and kick off a new re-image if required?
Assignee: server-ops-dcops → eramirez
colo-trip: --- → scl3
vinh was having trouble netbooting this machine, and I'm wondering if something broke the firewall rules for contacting the install.build.releng.scl3.mozilla.com.

The symptom was that he would netboot and would get the flashing globe, then the sad folder (both of the disks were just replaced).  He tried swapping the cable, network ports, etc.  When he put it on the test vlan, he got DS to answer (but install didn't continue to work because it was on the wrong vlan).

The address is offered and acked in the DHCP logs on admin1a/b.
A tcpdump on install.build.releng.scl3.mozilla.com only shows the BOOTP transaction, nothing after that:

15:22:36.199304 IP 10.26.52.1.bootps > install.build.releng.scl3.mozilla.com.bootps: BOOTP/DHCP, Request from 3c:07:54:72:50:27 (oui Unknown), length 292
15:22:36.272694 IP 10.26.52.1.bootps > install.build.releng.scl3.mozilla.com.bootps: BOOTP/DHCP, Request from 3c:07:54:72:50:27 (oui Unknown), length 304
15:22:36.285950 IP 10.26.52.1.bootps > install.build.releng.scl3.mozilla.com.bootps: BOOTP/DHCP, Request from 3c:07:54:72:50:27 (oui Unknown), length 293
15:22:37.177240 IP 10.26.52.1.bootps > install.build.releng.scl3.mozilla.com.bootps: BOOTP/DHCP, Request from 3c:07:54:72:50:27 (oui Unknown), length 312
15:22:36.286195 IP install.build.releng.scl3.mozilla.com.bootps > bld-lion-r5-040.try.releng.scl3.mozilla.com.bootpc: BOOTP/DHCP, Reply, length 316

I also tried rebooting the install machine just in case it was something non-obvious wedged there (same result).

He started a recovery install over the internet, and that seems to be functioning just fine (and the machine gets the proper IP).

We don't currently have another machine to test in this VLAN, but I suspect that would confirm/deny the problem. Some more in-depth troubleshooting with netops is probably required.

Dustin, can you take a look at this Friday?
Assignee: eramirez → dustin
I have the mac mini hooked up to a spider kvm for remote console.  10.22.0.155

If you need to do a PDU reboot (mini is connected to a temp. outlet):
pdu1b.r101-22.ops.releng.scl3.mozilla.com 
Outlet - BB9

Ping me if you need the generic log in credentials for the mac mini.
When I had a look at this a while back, it was failing to puppetize because set_hostname.sh had set its hostanme to NXDOMAIN.

Also, I power-cycled the mini, and the spider went away..
Well, I tried blessing and rebooting, and I can't access the machine anymore via spider, ssh, or VNC.  However, I did see a good deal of traffic - the same kind of traffic that I saw when r4-mini-001 successfully imaged.  I tried power-cycling and saw the same traffic, but still no image in the spider ("No signal"), no ssh, and no VNC.  I don't see any activity in DeployStudio.

I'll wait an hour or so and see what happens.
Now nothing.  No video, no port 22, no port 5900.  From here in NY, this machine is a black hole.  If the spider is telling the truth about the video output, then this machine needs more hardware TLC.  Otherwise, it needs some help reimaging with something other than a spider.

Probably the best way to learn more here is to try re-imaging another system in this VLAN, as Amy suggested.  That will at least rule out network issues.
Assignee: dustin → arich
attached a monitor and the host is sitting at the login prompt. did it reimage properly?


[vle@admin1b.private.scl3 ~]$ fping !$
fping bld-lion-r5-040.try.releng.scl3.mozilla.com
bld-lion-r5-040.try.releng.scl3.mozilla.com is alive
[vle@admin1b.private.scl3 ~]$ ssh !$
ssh bld-lion-r5-040.try.releng.scl3.mozilla.com
The authenticity of host 'bld-lion-r5-040.try.releng.scl3.mozilla.com (10.26.64.60)' can't be established.
RSA key fingerprint is 71:2c:bb:c2:7c:ba:4a:90:3d:6c:6e:a5:25:5e:ab:02.
Are you sure you want to continue connecting (yes/no)?
We have another mini that needs netbooting but has failed to connect to deploy studio. It is on the same vlan (vlan264).  

Bug 918082 - please run hardware diagnostics on bld-lion-r5-035
No, no known password gets me into -040.

I think we need to find a known-good mini in the try VLAN (or move one to that VLAN) and netboot that to verify whether this is a network or host issue.  The common thread with -040 and -035 is that they had new disks.
Ah, the password that was set when doing the recovery install in comment 0 gets me in.  So -040 hasn't successfully netbooted.
Depends on: 918486
Host reimaged successfully.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.