Closed Bug 1235414 Opened 10 years ago Closed 10 years ago

bld-lion-r5-056 is unreachable

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: slaveapi, Assigned: van)

References

Details

(Whiteboard: host keeps powering off)

No description provided.
Host continues to power off even after a fan swap. Possibly logic board is failing. Should we decomm this mini?
colo-trip: --- → scl3
QA Contact: jbarnell
Whiteboard: host keeps powering off
Please fix with parts from the r5 test minis we just decommed.
:arr, how do you want to proceed with this? can we just do a 1:1 swap in inventory by updating the hostname then reimage? swapping the logic board would change the mac address anyway as the NIC is embedded.
Flags: needinfo?(arich)
If we're going to do that, we want to make note of that in inventory in the notes field. This is probably the easiest route, but it means that we're going to conflate the service histories of the two machines, which is generally undesirable (and why we don't re-use hostnames). What's the difference in work between swapping the logic board and just slotting the decommed machine in as bld-lion-r5-056? Either way we'll have to modify inventory and likely clean out deploy studio so it doesn't try to install it as a test machine based on the old profile.
Flags: needinfo?(arich)
i broke an inventory entry for a host when editing the dhcp scope and am now unsure how to fix it as it wont let me get back to that page - https://inventory.mozilla.org/en-US/systems/show/6797/. will wait until sheeri or mpressman is back to see if they can revert my changes or edit it from the back end. can you edit this old host to replace 056 - https://inventory.mozilla.org/en-US/systems/show/6760/ and ill replace it tomorrow?
Trying to figure out what went wrong here... There are two entries that mention bld-lion-r5-056 already: https://inventory.mozilla.org/systems/show/6797/ https://inventory.mozilla.org/systems/show/5460/ I'm, assuming that 5460 is the original? I've decommissioned it using invtool to regain the IP. I've also taken the switch and PDU info from 5460 to add to https://inventory.mozilla.org/en-US/systems/show/6760/ and filled out the OS, allocation, etc. You'll need to fill in the rack location info since there was none for the original host. I went back and modified the SREG info for 6797 to include the right A/PTR record, created a dhcp scope key, and enable dhcp. I re-added the CNAME for bld-lion-r5-056.build.mozilla.org I rebooted the DS server since it has been up for 147 days and was reeeeally slow. I looked for the MAC of 6797 to see if there was conflicting database info that would install the wrong OS, and I found an old entry. I think moved the existing bld-lion-r5-056 entry aside and renamed the one for the reurposed machine and moved it to the correct group. If you run into problems trying to netboot it, let me know. Nagios should just work since it's based on hostname/IP. Please look over all of the inventory info to make sure it's correct, add the necessary missing info, and try netbooting the machine.
I could see the most of the info for 6797 was missing by looking at https://inventory.mozilla.org/en-US/systems/edit/6797/ I presumed it was a decommissioned host that you were trying to swap in to fix bld-lion-r5-056. I've fixed the entry for https://inventory.mozilla.org/en-US/systems/show/6797/ by marking it as decommissioned again using invtool. That programmatically deletes the dhcp scope so wipes out any issue there may have been. I made note of the information in the entry before I decommed it, in case you needed it: switch: switch1.r401-9.ops.releng.scl3.mozilla.net:1/0/23 rack: 401-9 scl3 15.10
thanks amy, but it looks like something is wrong with the DS server? i've tried to pxeboot the host but it fails and goes right into the O/S. i've also tried using the bless command and see the gray "?" folder before it goes back into booting into the O/S.
Flags: needinfo?(arich)
I can ping the host, but can't log in. I don't see any record of it trying to contact the DS host.
Flags: needinfo?(arich)
One possibility... Did you wipe out the disks and reformat them? The test and build hosts don't use the same configuration. Ideally it should just overwrite and fix that, but it might not.
reimaging worked today. please let me know of any other issues. vans-MacBook-Pro:~ vle$ fping bld-lion-r5-056.build.releng.scl3.mozilla.com bld-lion-r5-056.build.releng.scl3.mozilla.com is alive vans-MacBook-Pro:~ vle$ ssh bld-lion-r5-056.build.releng.scl3.mozilla.com The authenticity of host 'bld-lion-r5-056.build.releng.scl3.mozilla.com (10.26.52.76)' can't be established. RSA key fingerprint is 25:58:5a:e5:b3:ae:37:a3:b5:c7:75:91:80:b7:3e:d4. Are you sure you want to continue connecting (yes/no)?
Assignee: server-ops-dcops → vle
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.