Closed Bug 1050618 Opened 11 years ago Closed 11 years ago

sea-hp-linux64-2 isn't responding to pings nor ssh

Categories

(Infrastructure & Operations :: DCOps, task)

x86
Windows Vista
task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: ewong, Unassigned)

Details

Attachments

(1 file)

sea-hp-linux64-2 seems to be down. Hasn't responded to any pings nor ssh connects for over ten minutes. [ewong@jump1.community.scl3 ~]$ ping sea-hp-linux64-2 PING sea-hp-linux64-2.community.scl3.mozilla.com (63.245.223.113) 56(84) bytes of data. From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=2 Destination Host Unreachable From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=3 Destination Host Unreachable From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=4 Destination Host Unreachable [ewong@jump1.community.scl3 ~]$ ssh -l seabld sea-hp-linux64-2 ssh: connect to host sea-hp-linux64-2 port 22: No route to host
colo-trip: --- → scl3
for some odd reason the raid array lost it's config. i reconfigured the raid 0 and the host is back online. [vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2 sea-hp-linux64-2 is alive [vle@jump1.community.scl3 ~]$ ssh !$ ssh sea-hp-linux64-2 The authenticity of host 'sea-hp-linux64-2 (63.245.223.113)' can't be established. RSA key fingerprint is 4d:12:40:64:47:8b:0f:7e:43:81:02:82:79:a7:d6:2f. Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
(In reply to Van Le [:van] from comment #1) > for some odd reason the raid array lost it's config. i reconfigured the raid > 0 and the host is back online. > > [vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2 > sea-hp-linux64-2 is alive > [vle@jump1.community.scl3 ~]$ ssh !$ > ssh sea-hp-linux64-2 > The authenticity of host 'sea-hp-linux64-2 (63.245.223.113)' can't be > established. > RSA key fingerprint is 4d:12:40:64:47:8b:0f:7e:43:81:02:82:79:a7:d6:2f. > Are you sure you want to continue connecting (yes/no)? Hmm it's back down. :( I don't like the sound of "lost its config".
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
heya :ewong, for some reason the system is alerting that the controller firmware is not up to date. i've tried to contact HP support but it looks like our support for this host has expired. this is the first time ive seen this error so ive rebooted the host a few times to make sure it comes back with the RAID config (it does). so now I'm not sure if it's the OS causing this issue or if we do indeed newer firmware. can you open a ticket with the SA team to see if they can grab an updated firmware from another source or check syslogs to see what's causing the host to crash if the problem reoccurs? currently the host is back online.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Whiteboard: out of warranty
(In reply to Van Le [:van] from comment #3) > heya :ewong, for some reason the system is alerting that the controller > firmware is not up to date. i've tried to contact HP support but it looks > like our support for this host has expired. this is the first time ive seen > this error so ive rebooted the host a few times to make sure it comes back > with the RAID config (it does). > > so now I'm not sure if it's the OS causing this issue or if we do indeed > newer firmware. can you open a ticket with the SA team to see if they can > grab an updated firmware from another source or check syslogs to see what's > causing the host to crash if the problem reoccurs? currently the host is > back online. Thanks Van! Pardon my ignorance, but what's the SA team?
hrm, whats an "SA Team". vague memory said it was the same class of machines we use in moco releng for builders. I'm not sure where/what could be faulty here, but if this is a matter of "we can certainly fix by replacing hardware piece X" I could forsee us doing that, depending on effort that causes on both sides of the fence, and of course the out-of-pocket cost to the SeaMonkey project. Can we possibly get more info on how to proceed with this host? Thanks
Flags: needinfo?(vle)
Flags: needinfo?(dustin)
>Pardon my ignorance, but what's the SA team? old habits die hard... i was checking who the sysadmin on call was in inventory when i was updating this bug and instead of SRE/MOC, i wrote SA, since inventory calls them sysadmins. >Can we possibly get more info on how to proceed with this host? i've emailed our vendor and will get you a quote on the warranty renewal for this host.
Flags: needinfo?(vle)
Flags: needinfo?(dustin)
Ok, so this host is down *again* :/ :van did we ever get a reply from vendor?
Status: RESOLVED → REOPENED
Flags: needinfo?(vle)
Resolution: FIXED → ---
>:van did we ever get a reply from vendor? he hasn't so i've emailed him for an update.
Flags: needinfo?(vle)
Attached file CAREPAQ.pdf
warranty renewal for host.
it appears this drive is dead and will no longer spin up. ive opened Case ID 4649185564 with HP for a replacement drive. They're out of 250GB drives so will send us a 500GB drive instead.
drive came in, swapped the bad 250GB for a 500GB. will MOC be rebuilding this system or is something that :callek can handle? warranty was $149 for the year.
Hey Van This needs to be reimaged in a pretty annoying roundabout way: https://bugzilla.mozilla.org/show_bug.cgi?id=740633#c7 (and on) has the full steps necessary. The previous machines were all imaged as centOS6.2, however iirc moco's pxeboot stuff does 6.5 now. Either way, 6.5 || 6.2 is *OK* here, as long as you follow the imaging process for moco machines, (e.g. https://mana.mozilla.org/wiki/display/DC/How+To+Reimage+Releng+iX+and+HP+Linux+Machines )
Whiteboard: out of warranty
(In reply to Van Le [:van] from comment #11) > drive came in, swapped the bad 250GB for a 500GB. will MOC be rebuilding > this system or is something that :callek can handle? warranty was $149 for > the year. So to be explicit, I can't handle so is dcops/moc/whatever that will need to do this since its hands-on stuff. While we're at it can I get an ETA [for our planning purposes], knowing this is relatively low priority among other work. (we have 11 other working systems, and seamonkey is itself low priority)
Flags: needinfo?(vle)
I won't be in the data center tomorrow and I'm out for training early next week so I'll try to get to it by next Thursday.
Flags: needinfo?(vle)
:callek, I have training this whole week so I wont be able to take a look at this until Monday.
(In reply to Van Le [:van] from comment #15) > :callek, I have training this whole week so I wont be able to take a look at > this until Monday. monday is a US holiday :-) but thanks for the update. don't strain yourself over getting to this.
per manual checking and IRC chat, we want 6.2 for this host, there is an option for 6.5 but thats used for "server" class machines for moco releng right now, and not slaves. we want as much matching as possible (otherwise we risk odd other issues). My comment above was more a "I forget if slaves in moco releng use 6.5, if they do it is fine" not a "please use latest"
i reimaged the host again with 6.2 but i had to catch a flight to phx1. ill be back friday to do the static ip config.
BTW: Why use RAID on those systems at all when there is only one HDD? Or is the RAID controller always involved and you cannot disable it in BIOS?
host is back online. the trick is to not use a puppet password. [vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2.community.scl3.mozilla.com sea-hp-linux64-2.community.scl3.mozilla.com is alive [vle@jump1.community.scl3 ~]$ ssh !$ ssh sea-hp-linux64-2.community.scl3.mozilla.com The authenticity of host 'sea-hp-linux64-2.community.scl3.mozilla.com (63.245.223.113)' can't be established. RSA key fingerprint is bc:60:0e:4f:5c:2b:37:df:10:c9:cf:9d:dd:60:39:ac. Are you sure you want to continue connecting (yes/no)?
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Sorry but this reimage is not a PuppetAgain CENTOS6.2 host, there is no relevant puppetAgain setup in it, (e.g. no puppetize.sh, yum repos incorrectly setup, etc)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
try now.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Verified, this is imaged properly now. :Van, what did you do differently compared to previous attempts, incase we ever have to do this again.
Status: RESOLVED → VERIFIED
Flags: needinfo?(vle)
the problem i ran into was that i couldn't log into the host because i couldn't boot the host into single user mode (no splash and no GRUB screen so im either missing it or didn't figure it out), or get the leased IP information to ssh and the host never boots up completely. observium was giving me the community IP instead of the DHCP IP when i did an ARP look up and i don't have a complete understanding of how releng kickstarts their servers so i wasn't sure which server to log into to grab the DHCP lease. my lazy/do-more work around was to image the host with no puppet password since it boots up completely; log in and grab the leased DHCP IP. Then it's back to the normal reimage process - reboot, reimage with wrong puppet password then ssh in with the leased IP and reconfigure the network.
Flags: needinfo?(vle)
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: