sea-hp-linux64-2 isn't responding to pings nor ssh

VERIFIED FIXED

Status

Infrastructure & Operations
DCOps
VERIFIED FIXED
3 years ago
3 years ago

People

(Reporter: ewong, Unassigned)

Tracking

Details

Attachments

(1 attachment)

71.11 KB, application/pdf
Details
(Reporter)

Description

3 years ago
sea-hp-linux64-2 seems to be down.  Hasn't responded to any pings nor
ssh connects for over ten minutes.

[ewong@jump1.community.scl3 ~]$ ping sea-hp-linux64-2
PING sea-hp-linux64-2.community.scl3.mozilla.com (63.245.223.113) 56(84) bytes of data.
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=2 Destination Host Unreachable
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=3 Destination Host Unreachable
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=4 Destination Host Unreachable

[ewong@jump1.community.scl3 ~]$ ssh -l seabld sea-hp-linux64-2
ssh: connect to host sea-hp-linux64-2 port 22: No route to host

Updated

3 years ago
colo-trip: --- → scl3

Comment 1

3 years ago
for some odd reason the raid array lost it's config. i reconfigured the raid 0 and the host is back online.

[vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2
sea-hp-linux64-2 is alive
[vle@jump1.community.scl3 ~]$ ssh !$
ssh sea-hp-linux64-2
The authenticity of host 'sea-hp-linux64-2 (63.245.223.113)' can't be established.
RSA key fingerprint is 4d:12:40:64:47:8b:0f:7e:43:81:02:82:79:a7:d6:2f.
Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(Reporter)

Comment 2

3 years ago
(In reply to Van Le [:van] from comment #1)
> for some odd reason the raid array lost it's config. i reconfigured the raid
> 0 and the host is back online.
> 
> [vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2
> sea-hp-linux64-2 is alive
> [vle@jump1.community.scl3 ~]$ ssh !$
> ssh sea-hp-linux64-2
> The authenticity of host 'sea-hp-linux64-2 (63.245.223.113)' can't be
> established.
> RSA key fingerprint is 4d:12:40:64:47:8b:0f:7e:43:81:02:82:79:a7:d6:2f.
> Are you sure you want to continue connecting (yes/no)?

Hmm it's back down. :(

I don't like the sound of "lost its config".
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 3

3 years ago
heya :ewong, for some reason the system is alerting that the controller firmware is not up to date. i've tried to contact HP support but it looks like our support for this host has expired. this is the first time ive seen this error so ive rebooted the host a few times to make sure it comes back with the RAID config (it does).

so now I'm not sure if it's the OS causing this issue or if we do indeed newer firmware. can you open a ticket with the SA team to see if they can grab an updated firmware from another source or check syslogs to see what's causing the host to crash if the problem reoccurs? currently the host is back online.
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Whiteboard: out of warranty
(Reporter)

Comment 4

3 years ago
(In reply to Van Le [:van] from comment #3)
> heya :ewong, for some reason the system is alerting that the controller
> firmware is not up to date. i've tried to contact HP support but it looks
> like our support for this host has expired. this is the first time ive seen
> this error so ive rebooted the host a few times to make sure it comes back
> with the RAID config (it does).
> 
> so now I'm not sure if it's the OS causing this issue or if we do indeed
> newer firmware. can you open a ticket with the SA team to see if they can
> grab an updated firmware from another source or check syslogs to see what's
> causing the host to crash if the problem reoccurs? currently the host is
> back online.

Thanks Van!  Pardon my ignorance, but what's the SA team?
hrm, whats an "SA Team". vague memory said it was the same class of machines we use in moco releng for builders.

I'm not sure where/what could be faulty here, but if this is a matter of "we can certainly fix by replacing hardware piece X" I could forsee us doing that, depending on effort that causes on both sides of the fence, and of course the out-of-pocket cost to the SeaMonkey project.

Can we possibly get more info on how to proceed with this host?

Thanks

Updated

3 years ago
Flags: needinfo?(vle)
Flags: needinfo?(dustin)

Comment 6

3 years ago
>Pardon my ignorance, but what's the SA team?

old habits die hard... i was checking who the sysadmin on call was in inventory when i was updating this bug and instead of SRE/MOC, i wrote SA, since inventory calls them sysadmins.

>Can we possibly get more info on how to proceed with this host?

i've emailed our vendor and will get you a quote on the warranty renewal for this host.
Flags: needinfo?(vle)

Updated

3 years ago
Flags: needinfo?(dustin)
Ok, so this host is down *again* :/  :van did we ever get a reply from vendor?
Status: RESOLVED → REOPENED
Flags: needinfo?(vle)
Resolution: FIXED → ---

Comment 8

3 years ago
>:van did we ever get a reply from vendor?

he hasn't so i've emailed him for an update.
Flags: needinfo?(vle)

Comment 9

3 years ago
Created attachment 8476263 [details]
CAREPAQ.pdf

warranty renewal for host.

Comment 10

3 years ago
it appears this drive is dead and will no longer spin up. ive opened Case ID 4649185564 with HP for a replacement drive. They're out of 250GB drives so will send us a 500GB drive instead.

Comment 11

3 years ago
drive came in, swapped the bad 250GB for a 500GB. will MOC be rebuilding this system or is something that :callek can handle? warranty was $149 for the year.
Hey Van

This needs to be reimaged in a pretty annoying roundabout way:

https://bugzilla.mozilla.org/show_bug.cgi?id=740633#c7 (and on) has the full steps necessary.

The previous machines were all imaged as centOS6.2, however iirc moco's pxeboot stuff does 6.5 now. Either way, 6.5 || 6.2 is *OK* here, as long as you follow the imaging process for moco machines, (e.g. https://mana.mozilla.org/wiki/display/DC/How+To+Reimage+Releng+iX+and+HP+Linux+Machines )

Updated

3 years ago
Whiteboard: out of warranty
(In reply to Van Le [:van] from comment #11)
> drive came in, swapped the bad 250GB for a 500GB. will MOC be rebuilding
> this system or is something that :callek can handle? warranty was $149 for
> the year.

So to be explicit, I can't handle so is dcops/moc/whatever that will need to do this since its hands-on stuff.

While we're at it can I get an ETA [for our planning purposes], knowing this is relatively low priority among other work. (we have 11 other working systems, and seamonkey is itself low priority)
Flags: needinfo?(vle)

Comment 14

3 years ago
I won't be in the data center tomorrow and I'm out for training early next week so I'll try to get to it by next Thursday.
Flags: needinfo?(vle)

Comment 15

3 years ago
:callek, I have training this whole week so I wont be able to take a look at this until Monday.
(In reply to Van Le [:van] from comment #15)
> :callek, I have training this whole week so I wont be able to take a look at
> this until Monday.

monday is a US holiday :-) but thanks for the update. don't strain yourself over getting to this.
per manual checking and IRC chat, we want 6.2 for this host, there is an option for 6.5 but thats used for "server" class machines for moco releng right now, and not slaves. we want as much matching as possible (otherwise we risk odd other issues).

My comment above was more a "I forget if slaves in moco releng use 6.5, if they do it is fine" not a "please use latest"

Comment 18

3 years ago
i reimaged the host again with 6.2 but i had to catch a flight to phx1. ill be back friday to do the static ip config.
BTW: Why use RAID on those systems at all when there is only one HDD? Or is the RAID controller always involved and you cannot disable it in BIOS?

Comment 20

3 years ago
host is back online. the trick is to not use a puppet password.

[vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2.community.scl3.mozilla.com
sea-hp-linux64-2.community.scl3.mozilla.com is alive
[vle@jump1.community.scl3 ~]$ ssh !$
ssh sea-hp-linux64-2.community.scl3.mozilla.com
The authenticity of host 'sea-hp-linux64-2.community.scl3.mozilla.com (63.245.223.113)' can't be established.
RSA key fingerprint is bc:60:0e:4f:5c:2b:37:df:10:c9:cf:9d:dd:60:39:ac.
Are you sure you want to continue connecting (yes/no)?
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Sorry but this reimage is not a PuppetAgain CENTOS6.2 host, there is no relevant puppetAgain setup in it, (e.g. no puppetize.sh, yum repos incorrectly setup, etc)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 22

3 years ago
try now.
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Verified, this is imaged properly now.

:Van, what did you do differently compared to previous attempts, incase we ever have to do this again.
Status: RESOLVED → VERIFIED
Flags: needinfo?(vle)

Comment 24

3 years ago
the problem i ran into was that i couldn't log into the host because i couldn't boot the host into single user mode (no splash and no GRUB screen so im either missing it or didn't figure it out), or get the leased IP information to ssh and the host never boots up completely. 

observium was giving me the community IP instead of the DHCP IP when i did an ARP look up and i don't have a complete understanding of how releng kickstarts their servers so i wasn't sure which server to log into to grab the DHCP lease.

my lazy/do-more work around was to image the host with no puppet password since it boots up completely; log in and grab the leased DHCP IP. Then it's back to the normal reimage process - reboot, reimage with wrong puppet password then ssh in with the leased IP and reconfigure the network.
Flags: needinfo?(vle)
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.