1050618 - sea-hp-linux64-2 isn't responding to pings nor ssh

Reporter

Description

•

11 years ago

sea-hp-linux64-2 seems to be down. Hasn't responded to any pings nor ssh connects for over ten minutes. [ewong@jump1.community.scl3 ~]$ ping sea-hp-linux64-2 PING sea-hp-linux64-2.community.scl3.mozilla.com (63.245.223.113) 56(84) bytes of data. From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=2 Destination Host Unreachable From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=3 Destination Host Unreachable From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=4 Destination Host Unreachable [ewong@jump1.community.scl3 ~]$ ssh -l seabld sea-hp-linux64-2 ssh: connect to host sea-hp-linux64-2 port 22: No route to host

Vinh Hua [:vinh]

Updated

•

11 years ago

colo-trip: --- → scl3

Van Le [:van]

Comment 1

•

11 years ago

for some odd reason the raid array lost it's config. i reconfigured the raid 0 and the host is back online. [vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2 sea-hp-linux64-2 is alive [vle@jump1.community.scl3 ~]$ ssh !$ ssh sea-hp-linux64-2 The authenticity of host 'sea-hp-linux64-2 (63.245.223.113)' can't be established. RSA key fingerprint is 4d:12:40:64:47:8b:0f:7e:43:81:02:82:79:a7:d6:2f. Are you sure you want to continue connecting (yes/no)?

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Edmund Wong (:ewong)

Reporter

Comment 2

•

11 years ago

(In reply to Van Le [:van] from comment #1) > for some odd reason the raid array lost it's config. i reconfigured the raid > 0 and the host is back online. > > [vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2 > sea-hp-linux64-2 is alive > [vle@jump1.community.scl3 ~]$ ssh !$ > ssh sea-hp-linux64-2 > The authenticity of host 'sea-hp-linux64-2 (63.245.223.113)' can't be > established. > RSA key fingerprint is 4d:12:40:64:47:8b:0f:7e:43:81:02:82:79:a7:d6:2f. > Are you sure you want to continue connecting (yes/no)? Hmm it's back down. :( I don't like the sound of "lost its config".

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Van Le [:van]

Comment 3

•

11 years ago

heya :ewong, for some reason the system is alerting that the controller firmware is not up to date. i've tried to contact HP support but it looks like our support for this host has expired. this is the first time ive seen this error so ive rebooted the host a few times to make sure it comes back with the RAID config (it does). so now I'm not sure if it's the OS causing this issue or if we do indeed newer firmware. can you open a ticket with the SA team to see if they can grab an updated firmware from another source or check syslogs to see what's causing the host to crash if the problem reoccurs? currently the host is back online.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Whiteboard: out of warranty

Edmund Wong (:ewong)

Reporter

Comment 4

•

11 years ago

(In reply to Van Le [:van] from comment #3) > heya :ewong, for some reason the system is alerting that the controller > firmware is not up to date. i've tried to contact HP support but it looks > like our support for this host has expired. this is the first time ive seen > this error so ive rebooted the host a few times to make sure it comes back > with the RAID config (it does). > > so now I'm not sure if it's the OS causing this issue or if we do indeed > newer firmware. can you open a ticket with the SA team to see if they can > grab an updated firmware from another source or check syslogs to see what's > causing the host to crash if the problem reoccurs? currently the host is > back online. Thanks Van! Pardon my ignorance, but what's the SA team?

Justin Wood (:Callek)

Comment 5

•

11 years ago

hrm, whats an "SA Team". vague memory said it was the same class of machines we use in moco releng for builders. I'm not sure where/what could be faulty here, but if this is a matter of "we can certainly fix by replacing hardware piece X" I could forsee us doing that, depending on effort that causes on both sides of the fence, and of course the out-of-pocket cost to the SeaMonkey project. Can we possibly get more info on how to proceed with this host? Thanks

Justin Wood (:Callek)

Updated

•

11 years ago

Flags: needinfo?(vle)

Flags: needinfo?(dustin)

Van Le [:van]

Comment 6

•

11 years ago

>Pardon my ignorance, but what's the SA team? old habits die hard... i was checking who the sysadmin on call was in inventory when i was updating this bug and instead of SRE/MOC, i wrote SA, since inventory calls them sysadmins. >Can we possibly get more info on how to proceed with this host? i've emailed our vendor and will get you a quote on the warranty renewal for this host.

Flags: needinfo?(vle)

Justin Wood (:Callek)

Updated

•

11 years ago

Flags: needinfo?(dustin)

Justin Wood (:Callek)

Comment 7

•

11 years ago

Ok, so this host is down *again* :/ :van did we ever get a reply from vendor?

Status: RESOLVED → REOPENED

Flags: needinfo?(vle)

Resolution: FIXED → ---

Van Le [:van]

Comment 8

•

11 years ago

>:van did we ever get a reply from vendor? he hasn't so i've emailed him for an update.

Flags: needinfo?(vle)

Van Le [:van]

Comment 9

•

11 years ago

Attached file CAREPAQ.pdf — Details

warranty renewal for host.

Van Le [:van]

Comment 10

•

11 years ago

it appears this drive is dead and will no longer spin up. ive opened Case ID 4649185564 with HP for a replacement drive. They're out of 250GB drives so will send us a 500GB drive instead.

Van Le [:van]

Comment 11

•

11 years ago

drive came in, swapped the bad 250GB for a 500GB. will MOC be rebuilding this system or is something that :callek can handle? warranty was $149 for the year.

Justin Wood (:Callek)

Comment 12

•

11 years ago

Hey Van This needs to be reimaged in a pretty annoying roundabout way: https://bugzilla.mozilla.org/show_bug.cgi?id=740633#c7 (and on) has the full steps necessary. The previous machines were all imaged as centOS6.2, however iirc moco's pxeboot stuff does 6.5 now. Either way, 6.5 || 6.2 is *OK* here, as long as you follow the imaging process for moco machines, (e.g. https://mana.mozilla.org/wiki/display/DC/How+To+Reimage+Releng+iX+and+HP+Linux+Machines )

Justin Wood (:Callek)

Updated

•

11 years ago

Whiteboard: out of warranty

Justin Wood (:Callek)

Comment 13

•

11 years ago

(In reply to Van Le [:van] from comment #11) > drive came in, swapped the bad 250GB for a 500GB. will MOC be rebuilding > this system or is something that :callek can handle? warranty was $149 for > the year. So to be explicit, I can't handle so is dcops/moc/whatever that will need to do this since its hands-on stuff. While we're at it can I get an ETA [for our planning purposes], knowing this is relatively low priority among other work. (we have 11 other working systems, and seamonkey is itself low priority)

Flags: needinfo?(vle)

Van Le [:van]

Comment 14

•

11 years ago

I won't be in the data center tomorrow and I'm out for training early next week so I'll try to get to it by next Thursday.

Flags: needinfo?(vle)

Van Le [:van]

Comment 15

•

11 years ago

:callek, I have training this whole week so I wont be able to take a look at this until Monday.

Justin Wood (:Callek)

Comment 16

•

11 years ago

(In reply to Van Le [:van] from comment #15) > :callek, I have training this whole week so I wont be able to take a look at > this until Monday. monday is a US holiday :-) but thanks for the update. don't strain yourself over getting to this.

Justin Wood (:Callek)

Comment 17

•

11 years ago

per manual checking and IRC chat, we want 6.2 for this host, there is an option for 6.5 but thats used for "server" class machines for moco releng right now, and not slaves. we want as much matching as possible (otherwise we risk odd other issues). My comment above was more a "I forget if slaves in moco releng use 6.5, if they do it is fine" not a "please use latest"

Van Le [:van]

Comment 18

•

11 years ago

i reimaged the host again with 6.2 but i had to catch a flight to phx1. ill be back friday to do the static ip config.

Frank Wein [:mcsmurf]

Comment 19

•

11 years ago

BTW: Why use RAID on those systems at all when there is only one HDD? Or is the RAID controller always involved and you cannot disable it in BIOS?

Van Le [:van]

Comment 20

•

11 years ago

host is back online. the trick is to not use a puppet password. [vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2.community.scl3.mozilla.com sea-hp-linux64-2.community.scl3.mozilla.com is alive [vle@jump1.community.scl3 ~]$ ssh !$ ssh sea-hp-linux64-2.community.scl3.mozilla.com The authenticity of host 'sea-hp-linux64-2.community.scl3.mozilla.com (63.245.223.113)' can't be established. RSA key fingerprint is bc:60:0e:4f:5c:2b:37:df:10:c9:cf:9d:dd:60:39:ac. Are you sure you want to continue connecting (yes/no)?

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Comment 21

•

11 years ago

Sorry but this reimage is not a PuppetAgain CENTOS6.2 host, there is no relevant puppetAgain setup in it, (e.g. no puppetize.sh, yum repos incorrectly setup, etc)

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Van Le [:van]

Comment 22

•

11 years ago

try now.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Comment 23

•

11 years ago

Verified, this is imaged properly now. :Van, what did you do differently compared to previous attempts, incase we ever have to do this again.

Status: RESOLVED → VERIFIED

Flags: needinfo?(vle)

Van Le [:van]

Comment 24

•

11 years ago

the problem i ran into was that i couldn't log into the host because i couldn't boot the host into single user mode (no splash and no GRUB screen so im either missing it or didn't figure it out), or get the leased IP information to ssh and the host never boots up completely. observium was giving me the community IP instead of the DHCP IP when i did an ARP look up and i don't have a complete understanding of how releng kickstarts their servers so i wasn't sure which server to log into to grab the DHCP lease. my lazy/do-more work around was to image the host with no puppet password since it boots up completely; log in and grab the leased DHCP IP. Then it's back to the normal reimage process - reboot, reimage with wrong puppet password then ssh in with the leased IP and reconfigure the network.

Flags: needinfo?(vle)

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Infrastructure & Operations