Closed
Bug 1050618
Opened 11 years ago
Closed 11 years ago
sea-hp-linux64-2 isn't responding to pings nor ssh
Categories
(Infrastructure & Operations :: DCOps, task)
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: ewong, Unassigned)
Details
Attachments
(1 file)
71.11 KB,
application/pdf
|
Details |
sea-hp-linux64-2 seems to be down. Hasn't responded to any pings nor
ssh connects for over ten minutes.
[ewong@jump1.community.scl3 ~]$ ping sea-hp-linux64-2
PING sea-hp-linux64-2.community.scl3.mozilla.com (63.245.223.113) 56(84) bytes of data.
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=2 Destination Host Unreachable
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=3 Destination Host Unreachable
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=4 Destination Host Unreachable
[ewong@jump1.community.scl3 ~]$ ssh -l seabld sea-hp-linux64-2
ssh: connect to host sea-hp-linux64-2 port 22: No route to host
Updated•11 years ago
|
colo-trip: --- → scl3
Comment 1•11 years ago
|
||
for some odd reason the raid array lost it's config. i reconfigured the raid 0 and the host is back online.
[vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2
sea-hp-linux64-2 is alive
[vle@jump1.community.scl3 ~]$ ssh !$
ssh sea-hp-linux64-2
The authenticity of host 'sea-hp-linux64-2 (63.245.223.113)' can't be established.
RSA key fingerprint is 4d:12:40:64:47:8b:0f:7e:43:81:02:82:79:a7:d6:2f.
Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
![]() |
Reporter | |
Comment 2•11 years ago
|
||
(In reply to Van Le [:van] from comment #1)
> for some odd reason the raid array lost it's config. i reconfigured the raid
> 0 and the host is back online.
>
> [vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2
> sea-hp-linux64-2 is alive
> [vle@jump1.community.scl3 ~]$ ssh !$
> ssh sea-hp-linux64-2
> The authenticity of host 'sea-hp-linux64-2 (63.245.223.113)' can't be
> established.
> RSA key fingerprint is 4d:12:40:64:47:8b:0f:7e:43:81:02:82:79:a7:d6:2f.
> Are you sure you want to continue connecting (yes/no)?
Hmm it's back down. :(
I don't like the sound of "lost its config".
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 3•11 years ago
|
||
heya :ewong, for some reason the system is alerting that the controller firmware is not up to date. i've tried to contact HP support but it looks like our support for this host has expired. this is the first time ive seen this error so ive rebooted the host a few times to make sure it comes back with the RAID config (it does).
so now I'm not sure if it's the OS causing this issue or if we do indeed newer firmware. can you open a ticket with the SA team to see if they can grab an updated firmware from another source or check syslogs to see what's causing the host to crash if the problem reoccurs? currently the host is back online.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Whiteboard: out of warranty
![]() |
Reporter | |
Comment 4•11 years ago
|
||
(In reply to Van Le [:van] from comment #3)
> heya :ewong, for some reason the system is alerting that the controller
> firmware is not up to date. i've tried to contact HP support but it looks
> like our support for this host has expired. this is the first time ive seen
> this error so ive rebooted the host a few times to make sure it comes back
> with the RAID config (it does).
>
> so now I'm not sure if it's the OS causing this issue or if we do indeed
> newer firmware. can you open a ticket with the SA team to see if they can
> grab an updated firmware from another source or check syslogs to see what's
> causing the host to crash if the problem reoccurs? currently the host is
> back online.
Thanks Van! Pardon my ignorance, but what's the SA team?
Comment 5•11 years ago
|
||
hrm, whats an "SA Team". vague memory said it was the same class of machines we use in moco releng for builders.
I'm not sure where/what could be faulty here, but if this is a matter of "we can certainly fix by replacing hardware piece X" I could forsee us doing that, depending on effort that causes on both sides of the fence, and of course the out-of-pocket cost to the SeaMonkey project.
Can we possibly get more info on how to proceed with this host?
Thanks
Updated•11 years ago
|
Flags: needinfo?(vle)
Flags: needinfo?(dustin)
Comment 6•11 years ago
|
||
>Pardon my ignorance, but what's the SA team?
old habits die hard... i was checking who the sysadmin on call was in inventory when i was updating this bug and instead of SRE/MOC, i wrote SA, since inventory calls them sysadmins.
>Can we possibly get more info on how to proceed with this host?
i've emailed our vendor and will get you a quote on the warranty renewal for this host.
Flags: needinfo?(vle)
Updated•11 years ago
|
Flags: needinfo?(dustin)
Comment 7•11 years ago
|
||
Ok, so this host is down *again* :/ :van did we ever get a reply from vendor?
Status: RESOLVED → REOPENED
Flags: needinfo?(vle)
Resolution: FIXED → ---
Comment 8•11 years ago
|
||
>:van did we ever get a reply from vendor?
he hasn't so i've emailed him for an update.
Flags: needinfo?(vle)
Comment 9•11 years ago
|
||
warranty renewal for host.
Comment 10•11 years ago
|
||
it appears this drive is dead and will no longer spin up. ive opened Case ID 4649185564 with HP for a replacement drive. They're out of 250GB drives so will send us a 500GB drive instead.
Comment 11•11 years ago
|
||
drive came in, swapped the bad 250GB for a 500GB. will MOC be rebuilding this system or is something that :callek can handle? warranty was $149 for the year.
Comment 12•11 years ago
|
||
Hey Van
This needs to be reimaged in a pretty annoying roundabout way:
https://bugzilla.mozilla.org/show_bug.cgi?id=740633#c7 (and on) has the full steps necessary.
The previous machines were all imaged as centOS6.2, however iirc moco's pxeboot stuff does 6.5 now. Either way, 6.5 || 6.2 is *OK* here, as long as you follow the imaging process for moco machines, (e.g. https://mana.mozilla.org/wiki/display/DC/How+To+Reimage+Releng+iX+and+HP+Linux+Machines )
Updated•11 years ago
|
Whiteboard: out of warranty
Comment 13•11 years ago
|
||
(In reply to Van Le [:van] from comment #11)
> drive came in, swapped the bad 250GB for a 500GB. will MOC be rebuilding
> this system or is something that :callek can handle? warranty was $149 for
> the year.
So to be explicit, I can't handle so is dcops/moc/whatever that will need to do this since its hands-on stuff.
While we're at it can I get an ETA [for our planning purposes], knowing this is relatively low priority among other work. (we have 11 other working systems, and seamonkey is itself low priority)
Flags: needinfo?(vle)
Comment 14•11 years ago
|
||
I won't be in the data center tomorrow and I'm out for training early next week so I'll try to get to it by next Thursday.
Flags: needinfo?(vle)
Comment 15•11 years ago
|
||
:callek, I have training this whole week so I wont be able to take a look at this until Monday.
Comment 16•11 years ago
|
||
(In reply to Van Le [:van] from comment #15)
> :callek, I have training this whole week so I wont be able to take a look at
> this until Monday.
monday is a US holiday :-) but thanks for the update. don't strain yourself over getting to this.
Comment 17•11 years ago
|
||
per manual checking and IRC chat, we want 6.2 for this host, there is an option for 6.5 but thats used for "server" class machines for moco releng right now, and not slaves. we want as much matching as possible (otherwise we risk odd other issues).
My comment above was more a "I forget if slaves in moco releng use 6.5, if they do it is fine" not a "please use latest"
Comment 18•11 years ago
|
||
i reimaged the host again with 6.2 but i had to catch a flight to phx1. ill be back friday to do the static ip config.
Comment 19•11 years ago
|
||
BTW: Why use RAID on those systems at all when there is only one HDD? Or is the RAID controller always involved and you cannot disable it in BIOS?
Comment 20•11 years ago
|
||
host is back online. the trick is to not use a puppet password.
[vle@jump1.community.scl3 ~]$ fping sea-hp-linux64-2.community.scl3.mozilla.com
sea-hp-linux64-2.community.scl3.mozilla.com is alive
[vle@jump1.community.scl3 ~]$ ssh !$
ssh sea-hp-linux64-2.community.scl3.mozilla.com
The authenticity of host 'sea-hp-linux64-2.community.scl3.mozilla.com (63.245.223.113)' can't be established.
RSA key fingerprint is bc:60:0e:4f:5c:2b:37:df:10:c9:cf:9d:dd:60:39:ac.
Are you sure you want to continue connecting (yes/no)?
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 21•11 years ago
|
||
Sorry but this reimage is not a PuppetAgain CENTOS6.2 host, there is no relevant puppetAgain setup in it, (e.g. no puppetize.sh, yum repos incorrectly setup, etc)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 22•11 years ago
|
||
try now.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 23•11 years ago
|
||
Verified, this is imaged properly now.
:Van, what did you do differently compared to previous attempts, incase we ever have to do this again.
Status: RESOLVED → VERIFIED
Flags: needinfo?(vle)
Comment 24•11 years ago
|
||
the problem i ran into was that i couldn't log into the host because i couldn't boot the host into single user mode (no splash and no GRUB screen so im either missing it or didn't figure it out), or get the leased IP information to ssh and the host never boots up completely.
observium was giving me the community IP instead of the DHCP IP when i did an ARP look up and i don't have a complete understanding of how releng kickstarts their servers so i wasn't sure which server to log into to grab the DHCP lease.
my lazy/do-more work around was to image the host with no puppet password since it boots up completely; log in and grab the leased DHCP IP. Then it's back to the normal reimage process - reboot, reimage with wrong puppet password then ssh in with the leased IP and reconfigure the network.
Flags: needinfo?(vle)
Updated•10 years ago
|
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•