Closed
Bug 1058385
Opened 11 years ago
Closed 10 years ago
sea-hp-linux64-* have a HD/RAID firmware mismatch
Categories
(Infrastructure & Operations :: DCOps, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ewong, Unassigned)
References
Details
(Whiteboard: Case ID is 4649699191/4649842904/4650124724/4650242010)
Attachments
(3 files)
This machine isn't responding to pings nor ssh
[ewong@jump1.community.scl3 tmp]$ ssh -l seabld sea-hp-linux64-9
ssh: connect to host sea-hp-linux64-9 port 22: No route to host
[ewong@jump1.community.scl3 tmp]$ ping sea-hp-linux64-9
PING sea-hp-linux64-9.community.scl3.mozilla.com (63.245.223.120) 56(84) bytes of data.
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=2 Destination Host Unreachable
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=3 Destination Host Unreachable
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=4 Destination Host Unreachable
I hope it's not going the way of its sibling (sea-hp-linux64-2).
Comment 1•11 years ago
|
||
p.s. we love you DCOps.
Updated•11 years ago
|
colo-trip: --- → scl3
![]() |
Reporter | |
Updated•11 years ago
|
Summary: sea-hp-linux64-9 is down → sea-hp-linux64-{4,5,6,7,9,10,11} are down
![]() |
Reporter | |
Comment 2•11 years ago
|
||
They aren't happy at all. I guess they're upset that their sibling -2
is still sick.
![]() |
Reporter | |
Updated•11 years ago
|
Summary: sea-hp-linux64-{4,5,6,7,9,10,11} are down → sea-hp-linux64-{4,5,6,7,9,10} are down
![]() |
Reporter | |
Updated•11 years ago
|
Summary: sea-hp-linux64-{4,5,6,7,9,10} are down → sea-hp-linux64-{4,5,6,7,9,10,12} are down
![]() |
Reporter | |
Updated•11 years ago
|
Summary: sea-hp-linux64-{4,5,6,7,9,10,12} are down → sea-hp-linux64-{4,5,6,7,8,9,10,12} are down
![]() |
Reporter | |
Updated•11 years ago
|
Summary: sea-hp-linux64-{4,5,6,7,8,9,10,12} are down → sea-hp-linux64-{3,4,5,6,7,8,9,10,12} are down
![]() |
Reporter | |
Updated•11 years ago
|
Summary: sea-hp-linux64-{3,4,5,6,7,8,9,10,12} are down → sea-hp-linux64-{3,4,5,6,7,8,9,10,11,12} are down
![]() |
Reporter | |
Comment 3•11 years ago
|
||
One active Linux64 left. -13.
The rest have gone AWOL.
![]() |
Reporter | |
Updated•11 years ago
|
Summary: sea-hp-linux64-{3,4,5,6,7,8,9,10,11,12} are down → sea-hp-linux64-{3,4,5,6,7,8,9,10,11,12,13} are down
![]() |
Reporter | |
Comment 4•11 years ago
|
||
awol_list = [x in range(3,13)]
awol_list.append(13)
len(awol_list) = 11
len(free_list) = 0
And don't forget about sea-hp-linux64-2 (from bug 1050618).
w0ts0n from #it mentioned that a randomly selected host (sea-hp-linux64-9)
has "Disk Error". -2 also had disk error. It's currently being readied for
imaging (so not yet deployed).
Comment 5•11 years ago
|
||
raid controller/drives complaining of firmware mismatch
Comment 6•11 years ago
|
||
raid controller complaining of drive being moved over the holiday.
Comment 7•11 years ago
|
||
all these hosts are complaining of a firmware mismatch and the drive has been moved (they havent been moved). i have attached both screen shots. since it's the same error for all of these hosts, i dont think it's a hardware issue but rather a software issue. please open a bug with the MOC/SRE team to look further into this issue if needed.
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 8•11 years ago
|
||
Instead of opening a new bug with no history, I'm going to reopen this and move it.
MOC: can we get any assistance/insight into how to diagnose/treat this issue, so the symptom can be avoided?
We can take any of the hosts out of production circulation so you can do some non-destructive prodding. (if you need destruction prodding we'll need to reimage the complicated way after you're done similar to what we are doing in Bug 1050618 (described in c#12)
Assignee: server-ops-dcops → nobody
Status: RESOLVED → REOPENED
Component: Server Operations: DCOps → Server Operations: MOC
Resolution: FIXED → ---
Summary: sea-hp-linux64-{3,4,5,6,7,8,9,10,11,12,13} are down → sea-hp-linux64-* have a HD/RAID firmware mismatch
Comment 9•11 years ago
|
||
sea-hp-linux64-{3,7,9} are back down as of now.... presumably the same issue.
Comment 10•11 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #9)
> sea-hp-linux64-{3,7,9} are back down as of now.... presumably the same issue.
+ {11,12}
Comment 11•11 years ago
|
||
So as per IRC, would be nice to know what updated the drive's firmware(s).
The issue right now is that there's a mis-match between the controller's firmware and the drive's.
The controller's firmware is older than the drives so the controller in turn, cannot detect the drives.
We'll need to boot these machines into the network, boot into sysrescue, then update the controller's firmware and reboot.
This should hopefully fix the issue, however, possible that the RAID information (ie logical drives etc) might go missing and we'll need to recreate the logical drives which may mean reinstalls (technically we should be able to redo the logical drives without wiping the data but it all depends on the RAID configuration).
Updated•11 years ago
|
Assignee: nobody → afernandez
Comment 12•11 years ago
|
||
sea-hp-linux64-3.community.scl3.mozilla.com is back online.
For more information, seems there are NO known (public) firmware updates for the RAID controller in question. So far, for the above host, that controller is: HP Smart Array B110i SATA RAID Controller
The only firmwares related to that controller, reference the actual drives (which may or may not be the drives these servers currently have).
We will be checking the other servers by the end of the day but at this time, seems there's no "permanent" solution besides perhaps replacing the drives with "correctly supported one" but that's for further discussion.
Comment 13•11 years ago
|
||
Busy day, however, the affected hosts have been online (for a while);
sea-hp-linux64-3.community.scl3.mozilla.com 17:45:23 up 1:01, 0 users, load average: 0.00, 0.00, 0.00
sea-hp-linux64-7.community.scl3.mozilla.com 17:45:45 up 1:32, 0 users, load average: 0.00, 0.00, 0.00
sea-hp-linux64-9.community.scl3.mozilla.com 17:46:24 up 54 min, 0 users, load average: 0.00, 0.00, 0.00
sea-hp-linux64-11.community.scl3.mozilla.com 17:46:37 up 1:36, 0 users, load average: 0.00, 0.00, 0.00
apparently following was missed =\ (there was a fire alarm!);
sea-hp-linux64-12.community.scl3.mozilla.com 17:52:42 up 0 min, 0 users, load average: 0.47, 0.13, 0.04
Leaving bug open to "permanently" fix the issue.
![]() |
Reporter | |
Comment 14•10 years ago
|
||
Just want to mention that sea-hp-linux64-{5,6,9,10,11,13} are down, most likely due to the same issue.
![]() |
Reporter | |
Comment 15•10 years ago
|
||
(In reply to Edmund Wong (:ewong) from comment #14)
> Just want to mention that sea-hp-linux64-{5,6,9,10,11,13} are down, most
> likely due to the same issue.
+ {3, 8}
![]() |
Reporter | |
Comment 16•10 years ago
|
||
Current list of AWOL slaves:
[3, 5, 6, 7, 8, 9, 10, 11, 13]
Comment 17•10 years ago
|
||
Van, can we recover these again, currently the *only* host up is -4, (so 3 {5..13} are down)
Flags: needinfo?(vle)
Comment 18•10 years ago
|
||
err -2 is down too...
Comment 19•10 years ago
|
||
I'm on PTO until Thursday. I can take a look at these machines then if it's not super urgent. In the future, for hands on requests, please open a bug and drop it in the DCOPs queue so we can track it. You can add it as a dependent of this parent bug if needed.
Flags: needinfo?(vle)
Comment 20•10 years ago
|
||
Heya Vinh, per derek you should "be there" (I'm assuming scl3) so if you feel comfortable would be great if you can recover some/all of these hosts.
If the issue is like before, they probably are booting unable to find their drive, which is merely a "rebuild the raid0" since they are single drive hosts. That has sufficed to recover in the past.
If you are unable/uncomfortable doing this, I understand and we can defer to thursday for Van, but we are down to 1 up at the moment. with releases scheduled to build tonight.
Flags: needinfo?(vhua)
Comment 21•10 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #20)
> scheduled to build tonight.
*scheduled to build this week
Comment 22•10 years ago
|
||
sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
Flags: needinfo?(vhua)
![]() |
Reporter | |
Comment 23•10 years ago
|
||
At this moment, the following are not allowing SSH or pings:
sea-hp-linux64-{2,4,7,8,9,10,11,12,13}
OS: Windows Vista → Linux
Comment 24•10 years ago
|
||
(In reply to Edmund Wong (:ewong) from comment #23)
> At this moment, the following are not allowing SSH or pings:
>
> sea-hp-linux64-{2,4,7,8,9,10,11,12,13}
Of note, we renewed the -3 warranty to try and catch this issue on a warrantied box and correct it.
-2 however was already warranty renewed and given a different disk drive than these other hosts, and if it is hitting this issue as well can be used as a substitute to try and figure out the root cause/driver we need.
Flags: needinfo?(vle)
Comment 25•10 years ago
|
||
looks like -2 went down with same errors. i'm not sure when i'll be able to hit up HP as I'm pretty busy this week and next week but i'll try to fit it in when I have free cycles. i've left the needinfo in check to remind myself.
in the mean time, ive brought the other hosts back online (except -2 so we can work with HP).
![]() |
Reporter | |
Comment 26•10 years ago
|
||
sea-hp-linux64-{5,7,8,9,10,11} are back down again.
Comment 27•10 years ago
|
||
what username/password can i use to log in to these hosts to troubleshoot/run scripts with HP?
cltbld isnt working.
Flags: needinfo?(vle)
Comment 28•10 years ago
|
||
[Monday, October 20, 2014 2:47 PM] -- Venkatesh A says:
Also, Van, I understand that there is no dedicated firmware file for the storage controller to update
[Monday, October 20, 2014 2:47 PM] -- Venkatesh A says:
However You need to use the latest SPP utilty on the server to update the latest server BIOS, which also will update the firmware of the controller .
[Monday, October 20, 2014 2:48 PM] -- Van Le says:
what's the SPP utility?
[Monday, October 20, 2014 2:48 PM] -- Van Le says:
how do i get that utility?
[Monday, October 20, 2014 2:49 PM] -- Venkatesh A says:
I shall provide you the URL to download the same
[Monday, October 20, 2014 2:49 PM] -- Van Le says:
thanks
[Monday, October 20, 2014 2:49 PM] -- Venkatesh A says:
Please find the below URL to download the latest SPP utility :
http://h17007.www1.hp.com/us/en/enterprise/servers/products/service_pack/spp/index.aspx
can a MOC engineer try to update the firmware and let me know if this resolves the issue? it might not if the server BIOS update doesn't contain an update for the embedded controller.
Whiteboard: Case ID is 4649699191
![]() |
Reporter | |
Comment 29•10 years ago
|
||
update:
sea-hp-linux64-{3,5,7,8,9,10,11,12,13} are down.
![]() |
Reporter | |
Comment 30•10 years ago
|
||
Updating:
Aside for sea-hp-linux64-4, the rest are down.
that is sea-hp-linux64-{2,3,5,6,7,8,9,10,11,12,13}
Can something be done to them? We're doing the beta release and
it's worrisome that we're down to one slave.
Thanks!
Flags: needinfo?(afernandez)
Comment 32•10 years ago
|
||
Flags: needinfo?(vle)
Comment 33•10 years ago
|
||
(In reply to Van Le [:van] from comment #32)
The n-i was so that we can get the ones you're not working with HP on back up.
Flags: needinfo?(vle)
Comment 34•10 years ago
|
||
oops i didn't realize it cleared my n-info when i attached the carepaq for aj. im downloading the 4gb firmware now and we'll work on these hosts. i'll get them online for you today if the firmware doesn't resolve the raid issue.
Comment 35•10 years ago
|
||
HP_Service_Pack_for_ProLiant_2014.09.0_792934_001_spp_2014.09.0-SPP2014090.2014_0827.10.iso uploaded into /tmp of admin1a.
please update firmware and let me know if you still need me to fix the others.
Comment 36•10 years ago
|
||
van, our issue is that we have just 1 available right now, so we need them all brought back up ~now since we have a beta release we're trying to finish. (and possibly a chemspill ala Firefox 33.0.1)
We want to install that firmware on at least 1 or 2 and see if taht fixes it, but *we* don't have access to admin1a to do so, nor do *we* have the ability to recover if something goes wrong with it.
Comment 37•10 years ago
|
||
>van, our issue is that we have just 1 available right now, so we need them all brought back up ~now since >we have a beta release we're trying to finish. (and possibly a chemspill ala Firefox 33.0.1)
vans-MacBook-Pro:~ vle$ fping sea-hp-linux64-{2..13}.community.scl3.mozilla.com
sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
Flags: needinfo?(vle)
Comment 38•10 years ago
|
||
Attempted to reach someone on irc before attempting to work on this but couldn't.
Rebooted sea-hp-linux64-2.community.scl3.mozilla.com and currently checking to see any firmware(s) getting updated.
Flags: needinfo?(afernandez)
Comment 39•10 years ago
|
||
sea-hp-linux64-2.community.scl3.mozilla.com back online.
The SPP applied updates, however, the automatic process (hands-off) is NOT verbose at all, so don't have a list of what was actually updated or from what version to what version. A check in the IML log shows that the BIOS was updated to version 07/01/2013
Lets have the server run for a few days (or just over the weekend) and if no issues arise, we could then apply the updates on the rest of the servers.
On a different note, seems when these servers boot, they do not reach the login prompt (or at least this server doesn't).
They get stuck in the init process under;
"Starting runner"
Comment 40•10 years ago
|
||
Van:
Linux64[2: NC, 3: B, 4: NC, 5: NC, 6: B, 7: B, 8: B, 9: B, 10: B, 11: B, 12: B, 13: NC],
NC == "not connected to buildbot"
B == "working fine"
Of note is 2, which is sea-hp-linux64-2, and is currently not pinging.
Can you peek to see if this is still the same firmware issue, of note this is teh host HP already replaced a drive for, and is one of the two hosts we have renewed the warranty on (the other being -3 which we recently bought the extra warranty for).
Flags: needinfo?(vle)
Comment 42•10 years ago
|
||
can check for you on thursday or friday, i'm in phx at the moment.
Comment 43•10 years ago
|
||
So as per Comment 39 which states that "sea-hp-linux64-2.community.scl3.mozilla.com" was upgraded with the SSP ISO, it's experiencing the same issue as before.
Comment 44•10 years ago
|
||
the update didnt update the firmware on the raid controller as it is still on v1.38.
Flags: needinfo?(vle)
Comment 45•10 years ago
|
||
i'm not sure if HP has another update. ill check with them when i have some free cycles next week.
>Linux64[2: NC, 3: B, 4: NC, 5: NC, 6: B, 7: B, 8: B, 9: B, 10: B, 11: B, 12: B, 13: NC],
i brought up 2,3, and 13.
Flags: needinfo?(vle)
Updated•10 years ago
|
Flags: needinfo?(vle)
![]() |
Reporter | |
Comment 47•10 years ago
|
||
sea-hp-linux64-{5,6,7,8,9,10} are back down.
Comment 48•10 years ago
|
||
hp came on site and i had them take a look at this issue. the SPP software suite doesnt update the firmware of the controller, it updates the iLO firmware and motherboard BIOS. the tech informed me that the 100 series of servers are their low end servers and the onboard embedded controller is outsourced so they dont develop the drivers for them.
however, the tech had a USB that updated the firmware of the hard drive (hp g0 to hp g9) so i asked her to test it on -5.
i'm not sure if this will resolve the issue since we RMA'd a drive in -2 but if it does, we can open a ticket for them to come back and update the firmware for all the drives.
i'll take a look at the remaining down hosts sometime in the next couple of days.
Comment 49•10 years ago
|
||
:van, no luck it seems:
[Callek@jump1.community.scl3 ~]$ ssh root@sea-hp-linux64-5
ssh: connect to host sea-hp-linux64-5 port 22: No route to host
NOTE: at this moment *only* -3 is up, and all the rest are down. (I'm *assuming* the -5 above was not a typo and was meant to be -3)
Comment 50•10 years ago
|
||
not a typo, it was -5 and its drive that was updated with the latest firmware. same issue thogug so i guess i'll contact HP again to see if they have anything else for us to try but per :aj in c#12, there's no known firmware for the individual embedded raid controller.
in the meantime, i've brought back the other hosts.
Comment 51•10 years ago
|
||
they're asking us to downgrade the BIOS 1 previous version to see if that resolves the issue as they do not have any firmware for this particular RAID controller. i'll work on -2 when i get a chance.
[Wednesday, November 05, 2014 10:43 AM] -- Arijeet S says:
We need to downgrade the BIOS to one level and then see if the error persist
[Wednesday, November 05, 2014 10:44 AM] -- Van Le says:
ok do you have a link for the bios?
[Wednesday, November 05, 2014 10:45 AM] -- Arijeet S says:
As I see the BIOS also is still not the latest
[Wednesday, November 05, 2014 10:46 AM] -- Arijeet S says:
I apologize it is the latest
[Wednesday, November 05, 2014 10:46 AM] -- Arijeet S says:
The previous version is 2012.12.04 (4 Jan 2013)
[Wednesday, November 05, 2014 10:47 AM] -- Arijeet S says:
* RECOMMENDED * Online ROM Flash Component for Windows - HP ProLiant ML110 G7/DL120 G7 (J01) Servers
http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdDetails/?sp4ts.oid=5075937&spf_p.tpst=swdMain&spf_p.prp_swdMain=wsrp-navigationalState%3Didx%253D2%257CswItem%253DMTX_8fb5c65ba0a7403ca1e54638e6%257CswEnvOID%253D4064%257CitemLocale%253D%257CswLang%253D%257Cmode%253D4%257Caction%253DdriverDocument&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken
Flags: needinfo?(vle)
Whiteboard: Case ID is 4649699191 → Case ID is 4649699191/4649842904
Updated•10 years ago
|
Flags: needinfo?(vle)
Comment 52•10 years ago
|
||
link in previous comment was for Windows.
RECOMMENDED * Online ROM Flash Component for Linux - HP ProLiant ML110 G7/DL120 G7 (J01) Servers
Click : Obtain software
Type: BIOS (Entitlement Required) - System ROM
Version: 2012.12.04 (4 Jan 2013)
http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdDetails/?sp4ts.oid=5075937&spf_p.tpst=swdMain&spf_p.prp_swdMain=wsrp-navigationalState%3Didx%253D2%257CswItem%253DMTX_fe867643b9b5488681e4e70cf6%257CswEnvOID%253D4103%257CitemLocale%253D%257CswLang%253D%257Cmode%253D4%257Caction%253DdriverDocument&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken
Comment 53•10 years ago
|
||
i brought -6 up with the downgraded ROM Version. please let me know when it goes down again.
sea-hp-linux64-6.community.scl3.mozilla.com is alive
System ROM J01 12/04/2012
![]() |
Reporter | |
Comment 54•10 years ago
|
||
5 & 6 are up.
sea-hp-linux64-{2,3,4,7,8,9,10,11,12} are down.
Comment 55•10 years ago
|
||
brought the hosts back online and upgraded -7 to System ROM J01 12/04/2012. these take a while to upgrade so if -6 and -7 stay stable, we'll upgrade the rest.
Flags: needinfo?(vle)
Updated•10 years ago
|
Flags: needinfo?(vle)
![]() |
Reporter | |
Comment 56•10 years ago
|
||
All the HP systems are down and a I got a bunch of emails with "Repeated Puppet Failures on sea-hp-linux64-{5,6}.community.scl3.mozilla.com.
with the following in the message body:
IP: 63.245.223.116
Comment 57•10 years ago
|
||
(In reply to Edmund Wong (:ewong) from comment #56)
> All the HP systems are down and a I got a bunch of emails with "Repeated
> Puppet Failures on sea-hp-linux64-{5,6}.community.scl3.mozilla.com.
Currently thinking the 5,6 issue is a seperate bug/issue. Will followup here if not.
![]() |
Reporter | |
Comment 58•10 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #57)
> (In reply to Edmund Wong (:ewong) from comment #56)
> > All the HP systems are down and a I got a bunch of emails with "Repeated
> > Puppet Failures on sea-hp-linux64-{5,6}.community.scl3.mozilla.com.
>
> Currently thinking the 5,6 issue is a seperate bug/issue. Will followup here
> if not.
Ok.. it was a puppet change. Anyway, Callek fixed that and now, both -5 and -6
are up. The others are down.
Comment 59•10 years ago
|
||
All hosts are up.
sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
Updated•10 years ago
|
Assignee: afernandez → nobody
![]() |
Reporter | |
Comment 60•10 years ago
|
||
oh fun. All except -6 are down.
Comment 61•10 years ago
|
||
Hosts are back online.
sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
Comment 63•10 years ago
|
||
we'll just grab this bug until we get a chance to upgrade all the firmware on the hosts.
Assignee: nobody → server-ops-dcops
Component: Server Operations: MOC → Server Operations: DCOps
Updated•10 years ago
|
Product: mozilla.org → Infrastructure & Operations
![]() |
Reporter | |
Comment 64•10 years ago
|
||
Sea-hp-linux64-{5,6,7,8,9,12,13} are all down.
while 3, 4 and 11 aren't down... they aren't registering properly with the master.
(different issue)
Comment 65•10 years ago
|
||
ill need to contact HP again as ive upgraded a bunch of hosts and they're all still failing.
in the meantime, ive brought back up the down sea-hp-linux hosts.
vans-MacBook-Pro:~ vle$ fping sea-hp-linux64-{5,6,7,8,9,12,13}.community.scl3.mozilla.com
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
Comment 66•10 years ago
|
||
HP still cant find the root cause for this issue. They've asked that I dl the SmartCD ISO and run their diagnostics and escalate my findings to their tier2 team. Will do this when -2 and -3 goes down as these are the hosts under warranty or when I have some extra cycles.
[Monday, December 08, 2014 12:08 PM] -- Van Le says:
any update?
[Monday, December 08, 2014 12:09 PM] -- Avinash B says:
Yes at this point, we will try to run offline diagnostics on may be any 2 servers that have had this issue before
[Monday, December 08, 2014 12:10 PM] -- Avinash B says:
Once that is done we think it will be better if we elevate this case to the level two team.
[Monday, December 08, 2014 12:10 PM] -- Avinash B says:
So you would need to use the smart start CD for the server, I am not sure if you have it.
[Monday, December 08, 2014 12:11 PM] -- Avinash B says:
This is the link to get the iso.
http://h20564.www2.hp.com/hpsc/swd/public/detail?sp4ts.oid=5075938&swItemId=MTX_fa4e107ffbbd4ef394e57dd739&swEnvOid=4064#tab-history
[Monday, December 08, 2014 12:12 PM] -- Avinash B says:
The SPP does not give option to run diagnostics.
[Monday, December 08, 2014 12:12 PM] -- Van Le says:
ok
[Monday, December 08, 2014 12:13 PM] -- Van Le says:
does it print out a report or what do i need to do to get you the information?
[Monday, December 08, 2014 12:13 PM] -- Avinash B says:
Once you boot from the iso, you will few options.
Like Install Maintenance, Reboot.
Select maintenance
[Monday, December 08, 2014 12:14 PM] -- Avinash B says:
You will have to run the insight diagnostics.
[Monday, December 08, 2014 12:14 PM] -- Avinash B says:
1. Select Survey logs.
2. Change View level to Advance.
3. Change category level to all.
4. Save the generated report.
[Monday, December 08, 2014 12:14 PM] -- Avinash B says:
it will be a html file, few hundred KB in size.
[Monday, December 08, 2014 12:15 PM] -- Avinash B says:
The other one will be to run the Diagnostic that will scan all the hard ware components.
[Monday, December 08, 2014 12:15 PM] -- Avinash B says:
Save both the reports
Whiteboard: Case ID is 4649699191/4649842904 → Case ID is 4649699191/4649842904/4650124724
![]() |
Reporter | |
Comment 67•10 years ago
|
||
Currently sea-hp-linux64-{3,4,5,6,8,9,10,11,12,13} are down. So I guess -3
can be used to run the tests.
![]() |
Reporter | |
Comment 68•10 years ago
|
||
(In reply to Edmund Wong (:ewong) from comment #67)
> Currently sea-hp-linux64-{3,4,5,6,8,9,10,11,12,13} are down. So I guess -3
> can be used to run the tests.
err was wrong. -3 is back up.
Comment 69•10 years ago
|
||
> Currently sea-hp-linux64-{3,4,5,6,8,9,10,11,12,13} are down. So I guess -3
> can be used to run the tests.
>err was wrong. -3 is back up.
-2 and -3 are the ones on warranty, when those go down ill run the tests for HP. in the meantime ive brought back the other hosts.
sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
![]() |
Reporter | |
Comment 70•10 years ago
|
||
(In reply to Van Le [:van] from comment #69)
> > Currently sea-hp-linux64-{3,4,5,6,8,9,10,11,12,13} are down. So I guess -3
> > can be used to run the tests.
>
> >err was wrong. -3 is back up.
>
> -2 and -3 are the ones on warranty, when those go down ill run the tests for
> HP. in the meantime ive brought back the other hosts.
Currently only -2 and -5 are up. The rest are down.
Van, I guess you can take -3 and do what needs to be done. :)
Comment 71•10 years ago
|
||
i ran advanced diagnostic and provided HP with report. will update bug as soon as i get more information. in the meantime, ive brought your hosts back online.
[Monday, December 22, 2014 11:35 AM] -- Venkatesh A says:
Please allow me some time while I update the case notes and so that we shall elevate the case to next level of engineering
[Monday, December 22, 2014 11:36 AM] -- Venkatesh A says:
Yes one our engineer would contact you on the provided number or email preferrablly Van
[Monday, December 22, 2014 11:36 AM] -- Venkatesh A says:
No we shall close this chat session as of now
vans-MacBook-Pro:~ vle$ fping sea-hp-linux64-{2..13}.community.scl3.mozilla.com
sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
Whiteboard: Case ID is 4649699191/4649842904/4650124724 → Case ID is 4649699191/4649842904/4650124724/4650242010
Comment 72•10 years ago
|
||
HP would like us to try and use their driver for the RAID controller and if it crashes, generate the CFG2HTML for them.
The B110i controller driver can be found here.
http://h20564.www2.hp.com/hpsc/swd/public/detail?sp4ts.oid=3958195&swItemId=MTX_c9800386db8146efb747deee41&swEnvOid=4103#tab3
This is the file that you will need to use
File name: kmod-hpahcisr-1.2.6-18.rhel6u2.x86_64.rpm (136 KB)
NOTE: These driver are for RHEL 6 but will work with CentOS 6 as well.
Generate the CFG2HTML:
For LINUX type the following at your console:
- ftp ftp.usa.hp.com (file://ftp.usa.hp.com/)
- at the user prompt enter: iss
- at the password prompt enter: tools4AL
don't set the connection type to binary as otherwise the script might not work.
- then type: get cfg2html-linux124HP (transfer will start as you will see)
- type: quit to leave FTP
Note: If the script is opened on any windows file editor and exit the program without saving the file.
2. Make the script executable on your LINUX server: chmod +x cfg2html-linux124cHP
3. Run the script by typing ./ cfg2html-linux124cHP
4. All output is stored all together in the file {hostname}.tar (as stated during execution of the script.
5. Collect the resultant {hostname}.tar file
Comment 73•10 years ago
|
||
spoke to linda regarding this issue. since aj is no longer available, can an experienced tier 2 poke at this?
Assignee: server-ops-dcops → nobody
Component: DCOps → MOC: Problems
Comment 74•10 years ago
|
||
I hit:
> [root@sea-hp-linux64-2.community.scl3.mozilla.com ~]# rpm -Uvh kmod-hpahcisr-1.2.6-18.rhel6u2.x86_64.rpm
> Preparing... ########################################### [100%]
> #######################################################
> # Hpahcisr is currently not controlling any storage. #
> # Loading this driver could displace the current #
> # storage driver causing filesystem corruption. #
> # Exiting! #
> #######################################################
> error: %pre(kmod-hpahcisr-1.2.6-18.rhel6u2.x86_64) scriptlet failed, exit status 1
> error: install: %pre scriptlet failed (2), skipping kmod-hpahcisr-1.2.6-18.rhel6u2
This same problem was encountered in Bug 779487 and all the troubleshooting and diagnosing went nowhere. Eventually the machines were switched to AHCI. Is that a viable option in this case as well?
Comment 75•10 years ago
|
||
:ashish, thanks for update. i've updated my case with HP.
Comment 76•10 years ago
|
||
\o/ Lets try this.
I'm sad we missed that because myself, van, dustin and arr are all over that bug, and these are the exact same hosts, where that bug hit almost the exact same issue.
Does doing that change require a reimage or lose any data?
Comment 77•10 years ago
|
||
arr changed the settings and reimaged it bug 779487. from my experience, we've always had to reimage after changing from raid to ahci as remapping the bits isnt worth it and could still cause issues later on. i would defer to our sres though, perhaps there's a good way to do it.
Comment 78•10 years ago
|
||
If reimage is needed, that is fine; just needs more coordination and more hands on. (since they can't be reimaged in place he way we need, and I'll need to input a puppet deploypass myself once they are back in their proper home)
Comment 79•10 years ago
|
||
Reading through the old bug I'm pretty sure a reimage is needed. I personally don't have prior experience dealing with this VLAN, so any input would be valuable!
Comment 80•10 years ago
|
||
:ashish (and :van)
Prior work to setup in initial imaging was done in https://bugzilla.mozilla.org/show_bug.cgi?id=740633#c7 (and on)
n-i to both of you so we can get a rough eta and then I can coord better, once you two coordinate on your end. So we can get rid of this issue once and for all.
Flags: needinfo?(vle)
Flags: needinfo?(ashish)
Comment 81•10 years ago
|
||
i dont think ashish needs to be involved anymore as you and i can coordinate and get this working since it requires physically moving some cables. ill work on a host monday and try to get the process down (+document). i remember there was a hiccup because the host doesnt boot all the way so i dont get a login prompt (procedure requires you input incorrect puppet pw).
Assignee: nobody → server-ops-dcops
Component: MOC: Problems → DCOps
Flags: needinfo?(ashish)
Comment 82•10 years ago
|
||
pleasant surprise, it seems changing the controller mode to AHCI doesnt require a reimage. can you take a look at -7, -8, -9, and -10?
Flags: needinfo?(bugspam.Callek)
Comment 83•10 years ago
|
||
So, -7, -8, -9, -10 are all connected to buildbot right now, and at least -7 has had successful jobs, I'm going to call this a success right now.
Once the others fail, or time passes we'll do them:
ToDo:
-2, -3, -4, -5, -6, -11, -12, -13
Flags: needinfo?(bugspam.Callek)
Comment 84•10 years ago
|
||
i noticed 5 and 11 were down. i went ahead and did a graceful shutdown on the others and fixed them all.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 10 years ago
Flags: needinfo?(vle)
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•