Closed Bug 1058385 Opened 6 years ago Closed 5 years ago

sea-hp-linux64-* have a HD/RAID firmware mismatch

Categories

(Infrastructure & Operations :: DCOps, task)

x86
Linux
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ewong, Unassigned)

References

Details

(Whiteboard: Case ID is 4649699191/4649842904/4650124724/4650242010)

Attachments

(3 files)

This machine isn't responding to pings nor ssh

[ewong@jump1.community.scl3 tmp]$ ssh -l seabld sea-hp-linux64-9
ssh: connect to host sea-hp-linux64-9 port 22: No route to host
[ewong@jump1.community.scl3 tmp]$ ping sea-hp-linux64-9
PING sea-hp-linux64-9.community.scl3.mozilla.com (63.245.223.120) 56(84) bytes of data.
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=2 Destination Host Unreachable
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=3 Destination Host Unreachable
From jump1.community.scl3.mozilla.com (63.245.223.8) icmp_seq=4 Destination Host Unreachable

I hope it's not going the way of its sibling (sea-hp-linux64-2).
p.s. we love you DCOps.
colo-trip: --- → scl3
Summary: sea-hp-linux64-9 is down → sea-hp-linux64-{4,5,6,7,9,10,11} are down
They aren't happy at all.  I guess they're upset that their sibling -2
is still sick.
Summary: sea-hp-linux64-{4,5,6,7,9,10,11} are down → sea-hp-linux64-{4,5,6,7,9,10} are down
Summary: sea-hp-linux64-{4,5,6,7,9,10} are down → sea-hp-linux64-{4,5,6,7,9,10,12} are down
Summary: sea-hp-linux64-{4,5,6,7,9,10,12} are down → sea-hp-linux64-{4,5,6,7,8,9,10,12} are down
Summary: sea-hp-linux64-{4,5,6,7,8,9,10,12} are down → sea-hp-linux64-{3,4,5,6,7,8,9,10,12} are down
Summary: sea-hp-linux64-{3,4,5,6,7,8,9,10,12} are down → sea-hp-linux64-{3,4,5,6,7,8,9,10,11,12} are down
One active Linux64 left.  -13.

The rest have gone AWOL.
Summary: sea-hp-linux64-{3,4,5,6,7,8,9,10,11,12} are down → sea-hp-linux64-{3,4,5,6,7,8,9,10,11,12,13} are down
awol_list = [x in range(3,13)]
awol_list.append(13)

len(awol_list) = 11
len(free_list) = 0

And don't forget about sea-hp-linux64-2 (from bug 1050618).

w0ts0n from #it mentioned that a randomly selected host (sea-hp-linux64-9)
has "Disk Error".  -2 also had disk error.  It's currently being readied for
imaging (so not yet deployed).
Attached image firmware mismatch.png
raid controller/drives complaining of firmware mismatch
raid controller complaining of drive being moved over the holiday.
all these hosts are complaining of a firmware mismatch and the drive has been moved (they havent been moved). i have attached both screen shots. since it's the same error for all of these hosts, i dont think it's a hardware issue but rather a software issue. please open a bug with the MOC/SRE team to look further into this issue if needed.

sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Instead of opening a new bug with no history, I'm going to reopen this and move it.

MOC: can we get any assistance/insight into how to diagnose/treat this issue, so the symptom can be avoided?

We can take any of the hosts out of production circulation so you can do some non-destructive prodding.  (if you need destruction prodding we'll need to reimage the complicated way after you're done similar to what we are doing in Bug 1050618 (described in c#12)
Assignee: server-ops-dcops → nobody
Status: RESOLVED → REOPENED
Component: Server Operations: DCOps → Server Operations: MOC
Resolution: FIXED → ---
Summary: sea-hp-linux64-{3,4,5,6,7,8,9,10,11,12,13} are down → sea-hp-linux64-* have a HD/RAID firmware mismatch
sea-hp-linux64-{3,7,9} are back down as of now.... presumably the same issue.
(In reply to Justin Wood (:Callek) from comment #9)
> sea-hp-linux64-{3,7,9} are back down as of now.... presumably the same issue.

+ {11,12}
So as per IRC, would be nice to know what updated the drive's firmware(s).
The issue right now is that there's a mis-match between the controller's firmware and the drive's.
The controller's firmware is older than the drives so the controller in turn, cannot detect the drives.

We'll need to boot these machines into the network, boot into sysrescue, then update the controller's firmware and reboot.
This should hopefully fix the issue, however, possible that the RAID information (ie logical drives etc) might go missing and we'll need to recreate the logical drives which may mean reinstalls (technically we should be able to redo the logical drives without wiping the data but it all depends on the RAID configuration).
Assignee: nobody → afernandez
sea-hp-linux64-3.community.scl3.mozilla.com is back online.

For more information, seems there are NO known (public) firmware updates for the RAID controller in question. So far, for the above host, that controller is: HP Smart Array B110i SATA RAID Controller

The only firmwares related to that controller, reference the actual drives (which may or may not be the drives these servers currently have).

We will be checking the other servers by the end of the day but at this time, seems there's no "permanent" solution besides perhaps replacing the drives with "correctly supported one" but that's for further discussion.
Busy day, however, the affected hosts have been online (for a while);
sea-hp-linux64-3.community.scl3.mozilla.com 17:45:23 up  1:01,  0 users,  load average: 0.00, 0.00, 0.00
sea-hp-linux64-7.community.scl3.mozilla.com 17:45:45 up  1:32,  0 users,  load average: 0.00, 0.00, 0.00
sea-hp-linux64-9.community.scl3.mozilla.com  17:46:24 up 54 min,  0 users,  load average: 0.00, 0.00, 0.00
sea-hp-linux64-11.community.scl3.mozilla.com  17:46:37 up  1:36,  0 users,  load average: 0.00, 0.00, 0.00
apparently following was missed =\ (there was a fire alarm!);
sea-hp-linux64-12.community.scl3.mozilla.com  17:52:42 up 0 min,  0 users,  load average: 0.47, 0.13, 0.04
Leaving bug open to "permanently" fix the issue.
Just want to mention that sea-hp-linux64-{5,6,9,10,11,13} are down, most likely due to the same issue.
(In reply to Edmund Wong (:ewong) from comment #14)
> Just want to mention that sea-hp-linux64-{5,6,9,10,11,13} are down, most
> likely due to the same issue.

+ {3, 8}
Current list of AWOL slaves:

[3, 5, 6, 7, 8, 9, 10, 11, 13]
Van, can we recover these again, currently the *only* host up is -4, (so 3 {5..13} are down)
Flags: needinfo?(vle)
err -2 is down too...
Depends on: 1078900
I'm on PTO until Thursday. I can take a look at these machines then if it's not super urgent. In the future, for hands on requests, please open a bug and drop it in the DCOPs queue so we can track it. You can add it as a dependent of this parent bug if needed.
Flags: needinfo?(vle)
Heya Vinh, per derek you should "be there" (I'm assuming scl3) so if you feel comfortable would be great if you can recover some/all of these hosts.

If the issue is like before, they probably are booting unable to find their drive, which is merely a "rebuild the raid0" since they are single drive hosts. That has sufficed to recover in the past.

If you are unable/uncomfortable doing this, I understand and we can defer to thursday for Van, but we are down to 1 up at the moment. with releases scheduled to build tonight.
Flags: needinfo?(vhua)
(In reply to Justin Wood (:Callek) from comment #20)
> scheduled to build tonight.

*scheduled to build this week
sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
Flags: needinfo?(vhua)
At this moment, the following are not allowing SSH or pings:

sea-hp-linux64-{2,4,7,8,9,10,11,12,13}
OS: Windows Vista → Linux
(In reply to Edmund Wong (:ewong) from comment #23)
> At this moment, the following are not allowing SSH or pings:
> 
> sea-hp-linux64-{2,4,7,8,9,10,11,12,13}

Of note, we renewed the -3 warranty to try and catch this issue on a warrantied box and correct it.

-2 however was already warranty renewed and given a different disk drive than these other hosts, and if it is hitting this issue as well can be used as a substitute to try and figure out the root cause/driver we need.
Flags: needinfo?(vle)
looks like -2 went down with same errors. i'm not sure when i'll be able to hit up HP as I'm pretty busy this week and next week  but i'll try to fit it in when I have free cycles. i've left the needinfo in check to remind myself.

in the mean time, ive brought the other hosts back online (except -2 so we can work with HP).
sea-hp-linux64-{5,7,8,9,10,11} are back down again.
what username/password can i use to log in to these hosts to troubleshoot/run scripts with HP?

cltbld isnt working.
Flags: needinfo?(vle)
[Monday, October 20, 2014 2:47 PM] -- Venkatesh A says:
Also, Van, I understand that there is no dedicated firmware file for the storage controller to update
[Monday, October 20, 2014 2:47 PM] -- Venkatesh A says:
However You need to use the latest SPP utilty on the server to update the latest server BIOS, which also will update the firmware of the controller .
[Monday, October 20, 2014 2:48 PM] -- Van Le says:
what's the SPP utility?
[Monday, October 20, 2014 2:48 PM] -- Van Le says:
how do i get that utility?
[Monday, October 20, 2014 2:49 PM] -- Venkatesh A says:
I shall provide you the URL to download the same
[Monday, October 20, 2014 2:49 PM] -- Van Le says:
thanks
[Monday, October 20, 2014 2:49 PM] -- Venkatesh A says:
Please find the below URL to download the latest SPP utility :
http://h17007.www1.hp.com/us/en/enterprise/servers/products/service_pack/spp/index.aspx


can a MOC engineer try to update the firmware and let me know if this resolves the issue? it might not if the server BIOS update doesn't contain an update for the embedded controller.
Whiteboard: Case ID is 4649699191
update:

sea-hp-linux64-{3,5,7,8,9,10,11,12,13} are down.
Updating:

Aside for sea-hp-linux64-4, the rest are down. 
that is sea-hp-linux64-{2,3,5,6,7,8,9,10,11,12,13}  

Can something be done to them? We're doing the beta release and
it's worrisome that we're down to one slave.

Thanks!
Flags: needinfo?(afernandez)
(adding van, incase :Aj) is unable to do this.
Flags: needinfo?(vle)
Attached file 2m21210363 carepaq
Flags: needinfo?(vle)
(In reply to Van Le [:van] from comment #32)

The n-i was so that we can get the ones you're not working with HP on back up.
Flags: needinfo?(vle)
oops i didn't realize it cleared my n-info when i attached the carepaq for aj. im downloading the 4gb firmware now and we'll work on these hosts. i'll get them online for you today if the firmware doesn't resolve the raid issue.
HP_Service_Pack_for_ProLiant_2014.09.0_792934_001_spp_2014.09.0-SPP2014090.2014_0827.10.iso uploaded into /tmp of admin1a. 

please update firmware and let me know if you still need me to fix the others.
van, our issue is that we have just 1 available right now, so we need them all brought back up ~now since we have a beta release we're trying to finish. (and possibly a chemspill ala Firefox 33.0.1)

We want to install that firmware on at least 1 or 2 and see if taht fixes it, but *we* don't have access to admin1a to do so, nor do *we* have the ability to recover if something goes wrong with it.
>van, our issue is that we have just 1 available right now, so we need them all brought back up ~now since >we have a beta release we're trying to finish. (and possibly a chemspill ala Firefox 33.0.1)


vans-MacBook-Pro:~ vle$ fping sea-hp-linux64-{2..13}.community.scl3.mozilla.com
sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
Flags: needinfo?(vle)
Attempted to reach someone on irc before attempting to work on this but couldn't.
Rebooted sea-hp-linux64-2.community.scl3.mozilla.com and currently checking to see any firmware(s) getting updated.
Flags: needinfo?(afernandez)
sea-hp-linux64-2.community.scl3.mozilla.com back online.
The SPP applied updates, however, the automatic process (hands-off) is NOT verbose at all, so don't have a list of what was actually updated or from what version to what version. A check in the IML log shows that the BIOS was updated to version 07/01/2013

Lets have the server run for a few days (or just over the weekend) and if no issues arise, we could then apply the updates on the rest of the servers.

On a different note, seems when these servers boot, they do not reach the login prompt (or at least this server doesn't).
They get stuck in the init process under;
"Starting runner"
Van:

Linux64[2: NC, 3: B, 4: NC, 5: NC, 6: B, 7: B, 8: B, 9: B, 10: B, 11: B, 12: B, 13: NC],

NC == "not connected to buildbot"

B == "working fine"

Of note is 2, which is sea-hp-linux64-2, and is currently not pinging.

Can you peek to see if this is still the same firmware issue, of note this is teh host HP already replaced a drive for, and is one of the two hosts we have renewed the warranty on (the other being -3 which we recently bought the extra warranty for).
Flags: needinfo?(vle)
Duplicate of this bug: 1078900
can check for you on thursday or friday, i'm in phx at the moment.
So as per Comment 39 which states that "sea-hp-linux64-2.community.scl3.mozilla.com" was upgraded with the SSP ISO, it's experiencing the same issue as before.
the update didnt update the firmware on the raid controller as it is still on v1.38.
Flags: needinfo?(vle)
i'm not sure if HP has another update. ill check with them when i have some free cycles next week.

>Linux64[2: NC, 3: B, 4: NC, 5: NC, 6: B, 7: B, 8: B, 9: B, 10: B, 11: B, 12: B, 13: NC],

i brought up 2,3, and 13.
Flags: needinfo?(vle)
4 and 5 online as well.
Flags: needinfo?(vle)
Flags: needinfo?(vle)
sea-hp-linux64-{5,6,7,8,9,10} are back down.
hp came on site and i had them take a look at this issue. the SPP software suite doesnt update the firmware of the controller, it updates the iLO firmware and motherboard BIOS. the tech informed me that the 100 series of servers are their low end servers and the onboard embedded controller is outsourced so they dont develop the drivers for them.

however, the tech had a USB that updated the firmware of the hard drive (hp g0 to hp g9) so i asked her to test it on -5. 

i'm not sure if this will resolve the issue since we RMA'd a drive in -2 but if it does, we can open a ticket for them to come back and update the firmware for all the drives.

i'll take a look at the remaining down hosts sometime in the next couple of days.
:van, no luck it seems:

[Callek@jump1.community.scl3 ~]$ ssh root@sea-hp-linux64-5
ssh: connect to host sea-hp-linux64-5 port 22: No route to host

NOTE: at this moment *only* -3 is up, and all the rest are down.  (I'm *assuming* the -5 above was not a typo and was meant to be -3)
not a typo, it was -5 and its drive that was updated with the latest firmware. same issue thogug so i guess i'll contact HP again to see if they have anything else for us to try but per :aj in c#12, there's no known firmware for the individual embedded raid controller. 

in the meantime, i've brought back the other hosts.
they're asking us to downgrade the BIOS 1 previous version to see if that resolves the issue as they do not have any firmware for this particular RAID controller. i'll work on -2 when i get a chance.

[Wednesday, November 05, 2014 10:43 AM] -- Arijeet S says:
We need to downgrade the BIOS to one level and then see if the error persist
[Wednesday, November 05, 2014 10:44 AM] -- Van Le says:
ok do you have a link for the bios?
[Wednesday, November 05, 2014 10:45 AM] -- Arijeet S says:
As I see the BIOS also is still not the latest
[Wednesday, November 05, 2014 10:46 AM] -- Arijeet S says:
I apologize it is the latest
[Wednesday, November 05, 2014 10:46 AM] -- Arijeet S says:
The previous version is 2012.12.04 (4 Jan 2013)
[Wednesday, November 05, 2014 10:47 AM] -- Arijeet S says:
* RECOMMENDED * Online ROM Flash Component for Windows - HP ProLiant ML110 G7/DL120 G7 (J01) Servers

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdDetails/?sp4ts.oid=5075937&spf_p.tpst=swdMain&spf_p.prp_swdMain=wsrp-navigationalState%3Didx%253D2%257CswItem%253DMTX_8fb5c65ba0a7403ca1e54638e6%257CswEnvOID%253D4064%257CitemLocale%253D%257CswLang%253D%257Cmode%253D4%257Caction%253DdriverDocument&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken
Flags: needinfo?(vle)
Whiteboard: Case ID is 4649699191 → Case ID is 4649699191/4649842904
Flags: needinfo?(vle)
i brought -6 up with the downgraded ROM Version. please let me know when it goes down again.

sea-hp-linux64-6.community.scl3.mozilla.com is alive
System ROM 	J01 12/04/2012
5 & 6 are up.

sea-hp-linux64-{2,3,4,7,8,9,10,11,12} are down.
brought the hosts back online and upgraded -7 to System ROM J01 12/04/2012. these take a while to upgrade so if -6 and -7 stay stable, we'll upgrade the rest.
Flags: needinfo?(vle)
Flags: needinfo?(vle)
All the HP systems are down and a I got a bunch of emails with "Repeated Puppet Failures on sea-hp-linux64-{5,6}.community.scl3.mozilla.com.

with the following in the message body:

IP: 63.245.223.116
(In reply to Edmund Wong (:ewong) from comment #56)
> All the HP systems are down and a I got a bunch of emails with "Repeated
> Puppet Failures on sea-hp-linux64-{5,6}.community.scl3.mozilla.com.

Currently thinking the 5,6 issue is a seperate bug/issue. Will followup here if not.
(In reply to Justin Wood (:Callek) from comment #57)
> (In reply to Edmund Wong (:ewong) from comment #56)
> > All the HP systems are down and a I got a bunch of emails with "Repeated
> > Puppet Failures on sea-hp-linux64-{5,6}.community.scl3.mozilla.com.
> 
> Currently thinking the 5,6 issue is a seperate bug/issue. Will followup here
> if not.

Ok.. it was a puppet change.  Anyway, Callek fixed that and now, both -5 and -6
are up.  The others are down.
All hosts are up.

sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
Assignee: afernandez → nobody
oh fun.  All except -6 are down.
Hosts are back online.

sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
upgraded -8 to System ROM J01 12/04/2012.
Flags: needinfo?(vle)
we'll just grab this bug until we get a chance to upgrade all the firmware on the hosts.
Assignee: nobody → server-ops-dcops
Component: Server Operations: MOC → Server Operations: DCOps
Product: mozilla.org → Infrastructure & Operations
Sea-hp-linux64-{5,6,7,8,9,12,13} are all down.

while 3, 4 and 11 aren't down... they aren't registering properly with the master.
(different issue)
ill need to contact HP again as ive upgraded a bunch of hosts and they're all still failing.

in the meantime, ive brought back up the down sea-hp-linux hosts.
 
vans-MacBook-Pro:~ vle$ fping sea-hp-linux64-{5,6,7,8,9,12,13}.community.scl3.mozilla.com
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
HP still cant find the root cause for this issue. They've asked that I dl the SmartCD ISO and run their diagnostics and escalate my findings to their tier2 team. Will do this when -2 and -3 goes down as these are the hosts under warranty or when I have some extra cycles.


[Monday, December 08, 2014 12:08 PM] -- Van Le says:
any update?
[Monday, December 08, 2014 12:09 PM] -- Avinash B says:
Yes at this point, we will try to run offline diagnostics on may be any 2 servers that have had this issue before
[Monday, December 08, 2014 12:10 PM] -- Avinash B says:
Once that is done we think it will be better if we elevate this case to the level two team.
[Monday, December 08, 2014 12:10 PM] -- Avinash B says:
So you would need to use the smart start CD for the server, I am not sure if you have it.
[Monday, December 08, 2014 12:11 PM] -- Avinash B says:
This is the link to get the iso.

http://h20564.www2.hp.com/hpsc/swd/public/detail?sp4ts.oid=5075938&swItemId=MTX_fa4e107ffbbd4ef394e57dd739&swEnvOid=4064#tab-history
[Monday, December 08, 2014 12:12 PM] -- Avinash B says:
The SPP does not give option to run diagnostics.
[Monday, December 08, 2014 12:12 PM] -- Van Le says:
ok
[Monday, December 08, 2014 12:13 PM] -- Van Le says:
does it print out a report or what do i need to do to get you the information?
[Monday, December 08, 2014 12:13 PM] -- Avinash B says:
Once you boot from the iso, you will few options.
Like Install Maintenance, Reboot.
Select maintenance
[Monday, December 08, 2014 12:14 PM] -- Avinash B says:
You will have to run the insight diagnostics.
[Monday, December 08, 2014 12:14 PM] -- Avinash B says:
1. Select Survey logs.

2. Change View level to Advance.

3. Change category level to all.

4. Save the generated report.
[Monday, December 08, 2014 12:14 PM] -- Avinash B says:
it will be a html file, few hundred KB in size.
[Monday, December 08, 2014 12:15 PM] -- Avinash B says:
The other one will be to run the Diagnostic that will scan all the hard ware components.
[Monday, December 08, 2014 12:15 PM] -- Avinash B says:
Save both the reports
Whiteboard: Case ID is 4649699191/4649842904 → Case ID is 4649699191/4649842904/4650124724
Currently sea-hp-linux64-{3,4,5,6,8,9,10,11,12,13} are down.  So I guess -3 
can be used to run the tests.
(In reply to Edmund Wong (:ewong) from comment #67)
> Currently sea-hp-linux64-{3,4,5,6,8,9,10,11,12,13} are down.  So I guess -3 
> can be used to run the tests.

err  was wrong. -3 is back up.
> Currently sea-hp-linux64-{3,4,5,6,8,9,10,11,12,13} are down.  So I guess -3 
> can be used to run the tests.

>err  was wrong. -3 is back up.

-2 and -3 are the ones on warranty, when those go down ill run the tests for HP. in the meantime ive brought back the other hosts.

sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
(In reply to Van Le [:van] from comment #69)
> > Currently sea-hp-linux64-{3,4,5,6,8,9,10,11,12,13} are down.  So I guess -3 
> > can be used to run the tests.
> 
> >err  was wrong. -3 is back up.
> 
> -2 and -3 are the ones on warranty, when those go down ill run the tests for
> HP. in the meantime ive brought back the other hosts.

Currently only -2 and -5 are up.  The rest are down.

Van, I guess you can take -3 and do what needs to be done. :)
i ran advanced diagnostic and provided HP with report. will update bug as soon as i get more information. in the meantime, ive brought your hosts back online.

[Monday, December 22, 2014 11:35 AM] -- Venkatesh A says:
Please allow me some time while I update the case notes and so that we shall elevate the case to next level of engineering
[Monday, December 22, 2014 11:36 AM] -- Venkatesh A says:
Yes one our engineer would contact you on the provided number or email preferrablly Van
[Monday, December 22, 2014 11:36 AM] -- Venkatesh A says:
No we shall close this chat session as of now

vans-MacBook-Pro:~ vle$ fping sea-hp-linux64-{2..13}.community.scl3.mozilla.com
sea-hp-linux64-2.community.scl3.mozilla.com is alive
sea-hp-linux64-3.community.scl3.mozilla.com is alive
sea-hp-linux64-4.community.scl3.mozilla.com is alive
sea-hp-linux64-5.community.scl3.mozilla.com is alive
sea-hp-linux64-6.community.scl3.mozilla.com is alive
sea-hp-linux64-7.community.scl3.mozilla.com is alive
sea-hp-linux64-8.community.scl3.mozilla.com is alive
sea-hp-linux64-9.community.scl3.mozilla.com is alive
sea-hp-linux64-10.community.scl3.mozilla.com is alive
sea-hp-linux64-11.community.scl3.mozilla.com is alive
sea-hp-linux64-12.community.scl3.mozilla.com is alive
sea-hp-linux64-13.community.scl3.mozilla.com is alive
Whiteboard: Case ID is 4649699191/4649842904/4650124724 → Case ID is 4649699191/4649842904/4650124724/4650242010
HP would like us to try and use their driver for the RAID controller and if it crashes, generate the CFG2HTML for them.

The B110i controller driver can be found here.
 
http://h20564.www2.hp.com/hpsc/swd/public/detail?sp4ts.oid=3958195&swItemId=MTX_c9800386db8146efb747deee41&swEnvOid=4103#tab3
 
This is the file that you will need to use
 
File name:           kmod-hpahcisr-1.2.6-18.rhel6u2.x86_64.rpm (136 KB)
 
NOTE: These driver are for RHEL 6 but will work with CentOS 6 as well.

Generate the CFG2HTML:

For LINUX type the following at your console:
    - ftp ftp.usa.hp.com (file://ftp.usa.hp.com/)
    - at the user prompt enter: iss
    - at the password prompt enter: tools4AL
don't set the connection type to binary as otherwise the script might not work.
    - then type: get cfg2html-linux124HP (transfer will start as you will see)
    - type: quit to leave FTP
Note: If the script is opened on any windows file editor and exit the program without saving the file.
2. Make the script executable on your LINUX server: chmod +x cfg2html-linux124cHP
3. Run the script by typing ./ cfg2html-linux124cHP
4. All output is stored all together in the file {hostname}.tar (as stated during execution of the script.
5. Collect the resultant {hostname}.tar file
spoke to linda regarding this issue. since aj is no longer available, can an experienced tier 2 poke at this?
Assignee: server-ops-dcops → nobody
Component: DCOps → MOC: Problems
I hit:

> [root@sea-hp-linux64-2.community.scl3.mozilla.com ~]# rpm -Uvh kmod-hpahcisr-1.2.6-18.rhel6u2.x86_64.rpm
> Preparing...                ########################################### [100%]
>      #######################################################
>      # Hpahcisr is currently not controlling any storage.  #
>      # Loading this driver could displace the current      #
>      # storage driver causing filesystem corruption.       #
>      # Exiting!                                            #
>      #######################################################
> error: %pre(kmod-hpahcisr-1.2.6-18.rhel6u2.x86_64) scriptlet failed, exit status 1
> error:   install: %pre scriptlet failed (2), skipping kmod-hpahcisr-1.2.6-18.rhel6u2

This same problem was encountered in Bug 779487 and all the troubleshooting and diagnosing went nowhere. Eventually the machines were switched to AHCI. Is that a viable option in this case as well?
:ashish, thanks for update. i've updated my case with HP.
\o/ Lets try this.

I'm sad we missed that because myself, van, dustin and arr are all over that bug, and these are the exact same hosts, where that bug hit almost the exact same issue.

Does doing that change require a reimage or lose any data?
arr changed the settings and reimaged it bug 779487. from my experience, we've always had to reimage after changing from raid to ahci as remapping the bits isnt worth it and could still cause issues later on. i would defer to our sres though, perhaps there's a good way to do it.
If reimage is needed, that is fine; just needs more coordination and more hands on. (since they can't be reimaged in place he way we need, and I'll need to input a puppet deploypass myself once they are back in their proper home)
Reading through the old bug I'm pretty sure a reimage is needed. I personally don't have prior experience dealing with this VLAN, so any input would be valuable!
:ashish (and :van)

Prior work to setup in initial imaging was done in https://bugzilla.mozilla.org/show_bug.cgi?id=740633#c7 (and on)

n-i to both of you so we can get a rough eta and then I can coord better, once you two coordinate on your end. So we can get rid of this issue once and for all.
Flags: needinfo?(vle)
Flags: needinfo?(ashish)
i dont think ashish needs to be involved anymore as you and i can coordinate and get this working since it requires physically moving some cables. ill work on a host monday and try to get the process down (+document). i remember there was a hiccup because the host doesnt boot all the way so i dont get a login prompt (procedure requires you input incorrect puppet pw).
Assignee: nobody → server-ops-dcops
Component: MOC: Problems → DCOps
Flags: needinfo?(ashish)
pleasant surprise, it seems changing the controller mode to AHCI doesnt require a reimage. can you take a look at -7, -8, -9, and -10?
Flags: needinfo?(bugspam.Callek)
So, -7, -8, -9, -10 are all connected to buildbot right now, and at least -7 has had successful jobs, I'm going to call this a success right now.

Once the others fail, or time passes we'll do them:

ToDo:

-2, -3, -4, -5, -6, -11, -12, -13
Flags: needinfo?(bugspam.Callek)
i noticed 5 and 11 were down. i went ahead and did a graceful shutdown on the others and fixed them all.
Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Flags: needinfo?(vle)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.