Closed Bug 773390 Opened 8 years ago Closed 8 years ago

Review socorro1.stage.db.phx1.mozilla.com for hardware/driver issues

Categories

(mozilla.org Graveyard :: Server Operations, task, major)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bburton, Assigned: ericz)

References

Details

(Whiteboard: phx1)

In reviewing the logs on socorro1.stage.db.phx1.mozilla.com for errors related clients getting disconnected from Postgres in Socorro staging, see bug 771218, I noticed the following in dmesg

[root@socorro1.stage.db.phx1 log]# dmesg | grep fail
cciss 0000:03:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.
cciss 0000:03:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.

There doesn't seem to be anything corresponding in /var/log/messages and I'm unsure where the HP agent stores logs

Can this host be reviewed for hardware faults, driver updates, OS updates, etc, that should be applied?
There are 252 updates and likely BIOS and RAID firmware upgrades for this host.  I can apply them if we can spare one reboot's worth of downtime.  Can we do that during that day?
go for it!
Assignee: server-ops → eziegenhorn
The updates have been applied.  It is complaining about the BIOS in a different way now:

CPU0: Intel(R) Xeon(R) CPU           L5520  @ 2.27GHz stepping 05
Performance Events: PEBS fmt1+, erratum AAJ80 worked around, Nehalem events, Broken BIOS detected, complain to your hardware vendor.
[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)

which I will contact HP about.
Status: NEW → ASSIGNED
Thank you eric!
HP wants to have someone open up this box and physically flip a switch on the board.  I'd guess it'll be < 1 hour of downtime.  Please comment here when a good time to do this is and then I can send the direction to DC Ops to do it.
ericz: as soon as you like.  Just let us know in IRC in #breakpad.  Thanks!
DC Ops, HP wants someone to clear the nvram on socorro1.stage.db.phx1.  The directions are below.  Can you guys do that or do I have to talk to I/O?

Clear nv ram:

Power off the server
Locate the system maintenance switch on the system board
Move switch number 6 to the on position
Power on the server
Leave the server running for 30 seconds
Power the server back off
Move the switch 6 back to the off position
Power on the server
Assignee: eziegenhorn → server-ops
Component: Server Operations → Server Operations: DCOps
QA Contact: phong → dmoore
Putting my name back on this so it doesn't page.
Assignee: server-ops → eziegenhorn
:ericz, we won't out in phx1 until next week. Let us know if you need it before then and we'll have to work with IO remote hands.
:laura, want to comment on that time frame?  My best guess is that the server will stay running until then.
Eric Z: I'm sure it will stay running, but presumably still dropping connections like chaff.  If remote hands are easy, let's go ahead and do that.
Whiteboard: phx1
:van Can you get IO to run through the above steps to clear the NVRAM?  I can help coordinate if that'd help and you tell me how (I think I wasn't authorized when I first tried).
yup, let me call them and see if i have the authority.

van
:ericz, i contacted io. they pulled the blade but they do not want to open the blade and fiddle with anything inside for fear of breaking anything else. they apologized and reinserted the blade and powered it back on. hopefully the cold restart fixed the issue otherwise we'll have to wait until the next phx1 trip (next week) to perform the proper maintenance. 

van
I wouldn't recommend asking I/O to do this. They are only performing basic work for us (hardware replacement, power cycling a server etc.).
For the task described in comment 7, knowing that's also a blade, we should get HP support onsite and do it.
The server has "HP HW Maintenance Onsite Support" contract active.
Any reason why HP is not doing this, Eric?
I'll ask HP to do it.
Assignee: eziegenhorn → server-ops
Component: Server Operations: DCOps → Server Operations
QA Contact: dmoore → phong
Assignee: server-ops → eziegenhorn
HP Case # 4678197910 opened for this, but the upshot is that this is a RHEL issue that has no impact on the operation of the system therefore we need to take no action.  Therefore, there are no impacting issues and everything has been upgraded so I'm closing this bug.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.