Closed Bug 773390 Opened 8 years ago Closed 8 years ago
.stage .db .phx1 .mozilla .com for hardware/driver issues
In reviewing the logs on socorro1.stage.db.phx1.mozilla.com for errors related clients getting disconnected from Postgres in Socorro staging, see bug 771218, I noticed the following in dmesg [firstname.lastname@example.org log]# dmesg | grep fail cciss 0000:03:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. cciss 0000:03:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. There doesn't seem to be anything corresponding in /var/log/messages and I'm unsure where the HP agent stores logs Can this host be reviewed for hardware faults, driver updates, OS updates, etc, that should be applied?
There are 252 updates and likely BIOS and RAID firmware upgrades for this host. I can apply them if we can spare one reboot's worth of downtime. Can we do that during that day?
go for it!
The updates have been applied. It is complaining about the BIOS in a different way now: CPU0: Intel(R) Xeon(R) CPU L5520 @ 2.27GHz stepping 05 Performance Events: PEBS fmt1+, erratum AAJ80 worked around, Nehalem events, Broken BIOS detected, complain to your hardware vendor. [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330) which I will contact HP about.
Status: NEW → ASSIGNED
Thank you eric!
HP wants to have someone open up this box and physically flip a switch on the board. I'd guess it'll be < 1 hour of downtime. Please comment here when a good time to do this is and then I can send the direction to DC Ops to do it.
ericz: as soon as you like. Just let us know in IRC in #breakpad. Thanks!
DC Ops, HP wants someone to clear the nvram on socorro1.stage.db.phx1. The directions are below. Can you guys do that or do I have to talk to I/O? Clear nv ram: Power off the server Locate the system maintenance switch on the system board Move switch number 6 to the on position Power on the server Leave the server running for 30 seconds Power the server back off Move the switch 6 back to the off position Power on the server
Assignee: eziegenhorn → server-ops
Component: Server Operations → Server Operations: DCOps
QA Contact: phong → dmoore
Putting my name back on this so it doesn't page.
Assignee: server-ops → eziegenhorn
:ericz, we won't out in phx1 until next week. Let us know if you need it before then and we'll have to work with IO remote hands.
:laura, want to comment on that time frame? My best guess is that the server will stay running until then.
Eric Z: I'm sure it will stay running, but presumably still dropping connections like chaff. If remote hands are easy, let's go ahead and do that.
:van Can you get IO to run through the above steps to clear the NVRAM? I can help coordinate if that'd help and you tell me how (I think I wasn't authorized when I first tried).
yup, let me call them and see if i have the authority. van
:ericz, i contacted io. they pulled the blade but they do not want to open the blade and fiddle with anything inside for fear of breaking anything else. they apologized and reinserted the blade and powered it back on. hopefully the cold restart fixed the issue otherwise we'll have to wait until the next phx1 trip (next week) to perform the proper maintenance. van
I wouldn't recommend asking I/O to do this. They are only performing basic work for us (hardware replacement, power cycling a server etc.). For the task described in comment 7, knowing that's also a blade, we should get HP support onsite and do it. The server has "HP HW Maintenance Onsite Support" contract active. Any reason why HP is not doing this, Eric?
I'll ask HP to do it.
Assignee: eziegenhorn → server-ops
Component: Server Operations: DCOps → Server Operations
QA Contact: dmoore → phong
HP Case # 4678197910 opened for this, but the upshot is that this is a RHEL issue that has no impact on the operation of the system therefore we need to take no action. Therefore, there are no impacting issues and everything has been upgraded so I'm closing this bug.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.