Closed
Bug 1019834
Opened 11 years ago
Closed 11 years ago
Unexplained lockups/reboots of production servers
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: tmary, Assigned: ericz)
References
Details
This bug exists to discuss and track progress of investigation into why some of the production servers lockup and need a reboot to become functional again.
--
Assignee | ||
Comment 1•11 years ago
|
||
For what it's worth, both of the recent node* server crashes had unstable tsc clocksource messages such as:
Jun 3 16:03:07 node2.admin.plum.metrics.phx1.mozilla.com kernel: Clocksource tsc unstable (delta = -51539585144 ns). Enable clocksource failover by adding clocksource_failover kernel parameter.
Jun 3 16:03:37 node2.admin.plum.metrics.phx1.mozilla.com kernel: __ratelimit: 16 callbacks suppressed
Jun 3 16:03:37 node2.admin.plum.metrics.phx1.mozilla.com kernel: __ratelimit: 19 callbacks suppressed
Jun 3 16:03:37 node2.admin.plum.metrics.phx1.mozilla.com kernel: __ratelimit: 19 callbacks suppressed
Jun 3 16:03:38 node2.admin.plum.metrics.phx1.mozilla.com kernel: Clocksource tsc unstable (delta = -25769792099 ns). Enable clocksource failover by adding clocksource_failover kernel parameter.
Jun 3 16:03:38 node2.admin.plum.metrics.phx1.mozilla.com kernel: hpsa 0000:0c:00.0: Abort request on C0:B0:T0:L0
Assignee | ||
Comment 2•11 years ago
|
||
The firmware was quite old on both of these:
Updating: P410i Slot: 0 from [3.50] to [6.40]
New Version: 07/02/2013
Current Version: 09/30/2010
I upgraded them, including packages which were more recent but still there were >50 packages on node3 and >100 packages on node2 to upgrade.
Assignee | ||
Comment 3•11 years ago
|
||
Now that the firmware and packages are up to date, if either of these crash again, we can call HP and have a reasonable chance of getting help. It's good to have that prerequisite out of the way. I'll take this bug and keep an eye on these for a week or so then we can close it if there are no more problems. The tsc clocksource error returned after the first reboot but not since the upgrades, which is a good sign.
Assignee: server-ops → eziegenhorn
Reporter | ||
Comment 4•11 years ago
|
||
node26.peach.metrics.scl3.mozilla.com crashed/rebooted on 02-Jun
node22.peach.metrics.scl3.mozilla.com crashed/rebooted on 26-May
--
Summary: Unexplained lockups of production servers → Unexplained lockups/reboots of production servers
Assignee | ||
Comment 5•11 years ago
|
||
Both node22 and node26.peach.metrics.scl3 need software and firmware upgrades, when is a good time for those :tmary?
Flags: needinfo?(tmeyarivan)
Reporter | ||
Comment 6•11 years ago
|
||
Latest "Lights Out Management" firmware upgrade (v 1.51) seems relevant (given that many of the nodes rebooted due to a NMI)
http://goo.gl/ex2exl
--
Flags: needinfo?(tmeyarivan)
Comment 7•11 years ago
|
||
Pushing ilo4 v1.51 out
pir@wedge> svn diff
Index: manifests/init.pp
===================================================================
--- manifests/init.pp (revision 89792)
+++ manifests/init.pp (working copy)
@@ -105,10 +105,10 @@
'iLO 4': {
# Patch iLo 4 to the most recent version.
- if ($::hp_ilo_4_firmware_version < 1.40) {
- # This provides iLo 4 version 1.40
- $ilofile_path = "${hpbase}/ilo4/CP020341.scexe"
- $ilofile_md5 = '9bfa908206cead1ce817c4cb94016a5e'
+ if ($::hp_ilo_4_firmware_version < 1.51) {
+ # This provides iLo 4 version 1.51
+ $ilofile_path = "${hpbase}/ilo4/CP023646.scexe"
+ $ilofile_md5 = 'cd1dd21720f214e15d56c37e15e94366'
}
}
pir@wedge> svn ci -m "pushing ilo4 v 1.51 bug 1019834"
Sending manifests/init.pp
Transmitting file data .
Committed revision 89794.
:tmary :ericz asked when would be a good time to do updates on those machines (other firmware and software) and I don't see an answer to his question...
Flags: needinfo?(tmeyarivan)
Reporter | ||
Updated•11 years ago
|
Flags: needinfo?(tmeyarivan)
Comment 8•11 years ago
|
||
There are updated BIOS images available for these machines but I can't download them since we still don't have HP accounts linked to our service paks.
Reporter | ||
Comment 9•11 years ago
|
||
All available firmware upgrades have applied to *.peach hosts (using /usr/local/sbin/install_hp_firmware.sh)
--
Reporter | ||
Comment 10•11 years ago
|
||
node1.peach crashed with the frequently observed '<0>Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details.' msg.
https://access.redhat.com/solutions/707563 suggests that 'HP support' should be availed to figure out if there are underlying HW issues etc.
--
Depends on: 1035616
Assignee | ||
Comment 11•11 years ago
|
||
node22.peach.metrics.scl3.mozilla.com has been up for 45 days
node26.peach.metrics.scl3.mozilla.com has been up for 34 days
node1.peach.metrics.scl3.mozilla.com has been up for 44 days
HP replaced parts on node1. Additionally, we've regained access to the latest BIOS firmwares and pushed those out. I don't see any other action to take at this point.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 12•11 years ago
|
||
Re node26, power management configuration was changed to "OS control mode" (it has been stable since that change even under heavy load.
--
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•