Closed Bug 1019834 Opened 11 years ago Closed 11 years ago

Unexplained lockups/reboots of production servers

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tmary, Assigned: ericz)

References

Details

This bug exists to discuss and track progress of investigation into why some of the production servers lockup and need a reboot to become functional again. --
For what it's worth, both of the recent node* server crashes had unstable tsc clocksource messages such as: Jun 3 16:03:07 node2.admin.plum.metrics.phx1.mozilla.com kernel: Clocksource tsc unstable (delta = -51539585144 ns). Enable clocksource failover by adding clocksource_failover kernel parameter. Jun 3 16:03:37 node2.admin.plum.metrics.phx1.mozilla.com kernel: __ratelimit: 16 callbacks suppressed Jun 3 16:03:37 node2.admin.plum.metrics.phx1.mozilla.com kernel: __ratelimit: 19 callbacks suppressed Jun 3 16:03:37 node2.admin.plum.metrics.phx1.mozilla.com kernel: __ratelimit: 19 callbacks suppressed Jun 3 16:03:38 node2.admin.plum.metrics.phx1.mozilla.com kernel: Clocksource tsc unstable (delta = -25769792099 ns). Enable clocksource failover by adding clocksource_failover kernel parameter. Jun 3 16:03:38 node2.admin.plum.metrics.phx1.mozilla.com kernel: hpsa 0000:0c:00.0: Abort request on C0:B0:T0:L0
The firmware was quite old on both of these: Updating: P410i Slot: 0 from [3.50] to [6.40] New Version: 07/02/2013 Current Version: 09/30/2010 I upgraded them, including packages which were more recent but still there were >50 packages on node3 and >100 packages on node2 to upgrade.
Now that the firmware and packages are up to date, if either of these crash again, we can call HP and have a reasonable chance of getting help. It's good to have that prerequisite out of the way. I'll take this bug and keep an eye on these for a week or so then we can close it if there are no more problems. The tsc clocksource error returned after the first reboot but not since the upgrades, which is a good sign.
Assignee: server-ops → eziegenhorn
node26.peach.metrics.scl3.mozilla.com crashed/rebooted on 02-Jun node22.peach.metrics.scl3.mozilla.com crashed/rebooted on 26-May --
Summary: Unexplained lockups of production servers → Unexplained lockups/reboots of production servers
Depends on: 1020561
Both node22 and node26.peach.metrics.scl3 need software and firmware upgrades, when is a good time for those :tmary?
Flags: needinfo?(tmeyarivan)
Latest "Lights Out Management" firmware upgrade (v 1.51) seems relevant (given that many of the nodes rebooted due to a NMI) http://goo.gl/ex2exl --
Flags: needinfo?(tmeyarivan)
Pushing ilo4 v1.51 out pir@wedge> svn diff Index: manifests/init.pp =================================================================== --- manifests/init.pp (revision 89792) +++ manifests/init.pp (working copy) @@ -105,10 +105,10 @@ 'iLO 4': { # Patch iLo 4 to the most recent version. - if ($::hp_ilo_4_firmware_version < 1.40) { - # This provides iLo 4 version 1.40 - $ilofile_path = "${hpbase}/ilo4/CP020341.scexe" - $ilofile_md5 = '9bfa908206cead1ce817c4cb94016a5e' + if ($::hp_ilo_4_firmware_version < 1.51) { + # This provides iLo 4 version 1.51 + $ilofile_path = "${hpbase}/ilo4/CP023646.scexe" + $ilofile_md5 = 'cd1dd21720f214e15d56c37e15e94366' } } pir@wedge> svn ci -m "pushing ilo4 v 1.51 bug 1019834" Sending manifests/init.pp Transmitting file data . Committed revision 89794. :tmary :ericz asked when would be a good time to do updates on those machines (other firmware and software) and I don't see an answer to his question...
Flags: needinfo?(tmeyarivan)
Flags: needinfo?(tmeyarivan)
There are updated BIOS images available for these machines but I can't download them since we still don't have HP accounts linked to our service paks.
All available firmware upgrades have applied to *.peach hosts (using /usr/local/sbin/install_hp_firmware.sh) --
node1.peach crashed with the frequently observed '<0>Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details.' msg. https://access.redhat.com/solutions/707563 suggests that 'HP support' should be availed to figure out if there are underlying HW issues etc. --
Depends on: 1035616
Depends on: 1035897
node22.peach.metrics.scl3.mozilla.com has been up for 45 days node26.peach.metrics.scl3.mozilla.com has been up for 34 days node1.peach.metrics.scl3.mozilla.com has been up for 44 days HP replaced parts on node1. Additionally, we've regained access to the latest BIOS firmwares and pushed those out. I don't see any other action to take at this point.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Re node26, power management configuration was changed to "OS control mode" (it has been stable since that change even under heavy load. --
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.