1019834 - Unexplained lockups/reboots of production servers

Reporter

Description

•

11 years ago

This bug exists to discuss and track progress of investigation into why some of the production servers lockup and need a reboot to become functional again. --

Eric Ziegenhorn :ericz

Assignee

Comment 1

•

11 years ago

For what it's worth, both of the recent node* server crashes had unstable tsc clocksource messages such as: Jun 3 16:03:07 node2.admin.plum.metrics.phx1.mozilla.com kernel: Clocksource tsc unstable (delta = -51539585144 ns). Enable clocksource failover by adding clocksource_failover kernel parameter. Jun 3 16:03:37 node2.admin.plum.metrics.phx1.mozilla.com kernel: __ratelimit: 16 callbacks suppressed Jun 3 16:03:37 node2.admin.plum.metrics.phx1.mozilla.com kernel: __ratelimit: 19 callbacks suppressed Jun 3 16:03:37 node2.admin.plum.metrics.phx1.mozilla.com kernel: __ratelimit: 19 callbacks suppressed Jun 3 16:03:38 node2.admin.plum.metrics.phx1.mozilla.com kernel: Clocksource tsc unstable (delta = -25769792099 ns). Enable clocksource failover by adding clocksource_failover kernel parameter. Jun 3 16:03:38 node2.admin.plum.metrics.phx1.mozilla.com kernel: hpsa 0000:0c:00.0: Abort request on C0:B0:T0:L0

Eric Ziegenhorn :ericz

Assignee

Comment 2

•

11 years ago

The firmware was quite old on both of these: Updating: P410i Slot: 0 from [3.50] to [6.40] New Version: 07/02/2013 Current Version: 09/30/2010 I upgraded them, including packages which were more recent but still there were >50 packages on node3 and >100 packages on node2 to upgrade.

Eric Ziegenhorn :ericz

Assignee

Comment 3

•

11 years ago

Now that the firmware and packages are up to date, if either of these crash again, we can call HP and have a reasonable chance of getting help. It's good to have that prerequisite out of the way. I'll take this bug and keep an eye on these for a week or so then we can close it if there are no more problems. The tsc clocksource error returned after the first reboot but not since the upgrades, which is a good sign.

Assignee: server-ops → eziegenhorn

T [:tmary] Meyarivan

Reporter

Comment 4

•

11 years ago

node26.peach.metrics.scl3.mozilla.com crashed/rebooted on 02-Jun node22.peach.metrics.scl3.mozilla.com crashed/rebooted on 26-May --

Summary: Unexplained lockups of production servers → Unexplained lockups/reboots of production servers

T [:tmary] Meyarivan

Reporter

Updated

•

11 years ago

Depends on: 1020561

Eric Ziegenhorn :ericz

Assignee

Comment 5

•

11 years ago

Both node22 and node26.peach.metrics.scl3 need software and firmware upgrades, when is a good time for those :tmary?

Flags: needinfo?(tmeyarivan)

T [:tmary] Meyarivan

Reporter

Comment 6

•

11 years ago

Latest "Lights Out Management" firmware upgrade (v 1.51) seems relevant (given that many of the nodes rebooted due to a NMI) http://goo.gl/ex2exl --

Flags: needinfo?(tmeyarivan)

Peter Radcliffe [:pir]

Comment 7

•

11 years ago

Pushing ilo4 v1.51 out pir@wedge> svn diff Index: manifests/init.pp =================================================================== --- manifests/init.pp (revision 89792) +++ manifests/init.pp (working copy) @@ -105,10 +105,10 @@ 'iLO 4': { # Patch iLo 4 to the most recent version. - if ($::hp_ilo_4_firmware_version < 1.40) { - # This provides iLo 4 version 1.40 - $ilofile_path = "${hpbase}/ilo4/CP020341.scexe" - $ilofile_md5 = '9bfa908206cead1ce817c4cb94016a5e' + if ($::hp_ilo_4_firmware_version < 1.51) { + # This provides iLo 4 version 1.51 + $ilofile_path = "${hpbase}/ilo4/CP023646.scexe" + $ilofile_md5 = 'cd1dd21720f214e15d56c37e15e94366' } } pir@wedge> svn ci -m "pushing ilo4 v 1.51 bug 1019834" Sending manifests/init.pp Transmitting file data . Committed revision 89794. :tmary :ericz asked when would be a good time to do updates on those machines (other firmware and software) and I don't see an answer to his question...

Flags: needinfo?(tmeyarivan)

T [:tmary] Meyarivan

Reporter

Updated

•

11 years ago

Flags: needinfo?(tmeyarivan)

Peter Radcliffe [:pir]

Comment 8

•

11 years ago

There are updated BIOS images available for these machines but I can't download them since we still don't have HP accounts linked to our service paks.

T [:tmary] Meyarivan

Reporter

Comment 9

•

11 years ago

All available firmware upgrades have applied to *.peach hosts (using /usr/local/sbin/install_hp_firmware.sh) --

T [:tmary] Meyarivan

Reporter

Comment 10

•

11 years ago

node1.peach crashed with the frequently observed '<0>Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details.' msg. https://access.redhat.com/solutions/707563 suggests that 'HP support' should be availed to figure out if there are underlying HW issues etc. --

Depends on: 1035616

Eric Ziegenhorn :ericz

Assignee

Updated

•

11 years ago

Depends on: 1035897

Eric Ziegenhorn :ericz

Assignee

Comment 11

•

11 years ago

node22.peach.metrics.scl3.mozilla.com has been up for 45 days node26.peach.metrics.scl3.mozilla.com has been up for 34 days node1.peach.metrics.scl3.mozilla.com has been up for 44 days HP replaced parts on node1. Additionally, we've regained access to the latest BIOS firmwares and pushed those out. I don't see any other action to take at this point.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

T [:tmary] Meyarivan

Reporter

Comment 12

•

11 years ago

Re node26, power management configuration was changed to "OS control mode" (it has been stable since that change even under heavy load. --

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

Unexplained lockups/reboots of production servers

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: tmary, Assigned: ericz)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Updated