Last Comment Bug 661420 - Upgrade to RHEL 5.7 + upgrade HP Firmware to prevent hangs on the DL 360 machines
: Upgrade to RHEL 5.7 + upgrade HP Firmware to prevent hangs on the DL 360 mach...
Status: RESOLVED FIXED
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: x86 Other
: -- major (vote)
: ---
Assigned To: Ben Kero [:bkero]
: matthew zeier [:mrz]
Mentors:
: 660927 673928 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-06-01 16:36 PDT by Dumitru Gherman [:dumitru]
Modified: 2015-03-12 08:17 PDT (History)
8 users (show)
thereallove: needs‑downtime+
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Dumitru Gherman [:dumitru] 2011-06-01 16:36:19 PDT
This bug is to track the upgrades on the RAID controllers alerts that are being reported by Nagios.
Current list is here:

https://mana.mozilla.org/wiki/display/~dgherman@mozilla.com/HP+RAID+controllers+that+need+firmware+upgrades#
Comment 2 Dumitru Gherman [:dumitru] 2011-06-02 14:57:46 PDT
*** Bug 660927 has been marked as a duplicate of this bug. ***
Comment 3 Dumitru Gherman [:dumitru] 2011-06-17 14:40:23 PDT
CC'ing Ravi on this to see when we can squeeze in a ringring reboot.
Comment 4 Shyam Mani [:fox2mike] 2011-07-26 00:10:33 PDT
This is a blocker since machines are hard locking because of a combination of this + RHEL kernel weirdness. Assigning to current oncall.
Comment 5 Shyam Mani [:fox2mike] 2011-07-26 00:39:01 PDT
To summarize - POA is as follows on the crashing machines :

1) Upgrade to RHEL 5.7, reboot.
2) Upgrade the HP Firmware for the RAID controller.

Dave, correct me if it's needed here.

Ashish, 1) should be simple enough, either a yum -y update and reboot or login to boris as root, forward keys and run rebootrhel <hostname> and let that script do the magic.

Phong should have updated you with instructions for 2.
Comment 6 Ashish Vijayaram [:ashish] 2011-07-26 03:20:15 PDT
boris and dm-peep01 upgraded. The remaining machines in the mana page are production hosts and we'll have to schedule downtime with corresponding teams.
Comment 7 Justin Dow [:jabba] 2011-07-26 05:28:09 PDT
pm-web02 and pm-web03 can be done one at a time, since they are part of a redundant cluster.

The three socorro hosts on there are mostly used for dev work and could be done whenever with a quick heads-up announcement in #breakpad, however they are all running rhel6, so I'm not sure it is an issue. I'd still push to do the firmware upgrades though.
Comment 8 Justin Dow [:jabba] 2011-07-26 05:29:15 PDT
I mean the *5* socorro hosts :)

In fact, now would be a good time to do those, since everyone is probably asleep that might use them.
Comment 9 Ashish Vijayaram [:ashish] 2011-07-26 06:15:45 PDT
Added a list of thus far known controller firmwares to https://mana.mozilla.org/wiki/display/SYSADMIN/Hardware+Issues (scroll to the end)
Comment 10 Zandr Milewski [:zandr] 2011-07-26 12:44:45 PDT
cc: coop so he can see details for the downtime announcement.
Comment 11 Chris Cooper [:coop] [away until Aug 29] 2011-07-26 14:14:33 PDT
The downtime has been announced:

http://groups.google.com/group/mozilla.dev.planning/browse_thread/thread/036a7b39b6059a7d#

As of now, we are a GO for the downtime tonight at 6pm PDT. If I hear otherwise, I'll update this bug.

I'll be responsible for closing trees on the releng side.
Comment 12 Ben Kero [:bkero] 2011-07-26 15:39:42 PDT
All machines have been taken care of besides:

tm-c01-master01
dm-svn01
dm-svn02
dm-webtools04
ringring.mv


These machines are slated to be taken care of tonight when scheduled downtime is planned.
Comment 13 Ben Kero [:bkero] 2011-07-27 02:47:55 PDT
all systems upgraded beside stm-c01-master01, which is a database for a whole lot of high profile sites (also a single point of failure).  We'll need to coordinate a downtime window for this bad boy.
Comment 14 Justin Dow [:jabba] 2011-07-27 06:58:20 PDT
*** Bug 673928 has been marked as a duplicate of this bug. ***
Comment 15 Corey Shields [:cshields] 2011-07-28 03:36:08 PDT
(In reply to comment #13)
> all systems upgraded beside stm-c01-master01, which is a database for a
> whole lot of high profile sites (also a single point of failure).  We'll
> need to coordinate a downtime window for this bad boy.

tm-c01-master01 (and tm-c01-slave01) were brought up to date this morning during an outage on tm-c01-slave01.
Comment 16 Justin Dow [:jabba] 2011-07-28 04:50:04 PDT
That was the last one on the list.

Note You need to log in before you can comment on or make changes to this bug.