This bug is to track the upgrades on the RAID controllers alerts that are being reported by Nagios.
Current list is here:
*** Bug 660927 has been marked as a duplicate of this bug. ***
CC'ing Ravi on this to see when we can squeeze in a ringring reboot.
This is a blocker since machines are hard locking because of a combination of this + RHEL kernel weirdness. Assigning to current oncall.
To summarize - POA is as follows on the crashing machines :
1) Upgrade to RHEL 5.7, reboot.
2) Upgrade the HP Firmware for the RAID controller.
Dave, correct me if it's needed here.
Ashish, 1) should be simple enough, either a yum -y update and reboot or login to boris as root, forward keys and run rebootrhel <hostname> and let that script do the magic.
Phong should have updated you with instructions for 2.
boris and dm-peep01 upgraded. The remaining machines in the mana page are production hosts and we'll have to schedule downtime with corresponding teams.
pm-web02 and pm-web03 can be done one at a time, since they are part of a redundant cluster.
The three socorro hosts on there are mostly used for dev work and could be done whenever with a quick heads-up announcement in #breakpad, however they are all running rhel6, so I'm not sure it is an issue. I'd still push to do the firmware upgrades though.
I mean the *5* socorro hosts :)
In fact, now would be a good time to do those, since everyone is probably asleep that might use them.
Added a list of thus far known controller firmwares to https://mana.mozilla.org/wiki/display/SYSADMIN/Hardware+Issues (scroll to the end)
cc: coop so he can see details for the downtime announcement.
The downtime has been announced:
As of now, we are a GO for the downtime tonight at 6pm PDT. If I hear otherwise, I'll update this bug.
I'll be responsible for closing trees on the releng side.
All machines have been taken care of besides:
These machines are slated to be taken care of tonight when scheduled downtime is planned.
all systems upgraded beside stm-c01-master01, which is a database for a whole lot of high profile sites (also a single point of failure). We'll need to coordinate a downtime window for this bad boy.
*** Bug 673928 has been marked as a duplicate of this bug. ***
(In reply to comment #13)
> all systems upgraded beside stm-c01-master01, which is a database for a
> whole lot of high profile sites (also a single point of failure). We'll
> need to coordinate a downtime window for this bad boy.
tm-c01-master01 (and tm-c01-slave01) were brought up to date this morning during an outage on tm-c01-slave01.
That was the last one on the list.