Closed Bug 1710437 Opened 4 years ago Closed 3 years ago

hg* host package updates

Categories

(Developer Services :: General, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dhouse, Assigned: dhouse)

Details

Applying kernel+package updates to the hg* hosts (including a reboot).

Discussed in a google doc(https://docs.google.com/document/d/1lVhzN_iE0OE3H4cEsTPhvEXPlQ3C8CP5AjLfp3QZjHM/edit#heading=h.7iw6jaybm7rp), and email.

I found that there were not kernel upgrades waiting on hgssh{3,2}. They were both at 3.10.0-862.9.1. I rebooted hgssh3 to verify and it came up (grub2.cfg showed before the reboot, but still tested a reboot) into the same kernel.

I'm getting latest updates onto hgssh3 now and will reboot (down-timing in nagios this time -- nagios alerts went to slack during the first reboot): new kernel, 3.10.0-1160.25.1

hgweb1 shows the same 3.10.0-1160.25.1 ready to apply at reboot (default menuoption in /etc/grub2.cfg) and no updates available (since autoupdate already would have applied them) from sudo yum-wrapper update

hgssh3 updates completed and I'm rebooting it.

I've removed hgweb1 from the hgssh1:/repo/hg/pushdataaggregator_groups (waiting ~10m for the replication pool to rebalance)

hgssh3 has two warnings in nagios after the reboot:

WARNING 0002: POST Error: 1705-Slot X Drive Array - Please replace Cache Module Super-Cap. Caching will be enabled once Super-Cap has been replaced and charged.

NTP WARNING: Server has the LI_ALARM bit set, Offset -6e-05 secs (WARNING)

The drive cache one looks like we may need to check or replace it.

I'm applying updates on hgssh2 now.

hgweb1 has been out of the replication pool for >15m. I'm draining it(10.48.74.51) from the load-balancer pool.

hgssh2 updates completed (rebooting now; updated downtime until :45)

hgssh2 rebooted and shows the new kernel.
no POST drive cache warning on it, but shows "OK 0008: Firmware flashed (iLO 4 2.73) "

hgweb1 fully drained from the load-balancer (no connections shown in https://zlb1.external.ops.mdc1.mozilla.com:9090/apps/zxtm/?section=Draining and on hgweb1 sudo netstat -plan|grep :80|grep ESTAB no connections showed.
I'm rebooting hgweb1 (confirmed new kernel in /etc/grub2.cfg, 3.10.0-1160.25.1).

re: discussion with :gcox in https://chat.mozilla.org/#/room/#vcs:mozilla.org,
I'm applying the firmware update to hgssh3 to see if that resolves the POST error (drive cache?)
[dhouse@hgssh3.dmz.mdc1 ~]$ sudo /usr/local/sbin/install_hp_firmware.sh

the hgssh3 firmware update completed

[dhouse@hgssh3.dmz.mdc1 ~]$ sudo /usr/local/sbin/install_hp_firmware.sh

DO NOT answer yes to any of the reboot questions individual updates
ask you. You can end up rebooting during the next update and bricking
the machine. Install all the updates and then reboot manually.
Press return to continue.

=== Running :/usr/lib/i386-linux-gnu/firmware-ilo4-2.74-1.1/./hpsetup

FLASH_iLO4 v1.17 for Linux (Sep 30 2015)
(C) Copyright 2002, 2015 Hewlett-Packard Enterprise Development Company, L.P.
Firmware image: ilo4_274.bin
Current iLO 4 firmware version  2.73; Serial number ILOUSE310XFY1      

Component XML file: CP043715.xml
CP043715.xml reports firmware version 2.74
This operation will update the firmware on the
iLO 4 in this server with version 2.74.
Continue (y/N)?y
Current firmware is  2.73 (Feb 11 2020 00:00:00)
Firmware image is 0x1001b1c(16784156) bytes
Committing to flash part...        
******** DO NOT INTERRUPT! ********
Flashing is underway... 100 percent programmed. /     
Succeeded.
***** iLO 4 reboot in progress (may take more than 60 seconds.)
***** Please ignore console messages, if any.
iLO 4 reboot completed.

REMINDER. The login warning will remain until puppet has run after a reboot.

[dhouse@hgssh3.dmz.mdc1 ~]$ 

I'll reboot it as a partial test (and for the POST message check)

hgweb1 rebooted and looks good: nagios alerts recovered, restored into the load-balancer and vcs-replication pools, and connections are active
hgweb2 drained and removed from vcs-replication. drained quickly and shows no connections, no updates, and kernel ready to apply at reboot (from 3.10.0-1160.11.1 to 3.10.0-1160.25.1).

hgweb2 shows correct kernel, 3.10.0-1160.25.1, after reboot
restoring to vcs-replication, waiting a few minutes, and then restoring to load-balancer pool

hgweb2 is restored to the load-balancer.
I'm draining hgweb3 and removed it from the vcs-replicator pool, and downtimed it
hgweb3 shows kernel 3.10.0-1062.9.1 active, 3.10.0-1160.25.1 queued)

hgweb3 drained. rebooting

hgweb3 came back with the correct kernel, 3.10.0-1160.25.1, and I've restored it in the vcs replicator pool (waiting 5m and then restoring in the load balancer)

hgweb4 draining from load-balancer and removed from vcs-replicator pool.

rebooted hgweb4: no connections left, downtimed in nagios, no updates, new kernel is ready for reboot 3.10.0-862.9.1 -> 3.10.0-1160.25.1
came up after reboot, correct kernel. restored in vcs replicator pool (waiting 5m then load-balancer restore)

hgweb4 restored into the load-balancer pool and shows connections

hgssh1's update (packages, and kernel 3.10.0-862.3.3 -> 3.10.0-1160.25.1) is scheduled for during the TCW Thursday, after the pulse update, after 15:30utc

following discussion in https://chat.mozilla.org/#/room/#sheriffs:mozilla.org and https://app.slack.com/client/T027LFU12/CPA5S0H4H, the pulse update part completed and tasks+pushes are going through correctly again.
the firewall work is being done, and when that is complete (still aiming for in 9m at 15:30utc) I'll update and reboot hgssh1

on hgssh1, I applied updates and rebooted

hgssh1 shows the new kernel:

[dhouse@hgssh1.dmz.mdc1 ~]$ uname -srn
Linux hgssh1.dmz.mdc1.mozilla.com 3.10.0-1160.25.1.el7.x86_64
[dhouse@hgssh1.dmz.mdc1 ~]$ cat /etc/redhat-release 
CentOS Linux release 7.9.2009 (Core)

pushes and replication recovered and were fine after the upgrade.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.