hg* host package updates
Categories
(Developer Services :: General, task)
Tracking
(Not tracked)
People
(Reporter: dhouse, Assigned: dhouse)
Details
Applying kernel+package updates to the hg* hosts (including a reboot).
Discussed in a google doc(https://docs.google.com/document/d/1lVhzN_iE0OE3H4cEsTPhvEXPlQ3C8CP5AjLfp3QZjHM/edit#heading=h.7iw6jaybm7rp), and email.
I found that there were not kernel upgrades waiting on hgssh{3,2}. They were both at 3.10.0-862.9.1. I rebooted hgssh3 to verify and it came up (grub2.cfg showed before the reboot, but still tested a reboot) into the same kernel.
I'm getting latest updates onto hgssh3 now and will reboot (down-timing in nagios this time -- nagios alerts went to slack during the first reboot): new kernel, 3.10.0-1160.25.1
hgweb1 shows the same 3.10.0-1160.25.1 ready to apply at reboot (default menuoption in /etc/grub2.cfg) and no updates available (since autoupdate already would have applied them) from sudo yum-wrapper update
hgssh3 updates completed and I'm rebooting it.
I've removed hgweb1 from the hgssh1:/repo/hg/pushdataaggregator_groups (waiting ~10m for the replication pool to rebalance)
hgssh3 has two warnings in nagios after the reboot:
WARNING 0002: POST Error: 1705-Slot X Drive Array - Please replace Cache Module Super-Cap. Caching will be enabled once Super-Cap has been replaced and charged.
NTP WARNING: Server has the LI_ALARM bit set, Offset -6e-05 secs (WARNING)
The drive cache one looks like we may need to check or replace it.
I'm applying updates on hgssh2 now.
hgweb1 has been out of the replication pool for >15m. I'm draining it(10.48.74.51) from the load-balancer pool.
hgssh2 updates completed (rebooting now; updated downtime until :45)
hgssh2 rebooted and shows the new kernel.
no POST drive cache warning on it, but shows "OK 0008: Firmware flashed (iLO 4 2.73) "
hgweb1 fully drained from the load-balancer (no connections shown in https://zlb1.external.ops.mdc1.mozilla.com:9090/apps/zxtm/?section=Draining and on hgweb1 sudo netstat -plan|grep :80|grep ESTAB
no connections showed.
I'm rebooting hgweb1 (confirmed new kernel in /etc/grub2.cfg, 3.10.0-1160.25.1).
re: discussion with :gcox in https://chat.mozilla.org/#/room/#vcs:mozilla.org,
I'm applying the firmware update to hgssh3 to see if that resolves the POST error (drive cache?)
[dhouse@hgssh3.dmz.mdc1 ~]$ sudo /usr/local/sbin/install_hp_firmware.sh
Assignee | ||
Comment 10•4 years ago
|
||
the hgssh3 firmware update completed
[dhouse@hgssh3.dmz.mdc1 ~]$ sudo /usr/local/sbin/install_hp_firmware.sh
DO NOT answer yes to any of the reboot questions individual updates
ask you. You can end up rebooting during the next update and bricking
the machine. Install all the updates and then reboot manually.
Press return to continue.
=== Running :/usr/lib/i386-linux-gnu/firmware-ilo4-2.74-1.1/./hpsetup
FLASH_iLO4 v1.17 for Linux (Sep 30 2015)
(C) Copyright 2002, 2015 Hewlett-Packard Enterprise Development Company, L.P.
Firmware image: ilo4_274.bin
Current iLO 4 firmware version 2.73; Serial number ILOUSE310XFY1
Component XML file: CP043715.xml
CP043715.xml reports firmware version 2.74
This operation will update the firmware on the
iLO 4 in this server with version 2.74.
Continue (y/N)?y
Current firmware is 2.73 (Feb 11 2020 00:00:00)
Firmware image is 0x1001b1c(16784156) bytes
Committing to flash part...
******** DO NOT INTERRUPT! ********
Flashing is underway... 100 percent programmed. /
Succeeded.
***** iLO 4 reboot in progress (may take more than 60 seconds.)
***** Please ignore console messages, if any.
iLO 4 reboot completed.
REMINDER. The login warning will remain until puppet has run after a reboot.
[dhouse@hgssh3.dmz.mdc1 ~]$
I'll reboot it as a partial test (and for the POST message check)
Assignee | ||
Comment 11•4 years ago
|
||
hgweb1 rebooted and looks good: nagios alerts recovered, restored into the load-balancer and vcs-replication pools, and connections are active
hgweb2 drained and removed from vcs-replication. drained quickly and shows no connections, no updates, and kernel ready to apply at reboot (from 3.10.0-1160.11.1 to 3.10.0-1160.25.1).
Assignee | ||
Comment 12•4 years ago
|
||
hgweb2 shows correct kernel, 3.10.0-1160.25.1, after reboot
restoring to vcs-replication, waiting a few minutes, and then restoring to load-balancer pool
Assignee | ||
Comment 13•4 years ago
|
||
hgweb2 is restored to the load-balancer.
I'm draining hgweb3 and removed it from the vcs-replicator pool, and downtimed it
hgweb3 shows kernel 3.10.0-1062.9.1 active, 3.10.0-1160.25.1 queued)
Assignee | ||
Comment 14•4 years ago
|
||
hgweb3 drained. rebooting
Assignee | ||
Comment 15•4 years ago
|
||
hgweb3 came back with the correct kernel, 3.10.0-1160.25.1, and I've restored it in the vcs replicator pool (waiting 5m and then restoring in the load balancer)
Assignee | ||
Comment 16•4 years ago
|
||
hgweb4 draining from load-balancer and removed from vcs-replicator pool.
Assignee | ||
Comment 17•4 years ago
|
||
rebooted hgweb4: no connections left, downtimed in nagios, no updates, new kernel is ready for reboot 3.10.0-862.9.1 -> 3.10.0-1160.25.1
came up after reboot, correct kernel. restored in vcs replicator pool (waiting 5m then load-balancer restore)
Assignee | ||
Comment 18•4 years ago
|
||
hgweb4 restored into the load-balancer pool and shows connections
Assignee | ||
Comment 19•4 years ago
|
||
hgssh1's update (packages, and kernel 3.10.0-862.3.3 -> 3.10.0-1160.25.1) is scheduled for during the TCW Thursday, after the pulse update, after 15:30utc
Assignee | ||
Comment 20•3 years ago
|
||
following discussion in https://chat.mozilla.org/#/room/#sheriffs:mozilla.org and https://app.slack.com/client/T027LFU12/CPA5S0H4H, the pulse update part completed and tasks+pushes are going through correctly again.
the firewall work is being done, and when that is complete (still aiming for in 9m at 15:30utc) I'll update and reboot hgssh1
Assignee | ||
Comment 21•3 years ago
|
||
on hgssh1, I applied updates and rebooted
Assignee | ||
Comment 22•3 years ago
|
||
hgssh1 shows the new kernel:
[dhouse@hgssh1.dmz.mdc1 ~]$ uname -srn
Linux hgssh1.dmz.mdc1.mozilla.com 3.10.0-1160.25.1.el7.x86_64
[dhouse@hgssh1.dmz.mdc1 ~]$ cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
Assignee | ||
Comment 23•3 years ago
|
||
pushes and replication recovered and were fine after the upgrade.
Description
•