Closed Bug 734728 Opened 12 years ago Closed 12 years ago

scl1 ganeti cluster having issues

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: bkero)

Details

bm12 lost it's head a bit before 00:45 Sunday and started using a lot of CPU resources. It as impacting other VMs on kvm1 in SCL1 so bkero restarted the VM; buildbot and associated jobs restarted on boot automatically.

Since then we have
 buildbot-master12.build.scl1:Command Queue is UNKNOWN: Unhandled exception
 buildbot-master12.build.scl1:Pulse Queue is UNKNOWN: Unhandled exception
from nagios, but the commands defined in /etc/nagios/nrpe.d work fine at the prompt (as cltbld and root). 

And these in the system log:
# grep lockup /var/log/messages
Mar 11 03:32:25 buildbot-master12 kernel: BUG: soft lockup - CPU#0 stuck for 32s! [swapper:0]
Mar 11 03:32:25 buildbot-master12 kernel: BUG: soft lockup - CPU#1 stuck for 32s! [makewhatis:10916]
Mar 11 03:34:24 buildbot-master12 kernel: BUG: soft lockup - CPU#0 stuck for 15s! [swapper:0]
Mar 11 03:34:25 buildbot-master12 kernel: BUG: soft lockup - CPU#1 stuck for 15s! [makewhatis:6656]
plus register info and call traces. 

The buildbot master is disabled in slavealloc and stopped pending investigation.
Summary: buildbot-master12 → buildbot-master12 is unwell
buildbot-master04 has lockups going back to 9:42pm on Saturday

Mar 10 21:42:00 buildbot-master04 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [swapper:0]
Mar 10 21:42:04 buildbot-master04 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [buildbot:14011]
Mar 10 21:42:23 buildbot-master04 kernel: BUG: soft lockup - CPU#1 stuck for 11s! [swapper:0]
Mar 10 21:42:23 buildbot-master04 kernel: BUG: soft lockup - CPU#0 stuck for 12s! [swapper:0]
Mar 10 21:46:02 buildbot-master04 kernel: BUG: soft lockup - CPU#0 stuck for 33s! [swapper:0]
Mar 10 21:46:02 buildbot-master04 kernel: BUG: soft lockup - CPU#1 stuck for 33s! [buildbot:13355]

and if I look harder at buildbot-master12 then goes back to the same time too, 82 events in total. slavealloc is also on this KVM master.

The common element is kvm1.infra.scl1, over to Server Ops: Releng.
Assignee: nobody → server-ops-releng
Severity: normal → major
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → arich
Summary: buildbot-master12 is unwell → kvm1.infra.scl1 unwell ?
The ganeti cluster in scl1 is having issues beyond our ability to diagnose.  Ssh to the master node timed out a couple times before succeeding.  Trying to do a cluster-verify times out, nodes have errors in the logs, etc.  Please page someone with ganeti knowledge to look at this asap since all critical releng infra is on this cluster.
Assignee: server-ops-releng → server-ops
Severity: major → blocker
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → phong
Summary: kvm1.infra.scl1 unwell ? → scl1 ganeti cluster having issues
Assignee: server-ops → dgherman
Assignee: dgherman → bkero
Closed the trees around 09:45.
We have been looking at this all morning and have yet to identify the cause of the issue.  Investigation continues.
(In reply to Nick Thomas [:nthomas] from comment #0)
> Since then we have
>  buildbot-master12.build.scl1:Command Queue is UNKNOWN: Unhandled exception
>  buildbot-master12.build.scl1:Pulse Queue is UNKNOWN: Unhandled exception
> from nagios, but the commands defined in /etc/nagios/nrpe.d work fine at the
> prompt (as cltbld and root). 

This is a bug for RelEng to fix, related to modes on some directories. Bug 734764 for that.
Trees reopened at 14:30.
tl;dr: All VMs are back online.  There should be no continued unavailability.

The issue has been contained.  I was unable to determine at an application level what had caused the issues, but from a system standpoint this is what happened.

The master node (kvm2) was having problems communicating with other nodes.  The TCP connections were being accepted, although OpenSSL was timing out.  The result of this is that the master was unable to communicate with the rest of the nodes in the cluster.

Meanwhile on kvm1, the buildbot-master12 process was consuming immense CPU time.

On kvm3, another VM had runaway and raised the loadavg above 50 before eventually making the host become completely unresponsive including out-of-band management.  In response to this, we have migrated all VMs (including scl1's nagios VM) to other nodes in the cluster.

I am currently en route to scl1 to power cycle kvm3 and bring it back online manually.

The fact that (seemingly) one virtual machine could have this devastating effect on the cluster disturbs me, and when I've taken care of kvm3 and have a chance to investigate more about how this happened, I would like to determine what can be done to prevent it from happening again.
To add to that:

All of the vms have been distributed off of kvm3 and are on the other hosts for the time being, but having too many buildbot masters on one host often means that we'll see performance problems, so be cognizant of that (kvm4 is a bit loaded with kvm3 down right now):

kvm1.infra.scl1.mozilla.com buildbot-master12.build.scl1.mozilla.com
kvm1.infra.scl1.mozilla.com buildbot-master4

kvm2.infra.scl1.mozilla.com buildbot-master16.build.scl1.mozilla.com
kvm2.infra.scl1.mozilla.com buildbot-master6.build.scl1.mozilla.com

kvm4.infra.scl1.mozilla.com buildbot-master11.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com buildbot-master13.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com buildbot-master14.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com buildbot-master15.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com buildbot-master17.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com buildbot-master18.build.scl1.mozilla.com

kvm5.infra.scl1.mozilla.com buildbot-master25.build.scl1.mozilla.com

kvm6.infra.scl1.mozilla.com buildbot-master21.build.scl1.mozilla.com
kvm6.infra.scl1.mozilla.com buildbot-master23.build.scl1.mozilla.com
kvm6.infra.scl1.mozilla.com buildbot-master24.build.scl1.mozilla.com
We also added "acpi=off noapic" to /etc/grub.conf on some vms based on google searches, but it's not clear that this will solve future issues.
We experienced additional issues today with buildbot-master04.  I have added "acpi=off noapic" to /etc/grub.conf and moved bm4 off to a different kvm host in case we're just seeing CPU overload (but other kvm nodes are handling many more vms than kvm1, specifically kvm4 which has most of kvm3's nodes on it in addition to its own right now).
A note which has not been explicitly called out, but we've only ever experienced issues with runaway vms (pegging a CPU) on kvm1 and kvm3.  The machines were bought and added to the cluster as sets:  kvm1/2, kvm3/4, kvm5/6, so kvm1 and lvm3 should not share any characteristics that the others do not.

After the last update, I resynced the secondary disks back to kvm3 and moved some of the less critical vms back onto that node.  

Here's how things stand now.  Note that kvm4 is still doing the lions share out of all 6 nodes because I didn't want to try to move any of the buildbot masters while they were in use.

Instance                                         Primary_node                Secondary_Nodes

slavealloc.build.scl1.mozilla.com                kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com
releng-mirror01.build.scl1.mozilla.com           kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com
buildbot-master12.build.scl1.mozilla.com         kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com
scl-production-puppet-new.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com
redis01.build.scl1.mozilla.com                   kvm1.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com

ns1.infra.scl1.mozilla.com                       kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
dc01.winbuild.scl1.mozilla.com                   kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
wds01.winbuild.scl1.mozilla.com                  kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
buildbot-master6.build.scl1.mozilla.com          kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
buildbot-master16.build.scl1.mozilla.com         kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
autoland-staging01.build.scl1.mozilla.com        kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
talos-addon-master1.amotest.scl1.mozilla.com     kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
linux-hgwriter-slave03.build.scl1.mozilla.com    kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com

arr-test.build.scl1.mozilla.com                  kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com
master-puppet1.build.scl1.mozilla.com            kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com
arr-client-test.build.scl1.mozilla.com           kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com
autoland-staging02.build.scl1.mozilla.com        kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com

ns2.infra.scl1.mozilla.com                       kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
admin1.infra.scl1.mozilla.com                    kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
dc02.winbuild.scl1.mozilla.com                   kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
ganglia1.build.scl1.mozilla.com                  kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
buildbot-master11.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
buildbot-master13.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
buildbot-master14.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
buildbot-master17.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
buildbot-master15.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com
buildbot-master18.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com
linux-hgwriter-slave04.build.scl1.mozilla.com    kvm4.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com

buildbot-master4                                 kvm5.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
buildapi01.build.scl1.mozilla.com                kvm5.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
ganetiwebmanager1                                kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com
puppet1.infra.scl1.mozilla.com                   kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com
signing2.build.scl1.mozilla.com                  kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com
releng-puppet1.build.scl1.mozilla.com            kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com
buildbot-master25.build.scl1.mozilla.com         kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com

signing1.build.scl1.mozilla.com                  kvm6.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com
dev-master01.build.scl1.mozilla.com              kvm6.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
rabbit1-dev.build.scl1.mozilla.com               kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com
buildbot-master21.build.scl1.mozilla.com         kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com
buildbot-master23.build.scl1.mozilla.com         kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com
buildbot-master24.build.scl1.mozilla.com         kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com
I spent most of this afternoon looking at this with Ben Kero, ultimately we were not able to get useful debugging information from the process state on kvm3 nor were we able to conclusively identify the root cause. We do have a theory on one possible cause.

Based on the state of the machine prior to a reset, it behaved as if it had exhausted all available memory and/or swap. The principle indication of this is that ssh accepts a connection (indicating the kernel is running and able to accept) but no ssh banner is displayed because userland is hosed.

Based on this behavior I have a hypothesis on what may of happened.

It seems there is a cron that is ran as root every minute[1], connecting to a sqlite3 database and printing some information. Based on Python documentation[2] the library will obtain an exclusive lock on the database as part of the connection procedure. Any other processes which want access (even read only) to the database will be blocked until the lock is released.

Something could of happened on the underlying disk blocking i/o, causing an unrecoverable lock on the database. As time went forward the processes could stack until all system resources (primarily memory) were exhausted.

I was unable to find any data showing historical memory usage on autoland-staging02. If the data were available it might confirm this theory.

At a minimum I suggest serializing this cron or establishing a timeout less than the execution interval.

[1] */1 * * * * source /home/autoland/autoland-env/bin/activate && python /var/www/cgi-bin/py.cgi > /var/www/html/test.html
[2] http://docs.python.org/library/sqlite3.html, sqlite3.connect
buildbot-master12 got re-enabled at about 1650, but philor noticed that compile jobs were failing. 

I looked into one instance. The buildbot slave tries to send a chunk of progress log but the master doesn't respond, so the slave gives up the job & connection. (Releng: for kicks on windows we don't kill the existing process tree). To the master it looks like a second host with the same hostname wants to make a connection. This was time-correlated with these entries in the system log:
Mar 12 19:47:16 buildbot-master12 kernel: BUG: soft lockup - CPU#1 stuck for 20s! [swapper:0]
Mar 12 19:47:16 buildbot-master12 kernel: BUG: soft lockup - CPU#0 stuck for 32s! [swapper:0]

buildbot-master12 is disabled again.
Further info, 
* the slave ends up inoperative because the master is trying to boot dupe connections all the time (really I think it has a stale connection the other side gave up on)
* this master had a great deal of free ram available and didn't need to be doing any swapping
* there were no nagios alerts about kvm hosts all evening
* based on comment #11 it's on kvm1
More issues have developed since the above
* at 1:49 buildbot alerts on buildbot-master12 PING failure
* ganglia data stops at ~ 1:45, when the load spiked from ~0 to 10
* at 1:57 slavealloc alerts about http_expect "Service Check Timed Out" and 'Ganglia IO" being 'cpu_wio is 99.70'
* at 2:21 kvm1.infra.scl1:DRBD goes critical, socket timeout after 30 seconds

Hosted VM breakdown (per comment #11):
* slavealloc - not a tree closer but would like this back to be able to control slaves, ganglia reports slow increase in load
* releng-mirror01 - can be turned off, not in use, load at 80 and increasing linearly
* buildbot-master12 - can be turned off, can do without it
* scl-production-puppet - would close tree if it goes, acting normally so far
* redis01 - closes tree because data on completed jobs will be missing from tbpl, was fine until 2:55a then load increased rapidly to 25 and nonresponsive, CPU mostly in wait state
(In reply to Nick Thomas [:nthomas] from comment #15)
> * redis01 - closes tree because data on completed jobs will be missing from
> tbpl, was fine until 2:55a then load increased rapidly to 25 and
> nonresponsive, CPU mostly in wait state

For reasons I don't understand the redis process is still working fine, so the trees are open.
also autoland-staging02 appears to be idle at the moment.
Based on the above, I'm still very much leaning towards an issue with nodes kvm1 and kvm3.  To that end, I've migrated most everything off of them.  Since redis01 is refusing to migrate without a failover (shutdown and then boot on new node), catlee asked that I not move it.  

If kvm2 gets too overloaded (CPU), I suggest we shut down linux-hgwriter03 and releng-mirror01.  If kvm4 gets too overloaded, shut down linux-hgwriter-slave04


Instance                                         Primary_node                Secondary_Nodes


redis01.build.scl1.mozilla.com                   kvm1.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com

ns1.infra.scl1.mozilla.com                       kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
dc01.winbuild.scl1.mozilla.com                   kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
wds01.winbuild.scl1.mozilla.com                  kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
slavealloc.build.scl1.mozilla.com                kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
releng-mirror01.build.scl1.mozilla.com           kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
buildbot-master6.build.scl1.mozilla.com          kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
buildbot-master12.build.scl1.mozilla.com         kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
buildbot-master16.build.scl1.mozilla.com         kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
talos-addon-master1.amotest.scl1.mozilla.com     kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
linux-hgwriter-slave03.build.scl1.mozilla.com    kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
autoland-staging01.build.scl1.mozilla.com        kvm2.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com

arr-test.build.scl1.mozilla.com                  kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com
arr-client-test.build.scl1.mozilla.com           kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com

ns2.infra.scl1.mozilla.com                       kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
admin1.infra.scl1.mozilla.com                    kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
master-puppet1.build.scl1.mozilla.com            kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
dc02.winbuild.scl1.mozilla.com                   kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
ganglia1.build.scl1.mozilla.com                  kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
buildbot-master11.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
buildbot-master13.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
buildbot-master14.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
buildbot-master17.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
buildbot-master15.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com
buildbot-master18.build.scl1.mozilla.com         kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com
autoland-staging02.build.scl1.mozilla.com        kvm4.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com
linux-hgwriter-slave04.build.scl1.mozilla.com    kvm4.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com

buildbot-master4                                 kvm5.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com
scl-production-puppet-new.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com
buildapi01.build.scl1.mozilla.com                kvm5.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
ganetiwebmanager1                                kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com
puppet1.infra.scl1.mozilla.com                   kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com
signing2.build.scl1.mozilla.com                  kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com
releng-puppet1.build.scl1.mozilla.com            kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com
buildbot-master25.build.scl1.mozilla.com         kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com

signing1.build.scl1.mozilla.com                  kvm6.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com
dev-master01.build.scl1.mozilla.com              kvm6.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com
rabbit1-dev.build.scl1.mozilla.com               kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com
buildbot-master21.build.scl1.mozilla.com         kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com
buildbot-master23.build.scl1.mozilla.com         kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com
buildbot-master24.build.scl1.mozilla.com         kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com


As an aside, many of the vms and the ganeti nodes themselves have been up over 200 days.  SO if a node gets failed over, it does an automatic fsck and may take longer than expected to come back online.
We just had another occurrence of the same symptoms on kvm4 (which had not shown any issues prior to now).
Try, mozilla-inbound and mozilla-central are closed because of the test backlog that we're getting. Currently down 1 buildbot master for mac, linux and windows test, and one master for try compiles.
I was under the impression we had built out sufficient master capacity to handle losing one in each silo - either for rolling-reboots, or failures such as this one.  If that's no longer the case, we should talk elsewhere about why and how to fix it (and why master-capacity forecasting is so bad).
The trees are reopened now. bkero has the details on the infra.
me, dustin, and lerxst evacuated two nodes and spent some time replicating and diagnosing this problem on the evacuated nodes.

We can reproduce the problems by creating a test virtual machine, then artifically starving it of memory and digging into swap.  We estimate that elimination of drbd on hosts that can be recreated easily using puppet, and disabling swap inside these virtual machines should prevent this problem from happening again.

I am looking into some new kernel parameters on hosts, and upgraded versions of qemu-kvm and drbd
Status update:

I think we're all ragged and nobody has a full understanding of what's up, but we have a rough replication strategy and two nodes (kvm1 and kvm4) to play with.  The replication involves touching 400M of memory in a 256M VM, thereby swapping.  The symptoms are downright weird - kvm1 pegged itself so badly that we couldn't do anything but run 'ssh kvm1 killall kvm'.  kvm4 started hanging commands like 'cat', 'ls', 'ps', even over non-disk-backed areas like /sys, and refused to show anything but 0% CPU for any process.

I suspect that disabling swap for all VMs would help here, since it means that a VM using too much RAM will just OOM, rather than causing IO mayhem on the node.  Disabling drbd might help, too, but that has much worse implications (no live migration, no failover from a failed node, although bear in mind we're using raid10 here so disk failure isn't a huge risk).  Just disabling swap still leaves us vulnerable to application-initiated IO mayhem, so I'd like to get to the bottom of this.

Ben has a list of things to try:

 try hpet=disable noapic in the kernel
 upgrade the kernel
 add the qemu-kvm ppa
 upgrade from 0.12.3 to 0.14.1

for now, I think we should leave these two nodes vacant until we can nail this down.

Another option is to start kicking new nodes as RHEL6 like the other clusters, and cold-migrating VMs to them one by one.
This impacted last night's FF12.0b2 release. 

I agree with comment#24 - if these systems are not reliable, we need to get RelEng-related production systems off them asap. Debugging this issue can continue later, but stopping the ongoing disruption to production feels like a priority. 

How quickly can we moving any production RelEng VMs off to new/different systems?  As this is already disrupting production, if an emergency tree closure / release schedule disruption makes this move quicker, we should do that.
joduinn,

Here's a list of those hosts.

autoland-staging01.build.scl1.mozilla.com
autoland-staging02.build.scl1.mozilla.com
buildapi01.build.scl1.mozilla.com
buildbot-master04.build.scl1.mozilla.com
buildbot-master06.build.scl1.mozilla.com
buildbot-master11.build.scl1.mozilla.com
buildbot-master12.build.scl1.mozilla.com
buildbot-master13.build.scl1.mozilla.com
buildbot-master14.build.scl1.mozilla.com
buildbot-master15.build.scl1.mozilla.com
buildbot-master16.build.scl1.mozilla.com
buildbot-master17.build.scl1.mozilla.com
buildbot-master18.build.scl1.mozilla.com
buildbot-master21.build.scl1.mozilla.com
buildbot-master23.build.scl1.mozilla.com
buildbot-master24.build.scl1.mozilla.com
buildbot-master25.build.scl1.mozilla.com
dev-master01.build.scl1.mozilla.com
ganglia1.build.scl1.mozilla.com
linux-hgwriter-slave03.build.scl1.mozilla.com
linux-hgwriter-slave04.build.scl1.mozilla.com
master-puppet1.build.scl1.mozilla.com
rabbit1-dev.build.scl1.mozilla.com
redis01.build.scl1.mozilla.com
releng-mirror01.build.scl1.mozilla.com
releng-puppet1.build.scl1.mozilla.com
scl-production-puppet-new.build.scl1.mozilla.com
signing1.build.scl1.mozilla.com
signing2.build.scl1.mozilla.com
slavealloc.build.scl1.mozilla.com
talos-addon-master1.amotest.scl1.mozilla.com
I talked more with bkero this morning about kvm in scl1.  Yesterday he performed a qemu upgrade on kvm nodes 1 and 4 and he has been unable to reproduce the issues that brought down the nodes/cluster.  He was able to reproduce this behavior after a simple reboot of the node (before patching), so we think we may have it licked *fingers crossed*.

This patch should *also* fix the issue we've seen with live migration stalling out and not transferring any of the RAM to the new node when a host is live migrated to its secondary

Because of the nature of the software upgrade, all of the vms will need to be failed over (rebooted, not live migration) to their secondary node to clear off the unpatched primary node for the patch.  Most of the vms are in some way redundant, but there are one or two that may require scheduling a tree closure (releng to determine which if any).

Obviously the sooner we do this the better (to avoid another melt down), so if we need a tree closure, we should start scheduling that now.  If we start to have another node melt down before an outage is scheduled, we will force a tree closure at that point to migrate hosts off of the ailing node.


Once all of the nodes are patched, we will additionally need to migrate some vms around to rebalance load for better performance, but hopefully this will be using live migration and not require any more reboots.

Here's the order in which I suggest fixing the nodes.  

Secondary_Nodes             Primary_node                Instance

fixing kvm3 (all will be moving onto a patched node (kvm4))

kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com admin1.infra.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master13.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master14.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master17.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com dc02.winbuild.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com ganglia1.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com master-puppet1.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com ns2.infra.scl1.mozilla.com

fixing kvm2 (all but one will be moving onto a patched node (kvm1))

kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com buildbot-master12.build.scl1.mozilla.com
kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com buildbot-master16.build.scl1.mozilla.com
kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com buildbot-master6.build.scl1.mozilla.com
kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com dc01.winbuild.scl1.mozilla.com
kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com ns1.infra.scl1.mozilla.com
kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com slavealloc.build.scl1.mozilla.com
kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com talos-addon-master1.amotest.scl1.mozilla.com
kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com wds01.winbuild.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com buildbot-master11.build.scl1.mozilla.com

kvm6.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com autoland-staging01.build.scl1.mozilla.com

fixing kvm5 (half will be moving onto a patched node but half will be moving onto an unpatched node)

kvm1.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master4
kvm1.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com redis01.build.scl1.mozilla.com
kvm2.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com scl-production-puppet-new.build.scl1.mozilla.com
kvm3.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildapi01.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master15.build.scl1.mozilla.com
kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master18.build.scl1.mozilla.com

kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master25.build.scl1.mozilla.com
kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com ganetiwebmanager1
kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com puppet1.infra.scl1.mozilla.com
kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com rabbit1-dev.build.scl1.mozilla.com
kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com releng-puppet1.build.scl1.mozilla.com
kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com signing2.build.scl1.mozilla.com

fixing kvm6 (all will be moving onto a patched node)

kvm4.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com autoland-staging02.build.scl1.mozilla.com
kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master21.build.scl1.mozilla.com
kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master23.build.scl1.mozilla.com
kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master24.build.scl1.mozilla.com
kvm2.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com signing1.build.scl1.mozilla.com
kvm3.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com dev-master01.build.scl1.mozilla.com
kvm6.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com autoland-staging01.build.scl1.mozilla.com
kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master21.build.scl1.mozilla.com
kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master23.build.scl1.mozilla.com
kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master24.build.scl1.mozilla.com

I chose to do kvm5 before kvm6 because that means that fewer buildbot vms will need to be rebooted in order to migrate them twice.
looks fine to me. kvm2/3 look like they can be evacuated without a downtime given enough lead time to move slaves off the buildbot masters there.
(In reply to Amy Rich [:arich] [:arr] from comment #27)
> I talked more with bkero this morning about kvm in scl1.  Yesterday he
> performed a qemu upgrade on kvm nodes 1 and 4 and he has been unable to
> reproduce the issues that brought down the nodes/cluster.  He was able to
> reproduce this behavior after a simple reboot of the node (before patching),
> so we think we may have it licked *fingers crossed*.
>
> This patch should *also* fix the issue we've seen with live migration
> stalling out and not transferring any of the RAM to the new node when a host
> is live migrated to its secondary
Keeping fingers crossed, but very cool to hear.


> Because of the nature of the software upgrade, all of the vms will need to
> be failed over (rebooted, not live migration) to their secondary node to
> clear off the unpatched primary node for the patch.  Most of the vms are in
> some way redundant, but there are one or two that may require scheduling a
> tree closure (releng to determine which if any).
> 
> Obviously the sooner we do this the better (to avoid another melt down), so
> if we need a tree closure, we should start scheduling that now.  If we start
> to have another node melt down before an outage is scheduled, we will force
> a tree closure at that point to migrate hosts off of the ailing node.
Yep, lets get this done. I've flagged "needs-treeclosure" for buildduty to track.

To ask for treeclosure, can you give us approx 
1) how soon you'd be ready to start this
2) how long you'd need to complete your work
2a) we'll factor in the extra time needed for getting RelEng systems running again before we can reopen trees.



(In reply to Chris AtLee [:catlee] from comment #28)
> looks fine to me. kvm2/3 look like they can be evacuated without a downtime
> given enough lead time to move slaves off the buildbot masters there.
From irc, the work on kvm1,2,3 is already in progress. This will give us extra comfort that the fix is really working, and reduces our risk-exposure until the tree-closure.
Flags: needs-treeclosure?
Just an FYI, we are working on this today.  

kvm3 is patched.
kvm2 has one more vm to migrate off of it before we can patch.
kvm6 is in the process of being evacuated.
kvm5 will wait till next week because of redis01.
joduinn: I have yet to be informed that this will actually require a tree closure.  My discussion with coop/catlee (buildduty) has indicated that we may be able to do this without closing anything.
(In reply to Amy Rich [:arich] [:arr] from comment #31)
> joduinn: I have yet to be informed that this will actually require a tree
> closure.  My discussion with coop/catlee (buildduty) has indicated that we
> may be able to do this without closing anything.

arr: yep, understood that tree closure may not be required. Extra kudos if all this vm-shuffling can be done without needing a tree closure. Note that if a tree closure *is* needed, we'll do it to solve this disruptive bug.

Thanks again.
all nodes but kvm5 have been patched and rebooted.

Next week (or sooner if forced by issues), we need to move/reboot the following nodes off of kvm6:

buildapi01.build.scl1.mozilla.com
buildbot-master25.build.scl1.mozilla.com
redis01.build.scl1.mozilla.com
signing2.build.scl1.mozilla.com

And then patch kvm6.

We can then live migrate some of the hosts back onto it (no downtime required).
that should say patch kvm5, not patch kvm6
The remainder of the hosts were failed over this morning, and kvm5 has been patched. 

I'm going to open a separate bug to discuss how to balance the kvm nodes so that we don't have vms that are redundant for each other on the same nodes.
Severity: blocker → normal
(In reply to Amy Rich [:arich] [:arr] from comment #35)
> The remainder of the hosts were failed over this morning, and kvm5 has been
> patched. 

So are we done here then? Can this bug be closed?
The nodes have all been patched.  I opened a new bug to rebalance, so, yes.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.