Closed Bug 734728 Opened 14 years ago Closed 14 years ago

scl1 ganeti cluster having issues

Categories

(mozilla.org Graveyard :: Server Operations, task)

Product:

Component:

Platform:

x86

macOS

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: bkero)

Details

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

14 years ago

bm12 lost it's head a bit before 00:45 Sunday and started using a lot of CPU resources. It as impacting other VMs on kvm1 in SCL1 so bkero restarted the VM; buildbot and associated jobs restarted on boot automatically. Since then we have buildbot-master12.build.scl1:Command Queue is UNKNOWN: Unhandled exception buildbot-master12.build.scl1:Pulse Queue is UNKNOWN: Unhandled exception from nagios, but the commands defined in /etc/nagios/nrpe.d work fine at the prompt (as cltbld and root). And these in the system log: # grep lockup /var/log/messages Mar 11 03:32:25 buildbot-master12 kernel: BUG: soft lockup - CPU#0 stuck for 32s! [swapper:0] Mar 11 03:32:25 buildbot-master12 kernel: BUG: soft lockup - CPU#1 stuck for 32s! [makewhatis:10916] Mar 11 03:34:24 buildbot-master12 kernel: BUG: soft lockup - CPU#0 stuck for 15s! [swapper:0] Mar 11 03:34:25 buildbot-master12 kernel: BUG: soft lockup - CPU#1 stuck for 15s! [makewhatis:6656] plus register info and call traces. The buildbot master is disabled in slavealloc and stopped pending investigation.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

14 years ago

Summary: buildbot-master12 → buildbot-master12 is unwell

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

14 years ago

buildbot-master04 has lockups going back to 9:42pm on Saturday Mar 10 21:42:00 buildbot-master04 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [swapper:0] Mar 10 21:42:04 buildbot-master04 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [buildbot:14011] Mar 10 21:42:23 buildbot-master04 kernel: BUG: soft lockup - CPU#1 stuck for 11s! [swapper:0] Mar 10 21:42:23 buildbot-master04 kernel: BUG: soft lockup - CPU#0 stuck for 12s! [swapper:0] Mar 10 21:46:02 buildbot-master04 kernel: BUG: soft lockup - CPU#0 stuck for 33s! [swapper:0] Mar 10 21:46:02 buildbot-master04 kernel: BUG: soft lockup - CPU#1 stuck for 33s! [buildbot:13355] and if I look harder at buildbot-master12 then goes back to the same time too, 82 events in total. slavealloc is also on this KVM master. The common element is kvm1.infra.scl1, over to Server Ops: Releng.

Assignee: nobody → server-ops-releng

Severity: normal → major

Component: Release Engineering → Server Operations: RelEng

QA Contact: release → arich

Summary: buildbot-master12 is unwell → kvm1.infra.scl1 unwell ?

Amy Rich [:arr] [:arich]

Comment 2

•

14 years ago

The ganeti cluster in scl1 is having issues beyond our ability to diagnose. Ssh to the master node timed out a couple times before succeeding. Trying to do a cluster-verify times out, nodes have errors in the logs, etc. Please page someone with ganeti knowledge to look at this asap since all critical releng infra is on this cluster.

Assignee: server-ops-releng → server-ops

Severity: major → blocker

Component: Server Operations: RelEng → Server Operations

QA Contact: arich → phong

Summary: kvm1.infra.scl1 unwell ? → scl1 ganeti cluster having issues

Dumitru Gherman [:dumitru]

Updated

•

14 years ago

Assignee: server-ops → dgherman

Dumitru Gherman [:dumitru]

Updated

•

14 years ago

Assignee: dgherman → bkero

Phil Ringnalda (:philor)

Comment 3

•

14 years ago

Closed the trees around 09:45.

Amy Rich [:arr] [:arich]

Comment 4

•

14 years ago

We have been looking at this all morning and have yet to identify the cause of the issue. Investigation continues.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 5

•

14 years ago

(In reply to Nick Thomas [:nthomas] from comment #0) > Since then we have > buildbot-master12.build.scl1:Command Queue is UNKNOWN: Unhandled exception > buildbot-master12.build.scl1:Pulse Queue is UNKNOWN: Unhandled exception > from nagios, but the commands defined in /etc/nagios/nrpe.d work fine at the > prompt (as cltbld and root). This is a bug for RelEng to fix, related to modes on some directories. Bug 734764 for that.

Phil Ringnalda (:philor)

Comment 6

•

14 years ago

Trees reopened at 14:30.

Ben Kero [:bkero]

Assignee

Comment 7

•

14 years ago

tl;dr: All VMs are back online. There should be no continued unavailability. The issue has been contained. I was unable to determine at an application level what had caused the issues, but from a system standpoint this is what happened. The master node (kvm2) was having problems communicating with other nodes. The TCP connections were being accepted, although OpenSSL was timing out. The result of this is that the master was unable to communicate with the rest of the nodes in the cluster. Meanwhile on kvm1, the buildbot-master12 process was consuming immense CPU time. On kvm3, another VM had runaway and raised the loadavg above 50 before eventually making the host become completely unresponsive including out-of-band management. In response to this, we have migrated all VMs (including scl1's nagios VM) to other nodes in the cluster. I am currently en route to scl1 to power cycle kvm3 and bring it back online manually. The fact that (seemingly) one virtual machine could have this devastating effect on the cluster disturbs me, and when I've taken care of kvm3 and have a chance to investigate more about how this happened, I would like to determine what can be done to prevent it from happening again.

Amy Rich [:arr] [:arich]

Comment 8

•

14 years ago

To add to that: All of the vms have been distributed off of kvm3 and are on the other hosts for the time being, but having too many buildbot masters on one host often means that we'll see performance problems, so be cognizant of that (kvm4 is a bit loaded with kvm3 down right now): kvm1.infra.scl1.mozilla.com buildbot-master12.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com buildbot-master4 kvm2.infra.scl1.mozilla.com buildbot-master16.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com buildbot-master6.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com buildbot-master11.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com buildbot-master13.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com buildbot-master14.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com buildbot-master15.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com buildbot-master17.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com buildbot-master18.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master25.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master21.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master23.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master24.build.scl1.mozilla.com

Amy Rich [:arr] [:arich]

Comment 9

•

14 years ago

We also added "acpi=off noapic" to /etc/grub.conf on some vms based on google searches, but it's not clear that this will solve future issues.

Amy Rich [:arr] [:arich]

Comment 10

•

14 years ago

We experienced additional issues today with buildbot-master04. I have added "acpi=off noapic" to /etc/grub.conf and moved bm4 off to a different kvm host in case we're just seeing CPU overload (but other kvm nodes are handling many more vms than kvm1, specifically kvm4 which has most of kvm3's nodes on it in addition to its own right now).

Amy Rich [:arr] [:arich]

Comment 11

•

14 years ago

A note which has not been explicitly called out, but we've only ever experienced issues with runaway vms (pegging a CPU) on kvm1 and kvm3. The machines were bought and added to the cluster as sets: kvm1/2, kvm3/4, kvm5/6, so kvm1 and lvm3 should not share any characteristics that the others do not. After the last update, I resynced the secondary disks back to kvm3 and moved some of the less critical vms back onto that node. Here's how things stand now. Note that kvm4 is still doing the lions share out of all 6 nodes because I didn't want to try to move any of the buildbot masters while they were in use. Instance Primary_node Secondary_Nodes slavealloc.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com releng-mirror01.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com buildbot-master12.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com scl-production-puppet-new.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com redis01.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com ns1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com dc01.winbuild.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com wds01.winbuild.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com buildbot-master6.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com buildbot-master16.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com autoland-staging01.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com talos-addon-master1.amotest.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com linux-hgwriter-slave03.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com arr-test.build.scl1.mozilla.com kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com master-puppet1.build.scl1.mozilla.com kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com arr-client-test.build.scl1.mozilla.com kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com autoland-staging02.build.scl1.mozilla.com kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com ns2.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com admin1.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com dc02.winbuild.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com ganglia1.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master11.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master13.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master14.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master17.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master15.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master18.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com linux-hgwriter-slave04.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master4 kvm5.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com buildapi01.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com ganetiwebmanager1 kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com puppet1.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com signing2.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com releng-puppet1.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master25.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com signing1.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com dev-master01.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com rabbit1-dev.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master21.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master23.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master24.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com

Brian Hourigan [:digi]

Comment 12

•

14 years ago

I spent most of this afternoon looking at this with Ben Kero, ultimately we were not able to get useful debugging information from the process state on kvm3 nor were we able to conclusively identify the root cause. We do have a theory on one possible cause. Based on the state of the machine prior to a reset, it behaved as if it had exhausted all available memory and/or swap. The principle indication of this is that ssh accepts a connection (indicating the kernel is running and able to accept) but no ssh banner is displayed because userland is hosed. Based on this behavior I have a hypothesis on what may of happened. It seems there is a cron that is ran as root every minute[1], connecting to a sqlite3 database and printing some information. Based on Python documentation[2] the library will obtain an exclusive lock on the database as part of the connection procedure. Any other processes which want access (even read only) to the database will be blocked until the lock is released. Something could of happened on the underlying disk blocking i/o, causing an unrecoverable lock on the database. As time went forward the processes could stack until all system resources (primarily memory) were exhausted. I was unable to find any data showing historical memory usage on autoland-staging02. If the data were available it might confirm this theory. At a minimum I suggest serializing this cron or establishing a timeout less than the execution interval. [1] */1 * * * * source /home/autoland/autoland-env/bin/activate && python /var/www/cgi-bin/py.cgi > /var/www/html/test.html [2] http://docs.python.org/library/sqlite3.html, sqlite3.connect

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 13

•

14 years ago

buildbot-master12 got re-enabled at about 1650, but philor noticed that compile jobs were failing. I looked into one instance. The buildbot slave tries to send a chunk of progress log but the master doesn't respond, so the slave gives up the job & connection. (Releng: for kicks on windows we don't kill the existing process tree). To the master it looks like a second host with the same hostname wants to make a connection. This was time-correlated with these entries in the system log: Mar 12 19:47:16 buildbot-master12 kernel: BUG: soft lockup - CPU#1 stuck for 20s! [swapper:0] Mar 12 19:47:16 buildbot-master12 kernel: BUG: soft lockup - CPU#0 stuck for 32s! [swapper:0] buildbot-master12 is disabled again.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 14

•

14 years ago

Further info, * the slave ends up inoperative because the master is trying to boot dupe connections all the time (really I think it has a stale connection the other side gave up on) * this master had a great deal of free ram available and didn't need to be doing any swapping * there were no nagios alerts about kvm hosts all evening * based on comment #11 it's on kvm1

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 15

•

14 years ago

More issues have developed since the above * at 1:49 buildbot alerts on buildbot-master12 PING failure * ganglia data stops at ~ 1:45, when the load spiked from ~0 to 10 * at 1:57 slavealloc alerts about http_expect "Service Check Timed Out" and 'Ganglia IO" being 'cpu_wio is 99.70' * at 2:21 kvm1.infra.scl1:DRBD goes critical, socket timeout after 30 seconds Hosted VM breakdown (per comment #11): * slavealloc - not a tree closer but would like this back to be able to control slaves, ganglia reports slow increase in load * releng-mirror01 - can be turned off, not in use, load at 80 and increasing linearly * buildbot-master12 - can be turned off, can do without it * scl-production-puppet - would close tree if it goes, acting normally so far * redis01 - closes tree because data on completed jobs will be missing from tbpl, was fine until 2:55a then load increased rapidly to 25 and nonresponsive, CPU mostly in wait state

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 16

•

14 years ago

(In reply to Nick Thomas [:nthomas] from comment #15) > * redis01 - closes tree because data on completed jobs will be missing from > tbpl, was fine until 2:55a then load increased rapidly to 25 and > nonresponsive, CPU mostly in wait state For reasons I don't understand the redis process is still working fine, so the trees are open.

Chris AtLee [:catlee]

Comment 17

•

14 years ago

also autoland-staging02 appears to be idle at the moment.

Amy Rich [:arr] [:arich]

Comment 18

•

14 years ago

Based on the above, I'm still very much leaning towards an issue with nodes kvm1 and kvm3. To that end, I've migrated most everything off of them. Since redis01 is refusing to migrate without a failover (shutdown and then boot on new node), catlee asked that I not move it. If kvm2 gets too overloaded (CPU), I suggest we shut down linux-hgwriter03 and releng-mirror01. If kvm4 gets too overloaded, shut down linux-hgwriter-slave04 Instance Primary_node Secondary_Nodes redis01.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com ns1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com dc01.winbuild.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com wds01.winbuild.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com slavealloc.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com releng-mirror01.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com buildbot-master6.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com buildbot-master12.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com buildbot-master16.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com talos-addon-master1.amotest.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com linux-hgwriter-slave03.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com autoland-staging01.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com arr-test.build.scl1.mozilla.com kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com arr-client-test.build.scl1.mozilla.com kvm3.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com ns2.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com admin1.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com master-puppet1.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com dc02.winbuild.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com ganglia1.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master11.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master13.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master14.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master17.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master15.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master18.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com autoland-staging02.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com linux-hgwriter-slave04.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master4 kvm5.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com scl-production-puppet-new.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com buildapi01.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com ganetiwebmanager1 kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com puppet1.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com signing2.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com releng-puppet1.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master25.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com signing1.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com dev-master01.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com rabbit1-dev.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master21.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master23.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master24.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com As an aside, many of the vms and the ganeti nodes themselves have been up over 200 days. SO if a node gets failed over, it does an automatic fsck and may take longer than expected to come back online.

Amy Rich [:arr] [:arich]

Comment 19

•

14 years ago

We just had another occurrence of the same symptoms on kvm4 (which had not shown any issues prior to now).

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 20

•

14 years ago

Try, mozilla-inbound and mozilla-central are closed because of the test backlog that we're getting. Currently down 1 buildbot master for mac, linux and windows test, and one master for try compiles.

Dustin J. Mitchell [:dustin] (he/him)

Comment 21

•

14 years ago

I was under the impression we had built out sufficient master capacity to handle losing one in each silo - either for rolling-reboots, or failures such as this one. If that's no longer the case, we should talk elsewhere about why and how to fix it (and why master-capacity forecasting is so bad).

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 22

•

14 years ago

The trees are reopened now. bkero has the details on the infra.

Ben Kero [:bkero]

Assignee

Comment 23

•

14 years ago

me, dustin, and lerxst evacuated two nodes and spent some time replicating and diagnosing this problem on the evacuated nodes. We can reproduce the problems by creating a test virtual machine, then artifically starving it of memory and digging into swap. We estimate that elimination of drbd on hosts that can be recreated easily using puppet, and disabling swap inside these virtual machines should prevent this problem from happening again. I am looking into some new kernel parameters on hosts, and upgraded versions of qemu-kvm and drbd

Dustin J. Mitchell [:dustin] (he/him)

Comment 24

•

14 years ago

Status update: I think we're all ragged and nobody has a full understanding of what's up, but we have a rough replication strategy and two nodes (kvm1 and kvm4) to play with. The replication involves touching 400M of memory in a 256M VM, thereby swapping. The symptoms are downright weird - kvm1 pegged itself so badly that we couldn't do anything but run 'ssh kvm1 killall kvm'. kvm4 started hanging commands like 'cat', 'ls', 'ps', even over non-disk-backed areas like /sys, and refused to show anything but 0% CPU for any process. I suspect that disabling swap for all VMs would help here, since it means that a VM using too much RAM will just OOM, rather than causing IO mayhem on the node. Disabling drbd might help, too, but that has much worse implications (no live migration, no failover from a failed node, although bear in mind we're using raid10 here so disk failure isn't a huge risk). Just disabling swap still leaves us vulnerable to application-initiated IO mayhem, so I'd like to get to the bottom of this. Ben has a list of things to try: try hpet=disable noapic in the kernel upgrade the kernel add the qemu-kvm ppa upgrade from 0.12.3 to 0.14.1 for now, I think we should leave these two nodes vacant until we can nail this down. Another option is to start kicking new nodes as RHEL6 like the other clusters, and cold-migrating VMs to them one by one.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 25

•

14 years ago

This impacted last night's FF12.0b2 release. I agree with comment#24 - if these systems are not reliable, we need to get RelEng-related production systems off them asap. Debugging this issue can continue later, but stopping the ongoing disruption to production feels like a priority. How quickly can we moving any production RelEng VMs off to new/different systems? As this is already disrupting production, if an emergency tree closure / release schedule disruption makes this move quicker, we should do that.

Ben Kero [:bkero]

Assignee

Comment 26

•

14 years ago

joduinn, Here's a list of those hosts. autoland-staging01.build.scl1.mozilla.com autoland-staging02.build.scl1.mozilla.com buildapi01.build.scl1.mozilla.com buildbot-master04.build.scl1.mozilla.com buildbot-master06.build.scl1.mozilla.com buildbot-master11.build.scl1.mozilla.com buildbot-master12.build.scl1.mozilla.com buildbot-master13.build.scl1.mozilla.com buildbot-master14.build.scl1.mozilla.com buildbot-master15.build.scl1.mozilla.com buildbot-master16.build.scl1.mozilla.com buildbot-master17.build.scl1.mozilla.com buildbot-master18.build.scl1.mozilla.com buildbot-master21.build.scl1.mozilla.com buildbot-master23.build.scl1.mozilla.com buildbot-master24.build.scl1.mozilla.com buildbot-master25.build.scl1.mozilla.com dev-master01.build.scl1.mozilla.com ganglia1.build.scl1.mozilla.com linux-hgwriter-slave03.build.scl1.mozilla.com linux-hgwriter-slave04.build.scl1.mozilla.com master-puppet1.build.scl1.mozilla.com rabbit1-dev.build.scl1.mozilla.com redis01.build.scl1.mozilla.com releng-mirror01.build.scl1.mozilla.com releng-puppet1.build.scl1.mozilla.com scl-production-puppet-new.build.scl1.mozilla.com signing1.build.scl1.mozilla.com signing2.build.scl1.mozilla.com slavealloc.build.scl1.mozilla.com talos-addon-master1.amotest.scl1.mozilla.com

Amy Rich [:arr] [:arich]

Comment 27

•

14 years ago

I talked more with bkero this morning about kvm in scl1. Yesterday he performed a qemu upgrade on kvm nodes 1 and 4 and he has been unable to reproduce the issues that brought down the nodes/cluster. He was able to reproduce this behavior after a simple reboot of the node (before patching), so we think we may have it licked *fingers crossed*. This patch should *also* fix the issue we've seen with live migration stalling out and not transferring any of the RAM to the new node when a host is live migrated to its secondary Because of the nature of the software upgrade, all of the vms will need to be failed over (rebooted, not live migration) to their secondary node to clear off the unpatched primary node for the patch. Most of the vms are in some way redundant, but there are one or two that may require scheduling a tree closure (releng to determine which if any). Obviously the sooner we do this the better (to avoid another melt down), so if we need a tree closure, we should start scheduling that now. If we start to have another node melt down before an outage is scheduled, we will force a tree closure at that point to migrate hosts off of the ailing node. Once all of the nodes are patched, we will additionally need to migrate some vms around to rebalance load for better performance, but hopefully this will be using live migration and not require any more reboots. Here's the order in which I suggest fixing the nodes. Secondary_Nodes Primary_node Instance fixing kvm3 (all will be moving onto a patched node (kvm4)) kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com admin1.infra.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master13.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master14.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com buildbot-master17.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com dc02.winbuild.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com ganglia1.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com master-puppet1.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm3.infra.scl1.mozilla.com ns2.infra.scl1.mozilla.com fixing kvm2 (all but one will be moving onto a patched node (kvm1)) kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com buildbot-master12.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com buildbot-master16.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com buildbot-master6.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com dc01.winbuild.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com ns1.infra.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com slavealloc.build.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com talos-addon-master1.amotest.scl1.mozilla.com kvm1.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com wds01.winbuild.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com buildbot-master11.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com autoland-staging01.build.scl1.mozilla.com fixing kvm5 (half will be moving onto a patched node but half will be moving onto an unpatched node) kvm1.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master4 kvm1.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com redis01.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com scl-production-puppet-new.build.scl1.mozilla.com kvm3.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildapi01.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master15.build.scl1.mozilla.com kvm4.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master18.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com buildbot-master25.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com ganetiwebmanager1 kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com puppet1.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com rabbit1-dev.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com releng-puppet1.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm5.infra.scl1.mozilla.com signing2.build.scl1.mozilla.com fixing kvm6 (all will be moving onto a patched node) kvm4.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com autoland-staging02.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master21.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master23.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master24.build.scl1.mozilla.com kvm2.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com signing1.build.scl1.mozilla.com kvm3.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com dev-master01.build.scl1.mozilla.com kvm6.infra.scl1.mozilla.com kvm2.infra.scl1.mozilla.com autoland-staging01.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master21.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master23.build.scl1.mozilla.com kvm5.infra.scl1.mozilla.com kvm6.infra.scl1.mozilla.com buildbot-master24.build.scl1.mozilla.com I chose to do kvm5 before kvm6 because that means that fewer buildbot vms will need to be rebooted in order to migrate them twice.

Chris AtLee [:catlee]

Comment 28

•

14 years ago

looks fine to me. kvm2/3 look like they can be evacuated without a downtime given enough lead time to move slaves off the buildbot masters there.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 29

•

14 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #27) > I talked more with bkero this morning about kvm in scl1. Yesterday he > performed a qemu upgrade on kvm nodes 1 and 4 and he has been unable to > reproduce the issues that brought down the nodes/cluster. He was able to > reproduce this behavior after a simple reboot of the node (before patching), > so we think we may have it licked *fingers crossed*. > > This patch should *also* fix the issue we've seen with live migration > stalling out and not transferring any of the RAM to the new node when a host > is live migrated to its secondary Keeping fingers crossed, but very cool to hear. > Because of the nature of the software upgrade, all of the vms will need to > be failed over (rebooted, not live migration) to their secondary node to > clear off the unpatched primary node for the patch. Most of the vms are in > some way redundant, but there are one or two that may require scheduling a > tree closure (releng to determine which if any). > > Obviously the sooner we do this the better (to avoid another melt down), so > if we need a tree closure, we should start scheduling that now. If we start > to have another node melt down before an outage is scheduled, we will force > a tree closure at that point to migrate hosts off of the ailing node. Yep, lets get this done. I've flagged "needs-treeclosure" for buildduty to track. To ask for treeclosure, can you give us approx 1) how soon you'd be ready to start this 2) how long you'd need to complete your work 2a) we'll factor in the extra time needed for getting RelEng systems running again before we can reopen trees. (In reply to Chris AtLee [:catlee] from comment #28) > looks fine to me. kvm2/3 look like they can be evacuated without a downtime > given enough lead time to move slaves off the buildbot masters there. From irc, the work on kvm1,2,3 is already in progress. This will give us extra comfort that the fix is really working, and reduces our risk-exposure until the tree-closure.

Flags: needs-treeclosure?

Amy Rich [:arr] [:arich]

Comment 30

•

14 years ago

Just an FYI, we are working on this today. kvm3 is patched. kvm2 has one more vm to migrate off of it before we can patch. kvm6 is in the process of being evacuated. kvm5 will wait till next week because of redis01.

Amy Rich [:arr] [:arich]

Comment 31

•

14 years ago

joduinn: I have yet to be informed that this will actually require a tree closure. My discussion with coop/catlee (buildduty) has indicated that we may be able to do this without closing anything.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 32

•

14 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #31) > joduinn: I have yet to be informed that this will actually require a tree > closure. My discussion with coop/catlee (buildduty) has indicated that we > may be able to do this without closing anything. arr: yep, understood that tree closure may not be required. Extra kudos if all this vm-shuffling can be done without needing a tree closure. Note that if a tree closure *is* needed, we'll do it to solve this disruptive bug. Thanks again.

Amy Rich [:arr] [:arich]

Comment 33

•

14 years ago

all nodes but kvm5 have been patched and rebooted. Next week (or sooner if forced by issues), we need to move/reboot the following nodes off of kvm6: buildapi01.build.scl1.mozilla.com buildbot-master25.build.scl1.mozilla.com redis01.build.scl1.mozilla.com signing2.build.scl1.mozilla.com And then patch kvm6. We can then live migrate some of the hosts back onto it (no downtime required).

Amy Rich [:arr] [:arich]

Comment 34

•

14 years ago

that should say patch kvm5, not patch kvm6

Amy Rich [:arr] [:arich]

Comment 35

•

14 years ago

The remainder of the hosts were failed over this morning, and kvm5 has been patched. I'm going to open a separate bug to discuss how to balance the kvm nodes so that we don't have vms that are redundant for each other on the same nodes.

Severity: blocker → normal

Chris Cooper [:coop] (he/him)

Comment 36

•

14 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #35) > The remainder of the hosts were failed over this morning, and kvm5 has been > patched. So are we done here then? Can this bug be closed?

Amy Rich [:arr] [:arich]

Comment 37

•

14 years ago

The nodes have all been patched. I opened a new bug to rebalance, so, yes.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → mozilla.org Graveyard

You need to log in before you can comment on or make changes to this bug.