Closed Bug 1108844 Opened 10 years ago Closed 10 years ago

panda high pending - 200 pandas disconnected from buildbot after reconfig

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

ARM
Android
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Unassigned)

References

Details

looks like on the 5th we had ~200 pandas stop running jobs: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=panda now we have a high pending count for panda with many slaves waiting over an hour. I am not sure why yet but since split-apk landed on the 5th, it is by far the most likely culprit. investigating a few panda's, you can see through twistd.log that the last few lines involve a reconfig and killing of a job: 4749 2014-12-05 12:59:08-0800 [Broker,client] in dir /builds/panda-0200/test/. (timeout 2400 secs) (maxTime 14400 secs) 4750 2014-12-05 12:59:08-0800 [Broker,client] watching logfiles {} 4751 2014-12-05 12:59:08-0800 [Broker,client] argv: ['/tools/buildbot/bin/python', 'scripts/scripts/android_panda.py', '--cfg', 'android/android_panda_releng.py', '--mochitest-suite', 'mochitest-gl', '--blob-upload-branch', 'b2g-inbound ', '--download-symbols', 'ondemand'] 4752 2014-12-05 12:59:08-0800 [Broker,client] environment: {'SHELL': '/bin/sh', 'MOZ_HIDE_RESULTS_TABLE': '1', 'SHLVL': '4', 'PYTHONPATH': '/builds/sut_tools', 'OLDPWD': '/home/cltbld', 'PROPERTIES_FILE': '/builds/panda-0200/test/buildpr ops.json', 'SUT_NAME': 'panda-0200', 'PWD': '/builds/panda-0200/test', 'LOGNAME': 'cltbld', 'USER': 'cltbld', 'PATH': '/usr/local/bin:/usr/local/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/cltbld/bin', 'HOM E': '/home/cltbld', 'SUT_IP': '10.26.129.95', '_': '/tools/buildbot/bin/python2.7'} 4753 2014-12-05 12:59:08-0800 [Broker,client] using PTY: False 4754 2014-12-05 12:59:12-0800 [-] Initiating shutdown because /builds/panda-0200/shutdown.stamp was touched 4755 2014-12-05 12:59:12-0800 [-] Telling the master we want to shutdown after any running builds are finished 4756 2014-12-05 13:00:01-0800 [Broker,client] removing old builder Android 4.0 Panda fx-team opt test robocop-1 4757 2014-12-05 13:00:01-0800 [Broker,client] removing old builder Android 4.0 Panda fx-team opt test robocop-3 4758 2014-12-05 13:00:01-0800 [Broker,client] removing old builder Android 4.0 Panda fx-team opt test robocop-2 4759 2014-12-05 13:00:01-0800 [Broker,client] removing old builder Android 4.0 Panda fx-team opt test robocop-5 ... ... etc ... 5200 2014-12-05 13:00:02-0800 [Broker,client] removing old builder Android 4.0 Panda mozilla-central debug test jsreftest-3 5201 2014-12-05 13:00:02-0800 [Broker,client] removing old builder Android 4.0 Panda mozilla-central debug test jsreftest-2 5202 2014-12-05 13:00:02-0800 [Broker,client] removing old builder Android 4.0 Panda b2g-inbound opt test mochitest-gl 5203 2014-12-05 13:00:02-0800 [Broker,client] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x22745a8> 5204 2014-12-05 13:00:02-0800 [Broker,client] command interrupted, attempting to kill 5205 2014-12-05 13:00:02-0800 [Broker,client] trying to kill process group 26163 5206 2014-12-05 13:00:02-0800 [Broker,client] signal 9 sent successfully 5207 2014-12-05 13:00:02-0800 [Broker,client] removing old builder Android 4.0 Panda maple opt test crashtest 5208 2014-12-05 13:00:02-0800 [Broker,client] removing old builder Android 4.0 Panda mozilla-central talos remote-troboprovider 5209 2014-12-05 13:00:02-0800 [Broker,client] removing old builder Android 4.0 Panda try talos remote-trobocheck2 ... ... etc ... 5378 2014-12-05 13:00:02-0800 [Broker,client] removing old builder Android 4.0 Panda b2g-inbound opt test mochitest-7 5379 2014-12-05 13:00:02-0800 [Broker,client] removing old builder Android 4.0 Panda b2g-inbound talos remote-tspaint 5380 2014-12-05 13:00:02-0800 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now 5381 2014-12-05 13:00:02-0800 [-] command finished with signal 9, exit code None, elapsedTime: 54.229844 5382 2014-12-05 13:00:02-0800 [-] SlaveBuilder.commandComplete None 5383 2014-12-05 13:00:02-0800 [-] but we weren't running, quitting silently 5384 2014-12-05 13:00:43-0800 [Broker,client] SlaveBuilder.remote_print(Android 4.0 armv7 API 10+ fx-team debug test mochitest-7): message from master: attached 5385 2014-12-05 13:00:43-0800 [Broker,client] SlaveBuilder.remote_print(Android 4.0 armv7 API 10+ mozilla-central debug test mochitest-8): message from master: attached ... ... etc ... 5992 2014-12-05 13:00:43-0800 [Broker,client] SlaveBuilder.remote_print(Android 4.0 armv7 API 10+ cypress talos remote-troboprovider): message from master: attached 5993 2014-12-05 13:00:43-0800 [Broker,client] SlaveBuilder.remote_print(Android 4.0 armv7 API 10+ jamun opt test mochitest-8): message from master: attached 5994 2014-12-05 13:00:43-0800 [Broker,client] SlaveBuilder.remote_print(Android 4.0 armv7 API 10+ larch opt test mochitest-7): message from master: attached 5995 2014-12-06 12:46:06-0800 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now wrt watcher.log, I'm not noticing anything obvious. this lines up with my reconfig @ ~1300 to enable split-apk: https://hg.mozilla.org/build/buildbot-configs/rev/ff066fc73a76 so here's what I think happened. the reconfig killed off a bunch builders (the pre-split single armv7 builds/tests) and subsequently, the running panda test jobs got lost in the cosmos. Then, the pandas themselves got put into a lost state. where do we go from here? without knowing into what happened in detail, I'd imagine a simple 'reboot' of these pandas is the best course of action
Yeah, be nice if that would work. https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=panda&name=panda-0253 is one of the ones I tried to reboot back to life, with the two "400 Client Error: Bad Request" ones being the nasty surprise that although for other platforms slaveapi is now back to successfully filing problem tracking bugs, for pandas it still fails, and then the "unreachable" coming from it failing once I filed the tracking bug for it.
(In reply to Phil Ringnalda (:philor) from comment #1) > Yeah, be nice if that would work. > > https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave. > html?class=test&type=panda&name=panda-0253 is one of the ones I tried to > reboot back to life, with the two "400 Client Error: Bad Request" ones being > the nasty surprise that although for other platforms slaveapi is now back to > successfully filing problem tracking bugs, for pandas it still fails, and > then the "unreachable" coming from it failing once I filed the tracking bug > for it. hmm :( I should note that it was requested I change the split-apk to be api-9-10 and api-11+ based now (it was api-9 and api-10+) This means we will be needing to reconfig tomorrow the new builders again. I'll coordinate with buildduty and sheriffs on how to deal with this bug and the inbound reconfig tomorrow.
Looks like that reconfig didn't actually happen until today, because we *like* tempting the fates by only doing reconfigs on Friday. Based on the state right now, I'd say we probably lost another 199 today, bringing the total to 420 dead and 107 alive. That's not enough to even survive a Friday night, we've got 51 pending (and by comparison, 0 pending for 10.8). So without a fix, we're looking at a tree closure Monday morning as soon as someone notices. Unless I'm wrong about today's 199, critical until 7:30am Pacific Monday at the very latest, and it'll be a blocker then.
Severity: normal → critical
OS: Mac OS X → Android
Hardware: x86 → ARM
(In reply to Phil Ringnalda (:philor) from comment #3) > Looks like that reconfig didn't actually happen until today, because we > *like* tempting the fates by only doing reconfigs on Friday. I will be taking a look at this tomorrow
Blocks: 1111163
ftr - coop ended up a mass reboot of the long broken pandas via slaveapi. philor mentioned issues with this in comment 1 so this may not effectively work but it is better than nothing. it looks like slave health is already reporting many of the 'long broken slaves' as working again. hopefully many correct themselves. will report back in later.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.