Closed Bug 994321 Opened 11 years ago Closed 11 years ago

10.5 hour wait time for mtnlion try test builds due to around 50 mtnlion slaves breaking last night

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Unassigned)

Details

No description provided.
Summary: 10.5 hour wait time for 10.8 try test builds due to around 50 10.8 slaves braking last night → 10.5 hour wait time for mtnlion try test builds due to around 50 mtnlion slaves breaking last night
last job from one of the test slaves that stopped taking jobs: 00:01:54 INFO - Calling ['/builds/slave/talos-slave/test/build/venv/bin/python', '-u', '/builds/slave/talos-slave/test/build/tests/mochitest/runtests.py', '--appname=/builds/slave/talos-slave/test/build/application/FirefoxNightly.app/Contents/MacOS/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/try-builds/mwoodrow@mozilla.com-1061354610d7/try-macosx64/firefox-31.0a1.en-US.mac.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--autorun', '--close-when-done', '--console-level=INFO', '--quiet', '--chrome'] with output_timeout 1000 00:01:54 INFO - /builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/mozrunner/utils.py:19: UserWarning: Module manifestparser was already imported from /builds/slave/talos-slave/test/build/tests/mochitest/manifestparser.py, but /builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages is being added to sys.path 00:01:54 INFO - import pkg_resources 00:01:55 INFO - MochitestServer : launching [u'/builds/slave/talos-slave/test/build/tests/bin/xpcshell', '-g', '/builds/slave/talos-slave/test/build/application/FirefoxNightly.app/Contents/MacOS', '-v', '170', '-f', '/builds/slave/talos-slave/test/build/tests/mochitest/httpd.js', '-e', "const _PROFILE_PATH = '/var/folders/vl/6t1nwr3x54v2b5fs6p8qnxjc00000w/T/tmp1ugbYm'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', './server.js'] 00:01:55 INFO - runtests.py | Server pid: 1259 00:01:56 INFO - runtests.py | Websocket server pid: 1260 00:01:56 INFO - Warning: test_bug357450.js from manifest /builds/slave/talos-slave/test/build/tests/mochitest/chrome/content/base/test/chrome.ini is not a valid test 00:01:56 INFO - Warning: test_bug380418.html^headers^ from manifest /builds/slave/talos-slave/test/build/tests/mochitest/chrome/content/base/test/chrome/chrome.ini is not a valid test 00:01:56 INFO - Warning: test_operator_app_install.js from manifest /builds/slave/talos-slave/test/build/tests/mochitest/chrome/dom/apps/tests/chrome.ini is not a valid test 00:01:56 INFO - Warning: test_bug336682.js from manifest /builds/slave/talos-slave/test/build/tests/mochitest/chrome/dom/events/test/chrome.ini is not a valid test 00:01:56 INFO - Warning: test_offsets.css from manifest /builds/slave/talos-slave/test/build/tests/mochitest/chrome/dom/tests/mochitest/general/chrome.ini is not a valid test 00:01:56 INFO - Warning: test_offsets.js from manifest /builds/slave/talos-slave/test/build/tests/mochitest/chrome/dom/tests/mochitest/general/chrome.ini is not a valid test 00:01:56 INFO - Warning: test_bug883784.jsm from manifest /builds/slave/talos-slave/test/build/tests/mochitest/chrome/dom/workers/test/chrome.ini is not a valid test 00:01:56 INFO - Warning: test_bug467669.css from manifest /builds/slave/talos-slave/test/build/tests/mochitest/chrome/layout/inspector/tests/chrome/chrome.ini is not a valid test 00:01:56 INFO - Warning: test_bug695639.css from manifest /builds/slave/talos-slave/test/build/tests/mochitest/chrome/layout/inspector/tests/chrome/chrome.ini is not a valid test 00:01:56 INFO - Warning: test_bug708874.css from manifest /builds/slave/talos-slave/test/build/tests/mochitest/chrome/layout/inspector/tests/chrome/chrome.ini is not a valid test 00:01:56 INFO - Warning: test_bug727834.css from manifest /builds/slave/talos-slave/test/build/tests/mochitest/chrome/layout/inspector/tests/chrome/chrome.ini is not a valid test 00:01:56 INFO - runtests.py | Running tests: start. 00:01:56 INFO - TEST-INFO | certutil: exit 0 00:01:56 INFO - TEST-INFO | certutil: exit 0 00:01:56 INFO - TEST-INFO | certutil: exit 0 00:01:56 INFO - TEST-INFO | certutil: exit 0 00:01:56 INFO - TEST-INFO | certutil: exit 0 00:01:56 INFO - TEST-INFO | certutil: exit 0 00:01:56 INFO - pk12util: PKCS12 IMPORT SUCCESSFUL 00:01:56 INFO - TEST-INFO | pk2util: exit 0 00:01:56 INFO - TEST-INFO | certutil: exit 0 00:01:56 INFO - INFO | runtests.py | SSL tunnel pid: 1269 00:01:56 INFO - INFO | runtests.py | Application pid: 1270 00:02:04 INFO - Apr 9 00:02:04 talos-mtnlion-r5-038.test.releng.scl3.mozilla.com firefox[1270] <Error>: clip: empty path. 00:02:04 INFO - Apr 9 00:02:04 talos-mtnlion-r5-038.test.releng.scl3.mozilla.com firefox[1270] <Error>: clip: empty path. 00:02:04 INFO - Apr 9 00:02:04 talos-mtnlion-r5-038.test.releng.scl3.mozilla.com firefox[1270] <Error>: clip: empty path.
comment 1 was talos-mtnlion-r5-038 slavealloc says it's still enabled. I can not reach it: jlund@Hastings163:~ > ping talos-mtnlion-r5-038.test.releng.scl3.mozilla.com PING talos-mtnlion-r5-038.test.releng.scl3.mozilla.com (10.26.56.58): 56 data bytes Request timeout for icmp_seq 0 Request timeout for icmp_seq 1 Request timeout for icmp_seq 2 ^C --- talos-mtnlion-r5-038.test.releng.scl3.mozilla.com ping statistics --- 4 packets transmitted, 0 packets received, 100.0% packet loss
I can reach next slave in list that went down: talos-mtnlion-r5-050 buildbot.tac reports slave is disabled. slavealloc says it's enabled. twistd.log just has last commend as last line of output: 'desktop_unittest.py --suite reftest' but it's last job log points to same failure as well: 00:02:34 INFO - Apr 9 00:02:34 talos-mtnlion-r5-050.test.releng.scl3.mozilla.com firefox-bin[1255] <Error>: clip: empty path.
mac kernel issues involving 'clip:' logs: https://bugzilla.mozilla.org/show_bug.cgi?id=536444#c5 not sure if this is related or why we lost so many mtnlion machines.
callek discovered that slaveapi wasn't able to reboot due to the new requirement of 'https' in our configs. rebooting broken slaves now. looks like a cset caused: <Error>: clip: empty path. on a bunch of slaves which did not let them reboot.
still can not manually (ssh) reach almost all mtnlion machines. slaveapi is still having issues so I can not reboot via that either https://callek.pastebin.mozilla.org/4780984
mtnlion machines were hung hard from a bad cset in try: https://tbpl.mozilla.org/?tree=Try&rev=1061354610d7 buildbot set each of these builds to RETRY causing this build to eat away at many of our mtnlion machines. many reboots failed to recover even after pdu request. here is what slaveapi filed after failed reboots: 994373 talos-mtnlion-r5-067 is unreachable 16:16:03 994375 talos-mtnlion-r5-085 is unreachable 16:16:29 994378 talos-mtnlion-r5-003 is unreachable 16:16:56 994379 talos-mtnlion-r5-063 is unreachable 16:17:10 994380 talos-mtnlion-r5-084 is unreachable 16:17:19 994381 talos-mtnlion-r5-079 is unreachable 16:17:47 994382 talos-mtnlion-r5-075 is unreachable 16:18:00 994383 talos-mtnlion-r5-069 is unreachable 16:18:02 994385 talos-mtnlion-r5-004 is unreachable 16:18:15 994387 talos-mtnlion-r5-081 is unreachable 16:18:28 994389 talos-mtnlion-r5-064 is unreachable 16:18:56 994390 talos-mtnlion-r5-060 is unreachable 16:21:09 994393 talos-mtnlion-r5-066 is unreachable 16:24:56 994396 talos-mtnlion-r5-008 is unreachable 16:25:13 994398 talos-mtnlion-r5-077 is unreachable 16:25:35 994399 talos-mtnlion-r5-002 is unreachable 16:25:58 994403 talos-mtnlion-r5-041 is unreachable 16:26:51 994404 talos-mtnlion-r5-073 is unreachable 16:27:04
Thanks Jordan. It looks like Van has resolved these, and I've begun rebooting them. So far looks good. Will update here again once reboots have finished.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.