Closed Bug 1011207 Opened 8 years ago Closed 8 years ago
.3 mochitest jobs put ix slaves into weird state (and so need rebooting)
edited irc convo RyanVM|sheriffduty has something changed recently that might affect Android 2.3 tests? RyanVM|sheriffduty we're seeing intermittent "Timed out while waiting for server startup." on multiple trees RyanVM|sheriffduty https://tbpl.mozilla.org/php/getParsedLog.php?id=39753901&tree=Mozilla-Inbound jlund RyanVM|sheriffduty: there was some changes from kmoir the other day IIRC jlund http://hg.mozilla.org/build/buildbot-configs/rev/97e94672cbdc maybe RyanVM|sheriffduty oddly, it seems to be quite recent (i.e. the last couple hours) kmoir I enabled more 2.3 tests on Tuesday, haven't been changes since then RyanVM|sheriffduty 12:43:11 INFO - !!! could not start server on port 8854: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIServerSocket.init]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: /builds/slave/talos-slave/test/build/tests/reftest/reftest/components/httpd.js :: <TOP_LEVEL> :: line 552" data: no] RyanVM|sheriffduty kmoir: actually, appears to be limited to a couple slaves kmoir okay RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-027 RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-029 RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-034 RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-073 RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-076 RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-108 RyanVM|sheriffduty seems to be all of them so far bhearsum aki: yeah RyanVM|sheriffduty kmoir: so some seem to go green after managing to take a non-2.3 job RyanVM|sheriffduty so I'm going to try rebooting the ones that haven't yet kmoir okay RyanVM|sheriffduty kmoir: interestingly, they all seem to trace back to a canceled Try run jlund http://hg.mozilla.org/users/jlund_mozilla.com/mozharness/rev/1e2163e5222e kmoir RyanVM|sheriffduty: huh, that's strange jlund mshal: I just looked at the logs again. build_properties.json wasn't the issue RyanVM|sheriffduty kmoir: all canceled around 11:58am RyanVM|sheriffduty kmoir: I blame katshttps://tbpl.mozilla.org/?tree=Try&jobname=Android%202.3&rev=ea344577ab5b kmoir ah jlund yup looks like it RyanVM|sheriffduty kmoir: might still want to investigate, though -->| gaye_ (sid12943@CB1D25D9.E02B7C75.9377050C.IP) has joined #releng RyanVM|sheriffduty looks like every single one of those canceled jobs got the slave into a funky state kmoir yeah it's strange the slaves got into a weird state RyanVM|sheriffduty talos-linux64-ix-046 survived somehow RyanVM|sheriffduty looks like it was only the oens that were running mochitests when they got canceled RyanVM|sheriffduty reftest/crashtest slaves recovered OK RyanVM|sheriffduty turns the rest of the investigation over to kmoir
Hardware: x86 → x86_64
Summary: cancelled 2.3 mochitest jobs put ix slaves into weird state → cancelled 2.3 mochitest jobs put ix slaves into weird state (and so need rebooting)
I was able to reproduce this problem on my dev-master If I cancel a job, this happens in the next run and the tests don't run 13:08:25 INFO - Running pre-action listener: _resource_record_pre_action 13:08:25 INFO - Running main action method: start_emulators 13:08:25 INFO - Let's kill every process called compiz 13:08:25 INFO - Killing pid 3734. 13:08:25 INFO - Attempting to establish symlink for /builds/slave/talos-slave/test/build/libGL.so 13:08:25 INFO - Symlinking /builds/slave/talos-slave/test/build/libGL.so -> /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 13:08:25 INFO - mkdir: /builds/slave/talos-slave/test/build 13:08:25 INFO - Attempt #1 to launch emulators... 13:08:25 INFO - Created temp file /tmp/tmpRqqRs8. 13:08:25 INFO - Trying to start the emulator with this command: emulator -avd test-1 -debug init,console,gles,memcheck,adbserver,adbclient,adb,avd_config,socket -port 5554 -qemu -m 1024 -cpu cortex-a9 13:08:25 INFO - Sleeping 10 seconds 13:08:35 INFO - Attempt #1 to redirect ports: (5554, 20701, 20700) 13:08:35 INFO - test-1: 5554; sut port: 20701/20700 13:08:35 INFO - Checking emulator test-1 13:08:35 INFO - Attempt #1 to connect to SUT on port 20701 13:08:35 INFO - Connected to port 20701 13:08:35 INFO - Trying again after EOF 13:08:35 INFO - Sleeping 30 seconds 13:09:05 INFO - Attempt #2 to connect to SUT on port 20701 13:09:05 INFO - Connected to port 20701 13:09:05 INFO - Trying again after EOF 13:09:05 INFO - Sleeping 30 seconds 13:09:35 INFO - Attempt #3 to connect to SUT on port 20701 13:09:35 INFO - Connected to port 20701 13:09:35 INFO - Trying again after EOF 13:09:35 INFO - Sleeping 30 seconds 13:10:05 INFO - Attempt #4 to connect to SUT on port 20701 13:10:05 INFO - Connected to port 20701 13:10:05 INFO - SUT response: $> 13:10:05 INFO - Attempt #1 to connect to emulator on port 5554 13:10:05 INFO - Connected to port 5554 13:10:05 INFO - Android Console: type 'help' for a list of commands Looking at the slave I see port 20701 held open by some xpcshell tests [root@talos-linux64-ix-005 talos-slave]# netstat -tupn Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 1 0 10.26.56.234:57587 22.214.171.124:80 CLOSE_WAIT 2621/ubuntu-geoip-p tcp 1 0 127.0.0.1:45887 127.0.0.1:20701 CLOSE_WAIT 3886/xpcshell tcp 0 0 10.26.56.234:45918 10.22.75.39:2003 ESTABLISHED 1244/collectd tcp 0 224 10.26.56.234:22 10.22.248.198:54381 ESTABLISHED 2759/0 [root@talos-linux64-ix-005 talos-slave]# ps -ef | grep 3886 cltbld 3886 1 0 12:56 ? 00:00:02 /builds/slave/talos-slave/test/build/hostutils/bin/xpcshell -g /builds/slave/talos-slave/test/build/hostutils/xre -v 170 -f /builds/slave/talos-slave/test/build/hostutils/bin/components/httpd.js -e const _PROFILE_PATH = '/tmp/tmp1mD5Lr'; const _SERVER_PORT = '8854'; const _SERVER_ADDR = '10.0.2.2'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false; -f /builds/slave/talos-slave/test/build/tests/mochitest/server.js If I reboot this process disappears [root@talos-linux64-ix-005 talos-slave]# reboot Broadcast message from root@talos-linux64-ix-005 (/dev/pts/0) at 14:01 ... The system is going down for reboot NOW! [root@talos-linux64-ix-005 talos-slave]# Connection to talos-linux64-ix-005 closed by remote host. Connection to talos-linux64-ix-005 closed. mozillas-MacBook-Pro-2:~ kmoir$ ssh -l root talos-linux64-ix-005 Last login: Fri May 30 10:33:10 2014 from 10.22.248.198 Unauthorized access prohibited [root@talos-linux64-ix-005 ~]# netstat -tupn Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 10.26.56.234:50998 10.22.75.39:2003 ESTABLISHED 1231/collectd tcp 0 0 10.26.56.234:22 10.22.248.198:58766 ESTABLISHED 2264/2 So I'll test a patch to the scripts to kill this process when it starts a new job or perhaps reboot the slave entirely.
stopping the xpcshell process before we start the emulators seems to work in staging.
Attachment #8432571 - Flags: review?(aki)
Attachment #8432571 - Flags: review?(aki) → review+
Sheriffs, please reopen if you see this issue again. It's been merged to the production branch. I couldn't reproduce it again on my dev-master after applying this fix.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Thank you for figuring this out! :-)
In production with reconfig on 2014-06-03 00:53 PT
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.