Closed
Bug 1011207
Opened 11 years ago
Closed 11 years ago
cancelled 2.3 mochitest jobs put ix slaves into weird state (and so need rebooting)
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kmoir, Assigned: kmoir)
References
Details
Attachments
(1 file)
1.07 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
edited irc convo
RyanVM|sheriffduty has something changed recently that might affect Android 2.3 tests?
RyanVM|sheriffduty we're seeing intermittent "Timed out while waiting for server startup." on multiple trees
RyanVM|sheriffduty https://tbpl.mozilla.org/php/getParsedLog.php?id=39753901&tree=Mozilla-Inbound
jlund RyanVM|sheriffduty: there was some changes from kmoir the other day IIRC
jlund http://hg.mozilla.org/build/buildbot-configs/rev/97e94672cbdc maybe
RyanVM|sheriffduty oddly, it seems to be quite recent (i.e. the last couple hours)
kmoir I enabled more 2.3 tests on Tuesday, haven't been changes since then
RyanVM|sheriffduty 12:43:11 INFO - !!! could not start server on port 8854: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIServerSocket.init]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: /builds/slave/talos-slave/test/build/tests/reftest/reftest/components/httpd.js :: <TOP_LEVEL> :: line 552" data: no]
RyanVM|sheriffduty kmoir: actually, appears to be limited to a couple slaves
kmoir okay
RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-027
RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-029
RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-034
RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-073
RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-076
RyanVM|sheriffduty https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-108
RyanVM|sheriffduty seems to be all of them so far
bhearsum aki: yeah
RyanVM|sheriffduty kmoir: so some seem to go green after managing to take a non-2.3 job
RyanVM|sheriffduty so I'm going to try rebooting the ones that haven't yet
kmoir okay
RyanVM|sheriffduty kmoir: interestingly, they all seem to trace back to a canceled Try run
jlund http://hg.mozilla.org/users/jlund_mozilla.com/mozharness/rev/1e2163e5222e
kmoir RyanVM|sheriffduty: huh, that's strange
jlund mshal: I just looked at the logs again. build_properties.json wasn't the issue
RyanVM|sheriffduty kmoir: all canceled around 11:58am
RyanVM|sheriffduty kmoir: I blame katshttps://tbpl.mozilla.org/?tree=Try&jobname=Android%202.3&rev=ea344577ab5b
kmoir ah
jlund yup looks like it
RyanVM|sheriffduty kmoir: might still want to investigate, though
-->| gaye_ (sid12943@CB1D25D9.E02B7C75.9377050C.IP) has joined #releng
RyanVM|sheriffduty looks like every single one of those canceled jobs got the slave into a funky state
kmoir yeah it's strange the slaves got into a weird state
RyanVM|sheriffduty talos-linux64-ix-046 survived somehow
RyanVM|sheriffduty looks like it was only the oens that were running mochitests when they got canceled
RyanVM|sheriffduty reftest/crashtest slaves recovered OK
RyanVM|sheriffduty turns the rest of the investigation over to kmoir
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → kmoir
Updated•11 years ago
|
Hardware: x86 → x86_64
Summary: cancelled 2.3 mochitest jobs put ix slaves into weird state → cancelled 2.3 mochitest jobs put ix slaves into weird state (and so need rebooting)
Updated•11 years ago
|
OS: Mac OS X → Linux
Assignee | ||
Comment 1•11 years ago
|
||
I was able to reproduce this problem on my dev-master
If I cancel a job, this happens in the next run and the tests don't run
13:08:25 INFO - Running pre-action listener: _resource_record_pre_action
13:08:25 INFO - Running main action method: start_emulators
13:08:25 INFO - Let's kill every process called compiz
13:08:25 INFO - Killing pid 3734.
13:08:25 INFO - Attempting to establish symlink for /builds/slave/talos-slave/test/build/libGL.so
13:08:25 INFO - Symlinking /builds/slave/talos-slave/test/build/libGL.so -> /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1
13:08:25 INFO - mkdir: /builds/slave/talos-slave/test/build
13:08:25 INFO - Attempt #1 to launch emulators...
13:08:25 INFO - Created temp file /tmp/tmpRqqRs8.
13:08:25 INFO - Trying to start the emulator with this command: emulator -avd test-1 -debug init,console,gles,memcheck,adbserver,adbclient,adb,avd_config,socket -port 5554 -qemu -m 1024 -cpu cortex-a9
13:08:25 INFO - Sleeping 10 seconds
13:08:35 INFO - Attempt #1 to redirect ports: (5554, 20701, 20700)
13:08:35 INFO - test-1: 5554; sut port: 20701/20700
13:08:35 INFO - Checking emulator test-1
13:08:35 INFO - Attempt #1 to connect to SUT on port 20701
13:08:35 INFO - Connected to port 20701
13:08:35 INFO - Trying again after EOF
13:08:35 INFO - Sleeping 30 seconds
13:09:05 INFO - Attempt #2 to connect to SUT on port 20701
13:09:05 INFO - Connected to port 20701
13:09:05 INFO - Trying again after EOF
13:09:05 INFO - Sleeping 30 seconds
13:09:35 INFO - Attempt #3 to connect to SUT on port 20701
13:09:35 INFO - Connected to port 20701
13:09:35 INFO - Trying again after EOF
13:09:35 INFO - Sleeping 30 seconds
13:10:05 INFO - Attempt #4 to connect to SUT on port 20701
13:10:05 INFO - Connected to port 20701
13:10:05 INFO - SUT response: $>
13:10:05 INFO - Attempt #1 to connect to emulator on port 5554
13:10:05 INFO - Connected to port 5554
13:10:05 INFO - Android Console: type 'help' for a list of commands
Looking at the slave I see port 20701 held open by some xpcshell tests
[root@talos-linux64-ix-005 talos-slave]# netstat -tupn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 1 0 10.26.56.234:57587 91.189.89.144:80 CLOSE_WAIT 2621/ubuntu-geoip-p
tcp 1 0 127.0.0.1:45887 127.0.0.1:20701 CLOSE_WAIT 3886/xpcshell
tcp 0 0 10.26.56.234:45918 10.22.75.39:2003 ESTABLISHED 1244/collectd
tcp 0 224 10.26.56.234:22 10.22.248.198:54381 ESTABLISHED 2759/0
[root@talos-linux64-ix-005 talos-slave]# ps -ef | grep 3886
cltbld 3886 1 0 12:56 ? 00:00:02 /builds/slave/talos-slave/test/build/hostutils/bin/xpcshell -g /builds/slave/talos-slave/test/build/hostutils/xre -v 170 -f /builds/slave/talos-slave/test/build/hostutils/bin/components/httpd.js -e const _PROFILE_PATH = '/tmp/tmp1mD5Lr'; const _SERVER_PORT = '8854'; const _SERVER_ADDR = '10.0.2.2'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false; -f /builds/slave/talos-slave/test/build/tests/mochitest/server.js
If I reboot this process disappears
[root@talos-linux64-ix-005 talos-slave]# reboot
Broadcast message from root@talos-linux64-ix-005
(/dev/pts/0) at 14:01 ...
The system is going down for reboot NOW!
[root@talos-linux64-ix-005 talos-slave]# Connection to talos-linux64-ix-005 closed by remote host.
Connection to talos-linux64-ix-005 closed.
mozillas-MacBook-Pro-2:~ kmoir$ ssh -l root talos-linux64-ix-005
Last login: Fri May 30 10:33:10 2014 from 10.22.248.198
Unauthorized access prohibited
[root@talos-linux64-ix-005 ~]# netstat -tupn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 10.26.56.234:50998 10.22.75.39:2003 ESTABLISHED 1231/collectd
tcp 0 0 10.26.56.234:22 10.22.248.198:58766 ESTABLISHED 2264/2
So I'll test a patch to the scripts to kill this process when it starts a new job or perhaps reboot the slave entirely.
Assignee | ||
Comment 2•11 years ago
|
||
stopping the xpcshell process before we start the emulators seems to work in staging.
Attachment #8432571 -
Flags: review?(aki)
Updated•11 years ago
|
Attachment #8432571 -
Flags: review?(aki) → review+
Assignee | ||
Comment 3•11 years ago
|
||
Sheriffs, please reopen if you see this issue again. It's been merged to the production branch. I couldn't reproduce it again on my dev-master after applying this fix.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 4•11 years ago
|
||
Thank you for figuring this out! :-)
Comment 5•11 years ago
|
||
In production with reconfig on 2014-06-03 00:53 PT
Updated•7 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•