Closed Bug 1011207 Opened 8 years ago Closed 8 years ago

cancelled 2.3 mochitest jobs put ix slaves into weird state (and so need rebooting)

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Assigned: kmoir)

References

Details

Attachments

(1 file)

edited irc convo

RyanVM|sheriffduty	has something changed recently that might affect Android 2.3 tests?
	RyanVM|sheriffduty	we're seeing intermittent "Timed out while waiting for server startup." on multiple trees
	RyanVM|sheriffduty	https://tbpl.mozilla.org/php/getParsedLog.php?id=39753901&tree=Mozilla-Inbound
	jlund	RyanVM|sheriffduty: there was some changes from kmoir the other day IIRC
	jlund	http://hg.mozilla.org/build/buildbot-configs/rev/97e94672cbdc maybe
RyanVM|sheriffduty	oddly, it seems to be quite recent (i.e. the last couple hours)
	kmoir	I enabled more 2.3 tests on Tuesday, haven't been changes since then
	RyanVM|sheriffduty	12:43:11 INFO - !!! could not start server on port 8854: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIServerSocket.init]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: /builds/slave/talos-slave/test/build/tests/reftest/reftest/components/httpd.js :: <TOP_LEVEL> :: line 552" data: no]
	RyanVM|sheriffduty	kmoir: actually, appears to be limited to a couple slaves
	kmoir	okay
	
	RyanVM|sheriffduty	https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-027
	RyanVM|sheriffduty	https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-029
	RyanVM|sheriffduty	https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-034
	RyanVM|sheriffduty	https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-073
	RyanVM|sheriffduty	https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-076
	RyanVM|sheriffduty	https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux64-ix-108
	RyanVM|sheriffduty	seems to be all of them so far
	bhearsum	aki: yeah
	RyanVM|sheriffduty	kmoir: so some seem to go green after managing to take a non-2.3 job
	RyanVM|sheriffduty	so I'm going to try rebooting the ones that haven't yet
	kmoir	okay	
	RyanVM|sheriffduty	kmoir: interestingly, they all seem to trace back to a canceled Try run
	jlund	http://hg.mozilla.org/users/jlund_mozilla.com/mozharness/rev/1e2163e5222e
	kmoir	RyanVM|sheriffduty: huh, that's strange
	jlund	mshal: I just looked at the logs again. build_properties.json wasn't the issue
	RyanVM|sheriffduty	kmoir: all canceled around 11:58am
	RyanVM|sheriffduty	kmoir: I blame katshttps://tbpl.mozilla.org/?tree=Try&jobname=Android%202.3&rev=ea344577ab5b
	kmoir	ah
	jlund	yup looks like it
	RyanVM|sheriffduty	kmoir: might still want to investigate, though
	-->|	gaye_ (sid12943@CB1D25D9.E02B7C75.9377050C.IP) has joined #releng
	RyanVM|sheriffduty	looks like every single one of those canceled jobs got the slave into a funky state
	kmoir	yeah it's strange the slaves got into a weird state
	RyanVM|sheriffduty	talos-linux64-ix-046 survived somehow
	RyanVM|sheriffduty	looks like it was only the oens that were running mochitests when they got canceled
	RyanVM|sheriffduty	reftest/crashtest slaves recovered OK
	RyanVM|sheriffduty	turns the rest of the investigation over to kmoir
Assignee: nobody → kmoir
Hardware: x86 → x86_64
Summary: cancelled 2.3 mochitest jobs put ix slaves into weird state → cancelled 2.3 mochitest jobs put ix slaves into weird state (and so need rebooting)
OS: Mac OS X → Linux
Depends on: 1018212
I was able to reproduce this problem on my dev-master

If I cancel a job, this happens in the next run and the tests don't run

13:08:25     INFO - Running pre-action listener: _resource_record_pre_action
13:08:25     INFO - Running main action method: start_emulators
13:08:25     INFO - Let's kill every process called compiz
13:08:25     INFO - Killing pid 3734.
13:08:25     INFO - Attempting to establish symlink for /builds/slave/talos-slave/test/build/libGL.so
13:08:25     INFO - Symlinking /builds/slave/talos-slave/test/build/libGL.so -> /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1
13:08:25     INFO - mkdir: /builds/slave/talos-slave/test/build
13:08:25     INFO - Attempt #1 to launch emulators...
13:08:25     INFO - Created temp file /tmp/tmpRqqRs8.
13:08:25     INFO - Trying to start the emulator with this command: emulator -avd test-1 -debug init,console,gles,memcheck,adbserver,adbclient,adb,avd_config,socket -port 5554 -qemu -m 1024 -cpu cortex-a9
13:08:25     INFO - Sleeping 10 seconds
13:08:35     INFO -   Attempt #1 to redirect ports: (5554, 20701, 20700)
13:08:35     INFO - test-1: 5554; sut port: 20701/20700
13:08:35     INFO - Checking emulator test-1
13:08:35     INFO -   Attempt #1 to connect to SUT on port 20701
13:08:35     INFO - Connected to port 20701
13:08:35     INFO - Trying again after EOF
13:08:35     INFO - Sleeping 30 seconds
13:09:05     INFO -   Attempt #2 to connect to SUT on port 20701
13:09:05     INFO - Connected to port 20701
13:09:05     INFO - Trying again after EOF
13:09:05     INFO - Sleeping 30 seconds
13:09:35     INFO -   Attempt #3 to connect to SUT on port 20701
13:09:35     INFO - Connected to port 20701
13:09:35     INFO - Trying again after EOF
13:09:35     INFO - Sleeping 30 seconds
13:10:05     INFO -   Attempt #4 to connect to SUT on port 20701
13:10:05     INFO - Connected to port 20701
13:10:05     INFO - SUT response: $>
13:10:05     INFO -   Attempt #1 to connect to emulator on port 5554
13:10:05     INFO - Connected to port 5554
13:10:05     INFO - Android Console: type 'help' for a list of commands

Looking at the slave I see port 20701 held open by some xpcshell tests

[root@talos-linux64-ix-005 talos-slave]# netstat -tupn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        1      0 10.26.56.234:57587      91.189.89.144:80        CLOSE_WAIT  2621/ubuntu-geoip-p
tcp        1      0 127.0.0.1:45887         127.0.0.1:20701         CLOSE_WAIT  3886/xpcshell   
tcp        0      0 10.26.56.234:45918      10.22.75.39:2003        ESTABLISHED 1244/collectd   
tcp        0    224 10.26.56.234:22         10.22.248.198:54381     ESTABLISHED 2759/0          
[root@talos-linux64-ix-005 talos-slave]# ps -ef | grep 3886
cltbld    3886     1  0 12:56 ?        00:00:02 /builds/slave/talos-slave/test/build/hostutils/bin/xpcshell -g /builds/slave/talos-slave/test/build/hostutils/xre -v 170 -f /builds/slave/talos-slave/test/build/hostutils/bin/components/httpd.js -e const _PROFILE_PATH = '/tmp/tmp1mD5Lr'; const _SERVER_PORT = '8854'; const _SERVER_ADDR = '10.0.2.2'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false; -f /builds/slave/talos-slave/test/build/tests/mochitest/server.js


If I reboot this process disappears

[root@talos-linux64-ix-005 talos-slave]# reboot

Broadcast message from root@talos-linux64-ix-005
	(/dev/pts/0) at 14:01 ...

The system is going down for reboot NOW!
[root@talos-linux64-ix-005 talos-slave]# Connection to talos-linux64-ix-005 closed by remote host.
Connection to talos-linux64-ix-005 closed.
mozillas-MacBook-Pro-2:~ kmoir$ ssh -l root talos-linux64-ix-005 
Last login: Fri May 30 10:33:10 2014 from 10.22.248.198
Unauthorized access prohibited
[root@talos-linux64-ix-005 ~]# netstat -tupn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 10.26.56.234:50998      10.22.75.39:2003        ESTABLISHED 1231/collectd   
tcp        0      0 10.26.56.234:22         10.22.248.198:58766     ESTABLISHED 2264/2          

So I'll test a patch to the scripts to kill this process when it starts a new job or perhaps reboot the slave entirely.
Attached patch bug1011207.patchSplinter Review
stopping the xpcshell process before we start the emulators seems to work in staging.
Attachment #8432571 - Flags: review?(aki)
Attachment #8432571 - Flags: review?(aki) → review+
Sheriffs, please reopen if you see this issue again. It's been merged to the production branch.  I couldn't reproduce it again on my dev-master after applying this fix.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Thank you for figuring this out! :-)
In production with reconfig on 2014-06-03 00:53 PT
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.