1032335 - steeplechase leaves processes running on test timeout

Reporter

Description

•

10 years ago

If a steeplechase test run times out (e.g. because the second client died early) it results in a timeout on the steeplechase side like this: Exception in thread Client 1: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner self.run() File "/home/mozilla/src/steeplechase/steeplechase/runsteeplechase.py", line 100, in run output = dm.shellCheckOutput(cmd, env=env) File "/usr/local/lib/python2.7/dist-packages/mozdevice-0.33-py2.7.egg/mozdevice/devicemanager.py", line 390, in shellCheckOutput retval = self.shell(cmd, buf, env=env, cwd=cwd, timeout=timeout, root=root) File "/usr/local/lib/python2.7/dist-packages/mozdevice-0.33-py2.7.egg/mozdevice/devicemanagerSUT.py", line 323, in shell self._sendCmds([{ 'cmd': '%s %s' % (cmd, cmdline) }], outputfile, timeout) File "/usr/local/lib/python2.7/dist-packages/mozdevice-0.33-py2.7.egg/mozdevice/devicemanagerSUT.py", line 135, in _sendCmds raise err DMError: Automation Error: Timeout in command exec "MOZ_CRASHREPORTER_NO_REPORT=1,XPCOM_DEBUG_BREAK=warn,DISPLAY=:0" /tmp/tests/steeplechase/app/firefox-tee -no-remote -profile /tmp/tests/steeplechase/profile http://10.252.73.224:42868/index.html The problem is that this actually leaves the remote command running in Negatus. So without further cleanup on the client/Negatus side this results in multiple processes running. And as steeplechase re-uses the same directory again probably results in all kind of problems for future test runs.

Nils Ohlmeier [:drno]

Reporter

Comment 1

•

10 years ago

I see the following potential solutions/improvements: 1) The code which waits for the other client to join the room on the simplesignaling server could have a timeout and exit with an error to prevent that steeplechase/negatus will have to catch the generic timeout 2) When steeplechase catches the timeout, could it try to cleanup the processes it started?! 3) When invoking a new test run on the client/Negatus side it could try to execute some cleanup (a.k.a. killall) before starting the new test run (note: that only works with the assumption that always only one test executes at a given time - which is probably true for several reasons: usage of /tmp/tests dir, usage of camera and microphone,...) I think we should probably implement not just one of these, but to be safe 1 plus 2 or 3.

(not currently active) Ted Mielczarek

Comment 2

•

10 years ago

We should definitely do 1 and 2. I'm not really wild about 3 as it makes testing on local systems a pain (Talos used to do this, and maybe still does, and it would kill your local browser. :-/) I also wish the SUTAgent had smarter commands for dealing with things like this. I'm starting to suspect what we really want here is to push a Python script to each client that uses mozrunner to launch the browser, since mozrunner can handle timeouts etc. That means we have to have mozbase modules installed on the clients, which is sort of a pain, but not the end of the world. (We could push those down as well, I'm not sure where the right line is there.)

Nils Ohlmeier [:drno]

Reporter

Comment 3

•

10 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2) > We should definitely do 1 and 2. I'm not really wild about 3 as it makes > testing on local systems a pain (Talos used to do this, and maybe still > does, and it would kill your local browser. :-/) Good point. In the light of bug 1036439 we should avoid 3. I'm going to look into 1. > I also wish the SUTAgent had smarter commands for dealing with things like > this. I'm starting to suspect what we really want here is to push a Python > script to each client that uses mozrunner to launch the browser, since > mozrunner can handle timeouts etc. That means we have to have mozbase > modules installed on the clients, which is sort of a pain, but not the end > of the world. (We could push those down as well, I'm not sure where the > right line is there.) Actually that sound like a better plan to me then us extending the bash scripts we wrote for starting Firefox through Negatus. Python would give us better portability as well (over bash). I would prefer to push the required Python modules as well and make it a local Python environment. That avoids having to keep software installation requirements in sync across multiple machines and OS's which is a pain. And it allows you easily to throw any new machine into the mix.

Nils Ohlmeier [:drno]

Reporter

Comment 4

•

10 years ago

Result from a brainstorming we did today: #1 should be implemented as two timeouts: a) a timeout while waiting for "numclients" to get bigger then 1 b) a timeout while waiting for the other client to post its "test_loaded" message Both of these can be found in webharness/harness.js. And it would be nice to have tests for this. But we can do that in a separate ticket.

Martijn Wargers (dead)

Updated

•

10 years ago

Assignee: nobody → martijn.martijn

Syd Polk :sydpolk

Comment 5

•

10 years ago

So, on Linux and Mac, this problem just leaves firefoxes running with Windows open, but future tests still work fine. The only real downside is eventually the host will run out of memory or get very very slow. Has to be cleaned up manually, but "killall firefox" does it easily. On Windows, however, firefox.exe is still running so the attempt to delete the steeplechase temp firefox directory fails, and Negatus hangs. This is a serious impediment to continued testing, as all future tests on this host will fail until the situation is manually addressed. Could we get this looked at with higher priority?

Flags: needinfo?(ted)

Nils Ohlmeier [:drno]

Reporter

Updated

•

10 years ago

Whiteboard: [webrtc-mochitest],[steeplechase]

Nils Ohlmeier [:drno]

Reporter

Updated

•

10 years ago

Assignee: martijn.martijn → nobody

rshenthar

Updated

•

9 years ago

Assignee: nobody → rshenthar

(not currently active) Ted Mielczarek

Comment 6

•

9 years ago

I don't think I have time to work on this currently, but I'd be happy to offer advice to someone else.

Flags: needinfo?(ted)

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: New Frameworks → General

BugBot [:suhaib / :marco/ :calixte]

Comment 7

•

3 years ago

The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.

Assignee: rshenthar → nobody

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Bugzilla

steeplechase leaves processes running on test timeout

Categories

(Testing :: General, defect)

Tracking

(Not tracked)

People

(Reporter: drno, Unassigned)

References

Details

(Whiteboard: [webrtc-mochitest],[steeplechase])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Updated

Updated

Updated

Comment 6

Updated

Comment 7

Updated