Open
Bug 1032335
Opened 10 years ago
Updated 2 years ago
steeplechase leaves processes running on test timeout
Categories
(Testing :: General, defect)
Testing
General
Tracking
(Not tracked)
NEW
People
(Reporter: drno, Unassigned)
Details
(Whiteboard: [webrtc-mochitest],[steeplechase])
If a steeplechase test run times out (e.g. because the second client died early) it results in a timeout on the steeplechase side like this:
Exception in thread Client 1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/home/mozilla/src/steeplechase/steeplechase/runsteeplechase.py", line 100, in run
output = dm.shellCheckOutput(cmd, env=env)
File "/usr/local/lib/python2.7/dist-packages/mozdevice-0.33-py2.7.egg/mozdevice/devicemanager.py", line 390, in shellCheckOutput
retval = self.shell(cmd, buf, env=env, cwd=cwd, timeout=timeout, root=root)
File "/usr/local/lib/python2.7/dist-packages/mozdevice-0.33-py2.7.egg/mozdevice/devicemanagerSUT.py", line 323, in shell
self._sendCmds([{ 'cmd': '%s %s' % (cmd, cmdline) }], outputfile, timeout)
File "/usr/local/lib/python2.7/dist-packages/mozdevice-0.33-py2.7.egg/mozdevice/devicemanagerSUT.py", line 135, in _sendCmds
raise err
DMError: Automation Error: Timeout in command exec "MOZ_CRASHREPORTER_NO_REPORT=1,XPCOM_DEBUG_BREAK=warn,DISPLAY=:0" /tmp/tests/steeplechase/app/firefox-tee -no-remote -profile /tmp/tests/steeplechase/profile http://10.252.73.224:42868/index.html
The problem is that this actually leaves the remote command running in Negatus. So without further cleanup on the client/Negatus side this results in multiple processes running. And as steeplechase re-uses the same directory again probably results in all kind of problems for future test runs.
Reporter | ||
Comment 1•10 years ago
|
||
I see the following potential solutions/improvements:
1) The code which waits for the other client to join the room on the simplesignaling server could have a timeout and exit with an error to prevent that steeplechase/negatus will have to catch the generic timeout
2) When steeplechase catches the timeout, could it try to cleanup the processes it started?!
3) When invoking a new test run on the client/Negatus side it could try to execute some cleanup (a.k.a. killall) before starting the new test run (note: that only works with the assumption that always only one test executes at a given time - which is probably true for several reasons: usage of /tmp/tests dir, usage of camera and microphone,...)
I think we should probably implement not just one of these, but to be safe 1 plus 2 or 3.
Comment 2•10 years ago
|
||
We should definitely do 1 and 2. I'm not really wild about 3 as it makes testing on local systems a pain (Talos used to do this, and maybe still does, and it would kill your local browser. :-/)
I also wish the SUTAgent had smarter commands for dealing with things like this. I'm starting to suspect what we really want here is to push a Python script to each client that uses mozrunner to launch the browser, since mozrunner can handle timeouts etc. That means we have to have mozbase modules installed on the clients, which is sort of a pain, but not the end of the world. (We could push those down as well, I'm not sure where the right line is there.)
Reporter | ||
Comment 3•10 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2)
> We should definitely do 1 and 2. I'm not really wild about 3 as it makes
> testing on local systems a pain (Talos used to do this, and maybe still
> does, and it would kill your local browser. :-/)
Good point. In the light of bug 1036439 we should avoid 3.
I'm going to look into 1.
> I also wish the SUTAgent had smarter commands for dealing with things like
> this. I'm starting to suspect what we really want here is to push a Python
> script to each client that uses mozrunner to launch the browser, since
> mozrunner can handle timeouts etc. That means we have to have mozbase
> modules installed on the clients, which is sort of a pain, but not the end
> of the world. (We could push those down as well, I'm not sure where the
> right line is there.)
Actually that sound like a better plan to me then us extending the bash scripts we wrote for starting Firefox through Negatus. Python would give us better portability as well (over bash).
I would prefer to push the required Python modules as well and make it a local Python environment. That avoids having to keep software installation requirements in sync across multiple machines and OS's which is a pain. And it allows you easily to throw any new machine into the mix.
Reporter | ||
Comment 4•10 years ago
|
||
Result from a brainstorming we did today:
#1 should be implemented as two timeouts:
a) a timeout while waiting for "numclients" to get bigger then 1
b) a timeout while waiting for the other client to post its "test_loaded" message
Both of these can be found in webharness/harness.js.
And it would be nice to have tests for this. But we can do that in a separate ticket.
Updated•10 years ago
|
Assignee: nobody → martijn.martijn
Comment 5•10 years ago
|
||
So, on Linux and Mac, this problem just leaves firefoxes running with Windows open, but future tests still work fine. The only real downside is eventually the host will run out of memory or get very very slow. Has to be cleaned up manually, but "killall firefox" does it easily.
On Windows, however, firefox.exe is still running so the attempt to delete the steeplechase temp firefox directory fails, and Negatus hangs. This is a serious impediment to continued testing, as all future tests on this host will fail until the situation is manually addressed.
Could we get this looked at with higher priority?
Flags: needinfo?(ted)
Reporter | ||
Updated•10 years ago
|
Whiteboard: [webrtc-mochitest],[steeplechase]
Reporter | ||
Updated•10 years ago
|
Assignee: martijn.martijn → nobody
Comment 6•9 years ago
|
||
I don't think I have time to work on this currently, but I'd be happy to offer advice to someone else.
Flags: needinfo?(ted)
Assignee | ||
Updated•7 years ago
|
Component: New Frameworks → General
Comment 7•3 years ago
|
||
The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.
Assignee: rshenthar → nobody
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•