Mozrunner has two threads in proc_handler.waitForFinish()

RESOLVED FIXED

Status

Testing
Mozbase
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: ahal, Assigned: ahal)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 3 obsolete attachments)

With the patch to bug 661343, a thread is implicitly spawned inside runner.start() to handle output.

This is fine, unless an application (like peptest) also calls self.runner.wait(). Now there are two threads both processing output. This in itself also isn't too bad. The problem occurs after output is finished and proc_handler.proc.wait() is called. In this method a "FINISHED" message is put on a queue, and will be consumed by the first thread to enter wait(). The next thread will timeout and a Queue.Empty exception will get raised.

See:
https://github.com/mozilla/mozbase/blob/eea3f2462c8fb8ec6056af940d2936ec5885d01e/mozrunner/mozrunner/runner.py#L183
https://github.com/mozilla/mozbase/blob/eea3f2462c8fb8ec6056af940d2936ec5885d01e/mozprocess/mozprocess/processhandler.py#L409

This is currently responsible for peptest bustage on windows:
https://tbpl.mozilla.org/?noignore=1

Not sure why it isn't breaking Linux (maybe a timing thing?)
Assignee: nobody → ahalberstadt
Status: NEW → ASSIGNED
Created attachment 645314 [details] [diff] [review]
Ideal patch that would break everything

I think the fix from bug 661343 is actually the right approach. Mozrunner should always read output, whether or not the application calls runner.wait().

I think the ideal fix is to go a step further and completely separate reading output from waiting. These seem like two separate tasks, and I'm not sure why they are tightly coupled.

The above patch achieves this, but alas, it will break everything that currently uses mozrunner/mozprocess. Sigh.
Could this be the reason why we are timing out in Mozmill too on Windows and Linux? See bug 764643.
Created attachment 645322 [details] [diff] [review]
Hackier patch that would break nothing

This patch would fix the problem. I made the default False since that way everything that uses mozrunner and calls runner.wait() won't need to be updated. But it means a change to mozmill will be needed.

I'm fine defaulting it to True, though other consumers of mozrunner may need to update their code in this case.

Anyway, maybe there's a compromise between this patch and the other one.
Attachment #645322 - Flags: review?(jhammel)
See Also: → bug 661343
(In reply to Henrik Skupin (:whimboo) from comment #2)
> Could this be the reason why we are timing out in Mozmill too on Windows and
> Linux? See bug 764643.

The stack trace you pasted for Windows looks like the same problem. This issue doesn't seem to happen for Peptest in Linux, but having two threads doing the same thing is pretty race condition prone, so who knows?
Blocks: 764643

Comment 5

6 years ago
Comment on attachment 645322 [details] [diff] [review]
Hackier patch that would break nothing

This will work.  TBH, I think we're going to have a lot of bugs related to the way we process output in a thread.  Sounds easy, but... That said, I don't have a better solution right now.  We (=someone) should really take a look at the whole mozprocess and mozrunner architecture and see if what we're doing is at all sane.
Attachment #645322 - Flags: review?(jhammel) → review+
IMO, the first patch I posted is the way things should be done. I just don't know how to land it without breaking everything.

This patch also won't help out mozmill at all. If mozmill sets read_output=True, mozmill will have the problem described here and if we leave it to False, mozmill will have the problem described in bug 661343.

The reason is that mozmill uses runner.wait() as it appears after applying the first patch (see https://github.com/mozilla/mozmill/blob/master/mozmill/mozmill/__init__.py#L451). Peptest on the other hand calls runner.wait() immediately after calling runner.start() so we see output immediately.

I'm going to see if I can come up with a better patch.
Created attachment 645402 [details] [diff] [review]
Patch 3

This patch should support both methods of using runner.start/wait (e.g peptest usage vs mozmill usage) as well as fix both the issue described here and the issue described in bug 661343 as well as maintain complete backwards compatibility.

It is a bit bulkier and therefore scarier than the other patch, so I wouldn't check this in until we do thorough testing on several platforms and test harnesses.

Let me know what you think.
Attachment #645402 - Flags: feedback?(jhammel)
Attachment #645402 - Flags: feedback?(ctalbert)

Comment 8

6 years ago
Comment on attachment 645402 [details] [diff] [review]
Patch 3

I'm fine with breaking things upstream (and thus requiring changes to them) if this is the right way to solve this problem. This patch looks pretty seamless and I really like it. If we take this and it requires changes in the other harnesses (peptest, mozmill etc) then we should see those diffs attached here as well and we should bump versions at that time. We can still control the entire ecosystem here so fixing things correctly is more important to me than ensuring we don't change APIs.

I like the approach.
Attachment #645402 - Flags: feedback?(ctalbert) → feedback+
Created attachment 645888 [details] [diff] [review]
Patch 4.0 - Don't worry about backwards compatibility

This seems to fix both issues in both peptest and mozmill (though in mozmill I see a ton of "Window Hidden" messages, is this a known issue?)

I will do testing in windows and mac for various harnesses tomorrow. Note that this will break backwards compatibility as the mozprocess arguments timeout and outputTimeout have been moved from mozrunner.wait() to mozrunner.start()
Attachment #645888 - Flags: review?(jhammel)
It just occurred to me that maybe it would be better to call "processOutput" from within mozprocess.run() instead of mozrunner.start(). Not sure which way is better...

Updated

6 years ago
Attachment #645402 - Flags: feedback?(jhammel)

Comment 11

6 years ago
Comment on attachment 645888 [details] [diff] [review]
Patch 4.0 - Don't worry about backwards compatibility

Looks good.  If we are going to break backwards compatibility, I'd like to coordinate a release, as there may be other things worth taking, like bug 758233
Attachment #645888 - Flags: review?(jhammel) → review+
I'm fine with coordinating a release, as long as we figure out what we need and get it done really quickly. This is currently blocking peptests from running on Windows (and may also fix the OSX issue).

Comment 13

6 years ago
Go ahead and release while I figure out what to do with bug 758233 which will also break backwards compatability
Created attachment 646200 [details] [diff] [review]
Patch 4.1 - Incremental

Sorry for the metric ton of reviews Jeff, but I realized that we should support this use case: https://github.com/mozilla/mozmill/blob/master/mozmill/mozmill/__init__.py#L451

Having a wait that you can't get out of seems like a bad idea. This will also make it more backwards compatible as it is only outputTimeout that will break.
Attachment #645314 - Attachment is obsolete: true
Attachment #645322 - Attachment is obsolete: true
Attachment #645402 - Attachment is obsolete: true
Attachment #646200 - Flags: review?(jhammel)

Comment 15

6 years ago
Is this really only for x86_64 Windows 7 or should that bugzilla flag be changed?
At the very least it's for all Windows versions, but it seems to work on Linux.
https://tbpl.mozilla.org/?noignore=1

Peptest is also busted on OSX 10.7, but it's unclear if it is caused by the same bug or not. In https://bugzilla.mozilla.org/show_bug.cgi?id=774817#c9 you said you aren't sure how we are getting a None there... well having two threads in that function at the same time might be one possible way of getting None :), though that's just a theory.

Comment 17

6 years ago
Comment on attachment 646200 [details] [diff] [review]
Patch 4.1 - Incremental

+            if not getattr(self.process_handler.proc, 'returncode', False):

Can we not check for if the return value of waitForFinish is None? (I honestly don't know).

Other than that, looks great!
Attachment #646200 - Flags: review?(jhammel) → review+
(In reply to Jeff Hammel [:jhammel] from comment #17)
> +            if not getattr(self.process_handler.proc, 'returncode', False):
> 
> Can we not check for if the return value of waitForFinish is None? (I
> honestly don't know).

I could make waitForFinish return True if there was no timeout and False if there was, but it seem a little less explicit. Also, runner.wait() always returns None, and Thread.join(timeout) also always returns None. They both require a second call to see if the process is still running.
master: https://github.com/mozilla/mozbase/commit/3a2acf4a0966ce34286cde843e67bbd78145c054

Needs to be synced to m-c again.
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
This breaks mozprocess1.py, but the test should be updated as things are done differently now: http://k0s.org:8010/builders/mozbase%20-%20linux/builds/89/steps/shell_5/logs/stdio

I'll have a patch soon.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Created attachment 646620 [details] [diff] [review]
Fixes mozprocess tests

I updated the tests to work with this commit. I also found that there were a few problems with the mozprocess tests. 1) The detection algorithm for Unix was broken. 2) The assertion errors were very confusing, sometimes positive, sometimes negative. I standardized them.
Attachment #646620 - Flags: review?(jhammel)

Comment 22

6 years ago
Comment on attachment 646620 [details] [diff] [review]
Fixes mozprocess tests

+        for line in output.split('\n'):

.splitlines() would be better

+            elif processName in line and not 'defunct' in line:

We should really not be doing this in this test.  I'm tempted just to disable both of these tests until we have a real solution, but this is fine for now.  I'll just file another bug to fix them
Attachment #646620 - Flags: review?(jhammel) → review+
https://github.com/mozilla/mozbase/commit/59a058e64b6fa911969864285cd9a2ec4cfd707d
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → FIXED

Comment 24

6 years ago
> +            elif processName in line and not 'defunct' in line:
> 
> We should really not be doing this in this test.  I'm tempted just to
> disable both of these tests until we have a real solution, but this is fine
> for now.  I'll just file another bug to fix them

Ticketed: bug 778267
Andrew, when do you plan to push updated versions of the affected packages to PyPI?
Depends on: 780841
Depends on: 781473
Depends on: 783132
You need to log in before you can comment on or make changes to this bug.