[mozprocess] Reader thread for an invalid process id doesn't finish on join()

RESOLVED FIXED in Firefox 49


3 years ago
3 years ago


(Reporter: vakila, Assigned: whimboo)


Version 3
Mac OS X

Firefox Tracking Flags

(firefox49 fixed)



(2 attachments)



3 years ago
When running firefox-ui-update tests with mach on Mac OS X (10.11), after interrupting with CTRL+C (after hang related to Bug 1276220), the tests seem to hang indefinitely at the step "Removing copy of the application at /var/folders/tmp-path-to/tmpXXXX.application.copy", and FirefoxNightly does not quit. However, afterwards the application copy seems to have been removed.
Ok, so the underlying problem here isn't related to anything in the firefox-ui harness but plainly with mozprocess. So what's happening:

With the landing of the Firefox updater changes for OS X on bug 394984, Firefox is now no longer keeping the process group on purpose to circumvent a couple of shutdown issues. Due to that our Firefox ui update tests are failing now (bug 1276220), and run into an automation timeout: "mozprocess timed out after 300 seconds running". It simply happens because the process id and group is no longer valid, and mozprocess fails to kill the application.

A more depth look into mozprocess showed me that we actually hang in `wait()` while waiting for the reader thread to finish. Given that `kill()` doesn't provide a timeout we actually do never exit this loop.



I don't feel it makes sense to wait for the reader thread while the underlying process is gone. I will check more all the code Julien added on bug 794984 to understand what's best here.
Assignee: nobody → hskupin
Component: Firefox UI Tests → Mozbase
QA Contact: hskupin
Summary: Firefox UI update tests hang on "Removing copy of the application" → [mozprocess] Reader thread for an invalid process id doesn't finish on join()
I was working with Andrew yesterday on further investigation of this problem. As it turns out the noticed hang happens when we try to join both the stdout and stderr reader threads:


The threads themselves do not see that the underlying process is dead, because the streams are still open but only don't receive data at that point. So a join will cause an infinite loop. To fix that we have to add another condition to the while loop in `_read()`, which actually checks the status of the process and `thread.is_alive()` will return False appropriately:


It's still interesting why the hang only happens when a new process group is getting created, but not when the process ends by itself. Maybe that is related to existing zombie processes at this time.

The simplest change we can do is to add a `_detached` property to the process class, so the stream readers can check it for their exit condition. We would have to set it in `kill()` when we know the process doesn't exist anymore.

The downside is that a check for `_detached` will only return a valid response when `kill()` was executed before. It may be good to have a valid response through the whole lifetime of the process. This is something I still want to check before uploading a patch which will fix the hang.
Created attachment 8759161 [details] [diff] [review]
Propagate exception similar to Windows

Ok, so this change might be kinda invasive given that we don't know how consumers of mozprocess actually handle a situation like that. Means if they catch an exception as thrown now. But when would we actually raise such an exception? We don't if the process doesn't exist anymore eg. the application closed itself or it crashed. So it should only be related if something has changed to the process group we started the initial process in. And this happens only if the application is restarted, and requests itself to spawn a new process group.

The only test harness I'm aware of which is capable of handling restarts nowadays is Marionette. Testing the change with our update tests it seems to work fine and we throw an exception which could then be picked up by Treeherder. Keep in mind that for now such a failure is not visible and tests are marked as successful. This is dangerous given that a Firefox instance can remain open and Marionette server is blocking port 2828. So no other tests on that machine can be run unless the instance gets killed manually.

For now I would like to get feedback from Andrew. If he is fine with it I will create a mozreview patch and trigger a try run.
Attachment #8759161 - Flags: feedback?(ahalberstadt)
Comment on attachment 8759161 [details] [diff] [review]
Propagate exception similar to Windows

Review of attachment 8759161 [details] [diff] [review]:

This looks ok to me.. I have a feeling it will cause issues in our tests though. I guess we'll see in the try run.

::: testing/mozbase/mozprocess/mozprocess/processhandler.py
@@ +147,2 @@
>                              if getattr(e, "errno", None) != 3:
> +                                raise OSError("Could not terminate process: %s" % e)

I'd rather just re-raise the original exception here, but if we must raise this, then at least put a traceback.print_exc() before.
Attachment #8759161 - Flags: feedback?(ahalberstadt) → feedback+
Created attachment 8759265 [details]
Bug 1276886 - [mozprocess] Fix hang in output readers when process is in a new process group.

In case of in-process restarts it can happen that the new process gets forked into a new process group.
When that happens we loose the capability to kill the process. To prevent a hang when joining the output
reader threads in wait(), we simply skip that call by passing-through the IO error.

Review commit: https://reviewboard.mozilla.org/r/57246/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/57246/
Attachment #8759265 - Flags: review?(ahalberstadt)
Attachment #8759265 - Flags: review?(ahalberstadt) → review+
Comment on attachment 8759265 [details]
Bug 1276886 - [mozprocess] Fix hang in output readers when process is in a new process group.


r+ assuming this doesn't break anything on try. There's a high probability this will also get raised when tests timeout/crash/etc. So we'll have to keep an eye out for reports like that.
The failures of the try run don't show any relationship to this change. Those are all test failures. There is one crash for reftests happening, and in this case we correctly report to Treeherder:


I'm going to push this to inbound now and will keep an eye out for possible side-effects in case of aborts or timeouts of test runs. How it affects our update tests we will know on Monday, whereby I'm off next week.

Comment 9

3 years ago
Last Resolved: 3 years ago
status-firefox49: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
You need to log in before you can comment on or make changes to this bug.