Closed Bug 1276886 Opened 8 years ago Closed 8 years ago

[mozprocess] Reader thread for an invalid process id doesn't finish on join()

Tracking

(firefox49 fixed)

Status:

RESOLVED FIXED

Milestone:

mozilla49

Tracking Flags:

Tracking

Status

firefox49

---

fixed

People

(Reporter: vakila, Assigned: whimboo)

References

Details

Attachments

(2 files)

Propagate exception similar to Windows 8 years ago Henrik Skupin [:whimboo][⌚️UTC+1] 1.17 KB, patch	ahal : feedback+	Details \| Diff \| Splinter Review
Bug 1276886 - [mozprocess] Fix hang in output readers when process is in a new process group. 8 years ago Henrik Skupin [:whimboo][⌚️UTC+1] 58 bytes, text/x-review-board-request	ahal : review+	Details

Anjana Vakil (:vakila)

Reporter

Description

•

8 years ago

When running firefox-ui-update tests with mach on Mac OS X (10.11), after interrupting with CTRL+C (after hang related to Bug 1276220), the tests seem to hang indefinitely at the step "Removing copy of the application at /var/folders/tmp-path-to/tmpXXXX.application.copy", and FirefoxNightly does not quit. However, afterwards the application copy seems to have been removed.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

8 years ago

Blocks: 1276220

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 1

•

8 years ago

Ok, so the underlying problem here isn't related to anything in the firefox-ui harness but plainly with mozprocess. So what's happening:

With the landing of the Firefox updater changes for OS X on bug 394984, Firefox is now no longer keeping the process group on purpose to circumvent a couple of shutdown issues. Due to that our Firefox ui update tests are failing now (bug 1276220), and run into an automation timeout: "mozprocess timed out after 300 seconds running". It simply happens because the process id and group is no longer valid, and mozprocess fails to kill the application.

A more depth look into mozprocess showed me that we actually hang in `wait()` while waiting for the reader thread to finish. Given that `kill()` doesn't provide a timeout we actually do never exit this loop.

`wait()`:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#805-813

`kill()`:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#742-745

I don't feel it makes sense to wait for the reader thread while the underlying process is gone. I will check more all the code Julien added on bug 794984 to understand what's best here.

Assignee: nobody → hskupin

Status: NEW → ASSIGNED

Component: Firefox UI Tests → Mozbase

QA Contact: hskupin

Summary: Firefox UI update tests hang on "Removing copy of the application" → [mozprocess] Reader thread for an invalid process id doesn't finish on join()

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 2

•

8 years ago

I was working with Andrew yesterday on further investigation of this problem. As it turns out the noticed hang happens when we try to join both the stdout and stderr reader threads:

https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#921-924

The threads themselves do not see that the underlying process is dead, because the streams are still open but only don't receive data at that point. So a join will cause an infinite loop. To fix that we have to add another condition to the while loop in `_read()`, which actually checks the status of the process and `thread.is_alive()` will return False appropriately:

https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#856

It's still interesting why the hang only happens when a new process group is getting created, but not when the process ends by itself. Maybe that is related to existing zombie processes at this time.

The simplest change we can do is to add a `_detached` property to the process class, so the stream readers can check it for their exit condition. We would have to set it in `kill()` when we know the process doesn't exist anymore.

The downside is that a check for `_detached` will only return a valid response when `kill()` was executed before. It may be good to have a valid response through the whole lifetime of the process. This is something I still want to check before uploading a patch which will fix the hang.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 3

•

8 years ago

Attached patch Propagate exception similar to Windows — Details — Splinter Review

Ok, so this change might be kinda invasive given that we don't know how consumers of mozprocess actually handle a situation like that. Means if they catch an exception as thrown now. But when would we actually raise such an exception? We don't if the process doesn't exist anymore eg. the application closed itself or it crashed. So it should only be related if something has changed to the process group we started the initial process in. And this happens only if the application is restarted, and requests itself to spawn a new process group.

The only test harness I'm aware of which is capable of handling restarts nowadays is Marionette. Testing the change with our update tests it seems to work fine and we throw an exception which could then be picked up by Treeherder. Keep in mind that for now such a failure is not visible and tests are marked as successful. This is dangerous given that a Firefox instance can remain open and Marionette server is blocking port 2828. So no other tests on that machine can be run unless the instance gets killed manually.

For now I would like to get feedback from Andrew. If he is fine with it I will create a mozreview patch and trigger a try run.

Attachment #8759161 - Flags: feedback?(ahalberstadt)

Andrew Halberstadt [:ahal]

Comment 4

•

8 years ago

Comment on attachment 8759161 [details] [diff] [review]
Propagate exception similar to Windows

Review of attachment 8759161 [details] [diff] [review]:
-----------------------------------------------------------------

This looks ok to me.. I have a feeling it will cause issues in our tests though. I guess we'll see in the try run.

::: testing/mozbase/mozprocess/mozprocess/processhandler.py
@@ +147,2 @@
>                              if getattr(e, "errno", None) != 3:
> +                                raise OSError("Could not terminate process: %s" % e)

I'd rather just re-raise the original exception here, but if we must raise this, then at least put a traceback.print_exc() before.

Attachment #8759161 - Flags: feedback?(ahalberstadt) → feedback+

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 5

•

8 years ago

Attached file Bug 1276886 - [mozprocess] Fix hang in output readers when process is in a new process group. — Details

In case of in-process restarts it can happen that the new process gets forked into a new process group.
When that happens we loose the capability to kill the process. To prevent a hang when joining the output
reader threads in wait(), we simply skip that call by passing-through the IO error.

Review commit: https://reviewboard.mozilla.org/r/57246/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/57246/

Attachment #8759265 - Flags: review?(ahalberstadt)

Andrew Halberstadt [:ahal]

Updated

•

8 years ago

Attachment #8759265 - Flags: review?(ahalberstadt) → review+

Andrew Halberstadt [:ahal]

Comment 6

•

8 years ago

Comment on attachment 8759265 [details]
Bug 1276886 - [mozprocess] Fix hang in output readers when process is in a new process group.

https://reviewboard.mozilla.org/r/57246/#review54092

r+ assuming this doesn't break anything on try. There's a high probability this will also get raised when tests timeout/crash/etc. So we'll have to keep an eye out for reports like that.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 7

•

8 years ago

The failures of the try run don't show any relationship to this change. Those are all test failures. There is one crash for reftests happening, and in this case we correctly report to Treeherder:

https://treeherder.mozilla.org/logviewer.html#?job_id=21899485&repo=try

I'm going to push this to inbound now and will keep an eye out for possible side-effects in case of aborts or timeouts of test runs. How it affects our update tests we will know on Monday, whereby I'm off next week.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 8

•

8 years ago

The patch got landed via Otto Länd on inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/dd6fb8e96ae4b9b2c2150e0515919ca8985b672b

Carsten Book [:Tomcat]

Comment 9

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/dd6fb8e96ae4

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox49: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla49

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

[mozprocess] Reader thread for an invalid process id doesn't finish on join()

Categories

(Testing :: Mozbase, defect)

Tracking

(firefox49 fixed)

People

(Reporter: vakila, Assigned: whimboo)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Attachment

General

Description

File Name

Content Type