Closed Bug 1433905 Opened 7 years ago Closed 6 years ago

[mozprocess] ProcessHandler.poll() returns invalid process status when process quit itself

Tracking

(firefox64 fixed)

Status:

RESOLVED FIXED

Milestone:

mozilla64

Tracking Flags:

Tracking

Status

firefox64

---

fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

(Blocks 3 open bugs)

Details

Attachments

(6 files, 6 obsolete files)

Bug 1433905 - [mozprocess] poll() always returns None for stopped process until wait() is called. 7 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 59 bytes, text/x-review-board-request		Details
Bug 1433905 - Bump mozprocess to version 0.26. 7 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 59 bytes, text/x-review-board-request		Details
Bug 1433905 - [mozprocess] check_for_detached has to return early if proc doesn't exist. 6 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 59 bytes, text/x-review-board-request		Details
Bug 1433905 - [mozprocess] pid has to be None if no process exists yet. 6 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 59 bytes, text/x-review-board-request		Details
Bug 1433905 - [mozprocess] Fix broken process location in test_process_output_nonewline. 6 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 59 bytes, text/x-review-board-request		Details
Bug 1433905 - [mozprocess] Existence of _handle on Windows doesn't mean the process is still alive. 6 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 59 bytes, text/x-review-board-request		Details
Bug 1433905 - Temporarily disable test_loader_global_sharing.py for opt/pgo builds on Windows. r?gbrown 6 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 46 bytes, text/x-phabricator-request	gbrown : review+	Details \| Review
Bug 1433905 - [mozprocess] Retrieving pid has to fail with RuntimeError if process hasn't been started yet. r?gbrown 6 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 46 bytes, text/x-phabricator-request	gbrown : review+	Details \| Review
Bug 1433905 - [mozprocess] "check_for_detached()" has to raise RuntimeError if process hasn't been started yet. r?gbrown 6 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 46 bytes, text/x-phabricator-request	gbrown : review+	Details \| Review
Bug 1433905 - [mozprocess] Fix broken path to process script for test_process_output_nonewline. r?gbrown 6 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 46 bytes, text/x-phabricator-request		Details \| Review
Bug 1433905 - [mozprocess] Existence of _handle on Windows doesn't mean the process is still alive. r?gbrown 6 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 46 bytes, text/x-phabricator-request		Details \| Review
Bug 1433905 - [mozprocess] poll() always returns None for stopped process until wait() is called. r?gbrown 6 years ago Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13) 46 bytes, text/x-phabricator-request		Details \| Review

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Description

•

7 years ago

While working on bug 1433873 I found that the poll() method on the ProcessHandler class returns an invalid value for the process status when the process has quit itself. The reason are those two lines: https://dxr.mozilla.org/mozilla-central/rev/e22e2c9eb81686e958a9448ea3d1e8cd766743e2/testing/mozbase/mozprocess/mozprocess/processhandler.py#811-812 Those have been added about 4 years ago by myself on bug 965326. Basically this doesn't work because our `Process` class as inherited from `subprocess.Popen` always has this property set. So once the process got started the poll method of the `ProcessHandler` will return None, even when the application already quit.

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 1

•

7 years ago

The problem here simply is that `poll` never tries to join the output reader thread. So as long as `wait` is not called, `poll` will always return None. Compared to `wait` it should not try to join the reader thread in a loop, but at least do it once. First try build looks pretty good: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a2cae4bbed129e8cc516d7b5349d3e8ddc560193 I finish the patch and have it up on mozreview soon.

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

7 years ago

Priority: -- → P2

Comment hidden (mozreview-request)

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 4

•

7 years ago

I pushed another try build as: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9e9c646bcd3dfa8f9ce77e03f1933bcc3bc32acb&selectedJob=159143234

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

7 years ago

Attachment #8946361 - Flags: review?(ahalberstadt)

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 5

•

7 years ago

Somehow the Marionette unit tests on Android fail now: https://treeherder.mozilla.org/logviewer.html#?job_id=159146838&repo=try [task 2018-01-29T21:07:26.311Z] 21:07:26 INFO - {"thread": "MainThread", "level": "ERROR", "pid": 1006, "source": "Marionette test runner", "time": 1517260046310, "action": "log", "message": "Failure during harness execution", "stack": "Traceback (most recent call last):\n\n File \"/builds/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py\", line 94, in cli\n failed = harness_instance.run()\n\n File \"/builds/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py\", line 74, in run\n runner.run_tests(tests)\n\n File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 863, in run_tests\n self.fixture_servers = self.start_fixture_servers()\n\n File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 964, in start_fixture_servers\n if self.appName == \"fennec\":\n\n File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 682, in appName\n self._appName = self.capabilities.get('browserName')\n\n File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 672, in capabilities\n self.marionette.start_session()\n\n File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/decorators.py\", line 26, in _\n return func(*args, **kwargs)\n\n File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/marionette.py\", line 1211, in start_session\n self.start_binary(timeout)\n\n File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/marionette.py\", line 635, in start_binary\n self.instance.start()\n\n File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/geckoinstance.py\", line 415, in start\n self._update_profile(self.profile)\n\n File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/geckoinstance.py\", line 189, in _update_profile\n raise errors.MarionetteException(\"The current profile can only be updated \"\n\nMarionetteException: The current profile can only be updated when the instance is not running\n"} It's something I will have to check tomorrow.

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 6

•

7 years ago

Here the better formatted traceback from Python: > Traceback (most recent call last): > > File \"/builds/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py\", line 94, in cli > failed = harness_instance.run() > > File \"/builds/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py\", line 74, in run > runner.run_tests(tests) > > File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 863, in run_tests > self.fixture_servers = self.start_fixture_servers() > > File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 964, in start_fixture_servers > if self.appName == \"fennec\": > > File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 682, in appName > self._appName = self.capabilities.get('browserName') > > File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 672, in capabilities > self.marionette.start_session() > > File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/decorators.py\", line 26, in _ > return func(*args, **kwargs) > > File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/marionette.py\", line 1211, in start_session > self.start_binary(timeout) > > File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/marionette.py\", line 635, in start_binary > self.instance.start() > > File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/geckoinstance.py\", line 415, in start > self._update_profile(self.profile) > > File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/geckoinstance.py\", line 189, in _update_profile > raise errors.MarionetteException(\"The current profile can only be updated \" > > MarionetteException: The current profile can only be updated when the instance is not running What I don't understand yet is how this is related to this patch but not to bug 1429759, which landed a couple of days ago.

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 7

•

7 years ago

Actually the problem here is that for Fennec Marionette uses `adb` to install and execute the binary. Sadly `adb` forks itself and leaves behind a zombie process. When Marionette checks for the process status via the mozrunner.returncode (uses mozprocess.poll() internally) we join the reader threads because no more output is coming from this zombie process, and return 0 for a normal exit. But that's not true, because adb is still running. I'm not sure how to continue here for the moment. I will file a mozprocess bug in the hope that we can get better tracking support for forked (detached) processes.

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

7 years ago

Depends on: 1434878

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 8

•

7 years ago

The command as used on Android to start the application is: /builds/worker/workspace/build/android-sdk-linux/platform-tools/adb -s emulator-5554 shell am start -a android.activity.MAIN -n org.mozilla.fennec_aurora/org.mozilla.gecko.BrowserApp --es args '-no-remote -profile /storage/sdcard/tests/profile -marionette' --es env0 MOZ_CRASHREPORTER=1 --es env1 R_LOG_VERBOSE=1 --es env2 MOZ_HIDE_RESULTS_TABLE=1 --es env3 MOZ_LOG=signaling:3,mtransport:4,DataChannel:4,jsep:4,MediaPipelineFactory:4 --es env4 MOZ_CRASHREPORTER_SHUTDOWN=1 --es env5 R_LOG_DESTINATION=stderr --es env6 MOZ_CRASHREPORTER_NO_REPORT=1 --es env7 NO_EM_RESTART=1 --es env8 MOZ_PROCESS_LOG=/tmp/tmpbGGTxNpidlog --es env9 R_LOG_LEVEL=6 Starting the application actually happens in launch_application: https://dxr.mozilla.org/mozilla-central/rev/0d806b3230fe4767fa70cdee57db87d1e9a5ba49/testing/mozbase/mozdevice/mozdevice/adb_android.py#304 Later in Marionette we try to check the status of Fennec via `wait()` in my patch. But there is no override for that method in device.py of mozrunner, and as such we check the process status of adb on the host. This is bad because adb runs as daemon at this point, and the exit code is 0! Only `is_running` would work because it has custom code: https://dxr.mozilla.org/mozilla-central/rev/0d806b3230fe4767fa70cdee57db87d1e9a5ba49/testing/mozbase/mozrunner/mozrunner/base/device.py#124 So I really wonder why all the time both `wait()` and `poll()` in mozrunner specifically for device instances do not have custom code, and falsely evaluate the host process (adb) for the exit status. Bob and Geoff, do you have some input here?

Flags: needinfo?(gbrown)

Flags: needinfo?(bob)

Geoff Brown [:gbrown]

Comment 9

•

7 years ago

Other than Marionette, all the other emulator unit tests use different code for this sort of thing. Notably, they do not use mozrunner/base/device.py. Also, they use devicemanagerADB.py instead of adb_android.py, although I suspect adb_android.py is superior to devicemanager now, and I've been contemplating porting from devicemanager to adb_android.py. In general, in devicemanager, we might wait for an adb command to complete and might even check for its return code, but that's just to catch things like a crash in adb or a missing adb binary. Usually, we also check the output from the adb command. Launching Firefox is even trickier. "am start" launches Firefox, but doesn't wait for it to complete. "am start -W" will wait for the application to launch, but I'm not sure if that really waits for Firefox to complete. In the past, we waited for Firefox to complete by polling "adb shell ps", but the mobile dev team had some issue with this. Now, we check that Firefox is the top (foreground) activity -- if it isn't, tests are complete and our wait is over: https://dxr.mozilla.org/mozilla-central/rev/0d806b3230fe4767fa70cdee57db87d1e9a5ba49/build/mobile/remoteautomation.py#375-412

Flags: needinfo?(gbrown)

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 10

•

7 years ago

(In reply to Geoff Brown [:gbrown] from comment #9) > Other than Marionette, all the other emulator unit tests use different code > for this sort of thing. Notably, they do not use mozrunner/base/device.py. > Also, they use devicemanagerADB.py instead of adb_android.py, although I > suspect adb_android.py is superior to devicemanager now, and I've been > contemplating porting from devicemanager to adb_android.py. Oh, I was wrong. FennecRunner doesn't use adb_android and as it looks like only parts from devicemanagerADB, but mostly DroidADB (`self.dm_class = DroidADB`)! Wow, why do we have so many different implementations for Android and adb?! Also the command property seems to get superseeded by mozrunner itself: https://dxr.mozilla.org/mozilla-central/rev/0d806b3230fe4767fa70cdee57db87d1e9a5ba49/testing/mozbase/mozrunner/mozrunner/base/device.py#171 So I assume I should have a look at device.py and DroidADB in what could be possible to improve. > Launching Firefox is even trickier. "am start" launches Firefox, but doesn't > wait for it to complete. "am start -W" will wait for the application to > launch, but I'm not sure if that really waits for Firefox to complete. In > the past, we waited for Firefox to complete by polling "adb shell ps", but > the mobile dev team had some issue with this. Hm, it would be good to know what -W is actually doing. If this causes adb not to start itself as daemon that would be good. But then it really would have to wait until the application has quit. > Now, we check that Firefox is the top (foreground) activity -- if it isn't, > tests are complete and our wait is over: > > https://dxr.mozilla.org/mozilla-central/rev/ > 0d806b3230fe4767fa70cdee57db87d1e9a5ba49/build/mobile/remoteautomation. > py#375-412 Might not be ideal, a command like the following would also check if the app is running in the background (some tests might want to do it in the future) - but well, right now it should be enough. > adb shell dumpsys activity activities | grep 'org.mozilla.firefox_beta/org.mozilla.gecko.BrowserApp' | grep Hist Ok, so it means I will try the following now: 1) Use -W for starting the activity - if it works fine 2) Override the `poll` and `wait` methods for FennecRunner to return the remote process status

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 11

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #10) > Oh, I was wrong. FennecRunner doesn't use adb_android and as it looks like > only parts from devicemanagerADB, but mostly DroidADB (`self.dm_class = > DroidADB`)! Wow, why do we have so many different implementations for > Android and adb?! While digging more into this I have seen that DroidADB is actually a subclass from DeviceManagerADB and DroidMixin. > 1) Use -W for starting the activity - if it works fine I checked that now, and as we already documented in the code -W only waits until the activity has been started. So it doesn't help me here. > 2) Override the `poll` and `wait` methods for FennecRunner to return the > remote process status Looks like wrappers for both methods are necessary, which would rely on the `getTopActivity()`, and return an appropriate exit code.

Flags: needinfo?(bob)

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

7 years ago

Priority: P2 → P1

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

7 years ago

Blocks: 1434878

No longer depends on: 1434878

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

7 years ago

Depends on: 1438679

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

7 years ago

Blocks: 1432792

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

6 years ago

Blocks: 1464989

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 12

•

6 years ago

I triggered a try build now that the transition to adb has been finished. It looks all fine on Linux, MacOS, and Android. But there are perma failures for Windows: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9a4572d2d102e8fff0f7490023f830aea7fcb855 > AssertionError: Positive returncode expected, got "None" After some investigation I noticed that we have a bogus process still running detection for Windows: > # If we have a handle, the process is alive > if isWin and getattr(self, '_handle', None): > return None This doesn't make sense and indeed will cause us to always return None if the process got killed by the system, or crashed.

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

6 years ago

Blocks: 1438009

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 13

•

6 years ago

The latest changes look promising. No failure yet: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a1e4c3114c69b5304e552c561c6f85804a7537a3 I added a couple of more tests for mozprocess to make sure that this change doesn't regress something. A full try build is still outstanding and I will trigger it after uploading the most recent changes.

Comment hidden (mozreview-request)

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

6 years ago

Attachment #8946362 - Attachment is obsolete: true

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 19

•

6 years ago

The test browser\components\migration\tests\marionette\test_refresh_firefox.py fails on Windows 10 64bit opt only so far. Not sure about Windows 7 because the tests aren't done yet. https://treeherder.mozilla.org/#/jobs?repo=try&revision=a1e4c3114c69b5304e552c561c6f85804a7537a3&selectedJob=180735820 > TEST-UNEXPECTED-ERROR | browser\components\migration\tests\marionette\test_refresh_firefox.py TestFirefoxRefresh.testReset | IOError: Process has been unexpectedly closed (Exit code: 0) (Reason: Requested restart of the application was aborted) Given that all the other still-enabled Marionette restart tests work, I wonder if that is a problem caused by the test code. I will check that tomorrow.

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

6 years ago

Comment 20

•

6 years ago

So the failure with test_refresh_firefox.py is troublesome and I don't know why the Firefox exists in-between the restart. Due to my change on this bug `raise_for_port()` will return early due to the exit code of 0. I don't know why Firefox exists the io completion port in this case, but I feel it's too much work for this bug. As such I created bug 1465715 and will re-enable the early return then. With the above change I can still see a failure in cleaning up the files in test_refresh_firefox.py. I hope that isn't too hard to fix.

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 21

•

6 years ago

So it's getting more complicated but finally I might know why it's failing. The test `test_refresh_firefox.exe` triggers a refresh of the user profile, and while Firefox is doing that I can see the following in the log: > IO Completion Port unexpectedly closed It means the IO completion port mozprocess is using is gone. Given the exit code of `0` for the tracked Fireefox process my assumption would be that Firefox closes, or forks itself in a very weird way that it escapes the io completion port. As of now mozprocess cannot handle detached processes on Windows yet, and bug 1438830 tracks that work. Gijs or Matt, I'm fairly sure one of you will be able to shed some lights on that problem and could point me to the code which is related to the strange restart. I would like to understand what it does, so that I can create a similar binary to be used for testing of mozprocess. Thanks.

Depends on: 1438830

Flags: needinfo?(gijskruitbosch+bugs)

Flags: needinfo?(MattN+bmo)

:Gijs (he/him)

Comment 22

•

6 years ago

(In reply to Henrik Skupin (:whimboo) from comment #21) > So it's getting more complicated but finally I might know why it's failing. > The test `test_refresh_firefox.exe` Is this just a typo? This is a python test... > triggers a refresh of the user profile, > and while Firefox is doing that I can see the following in the log: > > > IO Completion Port unexpectedly closed > > It means the IO completion port mozprocess is using is gone. Given the exit > code of `0` for the tracked Fireefox process my assumption would be that > Firefox closes, or forks itself in a very weird way that it escapes the io > completion port. As of now mozprocess cannot handle detached processes on > Windows yet, and bug 1438830 tracks that work. > > Gijs or Matt, I'm fairly sure one of you will be able to shed some lights on > that problem and could point me to the code which is related to the strange > restart. I would like to understand what it does, so that I can create a > similar binary to be used for testing of mozprocess. Thanks. I'm not really sure what you need to know here. The reset is at: https://searchfox.org/mozilla-central/source/browser/components/migration/tests/marionette/test_refresh_firefox.py#502-530 which does: self.marionette.restart(clean=False, in_app=True) after setting a bunch of environment variables. Does that help? If not, please be more specific about what you need. I'm afraid I'm not personally familiar with how Marionette implements the .restart() and how that'd relate to what you're doing with mozprocess.

Flags: needinfo?(hskupin)

Flags: needinfo?(gijskruitbosch+bugs)

Flags: needinfo?(MattN+bmo)

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 23

•

6 years ago

Gijs, what I basically want to know is not how Marionette handles restarts, because I completely implemented that, but what Firefox actually does here. The way how it restarts itself seem to be different to usual restarts which do not trigger those io completion port issues.

Flags: needinfo?(hskupin) → needinfo?(gijskruitbosch+bugs)

:Gijs (he/him)

Comment 24

•

6 years ago

(In reply to Henrik Skupin (:whimboo) from comment #23) > Gijs, what I basically want to know is not how Marionette handles restarts, > because I completely implemented that, but what Firefox actually does here. > The way how it restarts itself seem to be different to usual restarts which > do not trigger those io completion port issues. What is the "IO Completion Port"? We detect the env vars here: https://dxr.mozilla.org/mozilla-central/rev/d8f180ab74921fd07a66d6868914a48e5f9ea797/toolkit/xre/nsAppRunner.cpp#2261-2290 and then switch to a new profile in https://dxr.mozilla.org/mozilla-central/rev/d8f180ab74921fd07a66d6868914a48e5f9ea797/toolkit/xre/nsAppRunner.cpp#2314-2338 Does that help?

Flags: needinfo?(gijskruitbosch+bugs)

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 25

•

6 years ago

(In reply to :Gijs (he/him) from comment #24) > What is the "IO Completion Port"? It's a feature on Windows. Details can be found at https://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx. > https://dxr.mozilla.org/mozilla-central/rev/ > d8f180ab74921fd07a66d6868914a48e5f9ea797/toolkit/xre/nsAppRunner.cpp#2261- > 2290 > > https://dxr.mozilla.org/mozilla-central/rev/ > d8f180ab74921fd07a66d6868914a48e5f9ea797/toolkit/xre/nsAppRunner.cpp#2314- > 2338 Both the links do not actually show where the restart of Firefox is triggered, and how this is done. Is that a call to `Services.startup.quit()` and somewhere hidden in one of the called methods?

Flags: needinfo?(gijskruitbosch+bugs)

Matthew N. [:MattN]

Comment 26

•

6 years ago

LaunchChild may be worth looking at as it's where the actual restart happens IIUC. https://dxr.mozilla.org/mozilla-central/rev/d8f180ab74921fd07a66d6868914a48e5f9ea797/toolkit/xre/nsAppRunner.cpp#1815-1816,1845,1852-1858 I also don't really know how I can help since all the relevant code is in the link from comment 22… you can check if the restarts arguments or environment variables change the nsAppRunner code path.

:Gijs (he/him)

Comment 27

•

6 years ago

(In reply to Henrik Skupin (:whimboo) from comment #25) > (In reply to :Gijs (he/him) from comment #24) > > What is the "IO Completion Port"? > > It's a feature on Windows. Details can be found at > https://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85). > aspx. OK. What does mozprocess create these on/for ? The process? The lockfile in the profile? Something else? > > https://dxr.mozilla.org/mozilla-central/rev/ > > d8f180ab74921fd07a66d6868914a48e5f9ea797/toolkit/xre/nsAppRunner.cpp#2261- > > 2290 > > > > https://dxr.mozilla.org/mozilla-central/rev/ > > d8f180ab74921fd07a66d6868914a48e5f9ea797/toolkit/xre/nsAppRunner.cpp#2314- > > 2338 > > Both the links do not actually show where the restart of Firefox is > triggered, and how this is done. Is that a call to `Services.startup.quit()` > and somewhere hidden in one of the called methods? As Matt implied in comment #25, it's the call to the marionette .restart() function quoted in comment #22 which I assume eventually call startup.quit() with restart flags (but I haven't checked) - but in comment #23 you said you implemented it, so I also assume you know better than me how those restarts work and whether they do or do not call startup.quit()... There's nothing magical here. The only times that test will restart Firefox is when it calls marionette's .restart() method. The only difference with regular restarts will be how those environment variables change the restart path, which is why I linked you to the nsAppRunner code where that happens... Do we actually have other (unit?) tests on infra that exercise marionette's .restart() method and don't have this error?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 28

•

6 years ago

(In reply to :Gijs (he/him) from comment #27) > OK. What does mozprocess create these on/for ? The process? The lockfile in > the profile? Something else? It's for the process and nothing else as far as I understood. The code is kinda old and I haven't really touched it before. So my knowledge here is most likely equal to yours. > As Matt implied in comment #25, it's the call to the marionette .restart() > function quoted in comment #22 which I assume eventually call startup.quit() > with restart flags (but I haven't checked) - but in comment #23 you said you > implemented it, so I also assume you know better than me how those restarts > work and whether they do or do not call startup.quit()... Yes, it calls startup.quit() in case of in_app restarts. But I don't know what this method actually performs internally, additionally to the parameters we pass-in. So maybe the best would be to create a simplified testcase based on the parameters and environment variables which get set for a reset Firefox activity. Maybe I can reproduce it that way. > Do we actually have other (unit?) tests on infra that exercise marionette's > .restart() method and don't have this error? Yes, we have a ton of in_app restart tests. But when I now checked the full log of the last try push on comment 19 I can see that lots of them are marked as skipped. I can only imagine that something has been changed in Marionette so that the following condition evaluates to true: > if self.marionette.session_capabilities["platformName"] != "windows_nt": > skip("Bug 1363368 - Wrong window handles after in_app restarts") If that is true we would indeed not run any of them across all platforms, and it would explain why we only see it for the refresh Firefox Marionette test. Maybe when fixing this we would also see it for those unit tests. If that is the case sorry for the noise. But thanks for all the feedback which gives me an opportunity to dive into and create a test application for mozprocess unit tests.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

6 years ago

Depends on: 1467583

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 29

•

6 years ago

Now with the in_app restart tests enabled again I did another try push, and I now can also see problems with the Marionette unit tests. So I definitely have an easier starting point now. https://treeherder.mozilla.org/#/jobs?repo=try&revision=914fd980b4070fc7f03d7233947d1aa454420eae&selectedJob=182670952

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 30

•

6 years ago

Here a new try build given that the dependency has been fixed now: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0741572eb73882aea5016062a1ad6addf2baa6ec&selectedJob=202566667 Interesting the newly added test for poll() after os.kill() fails. I'm not sure yet why, given that os.waitpid() returns with sig as exit code, but ProcessHandler.poll() still returns None if run immediately. Using time.sleep(1) before, makes it all work. I will have to see how to fix this race by hopefully getting away the call to sleep()

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 31

•

6 years ago

The delay here actually comes from checking for the reader thread to be alive, which gets executed each 0.02s. As such the first call to poll() will return None, unless there is a delay of at least 0.02s before. I will solve this by simply waiting twice that period which should allow the thread to run at least once.

Treeherder Bug Filer

Updated

•

6 years ago

Blocks: 1495667

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 32

•

6 years ago

It all works fine now except for a single Marionette unit restart test, which I would like to temporary skip for opt/pgo Windows builds only. The necessary work to get it re-enabled is tracked on bug 1495667, and I will work on once bug 1433873 has been fixed. Here the last try build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=328703f839595e664942df5c7c645c3cb2eebd3d&selectedJob=202811804

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 33

•

6 years ago

Attached file Bug 1433905 - Temporarily disable test_loader_global_sharing.py for opt/pgo builds on Windows. r?gbrown — Details

Marionette currently doesn't handle restarts of Firefox that well, and as such needs a fix on bug 1433873. Sadly this patch cannot be landed until the mozprocess issues on bug 1433905 have been fixed. This patch temporarily disables the test for opt/pgo Windows builds only. Re-enabling is tracked via bug 1495667.

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 34

•

6 years ago

Attached file Bug 1433905 - [mozprocess] Retrieving pid has to fail with RuntimeError if process hasn't been started yet. r?gbrown — Details

Instead of an AttributeError a RuntimeError has to be thrown if the underlying process hasn't been created yet. Depends on D7392

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 35

•

6 years ago

Attached file Bug 1433905 - [mozprocess] "check_for_detached()" has to raise RuntimeError if process hasn't been started yet. r?gbrown — Details

Calling "check_for_detached()" doesn't make sense if the process hasn't been started yet, and as such has to raise a RuntimeError. Depends on D7393

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 36

•

6 years ago

Attached file Bug 1433905 - [mozprocess] Fix broken path to process script for test_process_output_nonewline. r?gbrown — Details

Depends on D7394

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 37

•

6 years ago

Attached file Bug 1433905 - [mozprocess] Existence of _handle on Windows doesn't mean the process is still alive. r?gbrown — Details

The assumption that when a handle is present for the process handler on Windows doesn't mean that the process is still alive. It could have already been externally killed, crashed, or closed itself. The patch makes sure to check the process exit code, and run clean-up steps if the process is already gone. Depends on D7395

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 38

•

6 years ago

Attached file Bug 1433905 - [mozprocess] poll() always returns None for stopped process until wait() is called. r?gbrown — Details

If the process quits externally (shutdown by itself or via kill), the poll method still returns None, even with the process not existent anymore. To fix that, the poll method should at least try to join the reader thread once before checking if it is still alive. Otherwise it will continue to run, and never stop. Also the attribute existence check for "returncode" on the process instance has to be removed because this property always exists. Instead a check for the "returncode" property of the ProcessHandler class is necessary. Depends on D7396

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Comment 39

•

6 years ago

Mn tests for opt builds look good now: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e302eae80ce996d55eff600da02a1ad051df563f But there are problems with pgo builds due to the artifact mode. Here another try build for those, which is still running at the moment: https://treeherder.mozilla.org/#/jobs?repo=try&revision=da14651308ed61c63c0b9d857fa9c836db0e67b5

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

6 years ago

Attachment #8946361 - Attachment is obsolete: true

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

6 years ago

Attachment #8981500 - Attachment is obsolete: true

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

6 years ago

Attachment #8981501 - Attachment is obsolete: true

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

6 years ago

Attachment #8981502 - Attachment is obsolete: true

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

6 years ago

Attachment #8981503 - Attachment is obsolete: true

Geoff Brown [:gbrown]

Comment 40

•

6 years ago

Comment on attachment 9013546 [details] Bug 1433905 - Temporarily disable test_loader_global_sharing.py for opt/pgo builds on Windows. r?gbrown Geoff Brown [:gbrown] has approved the revision.

Attachment #9013546 - Flags: review+

Geoff Brown [:gbrown]

Comment 41

•

6 years ago

Comment on attachment 9013547 [details] Bug 1433905 - [mozprocess] Retrieving pid has to fail with RuntimeError if process hasn't been started yet. r?gbrown Geoff Brown [:gbrown] has approved the revision.

Attachment #9013547 - Flags: review+

Geoff Brown [:gbrown]

Comment 42

•

6 years ago

Comment on attachment 9013548 [details] Bug 1433905 - [mozprocess] "check_for_detached()" has to raise RuntimeError if process hasn't been started yet. r?gbrown Geoff Brown [:gbrown] has approved the revision.

Attachment #9013548 - Flags: review+

Geoff Brown [:gbrown]

Comment 43

•

6 years ago

Sorry, I wasn't able to finish off these reviews today. Will make it a priority tomorrow.

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Updated

•

6 years ago

Flags: needinfo?(gbrown)

Pulsebot

Comment 44

•

6 years ago

Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/051b1c82dac8 Temporarily disable test_loader_global_sharing.py for opt/pgo builds on Windows. r=gbrown https://hg.mozilla.org/integration/autoland/rev/4f05d89da460 [mozprocess] Retrieving pid has to fail with RuntimeError if process hasn't been started yet. r=gbrown https://hg.mozilla.org/integration/autoland/rev/70607539582d [mozprocess] "check_for_detached()" has to raise RuntimeError if process hasn't been started yet. r=gbrown https://hg.mozilla.org/integration/autoland/rev/a9f6938eccfe [mozprocess] Fix broken path to process script for test_process_output_nonewline. r=gbrown https://hg.mozilla.org/integration/autoland/rev/8793e332890e [mozprocess] Existence of _handle on Windows doesn't mean the process is still alive. r=gbrown https://hg.mozilla.org/integration/autoland/rev/21e66bfa0cc4 [mozprocess] poll() always returns None for stopped process until wait() is called. r=gbrown

Natalia Csoregi [:nataliaCs]

Comment 45

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/051b1c82dac8 https://hg.mozilla.org/mozilla-central/rev/4f05d89da460 https://hg.mozilla.org/mozilla-central/rev/70607539582d https://hg.mozilla.org/mozilla-central/rev/a9f6938eccfe https://hg.mozilla.org/mozilla-central/rev/8793e332890e https://hg.mozilla.org/mozilla-central/rev/21e66bfa0cc4

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox64: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla64

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

6 years ago

Depends on: 1496759

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Assignee

Updated

•

6 years ago

Blocks: 1496759

No longer depends on: 1496759

You need to log in before you can comment on or make changes to this bug.