_handle_socket_failure should not be called when application is shutting down via quit/restart

RESOLVED FIXED in Firefox 64

Status

defect
P1
normal
RESOLVED FIXED
Last year
7 months ago

People

(Reporter: whimboo, Assigned: whimboo)

Tracking

(Blocks 7 bugs)

Version 3
mozilla64
Points:
---

Firefox Tracking Flags

(firefox64 fixed)

Details

Attachments

(2 attachments, 1 obsolete attachment)

As investigated on bug 1397612 we have a very broken behavior for quit and restart in the client, which results in an inappropriate call to `_handle_socket_failure`, and an expected IOError isn't bubbled up to quit/restart anymore. As such specifically restart is not able to handle it, and fails to set the new process id of Firefox which then results in bug 1421289.

What we should do here is the following. Given that every method on the Marionette object which itself calls `_send_message` is affected, we need a property on the Marionette object, which shows when the application is going down. That property can then be used in the `do_process_check` decorator to trigger `_handle_socket_failure`, or to just ignore the exception and let it bubble up.

With this fix we might be able to also unskip all of our existing restart tests as covered on bug 1389103.
Depends on: 1433905
Bug 1433905 won't help us in the short term. So instead of relying on the `returncode` or output of `poll()` of mozprocess, Marionette would have to use a non-blocking call to `wait()` to determine the process status. This should work for the time being.
Depends on: 1434932
The try build shows permanent failures for Windows and Android:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb9d53bc4e4b&selectedJob=159974292

I will have to take a look at those tomorrow.
The problem for Android was my change in `raise_for_port()` which used `wait(0)` instead of `is_running()` and caused the same failure with a forked adb process as I already noticed on bug 1433905. It means we cannot escape the connection attempt loop earlier if the process is gone, at least not for now.

The remaining problem with Windows is in the test `test_in_app_restart_with_callback_but_no_shutdown`. The exact failure is:

11:47:19     INFO -  TEST-START | testing\marionette\harness\marionette_harness\tests\unit\test_quit_restart.py TestQuitRestart.test_in_app_restart_with_callback_but_no_shutdown
11:50:19     INFO -  WARNING | IO Completion Port failed to signal process shutdown
11:50:19     INFO -  Parent process 7732 exited with children alive:
11:50:19     INFO -  PIDS: 1896, 4684, 668, 132
11:50:19     INFO -  Attempting to kill them, but no guarantee of success
11:50:19    ERROR -  TEST-UNEXPECTED-FAIL | testing\marionette\harness\marionette_harness\tests\unit\test_quit_restart.py TestQuitRestart.test_in_app_restart_with_callback_but_no_shutdown | AssertionError: "Process still running after restart request" does not match "Process unexpectedly quit without restarting (exit code: 0)"
11:50:19     INFO -  Traceback (most recent call last):
11:50:19     INFO -    File "Z:\task_1517570260\build\venv\lib\site-packages\marionette_harness\marionette_test\testcases.py", line 159, in run
11:50:19     INFO -      testMethod()
11:50:19     INFO -    File "Z:\task_1517570260\build\tests\marionette\tests\testing\marionette\harness\marionette_harness\tests\unit\test_quit_restart.py", line 202, in test_in_app_restart_with_callback_but_no_shutdown
11:50:19     INFO -      self.marionette.restart(in_app=True, callback=lambda: False)
11:50:19     INFO -  TEST-INFO took 179806ms

Somehow the parent process with the pid 7732 exited and left childrens alive.  But this test should not cause Firefox to exit. So I don't understand the failure message of "Process unexpectedly quit without restarting (exit code: 0)" at all.
Comment hidden (mozreview-request)
Depends on: 1438679
Depends on: 1438830
(In reply to Henrik Skupin (:whimboo) from comment #4)
> Somehow the parent process with the pid 7732 exited and left childrens
> alive.  But this test should not cause Firefox to exit. So I don't
> understand the failure message of "Process unexpectedly quit without
> restarting (exit code: 0)" at all.

This actually happens for in_app restarts of Firefox only and is caused by somewhat loosing the Firefox process. I filed bug 
1438830 to get this investigated and fixed.

The workaround on Windows for the time being would be to add an extra `restart()` in tearDown of any test which is using an `in_app` restart. I think that I will see first if bug 1438830 is easy to fix.
The mozrunner case which affects Firefox for Android will hopefully be fixed later today. Then only the Windows case remains.
Ok, so wait() is sometimes broken on Windows and I have to fix bug 1438009 first.
Depends on: 1438009
No longer blocks: 1434926
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Attachment #8971663 - Attachment is obsolete: true
Depends on: 1467583
Blocks: 1469946
Blocks: 1365243

Updated

9 months ago
Blocks: 1495667
No longer blocks: 1461318
Blocks: 1438009
No longer depends on: 1438009
Both methods are poorly handling the phase when the application
is about to quit/restart. Especially when a callback is used
to trigger the application shutdown.

This patch makes sure that in case of an active shutdown the
do_process_check decorator doesn't trigger a socket failure,
and also correctly waits for the application to shutdown, or
being restarted.

Comment 14

9 months ago
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4204f9fab5fe
Fix race condition in Marionette client for in_app quit and restart. r=ato

Comment 15

9 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/4204f9fab5fe
Status: ASSIGNED → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Depends on: 1496897
Blocks: 1496759
No longer depends on: 1496759
Per your comment on https://phabricator.services.mozilla.com/D10837,
should this be uplifted to beta?
Flags: needinfo?(hskupin)
This landed for 64, which means it is already on beta now.
Flags: needinfo?(hskupin)
Depends on: 1504940
Blocks: 1489679
You need to log in before you can comment on or make changes to this bug.