Closed Bug 1433873 Opened 6 years ago Closed 5 years ago

_handle_socket_failure should not be called when application is shutting down via quit/restart

Categories

(Remote Protocol :: Marionette, defect, P1)

Version 3
defect

Tracking

(firefox64 fixed)

RESOLVED FIXED
mozilla64
Tracking Status
firefox64 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

(Blocks 3 open bugs)

Details

Attachments

(2 files, 1 obsolete file)

As investigated on bug 1397612 we have a very broken behavior for quit and restart in the client, which results in an inappropriate call to `_handle_socket_failure`, and an expected IOError isn't bubbled up to quit/restart anymore. As such specifically restart is not able to handle it, and fails to set the new process id of Firefox which then results in bug 1421289.

What we should do here is the following. Given that every method on the Marionette object which itself calls `_send_message` is affected, we need a property on the Marionette object, which shows when the application is going down. That property can then be used in the `do_process_check` decorator to trigger `_handle_socket_failure`, or to just ignore the exception and let it bubble up.

With this fix we might be able to also unskip all of our existing restart tests as covered on bug 1389103.
Depends on: 1433905
Bug 1433905 won't help us in the short term. So instead of relying on the `returncode` or output of `poll()` of mozprocess, Marionette would have to use a non-blocking call to `wait()` to determine the process status. This should work for the time being.
Depends on: 1434932
Blocks: 1435006
The try build shows permanent failures for Windows and Android:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb9d53bc4e4b&selectedJob=159974292

I will have to take a look at those tomorrow.
Blocks: 1435110
The problem for Android was my change in `raise_for_port()` which used `wait(0)` instead of `is_running()` and caused the same failure with a forked adb process as I already noticed on bug 1433905. It means we cannot escape the connection attempt loop earlier if the process is gone, at least not for now.

The remaining problem with Windows is in the test `test_in_app_restart_with_callback_but_no_shutdown`. The exact failure is:

11:47:19     INFO -  TEST-START | testing\marionette\harness\marionette_harness\tests\unit\test_quit_restart.py TestQuitRestart.test_in_app_restart_with_callback_but_no_shutdown
11:50:19     INFO -  WARNING | IO Completion Port failed to signal process shutdown
11:50:19     INFO -  Parent process 7732 exited with children alive:
11:50:19     INFO -  PIDS: 1896, 4684, 668, 132
11:50:19     INFO -  Attempting to kill them, but no guarantee of success
11:50:19    ERROR -  TEST-UNEXPECTED-FAIL | testing\marionette\harness\marionette_harness\tests\unit\test_quit_restart.py TestQuitRestart.test_in_app_restart_with_callback_but_no_shutdown | AssertionError: "Process still running after restart request" does not match "Process unexpectedly quit without restarting (exit code: 0)"
11:50:19     INFO -  Traceback (most recent call last):
11:50:19     INFO -    File "Z:\task_1517570260\build\venv\lib\site-packages\marionette_harness\marionette_test\testcases.py", line 159, in run
11:50:19     INFO -      testMethod()
11:50:19     INFO -    File "Z:\task_1517570260\build\tests\marionette\tests\testing\marionette\harness\marionette_harness\tests\unit\test_quit_restart.py", line 202, in test_in_app_restart_with_callback_but_no_shutdown
11:50:19     INFO -      self.marionette.restart(in_app=True, callback=lambda: False)
11:50:19     INFO -  TEST-INFO took 179806ms

Somehow the parent process with the pid 7732 exited and left childrens alive.  But this test should not cause Firefox to exit. So I don't understand the failure message of "Process unexpectedly quit without restarting (exit code: 0)" at all.
Blocks: 1325047
Blocks: 1433623
Blocks: 1411189
Blocks: 1434214
Blocks: 1430717
Blocks: 1423500
Blocks: 1258982
Blocks: 1260502
Blocks: 1364304
Blocks: 1394674
Blocks: 1436936
Depends on: 1438679
Depends on: 1438830
(In reply to Henrik Skupin (:whimboo) from comment #4)
> Somehow the parent process with the pid 7732 exited and left childrens
> alive.  But this test should not cause Firefox to exit. So I don't
> understand the failure message of "Process unexpectedly quit without
> restarting (exit code: 0)" at all.

This actually happens for in_app restarts of Firefox only and is caused by somewhat loosing the Firefox process. I filed bug 
1438830 to get this investigated and fixed.

The workaround on Windows for the time being would be to add an extra `restart()` in tearDown of any test which is using an `in_app` restart. I think that I will see first if bug 1438830 is easy to fix.
The mozrunner case which affects Firefox for Android will hopefully be fixed later today. Then only the Windows case remains.
Blocks: 1444600
Ok, so wait() is sometimes broken on Windows and I have to fix bug 1438009 first.
Depends on: 1438009
Blocks: 1397076
Blocks: 1400819
No longer blocks: 1434926
Blocks: 1141335
Blocks: 1326124
Blocks: 1459610
Blocks: 1461318
Attachment #8971663 - Attachment is obsolete: true
Blocks: 1465715
Blocks: 1466869
Depends on: 1467583
Blocks: 1363087
Blocks: 1472227
Blocks: 1472853
Blocks: 1469946
Blocks: 1365243
Blocks: 1495667
No longer blocks: 1461318
Blocks: 1438009
No longer depends on: 1438009
Both methods are poorly handling the phase when the application
is about to quit/restart. Especially when a callback is used
to trigger the application shutdown.

This patch makes sure that in case of an active shutdown the
do_process_check decorator doesn't trigger a socket failure,
and also correctly waits for the application to shutdown, or
being restarted.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4204f9fab5fe
Fix race condition in Marionette client for in_app quit and restart. r=ato
https://hg.mozilla.org/mozilla-central/rev/4204f9fab5fe
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Depends on: 1496897
Blocks: 1496759
No longer depends on: 1496759
Per your comment on https://phabricator.services.mozilla.com/D10837,
should this be uplifted to beta?
Flags: needinfo?(hskupin)
This landed for 64, which means it is already on beta now.
Flags: needinfo?(hskupin)
Depends on: 1504940
Blocks: 1489679
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.