wptrunner doesn't force-stop the browser when a wdspec test hits an external timeout
Categories
(Testing :: web-platform-tests, defect, P2)
Tracking
(firefox-esr91 wontfix, firefox97 wontfix, firefox98 wontfix, firefox99 fixed)
People
(Reporter: whimboo, Assigned: whimboo)
References
(Regression)
Details
(Keywords: regression, Whiteboard: [bidi-m3-mvp], [wptsync upstream])
Attachments
(1 file)
When a wdspec test times out geckodriver checks for the session status for around 70s before shutting down the browser. This is specific code has been added on bug 1705770.
Here an example log:
https://firefox-ci-tc.services.mozilla.com/tasks/BdMN5w-0TAOJK6-cav6QnA/runs/0/logs/live/public/logs/live.log#L3347-3490
[task 2022-02-10T08:03:35.108Z] 08:03:35 INFO - PID 2646 | 1644480215107 webdriver::server DEBUG -> POST /session/9474988c-ab89-4b0f-88f5-f6c2b9385079/execute/sync {"script": "console.log('TEST')", "args": []}
[task 2022-02-10T08:03:35.126Z] 08:03:35 INFO - PID 2646 | 1644480215124 webdriver::server DEBUG <- 200 OK {"value":null}
[task 2022-02-10T08:05:17.673Z] 08:05:17 INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/bidi/log/entry_added/console.py | expected OK
[task 2022-02-10T08:05:17.673Z] 08:05:17 INFO - TEST-INFO took 105003ms
[task 2022-02-10T08:05:17.692Z] 08:05:17 INFO - PID 2646 | 1644480317690 webdriver::server DEBUG -> GET /status
[task 2022-02-10T08:05:17.694Z] 08:05:17 INFO - PID 2646 | 1644480317691 webdriver::server DEBUG <- 200 OK {"value":{"message":"Session already started","ready":false}}
[task 2022-02-10T08:05:18.695Z] 08:05:18 INFO - PID 2646 | 1644480318693
[..]
[task 2022-02-10T08:06:26.913Z] 08:06:26 INFO - PID 2646 | 1644480386911 webdriver::server DEBUG -> GET /status
[task 2022-02-10T08:06:26.914Z] 08:06:26 INFO - PID 2646 | 1644480386911 webdriver::server DEBUG <- 200 OK {"value":{"message":"Session already started","ready":false}}
[task 2022-02-10T08:06:37.961Z] 08:06:37 WARNING - Forcibly terminating runner process
This is extra time that we should not have to wait for. Maybe the session fixture should trigger closing the browser when a test timed out?
On the other side geckodriver should only check for the session state when it's aware that the command for destroying the session has already been send. Otherwise it doesn't make sense to wait. The new behavior exists since geckodriver 0.30.0.
James, mind having a look at this?
Comment 1•2 years ago
|
||
wptrunner ends up in this state if it's expecting that the WebDriver session will be closed (see [1]). Note that it can't (easily) know if the the session is actually being closed because it's owned by pytest, and that's running in a totally different process. So the actual problem here is that we think we ought to be ending the session but aren't. So the fix is probably to ensure that we actually do end the session in this case.
Updated•2 years ago
|
Comment 2•2 years ago
|
||
Recent regression, should be fixed for the next release
Assignee | ||
Comment 3•2 years ago
|
||
So there is actually a problem when trying to use timeouts with pytest. The notion for web-platform-tests is a timeout per module (file) whereby in pytest timeouts are handled per function. Several plugins like pytest-timeout only work that way and I cannot find a possibility to set a module wide timeout.
Further for the pytestrunner as used by wptrunner we set a timeout via an argument when calling run
but that timeout is actually never used for pytest. This means when a test causes a timeout pytest will not handle that internally (meaning no safe shutdown) but the wptrunner has to handle such an inconsistent and unknown state.
Given that timeouts are handled that differently I don't see that we can easily instruct pytest to handle it internally.
From the wptrunner side the status is unclear and it's not known if some code actually triggered the deletion of the session or not. As such just waiting for the session to have ended is mainly not helpful and we should try to at least delete the session before so a clean shutdown can happen before the processes are force-killed.
James, do I see it right that wptrunner starts geckodriver but sadly we do not have any access to the actual session from that code? Only pytest itself knows about it? If yes, then we won't be able to delete the session, and should directly force-kill the process. Probably we should only do that when a timeout error occurs, and then we could pass in force=true
and with that flag set we won't enter the session alive check.
If you think that this is fine, could you maybe point me to the code where the timeout is handled and the process gets stopped? I have a problem to find that.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 4•2 years ago
|
||
James will have a look again what would be best here.
Comment 5•2 years ago
|
||
Try something like:
diff --git a/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py b/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py
index 0dac77c4b5533..9344e454a62af 100644
--- a/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py
+++ b/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py
@@ -246,7 +246,7 @@ class _RunnerManagerState(object):
initializing = namedtuple("initializing",
["test", "test_group", "group_metadata", "failure_count"])
running = namedtuple("running", ["test", "test_group", "group_metadata"])
- restarting = namedtuple("restarting", ["test", "test_group", "group_metadata"])
+ restarting = namedtuple("restarting", ["test", "test_group", "group_metadata", "force_stop"])
error = namedtuple("error", [])
stop = namedtuple("stop", [])
@@ -417,7 +417,10 @@ class TestRunnerManager(threading.Thread):
self.logger.debug("Got command: %r" % command)
except IOError:
self.logger.error("Got IOError from poll")
- return RunnerManagerState.restarting(0)
+ return RunnerManagerState.restarting(self.state.test,
+ self.state.test_group,
+ self.state.group_metadata,
+ False)
except Empty:
if (self.debug_info and self.debug_info.interactive and
self.browser.started and not self.browser.is_alive()):
@@ -442,7 +445,10 @@ class TestRunnerManager(threading.Thread):
self.logger.critical("Last test did not complete")
return RunnerManagerState.error()
self.logger.warning("More tests found, but runner process died, restarting")
- return RunnerManagerState.restarting(0)
+ return RunnerManagerState.restarting(self.state.test,
+ self.state.test_group,
+ self.state.group_metadata,
+ False)
else:
f = (dispatch.get(self.state.__class__, {}).get(command) or
dispatch.get(None, {}).get(command))
@@ -549,7 +555,8 @@ class TestRunnerManager(threading.Thread):
self.logger.info("Restarting browser for new test environment")
return RunnerManagerState.restarting(self.state.test,
self.state.test_group,
- self.state.group_metadata)
+ self.state.group_metadata,
+ False)
self.recording.set(["testrunner", "test"] + self.state.test.id.split("/")[1:])
self.logger.test_start(self.state.test.id)
@@ -683,6 +690,7 @@ class TestRunnerManager(threading.Thread):
file_result.status in ("CRASH", "EXTERNAL-TIMEOUT", "INTERNAL-ERROR") or
((subtest_unexpected or is_unexpected) and
self.restart_on_unexpected))
+ force_stop = test.test_type == "wdspec" and file_result.status == "EXTERNAL-TIMEOUT"
self.recording.set(["testrunner", "after-test"])
if (not file_result.status == "CRASH" and
@@ -691,7 +699,7 @@ class TestRunnerManager(threading.Thread):
self.logger.info("Pausing until the browser exits")
self.send_message("wait")
else:
- return self.after_test_end(test, restart_before_next)
+ return self.after_test_end(test, restart_before_next, force_stop=force_stop)
def wait_finished(self, rerun=False):
assert isinstance(self.state, RunnerManagerState.running)
@@ -701,7 +709,7 @@ class TestRunnerManager(threading.Thread):
# post-stop processing
return self.after_test_end(self.state.test, not rerun, force_rerun=rerun)
- def after_test_end(self, test, restart, force_rerun=False):
+ def after_test_end(self, test, restart, force_rerun=False, force_stop=False):
assert isinstance(self.state, RunnerManagerState.running)
# Mixing manual reruns and automatic reruns is confusing; we currently assume
# that as long as we've done at least the automatic run count in total we can
@@ -718,14 +726,14 @@ class TestRunnerManager(threading.Thread):
test_group = self.state.test_group
group_metadata = self.state.group_metadata
if restart:
- return RunnerManagerState.restarting(test, test_group, group_metadata)
+ return RunnerManagerState.restarting(test, test_group, group_metadata, force_stop)
else:
return RunnerManagerState.running(test, test_group, group_metadata)
def restart_runner(self):
"""Stop and restart the TestRunner"""
assert isinstance(self.state, RunnerManagerState.restarting)
- self.stop_runner()
+ self.stop_runner(force=self.state.force_stop)
return RunnerManagerState.initializing(self.state.test, self.state.test_group, self.state.group_metadata, 0)
def log(self, data):
I haven't tried it with a wdspec test, but it at least seems to run other tests, so I don't think it should break everything.
Agreed this code isn't very simple; the loop there is trying to be an explicit state machine using tokens to pass data between different states rather than having everything just call methods directly, but that does make it harder to add more state.
Assignee | ||
Comment 6•2 years ago
|
||
Web-platform tests use a timeout per test module while pytest
supports a timeout for test methods. As long as a file wide
timeout isn't supported wptrunner has to force kill the browser
when a wdspec test runs into an external timeout.
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/21dda7d9394c [wptrunner] Force stopping the browser when wdspec test hits external timeout. r=jgraham
Comment 8•2 years ago
|
||
bugherder |
Assignee | ||
Comment 9•2 years ago
|
||
James, can you please check what's wrong with the upstream sync and why the bot added [wptsync upstream error]
? It would have been nice if the bot had printed the URL of the PR in the comment above. I looked and it's the following:
Comment 10•2 years ago
|
||
I merged the upstream PR. The documentation bug is idenepdent of this change and needs to be investigated. I'm not sure why the bot didn't comment on the bug with the sync link, that is definitely supposed to happen.
Upstream PR merged by None
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/33161 for changes under testing/web-platform/tests
Upstream PR merged by None
Upstream PR merged by None
Description
•