1754712 - wptrunner doesn't force-stop the browser when a wdspec test hits an external timeout

Assignee

Description

•

3 years ago

When a wdspec test times out geckodriver checks for the session status for around 70s before shutting down the browser. This is specific code has been added on bug 1705770.

Here an example log:
https://firefox-ci-tc.services.mozilla.com/tasks/BdMN5w-0TAOJK6-cav6QnA/runs/0/logs/live/public/logs/live.log#L3347-3490

[task 2022-02-10T08:03:35.108Z] 08:03:35     INFO - PID 2646 | 1644480215107	webdriver::server	DEBUG	-> POST /session/9474988c-ab89-4b0f-88f5-f6c2b9385079/execute/sync {"script": "console.log('TEST')", "args": []}
[task 2022-02-10T08:03:35.126Z] 08:03:35     INFO - PID 2646 | 1644480215124	webdriver::server	DEBUG	<- 200 OK {"value":null}
[task 2022-02-10T08:05:17.673Z] 08:05:17     INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/bidi/log/entry_added/console.py | expected OK
[task 2022-02-10T08:05:17.673Z] 08:05:17     INFO - TEST-INFO took 105003ms
[task 2022-02-10T08:05:17.692Z] 08:05:17     INFO - PID 2646 | 1644480317690	webdriver::server	DEBUG	-> GET /status
[task 2022-02-10T08:05:17.694Z] 08:05:17     INFO - PID 2646 | 1644480317691	webdriver::server	DEBUG	<- 200 OK {"value":{"message":"Session already started","ready":false}}
[task 2022-02-10T08:05:18.695Z] 08:05:18     INFO - PID 2646 | 1644480318693
[..]
[task 2022-02-10T08:06:26.913Z] 08:06:26     INFO - PID 2646 | 1644480386911	webdriver::server	DEBUG	-> GET /status
[task 2022-02-10T08:06:26.914Z] 08:06:26     INFO - PID 2646 | 1644480386911	webdriver::server	DEBUG	<- 200 OK {"value":{"message":"Session already started","ready":false}}
[task 2022-02-10T08:06:37.961Z] 08:06:37  WARNING - Forcibly terminating runner process

This is extra time that we should not have to wait for. Maybe the session fixture should trigger closing the browser when a test timed out?

On the other side geckodriver should only check for the session state when it's aware that the command for destroying the session has already been send. Otherwise it doesn't make sense to wait. The new behavior exists since geckodriver 0.30.0.

James, mind having a look at this?

Flags: needinfo?(james)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

3 years ago

Blocks: 1723202

James Graham [:jgraham]

Comment 1

•

3 years ago

wptrunner ends up in this state if it's expecting that the WebDriver session will be closed (see [1]). Note that it can't (easily) know if the the session is actually being closed because it's owned by pytest, and that's running in a totally different process. So the actual problem here is that we think we ought to be ending the session but aren't. So the fix is probably to ensure that we actually do end the session in this case.

[1] https://searchfox.org/mozilla-central/source/testing/web-platform/tests/tools/wptrunner/wptrunner/browsers/firefox.py#877

Flags: needinfo?(james)

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Has Regression Range: --- → yes

Julian Descottes [:jdescottes]

Comment 2

•

3 years ago

Recent regression, should be fixed for the next release

Severity: -- → S3

Priority: -- → P2

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 3

•

3 years ago

So there is actually a problem when trying to use timeouts with pytest. The notion for web-platform-tests is a timeout per module (file) whereby in pytest timeouts are handled per function. Several plugins like pytest-timeout only work that way and I cannot find a possibility to set a module wide timeout.

Further for the pytestrunner as used by wptrunner we set a timeout via an argument when calling run but that timeout is actually never used for pytest. This means when a test causes a timeout pytest will not handle that internally (meaning no safe shutdown) but the wptrunner has to handle such an inconsistent and unknown state.

Given that timeouts are handled that differently I don't see that we can easily instruct pytest to handle it internally.

From the wptrunner side the status is unclear and it's not known if some code actually triggered the deletion of the session or not. As such just waiting for the session to have ended is mainly not helpful and we should try to at least delete the session before so a clean shutdown can happen before the processes are force-killed.

James, do I see it right that wptrunner starts geckodriver but sadly we do not have any access to the actual session from that code? Only pytest itself knows about it? If yes, then we won't be able to delete the session, and should directly force-kill the process. Probably we should only do that when a timeout error occurs, and then we could pass in force=true and with that flag set we won't enter the session alive check.

If you think that this is fine, could you maybe point me to the code where the timeout is handled and the process gets stopped? I have a problem to find that.

Flags: needinfo?(james)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

3 years ago

Whiteboard: [webdriver:triage]

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 4

•

3 years ago

James will have a look again what would be best here.

No longer blocks: 1723202

Whiteboard: [webdriver:triage]

James Graham [:jgraham]

Comment 5

•

3 years ago

Try something like:

diff --git a/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py b/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py
index 0dac77c4b5533..9344e454a62af 100644
--- a/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py
+++ b/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py
@@ -246,7 +246,7 @@ class _RunnerManagerState(object):
     initializing = namedtuple("initializing",
                               ["test", "test_group", "group_metadata", "failure_count"])
     running = namedtuple("running", ["test", "test_group", "group_metadata"])
-    restarting = namedtuple("restarting", ["test", "test_group", "group_metadata"])
+    restarting = namedtuple("restarting", ["test", "test_group", "group_metadata", "force_stop"])
     error = namedtuple("error", [])
     stop = namedtuple("stop", [])
 
@@ -417,7 +417,10 @@ class TestRunnerManager(threading.Thread):
             self.logger.debug("Got command: %r" % command)
         except IOError:
             self.logger.error("Got IOError from poll")
-            return RunnerManagerState.restarting(0)
+            return RunnerManagerState.restarting(self.state.test,
+                                                 self.state.test_group,
+                                                 self.state.group_metadata,
+                                                 False)
         except Empty:
             if (self.debug_info and self.debug_info.interactive and
                 self.browser.started and not self.browser.is_alive()):
@@ -442,7 +445,10 @@ class TestRunnerManager(threading.Thread):
                     self.logger.critical("Last test did not complete")
                     return RunnerManagerState.error()
                 self.logger.warning("More tests found, but runner process died, restarting")
-                return RunnerManagerState.restarting(0)
+                return RunnerManagerState.restarting(self.state.test,
+                                                     self.state.test_group,
+                                                     self.state.group_metadata,
+                                                     False)
         else:
             f = (dispatch.get(self.state.__class__, {}).get(command) or
                  dispatch.get(None, {}).get(command))
@@ -549,7 +555,8 @@ class TestRunnerManager(threading.Thread):
             self.logger.info("Restarting browser for new test environment")
             return RunnerManagerState.restarting(self.state.test,
                                                  self.state.test_group,
-                                                 self.state.group_metadata)
+                                                 self.state.group_metadata,
+                                                 False)
 
         self.recording.set(["testrunner", "test"] + self.state.test.id.split("/")[1:])
         self.logger.test_start(self.state.test.id)
@@ -683,6 +690,7 @@ class TestRunnerManager(threading.Thread):
                                file_result.status in ("CRASH", "EXTERNAL-TIMEOUT", "INTERNAL-ERROR") or
                                ((subtest_unexpected or is_unexpected) and
                                 self.restart_on_unexpected))
+        force_stop = test.test_type == "wdspec" and file_result.status == "EXTERNAL-TIMEOUT"
 
         self.recording.set(["testrunner", "after-test"])
         if (not file_result.status == "CRASH" and
@@ -691,7 +699,7 @@ class TestRunnerManager(threading.Thread):
             self.logger.info("Pausing until the browser exits")
             self.send_message("wait")
         else:
-            return self.after_test_end(test, restart_before_next)
+            return self.after_test_end(test, restart_before_next, force_stop=force_stop)
 
     def wait_finished(self, rerun=False):
         assert isinstance(self.state, RunnerManagerState.running)
@@ -701,7 +709,7 @@ class TestRunnerManager(threading.Thread):
         # post-stop processing
         return self.after_test_end(self.state.test, not rerun, force_rerun=rerun)
 
-    def after_test_end(self, test, restart, force_rerun=False):
+    def after_test_end(self, test, restart, force_rerun=False, force_stop=False):
         assert isinstance(self.state, RunnerManagerState.running)
         # Mixing manual reruns and automatic reruns is confusing; we currently assume
         # that as long as we've done at least the automatic run count in total we can
@@ -718,14 +726,14 @@ class TestRunnerManager(threading.Thread):
             test_group = self.state.test_group
             group_metadata = self.state.group_metadata
         if restart:
-            return RunnerManagerState.restarting(test, test_group, group_metadata)
+            return RunnerManagerState.restarting(test, test_group, group_metadata, force_stop)
         else:
             return RunnerManagerState.running(test, test_group, group_metadata)
 
     def restart_runner(self):
         """Stop and restart the TestRunner"""
         assert isinstance(self.state, RunnerManagerState.restarting)
-        self.stop_runner()
+        self.stop_runner(force=self.state.force_stop)
         return RunnerManagerState.initializing(self.state.test, self.state.test_group, self.state.group_metadata, 0)
 
     def log(self, data):

I haven't tried it with a wdspec test, but it at least seems to run other tests, so I don't think it should break everything.

Agreed this code isn't very simple; the loop there is trying to be an explicit state machine using tokens to pass data between different states rather than having everything just call methods directly, but that does make it harder to add more state.

Flags: needinfo?(james)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 6

•

3 years ago

Attached file Bug 1754712 - [wptrunner] Force stopping the browser when wdspec test hits external timeout. — Details

Web-platform tests use a timeout per test module while pytest
supports a timeout for test methods. As long as a file wide
timeout isn't supported wptrunner has to force kill the browser
when a wdspec test runs into an external timeout.

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → hskupin

Status: NEW → ASSIGNED

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

3 years ago

Component: geckodriver → web-platform-tests

Summary: After a test timeout geckodriver checks 70s for the session status before closing the browser → wptrunner doesn't force-stop the browser when a wdspec test hits an external timeout

Whiteboard: [bidi-m3-mvp]

Pulsebot

Comment 7

•

3 years ago

Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/21dda7d9394c [wptrunner] Force stopping the browser when wdspec test hits external timeout. r=jgraham

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Updated

•

3 years ago

Whiteboard: [bidi-m3-mvp] → [bidi-m3-mvp], [wptsync upstream error]

Atila Butkovits

Comment 8

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/21dda7d9394c

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox99: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 99 Branch

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 9

•

3 years ago

James, can you please check what's wrong with the upstream sync and why the bot added [wptsync upstream error]? It would have been nice if the bot had printed the URL of the PR in the comment above. I looked and it's the following:

https://github.com/web-platform-tests/wpt/pull/32994

Points: --- → 2

Flags: needinfo?(james)

James Graham [:jgraham]

Comment 10

•

3 years ago

I merged the upstream PR. The documentation bug is idenepdent of this change and needs to be investigated. I'm not sure why the bot didn't comment on the bug with the sync link, that is definitely supposed to happen.

Flags: needinfo?(james)

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 11

•

3 years ago

Upstream PR merged by None

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Updated

•

3 years ago

Whiteboard: [bidi-m3-mvp], [wptsync upstream error] → [bidi-m3-mvp], [wptsync upstream]

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 12

•

3 years ago

Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/33161 for changes under testing/web-platform/tests

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 13

•

3 years ago

Upstream PR merged by None

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 14

•

3 years ago

Upstream PR merged by None