Closed Bug 1754712 Opened 2 years ago Closed 2 years ago

wptrunner doesn't force-stop the browser when a wdspec test hits an external timeout

Categories

(Testing :: web-platform-tests, defect, P2)

Default
defect
Points:
2

Tracking

(firefox-esr91 wontfix, firefox97 wontfix, firefox98 wontfix, firefox99 fixed)

RESOLVED FIXED
99 Branch
Tracking Status
firefox-esr91 --- wontfix
firefox97 --- wontfix
firefox98 --- wontfix
firefox99 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [bidi-m3-mvp], [wptsync upstream])

Attachments

(1 file)

When a wdspec test times out geckodriver checks for the session status for around 70s before shutting down the browser. This is specific code has been added on bug 1705770.

Here an example log:
https://firefox-ci-tc.services.mozilla.com/tasks/BdMN5w-0TAOJK6-cav6QnA/runs/0/logs/live/public/logs/live.log#L3347-3490

[task 2022-02-10T08:03:35.108Z] 08:03:35     INFO - PID 2646 | 1644480215107	webdriver::server	DEBUG	-> POST /session/9474988c-ab89-4b0f-88f5-f6c2b9385079/execute/sync {"script": "console.log('TEST')", "args": []}
[task 2022-02-10T08:03:35.126Z] 08:03:35     INFO - PID 2646 | 1644480215124	webdriver::server	DEBUG	<- 200 OK {"value":null}
[task 2022-02-10T08:05:17.673Z] 08:05:17     INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/bidi/log/entry_added/console.py | expected OK
[task 2022-02-10T08:05:17.673Z] 08:05:17     INFO - TEST-INFO took 105003ms
[task 2022-02-10T08:05:17.692Z] 08:05:17     INFO - PID 2646 | 1644480317690	webdriver::server	DEBUG	-> GET /status
[task 2022-02-10T08:05:17.694Z] 08:05:17     INFO - PID 2646 | 1644480317691	webdriver::server	DEBUG	<- 200 OK {"value":{"message":"Session already started","ready":false}}
[task 2022-02-10T08:05:18.695Z] 08:05:18     INFO - PID 2646 | 1644480318693
[..]
[task 2022-02-10T08:06:26.913Z] 08:06:26     INFO - PID 2646 | 1644480386911	webdriver::server	DEBUG	-> GET /status
[task 2022-02-10T08:06:26.914Z] 08:06:26     INFO - PID 2646 | 1644480386911	webdriver::server	DEBUG	<- 200 OK {"value":{"message":"Session already started","ready":false}}
[task 2022-02-10T08:06:37.961Z] 08:06:37  WARNING - Forcibly terminating runner process

This is extra time that we should not have to wait for. Maybe the session fixture should trigger closing the browser when a test timed out?

On the other side geckodriver should only check for the session state when it's aware that the command for destroying the session has already been send. Otherwise it doesn't make sense to wait. The new behavior exists since geckodriver 0.30.0.

James, mind having a look at this?

Flags: needinfo?(james)
Blocks: 1723202

wptrunner ends up in this state if it's expecting that the WebDriver session will be closed (see [1]). Note that it can't (easily) know if the the session is actually being closed because it's owned by pytest, and that's running in a totally different process. So the actual problem here is that we think we ought to be ending the session but aren't. So the fix is probably to ensure that we actually do end the session in this case.

[1] https://searchfox.org/mozilla-central/source/testing/web-platform/tests/tools/wptrunner/wptrunner/browsers/firefox.py#877

Flags: needinfo?(james)
Has Regression Range: --- → yes

Recent regression, should be fixed for the next release

Severity: -- → S3
Priority: -- → P2

So there is actually a problem when trying to use timeouts with pytest. The notion for web-platform-tests is a timeout per module (file) whereby in pytest timeouts are handled per function. Several plugins like pytest-timeout only work that way and I cannot find a possibility to set a module wide timeout.

Further for the pytestrunner as used by wptrunner we set a timeout via an argument when calling run but that timeout is actually never used for pytest. This means when a test causes a timeout pytest will not handle that internally (meaning no safe shutdown) but the wptrunner has to handle such an inconsistent and unknown state.

Given that timeouts are handled that differently I don't see that we can easily instruct pytest to handle it internally.

From the wptrunner side the status is unclear and it's not known if some code actually triggered the deletion of the session or not. As such just waiting for the session to have ended is mainly not helpful and we should try to at least delete the session before so a clean shutdown can happen before the processes are force-killed.

James, do I see it right that wptrunner starts geckodriver but sadly we do not have any access to the actual session from that code? Only pytest itself knows about it? If yes, then we won't be able to delete the session, and should directly force-kill the process. Probably we should only do that when a timeout error occurs, and then we could pass in force=true and with that flag set we won't enter the session alive check.

If you think that this is fine, could you maybe point me to the code where the timeout is handled and the process gets stopped? I have a problem to find that.

Flags: needinfo?(james)
Whiteboard: [webdriver:triage]

James will have a look again what would be best here.

No longer blocks: 1723202
Whiteboard: [webdriver:triage]

Try something like:

diff --git a/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py b/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py
index 0dac77c4b5533..9344e454a62af 100644
--- a/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py
+++ b/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py
@@ -246,7 +246,7 @@ class _RunnerManagerState(object):
     initializing = namedtuple("initializing",
                               ["test", "test_group", "group_metadata", "failure_count"])
     running = namedtuple("running", ["test", "test_group", "group_metadata"])
-    restarting = namedtuple("restarting", ["test", "test_group", "group_metadata"])
+    restarting = namedtuple("restarting", ["test", "test_group", "group_metadata", "force_stop"])
     error = namedtuple("error", [])
     stop = namedtuple("stop", [])
 
@@ -417,7 +417,10 @@ class TestRunnerManager(threading.Thread):
             self.logger.debug("Got command: %r" % command)
         except IOError:
             self.logger.error("Got IOError from poll")
-            return RunnerManagerState.restarting(0)
+            return RunnerManagerState.restarting(self.state.test,
+                                                 self.state.test_group,
+                                                 self.state.group_metadata,
+                                                 False)
         except Empty:
             if (self.debug_info and self.debug_info.interactive and
                 self.browser.started and not self.browser.is_alive()):
@@ -442,7 +445,10 @@ class TestRunnerManager(threading.Thread):
                     self.logger.critical("Last test did not complete")
                     return RunnerManagerState.error()
                 self.logger.warning("More tests found, but runner process died, restarting")
-                return RunnerManagerState.restarting(0)
+                return RunnerManagerState.restarting(self.state.test,
+                                                     self.state.test_group,
+                                                     self.state.group_metadata,
+                                                     False)
         else:
             f = (dispatch.get(self.state.__class__, {}).get(command) or
                  dispatch.get(None, {}).get(command))
@@ -549,7 +555,8 @@ class TestRunnerManager(threading.Thread):
             self.logger.info("Restarting browser for new test environment")
             return RunnerManagerState.restarting(self.state.test,
                                                  self.state.test_group,
-                                                 self.state.group_metadata)
+                                                 self.state.group_metadata,
+                                                 False)
 
         self.recording.set(["testrunner", "test"] + self.state.test.id.split("/")[1:])
         self.logger.test_start(self.state.test.id)
@@ -683,6 +690,7 @@ class TestRunnerManager(threading.Thread):
                                file_result.status in ("CRASH", "EXTERNAL-TIMEOUT", "INTERNAL-ERROR") or
                                ((subtest_unexpected or is_unexpected) and
                                 self.restart_on_unexpected))
+        force_stop = test.test_type == "wdspec" and file_result.status == "EXTERNAL-TIMEOUT"
 
         self.recording.set(["testrunner", "after-test"])
         if (not file_result.status == "CRASH" and
@@ -691,7 +699,7 @@ class TestRunnerManager(threading.Thread):
             self.logger.info("Pausing until the browser exits")
             self.send_message("wait")
         else:
-            return self.after_test_end(test, restart_before_next)
+            return self.after_test_end(test, restart_before_next, force_stop=force_stop)
 
     def wait_finished(self, rerun=False):
         assert isinstance(self.state, RunnerManagerState.running)
@@ -701,7 +709,7 @@ class TestRunnerManager(threading.Thread):
         # post-stop processing
         return self.after_test_end(self.state.test, not rerun, force_rerun=rerun)
 
-    def after_test_end(self, test, restart, force_rerun=False):
+    def after_test_end(self, test, restart, force_rerun=False, force_stop=False):
         assert isinstance(self.state, RunnerManagerState.running)
         # Mixing manual reruns and automatic reruns is confusing; we currently assume
         # that as long as we've done at least the automatic run count in total we can
@@ -718,14 +726,14 @@ class TestRunnerManager(threading.Thread):
             test_group = self.state.test_group
             group_metadata = self.state.group_metadata
         if restart:
-            return RunnerManagerState.restarting(test, test_group, group_metadata)
+            return RunnerManagerState.restarting(test, test_group, group_metadata, force_stop)
         else:
             return RunnerManagerState.running(test, test_group, group_metadata)
 
     def restart_runner(self):
         """Stop and restart the TestRunner"""
         assert isinstance(self.state, RunnerManagerState.restarting)
-        self.stop_runner()
+        self.stop_runner(force=self.state.force_stop)
         return RunnerManagerState.initializing(self.state.test, self.state.test_group, self.state.group_metadata, 0)
 
     def log(self, data):

I haven't tried it with a wdspec test, but it at least seems to run other tests, so I don't think it should break everything.

Agreed this code isn't very simple; the loop there is trying to be an explicit state machine using tokens to pass data between different states rather than having everything just call methods directly, but that does make it harder to add more state.

Flags: needinfo?(james)

Web-platform tests use a timeout per test module while pytest
supports a timeout for test methods. As long as a file wide
timeout isn't supported wptrunner has to force kill the browser
when a wdspec test runs into an external timeout.

Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Component: geckodriver → web-platform-tests
Summary: After a test timeout geckodriver checks 70s for the session status before closing the browser → wptrunner doesn't force-stop the browser when a wdspec test hits an external timeout
Whiteboard: [bidi-m3-mvp]
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/21dda7d9394c
[wptrunner] Force stopping the browser when wdspec test hits external timeout. r=jgraham
Whiteboard: [bidi-m3-mvp] → [bidi-m3-mvp], [wptsync upstream error]
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 99 Branch

James, can you please check what's wrong with the upstream sync and why the bot added [wptsync upstream error]? It would have been nice if the bot had printed the URL of the PR in the comment above. I looked and it's the following:

https://github.com/web-platform-tests/wpt/pull/32994

Points: --- → 2
Flags: needinfo?(james)

I merged the upstream PR. The documentation bug is idenepdent of this change and needs to be investigated. I'm not sure why the bot didn't comment on the bug with the sync link, that is definitely supposed to happen.

Flags: needinfo?(james)
Whiteboard: [bidi-m3-mvp], [wptsync upstream error] → [bidi-m3-mvp], [wptsync upstream]
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/33161 for changes under testing/web-platform/tests
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: