Closed Bug 1947453 Opened 1 year ago Closed 1 year ago

Frequent WD macosx1470-64 debug /webdriver/tests/classic/set_window_rect/set.py | single tracking bug

Categories

(Remote Protocol :: Marionette, defect, P2)

defect

Tracking

(firefox-esr115 unaffected, firefox-esr128 unaffected, firefox135 unaffected, firefox136 unaffected, firefox137 fixed)

RESOLVED FIXED
137 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox-esr128 --- unaffected
firefox135 --- unaffected
firefox136 --- unaffected
firefox137 --- fixed

People

(Reporter: whimboo, Assigned: emilio)

References

(Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [webdriver:m15][webdriver:external])

Attachments

(2 files)

Since bug 581863 landed we can see hangs in WebDriver tests that manipulate the window position and size. Here an example of a try build from earlier today as built from most recent mozilla-central:

https://treeherder.mozilla.org/logviewer?job_id=494342940&repo=try&lineNumber=281343

[task 2025-02-11T11:17:38.859Z] 11:17:38     INFO - PID 4569 | 1739272658857	Marionette	DEBUG	0 -> [0,1236,"WebDriver:SetWindowRect",{"height":1080,"width":1920,"x":0,"y":25}]
[task 2025-02-11T11:17:38.860Z] 11:17:38     INFO - PID 4569 | 1739272658857	Marionette	TRACE	Checking window geometry 800x600 @ (100, 100)
[task 2025-02-11T11:17:38.862Z] 11:17:38     INFO - PID 4569 | 1739272658860	RemoteAgent	TRACE	Received DOM event MozUpdateWindowPos for [object WindowRoot]
[task 2025-02-11T11:17:38.863Z] 11:17:38     INFO - PID 4569 | 1739272658861	Marionette	TRACE	Checking window geometry 1920x1055 @ (0, 25)
[task 2025-02-11T11:17:39.285Z] 11:17:39     INFO - PID 4569 | [Parent 4589, Main Thread] ###!!! ASSERTION: mBounds out of sync!: 'mWindow && mBounds == r', file /builds/worker/checkouts/gecko/widget/cocoa/nsCocoaWindow.mm:1892
[task 2025-02-11T11:17:39.298Z] 11:17:39     INFO - PID 4569 | [Parent 4589, Main Thread] ###!!! ASSERTION: mBounds out of sync!: 'mWindow && mBounds == r', file /builds/worker/checkouts/gecko/widget/cocoa/nsCocoaWindow.mm:1892
[task 2025-02-11T11:17:39.299Z] 11:17:39     INFO - PID 4569 | 1739272659298	RemoteAgent	TRACE	Received DOM event resize for [object Window]
..
[task 2025-02-11T11:29:41.391Z] 11:29:41     INFO - PID 4569 | 1739273381390	RemoteAgent	TRACE	Received DOM event MozUpdateWindowPos for [object WindowRoot]
[task 2025-02-11T11:29:41.391Z] 11:29:41     INFO - PID 4569 | 1739273381390	Marionette	TRACE	Checking window geometry 1920x1055 @ (0, 25)
[task 2025-02-11T11:29:41.641Z] 11:29:41  WARNING - Forcibly terminating runner process
[task 2025-02-11T11:29:41.676Z] 11:29:41     INFO - STDERR: Could not terminate process: 4569
[task 2025-02-11T11:29:41.682Z] 11:29:41    ERROR - Failed to stop either the runner or the browser process

The manifest files for the related wdspec tests have multiple statuses included since then. Here an example:
https://searchfox.org/mozilla-central/source/testing/web-platform/meta/webdriver/tests/classic/set_window_rect/set.py.ini

Flags: needinfo?(emilio)

Oh and we usually had a timeout of 500ms which should have caused the command to return early if the target destination and size cannot be reached:

https://searchfox.org/mozilla-central/rev/7c573d9eb97e7b1ba383239bfac9260b26983544/remote/marionette/driver.sys.mjs#1431-1436

Is this on macOS only? If so, it seems likely something about the remote changes is upsetting macOS? Also, is this -debug only?

The timeout seems to still be there unless I'm missing something, it's probably something hanging on the widget side...

Flags: needinfo?(emilio) → needinfo?(hskupin)

Note that I cannot reproduce the hangs locally on MacOS 15.2. Maybe it's related to the configuration in CI? Also I noticed that the meta file got updated on Monday last week so that for OS X 14.7.1 all tests are marked as [TIMEOUT, OK] now. I've reverted those entries and pushed a try build. I'll re-trigger the specific jobs once it's clear in which chunk the tests are running:

https://treeherder.mozilla.org/jobs?repo=try&revision=a2df83245da3876ea684fb144492d6257d1b9000

Flags: needinfo?(hskupin)
Assignee: nobody → hskupin
Status: NEW → ASSIGNED

It looks like there are no more such failures. Lets get the meta data updated.

Summary: "WebDriver:SetWindowRect" sometimes hangs forever since bug 581863 landed → "WebDriver:SetWindowRect" sometimes hangs forever on MacOS since bug 581863 landed
Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/12b3d9efd15d [wdspec] Update meta data for classic/set_window_rect/set.py on MacOS. r=webdriver-reviewers,Sasha
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 137 Branch
Severity: -- → S3
Points: --- → 1
Priority: -- → P3
Whiteboard: [webdriver:m15]

Actually this is still a problem even through on try it didn't fail:
https://treeherder.mozilla.org/logviewer?job_id=496615509&repo=autoland&lineNumber=180883

And yes this is debug only because that assertion is only run for such builds:
https://searchfox.org/mozilla-central/rev/f256e50e068136275c1a2aff827aeddc92a75e07/widget/cocoa/nsCocoaWindow.mm#1889-1893

#ifdef DEBUG
  LayoutDeviceIntRect r = nsCocoaUtils::CocoaRectToGeckoRectDevPix(
      mWindow.frame, BackingScaleFactor());
  NS_ASSERTION(mWindow && mBounds == r, "mBounds out of sync!");
#endif
Assignee: hskupin → nobody
Status: RESOLVED → REOPENED
Points: 1 → ---
Flags: needinfo?(emilio)
Priority: P3 → --
Resolution: FIXED → ---
Whiteboard: [webdriver:m15]
Target Milestone: 137 Branch → ---

Does removing the "retry" code-path on macOS fix it?

diff --git a/remote/shared/WindowManager.sys.mjs b/remote/shared/WindowManager.sys.mjs
index af1da9d25b452..ea9a825d6833e 100644
--- a/remote/shared/WindowManager.sys.mjs
+++ b/remote/shared/WindowManager.sys.mjs
@@ -251,7 +251,7 @@ class WindowManager {
       // wait longer if window geometry does not match.
       const options = {
         checkFn() {
-          return geometryMatches(/* retry = */ true);
+          return geometryMatches();
         },
         timeout: 500,
       };

I don't know how many retriggers am I expected to do to confirm that so if you could that'd be great :)

Flags: needinfo?(emilio) → needinfo?(hskupin)

I've a dirty working tree with changes I cannot commit right now. So if you could push that would be great. But not sure how reliable the results are given that my last try build also didn't fail for quite a few pushes.

Maybe we should not retry when the MozUpdateWindowPos event reports the exact same data as the one before? It probably means we cannot further move the window?

Flags: needinfo?(hskupin) → needinfo?(emilio)

Here's a patch: https://treeherder.mozilla.org/jobs?repo=try&revision=a4dd6b96694b55a3011cec9e2c65075a25feaf9f

But no, I think there's a cocoa widget bug, but without reproing it is a bit hard for me to justify sinking a lot of time into it :(

Flags: needinfo?(emilio)

Emilio, so the trick here to get it reproduced seem to be a full build. While an artifact build on try didn't reproduce the failure I can see it here multiple times:

https://treeherder.mozilla.org/jobs?repo=try&revision=1ee7423804f22531e6094c4083653d61b7d9533f

You pushed your try build for an artifact build, so maybe you try it locally with a full build as well first, and then push again to try?

Given that all the tests are passing in your try build also for other platforms do we really need the retry logic?

Flags: needinfo?(emilio)

To allow sheriffs to better classify the failures we need to update the bug's summary.

Summary: "WebDriver:SetWindowRect" sometimes hangs forever on MacOS since bug 581863 landed → Frequent WD macosx1470-64 debug /webdriver/tests/classic/set_window_rect/set.py | single tracking bug
Flags: needinfo?(emilio)

It seems to confuse macOS, doesn't seem needed anymore on Linux after
bug 1946184, where we started to look at the X server position rather
than the last configure position.

Assignee: nobody → emilio
Pushed by ealvarez@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2879c26ed4cc Remove retry logic from geometryMatches. r=whimboo,webdriver-reviewers
Status: REOPENED → RESOLVED
Closed: 1 year ago1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 137 Branch

Comment on attachment 9469207 [details]
Bug 1947453 - Remove retry logic from geometryMatches. r=whimboo

Beta/Release Uplift Approval Request

  • User impact if declined/Reason for urgency: Reduces failure rate of bug 1785367
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: none
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): simple tweak
  • String changes made/needed: none
  • Is Android affected?: No
Attachment #9469207 - Flags: approval-mozilla-beta?

Comment on attachment 9469207 [details]
Bug 1947453 - Remove retry logic from geometryMatches. r=whimboo

The patch landed during the 137 nightly cycle so it's already in beta.

Attachment #9469207 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

This is strange. I've checked some of these failures and it turns out that those are for opt builds and we run into a timeout already after 13s running the test. Isn't the default timeout 25s for wdspec? Here an example:

https://treeherder.mozilla.org/logviewer?job_id=498600907&repo=mozilla-beta&lineNumber=163036

[task 2025-03-11T17:03:44.727Z] 17:03:44     INFO - TEST-START | /webdriver/tests/classic/set_window_rect/set.py
[task 2025-03-11T17:03:44.852Z] 17:03:44     INFO - STDOUT: tests/web-platform/tests/webdriver/tests/classic/set_window_rect/set.py::test_null_parameter_value 
[task 2025-03-11T17:03:48.997Z] 17:03:48     INFO - STDOUT: tests/web-platform/tests/webdriver/tests/classic/set_window_rect/set.py::test_no_top_browsing_context 
[task 2025-03-11T17:03:49.500Z] 17:03:49     INFO - STDOUT: tests/web-platform/tests/webdriver/tests/classic/set_window_rect/set.py::test_no_browsing_context 
[..]
[task 2025-03-11T17:03:56.489Z] 17:03:56     INFO - STDOUT: tests/web-platform/tests/webdriver/tests/classic/set_window_rect/set.py::test_y_as_current 
[task 2025-03-11T17:03:56.518Z] 17:03:56     INFO - STDOUT: tests/web-platform/tests/webdriver/tests/classic/set_window_rect/set.py::test_negative_x_y 
[task 2025-03-11T17:03:57.622Z] 17:03:57     INFO - TEST-UNEXPECTED-OK | /webdriver/tests/classic/set_window_rect/set.py | expected TIMEOUT

Maybe we need to file a new bug for this particular issue in case we do not have an immediate answer to that. Sasha, maybe something changed in wpt?

Flags: needinfo?(aborovova)

Oh wait. The problem here is not that we timeout, but that we do NOT timeout. Looks like the following entry in the manifest file is causing it:
https://searchfox.org/mozilla-beta/source/testing/web-platform/meta/webdriver/tests/classic/set_window_rect/set.py.ini#5

On mozilla-central we have [OK, TIMEOUT] but shouldn't we no longer expect timeout since the fix landed?
https://searchfox.org/mozilla-central/source/testing/web-platform/meta/webdriver/tests/classic/set_window_rect/set.py.ini#5

I've pushed to try: https://treeherder.mozilla.org/jobs?repo=try&revision=95eaab54270cc9d4db6d2056d9914a1c61ce5ba5, to see if we can try to remove the metadata.

Flags: needinfo?(aborovova)
Priority: -- → P2
Whiteboard: [webdriver:m15][webdriver:external]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: