Closed Bug 1739369 Opened 3 years ago Closed 2 years ago

WebDriver:NewWindow with type "tab" hangs in upstream wpt stability jobs

Categories

(Remote Protocol :: Marionette, defect, P2)

Default
defect
Points:
2

Tracking

(firefox97 fixed)

RESOLVED FIXED
97 Branch
Tracking Status
firefox97 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Whiteboard: [bidi-m3-mvp])

Attachments

(3 files)

See https://github.com/web-platform-tests/wpt/issues/31499 for the upstream issue.

The problem seems to only happen with the tests test_no_top_browsing_context for most of the WebDriver commands. In these cases the New Window command with type tab doesn't return. Interestingly this only happens upstream and not in our own CI.

Here an excerpt:

0:29.61 pid:1401 1636025190236	Marionette	DEBUG	0 -> [0,241,"WebDriver:NewWindow",{"type":"tab"}]
 0:29.62 pid:1401 1636025190241	Marionette	TRACE	Received observer notification domwindowopened
 0:29.74 pid:1401 1636025190366	Marionette	TRACE	Received event beforeunload for about:blank
 0:29.75 pid:1401 1636025190376	Marionette	TRACE	Received event pagehide for about:blank
 0:29.76 pid:1401 1636025190381	Marionette	TRACE	Received event DOMContentLoaded for about:blank
 0:29.76 pid:1401 1636025190381	Marionette	TRACE	Received event pageshow for about:blank
 0:59.37 TEST_END: TIMEOUT, expected OK

Most likely we hang in waiting for the load event of the newly opened tab:

https://searchfox.org/mozilla-central/rev/88cd13997fb0747cdcd78638fc762ff2d75e1fc5/remote/marionette/driver.js#1977-1999

Maybe it's caused by a remoteness change that we miss to detect and wait forever.

Sadly we print out the load events before checking the browsing context:
https://searchfox.org/mozilla-release/rev/67bec9f7c690bbc72cfefe652a9d5fffb961b698/remote/marionette/sync.js#550-557

As such it's possible that the browsing context has been changed and we did not detect that. Maybe it would be good to also use the WebProgressListener here similar to what we have in mind for bug 1739008.

Assignee: nobody → hskupin
Status: NEW → ASSIGNED

I uploaded a small patch to improve the logging. We need to get it landed first and reach Nightly before I can test upstream again.

Keywords: leave-open

Lets wait for the code from bug 1739008 has been landed. It will make things easier also for this command.

Depends on: 1739008
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/00c3528c6e15
[marionette] Improve logging for load events in waitForLoadEvent. r=webdriver-reviewers,jdescottes

With the patch landed and on Nightly I ran another nightly stability job on the wpt CI and the WebDriver:NewWindow command hangs with:

 0:26.87 pid:1407 1636132910074	Marionette	DEBUG	0 -> [0,241,"WebDriver:NewWindow",{"type":"tab"}]
 0:26.88 pid:1407 1636132910080	Marionette	TRACE	Received observer notification domwindowopened
 0:56.60 TEST_END: TIMEOUT, expected OK

As such it's clear that we do not listen on the correct browsing context for the page load events and as such wait forever for the tab to be loaded.

It's not clear why we do not see that in our own CI and cannot reproduce locally, but a fix is needed and it can be done equally to bug 1739008.

I pushed a try build for that bug:
https://treeherder.mozilla.org/jobs?repo=try&revision=9d21c3194a35a139258bd02a44b94a3360ac034d&selectedTaskRun=IGXt3JFJSs6nENfbCNtyJw.0

Sadly it's always failing the test test_navigation.py TestPageLoadStrategy.test_none with the first patch applied. But that one only moves the already used code from driver.js into the navigate.js module, and doesn't change anything else. So this failure shouldn't actually happen.

Olli, are there any side effects with webProgress listeners that we should be aware of? We register a listener but the onStateChange method is never called, and as such our Promise doesn't resolve.

Flags: needinfo?(bugs)
Blocks: 1740021

Is the listener registered early enough, and is it registered on the right object? Not sure what else I could say about any side effects.

Flags: needinfo?(bugs)

Based on Olli's reply I had another look at these changes, and AFAICS there is actually a race here...

Without my patch we run the browsingContext.webProgress.isLoadingDocument check in driver.js and that returns true. It means that we triggered the page load, immediately destroyed the Marionette session, and the check in session creation triggers the code path that actually handles onStateChange events and finally resolves the promise:

1639469808970	Marionette	DEBUG	4 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
** url: http://127.0.0.1:60099/slow_resource.html
** initial document: false
** is loading: true
** await for navigation

By moving the code into navigate.js the situation is different. While all the state as printed is completely the same, there is no onStateChange event fired and as such the promise never resolved:

** url: http://127.0.0.1:60198/slow_resource.html
** initial document: false
** is loading: true
** await for navigation

The URL is correct and as such I assume we attach the progress listener to the correct browsing context. But why it doesn't fire events even that I can see that the page finished loading is unclear.

Olli, what do you mean with registered early enough? We attach it when the page is already loading, but also before it has been finished loading. Are there any MOZ_LOG capabilities that could be used to log more detailed information?

Flags: needinfo?(bugs)

Even when subscribing for all notifications for this WebProgress I never receive an onStateChange event. But when using the BCWebProgress:5 logger I can clearly see that there should be an incoming state change:

1640207319580	Marionette	DEBUG	5 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
** add webprogress to browsing context 44 for http://127.0.0.1:62932/slow_resource.html
** initial document: false
** is loading: true
** await for navigation
[Parent 62110: Main Thread]: I/BCWebProgress OnStatusChange(<null>, <null>, NS_OK, "Transferring data from 127.0.0.1…") on {top:1, id:2c, url:http://127.0.0.1:62932/slow_resource.html}
[Parent 62110: Main Thread]: I/BCWebProgress OnStateChange({isTopLevel:1, isLoadingDocument:1}, {URI:http://127.0.0.1:62932/slow_resource.html, originalURI:http://127.0.0.1:62932/slow_resource.html}, STATE_STOP|STATE_IS_NETWORK|STATE_IS_WINDOW, NS_OK) on {top:1, id:2c, url:http://127.0.0.1:62932/slow_resource.html}

Here is what we have without my patch:

1640207843183	Marionette	DEBUG	4 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
[Parent 67098: Main Thread]: I/BCWebProgress OnStatusChange(<null>, <null>, NS_OK, "Transferring data from 127.0.0.1…") on {top:1, id:2c, url:http://127.0.0.1:63410/slow_resource.html}
[Parent 67098: Main Thread]: I/BCWebProgress OnStateChange({isTopLevel:1, isLoadingDocument:1}, {URI:http://127.0.0.1:63410/slow_resource.html, originalURI:http://127.0.0.1:63410/slow_resource.html}, STATE_STOP|STATE_IS_NETWORK|STATE_IS_WINDOW, NS_OK) on {top:1, id:2c, url:http://127.0.0.1:63410/slow_resource.html}
1640207847136	Marionette	DEBUG	4 <- [1,1,null,{"sessionId":"45f27c35-d06f-43ee-ba40-6f021c4b0a5a","capabilities":{"browserName":"firefox","browserV

So it's exactly the same but why does it not work when the handling of the webProgress is part of a different module (navigate.js instead of driver.js)? Maybe this is somewhat related to bug 1746483.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #11)

Maybe this is somewhat related to bug 1746483.

Actually not. In this case we do not even see an onStateChange with the logging module.

Hi Steve and Jon, could one of you please have a look at comment 11? With this patch as attached as WIP I'm moving the handling of a webProgress into an async method. The registered webProgressListener is then defined within this async method. But that is causing problems because the listener is a weak reference and get GC'ed before the page that the progresslistener is waiting for finishes loading.

I would be interested to know why the GC actually garbage collects the listener inside the async method which at this point is still running and the included promise hasn't been resolved?

Thanks a lot!

Flags: needinfo?(sphink)
Flags: needinfo?(jcoppeard)
Flags: needinfo?(bugs)
Blocks: 1747359
Attachment #9256593 - Attachment description: WIP: Bug 1739369 - [marionette] Move wait for initial page load logic into navigate module. → Bug 1739369 - [marionette] Move wait for initial page load logic into navigate module.
Attachment #9256594 - Attachment description: WIP: Bug 1739369 - [marionette] Use webProgress to listen for the initial page being loaded in "WebDriver:NewWindow". → Bug 1739369 - [marionette] Use webProgress to listen for the initial page being loaded in "WebDriver:NewWindow".
Depends on: 1748626

Running on try shows issues on Android where we seem to hang with the isInitialDocument check included. We could leave it out or just disable it for WebDriver:NewSession (see the proposal from Julian).

But before I can make a decision I would like to test myself locally on Android. Sadly various things are currently broken so that I cannot run the tests, and at least bug 1748626 is still required.

Attachment #9256593 - Attachment description: Bug 1739369 - [marionette] Move wait for initial page load logic into navigate module. → Bug 1739369 - [marionette] Move wait for initial page load logic into shared navigate module.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c90687f11d9b
[marionette] Move wait for initial page load logic into shared navigate module. r=webdriver-reviewers,jdescottes
https://hg.mozilla.org/integration/autoland/rev/ebff6a7ca24b
[marionette] Use webProgress to listen for the initial page being loaded in "WebDriver:NewWindow". r=webdriver-reviewers,jdescottes
Regressions: 1749111
Regressions: 1749106
Regressions: 1749104

Marking as fixed for now given that all necessary patches got landed. We see low frequent regressions that I'm going to observe the next days.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → 97 Branch

I checked the stability jobs of wpt and it looks to work fine now:

 0:31.60 pid:1442 1641829191459	Marionette	DEBUG	0 -> [0,23,"WebDriver:NewWindow",{"type":"tab"}]
 0:31.61 pid:1442 1641829191466	Marionette	TRACE	Received DOM event TabOpen for [object XULElement]
 0:31.63 pid:1442 1641829191493	RemoteAgent	TRACE	Initial navigation already completed
 0:31.64 pid:1442 1641829191494	Marionette	DEBUG	0 <- [1,23,null,{"handle":"d9522b07-d575-403b-8242-8bf144cb0365","type":"tab"}]

We still have the regressions, which might let us return a bit too early. I'll keep an eye on it.

An upcoming patch on bug 1747359 should fix all the known regressions.

Points: --- → 2
Whiteboard: [bidi-m3-mvp]
Priority: P3 → P2
Regressions: 1750441
Regressions: 1749626
Regressions: 1750433
Regressions: 1750302
Regressions: 1750125
Regressions: 1750334
No longer depends on: 1748626
Product: Testing → Remote Protocol

Looks like this got resolved.

Flags: needinfo?(sphink)
Flags: needinfo?(jcoppeard)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: