Open Bug 1664165 Opened 4 years ago Updated 1 day ago

Make use of a webprogress listener instead of page load events for tracking active navigations in Marionette

Categories

(Remote Protocol :: Marionette, task, P2)

Default
task
Points:
5

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: whimboo, Assigned: whimboo)

References

(Blocks 6 open bugs)

Details

(Whiteboard: [webdriver:backlog])

Attachments

(1 file, 2 obsolete files)

With bug 1612831 all the navigation commands will only operate on the parent process. That means no IPC communication will happen between the driver and the listener framescript. By doing that it retains the logic for waiting of the navigation to complete by using the page load events as sent by the framescript.

For a better handling and to finally get rid of the framescript at all, those checks need to be done in the parent process as best via the webprogress listener interface as available on each browsing context.

But before doing that we should get all the navigation tests updates first (see bug 1650132).

Fission Milestone: --- → M7
Whiteboard: [marionette-fission-mvp] → [marionette-fission-mvp][complex]
No longer blocks: 1665210
See Also: → 1673823

Given the complexity of this change I would suggest to try the minimal approach by using the events via a JSWindowActor implementation. I filed bug 1673823 for that.

No longer blocks: 1669174

With the framescript code gone we would only have to implement that for actors. So lets block on bug 1669172.

Depends on: 1669172

Also it would be good to have full support for bfcache (bug 1673059) before doing that change.

Depends on: 1673059

With all the recent work on the different Marionette actors the current way how navigations are handled works pretty fine, and doesn't cause any trouble with Fission. Moving over to a webprogress listener is complex, and can cause lots of side-effects / regressions, which do not outweigh any benefits from doing that now.

Whiteboard: [marionette-fission-mvp][complex]
No longer blocks: marionette-fission
Fission Milestone: M7 → ---
Summary: Move checks for active navigation from page load events to webprogress listener → Make use of a webprogress listener instead of page load events for tracking active navigations

Quick update. Using the webProgress via browsingContext.webProgress would be the way to go here. It automatically tracks browsing context replacements for ongoing loads, and won't need such a complicated logic as what we have right now. Here a simple example which returns when the page has been finished loading:

        await new Promise(resolve => {
          let wpl = {
            QueryInterface: ChromeUtils.generateQI([
              "nsIWebProgressListener",
              "nsISupportsWeakReference",
            ]),
            onStateChange(webProgress, request, stateFlags) {
              const { STATE_STOP } = Ci.nsIWebProgressListener;
              if ((stateFlags & STATE_STOP) == STATE_STOP) {
                browsingContext.webProgress.removeProgressListener(
                  wpl,
                  Ci.nsIWebProgress.NOTIFY_ALL
                );
                resolve();
              }
            },
          };
          browsingContext.webProgress.addProgressListener(
            wpl,
            Ci.nsIWebProgress.NOTIFY_ALL
          );
        });

When implementing this feature we should definitely use the MOZ_LOG=BCWebProgress:5 log module. It gives great information about any notification of web progress instances.

Blocks: 1730642
Points: --- → 13
Whiteboard: [bidi-m3-mvp]
Priority: P3 → P2
Assignee: nobody → jdescottes
Status: NEW → ASSIGNED

After discussing with Henrik, we will keep the current navigation tests as marionette unit tests, because they are much faster to run than wdspec tests. Removing the dependency on Bug 1650132.

The current marionette tests for navigation can be found at:

No longer depends on: 1650132

Summary of the progress so far.

I have added a preference to turn off the webprogress implementation for navigation: marionette.webprogress.enabled.
Even when this preference is enabled, we still use the legacy codepath for click attempts.
The implementation using webprogress listener is using a combination of onStateChange and onLocationChange.

Challenges/issues:

  1. PageLoadStrategy.Eager: onLocationChange is mandatory if we want to replicate the "eager" page load strategy. Otherwise for onStateChange, STATE_START is too early, STATE_STOP is too late. It doesn't seem like webprogresslistener is a good fit for this
  2. detecting about:*error redirections is not straightforward. If you attempt to navigate to "thisprotocoldoesnotexist://", you will reach a STATE_STOP for this URL before you are redirected to about:neterror. DevTools detects such bad requests by checking status != Cr.NS_OK && status != Cr.NS_BINDING_ABORTED and then waits for the next DOMContentLoaded since in DevTools case this runs in the content process (searchfox). Here instead I defer to the next onLocationChange, which should be about:neterror. However, there are also seemingly valid requests which match status != Cr.NS_OK && status != Cr.NS_BINDING_ABORTED. I had the case for a navigation between an inline image and a regular image (searchfox). So for now I simply wait for some time but if no onLocationChange resolves the navigation, I consider it as completed.
  3. I have a lot of intermittent about:blank navigations, although that might be because I was missing the fix from Bug 1747359. Currently rebasing, we'll see if the situation improves.
  4. navigations in driver.js always operate on the top Browsing Context, but navigate.js will watch navigation on the current browsing context. When the current browsing context is an iframe, the behavior is really surprising IMO. The navigation completes because the Frame's browsing context is unloaded and we do listen for that, but I am not sure we are really monitoring the actual full navigation of the top BC. In any case here since we use the webprogresslistener, I think we should always use the top BC
  5. (edit) overall there seems to be an issue with navigations to pages with inline script tags which open a dialog, eg inline("<script>window.{}('Hello');</script>".format(dialog_type)). Usually, this navigation should end when the common-dialog-loaded event is received, meaning the consumer can expect to interact with the dialog. However with the webprogresslistener, we sometimes get the STATE_STOP right before before the dialog was opened, and the consumer fails to find the dialog

I have a few try pushes in progress which are not too depressingly orange (eg https://treeherder.mozilla.org/jobs?repo=try&revision=ea910686f89a1a17ac891eca713b3d268ac41523), but there's still a lot of work here.

(In reply to Julian Descottes [:jdescottes] from comment #8)

  1. PageLoadStrategy.Eager: onLocationChange is mandatory if we want to replicate the "eager" page load strategy. Otherwise for onStateChange, STATE_START is too early, STATE_STOP is too late. It doesn't seem like webprogresslistener is a good fit for this

That is interesting. Is there no equivalent notification available as when DOMContentLoaded is fired? I assume the location change notification might not enough or is the DOM loaded when it fires?

  1. detecting about:*error redirections is not straightforward. If you attempt to navigate to "thisprotocoldoesnotexist://", you will reach a STATE_STOP for this URL before you are redirected to about:neterror. DevTools detects such bad requests by checking status != Cr.NS_OK && status != Cr.NS_BINDING_ABORTED and then waits for the next DOMContentLoaded since in DevTools case this runs in the content process (searchfox). Here instead I defer to the next onLocationChange, which should be about:neterror. However, there are also seemingly valid requests which match status != Cr.NS_OK && status != Cr.NS_BINDING_ABORTED. I had the case for a navigation between an inline image and a regular image (searchfox). So for now I simply wait for some time but if no onLocationChange resolves the navigation, I consider it as completed.

Mind sharing a log with BCWebProgress:5 for such an example? Similar to my comment above I don't feel that this is actually stable enough. Maybe we should talk to some DOM folks?

  1. navigations in driver.js always operate on the top Browsing Context, but navigate.js will watch navigation on the current browsing context. When the current browsing context is an iframe, the behavior is really surprising IMO. The navigation completes because the Frame's browsing context is unloaded and we do listen for that, but I am not sure we are really monitoring the actual full navigation of the top BC. In any case here since we use the webprogresslistener, I think we should always use the top BC

Yes, per WebDriver spec a real navigation can only be triggered on the top-browsing context. This will be different for BiDi. But there are cases like click (navigation or even unload), forward and backward that trigger a navigation only in a sub frame. When using the WebProgress listener we have to keep all these tests working. See also test_click.py.

  1. (edit) overall there seems to be an issue with navigations to pages with inline script tags which open a dialog, eg inline("<script>window.{}('Hello');</script>".format(dialog_type)). Usually, this navigation should end when the common-dialog-loaded event is received, meaning the consumer can expect to interact with the dialog. However with the webprogresslistener, we sometimes get the STATE_STOP right before before the dialog was opened, and the consumer fails to find the dialog

That's annoying and should not happen. That's a case we should clarify with the DOM folks.

I have a few try pushes in progress which are not too depressingly orange (eg https://treeherder.mozilla.org/jobs?repo=try&revision=ea910686f89a1a17ac891eca713b3d268ac41523), but there's still a lot of work here.

That's great to see! And thanks for the update.

Thanks for the reminder about BCWebProgress:5.

Adding more logs. All Trace logs prefixed with "@@@@@" are my debug traces. I log: all events captured by our MarionetteEventsChild actor and calls to my progress listener onLocationChange/onStateChange. For onStateChange, I filter out anything but STATE_STOP.

Below is eager evaluation while fetching a slow resource. Note that in an earlier attempt I tried to resolve on STATE_START from onStateChange, but it was too early as we will see here. In order to create this log I also let the webprogress listener run until STATE_STOP even though for this test case I would normally have resolved in onLocationChange. I did that to get a complete overview of the WebProgress updates.

### This is the navigate command
1642670020044 Marionette  DEBUG 3 -> [0,2,"WebDriver:Navigate",{"url":"http://127.0.0.1:50671/slow_resource.html"}]

### Now we are unloading 
1642670020046 Marionette  TRACE @@@@@ MarionetteEventsChild:PageLoadEvent type:beforeunload, uri:about:blank
[Parent 12948: Main Thread]: I/BCWebProgress OnStateChange({isTopLevel:1, isLoadingDocument:0}, {URI:http://127.0.0.1:50671/slow_resource.html, originalURI:http://127.0.0.1:50671/slow_resource.html}, STATE_START|STATE_IS_REQUEST|STATE_IS_DOCUMENT|STATE_IS_NETWORK|STATE_IS_WINDOW, NS_OK) on {top:1, id:2a, url:about:blank}
1642670020049 RemoteAgent TRACE [42] Web progress state=start: about:blank
[Parent 12948: Main Thread]: I/BCWebProgress OnStatusChange({isTopLevel:1, isLoadingDocument:1}, {URI:http://127.0.0.1:50671/slow_resource.html, originalURI:http://127.0.0.1:50671/slow_resource.html}, NS_ERROR_MALFORMED_URI, "127.0.0.1") on {top:1, id:2a, url:about:blank}
1642670020119 Marionette  TRACE Remoteness change detected. Set new top-level browsing context to 42
1642670020119 Marionette  TRACE @@@@@ MarionetteEventsChild:PageLoadEvent type:pagehide, uri:about:blank
1642670020119 Marionette  TRACE @@@@@ MarionetteEventsChild:PageLoadEvent type:unload, uri:about:blank
_RegisterApplication(), FAILED TO establish the default connection to the WindowServer, _CGSDefaultConnection() is NULL.
1642670020192 Marionette  TRACE @@@@@ MarionetteEventsChild:PageLoadEvent type:beforeunload, uri:about:blank

### Moving on to STATE_START
[Parent 12948: Main Thread]: I/BCWebProgress OnStateChange({isTopLevel:1, isLoadingDocument:1}, {URI:http://127.0.0.1:50671/slow_resource.html, originalURI:http://127.0.0.1:50671/slow_resource.html}, STATE_START|STATE_IS_REQUEST|STATE_IS_DOCUMENT|STATE_IS_NETWORK|STATE_IS_WINDOW, NS_OK) on {top:1, id:2a, url:about:blank}
[Parent 12948: Main Thread]: I/BCWebProgress OnStatusChange({isTopLevel:1, isLoadingDocument:1}, {URI:http://127.0.0.1:50671/slow_resource.html, originalURI:http://127.0.0.1:50671/slow_resource.html}, NS_NET_STATUS_RECEIVING_FROM, "127.0.0.1") on {top:1, id:2a, url:about:blank}

### Note that we are now unloading about:blank again, probably the initial one
1642670020194 Marionette  TRACE @@@@@ MarionetteEventsChild:PageLoadEvent type:pagehide, uri:about:blank
1642670020194 Marionette  TRACE @@@@@ MarionetteEventsChild:PageLoadEvent type:unload, uri:about:blank

### The following OnProgressChange is mostl likely turning into onLocationChange
[Parent 12948: Main Thread]: I/BCWebProgress OnProgressChange({isTopLevel:1, isLoadingDocument:1}, {URI:http://127.0.0.1:50671/slow_resource.html, originalURI:http://127.0.0.1:50671/slow_resource.html}, http://127.0.0.1:50671/slow_resource.html, 0) on {top:1, id:2a, url:http://127.0.0.1:50671/slow_resource.html}
1642670020196 RemoteAgent TRACE @@@@@ onLocationChange documentURI http://127.0.0.1:50671/slow_resource.html

### Now, 6ms after the onLocationChange triggered in the parent process, we get DOMContentLoaded 
### in the child actor, which is normally what we wait for in the current navigation logic. 
1642670020202 Marionette  TRACE @@@@@ MarionetteEventsChild:PageLoadEvent type:DOMContentLoaded, uri:http://127.0.0.1:50671/slow_resource.html

### I didn't add timestamps here, but the following On*Change are all triggered close to DOMContentLoaded
[Parent 12948: Main Thread]: I/BCWebProgress OnSecurityChange(<null>, <null>, 4) on {top:1, id:2a, url:http://127.0.0.1:50671/slow_resource.html}
[Parent 12948: Main Thread]: I/BCWebProgress OnStatusChange(<null>, <null>, NS_OK, "Transferring data from 127.0.0.1…") on {top:1, id:2a, url:http://127.0.0.1:50671/slow_resource.html}
[Parent 12948: Main Thread]: I/BCWebProgress OnProgressChange(<null>, <null>, 0, 0, 511, 511) on {top:1, id:2a, url:http://127.0.0.1:50671/slow_resource.html}
[Parent 12948: Main Thread]: I/BCWebProgress OnStatusChange(<null>, <null>, NS_OK, "Waiting for 127.0.0.1…") on {top:1, id:2a, url:http://127.0.0.1:50671/slow_resource.html}
_RegisterApplication(), FAILED TO establish the default connection to the WindowServer, _CGSDefaultConnection() is NULL.

### And the resource finally is loaded (and document state is complete)
[Parent 12948: Main Thread]: I/BCWebProgress OnStatusChange(<null>, <null>, NS_OK, "Transferring data from 127.0.0.1…") on {top:1, id:2a, url:http://127.0.0.1:50671/slow_resource.html}
1642670024226 Marionette  TRACE @@@@@ MarionetteEventsChild:PageLoadEvent type:pageshow, uri:http://127.0.0.1:50671/slow_resource.html
[Parent 12948: Main Thread]: I/BCWebProgress OnStateChange({isTopLevel:1, isLoadingDocument:1}, {URI:http://127.0.0.1:50671/slow_resource.html, originalURI:http://127.0.0.1:50671/slow_resource.html}, STATE_STOP|STATE_IS_NETWORK|STATE_IS_WINDOW, NS_OK) on {top:1, id:2a, url:http://127.0.0.1:50671/slow_resource.html}
1642670024227 RemoteAgent TRACE [42] Web progress state=stop: http://127.0.0.1:50671/slow_resource.html
1642670024227 RemoteAgent TRACE @@@@@ onStateChange documentURI http://127.0.0.1:50671/slow_resource.html

From this, onLocationChange, seems slightly too early. But I'm not sure I see anything specific enough in any of the onProgressChange/onStatusChange to use them as clear markers of "DOMContentLoaded" has happened.

Quick overview of the other issues

detecting about:*error redirections is not straightforward. If you attempt to navigate to "thisprotocoldoesnotexist://", you will reach a STATE_STOP for this URL before you are redirected to about:neterror. DevTools detects such bad requests by checking status != Cr.NS_OK && status != Cr.NS_BINDING_ABORTED ...

For this topic we probably have to know which error status imply that a redirection to an error page is about to happen. For the "thisprotocoldoesnotexist://" use case, we get NS_ERROR_UNKNOWN_PROTOCOL, while for the image navigation we get NS_ERROR_PARSED_DATA_CACHED. So we probably need to also check against NS_ERROR_PARSED_DATA_CACHED before we consider the request as incomplete.

I have a lot of intermittent about:blank navigations, although that might be because I was missing the fix from Bug 1747359. Currently rebasing, we'll see if the situation improves.

This improved significantly after rebasing, so hopefully this was just linked to the missing patch from Bug 1747359

overall there seems to be an issue with navigations to pages with inline script tags which open a dialog, eg inline("<script>window.{}('Hello');</script>".format(dialog_type))

Allowing for a small delay after STATE_STOP seems to fix the intermittent issues. I am pretty sure that STATE_STOP is only received after the script has fully executed. But on some platforms it seems that we receive STATE_STOP after the script was executed, but before the alert is open and ready to be used.

Hi Olli!

We are trying to use WebProgressListener to monitor navigations in Marionette. Marionette supports several page load strategies and one of them is "Eager" where we are supposed to resolve as soon as the document is "interactive". This comes after WebProgressListener onStateChange reaches STATE_START, but before STATE_STOP. So far, I ended up using onLocationChange, which seems to be triggered around the same time as the DOMContentLoaded, but this might just be by accident. See comment #10 for more details.

Is WebProgressListener a good fit to monitor a navigation and detect when the document becomes "interactive"? Should we try to use the other callbacks (onStatusChange, onProgressChange)?

Flags: needinfo?(bugs)

For the issue with dialogs opened from inline script (inline("<script>window.{}('Hello');</script>".format(dialog_type))), my analysis was wrong. The test where this seems to frequently fail is

    session.url = inline("<script>window.{}('Hello');</script>".format(dialog_type))
    session.alert.accept()

    session.url = inline("<script>window.{}('Hello');</script>".format(dialog_type))
    response = send_alert_text(session, "Federer")
    assert_error(response, "element not interactable")

And the issue is with the second inline call. We seem to get the STATE_STOP from the previous navigation from time to time. I think I can workaround this by only accepting a STATE_STOP after a STATE_START.

(In reply to Julian Descottes [:jdescottes] from comment #13)

And the issue is with the second inline call. We seem to get the STATE_STOP from the previous navigation from time to time. I think I can workaround this by only accepting a STATE_STOP after a STATE_START.

Interesting! This would also be a good addition for the existing ProgressListener class here (which you may already have updated/used):
https://searchfox.org/mozilla-central/rev/c1598f6d3edad19ccc53f53ab045d1d29835e1dd/remote/shared/Navigate.jsm#105-111

I have a relatively green try for the remote-protocol preset: https://treeherder.mozilla.org/jobs?repo=try&revision=8551d16d9859e4a5575169afc86f315b323e587d

The main failures remaining are from wpt tests, and can be reproduced locally. The issue seems related to the PageLoadStrategy eager. As hinted earlier, the onLocationChange callback is triggered slightly before the DOMContentLoaded event we use in the current implementation, and this leads to errors such as window.__wptrunner_process_next_event. Waiting slightly before resolving seems to fix the issue, but at this point we should wait for Olli's feedback.

Interesting! This would also be a good addition for the existing ProgressListener class here (which you may already have updated/used):

Thanks, and yes I plan to reuse it in the final version! I am using it a bit at the moment, but since my patch is not very stable I prefer to keep most of the code independent.

See Also: → 1751856

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #14)

Interesting! This would also be a good addition for the existing ProgressListener class here (which you may already have updated/used):
https://searchfox.org/mozilla-central/rev/c1598f6d3edad19ccc53f53ab045d1d29835e1dd/remote/shared/Navigate.jsm#105-111

Note that for my patch, it's fine to expect a STATE_START first, because in theory the navigation has not started until I trigger the callback passed to waitForNavigationCompleted. For the initial load, we can't be sure that we are starting to listen before some navigation started. So we should be careful if trying to apply the same pattern elsewhere.

Technically DOMContentLoaded doesn't map exactly to anything in WebProgressListener, nor readyState changes.
This is basically because of https://searchfox.org/mozilla-central/rev/02bd3e02b72d431f2c600a7328697a521f87d9b6/dom/base/Document.cpp#8074
At that point readyState is expected to be already interactive and see the callers why DOMContentLoaded may be delayed (ScriptLoader dealing with deferred scripts).

ReadyState gets value interactive when the page has been created, basically all of its DOM
https://searchfox.org/mozilla-central/rev/02bd3e02b72d431f2c600a7328697a521f87d9b6/dom/base/nsContentSink.cpp#790
https://searchfox.org/mozilla-central/rev/02bd3e02b72d431f2c600a7328697a521f87d9b6/parser/html/nsHtml5TreeOpExecutor.cpp#165,167,178

WebProgressListener doesn't really know about interactive state, since the state depends on HTML parser passing treeOps from the parser thread to the main thread and then creating the DOM.

Flags: needinfo?(bugs)

So if you must do something exactly when readyState changes to "interactive", a capturing event listener for readystatechange is possibly the only option. We could add something new, if event listeners are difficult to use in this case.

Thanks for the feedback! This means that if we want to use webprogresslistener, we still needto listen to readystatechange in the content process. I will try to adapt my current patch to handle this properly.

Depends on D138900

Comment on attachment 9264170 [details]
Bug 1664165 - [remote] Emit pageload internal events for existing WindowGlobalMessageHandlers

Revision D138900 was moved to bug 1756809. Setting attachment 9264170 [details] to obsolete.

Attachment #9264170 - Attachment is obsolete: true

Comment on attachment 9264279 [details]
Bug 1664165 - [remote] Add basic navigation helper to RootMessageHandler

Revision D138958 was moved to bug 1756809. Setting attachment 9264279 [details] to obsolete.

Attachment #9264279 - Attachment is obsolete: true

Unassigning for now this no longer blocks m3, we should now focus on bug 1756809 to get an initial navigation helper implemented.

Assignee: jdescottes → nobody
Status: ASSIGNED → NEW
Priority: P2 → P3
No longer blocks: 1730642
Summary: Make use of a webprogress listener instead of page load events for tracking active navigations → Make use of a webprogress listener instead of page load events for tracking active navigations in Marionette
Whiteboard: [bidi-m3-mvp] → [webdriver:backlog]
Product: Testing → Remote Protocol

We have made good experience with the WebProgress listener for tracking a navigation. As such we probably also want to let Marionette use the same logic so that we can get rid of listening / checking load events. As seen in some cases we might miss those especially when a remoteness change takes place.

I wonder if we should prioritize this work a bit higher to make certain navigation related APIs more stable in Marionette.

Whiteboard: [webdriver:backlog] → [webdriver:backlog][webdriver:triage]

Maybe we should not share with the BiDi implementation immediately because we need additional checks like modal dialogs for Marionette. As such lets use the WebProgressListener as next step to make navigation behavior more stable.

Points: 13 → 5
Priority: P3 → P2
Whiteboard: [webdriver:backlog][webdriver:triage] → [webdriver:backlog]
No longer blocks: 1760291

Given that amount of different failures during navigation with Marionette I put a stab on that and converted the code to make use of the webprogress listener from the shared navigation module. It works pretty fine locally. Here a full try build for verification:

https://treeherder.mozilla.org/jobs?repo=try&revision=6ad8cb259f8061599a74218fd3825d35bdea0b78

Assignee: nobody → hskupin
Status: NEW → ASSIGNED

I forgot to actually post an update on December 29th. I was able to fix the listed Mn failures from the last try push, but saw that wpt tests are basically perma-failing now with this change:

https://treeherder.mozilla.org/jobs?repo=try&revision=29047464b77539131dfd95c6e6a2e3c2e915b94a

I've to take a look at this problem next. Here a new try build with Marionette trace logging enabled:
https://treeherder.mozilla.org/jobs?repo=try&revision=7166516525b3c8d3da40a943a626681363290adc

Depends on: 1878690

Depends on D204027

While working on bug 1522790 I noticed that especially for ASAN builds the current navigation logic in Marionette is more than 3x slower when Fission is enabled as when it it disabled. Here an example:

  1. Fission disabled (try build - 210ms)
[task 2024-05-02T15:08:20.826Z] 15:08:20     INFO - PID 2325 | 1714662500825	Marionette	DEBUG	2 -> [0,7155,"WebDriver:Navigate",{"url":"http://web-platform.test:8000/html/canvas/element/path-objects/2d.path.roundrect.radius.intersecting.1.html"}]
[task 2024-05-02T15:08:20.834Z] 15:08:20     INFO - PID 2325 | 1714662500833	Marionette	TRACE	[1202] Received event beforeunload for about:blank
[task 2024-05-02T15:08:20.887Z] 15:08:20     INFO - PID 2325 | 1714662500886	Marionette	TRACE	Remoteness change detected. Set new top-level browsing context to 1203
[task 2024-05-02T15:08:20.915Z] 15:08:20     INFO - PID 2325 | 1714662500910	Marionette	TRACE	[1203] Received event beforeunload for about:blank
[task 2024-05-02T15:08:20.919Z] 15:08:20     INFO - PID 2325 | 1714662500916	Marionette	TRACE	[1203] Received event pagehide for about:blank
[task 2024-05-02T15:08:21.037Z] 15:08:21     INFO - PID 2325 | 1714662501035	Marionette	TRACE	[1203] Received event DOMContentLoaded for http://web-platform.test:8000/html/canvas/element/path-objects/2d.path.roundrect.radius.intersecting.1.html
[task 2024-05-02T15:08:21.037Z] 15:08:21     INFO - PID 2325 | 1714662501036	Marionette	DEBUG	2 <- [1,7155,null,{"value":null}]
  1. Fission enabled (try build - 860ms)
[task 2024-05-02T15:31:55.189Z] 15:31:55     INFO - PID 996 | 1714663915187	Marionette	DEBUG	2 -> [0,8376,"WebDriver:Navigate",{"url":"http://web-platform.test:8000/html/canvas/element/path-objects/2d.path.roundrect.radius.intersecting.1.html"}]
[task 2024-05-02T15:31:55.195Z] 15:31:55     INFO - PID 996 | 1714663915194	Marionette	TRACE	[1462] Received event beforeunload for about:blank
[task 2024-05-02T15:31:55.269Z] 15:31:55     INFO - PID 996 | 1714663915267	Marionette	TRACE	Remoteness change detected. Set new top-level browsing context to 1463
[task 2024-05-02T15:31:55.347Z] 15:31:55     INFO - PID 996 | 1714663915346	Marionette	TRACE	[1463] Received event beforeunload for about:blank
[task 2024-05-02T15:31:55.373Z] 15:31:55     INFO - PID 996 | 1714663915372	Marionette	TRACE	[1463] Received event pagehide for about:blank
[task 2024-05-02T15:31:56.041Z] 15:31:56     INFO - PID 996 | 1714663916040	Marionette	TRACE	[1463] Received event DOMContentLoaded for http://web-platform.test:8000/html/canvas/element/path-objects/2d.path.roundrect.radius.intersecting.1.html
[task 2024-05-02T15:31:56.046Z] 15:31:56     INFO - PID 996 | 1714663916041	Marionette	DEBUG	2 <- [1,8376,null,{"value":null}]

While the pagehide event comes right after the DOMContentLoaded event has this extreme delay. It seems to me very unlikely that this is related to the IPC that our current implementation uses to wait for this event from the content process. I might probably do another try build with the BCWebProgress:5 logging enabled.

Also I may try to actually get this patch landed so that we can fully stay in the parent process and not have to rely on IPC. Lets see if I can get the wpt failures fixed (if they are still present).

I pushed a try build to check the current state of wpt tests:
https://treeherder.mozilla.org/jobs?repo=try&revision=5d88d9f99cb9adb45342ff7c53fca426d5bf2f0c

Tests are still failing. After some investigation I noticed that we basically miss a OnStateChange notification for the real document load when there is already a document with about:blank loading. As such we do not detect the start of the navigation and as result do not stop when state is stop.

I've found the following condition in the docShell code which seems to be the culprit:
https://searchfox.org/mozilla-central/rev/386c7f17b2421374930a447c9f424910551b659d/docshell/base/BrowsingContextWebProgress.cpp#221-227

Here we return early without sending the notification when the document is already loading. But it does not check for the actual URL that gets loaded. That means if we have a situation when the initial about:blank is loaded and there is some kind of delay which bleeds into a navigation as triggered by Marionette, we are not getting a OnStateChange notification for start state.

I'll run a full build locally to check that and will also make the following log call non-debug:
https://searchfox.org/mozilla-central/rev/386c7f17b2421374930a447c9f424910551b659d/uriloader/base/nsDocLoader.cpp#1344-1351

Bug 1673059 shouldn't block us in getting these changes landed.

No longer depends on: 1673059
See Also: → 1895283

Indeed, we return early if there is already an active document load even through the URL is different. But also commenting out the mIsLoadingDocument condition and not returning early doesn't cause the notification to be sent. Maybe the DocLoader:5 logs might be helpful here. But I have to do a full build on my old Linux machine to actually test this.

Beside waiting for the build I also tried to reproduce this locally on MacOS and I stumbled over bug 1895283 which might be related or actually has the same underlying reason. Instead of the expected target page the former navigation is continued, which means that we never actually reach the target page.

There is actually an interesting situation here: https://share.firefox.dev/3Vs1JjB

While the start state is forwarded from the web content process to the parent process and the BCWebProgress also logs it at 78,065s the onStateChange callback of our registered WebProgressListener isn't called. As such we do not recognize that the actual navigation started and miss to set the internal flag to appropriately handle the state change stop at 78,221s, which results waiting forever.

Given that the issue happens right after a remoteness change I wonder if the browsing context check in that following code causes an early return:
https://searchfox.org/mozilla-central/rev/dc08a7321a22918d5a26b7641f9b10cd2a09d98e/dom/ipc/BrowserParent.cpp#2913-2952

I'll have to rebuild locally on my older Linux machine which will take a bit.

I'm trying to understand the situation here and how it is different than the situation on desktop re: onStateChange. On matrix :whimboo mentioned that desktop also doesn't send the STATE_START for the main load, only sending it for the initial about:blank load, but it sounded like the listener is being installed earlier compared to on Android.

I was also curious whether we'd see the STATE_START for the second load if the WPT code called stop() before starting the test page's load to halt any existing about:blank navigation.

Given that I'm off the next couple of days I cannot check if that call to window.stop() would help before wptrunner loads the actual test page. For now I've rebased my patch (to fix merge conflicts) and also pushed to try:

https://treeherder.mozilla.org/jobs?repo=try&revision=76fe21c16d8633d13c986cb2c48b95dd10436f4a

I only triggered a couple of web-platform-tests chunks so I hope that the failure will be visible without the above proposal. Then these patches can be used to test the above scenario. Julian, I would appreciate if you could help out here. Thanks!

Flags: needinfo?(jdescottes)

I am unsure if it'd help, but I'd be curious to know if the changes from https://hg.mozilla.org/try/rev/8492a2e199e4540a74256c6606c742990533981e will help this particular situation at all. My theory with this is that we're waiting too long to stop the previous navigation when a load is triggered from within the parent process, so we don't get the STATE_STOP from the previous load leading to this issue.

Sure. I quickly applied your patch and send it to try as full build:
https://treeherder.mozilla.org/jobs?repo=try&revision=be048a2ac26fdace1a00f7701c795669c376a1ab

Oh, I accidentally triggered a try build via artifact again because I was using mach try again. Here a new one that should really build Firefox with Nika's patch included:

https://treeherder.mozilla.org/jobs?repo=try&revision=1850d6f13a30a8e1b69c77c5b934566a464b5070

I don't see a clear change or improvement between the run with and without Nika's patch.
I pushed to try with the suggestion to call window.stop() before loading the test URL at https://treeherder.mozilla.org/jobs?repo=try&revision=e48a1c046c68cb2dc494af0194a86d36dc3d4b76

This is still in progress and we have a few interesting failures eg all early hints tests failing at https://treeherder.mozilla.org/jobs?repo=try&revision=e48a1c046c68cb2dc494af0194a86d36dc3d4b76&selectedTaskRun=axHKpc00SpWiVJw__mhxrw.0 , but otherwise I think this is much more green than what we had before.

Flags: needinfo?(jdescottes)

I had a look at the failures for the desktop tests:

  • /loading/early-hints/ tests fail because the navigation errors on NS_BINDING_ABORTED. This is very similar to Bug 1908515, we have the test script which redirects to another page before the load is completed. In terms of BiDi and according to the spec, it's fine to throw here actually. But maybe the navigation helper itself should not throw and should let the consumer decide what to do?
  • /html/semantics/forms/the-input-element/show-picker-disabled-readonly.html -> timeouts because each click waits for 1600 ms for a potential navigation. I imagine the timeout is. Our webprogress-based navigation helper uses a timeout multiplier based on the platform, so while the default is 200ms, for a DEBUG platform we will wait 1600ms per click, which is probably too much.
  • /html/semantics/interactive-elements/the-summary-element/interactive-content.html -> same timeout issue as above
  • /html/cross-origin-embedder-policy/require-corp.https.html -> similar to the NS_BINDING_ABORTED issue, we have a navigation that fails here so the test stops.
Depends on: 1761634
No longer blocks: 1522790
No longer depends on: 1761634
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: