Closed Bug 1832891 Opened 11 months ago Closed 10 months ago

AbortError issues after 200ms navigation timeout "No navigation detected: about:blank" for new tabs and windows

Categories

(Remote Protocol :: WebDriver BiDi, defect, P1)

defect
Points:
3

Tracking

(firefox115 fixed, firefox116 fixed)

RESOLVED FIXED
116 Branch
Tracking Status
firefox115 --- fixed
firefox116 --- fixed

People

(Reporter: jdescottes, Assigned: jdescottes)

References

(Blocks 1 open bug)

Details

(Whiteboard: [webdriver:m7][webdriver:relnote])

Attachments

(1 file, 1 obsolete file)

We frequently have issues with the WebDriverBiDi command browsingContext.create followed by a script.evaluate or script.callFunction command.

If browsingContext.create fails to wait for the initial navigation because of the navigation timeout, we might start sending a command to the temporary document, which will very soon be destroyed.

This usually leads to errors such as:

"AbortError: Actor 'MessageHandlerFrame' destroyed before query 'MessageHandlerFrameParent:sendCommand' was resolved"

And if you have trace logs enabled, you can (most likely) confirm that this is due to a navigation timeout by looking for logs (needs Trace level) such as:

RemoteAgent TRACE [31] ProgressListener Setting unload timer (200ms)
RemoteAgent TRACE [31] ProgressListener No navigation detected: about:blank
RemoteAgent TRACE [31] ProgressListener Stop: has error=false

We already increase the unload timer for our debug or slow platforms (see https://searchfox.org/mozilla-central/rev/815764d6f304cd69b60075e83b06320464a66223/remote/shared/Navigate.sys.mjs#24-40).

This bug might be a meta, to track all the reports linked to this root cause and track improvements.

Potential improvements here:

  • add better logging to help the investigation
  • make the timeout duration configurable
Depends on: 1832778
Summary: AbortError issues after initial navigation timeout "No navigation detected: about:blank" → AbortError issues after 200ms navigation timeout "No navigation detected: about:blank"
Blocks: 1832778
No longer depends on: 1832778

Given the results from bug 1816538 it would be good to see what happens when we increase the unload timer from 200ms to a larger number. It would be suspicious if the delays in navigation would also increase which means that something might be blocking.

I've created a try build which increases the timeout to 400ms and enables BCWebProgress:5 logging as well:
https://treeherder.mozilla.org/jobs?repo=try&revision=fa24a1f5b405110a22b392b6f6494ae5fbf0d065

Assignee: nobody → jdescottes
Status: NEW → ASSIGNED

For the record, we can't really reproduce the failure from Bug 1816538 on try at the moment, so it's hard to say if the patch really fixes this particular case. I did a try push without the fix, and no longer managed to reproduce the issue so I imagine we are using "fixed" windows workers now? Or maybe that's random?

In any case, I think this patch should be an improvement overall (as long as it doesn't lead to other regressions) but let's wait until the worker situation is clearer to see if we have a way to verify the fix properly.

I've asked for clarification on bug 1816538. Lets wait and see what the status is for try and other branches.

Severity: -- → S3
Points: --- → 3
Priority: -- → P1
Whiteboard: [webdriver:m7]

We mentioned that we would like to have a "safety" timeout to avoid blocking tests or scripts which might miss the navigation event.

One option is to start implementing a timeout option for the ProgressListener. That would typically be also useful for https://github.com/w3c/webdriver-bidi/issues/188. Compared to the unloadTimeout, the timeout could occur regardless of where we are in the navigation, as long as the delay is reached. We should also try to cancel the navigation.

Since waitForInitialNavigation should only ever have to wait for an about:blank load (and not a custom URL which might be slow to answer for good reasons), it sounds acceptable to come up with a hardcoded timeout value, potentially with a platform specific multiplier applied.

Another option would be to keep relying on the current unloadTimeout mechanism, but use a much higher value when called from waitForInitialNavigation. We would use that instead of expectNavigation.

Henrik: Should we go for the first option here and add a timeout option to ProgressListener + attempt to cancel the navigation? If yes we should split that in a separate bug, in case of backouts. Otherwise the second option is much simpler. Let me know what you think.

Flags: needinfo?(hskupin)

(In reply to Julian Descottes [:jdescottes] from comment #5)

Since waitForInitialNavigation should only ever have to wait for an about:blank load (and not a custom URL which might be slow to answer for good reasons), it sounds acceptable to come up with a hardcoded timeout value, potentially with a platform specific multiplier applied.

What about cases like window.open("https://example.org")? Here we open a new window (and tab) which might load the temporary about:blank first, but then loads immediately https://example.org. So I don't think that we can always assume we only load about:blank. Or do I miss something?

Another option would be to keep relying on the current unloadTimeout mechanism, but use a much higher value when called from waitForInitialNavigation. We would use that instead of expectNavigation.

This might be another good alternative which then only affects the case around unloading. We would still have the problem when the page is not finishing the load, but that could be handled by a different bug.

Henrik: Should we go for the first option here and add a timeout option to ProgressListener + attempt to cancel the navigation? If yes we should split that in a separate bug, in case of backouts. Otherwise the second option is much simpler. Let me know what you think.

I like the 2nd option more as well. So maybe lets try that one first.

Canceling a possible navigation might still be a question that should be discussed. It's not clear if that should be up to the client or the remote end to handle it.

Flags: needinfo?(hskupin)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #6)

(In reply to Julian Descottes [:jdescottes] from comment #5)

Since waitForInitialNavigation should only ever have to wait for an about:blank load (and not a custom URL which might be slow to answer for good reasons), it sounds acceptable to come up with a hardcoded timeout value, potentially with a platform specific multiplier applied.

What about cases like window.open("https://example.org")? Here we open a new window (and tab) which might load the temporary about:blank first, but then loads immediately https://example.org. So I don't think that we can always assume we only load about:blank. Or do I miss something?

At the moment, I think we only use waitForInitialNavigation when we open new tabs/windows without providing a URL. Apart from tests, there are 3 call sites I think:

  • browsingContext.create from BiDi
  • newSession from marionette
  • newWindow from marionette

In all three cases, I don't think we can provide a URL, so the commands should return only with an about:blank load (might be missing something for marionette).

In the general case where a script opens a window, using eg window.open, in BiDi we would only rely on events to monitor that. Not sure if classic/marionette monitors this navigation in some other way? But at least it doesn't seem to rely on waitForInitialNavigation.

Another option would be to keep relying on the current unloadTimeout mechanism, but use a much higher value when called from waitForInitialNavigation. We would use that instead of expectNavigation.

This might be another good alternative which then only affects the case around unloading. We would still have the problem when the page is not finishing the load, but that could be handled by a different bug.

Henrik: Should we go for the first option here and add a timeout option to ProgressListener + attempt to cancel the navigation? If yes we should split that in a separate bug, in case of backouts. Otherwise the second option is much simpler. Let me know what you think.

I like the 2nd option more as well. So maybe lets try that one first.

Canceling a possible navigation might still be a question that should be discussed. It's not clear if that should be up to the client or the remote end to handle it.

Ok, that sounds good to me as well. I had started a prototype for handling timeout + cancelling requests, I might just file a bug and have the patch around in case it's useful to support a future timeout parameter.

Thanks for the feedback.

(In reply to Julian Descottes [:jdescottes] from comment #7)

At the moment, I think we only use waitForInitialNavigation when we open new tabs/windows without providing a URL. Apart from tests, there are 3 call sites I think:

  • browsingContext.create from BiDi
  • newSession from marionette
  • newWindow from marionette

In all three cases, I don't think we can provide a URL, so the commands should return only with an about:blank load (might be missing something for marionette).

In the general case where a script opens a window, using eg window.open, in BiDi we would only rely on events to monitor that. Not sure if classic/marionette monitors this navigation in some other way? But at least it doesn't seem to rely on waitForInitialNavigation.

Fair point. So it can happen in Marionette when executing a script or eg clicking an element. But then we still have selected the current browsing context so that we do not care about the newly created one. But once we switch to the new Tab we should basically do that but I cannot find such a call at the moment.

Ok, that sounds good to me as well. I had started a prototype for handling timeout + cancelling requests, I might just file a bug and have the patch around in case it's useful to support a future timeout parameter.

Sounds good!

See Also: → 1835270
Blocks: 1835369
No longer blocks: 1816538
Depends on: 1816538

On bug 1816538 the unload timeout has been increased to 5000ms. Lets see how this works and if we may still need this patch or not.

Attachment #9335462 - Attachment is obsolete: true

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #9)

On bug 1816538 the unload timeout has been increased to 5000ms. Lets see how this works and if we may still need this patch or not.

We had to backout the patch from bug 1816538. We had several issues:

  • the xpcshell test test_Navigate.js started taking too much time on ccov -> bug 1835796
  • many timeouts on Android Debug -> Bug 1835819
  • timeouts on seemingly all wdspec tests in Beta simulation -> Bug 1835822

While the first one is not too concerning on its own, the 2 other issues mean that it's quite frequent to call waitForInitialNavigation in cases where we are missing the first navigation. Which means that we can't consistently expect to see a navigation in situations where we are using this API.

Potential next steps

  • increase the timeout more conservatively (we used 5000ms, we could stay closer to the current 200ms)
  • only apply this to BiDi's browsingContext.create calls
  • expose a preference to drive the timeout (this would help for instance with testing this for puppeteer)
Blocks: 1816538
No longer depends on: 1816538

I had a quick look at the beta sim test failures and I actually cannot find any instance where we actually timing out or delaying the wait for the initial page to be loaded. Something else might have added a small constant delay which triggered that many test failures. Not sure how many of our existing tests are already close to the general timeout limit.

Ok for now the beta failures also don't make much sense to me. I also note that there were no failures on the late beta sim, only early beta. I will start some additional pushes ...

Following the investigations on Bug 1835822, the beta failures were not related to this change.

Meaning we mostly have the Android reftest issues to look at. The logs are not super helpful as we don't have the full logs for those failures.
However we can see 40 seconds pauses between INFO - Starting runner and the next INFO - TEST-START (example https://treeherder.mozilla.org/logviewer?job_id=417476550&repo=autoland&lineNumber=11213-11214) so this must be where we hang while waiting for the navigation.

The issue is that on Android, when you open a new tab on about:blank, it keeps "isInitialDocument" to true until you actually manually perform a navigation.

Interestingly it doesn't seem to be the case when we use BiDi's browsingContext.create. I think this is because we explicitely create a tab for the url "about:blank". But when we open a tab using the browser's UI, or when we start the browser and look at the preloaded tabs, they will retain isInitialDocument === true indefinitely.

So this means that checking isInitialDocument is not a consistent way to know if we can expect a navigation to occur soon. I think it means the document will be disposed once we start actually using the tab (even if we end up going to about:blank), but it might not happen anytime soon.

Based on that, one approach could be to only extend the timeout for BiDi's browsingContext.create and for Marionette's newWindow. They both use the same codepath which should explicitly open a tab on "about:blank" and we should leave the initialDocument immediately. We would keep Marionette's newSession using the current 200ms timeout.

Extending the timeout only when we explicitly create a new window or tab should be safe.
On the other hand, initial windows from the browser or new tabs created by the browser UI can remain on the initial document forever, for instance on Android.
So for those cases we keep the current unloadTimeout value.

(In reply to Julian Descottes [:jdescottes] from comment #14)

The issue is that on Android, when you open a new tab on about:blank, it keeps "isInitialDocument" to true until you actually manually perform a navigation.
[..]
So this means that checking isInitialDocument is not a consistent way to know if we can expect a navigation to occur soon. I think it means the document will be disposed once we start actually using the tab (even if we end up going to about:blank), but it might not happen anytime soon.

Based on that, one approach could be to only extend the timeout for BiDi's browsingContext.create and for Marionette's newWindow. They both use the same codepath which should explicitly open a tab on "about:blank" and we should leave the initialDocument immediately. We would keep Marionette's newSession using the current 200ms timeout.

Interesting. So if under those cases the initial document is kept it won't make sense for us to wait longer. Instead I wonder if we should not check for !isInitial but only for the loading state. If a different page will be loaded right after we probably could spin some event loops and wait if another navigation is started? It's really unsatisfying that it's not known if the initial about:blank is kept or not.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #16)

Interesting. So if under those cases the initial document is kept it won't make sense for us to wait longer. Instead I wonder if we should not check for !isInitial but only for the loading state. If a different page will be loaded right after we probably could spin some event loops and wait if another navigation is started? It's really unsatisfying that it's not known if the initial about:blank is kept or not.

This can be an option, but it might be as racy as our current setup and we would first need to collect data on various platforms to see in which state we get the documents usually. And since the issue is rather hard to reproduce on try it might be hard to be confident here.

We would need to know in which state the document is when we currently miss navigations. We can add logs for that, but we already know a few things. In those scenarios, the progress listener does not observe anything for a long period of time (at least longer than the unload timeout). So I suspect the document is not in loading state (otherwise we would see it with the WebProgress logs).

I did the following try push which adds logs in waitForInitialNavigation and reduces the timeout https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=Y0csWSzgQ8CTueFSJ4-QeQ.0&revision=6fa3d6a3c91b6399434c576dd589e27189641b5b

The idea is to make the missed navigations much more likely and when they occur, check what was the state of the document.
I haven't parsed everything of course, but sampling a few failures, I've seen that we miss navigations after stumbling on documents which are not loading.

This means it's not just a matter of spinning the event loop a few times, we can have a significant delay between the initial document in non loading state and doing the upcoming navigation.

Duplicate of this bug: 1836476

(In reply to Julian Descottes [:jdescottes] from comment #18)

This means it's not just a matter of spinning the event loop a few times, we can have a significant delay between the initial document in non loading state and doing the upcoming navigation.

Olli, do you have an idea why we can see such long delays? It got worse since early May and it's not clear to us if that is expected or not. Right now we have a 200ms timeout when trying to detect a navigation but as Julian said we will have to increase it quite a lot to not miss an additional navigation after the initial document has been loaded.

Flags: needinfo?(smaug)

No idea. If one can reproduce this locally, profiling this could be useful.
Does the issue happen also on opt builds? debug builds can be very slow.

Flags: needinfo?(smaug)

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #21)

No idea. If one can reproduce this locally, profiling this could be useful.
Does the issue happen also on opt builds? debug builds can be very slow.

This typically only happens on CI, on platforms which are slower than what we expect.
But we already use different timeouts depending on the platform type (we wait 8 times longer on debug/tsan/asan, 16 times longer on ccov). And we get the issue both on opt and debug builds.

It's very hard to predict those delays, and their duration.

When an issue occurs on a platform, not every new tab is affected. But once in a while (maybe because a GC slows things down) the delay will be much bigger and be bigger than our timeout. And if we start using a very conservative timeout, then it can also be a problem because in some cases (new tab on android) we stay indefinitely on the initial about:blank, meaning we will always hit our timeout (so a huge timeout means test will take a lot more time to run).

Finding a "good" value for the timeout seems hard IMO. After a recent update of the windows worker images, we started getting much more frequent issues on windows. So even if we find a good value which works as of today, it's unlikely to always be a good fit.

Something which would be useful would be a way to know if a document with isInitialDocument set to true is going to navigate or not. Today we don't know for sure so we have to use the timeout. But if we could know for sure when there will be a load, we could decide whether we wait for a navigation or not and completely remove the timeout.

Pushed by jdescottes@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/48770eeaa782
[remote] Set timeout to 5s for waitForInitialNavigation for new contexts r=webdriver-reviewers,whimboo
Blocks: 1796201
Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Target Milestone: --- → 116 Branch

Comment on attachment 9337040 [details]
Bug 1832891 - [remote] Set timeout to 5s for waitForInitialNavigation for new contexts

Beta/Release Uplift Approval Request

  • User impact if declined: This patch helps reducing intermittent failures in wdspec tests both for webdriver classic and bidi. It has been on Nightly for a few days and hasn't caused any particular issue, so it should be safe to uplift to help with intermittent issues on beta as well.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch is extending the delay used for a navigation timeout in a few situations where we are sure to expect a navigation. This only affects automation.
  • String changes made/needed:
  • Is Android affected?: Yes
Attachment #9337040 - Flags: approval-mozilla-beta?

Comment on attachment 9337040 [details]
Bug 1832891 - [remote] Set timeout to 5s for waitForInitialNavigation for new contexts

Approved for 115.0b5.

Attachment #9337040 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Moving all the left over dependencies to bug 1838381.

Whiteboard: [webdriver:m7] → [webdriver:m7][webdriver:relnote]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: