AbortError issues after 200ms navigation timeout "No navigation detected: about:blank" for new tabs and windows
Categories
(Remote Protocol :: WebDriver BiDi, defect, P1)
Tracking
(firefox115 fixed, firefox116 fixed)
People
(Reporter: jdescottes, Assigned: jdescottes)
References
(Blocks 1 open bug)
Details
(Whiteboard: [webdriver:m7][webdriver:relnote])
Attachments
(1 file, 1 obsolete file)
48 bytes,
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-beta+
|
Details | Review |
We frequently have issues with the WebDriverBiDi command browsingContext.create
followed by a script.evaluate
or script.callFunction
command.
If browsingContext.create
fails to wait for the initial navigation because of the navigation timeout, we might start sending a command to the temporary document, which will very soon be destroyed.
This usually leads to errors such as:
"AbortError: Actor 'MessageHandlerFrame' destroyed before query 'MessageHandlerFrameParent:sendCommand' was resolved"
And if you have trace logs enabled, you can (most likely) confirm that this is due to a navigation timeout by looking for logs (needs Trace level) such as:
RemoteAgent TRACE [31] ProgressListener Setting unload timer (200ms)
RemoteAgent TRACE [31] ProgressListener No navigation detected: about:blank
RemoteAgent TRACE [31] ProgressListener Stop: has error=false
We already increase the unload timer for our debug or slow platforms (see https://searchfox.org/mozilla-central/rev/815764d6f304cd69b60075e83b06320464a66223/remote/shared/Navigate.sys.mjs#24-40).
This bug might be a meta, to track all the reports linked to this root cause and track improvements.
Potential improvements here:
- add better logging to help the investigation
- make the timeout duration configurable
Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Comment 1•1 year ago
|
||
Given the results from bug 1816538 it would be good to see what happens when we increase the unload timer from 200ms to a larger number. It would be suspicious if the delays in navigation would also increase which means that something might be blocking.
I've created a try build which increases the timeout to 400ms and enables BCWebProgress:5 logging as well:
https://treeherder.mozilla.org/jobs?repo=try&revision=fa24a1f5b405110a22b392b6f6494ae5fbf0d065
Assignee | ||
Comment 2•1 year ago
|
||
Updated•1 year ago
|
Assignee | ||
Comment 3•1 year ago
|
||
For the record, we can't really reproduce the failure from Bug 1816538 on try at the moment, so it's hard to say if the patch really fixes this particular case. I did a try push without the fix, and no longer managed to reproduce the issue so I imagine we are using "fixed" windows workers now? Or maybe that's random?
In any case, I think this patch should be an improvement overall (as long as it doesn't lead to other regressions) but let's wait until the worker situation is clearer to see if we have a way to verify the fix properly.
Comment 4•1 year ago
|
||
I've asked for clarification on bug 1816538. Lets wait and see what the status is for try and other branches.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 5•1 year ago
|
||
We mentioned that we would like to have a "safety" timeout to avoid blocking tests or scripts which might miss the navigation event.
One option is to start implementing a timeout
option for the ProgressListener. That would typically be also useful for https://github.com/w3c/webdriver-bidi/issues/188. Compared to the unloadTimeout
, the timeout
could occur regardless of where we are in the navigation, as long as the delay is reached. We should also try to cancel the navigation.
Since waitForInitialNavigation should only ever have to wait for an about:blank load (and not a custom URL which might be slow to answer for good reasons), it sounds acceptable to come up with a hardcoded timeout value, potentially with a platform specific multiplier applied.
Another option would be to keep relying on the current unloadTimeout mechanism, but use a much higher value when called from waitForInitialNavigation. We would use that instead of expectNavigation
.
Henrik: Should we go for the first option here and add a timeout
option to ProgressListener + attempt to cancel the navigation? If yes we should split that in a separate bug, in case of backouts. Otherwise the second option is much simpler. Let me know what you think.
Comment 6•1 year ago
|
||
(In reply to Julian Descottes [:jdescottes] from comment #5)
Since waitForInitialNavigation should only ever have to wait for an about:blank load (and not a custom URL which might be slow to answer for good reasons), it sounds acceptable to come up with a hardcoded timeout value, potentially with a platform specific multiplier applied.
What about cases like window.open("https://example.org")
? Here we open a new window (and tab) which might load the temporary about:blank
first, but then loads immediately https://example.org
. So I don't think that we can always assume we only load about:blank
. Or do I miss something?
Another option would be to keep relying on the current unloadTimeout mechanism, but use a much higher value when called from waitForInitialNavigation. We would use that instead of
expectNavigation
.
This might be another good alternative which then only affects the case around unloading. We would still have the problem when the page is not finishing the load, but that could be handled by a different bug.
Henrik: Should we go for the first option here and add a
timeout
option to ProgressListener + attempt to cancel the navigation? If yes we should split that in a separate bug, in case of backouts. Otherwise the second option is much simpler. Let me know what you think.
I like the 2nd option more as well. So maybe lets try that one first.
Canceling a possible navigation might still be a question that should be discussed. It's not clear if that should be up to the client or the remote end to handle it.
Assignee | ||
Comment 7•1 year ago
|
||
(In reply to Henrik Skupin [:whimboo][βοΈUTC+1] from comment #6)
(In reply to Julian Descottes [:jdescottes] from comment #5)
Since waitForInitialNavigation should only ever have to wait for an about:blank load (and not a custom URL which might be slow to answer for good reasons), it sounds acceptable to come up with a hardcoded timeout value, potentially with a platform specific multiplier applied.
What about cases like
window.open("https://example.org")
? Here we open a new window (and tab) which might load the temporaryabout:blank
first, but then loads immediatelyhttps://example.org
. So I don't think that we can always assume we only loadabout:blank
. Or do I miss something?
At the moment, I think we only use waitForInitialNavigation
when we open new tabs/windows without providing a URL. Apart from tests, there are 3 call sites I think:
- browsingContext.create from BiDi
- newSession from marionette
- newWindow from marionette
In all three cases, I don't think we can provide a URL, so the commands should return only with an about:blank load (might be missing something for marionette).
In the general case where a script opens a window, using eg window.open
, in BiDi we would only rely on events to monitor that. Not sure if classic/marionette monitors this navigation in some other way? But at least it doesn't seem to rely on waitForInitialNavigation
.
Another option would be to keep relying on the current unloadTimeout mechanism, but use a much higher value when called from waitForInitialNavigation. We would use that instead of
expectNavigation
.This might be another good alternative which then only affects the case around unloading. We would still have the problem when the page is not finishing the load, but that could be handled by a different bug.
Henrik: Should we go for the first option here and add a
timeout
option to ProgressListener + attempt to cancel the navigation? If yes we should split that in a separate bug, in case of backouts. Otherwise the second option is much simpler. Let me know what you think.I like the 2nd option more as well. So maybe lets try that one first.
Canceling a possible navigation might still be a question that should be discussed. It's not clear if that should be up to the client or the remote end to handle it.
Ok, that sounds good to me as well. I had started a prototype for handling timeout + cancelling requests, I might just file a bug and have the patch around in case it's useful to support a future timeout
parameter.
Thanks for the feedback.
Comment 8•1 year ago
|
||
(In reply to Julian Descottes [:jdescottes] from comment #7)
At the moment, I think we only use
waitForInitialNavigation
when we open new tabs/windows without providing a URL. Apart from tests, there are 3 call sites I think:
- browsingContext.create from BiDi
- newSession from marionette
- newWindow from marionette
In all three cases, I don't think we can provide a URL, so the commands should return only with an about:blank load (might be missing something for marionette).
In the general case where a script opens a window, using eg
window.open
, in BiDi we would only rely on events to monitor that. Not sure if classic/marionette monitors this navigation in some other way? But at least it doesn't seem to rely onwaitForInitialNavigation
.
Fair point. So it can happen in Marionette when executing a script or eg clicking an element. But then we still have selected the current browsing context so that we do not care about the newly created one. But once we switch to the new Tab we should basically do that but I cannot find such a call at the moment.
Ok, that sounds good to me as well. I had started a prototype for handling timeout + cancelling requests, I might just file a bug and have the patch around in case it's useful to support a future
timeout
parameter.
Sounds good!
Updated•1 year ago
|
Comment 9•1 year ago
|
||
On bug 1816538 the unload timeout has been increased to 5000ms. Lets see how this works and if we may still need this patch or not.
Updated•1 year ago
|
Assignee | ||
Comment 10•1 year ago
•
|
||
(In reply to Henrik Skupin [:whimboo][βοΈUTC+1] from comment #9)
On bug 1816538 the unload timeout has been increased to 5000ms. Lets see how this works and if we may still need this patch or not.
We had to backout the patch from bug 1816538. We had several issues:
- the xpcshell test test_Navigate.js started taking too much time on ccov -> bug 1835796
- many timeouts on Android Debug -> Bug 1835819
- timeouts on seemingly all wdspec tests in Beta simulation -> Bug 1835822
While the first one is not too concerning on its own, the 2 other issues mean that it's quite frequent to call waitForInitialNavigation
in cases where we are missing the first navigation. Which means that we can't consistently expect to see a navigation in situations where we are using this API.
Potential next steps
- increase the timeout more conservatively (we used 5000ms, we could stay closer to the current 200ms)
- only apply this to BiDi's browsingContext.create calls
- expose a preference to drive the timeout (this would help for instance with testing this for puppeteer)
Updated•1 year ago
|
Comment 11•1 year ago
|
||
I had a quick look at the beta sim test failures and I actually cannot find any instance where we actually timing out or delaying the wait for the initial page to be loaded. Something else might have added a small constant delay which triggered that many test failures. Not sure how many of our existing tests are already close to the general timeout limit.
Assignee | ||
Comment 12•1 year ago
|
||
Ok for now the beta failures also don't make much sense to me. I also note that there were no failures on the late beta sim, only early beta. I will start some additional pushes ...
Assignee | ||
Comment 13•1 year ago
|
||
Following the investigations on Bug 1835822, the beta failures were not related to this change.
Meaning we mostly have the Android reftest issues to look at. The logs are not super helpful as we don't have the full logs for those failures.
However we can see 40 seconds pauses between INFO - Starting runner
and the next INFO - TEST-START
(example https://treeherder.mozilla.org/logviewer?job_id=417476550&repo=autoland&lineNumber=11213-11214) so this must be where we hang while waiting for the navigation.
Assignee | ||
Comment 14•1 year ago
|
||
The issue is that on Android, when you open a new tab on about:blank, it keeps "isInitialDocument" to true until you actually manually perform a navigation.
Interestingly it doesn't seem to be the case when we use BiDi's browsingContext.create
. I think this is because we explicitely create a tab for the url "about:blank". But when we open a tab using the browser's UI, or when we start the browser and look at the preloaded tabs, they will retain isInitialDocument === true
indefinitely.
So this means that checking isInitialDocument
is not a consistent way to know if we can expect a navigation to occur soon. I think it means the document will be disposed once we start actually using the tab (even if we end up going to about:blank), but it might not happen anytime soon.
Based on that, one approach could be to only extend the timeout for BiDi's browsingContext.create
and for Marionette's newWindow
. They both use the same codepath which should explicitly open a tab on "about:blank" and we should leave the initialDocument immediately. We would keep Marionette's newSession using the current 200ms timeout.
Assignee | ||
Comment 15•1 year ago
|
||
Extending the timeout only when we explicitly create a new window or tab should be safe.
On the other hand, initial windows from the browser or new tabs created by the browser UI can remain on the initial document forever, for instance on Android.
So for those cases we keep the current unloadTimeout value.
Comment 16•1 year ago
|
||
(In reply to Julian Descottes [:jdescottes] from comment #14)
The issue is that on Android, when you open a new tab on about:blank, it keeps "isInitialDocument" to true until you actually manually perform a navigation.
[..]
So this means that checkingisInitialDocument
is not a consistent way to know if we can expect a navigation to occur soon. I think it means the document will be disposed once we start actually using the tab (even if we end up going to about:blank), but it might not happen anytime soon.Based on that, one approach could be to only extend the timeout for BiDi's
browsingContext.create
and for Marionette'snewWindow
. They both use the same codepath which should explicitly open a tab on "about:blank" and we should leave the initialDocument immediately. We would keep Marionette's newSession using the current 200ms timeout.
Interesting. So if under those cases the initial document is kept it won't make sense for us to wait longer. Instead I wonder if we should not check for !isInitial
but only for the loading state. If a different page will be loaded right after we probably could spin some event loops and wait if another navigation is started? It's really unsatisfying that it's not known if the initial about:blank
is kept or not.
Assignee | ||
Comment 17•1 year ago
•
|
||
(In reply to Henrik Skupin [:whimboo][βοΈUTC+1] from comment #16)
Interesting. So if under those cases the initial document is kept it won't make sense for us to wait longer. Instead I wonder if we should not check for
!isInitial
but only for the loading state. If a different page will be loaded right after we probably could spin some event loops and wait if another navigation is started? It's really unsatisfying that it's not known if the initialabout:blank
is kept or not.
This can be an option, but it might be as racy as our current setup and we would first need to collect data on various platforms to see in which state we get the documents usually. And since the issue is rather hard to reproduce on try it might be hard to be confident here.
We would need to know in which state the document is when we currently miss navigations. We can add logs for that, but we already know a few things. In those scenarios, the progress listener does not observe anything for a long period of time (at least longer than the unload timeout). So I suspect the document is not in loading state (otherwise we would see it with the WebProgress logs).
Assignee | ||
Comment 18•1 year ago
|
||
I did the following try push which adds logs in waitForInitialNavigation and reduces the timeout https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=Y0csWSzgQ8CTueFSJ4-QeQ.0&revision=6fa3d6a3c91b6399434c576dd589e27189641b5b
The idea is to make the missed navigations much more likely and when they occur, check what was the state of the document.
I haven't parsed everything of course, but sampling a few failures, I've seen that we miss navigations after stumbling on documents which are not loading.
This means it's not just a matter of spinning the event loop a few times, we can have a significant delay between the initial document in non loading state and doing the upcoming navigation.
Comment 20•1 year ago
|
||
(In reply to Julian Descottes [:jdescottes] from comment #18)
This means it's not just a matter of spinning the event loop a few times, we can have a significant delay between the initial document in non loading state and doing the upcoming navigation.
Olli, do you have an idea why we can see such long delays? It got worse since early May and it's not clear to us if that is expected or not. Right now we have a 200ms timeout when trying to detect a navigation but as Julian said we will have to increase it quite a lot to not miss an additional navigation after the initial document has been loaded.
Comment 21•1 year ago
|
||
No idea. If one can reproduce this locally, profiling this could be useful.
Does the issue happen also on opt builds? debug builds can be very slow.
Assignee | ||
Comment 22•1 year ago
|
||
(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #21)
No idea. If one can reproduce this locally, profiling this could be useful.
Does the issue happen also on opt builds? debug builds can be very slow.
This typically only happens on CI, on platforms which are slower than what we expect.
But we already use different timeouts depending on the platform type (we wait 8 times longer on debug/tsan/asan, 16 times longer on ccov). And we get the issue both on opt and debug builds.
It's very hard to predict those delays, and their duration.
When an issue occurs on a platform, not every new tab is affected. But once in a while (maybe because a GC slows things down) the delay will be much bigger and be bigger than our timeout. And if we start using a very conservative timeout, then it can also be a problem because in some cases (new tab on android) we stay indefinitely on the initial about:blank, meaning we will always hit our timeout (so a huge timeout means test will take a lot more time to run).
Finding a "good" value for the timeout seems hard IMO. After a recent update of the windows worker images, we started getting much more frequent issues on windows. So even if we find a good value which works as of today, it's unlikely to always be a good fit.
Something which would be useful would be a way to know if a document with isInitialDocument
set to true is going to navigate or not. Today we don't know for sure so we have to use the timeout. But if we could know for sure when there will be a load, we could decide whether we wait for a navigation or not and completely remove the timeout.
Comment hidden (Intermittent Failures Robot) |
Comment 24•1 year ago
|
||
Pushed by jdescottes@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/48770eeaa782 [remote] Set timeout to 5s for waitForInitialNavigation for new contexts r=webdriver-reviewers,whimboo
Comment 25•1 year ago
|
||
bugherder |
Assignee | ||
Comment 26•1 year ago
|
||
Comment on attachment 9337040 [details]
Bug 1832891 - [remote] Set timeout to 5s for waitForInitialNavigation for new contexts
Beta/Release Uplift Approval Request
- User impact if declined: This patch helps reducing intermittent failures in wdspec tests both for webdriver classic and bidi. It has been on Nightly for a few days and hasn't caused any particular issue, so it should be safe to uplift to help with intermittent issues on beta as well.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The patch is extending the delay used for a navigation timeout in a few situations where we are sure to expect a navigation. This only affects automation.
- String changes made/needed:
- Is Android affected?: Yes
Comment 27•1 year ago
|
||
Comment on attachment 9337040 [details]
Bug 1832891 - [remote] Set timeout to 5s for waitForInitialNavigation for new contexts
Approved for 115.0b5.
Comment hidden (obsolete) |
Updated•1 year ago
|
Description
•