Investigation for failure on Android Fission builds: /webdriver/tests/element_click/navigate.py | test_link_from_nested_context_with_target[_top] - webdriver.error.UnknownErrorException: unknown error (500): InactiveActor: Actor is no longer active
Categories
(Testing :: geckodriver, defect, P1)
Tracking
(firefox106 affected)
Tracking | Status | |
---|---|---|
firefox106 | --- | affected |
People
(Reporter: intermittent-bug-filer, Assigned: jdescottes)
References
(Blocks 1 open bug)
Details
(Whiteboard: [webdriver:m6][fission:android:m2])
Attachments
(1 obsolete file)
Filed by: istorozhko [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=387806546&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UM5JCZ5oTnOpQIDzteY_TQ/runs/0/artifacts/public/logs/live_backing.log
Here is the link to the try push: https://treeherder.mozilla.org/jobs?repo=try&revision=912cffe26463c17323a7b00ac5c07ceff88409a7
This bug blocks Fission on Android. The tests were run on Fission with `isolateNothing` strategy (so when debugging locally or making new try pushes, please add `pref("fission.webContentIsolationStrategy", 0);` line to the `geckoview-prefs.js`)
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Comment 1•2 years ago
|
||
The underlying problem here is the following InactiveActor exception that gets raised:
webdriver.error.UnknownErrorException: unknown error (500): InactiveActor: Actor is no longer active
It means that the contentWindow
property of the JSWindowActorChild isn't valid.
There is also the following note in the webidl file:
/**
* NOTE: As this returns a window proxy, it may not be currently referencing
* the document associated with this JSWindowActor. Generally prefer using
* `document`.
*/
Does it mean someone should better use document.defaultView
to get the window of the currently visible document and not contentWindow
? It works fine for us on desktop so why could it be broken on Android? Could there be a bug or should we change our implementation?
Olli, do you have a suggestion? Thanks!
Comment 2•2 years ago
|
||
Nika might actually remember the setup we have for JSWindowActorChild better.
Comment 3•2 years ago
|
||
(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #2)
Nika might actually remember the setup we have for JSWindowActorChild better.
Hi Nika, would you mind taking a look at the above comment 1? We would appreciate. Thanks!
Updated•2 years ago
|
Comment 4•2 years ago
|
||
(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #2)
Nika might actually remember the setup we have for JSWindowActorChild better.
Nika, in case you have a moment we would appreciate some feedback so we can get this fixed for GeckoView with Fission enabled. Thanks!
Comment 5•2 years ago
|
||
In general, this error message occurs when the JSWindowActor is not the active window actor within its BrowsingContext. The error is being returned because it would be returning a window proxy to a different window global if it didn't, as the BrowsingContext has navigated.
It looks like there are 2 failures in the log linked in comment 0, but based on comment 1, I'm going to assume you're only looking into the second one which is seeing the exception you mention.
I'm not sure I completely understand why this would be failing as I don't remember the internals of how marionette works very well, but it appears it's failing when trying to look for an element in the document after navigating the root document from a link click. There's a very real chance that this failure is just a pre-existing race-condition in marionette where on desktop we never attempt to poll asking about the link while the navigation is happening, meaning we always poll either before the navigation completes or after it completes, and the timing lines up differently on android?
My current impression is that if marionette wants to be resilient to handling situations where the document under it is navigated while sending commands, it might need to re-try to send the command to the child actor when it gets back an exception telling it that the actor is inactive now, as the child process may discover that is is inactive without the parent actor knowing ahead of time, due to a situation like an initial about:blank document being replaced.
Not sure if that was helpful or not, but feel free to ping me again if you have more questions or want to chat.
Comment hidden (Intermittent Failures Robot) |
Comment 7•2 years ago
|
||
Sorry, somehow your response got under the wire. Not sure why I haven't gotten an email for it from Bugzilla. Nevertheless I think this is something we need to have a look at. As such lets bring it up in the next triage meeting. We should spend at least some time to further investigate the problem and if it's not us failing given that on Desktop it's all fine we should file a GeckoView specific bug.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 8•2 years ago
|
||
For now I can't reproduce locally. There is one failure which is already listed in the corresponding py.ini file, for test_link_unload_event
, but I can't hit the InactiveActor
error, for me test_link_from_nested_context_with_target[_top] passes successfully on the emulator.
Will check try
Assignee | ||
Comment 9•2 years ago
|
||
Looks like this is debug only:
https://treeherder.mozilla.org/jobs?repo=try&revision=2b0d28c1f22df6bf5f9f22a2a3ce8d871a38e857
The test fails on debug builds with fission, passes on opt.
Olivia: I noticed that on opt there are failures which don't seem tracked for android + fission
- /webdriver/tests/navigate_to/navigate.py
- /webdriver/tests/back/back.py | test_cross_origin[capabilities0]
- /webdriver/tests/forward/forward.py | test_cross_origin[capabilities0]
All failing with
webdriver.error.StaleElementReferenceException: stale element reference (404)
Should I file bugs for this, or will it be done during a next simulation? We worked on stale element logic & tests in the past few weeks (eg in Bug 1811545), so it's possible we regressed this recenly.
Comment 10•2 years ago
|
||
Olivia: I noticed that on opt there are failures which don't seem tracked for android + fission
- /webdriver/tests/navigate_to/navigate.py
- /webdriver/tests/back/back.py | test_cross_origin[capabilities0]
- /webdriver/tests/forward/forward.py | test_cross_origin[capabilities0]
All failing with
webdriver.error.StaleElementReferenceException: stale element reference (404)
Should I file bugs for this, or will it be done during a next simulation? We worked on stale element logic & tests in the past few weeks (eg in Bug 1811545), so it's possible we regressed this recenly.
Thanks for identifying these fission bugs. I'm not sure, but I don't think they will be filed during simulation. :owlish might know for sure?
Assignee | ||
Comment 11•2 years ago
|
||
For the current bug:
Finished the investigation for the Inactive Actor
issue. This happens because we run the s.find.css("#foo")
command against a destroyed top level browsing context. Basically we have a race condition between a call to switchToFrame
and a XULFrameLoaderCreated
notification. XULFrameLoaderCreated
is observed while switchToFrame is already in progress, and the (outdated) result of switchToFrame will overwright the information from XULFrameLoaderCreated
In more details:
Test is on PAGE_1, which contains an IFRAME. IFRAME contains a link which will navigate to PAGE_2 with target=_top.
First we switchToFrame to the IFRAME. Then we click on the navigation link.
This click is going to navigate the parent page of IFRAME, which means it will destroy IFRAME.
The click will not wait for any navigation because of this check
Immediately after the test does session.switch_frame(None)
At this point, the current browsing context still points to the iframe, which has not been destroyed yet. Its parent page has not navigated either.
Now the race condition happens. This call to switchToFrame(null)
is going to return the browsing context id of PAGE_1, but before it resolves, we will observe XULFrameLoaderCreated
. In the observer for XULFrameLoaderCreated
, we update the current browsing context to PAGE_2. However right after that switchToFrame
comes back from the content process and overwrites the current browsing context back to its previous value.
After that, we are stuck on an invalid top browsing context and the next find
command will fail.
Will need to think a bit more about this, but it feels like an issue on Marionette's side. Maybe when we receive a XULFrameLoaderCreated
we should "cancel" any pending command which might have updated the current browsing context?
Assignee | ||
Comment 12•2 years ago
|
||
Just to illustrate the analysis, this fixes the issue. Let's discuss if we want to go for something along those lines
Comment 13•2 years ago
|
||
That's an interesting case and I wonder why this only happens for the current Fission implementation on Android but not not desktop. What specifically is different on mobile? Maybe owlish can explain this as well.
(In reply to Olivia Hall [:olivia] from comment #10)
Olivia: I noticed that on opt there are failures which don't seem tracked for android + fission
- /webdriver/tests/navigate_to/navigate.py
- /webdriver/tests/back/back.py | test_cross_origin[capabilities0]
- /webdriver/tests/forward/forward.py | test_cross_origin[capabilities0]
All failing with
webdriver.error.StaleElementReferenceException: stale element reference (404)
Should I file bugs for this, or will it be done during a next simulation? We worked on stale element logic & tests in the past few weeks (eg in Bug 1811545), so it's possible we regressed this recenly.Thanks for identifying these fission bugs. I'm not sure, but I don't think they will be filed during simulation. :owlish might know for sure?
Seeing a StaleElementReferenceException
here would mean that the newly loaded page is not within a different process (browsing context group) and as such it still can be found in our internal node map. So I assume that GeckoView has some custom settings regarding how processes are used? Maybe that's the underlying problem why we see the above failure only on Android?
Assignee | ||
Comment 14•2 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #13)
That's an interesting case and I wonder why this only happens for the current Fission implementation on Android but not not desktop. What specifically is different on mobile? Maybe owlish can explain this as well.
Good question, the main difference between desktop and mobile here is that we get a XULFrameLoaderCreated event for this navigation, and the browsing context is replaced. Which I guess means we have a cross group navigation here, whereas we don't on desktop. Normally for a navigation with target _top we shouldn't try to put the old page in bfcache, so maybe there's something wrong on Android there?
The underlying issue is still an edge case that is not handled correctly by Marionette, but it's true there's also something to investigate on geckoview.
Assignee | ||
Comment 15•2 years ago
•
|
||
For reference, here's a test which intermittently reproduces the issue on desktop:
def test_navigate_back_and_switch_frame(session, inline):
target_page = inline("<p id='foo'>foo</p>")
session.url = inline("<a id='link' href='{}'>click</a>".format(target_page))
element = session.find.css("a", all=False)
orig_handles = session.handles
response = element_click(session, element)
assert_success(response)
session.execute_script("history.go(-1)")
session.switch_frame(None)
wait = Poll(
session,
timeout=5,
ignored_exceptions=NoSuchElementException,
message="Expected element has not been found")
wait.until(lambda s: s.find.css("#link"))
The trick is to start a bfcache navigation that will not be awaited by marionette and then call switch_frame(None)
. But as you can see it's a bit harder to trigger than on android, which seems to replace browsing contexts much more often than desktop.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 16•2 years ago
|
||
Closing as we completed the investigation here. Following up to fix the failure should happen over in Bug 1816061
Comment 17•2 years ago
|
||
Comment on attachment 9316267 [details]
Bug 1786640 - [marionette] WIP: retry switch to frame if browsing context changed mid flight
Revision D169045 was moved to bug 1817820. Setting attachment 9316267 [details] to obsolete.
Description
•