Closed Bug 1786640 Opened 2 years ago Closed 2 years ago

Investigation for failure on Android Fission builds: /webdriver/tests/element_click/navigate.py | test_link_from_nested_context_with_target[_top] - webdriver.error.UnknownErrorException: unknown error (500): InactiveActor: Actor is no longer active

Categories

(Testing :: geckodriver, defect, P1)

Unspecified
Android
defect
Points:
1

Tracking

(firefox106 affected)

RESOLVED FIXED
Tracking Status
firefox106 --- affected

People

(Reporter: intermittent-bug-filer, Assigned: jdescottes)

References

(Blocks 1 open bug)

Details

(Whiteboard: [webdriver:m6][fission:android:m2])

Attachments

(1 obsolete file)

Filed by: istorozhko [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=387806546&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UM5JCZ5oTnOpQIDzteY_TQ/runs/0/artifacts/public/logs/live_backing.log


Here is the link to the try push: https://treeherder.mozilla.org/jobs?repo=try&revision=912cffe26463c17323a7b00ac5c07ceff88409a7

This bug blocks Fission on Android. The tests were run on Fission with `isolateNothing` strategy (so when debugging locally or making new try pushes, please add `pref("fission.webContentIsolationStrategy", 0);` line to the `geckoview-prefs.js`)
Blocks: 1714654
No longer blocks: fission-tests
OS: Unspecified → Android
Summary: Failure on Fission builds: /webdriver/tests/element_click/navigate.py | test_link_from_nested_context_with_target[_top] - webdriver.error.UnknownErrorException: unknown error (500): InactiveActor: Actor is no longer active → Failure on Android Fission builds: /webdriver/tests/element_click/navigate.py | test_link_from_nested_context_with_target[_top] - webdriver.error.UnknownErrorException: unknown error (500): InactiveActor: Actor is no longer active
Whiteboard: [fission:android:m2]

The underlying problem here is the following InactiveActor exception that gets raised:

webdriver.error.UnknownErrorException: unknown error (500): InactiveActor: Actor is no longer active

It means that the contentWindow property of the JSWindowActorChild isn't valid.

There is also the following note in the webidl file:

  /**
   * NOTE: As this returns a window proxy, it may not be currently referencing
   * the document associated with this JSWindowActor. Generally prefer using
   * `document`.
   */

Does it mean someone should better use document.defaultView to get the window of the currently visible document and not contentWindow? It works fine for us on desktop so why could it be broken on Android? Could there be a bug or should we change our implementation?

Olli, do you have a suggestion? Thanks!

Flags: needinfo?(smaug)

Nika might actually remember the setup we have for JSWindowActorChild better.

Flags: needinfo?(smaug) → needinfo?(nika)

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #2)

Nika might actually remember the setup we have for JSWindowActorChild better.

Hi Nika, would you mind taking a look at the above comment 1? We would appreciate. Thanks!

Flags: needinfo?(nika)

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #2)

Nika might actually remember the setup we have for JSWindowActorChild better.

Nika, in case you have a moment we would appreciate some feedback so we can get this fixed for GeckoView with Fission enabled. Thanks!

Flags: needinfo?(nika)

In general, this error message occurs when the JSWindowActor is not the active window actor within its BrowsingContext. The error is being returned because it would be returning a window proxy to a different window global if it didn't, as the BrowsingContext has navigated.

It looks like there are 2 failures in the log linked in comment 0, but based on comment 1, I'm going to assume you're only looking into the second one which is seeing the exception you mention.

I'm not sure I completely understand why this would be failing as I don't remember the internals of how marionette works very well, but it appears it's failing when trying to look for an element in the document after navigating the root document from a link click. There's a very real chance that this failure is just a pre-existing race-condition in marionette where on desktop we never attempt to poll asking about the link while the navigation is happening, meaning we always poll either before the navigation completes or after it completes, and the timing lines up differently on android?

My current impression is that if marionette wants to be resilient to handling situations where the document under it is navigated while sending commands, it might need to re-try to send the command to the child actor when it gets back an exception telling it that the actor is inactive now, as the child process may discover that is is inactive without the parent actor knowing ahead of time, due to a situation like an initial about:blank document being replaced.

Not sure if that was helpful or not, but feel free to ping me again if you have more questions or want to chat.

Flags: needinfo?(nika)

Sorry, somehow your response got under the wire. Not sure why I haven't gotten an email for it from Bugzilla. Nevertheless I think this is something we need to have a look at. As such lets bring it up in the next triage meeting. We should spend at least some time to further investigate the problem and if it's not us failing given that on Desktop it's all fine we should file a GeckoView specific bug.

Whiteboard: [fission:android:m2] → [fission:android:m2][webdriver:triage]
Assignee: nobody → jdescottes
Severity: S4 → S3
Status: NEW → ASSIGNED
Points: --- → 1
Flags: needinfo?(jdescottes)
Priority: P5 → P1
Whiteboard: [fission:android:m2][webdriver:triage] → [webdriver:m6][fission:android:m2]

For now I can't reproduce locally. There is one failure which is already listed in the corresponding py.ini file, for test_link_unload_event , but I can't hit the InactiveActor error, for me test_link_from_nested_context_with_target[_top] passes successfully on the emulator.

Will check try

Flags: needinfo?(jdescottes)

Looks like this is debug only:
https://treeherder.mozilla.org/jobs?repo=try&revision=2b0d28c1f22df6bf5f9f22a2a3ce8d871a38e857

The test fails on debug builds with fission, passes on opt.

Olivia: I noticed that on opt there are failures which don't seem tracked for android + fission

  • /webdriver/tests/navigate_to/navigate.py
  • /webdriver/tests/back/back.py | test_cross_origin[capabilities0]
  • /webdriver/tests/forward/forward.py | test_cross_origin[capabilities0]

All failing with

webdriver.error.StaleElementReferenceException: stale element reference (404)

Should I file bugs for this, or will it be done during a next simulation? We worked on stale element logic & tests in the past few weeks (eg in Bug 1811545), so it's possible we regressed this recenly.

Flags: needinfo?(ohall)

Olivia: I noticed that on opt there are failures which don't seem tracked for android + fission

  • /webdriver/tests/navigate_to/navigate.py
  • /webdriver/tests/back/back.py | test_cross_origin[capabilities0]
  • /webdriver/tests/forward/forward.py | test_cross_origin[capabilities0]

All failing with

webdriver.error.StaleElementReferenceException: stale element reference (404)
Should I file bugs for this, or will it be done during a next simulation? We worked on stale element logic & tests in the past few weeks (eg in Bug 1811545), so it's possible we regressed this recenly.

Thanks for identifying these fission bugs. I'm not sure, but I don't think they will be filed during simulation. :owlish might know for sure?

Flags: needinfo?(ohall) → needinfo?(bugzeeeeee)

For the current bug:

Finished the investigation for the Inactive Actor issue. This happens because we run the s.find.css("#foo") command against a destroyed top level browsing context. Basically we have a race condition between a call to switchToFrame and a XULFrameLoaderCreated notification. XULFrameLoaderCreated is observed while switchToFrame is already in progress, and the (outdated) result of switchToFrame will overwright the information from XULFrameLoaderCreated

In more details:
Test is on PAGE_1, which contains an IFRAME. IFRAME contains a link which will navigate to PAGE_2 with target=_top.
First we switchToFrame to the IFRAME. Then we click on the navigation link.
This click is going to navigate the parent page of IFRAME, which means it will destroy IFRAME.

The click will not wait for any navigation because of this check
Immediately after the test does session.switch_frame(None)

At this point, the current browsing context still points to the iframe, which has not been destroyed yet. Its parent page has not navigated either.

Now the race condition happens. This call to switchToFrame(null) is going to return the browsing context id of PAGE_1, but before it resolves, we will observe XULFrameLoaderCreated. In the observer for XULFrameLoaderCreated, we update the current browsing context to PAGE_2. However right after that switchToFrame comes back from the content process and overwrites the current browsing context back to its previous value.

After that, we are stuck on an invalid top browsing context and the next find command will fail.

Will need to think a bit more about this, but it feels like an issue on Marionette's side. Maybe when we receive a XULFrameLoaderCreated we should "cancel" any pending command which might have updated the current browsing context?

Just to illustrate the analysis, this fixes the issue. Let's discuss if we want to go for something along those lines

That's an interesting case and I wonder why this only happens for the current Fission implementation on Android but not not desktop. What specifically is different on mobile? Maybe owlish can explain this as well.

(In reply to Olivia Hall [:olivia] from comment #10)

Olivia: I noticed that on opt there are failures which don't seem tracked for android + fission

  • /webdriver/tests/navigate_to/navigate.py
  • /webdriver/tests/back/back.py | test_cross_origin[capabilities0]
  • /webdriver/tests/forward/forward.py | test_cross_origin[capabilities0]

All failing with

webdriver.error.StaleElementReferenceException: stale element reference (404)
Should I file bugs for this, or will it be done during a next simulation? We worked on stale element logic & tests in the past few weeks (eg in Bug 1811545), so it's possible we regressed this recenly.

Thanks for identifying these fission bugs. I'm not sure, but I don't think they will be filed during simulation. :owlish might know for sure?

Seeing a StaleElementReferenceException here would mean that the newly loaded page is not within a different process (browsing context group) and as such it still can be found in our internal node map. So I assume that GeckoView has some custom settings regarding how processes are used? Maybe that's the underlying problem why we see the above failure only on Android?

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #13)

That's an interesting case and I wonder why this only happens for the current Fission implementation on Android but not not desktop. What specifically is different on mobile? Maybe owlish can explain this as well.

Good question, the main difference between desktop and mobile here is that we get a XULFrameLoaderCreated event for this navigation, and the browsing context is replaced. Which I guess means we have a cross group navigation here, whereas we don't on desktop. Normally for a navigation with target _top we shouldn't try to put the old page in bfcache, so maybe there's something wrong on Android there?

The underlying issue is still an edge case that is not handled correctly by Marionette, but it's true there's also something to investigate on geckoview.

For reference, here's a test which intermittently reproduces the issue on desktop:

def test_navigate_back_and_switch_frame(session, inline):
    target_page = inline("<p id='foo'>foo</p>")

    session.url = inline("<a id='link' href='{}'>click</a>".format(target_page))
    element = session.find.css("a", all=False)

    orig_handles = session.handles

    response = element_click(session, element)
    assert_success(response)

    session.execute_script("history.go(-1)")
    session.switch_frame(None)

    wait = Poll(
        session,
        timeout=5,
        ignored_exceptions=NoSuchElementException,
        message="Expected element has not been found")
    wait.until(lambda s: s.find.css("#link"))

The trick is to start a bfcache navigation that will not be awaited by marionette and then call switch_frame(None). But as you can see it's a bit harder to trigger than on android, which seems to replace browsing contexts much more often than desktop.

Blocks: 1816061
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(bugzeeeeee)
Resolution: --- → FIXED
Summary: Failure on Android Fission builds: /webdriver/tests/element_click/navigate.py | test_link_from_nested_context_with_target[_top] - webdriver.error.UnknownErrorException: unknown error (500): InactiveActor: Actor is no longer active → Investigation for failure on Android Fission builds: /webdriver/tests/element_click/navigate.py | test_link_from_nested_context_with_target[_top] - webdriver.error.UnknownErrorException: unknown error (500): InactiveActor: Actor is no longer active

Closing as we completed the investigation here. Following up to fix the failure should happen over in Bug 1816061

Comment on attachment 9316267 [details]
Bug 1786640 - [marionette] WIP: retry switch to frame if browsing context changed mid flight

Revision D169045 was moved to bug 1817820. Setting attachment 9316267 [details] to obsolete.

Attachment #9316267 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: