Closed Bug 1822466 Opened 1 year ago Closed 1 year ago

Element reference is unknown and not stale after a remoteness change since Firefox 110.0

Categories

(Remote Protocol :: Marionette, defect, P1)

Firefox 110
defect
Points:
5

Tracking

(firefox-esr102 unaffected, firefox111 wontfix, firefox112 wontfix, firefox113 wontfix, firefox114 wontfix, firefox115 wontfix, firefox116 fixed)

RESOLVED FIXED
116 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox111 --- wontfix
firefox112 --- wontfix
firefox113 --- wontfix
firefox114 --- wontfix
firefox115 --- wontfix
firefox116 --- fixed

People

(Reporter: xlebnica.booroop, Assigned: whimboo)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression, Whiteboard: [webdriver:m7], [wptsync upstream][webdriver:relnote])

Attachments

(7 files, 2 obsolete files)

Attached image err.jpg

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0

Steps to reproduce:

We have the following env:
selenium java 4.7.2
geckodriver 0.32.1
firefox 110.0

After updating firefox from 109.0.1 to 110.0 we have a problem with ExpectedConditions.stalenessOf(element).

Example of code:

WebElement html = seleniumSettings.getWebDriver().findElement(By.tagName("html"));

seleniumSettings.getWebDriver().get("about:blank");
seleniumSettings.getWebDriver().get(page);

new WebDriverWait(seleniumSettings.getWebDriver(), seleniumSettings.getDefaultTimeout())
.withMessage("Element exist.")
.until(ExpectedConditions.stalenessOf(html));

Actual results:

Test break on WebDriverWait with the following error - Caused by: org.openqa.selenium.NoSuchElementException: Unknown element with id...

Expected results:

WebDriverWait should execute correctly

This problem actually only on Linux.

The problem is exactly the same as in Bug 1685951

Component: Untriaged → Marionette
Product: Firefox → Remote Protocol

Could you please provide a trace log as generated by geckodriver?

I assume that the originally loaded URL is the same as page? With that in mind I would say that the navigation to about:blank is causing the browsing context to be replaced. Navigating to the page again will again replace the browsing context.

As per latest WebDriver classic changes the remote end needs to raise a no such element error given that the element isn't known in the new browsing context.

Flags: needinfo?(xlebnica.booroop)

The landing of bug 1692468 has changed this behavior.

Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: regression
Regressed by: 1692468

It would still be good to get the requested details. But nevertheless we will discuss next week in our triage meeting.

Whiteboard: [webdriver:triage]
Attached file log.txt
Flags: needinfo?(xlebnica.booroop)

Attached the log file.

So the log shows two interesting scenarios:

  1. Here an element with the tag name html is retrieved from the initially loaded about:blank page. Then a navigation to about:blank is happening, following by a navigation to http://host.docker.internal:8080. Note that here no remoteness change will happen and as such the element is marked as stale.
1679574453112   webdriver::server       DEBUG   -> POST /session/3c576275-fd0b-4b2d-9bde-f11c8e7a51f8/element {
  "using": "tag name",
  "value": "html"
}
[..]
1679574453136   Marionette      DEBUG   0 -> [0,7,"WebDriver:Navigate",{"url":"about:blank"}]
1679574453138   RemoteAgent     TRACE   [8] ProgressListener Start: expectNavigation=false resolveWhenStarted=true unloadTimeout=200 waitForExplicitStart=true
1679574453138   RemoteAgent     TRACE   [8] ProgressListener Setting unload timer (200ms)
1679574453142   RemoteAgent     TRACE   [8] ProgressListener Check loading state: isStart=1 isStop=0
1679574453142   RemoteAgent     TRACE   [8] ProgressListener state=start: about:blank
1679574453142   RemoteAgent     TRACE   [8] ProgressListener Cleared the unload timer
1679574453142   RemoteAgent     TRACE   [8] ProgressListener Request to stop listening when navigation started
1679574453142   RemoteAgent     TRACE   [8] ProgressListener Stop: has error=false
1679574453144   Marionette      DEBUG   0 <- [1,7,null,{"value":null}]
1679574453145   webdriver::server       DEBUG   <- 200 OK {"value":null}
1679574453153   Marionette      TRACE   [8] MarionetteCommands actor destroyed for window id 4294967297
1679574453154   webdriver::server       DEBUG   -> POST /session/3c576275-fd0b-4b2d-9bde-f11c8e7a51f8/url {
  "url": "http:\u002f\u002fhost.docker.internal:8080"
}
1679574453166   Marionette      DEBUG   0 -> [0,8,"WebDriver:Navigate",{"url":"http://host.docker.internal:8080"}]
1679574453167   RemoteAgent     TRACE   [8] ProgressListener Start: expectNavigation=false resolveWhenStarted=true unloadTimeout=200 waitForExplicitStart=true
1679574453167   RemoteAgent     TRACE   [8] ProgressListener Setting unload timer (200ms)
1679574453178   RemoteAgent     TRACE   [8] ProgressListener Check loading state: isStart=1 isStop=0
1679574453178   RemoteAgent     TRACE   [8] ProgressListener state=start: http://host.docker.internal:8080/
1679574453178   RemoteAgent     TRACE   [8] ProgressListener Cleared the unload timer
1679574453178   RemoteAgent     TRACE   [8] ProgressListener Request to stop listening when navigation started
1679574453178   RemoteAgent     TRACE   [8] ProgressListener Stop: has error=false
1679574453181   Marionette      DEBUG   0 <- [1,8,null,{"value":null}]
1679574453183   webdriver::server       DEBUG   <- 200 OK {"value":null}
1679574453198   webdriver::server       DEBUG   -> GET /session/3c576275-fd0b-4b2d-9bde-f11c8e7a51f8/element/68833415-3c3a-4346-86c7-882a77a9ced8/enabled
1679574453200   Marionette      DEBUG   0 -> [0,9,"WebDriver:IsElementEnabled",{"id":"68833415-3c3a-4346-86c7-882a77a9ced8"}]
1679574453203   Marionette      TRACE   [8] MarionetteCommands actor created for window id 4294967298
1679574453211   Marionette      DEBUG   0 <- [1,9,{"error":"stale element reference","message":"The element reference of <html> is stale; either its node document is not the active document, or it is no longer connected to the DOM","stacktrace":"RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8\nWebDriverError@chrome://remote/content/shared/webdriver/Er ... arionette/json.sys.mjs:217:10\nreceiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.sys.mjs:85:30\n"},null]
1679574453213   webdriver::server       DEBUG   <- 404 Not Found {"value":{"error":"stale element reference","message":"The element reference of <html> is stale; either its  ... ette/json.sys.mjs:217:10\nreceiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.sys.mjs:85:30\n"}}
  1. Here the element will be retrieved from the loaded page http://host.docker.internal:8080, Then a navigation to about:blank happens, which again doesn't cause a remoteness change. But then the next navigation loads http:\u002f\u002fhost.docker.internal:8080\u002fgrid\u002fSimpleGrid.do?ModuleName=ADMIN_PROGRAM (note the encoded slashes) and that one triggers a remoteness change:
1679574461935   webdriver::server       DEBUG   -> POST /session/3c576275-fd0b-4b2d-9bde-f11c8e7a51f8/element {
  "using": "tag name",
  "value": "html"
}
1679574461936   Marionette      DEBUG   0 -> [0,64,"WebDriver:FindElement",{"using":"tag name","value":"html"}]
1679574461938   Marionette      DEBUG   0 <- [1,64,null,{"value":{"element-6066-11e4-a52e-4f735466cecf":"108b4138-1087-4dea-8399-89bc6214e3ec"}}]
1679574461939   webdriver::server       DEBUG   <- 200 OK {"value":{"element-6066-11e4-a52e-4f735466cecf":"108b4138-1087-4dea-8399-89bc6214e3ec"}}
1679574461948   webdriver::server       DEBUG   -> POST /session/3c576275-fd0b-4b2d-9bde-f11c8e7a51f8/url {
  "url": "about:blank"
}
1679574461949   Marionette      DEBUG   0 -> [0,65,"WebDriver:Navigate",{"url":"about:blank"}]
1679574461949   RemoteAgent     TRACE   [11] ProgressListener Start: expectNavigation=false resolveWhenStarted=true unloadTimeout=200 waitForExplicitStart=true
1679574461949   RemoteAgent     TRACE   [11] ProgressListener Setting unload timer (200ms)
1679574461953   RemoteAgent     TRACE   [11] ProgressListener Check loading state: isStart=1 isStop=0
1679574461953   RemoteAgent     TRACE   [11] ProgressListener state=start: about:blank
1679574461953   RemoteAgent     TRACE   [11] ProgressListener Cleared the unload timer
1679574461953   RemoteAgent     TRACE   [11] ProgressListener Request to stop listening when navigation started
1679574461953   RemoteAgent     TRACE   [11] ProgressListener Stop: has error=false
1679574461954   Marionette      DEBUG   0 <- [1,65,null,{"value":null}]
1679574461955   webdriver::server       DEBUG   <- 200 OK {"value":null}
1679574461959   Marionette      TRACE   [8] MarionetteCommands actor destroyed for window id 4294967298
1679574461965   webdriver::server       DEBUG   -> POST /session/3c576275-fd0b-4b2d-9bde-f11c8e7a51f8/url {
  "url": "http:\u002f\u002fhost.docker.internal:8080\u002fgrid\u002fSimpleGrid.do?ModuleName=ADMIN_PROGRAM"
}
1679574461966   Marionette      TRACE   Remoteness change detected. Set new top-level browsing context to 11
1679574461971   Marionette      TRACE   [11] MarionetteCommands actor destroyed for window id 6442450947
1679574461976   Marionette      DEBUG   0 -> [0,66,"WebDriver:Navigate",{"url":"http://host.docker.internal:8080/grid/SimpleGrid.do?ModuleName=ADMIN_PROGRAM"}]
1679574461977   RemoteAgent     TRACE   [11] ProgressListener Start: expectNavigation=false resolveWhenStarted=true unloadTimeout=200 waitForExplicitStart=true
1679574461977   RemoteAgent     TRACE   [11] ProgressListener Document already loading about:blank
1679574461978   RemoteAgent     TRACE   [11] ProgressListener Setting unload timer (200ms)
1679574461988   Marionette      WARN    Ignoring event 'DOMContentLoaded' because document has an invalid readyState of 'complete'.
1679574462003   RemoteAgent     TRACE   [11] ProgressListener Check loading state: isStart=0 isStop=16
1679574462011   RemoteAgent     TRACE   [11] ProgressListener Check loading state: isStart=1 isStop=0
1679574462011   RemoteAgent     TRACE   [11] ProgressListener state=start: http://host.docker.internal:8080/grid/SimpleGrid.do?ModuleName=ADMIN_PROGRAM
1679574462011   RemoteAgent     TRACE   [11] ProgressListener Cleared the unload timer
1679574462011   RemoteAgent     TRACE   [11] ProgressListener Request to stop listening when navigation started
1679574462011   RemoteAgent     TRACE   [11] ProgressListener Stop: has error=false
1679574462014   Marionette      DEBUG   0 <- [1,66,null,{"value":null}]
1679574462021   webdriver::server       DEBUG   <- 200 OK {"value":null}
1679574462029   webdriver::server       DEBUG   -> GET /session/3c576275-fd0b-4b2d-9bde-f11c8e7a51f8/element/108b4138-1087-4dea-8399-89bc6214e3ec/enabled 
1679574462030   Marionette      DEBUG   0 -> [0,67,"WebDriver:IsElementEnabled",{"id":"108b4138-1087-4dea-8399-89bc6214e3ec"}]
1679574462035   RemoteAgent     TRACE   WebDriverProcessData actor created for PID 1515
1679574462036   Marionette      TRACE   [11] MarionetteCommands actor created for window id 8589934593
1679574462044   Marionette      DEBUG   0 <- [1,67,{"error":"no such element","message":"Unknown element with id 108b4138-1087-4dea-8399-89bc6214e3ec","stacktrace":"RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8\nWebDriverError@chrome://remote/content/shared/webdriver/Er ... arionette/json.sys.mjs:217:10\nreceiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.sys.mjs:85:30\n"},null]
1679574462044   webdriver::server       DEBUG   <- 404 Not Found {"value":{"error":"no such element","message":"Unknown element with id 108b4138-1087-4dea-8399-89bc6214e3ec" ... ette/json.sys.mjs:217:10\nreceiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.sys.mjs:85:30\n"}}

That means that the top-level browsing context got replaced as what I have assumed in my last comment, and that indeed causes the node cache to be empty at this point and not knowing about the already retrieved element. So this can basically happen for each and every navigation.

Summary: ExpectedConditions.stalenessOf broken in firefox 110.0 → Element reference is unknown and not stale after a remoteness change since Firefox 110.0

As discussed today getting different states for element references after a navigation might be confusing. James wanted to have a look to understand what's going on and what we can do.

Flags: needinfo?(james)
Whiteboard: [webdriver:triage]

So I think the key thing here is that in https://w3c.github.io/webdriver/#dfn-get-a-known-element we first call "node reference is known" which checks if the element id has been used on the current navigable. That means if we ever saw the element in the navigable, irrespective of remoteness change, then we should send Stale Element Reference. We only send No Such Element if the id wasn't ever used in the navigable, or the id doesn't point at an element.

Flags: needinfo?(james)

Discussed with James on Matrix. So we basically missed the navigable part when we did the refactoring on bug 1692468. Given that we do not support Navigable yet there is no way to store the node cache information in the content process alone. Instead we also need a map in the parent process which at least holds the element or shadow root references in relation to their containing browsing context. In terms of top-level browsing contexts this would be the custom tab id.

To restore the checks we would have to partially revert some code that existed before my refactoring. We should get this fixed sooner than later even we didn't see much reports for it yet.

Points: --- → 5
Priority: -- → P1
Whiteboard: [webdriver:m7]

The severity field is not set for this bug.
:whimboo, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(hskupin)
Severity: -- → S3
Flags: needinfo?(hskupin)
Priority: P1 → P2

I created the upstream PR first to correct the broken tests: https://github.com/web-platform-tests/wpt/pull/39646.

In parallel I'm already going to start working on a fix for Firefox.

Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Priority: P2 → P1
Depends on: 1829413
Attachment #9332369 - Attachment description: Bug 1822466 - [marionette] Add global WebDriverSession map as singleton and initialize parent actor with WebDriver session id. → Bug 1822466 - [marionette] Add global WebDriverSession map as singleton and allow access for Marionette command parent actor.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b89608b3c46a
[wdspec] Correct stale element errors for cross-origin navigations #39646 r=webdriver-reviewers,jdescottes
https://hg.mozilla.org/integration/autoland/rev/2c6d325cb248
[marionette] Add global WebDriverSession map as singleton and allow access for Marionette command  parent actor. r=webdriver-reviewers,jdescottes
https://hg.mozilla.org/integration/autoland/rev/d94b6d6cd713
[marionette] Move WebElement and ShadowRoot serialization/deserialization helpers to Marionette command client actor. r=webdriver-reviewers,jdescottes
https://hg.mozilla.org/integration/autoland/rev/256239106623
[marionette] Check Navigable's seen nodes map for known nodes. r=webdriver-reviewers,jdescottes
Whiteboard: [webdriver:m7] → [webdriver:m7], [wptsync upstream]

Note that Bug 1822466 - [wdspec] Correct stale element errors for cross-origin navigations has already been landed upstream so no upstream sync is needed.

The patch landed in nightly and beta is affected.
:whimboo, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox114 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(hskupin)

The patches are too risky to uplift to beta and need some follow-up work as well to eg. improve the performance. As such we don't want to fix it in Firefox 114.

Flags: needinfo?(hskupin)

Backed out for causing regressions in the upstream wpt tests.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 115 Branch → ---
Attached image Data flow diagram

After the backout of the above patches we have to find another solution to get the following scenarios working:

  1. An element is retrieved in the child actor and needs to be send back to the client. Therefore a node reference will be created first in the node cache of the child process which identifies the node and is alive as long as the content process exists. The parent actor which receives the serialized data needs to know the WebReference id and store it in the SeenNodes cache for the given navigable (browsing context or browser for top-level).

  2. A formerly retrieved element is used by a command. Hereby the given WebReference needs to be checked if it is known within the SeenNodes cache for the given navigable. Without extra processing the command's payload this check would have to be done when deserializing the data in the child actor.

  3. (Bug 1274251) A window or frame needs to be serialized in the child actor. While for frames we make use of the browsing context id, for top-level browsing contexts we need a unique id given that the BC can be replaced by eg. cross-origin navigations. As such the parent process holds a singleton called browserUniqueIds. We cannot send it along with the command to the child actor given that it could change at any time and Ids might not be up-to-date anymore. Also if no ID exists yet one needs to be created.

  1. (Bug 1274251) A window or frame reference as formerly serialized is used by a command. For top-level browsing contexts the ID needs to be converted back to the associated browsing context's id. Without extra processing of the command's payload this check might have to be done in the child actor.

Generally we would like to avoid any extra IPC calls if possible which would make the methods async and also have a performance penalty.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #24)

  1. An element is retrieved in the child actor and needs to be send back to the client. Therefore a node reference will be created first in the node cache of the child process which identifies the node and is alive as long as the content process exists. The parent actor which receives the serialized data needs to know the WebReference id and store it in the SeenNodes cache for the given navigable (browsing context or browser for top-level).

We agreed on that additionally returning a list of seen nodeIds to the parent actor is a good way to get the seenNodes map for the current WebDriver session updated.

  1. A formerly retrieved element is used by a command. Hereby the given WebReference needs to be checked if it is known within the SeenNodes cache for the given navigable. Without extra processing the command's payload this check would have to be done when deserializing the data in the child actor.

This one is more complex and there are two approaches.

a) We pass the seenNodes list of the current navigable down to the child actor and through the full stack of methods. This has the negative effect that a lot of data might have to be transferred in case of SPA pages.

This approach can be seen with: https://treeherder.mozilla.org/jobs?repo=try&revision=d15508bcd2f458b6336da1aff695712f22e37ff6

b) We let the child actor try finding the node with the given reference within the actual browsing context, and if it is not available only raise a no such node error. Then in the parent actor we need to intercept this error type and check if the nodeId might have been known but the top-level browsing context has been replaced. If the latter is the case a stale element error can be raised.

To get this implemented some more work is needed. I'll try to get this implemented and come up with a proposed solution soon.

  1. (Bug 1274251) A window or frame needs to be serialized in the child actor. While for frames we make use of the browsing context id, for top-level browsing contexts we need a unique id given that the BC can be replaced by eg. cross-origin navigations. As such the parent process holds a singleton called browserUniqueIds. We cannot send it along with the command to the child actor given that it could change at any time and Ids might not be up-to-date anymore. Also if no ID exists yet one needs to be created.
  1. (Bug 1274251) A window or frame reference as formerly serialized is used by a command. For top-level browsing contexts the ID needs to be converted back to the associated browsing context's id. Without extra processing of the command's payload this check might have to be done in the child actor.

These two scenarios are more complex and we would like to defer to a later time over on bug 1274251.

Depends on: 1834971

This is basically the other option for Marionette to intercept errors and rewrite them:
https://hg.mozilla.org/try/rev/b36594fd3f9647bc8cac833bf73f45cf698f6b22#l2.22

Attachment #9332369 - Attachment description: Bug 1822466 - [marionette] Add global WebDriverSession map as singleton and allow access for Marionette command parent actor. → WIP: Bug 1822466 - [marionette] Add global WebDriverSession map as singleton and allow access for Marionette command parent actor.
Attachment #9332369 - Attachment description: WIP: Bug 1822466 - [marionette] Add global WebDriverSession map as singleton and allow access for Marionette command parent actor. → Bug 1822466 - [marionette] Add global WebDriverSession map as singleton and allow access for Marionette command parent actor.
Attachment #9332370 - Attachment is obsolete: true
Attachment #9332368 - Attachment is obsolete: true
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0d9194ca2da0
[wdspec] Enhanced serialization and deserialization tests. r=webdriver-reviewers,jdescottes
https://hg.mozilla.org/integration/autoland/rev/d6bcbd39ecf2
[marionette] Add global WebDriverSession map as singleton and allow access for Marionette command parent actor. r=webdriver-reviewers,jdescottes
https://hg.mozilla.org/integration/autoland/rev/796d1aa44633
[marionette] Check Navigable's seen nodes map for known nodes. r=webdriver-reviewers,jdescottes
https://hg.mozilla.org/integration/autoland/rev/06767279ae13
[wdspec] Improve stale element and detached shadow root tests. r=webdriver-reviewers,jdescottes
Failed to create upstream wpt PR due to merge conflicts. This requires fixup from a wpt sync admin.
Whiteboard: [webdriver:m7], [wptsync upstream] → [webdriver:m7], [wptsync upstream error]
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/40495 for changes under testing/web-platform/tests
Whiteboard: [webdriver:m7], [wptsync upstream error] → [webdriver:m7], [wptsync upstream]

Since nightly and release are affected, beta will likely be affected too.
For more information, please visit BugBot documentation.

We are not going to fix this regression for 115 due to the risks being involved. It's safer to let it ride the trains.

Regressions: 1769430
Regressions: 1540367
Whiteboard: [webdriver:m7], [wptsync upstream] → [webdriver:m7], [wptsync upstream][webdriver:relnote]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: