Open Bug 1925932 Opened 5 days ago Updated 3 days ago

Protocol error (<command>): "no such frame No window found for BrowsingContext with id"

Categories

(Remote Protocol :: WebDriver BiDi, defect, P3)

defect
Points:
3

Tracking

(Not tracked)

People

(Reporter: whimboo, Unassigned, NeedInfo)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [webdriver:m13])

Several Puppeteer Wd tests can fail because the underlying browsing context is gone. Here an example:

https://treeherder.mozilla.org/logviewer?job_id=478419576&repo=autoland&lineNumber=120393-120395

[task 2024-10-15T21:42:55.391Z] TEST-UNEXPECTED-FAIL | waittask specs Frame.waitForSelector xpath should throw when frame is detached (waittask.spec.js) | expected PASS
[task 2024-10-15T21:42:55.391Z] TEST-INFO took 206ms
[task 2024-10-15T21:42:55.392Z] PID 801 | ["fail",{"title":"should throw when frame is detached","fullTitle":"waittask specs Frame.waitForSelector xpath should throw when frame is detached","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/build/waittask.spec.js","duration":205,"currentRetry":0,"err":"\"Waiting for selector `.//*[@class=\"box\"]` failed: Protocol error (script.evaluate): no such frame No window found for BrowsingContext with id 994284929025 RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8\nWebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:193:5\nNoSuchFrameError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:572:5\n#getBrowsingContext@chrome://remote/content/webdriver-bidi/modules/root/script.sys.mjs:776:13\n#getContextFromTarget@chrome://remote/content/webdriver-bidi/modules/root/script.sys.mjs:786:38\nevaluate@chrome://remote/content/webdriver-bidi/modules/root/script.sys.mjs:476:53\nhandleCommand@chrome://remote/content/shared/messagehandler/MessageHandler.sys.mjs:257:33\nexecute@chrome://remote/content/shared/webdriver/Session.sys.mjs:390:32\nonPacket@chrome://remote/content/webdriver-bidi/WebDriverBiDiConnection.sys.mjs:228:37\nonMessage@chrome://remote/content/server/WebSocketTransport.sys.mjs:127:18\nhandleEvent@chrome://remote/content/server/WebSocketTransport.sys.mjs:109:14\n\" didn't contain any of the strings [\"waitForFunction failed: frame got detached.\",\"Browsing context already closed.\"]","stack":"Error: \"Waiting for selector `.//*[@class=\"box\"]` failed: Protocol error (script.evaluate): no such frame No window found for BrowsingContext with id 994284929025 RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8\nWebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:193:5\nNoSuchFrameError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:572:5\n#getBrowsingContext@chrome://remote/content/webdriver-bidi/modules/root/script.sys.mjs:776:13\n#getContextFromTarget@chrome://remote/content/webdriver-bidi/modules/root/script.sys.mjs:786:38\nevaluate@chrome://remote/content/webdriver-bidi/modules/root/script.sys.mjs:476:53\nhandleCommand@chrome://remote/content/shared/messagehandler/MessageHandler.sys.mjs:257:33\nexecute@chrome://remote/content/shared/webdriver/Session.sys.mjs:390:32\nonPacket@chrome://remote/content/webdriver-bidi/WebDriverBiDiConnection.sys.mjs:228:37\nonMessage@chrome://remote/content/server/WebSocketTransport.sys.mjs:127:18\nhandleEvent@chrome://remote/content/server/WebSocketTransport.sys.mjs:109:14\n\" didn't contain any of the strings [\"waitForFunction failed: frame got detached.\",\"Browsing context already closed.\"]\n    at Context.<anonymous> (/builds/worker/checkouts/gecko/remote/test/puppeteer/test/src/waittask.spec.ts:771:36)"}]

Note that in this case we see it for script.evaluate and it fails when we are trying to get the context from the given target argument:
https://searchfox.org/mozilla-central/rev/ec342a3d481d9ac3324d1041e05eefa6b61392d2/remote/webdriver-bidi/modules/root/script.sys.mjs#476

The Puppeteer test fails in these specific lines:

        const waitPromise = frame
          .waitForSelector('xpath/.//*[@class="box"]')
          .catch(error => {
            return (waitError = error);
          });
        await detachFrame(page, 'frame1');
        await waitPromise;
        expect(waitError).toBeTruthy();
        expect(waitError?.message).atLeastOneToContain([
          'waitForFunction failed: frame got detached.',
          'Browsing context already closed.',
        ]);

It's not clear to me right now (without digging into this issue) what exactly is causing this failure. Lets block bugs for specific tests that see this problem.

Summary: Intermittent Pup Wd | "Protocol error (<command>): no such frame No window found for BrowsingContext with id" → Protocol error (<command>): "no such frame No window found for BrowsingContext with id"
Severity: -- → S3
Points: --- → 3
Priority: -- → P3
Whiteboard: [webdriver:m13]

The issue is most likely related to the retry logic.
We can see:

RootTransport retrying command script.callFunctionDeclaration for browsing context 994284929025, attempt: 1.

And then the next attempt to use callFunctionDeclaration fails because the actual frame we target is gone. Then we fail with the following error:

No window found for BrowsingContext with id 994284929025

But puppeteer is actually relying on AbortError for their WaitTask mechanism:

https://searchfox.org/mozilla-central/rev/6ec81d7b1d2f60c18f1de400ac9e8ada1f905352/remote/test/puppeteer/packages/puppeteer-core/src/common/WaitTask.ts#228-236

// Errors coming from WebDriver BiDi. TODO: Adjust messages after
// https://github.com/w3c/webdriver-bidi/issues/540 is resolved.
if (
  error.message.includes(
    "AbortError: Actor 'MessageHandlerFrame' destroyed"
  )
) {
  return;
}

The code above normally allows Puppeteer to ignore our error and to retry later on. And for this test, normally in the meantime the browsingContext.destroyed event should be emitted for the iframe, which should lead to terminate all WaitTasks for the corresponding realm, which will lead to the "Browsing context already closed" error expected by the test.

But here we retry immediately, before the destroyed event has been emitted, and we fail, this time with "no such frame". This error is not recognized by Puppeteer, and terminates the WaitTask (which is fine) but with an unexpected message.

Oh I thought we do not retry for script.evaluate and script.callFunction because there is the chance that the document get modified. Has that changed at some point?

Even through we probably have to check if the browsing context just gets replaced, or actually destroyed. The latter would mean that it's gone and we should not try to running the command again.

I am not sure if the issue actually comes from the retry logic, or if we just hit a race condition in the WaitTask logic from puppeteer more frequently because of it. I also don't completely understand how the script commands sent by Puppeteer work together here.

I can see mainly two commands: a script.callFunction and a script.evaluate which usually fail with "no such frame". However in successful runs, the WaitTask used by puppeteer most likely gets terminated when the browsing context / realm gets destroyed, and is rejected with a "Browsing context already closed". In failing runs, we send script.evaluate before we received "browsingContext.contextDestroyed" and I imagine the failure from script.evaluate is what terminates the WaitTask, which then has an unexpected error message. What is puzzling for me is that even in successful runs, we still have this script.evaluate call, which fails with the same error.

My theory so far is that when we hit an AbortError on a script.callFunction, Firefox will attempt to resend the command immediately, and we actually don't receive the result for this retried command during the test (that's id 17014 in the logs from the summary). I imagine that without Firefox retrying logic, the WaitTask might be waiting a bit before retrying in case of an AbortError? But I don't understand the relationship between this script.callFunction and the script.evaluate which seems to terminate the WaitTask.

Alex, can you help me understand how the WaitTask is using both commands, and if there is any dependency between the two?

Flags: needinfo?(alexrudenko)

I wonder how easy it is to reproduce. It's most likely not happening that often. Otherwise I would have proposed to set remote.retry-on-abort to false to check with the former behavior, and if it makes a difference (just to exclude the retry behavior or not).

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #2)

Oh I thought we do not retry for script.evaluate and script.callFunction because there is the chance that the document get modified. Has that changed at some point?

Even through we probably have to check if the browsing context just gets replaced, or actually destroyed. The latter would mean that it's gone and we should not try to running the command again.

We do, based on the preference added in Bug 1854942 we retry all commands by default, unless they specify retryOnAbort: false. It's only for Marionette that we avoid retrying executeScript.

Maybe we should pass retryOnAbort: false for those. And I agree that if the browsingContext got destroyed, we probably should stop retrying, and maybe we should even fail with the original AbortError we had (because Puppeteer is relying on this error message a bit already).

But compared to my analysis yesterday, I was wrong when I was saying that the command failed with "no such frame" after retrying: I actually don't see the result for this command in the logs. After checking the logs in more details we actually hit the await lazy.waitForCurrentWindowGlobal(webProgress.browsingContext); which I added recently, as we can see the following log:

currentWindowGlobal was not available for Browsing Context with id: 994284929025 after 100 ms

This happens after the test failed, so basically, this means we hold this command for 100ms before rejecting: https://treeherder.mozilla.org/logviewer?job_id=478419576&repo=autoland&lineNumber=120424-120425

But it's still not clear for me if the retry has anything to do with the issue. The root cause of the failure seems to be that a script.evaluate fails and terminates the WaitTask before the browsingContext.contextDestroyed event was received. Let's wait for feedback from Alex to see if there is a relationship between the callFunction and evaluate calls, but if not this might just be a random intermittent.

Nevertheless, based on what we have seen here I think we should improve a few things:

  • avoid retrying on discarded contexts (and avoid waiting for currentWindowGlobal on discarded contexts)
  • avoid retrying callFunctionDeclaration and evaluateExpression

I will file dedicated bugs for that.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #4)

I wonder how easy it is to reproduce. It's most likely not happening that often. Otherwise I would have proposed to set remote.retry-on-abort to false to check with the former behavior, and if it makes a difference (just to exclude the retry behavior or not).

I couldn't trigger it locally at least.

You need to log in before you can comment on or make changes to this bug.