Closed Bug 1816538 Opened 1 year ago Closed 10 months ago

Intermittent Wd | webdriver.error.JavascriptErrorException: javascript error (500): JavaScriptError: Document was unloaded

Categories

(Remote Protocol :: Marionette, defect, P5)

All
Windows
defect

Tracking

(firefox114 wontfix, firefox115 wontfix, firefox116 fixed)

RESOLVED FIXED
116 Branch
Tracking Status
firefox114 --- wontfix
firefox115 --- wontfix
firefox116 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: whimboo)

References

()

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

Filed by: nbeleuzu [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=405489451&repo=mozilla-beta
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/B64SY6AXS82vvPvUY16AQg/runs/0/artifacts/public/logs/live_backing.log


[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO - TEST-PASS | /webdriver/tests/close_window/user_prompts.py | test_accept_and_notify[capabilities0-confirm-True] 
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO - TEST-UNEXPECTED-FAIL | /webdriver/tests/close_window/user_prompts.py | test_accept_and_notify[capabilities0-prompt-] - webdriver.error.JavascriptErrorException: javascript error (500): JavaScriptError: Document was unloaded
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO - check_user_prompt_closed_with_exception = <function check_user_prompt_closed_with_exception.<locals>.check_user_prompt_closed_with_exception at 0x7ff069db3050>
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO - dialog_type = 'prompt', retval = ''
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO - 
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO -     @pytest.mark.capabilities({"unhandledPromptBehavior": "accept and notify"})
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO -     @pytest.mark.parametrize("dialog_type, retval", [
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO -         ("alert", None),
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO -         ("confirm", True),
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO -         ("prompt", ""),
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO -     ])
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO -     def test_accept_and_notify(check_user_prompt_closed_with_exception, dialog_type, retval):
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO - >       check_user_prompt_closed_with_exception(dialog_type, retval)
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO - 
[task 2023-02-13T21:06:49.207Z] 21:06:49     INFO - check_user_prompt_closed_with_exception = <function check_user_prompt_closed_with_exception.<locals>.check_user_prompt_closed_with_exception at 0x7ff069db3050>
[task 2023-02-13T21:06:49.208Z] 21:06:49     INFO - dialog_type = 'prompt'
[task 2023-02-13T21:06:49.208Z] 21:06:49     INFO - retval     = ''
[task 2023-02-13T21:06:49.208Z] 21:06:49     INFO - 
[task 2023-02-13T21:06:49.208Z] 21:06:49     INFO - tests/web-platform/tests/webdriver/tests/close_window/user_prompts.py:87: 

Here the details of the failure:

https://treeherder.mozilla.org/logviewer?job_id=405489451&repo=mozilla-beta&lineNumber=7668-7678

[task 2023-02-13T21:06:34.490Z] 21:06:34     INFO - PID 1336 | 1676322394489	webdriver::server	DEBUG	-> POST /session/2a8055a3-9a49-48aa-93dc-a0ff6896c26f/execute/async {"script": "\n            let dialog_type = arguments[0]; ... _return_value = window[dialog_type](text);\n              }\n            }, 0);\n            ", "args": ["prompt", "prompt"]}
[task 2023-02-13T21:06:34.491Z] 21:06:34     INFO - PID 1336 | 1676322394490	Marionette	DEBUG	0 -> [0,64,"WebDriver:ExecuteAsyncScript",{"args":["prompt","prompt"],"script":"\n            let dialog_type = arguments[0];\n            let text = arguments[1];\n\n            setTimeout(function() {\n       ... lse {\n                window.dialog_return_value = window[dialog_type](text);\n              }\n            }, 0);\n            "}]
[task 2023-02-13T21:06:34.648Z] 21:06:34     INFO - PID 1336 | 1676322394647	RemoteAgent	TRACE	WebDriverProcessData actor created for PID 1386
[task 2023-02-13T21:06:34.649Z] 21:06:34     INFO - PID 1336 | 1676322394648	Marionette	TRACE	[23] MarionetteCommands actor created for window id 50
[task 2023-02-13T21:06:34.659Z] 21:06:34     INFO - PID 1336 | JavaScript error: resource://gre/actors/BrowserElementChild.jsm, line 22: TypeError: can't access property "windowUtils", this.contentWindow is null
[task 2023-02-13T21:06:34.663Z] 21:06:34     INFO - PID 1336 | 1676322394662	Marionette	DEBUG	0 <- [1,64,{"error":"javascript error","message":"JavaScriptError: Document was unloaded","stacktrace":"RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8\nWebDriverError@chrome://remote/content/shared/webdriver/Er ... 496:10\nprompt@resource://gre/modules/Prompter.sys.mjs:1288:36\n@moz-nullprincipal:{eacbdbbf-4826-4a99-884f-4a90a3819909}:8:65\n"},null]
[task 2023-02-13T21:06:34.664Z] 21:06:34     INFO - PID 1336 | 1676322394663	Marionette	TRACE	[23] MarionetteCommands actor destroyed for window id 50
[task 2023-02-13T21:06:34.664Z] 21:06:34     INFO - PID 1336 | JavaScript error: moz-nullprincipal:{eacbdbbf-4826-4a99-884f-4a90a3819909}, line 8: SecurityError: Permission denied to access property "dialog_return_value" on cross-origin object
[task 2023-02-13T21:06:34.665Z] 21:06:34     INFO - PID 1336 | JavaScript error: resource://gre/modules/Prompter.sys.mjs, line 1042: AbortError: Actor 'Prompt' destroyed before query 'Prompt:Open' was resolved
[task 2023-02-13T21:06:34.666Z] 21:06:34     INFO - PID 1336 | 1676322394666	Marionette	TRACE	Received event DOMModalDialogClosed
[task 2023-02-13T21:06:34.670Z] 21:06:34     INFO - PID 1336 | 1676322394669	webdriver::server	DEBUG	<- 500 Internal Server Error {"value":{"error":"javascript error","message":"JavaScriptError: Document was unloaded","stacktr ... \nprompt@resource://gre/modules/Prompter.sys.mjs:1288:36\n@moz-nullprincipal:{eacbdbbf-4826-4a99-884f-4a90a3819909}:8:65\n"}}

Before we return with a failure I can see the following error:

resource://gre/actors/BrowserElementChild.jsm, line 22: TypeError: can't access property "windowUtils", this.contentWindow is null

Lets observe for now if it happens again.

Moving bug to Remote Protocol::Marionette component per bug 1815831.
Component: geckodriver → Marionette
Product: Testing → Remote Protocol

I cannot se the failure again. Going to close as incomplete for now.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Summary: Intermittent [Tier-2] /webdriver/tests/close_window/user_prompts.py | test_accept_and_notify[capabilities0-prompt-] - webdriver.error.JavascriptErrorException: javascript error (500): JavaScriptError: Document was unloaded → Intermittent Wd | webdriver.error.JavascriptErrorException: javascript error (500): JavaScriptError: Document was unloaded
Duplicate of this bug: 1832814

The JS error resource://gre/actors/BrowserElementChild.jsm, line 22: TypeError: can't access property "windowUtils", this.contentWindow is null is still present and I've filed bug 1832494 to get it fixed.

Depends on: 1832494

This failure increased quite a lot of the last week and it looks like it is Windows only. In cases when this failure happens we also see a ton of failing tests for a single test job.

I can also see the following failure quite a lot:

[task 2023-05-22T00:02:43.422Z] 00:02:43     INFO - PID 8524 | JavaScript error: moz-nullprincipal:{b1efa709-135a-45de-8b76-99d550bb5da2}, line 10: SecurityError: Permission denied to access property "dialog_return_value" on cross-origin object
[task 2023-05-22T00:02:43.423Z] 00:02:43     INFO - PID 8524 | JavaScript error: resource://gre/modules/Prompter.sys.mjs, line 1042: AbortError: Actor 'Prompt' destroyed before query 'Prompt:Open' was resolved

It all seems to have been started on between May 9th (try builds for staging release 115) and 10th (autoland and central).

It doesn't make sense that beta staging fails first before autoland / central. As such I wonder if there was an update of the Windows workers at that time. Mark do you know?

Status: REOPENED → NEW
Flags: needinfo?(mcornmesser)
OS: Unspecified → Windows
Hardware: Unspecified → All
See Also: → 1834217
Duplicate of this bug: 1834186
See Also: → 1832167
No longer duplicate of this bug: 1834186

All the document was unloaded errors seem to follow a missed navigation

[task 2023-05-21T23:46:55.408Z] 23:46:55 INFO - PID 3812 | 1684712815407 RemoteAgent TRACE [23] ProgressListener Start: expectNavigation=false resolveWhenStarted=false unloadTimeout=200 waitForExplicitStart=false
[task 2023-05-21T23:46:55.409Z] 23:46:55 INFO - PID 3812 | 1684712815407 RemoteAgent TRACE [23] ProgressListener Setting unload timer (200ms)
[task 2023-05-21T23:46:55.616Z] 23:46:55 INFO - PID 3812 | 1684712815616 RemoteAgent TRACE [23] ProgressListener No navigation detected: about:blank
[task 2023-05-21T23:46:55.618Z] 23:46:55 INFO - PID 3812 | 1684712815616 RemoteAgent TRACE [23] ProgressListener Stop: has error=false

(and even on debug builds with the x8 = 1600ms timeout...)

Will try to increase the timeout to check if it has any impact on the failure frequency.

Here's a detailed and commented log with the webprogress info:

First we start the progress listener

[task 2023-05-22T10:03:06.232Z] 10:03:06 INFO - PID 5564 | 1684749786231 RemoteAgent TRACE [23] ProgressListener Start: expectNavigation=false resolveWhenStarted=false unloadTimeout=1600 waitForExplicitStart=false
[task 2023-05-22T10:03:06.233Z] 10:03:06 INFO - PID 5564 | 1684749786231 RemoteAgent TRACE [23] ProgressListener Setting unload timer (1600ms)
[task 2023-05-22T10:03:07.839Z] 10:03:07 INFO - PID 5564 | 1684749787838 RemoteAgent TRACE [23] ProgressListener No navigation detected: about:blank
[task 2023-05-22T10:03:07.840Z] 10:03:07 INFO - PID 5564 | 1684749787838 RemoteAgent TRACE [23] ProgressListener Stop: has error=false

And we just timeout after 1.6 seconds, nothing happened. No WebProgress log here. Then we go on with the test.

[task 2023-05-22T10:03:07.841Z] 10:03:07 INFO - PID 5564 | 1684749787839 Marionette DEBUG 0 <- [1,66,null,{"handle":"91e41a91-01c7-4da4-85fb-7f1ae5b66e0e","type":"tab"}]
[task 2023-05-22T10:03:07.841Z] 10:03:07 INFO - PID 5564 | 1684749787839 webdriver::server DEBUG <- 200 OK {"value":{"handle":"91e41a91-01c7-4da4-85fb-7f1ae5b66e0e","type":"tab"}}
[task 2023-05-22T10:03:07.842Z] 10:03:07 INFO - PID 5564 | 1684749787841 webdriver::server DEBUG -> POST /session/6a558360-c7c2-4987-9397-4c546b2e8fa1/window {"handle": "91e41a91-01c7-4da4-85fb-7f1ae5b66e0e"}
[task 2023-05-22T10:03:07.843Z] 10:03:07 INFO - PID 5564 | 1684749787842 Marionette DEBUG 0 -> [0,67,"WebDriver:SwitchToWindow",{"handle":"91e41a91-01c7-4da4-85fb-7f1ae5b66e0e"}]
[task 2023-05-22T10:03:07.854Z] 10:03:07 INFO - PID 5564 | 1684749787856 RemoteAgent TRACE Received DOM event TabSelect for [object XULElement]
[task 2023-05-22T10:03:07.871Z] 10:03:07 INFO - PID 5564 | 1684749787870 Marionette DEBUG 0 <- [1,67,null,{"value":null}]
[task 2023-05-22T10:03:07.872Z] 10:03:07 INFO - PID 5564 | 1684749787873 webdriver::server DEBUG <- 200 OK {"value":null}
[task 2023-05-22T10:03:07.875Z] 10:03:07 INFO - PID 5564 | 1684749787874 webdriver::server DEBUG -> POST /session/6a558360-c7c2-4987-9397-4c546b2e8fa1/execute/async {"script": "\n let dialog_type = arguments[0]; ... _return_value = windowdialog_type;\n }\n }, 0);\n ", "args": ["prompt", "prompt"]}
[task 2023-05-22T10:03:07.877Z] 10:03:07 INFO - PID 5564 | 1684749787876 Marionette DEBUG 0 -> [0,68,"WebDriver:ExecuteAsyncScript",{"args":["prompt","prompt"],"script":"\n let dialog_type = arguments[0];\n let text = arguments[1];\n\n setTimeout(function() {\n ... lse {\n window.dialog_return_value = windowdialog_type;\n }\n }, 0);\n "}]

As we are sending an async script, our first webprogress log shows up:

[task 2023-05-22T10:03:07.909Z] 10:03:07 INFO - PID 5564 | [Parent 5844: Main Thread]: I/BCWebProgress OnStateChange({isTopLevel:1, isLoadingDocument:0}, {URI:about:blank, originalURI:about:blank}, STATE_START|STATE_IS_REQUEST|STATE_IS_DOCUMENT|STATE_IS_NETWORK|STATE_IS_WINDOW, NS_OK) on {top:1, id:17, url:about:blank}
[task 2023-05-22T10:03:07.911Z] 10:03:07 INFO - PID 5564 | [Parent 5844: Main Thread]: I/BCWebProgress OnStateChange({isTopLevel:1, isLoadingDocument:1}, {URI:about:blank, originalURI:about:blank}, STATE_STOP|STATE_IS_NETWORK|STATE_IS_WINDOW, NS_BINDING_ABORTED) on {top:1, id:17, url:about:blank}
[task 2023-05-22T10:03:07.918Z] 10:03:07 INFO - PID 5564 | [Parent 5844: Main Thread]: I/BCWebProgress OnStateChange({isTopLevel:1, isLoadingDocument:0}, {URI:about:blank, originalURI:about:blank}, STATE_START|STATE_IS_REQUEST|STATE_IS_DOCUMENT|STATE_IS_NETWORK|STATE_IS_WINDOW, NS_OK) on {top:1, id:17, url:about:blank}

Interestingly, this happens 1677ms after we started our navigation timer, so only 77ms after our 1600ms timer expired. I wonder why the navigation was blocked? If this is just a timing/delay issue, or if it was stuck?

And then, we create related actors and we get our error:

[task 2023-05-22T10:03:07.922Z] 10:03:07 INFO - PID 5564 | 1684749787921 RemoteAgent TRACE WebDriverProcessData actor created for PID 7340
[task 2023-05-22T10:03:07.927Z] 10:03:07 INFO - PID 5564 | 1684749787926 Marionette TRACE [23] MarionetteCommands actor created for window id 52
[task 2023-05-22T10:03:07.978Z] 10:03:07 INFO - PID 5564 | JavaScript error: resource://gre/actors/BrowserElementChild.sys.mjs, line 19: TypeError: can't access property "windowUtils", this.contentWindow is null
[task 2023-05-22T10:03:08.009Z] 10:03:08 INFO - PID 5564 | 1684749788020 Marionette DEBUG 0 <- [1,68,{"error":"javascript error","message":"JavaScriptError: Document was unloaded","stacktrace":"RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8\nWebDriverError@chrome://remote/content/shared/webdriver/Er ... 493:10\nprompt@resource://gre/modules/Prompter.sys.mjs:1290:36\n@moz-nullprincipal:{15c9515d-32e1-4481-a0b0-f7236bbd35d2}:8:65\n"},null]

It looks like Azure had changed the patch level of the underlying image and added KB5022497 . We had another test that saw an increase in test failures. Currently we are working on getting a test pool up with mitigation for that update. Once it is ready we can test these tasks as well.

(In reply to Julian Descottes [:jdescottes] from comment #13)

Interestingly, this happens 1677ms after we started our navigation timer, so only 77ms after our 1600ms timer expired. I wonder why the navigation was blocked? If this is just a timing/delay issue, or if it was stuck?

That's very interesting! Thanks a lot for having a look at this Julian. I didn't notice that this was caused by a formerly missed navigation. Maybe if we do not detect one we should force stop whatever the browser is doing?

Regarding the timing this remembers me on bug 1825501 comment 11 where the navigation times out after 3000ms, but then some milliseconds later the actual navigation takes place. Maybe we keep the main thread busy with something? And if yes, why does it happen so infrequently?

(In reply to Mark Cornmesser [:markco] from comment #14)

It looks like Azure had changed the patch level of the underlying image and added KB5022497 . We had another test that saw an increase in test failures. Currently we are working on getting a test pool up with mitigation for that update. Once it is ready we can test these tasks as well.

Thanks for that information and good to know that at this day an update has been applied to the Windows workers. If it's related to that issue I wonder if we could keep an instance running with the update applied so that we could further investigate and fix our implementation? Or would it even be possible to get such a loaner to have better ways of testing as via try?

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #15)

(In reply to Julian Descottes [:jdescottes] from comment #13)

Interestingly, this happens 1677ms after we started our navigation timer, so only 77ms after our 1600ms timer expired. I wonder why the navigation was blocked? If this is just a timing/delay issue, or if it was stuck?

That's very interesting! Thanks a lot for having a look at this Julian. I didn't notice that this was caused by a formerly missed navigation. Maybe if we do not detect one we should force stop whatever the browser is doing?

Right! We normally have a flag to expect a navigation, but we don't usually use it when creating new tabs, because we're not 100% sure we will setup the ProgressListener early enough. But I also wonder if this is a valid reason. Maybe it would be worth trying to flip the flag and expect a navigation...

Regarding the timing this remembers me on bug 1825501 comment 11 where the navigation times out after 3000ms, but then some milliseconds later the actual navigation takes place. Maybe we keep the main thread busy with something? And if yes, why does it happen so infrequently?

I did another push with an increased timeout (from 200ms to 800ms) and this seems to fix the issue (with additional timeout vs with default timeout). So even if the fact that the navigation seemed to start right after our timeout, it was probably just a coincidence.

See Also: → 1832778
Depends on: 1832891
No longer depends on: 1832494

Thanks for that information and good to know that at this day an update has been applied to the Windows workers. If it's related to that issue I wonder if we could keep an instance running with the update applied so that we could further investigate and fix our implementation? Or would it even be possible to get such a loaner to have better ways of testing as via try?

We can set up a pool configuration to use the current VM configuration based off of the image with the updates.

Also we have a pool up currently that has the recommended workaround, gecko-t/win11-64-2009-alpha, so we can do a try push against that and see if we see any changes.

Thanks Mark. To confirm the default workers haven't been reverted so far, or? We are trying out a solution to make the initial navigation detection more stable over on bug 1832891, and the patch looks good which means all the tests are passing. But if the update on those workers has been reverted it might explain it as well. So could you please let us know about the status? Thanks!

It is still a bit fluid. We are looking at a couple different options to resolve the issue related to Bug 1832020 . If we can find definitive evidence that the update caused the issue, we are going to work on a way to revert to a prior patch level and maintain that same patch level until we migrate to a newer version of Windows 11.

See Also: → 1832020

Thanks Mark. But we wonder if something has been changed for workers on try given that we cannot reproduce this failure anymore. See bug 1832891 comment 3. As such we cannot really verify if the code change that we want to land will indeed fix this issue.

No, worker configuration has not change.

Flags: needinfo?(mcornmesser)

I updated my initial patch from Bug 1832891 in order to detect situations where the previous implementation would have missed the navigation: https://treeherder.mozilla.org/jobs?repo=try&revision=47d08457c6a83e7cec9f37d9b5345feb7dd08ba4

Hopefully this will make it easier for us to validate the approach (note that the push above is still based on using expectNavigation, I will do another version simply using a higher unloadTimeout)

Assignee: nobody → jdescottes
Status: NEW → ASSIGNED
Pushed by jdescottes@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/23a3b1b5a2b7
[remote] WaitForInitialNavigation should use a longer unloadTimeout r=webdriver-reviewers,whimboo

The patch probably should have been landed for bug 1832891 given that this was the underlying issue. But now that it's on autoland lets just reverse the dependencies.

Blocks: 1832891
No longer depends on: 1832891
Status: ASSIGNED → RESOLVED
Closed: 1 year ago11 months ago
Resolution: --- → FIXED
Target Milestone: --- → 115 Branch

Pascal, I think the last beta uplifts were on May 25, so this is too late to uplift this? This is a low risk patch which should avoid intermittent issues so it could have been nice to uplift it, but if it's too late it's not a big deal. Thanks for letting us know!

Flags: needinfo?(pascalc)
Regressions: 1835796

(In reply to Julian Descottes [:jdescottes] from comment #28)

Pascal, I think the last beta uplifts were on May 25, so this is too late to uplift this? This is a low risk patch which should avoid intermittent issues so it could have been nice to uplift it, but if it's too late it's not a big deal. Thanks for letting us know!

We have already built and shipped our release candidate, if there is an RC2 this week, we could take it as a ride-along in this build or in our planned dot release, but given that it landed in mozilla-central, it will mechanically be part of 115 beta 1 next week, is that worth it?

Flags: needinfo?(pascalc)

It all depends on the for now (?) unknown status of bug 1832020. We seem to be affected the same way as those tests and a lot of tests are failing when the specific Windows update has been applied. Given that we do not have control of the machines the consumers of Webdriver are actually using there is the risk that we will see a higher failure rate for any Webdriver test or automation script using release builds of Firefox, which most of our users do. Note that there is also no tweak / workaround available to get it working again. The only solution for affected consumers would be to test with a beta build instead.

Regressions: 1835819

We also have Bug 1835819 now so I would say, it's definitely not a good candidate for uplift.

Regressions: 1835822
Flags: needinfo?(jdescottes)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 115 Branch → ---
No longer blocks: 1832891
Depends on: 1832891
No longer regressions: 1835822
See Also: → 1836006

This is not completely fixed and I've added a comment on bug 1830616 comment 14 (basically should have done it here :/).

URL: 1830616
Depends on: 1838381
Depends on: 1838382
No longer depends on: 1832891
No longer depends on: 1838382

No more failures since my patch from bug 1838382 landed.

Assignee: jdescottes → hskupin
Status: REOPENED → RESOLVED
Closed: 11 months ago10 months ago
Flags: needinfo?(jdescottes)
Resolution: --- → FIXED
Target Milestone: --- → 116 Branch

The patch landed in nightly and beta is affected.
:whimboo, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.Also, don't forget to request an uplift for the patches in the regressions caused by this fix.
  • If no, please set status-firefox115 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(hskupin)

For now we do not want to get it fixed for the 115 release. We can reconsider in case of failures to uplift to a 115ESR minor release.

Flags: needinfo?(hskupin)
Duplicate of this bug: 1808622
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: