Closed Bug 1900222 Opened 1 year ago Closed 1 year ago

insertCSS/removeCSS promise may not settle during heavy page loads, causing subsequent async code to fail

Categories

(WebExtensions :: General, defect, P2)

Firefox 126
defect

Tracking

(firefox126 wontfix, firefox127 wontfix, firefox128 verified, firefox129 verified)

VERIFIED FIXED
129 Branch
Tracking Status
firefox126 --- wontfix
firefox127 --- wontfix
firefox128 --- verified
firefox129 --- verified

People

(Reporter: top.car8902, Assigned: robwu)

References

Details

(Whiteboard: [addons-jira])

Attachments

(6 files)

Steps to reproduce:

Firefox 126, macOS 14.5.

  1. Download the source code for my comment blocker extension from this branch:
    https://github.com/RickyRomero/shut-up-webextension/tree/firefox-css-bug
  2. In the src folder, delete manifest.json and rename manifest.ffx.json to manifest.json
  3. Open about:debugging#/runtime/this-firefox
  4. Click "Load Temporary Add-on" and browse to the src folder in my extension
  5. Grant site access permission in the options page if requested
  6. Inspect the extension from the debugging tab
  7. Load https://www.reddit.com/r/aww/comments/1d5sdin/my_17_yr_old_blind_and_deaf_shiba_loves_the/ in a new tab and observe the console output

NOTE: THIS BUG MAY REQUIRE MULTIPLE PAGE LOADS TO APPEAR.

Actual results:

Some page loads may show "result: elapsed" in red text in the console. That's the insertCSS/removeCSS promise failing to settle within 10 seconds, and my own code stepping in with Promise.race() to prevent a softlock.

I've extended the timeout to 10 seconds to show how serious this bug is. Under normal conditions, I time out within 50ms, extending the timeout over multiple attempts until the promise either resolves or rejects. It can sometimes take 5 or 6 of these timeouts for the promise to settle correctly. You can test this behavior yourself by changing the timeout at the top of blocker.js.

Expected results:

If insertCSS or removeCSS fails with an error, the promise should reject. If it succeeds, it should resolve. In other words, it should always either resolve or reject, not block further async code from executing by remaining open indefinitely.

The Bugbug bot thinks this bug should belong to the 'WebExtensions::Untriaged' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Product: Firefox → WebExtensions

Could you create a minimal test case that demonstrates the issue? I see the scripting.insertCSS / scripting.removeCSS calls in core/blocker.js, but it is mixed with lots of unrelated code.

And could you show the output of the global Browser Console? If you don't see any relevant errors, try enabling the Multi Process option to also include logs from other processes.

Flags: needinfo?(top.car8902)

@robwu
I've spent my evening updating my repo with a branch for this bug:
https://github.com/RickyRomero/shut-up-webextension/tree/firefox-css-bug

The branch contains a minimal test case. All relevant code is in src/init.js. You don't need to rename the manifest JSON file anymore either.

You should (hopefully) see that, when loading a simple page, the promises resolve and the operation queue clears every time. Try it on https://rickyromero.com/ to see what I mean. Conversely, https://www.reddit.com/r/aww/comments/1d5sdin/my_17_yr_old_blind_and_deaf_shiba_loves_the/ leaves promises hanging open most times I load the page with this same code.

Unfortunately, the Browser Console doesn't reveal any clues in either parent- or multi-process modes.

Same setup as before: Firefox 126, macOS 14.5.

Flags: needinfo?(top.car8902)

I reproduced the issue on the latest Nightly (128.0a1/20240603160728), Beta (127.0/20240603152359) and Release (126.0.1/20240526221752) under Windows 10 x64 and Ubuntu 22.04 LTS.

Loading the extension from Comment 3 and then https://www.reddit.com/r/aww/comments/1d5sdin/my_17_yr_old_blind_and_deaf_shiba_loves_the/ in a new tab and checking the extension console will show continuous “Async code halted; promise hanging open…” errors. See the attached screenshot for more details.

Status: UNCONFIRMED → NEW
Ever confirmed: true
Blocks: 1898133
Assignee: nobody → rob
Status: NEW → ASSIGNED
Component: Untriaged → General
Whiteboard: [addons-jira]

Minimal test case demonstrating the specific issue:

  1. Visit example.com
  2. Press the extension button. This will insert a slow-loading frame and call scripting.executeScript.

Expected:

  • Slow-loading frame inserted.
  • scripting.executeScript resolves, with results:
    • results from top frame and slow frame (this is Chrome's behavior).
    • or at least only the top frame without the slow frame (but definitely be resolving).
  • second scripting.executeScript call after frame load reports results from top document and the frame.

Actual:

  • Slow-loading frame inserted.
  • scripting.executeScript got stuck.
  • second scripting.executeScript call after frame load reports results from top document and the frame.

The issue is not specific to scripting.insertCSS / scripting.removeCSS. It also happens with scripting.executeScript, and the variants of these APIs in the tabs namespace.

A work-around is to use injectImmediately:true in the scripting.executeScript API, or runAt:"document_start" in the tabs.executeScript API. There are no work-arounds available for insertCSS / removeCSS, because these do not have the runAt option available (they are scheduled at document_idle).

The issue is happening because the implementation for execution at document_idle and document_end gets stuck when an initial about:blank frame is encountered (which has document.readyState = "uninitialized"). When the requested frame src is loaded, the frame will navigate to the final destination, and the internal promiseDocumentIdle, promiseDocumentLoaded (or in document_end's case) promiseReady implementations will never resolve.

Note that if the document load is aborted, that the frame will always be at the uninitialized state, without content script execution.

At the very least we should get unstuck, whether by throwing an error or by executing the script after a delay. A point against executing the script after the delay is that the frame could be cross-origin and be in a different process, so there is no guarantee that we could get meaningful results.

In case of declarative content scripts, the content script will execute in the final frame. I haven't checked whether it would try injecting in the initial about:blank.

The reported bug is caused by an attempt to schedule a content script in
an initial about:blank document/frame for which document_end and
document_idle never resolve. Although the bug is about dynamic execution
(insertCSS/executeScript/etc), the underlying injection logic is also
shared by declarative content scripts.

This test adds test coverage for two relevant scenarios:

  • Content scripts declared via content_scripts in manifest.json (this is
    also the underlying implementation of contentScripts.register,
    userScripts.register and scripting.registerContentScripts).
    This WORKS AS EXPECTED, because scripts never run in initial
    about:blank ( which is reported at bug 1415539 + bug 1486036 ).

  • Content scripts executed in existing content as part of extension
    install via ExtensionPolicyService::InjectContentScripts. This
    UNEXPECTEDLY results in executions that are stuck and never-settling.

The same issue is also relevant to the handleActivateScripts logic,
which did not and still does not have explicit coverage for this
scenario. A generic solution to the InjectContentScripts case will also
apply to handleActivateScripts, so I did not add tests for that here.

A test case for the dynamic code execution case is not part of this
patch because the observed result would be that the test gets stuck.
This test case will be included with the next patch that fixes this bug,
which will also update the test expectation for the "UNEXPECTEDLY"
never-settling promise mentioned above.

This patch fixes the issue by injecting immediately when a content
script execution is scheduled for an initial about:blank document.
This case is only reachable through dynamic content script execution
APIs, and not through declarative content scripts. The latter is out of
scope for this bug and tracked separately at bug 1415539 + bug 1486036.

An alternative solution could have been to not inject, e.g. by ignoring
the frame or throwing an error. This would however be unexpected to an
an extension developer whose intention is to run a code snippet in all
frames that are reachable by web pages (which works in Chrome).

Severity: -- → S3
Priority: -- → P2
Blocks: 1901894
No longer blocks: 1901894
Depends on: 1901894
See Also: → 1902709
Pushed by rob@robwu.nl: https://hg.mozilla.org/integration/autoland/rev/c018b69d1062 Add test coverage for content scripts in slow-loading frames r=zombie https://hg.mozilla.org/integration/autoland/rev/51a2aaa49a8d Avoid getting stuck on initial about:blank r=zombie

Backed out for causing assertion failures @ WebExtensionPolicy.cpp

Backout link: https://hg.mozilla.org/integration/autoland/rev/39acea0a60fe191ead9b51557ac3e7c10782d64e

Push with failures

[Failure log -> 17388] Assertion failure: !mThis.URL().Spec().EqualsLiteral("about:blank"), at /builds/worker/checkouts/gecko/toolkit/components/extensions/WebExtensionPolicy.cpp](https://treeherder.mozilla.org/logviewer?job_id=462466611&repo=autoland&lineNumber=3059)

Flags: needinfo?(rob)

The debug assertion is associated with the patch from bug 1901894, and independent of the patches here.

Flags: needinfo?(rob)
Pushed by rob@robwu.nl: https://hg.mozilla.org/integration/autoland/rev/35a4042189bb Add test coverage for content scripts in slow-loading frames r=zombie https://hg.mozilla.org/integration/autoland/rev/144bdc6f060b Avoid getting stuck on initial about:blank r=zombie
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 129 Branch

Verified as Fixed. Tested using the test extension and STR from Comment 6 on the latest Nightly (129.0a1/20240616215431) under Windows 10 x64 and Ubuntu 22.04 LTS.

The first scripting.executeScript resolves with results and does not get stuck, confirming the fix. For comparison, on the latest Release 127, the first scripting.executeScript gets stuck.

Status: RESOLVED → VERIFIED

The patch landed in nightly and beta is affected.
:robwu, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox128 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(rob)

The reported bug is caused by an attempt to schedule a content script in
an initial about:blank document/frame for which document_end and
document_idle never resolve. Although the bug is about dynamic execution
(insertCSS/executeScript/etc), the underlying injection logic is also
shared by declarative content scripts.

This test adds test coverage for two relevant scenarios:

  • Content scripts declared via content_scripts in manifest.json (this is
    also the underlying implementation of contentScripts.register,
    userScripts.register and scripting.registerContentScripts).
    This WORKS AS EXPECTED, because scripts never run in initial
    about:blank ( which is reported at bug 1415539 + bug 1486036 ).

  • Content scripts executed in existing content as part of extension
    install via ExtensionPolicyService::InjectContentScripts. This
    UNEXPECTEDLY results in executions that are stuck and never-settling.

The same issue is also relevant to the handleActivateScripts logic,
which did not and still does not have explicit coverage for this
scenario. A generic solution to the InjectContentScripts case will also
apply to handleActivateScripts, so I did not add tests for that here.

A test case for the dynamic code execution case is not part of this
patch because the observed result would be that the test gets stuck.
This test case will be included with the next patch that fixes this bug,
which will also update the test expectation for the "UNEXPECTEDLY"
never-settling promise mentioned above.

Original Revision: https://phabricator.services.mozilla.com/D212986

Attachment #9407973 - Flags: approval-mozilla-beta?

This patch fixes the issue by injecting immediately when a content
script execution is scheduled for an initial about:blank document.
This case is only reachable through dynamic content script execution
APIs, and not through declarative content scripts. The latter is out of
scope for this bug and tracked separately at bug 1415539 + bug 1486036.

An alternative solution could have been to not inject, e.g. by ignoring
the frame or throwing an error. This would however be unexpected to an
an extension developer whose intention is to run a code snippet in all
frames that are reachable by web pages (which works in Chrome).

Original Revision: https://phabricator.services.mozilla.com/D212987

Attachment #9407974 - Flags: approval-mozilla-beta?

beta Uplift Approval Request

  • User impact if declined: Extensions that dynamically insert content scripts may get stuck; memory leaks may occur
  • Code covered by automated testing: yes
  • Fix verified in Nightly: yes
  • Needs manual QE test: no
  • Steps to reproduce for manual QE testing: https://bugzilla.mozilla.org/show_bug.cgi?id=1900222#c6
  • Risk associated with taking this patch: Low
  • Explanation of risk level: Specific change in extensions code, well-understood cause and fix, covered by automated unit tests and manual tests
  • String changes made/needed: none
  • Is Android affected?: yes

Requested uplift for the above patches and the one from bug 1901894.

Flags: needinfo?(rob)
Attachment #9407974 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9407973 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9407974 - Flags: approval-mozilla-beta+ → approval-mozilla-beta-
Attachment #9407974 - Flags: approval-mozilla-beta?
Attachment #9407974 - Flags: approval-mozilla-beta- → approval-mozilla-beta+
Attachment #9407974 - Flags: approval-mozilla-beta?

Verified as Fixed. Tested using the test extension and STR from Comment 6 on the latest Beta (128.0b5/20240618204306), available from https://treeherder.mozilla.org/jobs?repo=mozilla-beta&revision=94586d93e2a743a9764c62a4098617652fad8339 under Windows 10 x64 and Ubuntu 22.04 LTS.

The first scripting.executeScript resolves with results and does not get stuck, confirming the fix.

Chiming in here, on macOS 14.5 and Firefox 128.0b6 the fix appears to have solved my problem. insertCSS/removeCSS are both much more reliable now. Thank you for quickly fixing this @robwu!

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: