Open Bug 1735899 Opened 2 months ago Updated 11 days ago

browserAction Pop-up failure

Categories

(WebExtensions :: Frontend, defect, P1)

Firefox 95
defect

Tracking

(firefox-esr78 unaffected, firefox-esr91 unaffected, firefox93 unaffected, firefox94 unaffected, firefox95 fixed)

ASSIGNED
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox93 --- unaffected
firefox94 --- unaffected
firefox95 --- fixed

People

(Reporter: eros_uk, Assigned: rpl)

References

(Regression)

Details

(Keywords: leave-open, regression, Whiteboard: [addons-jira])

Attachments

(7 files)

Attached file edit.xpi

Since possibly 95.0a1 (2021-10-10) (64-bit), the browserAction pop-up & sidebar has been failing.

Once Firefox starts, they work and gradually start to fail. It has been the same for multiple unsigned addons that I have installed.

  • When FF starts, pop-ups work
  • After a while they stop working
  • Opening a new window, pop-ups don't work
  • Opening a new Private window and pop-ups work
  • After a while, pop-ups also start to fail in Private window

Error messages

Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content.
FrameData missing for ******* page moz-extension://************/content/popup.html ExtensionPageChild.jsm:429
initExtensionContext resource://gre/modules/ExtensionPageChild.jsm:429
initExtensionDocument resource://gre/modules/ExtensionProcessScript.jsm:322

I noticed that loading the popup or sidebar without the stylesheet solves the problem.

Does not work

  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Popup</title>
    <link rel="stylesheet" href="popup.css">
    <script type="module" src="popup.js"></script>
  </head>

Works

  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Popup</title>
    <!-- <link rel="stylesheet" href="popup.css"> -->
    <script type="module" src="popup.js"></script>
  </head>
Attached image pop-up-error.jpg

How pop-ups show

Attached image popup-error2.jpg
Attached image popup-error3.jpg

(In reply to erosman from comment #0)

Once Firefox starts, they work and gradually start to fail. It has been the same for multiple unsigned addons that I have installed.

What does "gradually" mean? After a few seconds? minutes? hours?

They start to fail after launching the pop-ups a few times.

Notes:

  • Regardless of the STR, the issue exists but could be specific to certain conditions
  • The Toolbox - Extension - Console errors (FrameData missing for ...) starts to show error after Firefox starts but even WITHOUT opening the pop-up
  • The errors should confirm the existence of an issue
  • The issue relates to loading of stylesheet
  • The issue is affected by the conditions of Private window
  • Mouse-over toolbar icon results in Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content.` message on Toolbox - Extension - Console and sometimes FrameData missing for ... error
  • Above message also appear on Private window toolbar icon hover
  • The issue exists when switching from <link rel="stylesheet" href="popup.css"> to <style>@import 'popup.css';</style>
  • The issue exists with only ONE add-on enabled
  • Disabling and re-enabling the add-on makes no difference on the issue

I will do more testing and update this post with the findings.

Installed 95.0a1 (2021-10-10) (64-bit)

http://ftp.mozilla.org/pub/firefox/nightly/2021/10/2021-10-10-21-49-47-mozilla-central/firefox-95.0a1.en-US.win64.installer.exe

  • Mouse-over toolbar icon results in Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content.` message on Toolbox - Extension - Console
  • The error message was different
TypeError: data is undefined                         ExtensionPageChild.jsm:588:52
    initExtensionContext resource://gre/modules/ExtensionPageChild.jsm:588
    initExtensionDocument resource://gre/modules/ExtensionProcessScript.jsm:321
  • Despite above errors, the pop-up (& sidebar) works fine so far

screenshot to follow

2021-10-10-21-49-47-mozilla-central

Hello,

I’ve attempted to bisect the issue and find the regressing issue, however without success. I ran a bisection spanning Firefox 94 until the present day, but all tested builds were good i.e. both the pop-up and the sidebar, in the default window, a new window and a private window, opened each time without problems (opened and closed both pop-up and sidebar around 20 times each).

I did encounter the errors mentioned in the original description, but the pop-up and sidebar did not start to fail once I started testing them.

Installed 95.0a1 (2021-10-09) (64-bit) 2021-10-09-21-26-22

  • Mouse-over toolbar icon results in Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content. message on Toolbox - Extension - Console
  • There is no TypeError: data is undefined error
  • pop-up (& sidebar) seem to work fine so far

Installed previous versions one by one and the result is:

  • 2021-10-09-21 has Layout was forced before.. on mouse-over but no error & pop-up worked while testing (limited time)
  • 2021-10-10-21 has Layout was forced before.. on mouse-over plus TypeError: data is undefined error & pop-up worked while testing (limited time) ?!
  • 2021-10-11-21 has Layout was forced before.. on mouse-over plus TypeError: data is undefined error & pop-up starts to fail
  • 2021-10-12-09 has Layout was forced before.. on mouse-over plus TypeError: data is undefined error & pop-up starts to fail
  • 2021-10-12-21 has Layout was forced before.. on mouse-over plus TypeError: data is undefined error & pop-up starts to fail
  • 2021-10-13-03 has Layout was forced before.. on mouse-over plus TypeError: data is undefined error & pop-up starts to fail
  • 2021-10-13-21 has Layout was forced before.. on mouse-over plus FrameData missing for error & pop-up starts to fail

It is possible that the problem is related to the error TypeError: data is undefined & later FrameData missing for. Tracing the errors might shed some light on the issue.
Nonetheless, it is worthwhile eliminating the error (as well as the notice) and hopefully the pop-up issue could get solves as a result.

Mouse-over browserAction creates even more errors on Multiprocess Browser Console

NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIDocShell.domWindow] 33 window-global.js:408
    get window resource://devtools/server/actors/targets/window-global.js:408
    _onDocShellDestroy resource://devtools/server/actors/targets/window-global.js:938
    observe resource://devtools/server/actors/targets/window-global.js:876

Exception { name: "NS_ERROR_NOT_AVAILABLE", message: "Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIDocShell.domWindow]", result: 2147746065, filename: "resource://devtools/server/actors/targets/window-global.js", lineNumber: 1734, columnNumber: 0, data: null, stack: "watch@resource://devtools/server/actors/targets/window-global.js:1734:28\n_watchDocshells@resource://devtools/server/actors/targets/window-global.js:712:28\n_attach/<@resource://devtools/server/actors/targets/window-global.js:694:42\nexports.makeInfallible/<@resource://devtools/shared/ThreadSafeDevToolsUtils.js:103:22\n", location: XPCWrappedNative_NoHelper }
ThreadSafeDevToolsUtils.js:82:13
    reportException resource://devtools/shared/ThreadSafeDevToolsUtils.js:82
    makeInfallible resource://devtools/shared/ThreadSafeDevToolsUtils.js:109

Error while calling actor 'windowGlobalTarget's method 'attach' JSWindowActorChild.sendAsyncMessage: JSWindowActorChild cannot send at the moment Actor.js:92:13
DOMException: JSWindowActorChild.sendAsyncMessage: JSWindowActorChild cannot send at the moment Actor.js:99:15
NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIDocShell.domWindow] window-global.js:1774
InvalidStateError: JSWindowActorChild.sendAsyncMessage: JSWindowActorChild cannot send at the moment DevToolsFrameChild.jsm:437
Error when attaching target: Error: Connection closed, pending request to server1.conn0.windowGlobal8589934743/windowGlobalTarget1, type attach failed

Request stack:
request@resource://devtools/shared/protocol/Front.js:292:14
generateRequestMethods/</frontProto[name]@resource://devtools/shared/protocol/Front/FrontClassWithSpec.js:46:19
attach/this._attach<@resource://devtools/client/fronts/targets/window-global.js:120:36
attach@resource://devtools/client/fronts/targets/window-global.js:133:7
_attachAndInitThread@resource://devtools/client/fronts/targets/target-mixin.js:526:20
attachAndInitThread@resource://devtools/client/fronts/targets/target-mixin.js:505:40
_onTargetAvailable@resource://devtools/shared/commands/target/target-command.js:187:25
_emit@resource://devtools/shared/event-emitter.js:244:34
emit@resource://devtools/shared/event-emitter.js:190:18
emit@resource://devtools/shared/event-emitter.js:342:18
_onTargetAvailable@resource://devtools/client/fronts/watcher.js:60:10
_emit@resource://devtools/shared/event-emitter.js:244:34
emit@resource://devtools/shared/event-emitter.js:190:18
emit@resource://devtools/shared/event-emitter.js:342:18
onPacket@resource://devtools/shared/protocol/Front.js:336:13
onPacket@resource://devtools/client/devtools-client.js:482:13
send/<@resource://devtools/shared/transport/local-transport.js:68:25
exports.makeInfallible/<@resource://devtools/shared/ThreadSafeDevToolsUtils.js:103:22
exports.makeInfallible/<@resource://devtools/shared/ThreadSafeDevToolsUtils.js:103:22
    baseFrontClassDestroy resource://devtools/shared/protocol/Front.js:106
    _onTargetDestroyed resource://devtools/shared/commands/target/target-command.js:307
    _emit resource://devtools/shared/event-emitter.js:244
    emit resource://devtools/shared/event-emitter.js:190
    emit resource://devtools/shared/event-emitter.js:342
    _onTargetDestroyed resource://devtools/client/fronts/watcher.js:74
    _emit resource://devtools/shared/event-emitter.js:244
    emit resource://devtools/shared/event-emitter.js:190
    emit resource://devtools/shared/event-emitter.js:342
    onPacket resource://devtools/shared/protocol/Front.js:336
    onPacket resource://devtools/client/devtools-client.js:482
    send resource://devtools/shared/transport/local-transport.js:68
    makeInfallible resource://devtools/shared/ThreadSafeDevToolsUtils.js:103
    makeInfallible resource://devtools/shared/ThreadSafeDevToolsUtils.js:103
target-command.js:189:15

Thank you for the detailed error report erosman, this is the same as bug 1735347, but this really helps confirming it's a real regression from bug 1708243 part 4.

(I might ask you to test/confirm something there in a bit)

Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1735347

Bug 1735347 deals with the errors and it is not confirmed yet that the browserAction Pop-up failure is due to the errors.

Furthermore, the browserAction Pop-up failure is still happening in 95.0a1 (2021-10-17) (64-bit).

IMHO, the bug should be kept open until the browserAction Pop-up failure is resolved. Currently (& for the last 9 days), I am forced to restart the browser to get the pop-ups working for a short time, which considerably disrupts the browsing experience.

Furthermore, extension development has become impossible.

Since yesterday, I am noticing the CSS on extension's options pages also started to fail a few times.
It is becoming more likely that there is an issue with loading stylesheets for extensions which is getting more pronounced.

Options page message is from view-source:resource://devtools/server/actors/inspector/node.js

Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content.
node.js:356

Previous CSS message was from the popup.html

Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content.
popup.html

Once again, when Firefox starts, pop-ups and option pages work fine and they start to fail over time. Just now, I have tried to load an extension's options page for over 20 times and it failed every time.

(In reply to erosman from comment #12)

Since yesterday, I am noticing the CSS on extension's options pages also started to fail a few times.
It is becoming more likely that there is an issue with loading stylesheets for extensions which is getting more pronounced.
...
Once again, when Firefox starts, pop-ups and option pages work fine and they start to fail over time. Just now, I have tried to load an extension's options page for over 20 times and it failed every time.

Given that you seem to be able to reproduce that consistently, it would really help if you could try to bisect the possible regressing change using mozregression, which would help us to confirm if the issue that is preventing the popup to be opened correctly it is a separate issue or not.

Flags: needinfo?(eros_uk)

I did try to bisect it by manually downloading and installing previous versions of Nightly as in comment #8.

mozregression does not run on my Win7 OS and that has been the case for many years. There are Python errors. Installing various versions of Python didn't help either.

Looking at Developer Tools ➜ Style Editor, it may !!? relate to CSS @import as I don't see the files loaded. I also use @import in the browserAction pop-ups. Although, the pop-ups that don't have @import also fail.

As mentioned in the first post, commenting out the <link rel="stylesheet" href="popup.css"> will result in the pop-up showing (but looking bad). This point does not relate to the issues in bug 1735347.

Please note that there are specific setting in about:config that relate to Win7. There are a few OS-specific locked entries as I found out when discussing anther matter in #amo:mozilla.org which may explain why the issue in not happening for everyone.

N.B. As I type and as pop-ups fail and as extension option pages fail, I opened a new Private window and everything works fine in that private window.
After a while, that private window will also start to experience the same failures and I would be forced to restart Firefox in order to access pop-up & options.

PS. In case this has a bearing on the issue, all my extensions are under development and are loaded unpacked as it has been for many years.

Update:
It has already been 10 days and it appears that the failures, when happen, only happen for the above mentioned unpacked add-ons. I have 2 other addons (uBlock Origin & uMatirx) and their pop-ups haven't failed yet. Considering that loading stylesheets seems to be a factor, could there have been some changes to the stylesheet loading process on Oct 9th-10th that is affecting the loading from unpacked & out of Firefox own filesystem location?

Flags: needinfo?(eros_uk)
Flags: needinfo?(lgreco)

(In reply to erosman from comment #14)

I did try to bisect it by manually downloading and installing previous versions of Nightly as in comment #8.

mozregression does not run on my Win7 OS and that has been the case for many years. There are Python errors. Installing various versions of Python didn't help either.

Looking at Developer Tools ➜ Style Editor, it may !!? relate to CSS @import as I don't see the files loaded. I also use @import in the browserAction pop-ups. Although, the pop-ups that don't have @import also fail.

As mentioned in the first post, commenting out the <link rel="stylesheet" href="popup.css"> will result in the pop-up showing (but looking bad). This point does not relate to the issues in bug 1735347.

Please note that there are specific setting in about:config that relate to Win7. There are a few OS-specific locked entries as I found out when discussing anther matter in #amo:mozilla.org which may explain why the issue in not happening for everyone.

N.B. As I type and as pop-ups fail and as extension option pages fail, I opened a new Private window and everything works fine in that private window.
After a while, that private window will also start to experience the same failures and I would be forced to restart Firefox in order to access pop-up & options.

PS. In case this has a bearing on the issue, all my extensions are under development and are loaded unpacked as it has been for many years.

Update:
It has already been 10 days and it appears that the failures, when happen, only happen for the above mentioned unpacked add-ons. I have 2 other addons (uBlock Origin & uMatirx) and their pop-ups haven't failed yet. Considering that loading stylesheets seems to be a factor, could there have been some changes to the stylesheet loading process on Oct 9th-10th that is affecting the loading from unpacked & out of Firefox own filesystem location?

The only one that I can think of right now is https://bugzilla.mozilla.org/show_bug.cgi?id=1706594, which got on mozilla-central on Oct 11:

  • The stylesheets listed in an html document do trigger a speculative load, which was one of the kind of requests that were able to trigger that bug
  • the fix explicitly cancel streams that are related to channels that were already canceled and so in theory it shouldn't be preventing the page in the popup from opening correctly
  • but the bug was also more likely to happen for unpacked extensions (I couldn't find a way to reproduce it with packed ones)

and so there are 2 parts of your STR that seems to have some kind of connection with that fix.

It would be great if we could confirm if that is actually the case, would you mind to send me the about:support information and any specific setting in about:config that you already know that relates only to Window 7?

Being able to reproduce the issue locally would allow me to confirm if there is any actual connection between the issue you are experiencing and the fix we landed for Bug 1706594, and in either case (related or not to Bug 1706594) to dig more into it and get a better picture of what is really going on.

Alternatively (or in addition to that), I create a backout commit locally and push it to try, this way you may try on your system a build that includes everything that is already in Nightly besides that single change, you can download the custom build from here:

Let me know if you are unable to reproduce the issue in this build but you are able to reproduce it on the last nightly build.

Flags: needinfo?(lgreco) → needinfo?(eros_uk)
  • I was able to reproduce the bug on a clean profile on the Latest Nightly
  • I was not able to reproduce the bug on the build in comment #15

Update

  • I was able to reproduce the bug on latest Nightly with only one unpacked add-on enabled

Notes

  • Set-up:
    Windows 7 Ultimate, latest Nightly, 17 enabled unpacked installed addons, 5 enabled standard installed addons, 9 disabled standard installed addons
  • The failure only occurs on unpacked addons
  • The failure affects all add-on pages but mostly on popup & sidebar and much less on options page
  • Commenting out <link rel="stylesheet" href="popup.css"> results in failed pop-up showing
  • After failure, it is togglable, i.e. without-CSS-shows, with-CSS-fails
  • @import appears to be much slower than <link rel="stylesheet" href="popup.css">
Flags: needinfo?(eros_uk)
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---

I was more and more convinced that the errors logged and the popup issue are two different bugs which happens to be triggerable with a similar STR (which leverages the special "popup extension page preloaded in an hidden browser on mouseover and swapped with the visible browser if the browserAction is clicked or preload hidden browser destroyed if the user move the mouse out of the browserAction")

But unlike Bug 1735347 (which is specifically related to the error logged, and a regression from Bug 1708243), this issue can only be triggered by an unpacked extension (it doesn't matter if it is loaded temporarily or permanently).

STR to reproduce the popup issue consistently

Over the last couple of day I spent some more time gathering some more details about the STR originally shared in comment 1 by discussing it with erosman over Matrix, and I have been finally able to trigger the bug consistently (and just a few seconds) by using the following tweaked STR:

  • run firefox with the "nsCSSLoader" logs enabled: export MOZ_LOG="nsCSSLoader:5,sync" (or even MOZ_LOG="nsCSSLoader:5,sync" mach run if running a local Nightly build)
  • download, unpack in a local directory the xpi file of the FireMonkey extension and install it temporarily from about:debugging (I originally used some more addons all installed as unpacked, but then I have verified that I can also trigger it with just the unpacked FireMonkey extension)
  • move the mouse over and out of the FirefoxMonkey browserAction (which happen to be also how kernp5 reproduced consistently Bug 1735347, see Bug 1735347 comment 15, but this popup issue will only be triggered if the extension is unpacked, unlike the error logged that Bug 1735347 is already tracking)
  • the bug has been triggered when the nsCSSLoader logs shows that there is a FireMonkey popup.css stylesheets cache entry that gets stuck into the Loading state:
[Child 1598065: Main Thread]: D/nsCSSLoader css::Loader::LoadSheet(aURL, aObserver) api call
[Child 1598065: Main Thread]: D/nsCSSLoader   Non-document sheet uri: 'moz-extension://8132eec0-bdce-4534-af99-4edb3415c02c/content/popup.css'
[Child 1598065: Main Thread]: D/nsCSSLoader css::Loader::CreateSheet(moz-extension://8132eec0-bdce-4534-af99-4edb3415c02c/content/popup.css)
[Child 1598065: Main Thread]: D/nsCSSLoader KeyEquals(moz-extension://8132eec0-bdce-4534-af99-4edb3415c02c/content/popup.css)
[Child 1598065: Main Thread]: D/nsCSSLoader KeyEquals(moz-extension://8132eec0-bdce-4534-af99-4edb3415c02c/content/popup.css)
[Child 1598065: Main Thread]: D/nsCSSLoader   Hit cache with state: Loading
  • after the "Hit cache with state: Loading" has been logged at least once for the popup.css url, opening the browserAction will not fully render the FireMonkey popup with its intended content, on the contrary it gets rendered as the one in the screenshot attached in Bug 1735899 comment 1
  • from the BrowserToolbox (by setting a breakpoint on the devtools internals that are observing the destroyed window) I have also verified that the extension page document in the browserAction popup is stuck in the "interactive" state (which I think it is due to the fact that it will be forever waiting for the cache entry stuck into the Loading state)
  • by opening the same FireMonkey popup moz-extension url in a Firefox tab, the tab is stuck into the loading state and nothing gets rendered in the tab. Explicitly stopping the loading tab makes the extension page able to be rendered, but without the stylesheet that was stuck into the Loading state

Actual regressing change

This, as I was already suspecting when I started to investigate it, points suspiciously to the same kind of race issues triggered by the SimpleChannel created by ExtensionProtocolHandler for the unpacked extensions: Bug 1706594.

I'm pretty sure now that this issue was not happening before we landed Bug 1706594 because before that change Gecko would have kept loading data for requests even if technically cancelled (due to the related document being destroyed in the meantime) and so the preloaded stylesheet entries loads would complete and not stay stuck in the Loading state.

As an additional confirmation, I tried to reproduce the issue on a local build where I backed out the patch landed from Bug 1706594 (hg backout -r 2c21101c4b3b -m "TMP: temporarily revert Bug 1706594 patch") and on that build I was not able to trigger the issue anymore.

Next steps

As soon as I got a better picture of the issue yesterday, we (Tomislav and I) agreed on the following next steps:

  • Update this bug with more details about how to reproduce the issue consistently and what seems to be going on based on the details we have been able to collect so far
  • Needinfo Nika and Valentin for their perspective and opinions about this issue, and evaluate with them if we should backout Bug 1706594 patch from Nightly 95 and reopen it (so that we can investigate this issue and land it again after we have got to its root cause and prepared a fix to land along with the original Bug 1706594 patch)
  • Needinfo Emilio for his perspective on the preloaded stylesheet entry that gets stuck, and pointers / ideas that could help us to investigate it more (and then find out where we are missing to propagate the fact that the channel has been cancelled, so that the cache entries would be invalidated accordingly)
  • Reduce the STR described in this comment into an automated test and try to record it with rr (and submit to pernosco)
Motivations for backing out Bug 1706594 patch from Nightly 95
  • As Tomislav did mention to me when we discussed this yesterday, an additional reason to backout Bug 1706594 would be to ensure that this issue is not hiding any other issue that the fission-related changes Tomislav landed in Firefox 95 from Bug 1735347 may have introduced.
  • This bug shows us that there are some side-effects of the fix landed in Bug 1706594 that we missed to notice until now, and so spending some more time investigating this issue may also help to ensure us that we didn't miss other similar ones
What makes a user more likely to be impacted by this bug (and Bug 1706594)

Both this bug and Bug 1706594 can be triggered only on unpacked extensions (because those are the ones that will get a stream asynchronously when a moz-extension url is being loaded), and so the users that are more likely to trigger this issue are:

  • extension developers (because they are more likely be installing and testing unpacked extensions)
  • Linux users that install distro-packaged Firefox and extensions (which are also unpacked as I did mention in Bug 1706594 comment 32)
Impact comparison for Bug 1706594 and Bug 1735899
  • Bug 1706594: for the impacted user that happen to trigger that bug, all extensions would not be able to run any extension code anymore, because the extension child process main thread is kept busy by a "stream pump infinite loop"
  • Bug 1735899:
    • for the impacted user that happen to trigger this bug, for the extension that triggers this bug any extension page that include the css file that gets stuck will never load anymore (unless the extension process is killed and restarted), disabling and re-enabling the extension is not enough.
    • Based on erosman comments it seems that the issue then starts to impact other extensions over the time, but I've not been able to verify this yet

But this impact analysis is based only on the known side-effects of the fix we have landed for Bug 1706594, be sure that there aren't other side effects that we don't know about yet would require some more investigation.

I don't have stats or a complete rationale to bake the following statement, but based on my current understanding of both the bugs (Bug 1706594, Bug 1735899), it looks that they may have more or less the same kind of chances to be triggered by a user.

Assignee: nobody → lgreco
Regressed by: 1706594

Hi Nika and Valentin,
I investigated this bug and it looks that this issue did become "triggerable" because of the changes we introduced in Bug 1706594.
In comment 17 I added details from the investigation (include an STR that can reproduce it consistently) and some preliminary evaluations about possible next steps (included evaluating with you if we should backout Bug 1706594 patch from 95, what makes a user more likely to be affected etc.)

We would like to hear your opinions and perspective (about the issue and if we should back out Bug 1706594 from Nightly 95).

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(nika)

Hi Emilio,
It looks that the fix I landed to fix Bug 1706594 is having a side-effect that is now triggering this issue.

As described in more details in comment 17, a preloaded stylesheet cache entry gets stuck into the Loading state and it never completes (likely because with Bug 1706594 changes the stream pump has been cancelled, previously it would have been left to pump data, which I guess would have allowed those preloaded stylesheet cache entry to be fully loaded and never trigger this issue).

The nsCSSLoader log in comment 17 shows that when a cache entry gets stuck then the issue got triggered and the extension pages that include that CSS will never finish loading, which is clearly a side-effect and not the actual root cause.

I'm needinfo-ing you because I'd like to hear your perspective about this issue, and in particular any additional idea or pointers to details related to the CSS loader internals that I may look more deeply into to try to get a better picture and pin point where the issue is actually triggered to then lead to this issue.

Flags: needinfo?(emilio)

Where is the load canceled? The CSS loader generally coalesced loads, so it's not sound to cancel a load if there's another document loading the same url. Assuming that's not the case however, and we really just want to cancel the load altogether, it seems that the issue is that StreamLoader::OnStopRequest is not being called for that stylesheet.

This call is what eventually ends up in SheetComplete, and thus removing the load from the table (if it failed or was canceled), or marking the load as successful and inserting it into the cache. In any case, without that call the stylesheet will remain as loading for ever, which seems like what you're hitting.

So presumably you're missing an OnStopRequest(NS_BINDING_ABORTED) somewhere in the code you added?

Flags: needinfo?(emilio) → needinfo?(lgreco)

Thanks a lot Emilio, your comment confirmed me that I was looking in the direction and while investigating it I already verified that is the case (those cache entries never gets a OnStopRequest, from what I recall they don't get an OnStartRequest neither but I have to take another look to double-check that was the case) and that made me realize which may be the scenario that may make the OnStopRequest to never be sent.

I'm going to double-check that (and very likely asking your perspective after I have verified if that is the case).

Flags: needinfo?(lgreco)

I confirmed that my theory about what was missing to call OnStopRequest for the preloaded stylesheet cache entries was actually matching what is actually going on:

  • When ExtensionStreamGetter::OnStream is called we are currently passing the ExtensionStreamGetter instance's mListener (an nsCOMPtr to a nsIStreamListener) to the nsInputStreamPump's AsyncOpen method
  • if the call to mPump's AsyncOpen did not fail, then we assume that nsInputStreamPump will be always taking care of notifying OnStartRequest/OnStopRequest to that nsIStreamListener
    • and so in ExtensionStreamGetter::Cancel we just call mPump's Cancel method if the mPump exists and we assumed that was always going to take care of calling OnStopRequest (after having called OnStartRequest if it didn't yet) as we do in ExtensionStreamGetter::Cancel if the cancellable request got cancelled before we actually got a stream
  • unfortunately, that isn't the case and there is still a chance that ExtensionStreamGetter::Cancel may be called after we successfully created the nsInputStreamPump and called its AsyncOpen method, but before that nsInputStreamPump has got to notify OnStartRequest to the nsIStreamListener passed to it (actually when the race behind this bug is triggered, we are getting to cancel the nsInputStreamPump before it ever got to OnInputStreamReady).

Locally I tried the following approach and confirmed that it consistently prevented this issue for me locally, and the resulting patch seems to be small and clear enough:

  • Tweak ExtensionStreamGetter to also implement nsIStreamListener
  • When handling an extension protocol's SimpleChannel that needs to retrieve the data through an asynchronous stream (in other words: "when we are loading a moz-extension url from an unpacked extension"), pass to mPump's AsyncOpen the ExtensionStreamGetter itself as a nsIStreamListener (and keeps the references to mListener and mChannel)
  • Under normal conditions:
    • the ExtensionStreamGetter will just forward to mListener the calls to OnStartRequest, OnStopRequest and OnDataAvailable,
    • once ExtensionStreamGetter::OnStopRequest is called, the stream getter instance can forget both the mListener and mChannel
  • Under the race that triggers this bug:
    • ExtensionStreamGetter keeps track internally if OnStartRequest has ever been called and so it knows if mPump did never start to pump any data when the channel is being cancelled
    • and so if ExtensionStreamGetter::Cancel is called before mPump called ExtensionStreamGetter::OnStartRequest, then we know that we need to take care of notify that to mListener (by explicitly calling its OnStartRequest/OnStopRequest, as we were already doing if get got to create the nsInputStreamPump) as part of the cancellation.

I'm going to attach to this issue a draft patch that implements this strategy in a minute, to evaluate with Nika and Valentin if this is a reasonable strategy also from their perspective.

Making sure that nsInputStreamPump will call mListener's OnStartRequest/OnStopRequest doesn't seem the right way to fix the issue

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(nika)
Severity: -- → S2
Priority: -- → P1
Whiteboard: [addons-jira]

Hi Nika and Valentin,
In comment 22 I tried to provide a shorter summary of what is actually triggering this issue and a proposed approach to handle that in ExtensionStreamGetter (which is the approach drafted in the patch attached in comment 23).

In Comment 17 you can find some more details about the issue (in particular the STR I used to trigger it locally, and which classes of users may be impacted and what kind of impact we may expect based on the picture we got so far).

The motivations to consider backing out Bug 1706594 still stands, unless we are in agreement that the approach in the attached patch seems a reasonable approach and we prefer to aim to get that patch ready to be landed in the next few days (and before the merge to beta) instead.

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(nika)
Status: REOPENED → ASSIGNED

(In reply to Luca Greco [:rpl] [:luca] [:lgreco] from comment #22)

Making sure that nsInputStreamPump will call mListener's OnStartRequest/OnStopRequest doesn't seem the right way to fix the issue

Why doesn't that feel like the right solution? In general nsIChannel/nsIRequest instances should always fire OnStartRequest/OnStopRequest pairs once they've been opened, and fixing the situation here should fix it for other users of nsInputStreamPump as well.

Do you know what code path is being taken in the nsInputStreamPump which is causing us to never end up firing the OnStartRequest/OnStopRequest pair?

Flags: needinfo?(nika) → needinfo?(lgreco)

(In reply to Nika Layzell [:nika] (ni? for response) from comment #27)

(In reply to Luca Greco [:rpl] [:luca] [:lgreco] from comment #22)

Making sure that nsInputStreamPump will call mListener's OnStartRequest/OnStopRequest doesn't seem the right way to fix the issue

Why doesn't that feel like the right solution? In general nsIChannel/nsIRequest instances should always fire OnStartRequest/OnStopRequest pairs once they've been opened, and fixing the situation here should fix it for other users of nsInputStreamPump as well.
Do you know what code path is being taken in the nsInputStreamPump which is causing us to never end up firing the OnStartRequest/OnStopRequest pair?

To be honest, my initial instinct was that too: nsInputStreamPump should be always fire OnStartRequest/OnStopRequest, because in the end it does technically implement nsIRequest.

Based on my current understanding ([1]) nsInputStreamPump's current implementation seems to be guarantying that only if the pump has at least got a call to OnInputStreamReady (which internally will call OnStartRequest/OnStopRequest based on the state transitioning of the pump).

If the pump never got a ready stream then mListener OnStartRequest/OnStopRequest will never be called (there doesn't seem to be any other way to get to them at the moment).
I haven't checked in the detail how many other ways nsInputStreamPump may still not calling mListener's OnStartReqeust/OnStopRequest methods, but I got the impression that nsInputStreamPump's OnInputStreamReady/OnStateStart/OnStateTransfer/OnStateStop may still not call those callbacks in certain error scenarios.

This is mainly what motivated me to also look into "how handling that in ExtensionStreamGetter would have looked like".

And so, more than "nsInputStreamPump doesn't seem the right place where the issue should be fixed" I should have wrote "fixing the issue in nsInputStreamPump may turn out to be harder than expected and may need some more rework of the logic class, and after that some more investigations in the side-effects that changes to that logic may be introducing (in case we may introduce some other regressions due to that)".

As additional side notes:

  • the option to backout Bug 1706594 from Firefox 95 (and spend some more time to look into other approaches to fix this scenario before relanding it in Firefox 96) is definitely still on the table

  • I completely agree that if we end up fixing this scenario in ExtensionStreamGetter, then we should also at least look if a similar scenario has to be handled in the other two classes that we had to adapt in Bug 1706594 (but I deferred that to evaluate if we have a better approach to look into), and we would still have the problem that we may introduce the same issue elsewhere in the future (and I can't exclude that the same issue may already exists somewhere else and a similar issue may be triggered due to that race).


[1]: I've still a lot learn about these components to really have a complete picture, and I assume that I can be wrong.

Flags: needinfo?(lgreco)

Nika and I continued to discuss about this issue over Matrix:

  • Nika did notice that RemoteLazyInputStream::Close doesn't seems to be handling closing the remote lazy input stream correctly
  • which is confirmed by trying to don't use the remote lazy input stream for the requests loading from a file as a quick experiment ([1])

And so we agreed to look into fixing RemoteLazyInputStream::Close first ([2]).

Also clearing the needinfo assigned to Valentin, while I'm looking into the approach Nika and I agreed on.


[1]: verified by applying a temporary 2 line change on ipc/glue/IPCStreamUtils.cpp and then checking that using the STR from comment 17 I couldn't reproduce the issue anymore

[2]: Bug 1706594 backout is not completely off the table, we may still have to do that as a last resort (e.g. if we need more time to investigate the issue and have a fix we are confident about).

Flags: needinfo?(valentin.gosu)
Attachment #9247501 - Attachment description: Bug 1735899 - Make sure ExtensionProtocolHandler notify mListener when a request has been canceled before nsInputStreamPump got to OnStartRequest. r?nika!,valentin! → WIP: Bug 1735899 - Make sure RemoteLazyInputStream::Close calls mInputStreamCallback OnInputStreamReady method. r?nika!

I've updated the attach patch to make it aim for the fix Nika suggested (as briefly described in comment 29).

I tried this version of the fix locally and confirmed that I was unable to trigger the issue using the STR from comment 17.

I'm currently pushed it to try to double-check if in its current form there is any suspicious test failure to look into:

See Also: → 1737783
Attachment #9247501 - Attachment description: WIP: Bug 1735899 - Make sure RemoteLazyInputStream::Close calls mInputStreamCallback OnInputStreamReady method. r?nika! → Bug 1735899 - Make sure RemoteLazyInputStream::Close calls mInputStreamCallback OnInputStreamReady method. r?nika!

Setting leave-open, Nika and I agreed to push the fix to autoland, but we will still try to write a test case based on the STR (I'll remove the leave-open and close this issue if the resulting test case isn't able to trigger the issue consistently enough).

Keywords: leave-open
Pushed by luca.greco@alcacoop.it:
https://hg.mozilla.org/integration/autoland/rev/b064af2922e4
Make sure RemoteLazyInputStream::Close calls mInputStreamCallback OnInputStreamReady method. r=nika

Luca, this is marked as P1/S2 and is a new regression in 95, should we uplift the fix in beta?

Flags: needinfo?(lgreco)

(In reply to Pascal Chevrel:pascalc from comment #34)

Luca, this is marked as P1/S2 and is a new regression in 95, should we uplift the fix in beta?

The patch landed (comment 33) is already in Firefox 95 (the same release where Bug 1706594 patch has also been landed).

I haven't closed the bugzilla issue yet because I want to try to reduce the STR into a test case and see if that is stable enough (and not too intermittent) to be landed in tree (and that test case could be eventually uplifted to beta).
Unfortunately I have not been able to get back to this yet (I was planning to try to get back to it by the end of this week).

I can definitely file a separate follow up if we want to close this issue as fixed in the meantime (especially if keeping it open doesn't make it clear what is the actual status of this).

Flags: needinfo?(lgreco)

Thanks, marking the status as fixed for firefox95 will remove that from our relman queries :)

This patch includes a new test file to cover a scenario similar to the one
fixed in Bug 1735899, and possibly some other similar ones that may slip
through unnoticed.

This test triggers the issue consistently on a build where the fix
is not included and pass consistently in build including the fix,
but even if it doesn't include any arbitrary timeouts to trigger the
issue there is still a chance that the test may start to fail
intermittently for some other reasons.

Keeping it as a separate test file will make it easier to disable it
as a temporary measure if turns out to be necessary.

While putting together this test I did set for myself a couple of base requirements:

  • try to not tight it too much to the issue we fixed, but instead try to make it potentially able to trigger and caught other similar issues due to races that may be triggered by starting to preload the popup on mouseover and then cancelling it on mouseout
  • no use of arbitrary timeout to trigger the mouseover-mouseout race
  • should fail consistently on builds without the fix
  • should pass consistently on builds with the fix

And the test attached in comment 37 does match all of those requirements (well, it doesn't have arbitrary timeouts but I had to use an imported css file with an arbitrary size big enough to make the issue we fixed more likely to be triggered with a less amount of mouseover/mouseout sequences, and an arbitrary amount of these mouseover/mouseout sequences), but I'm still a bit on the fence about landing it on central.

I looked into how stable the desired behavior of this test was going to be with some push to try:

I restricted these try pushes to only run on linux64 while I was still figuring out how stable the test was going to be, and so I pushed one more for win32 and macosx64 that is still running:

It is also worth a mention that this test does still fails in debug builds with and without the fix (due to some assertion failures and shutdown leaks that are being triggered often enough, and seems worth to be investigated further in a separate follow up because they may be an hint of other issues we may have not noticed yet), and so I guess there may be also some more that asan/tsan build may also be able to detect.

You need to log in before you can comment on or make changes to this bug.