Closed Bug 1513855 Opened 2 years ago Closed 1 year ago

Crash in AsyncShutdownTimeout | AddonManager: Waiting for providers to shut down. | EnvironmentAddonBuilder,XPIProvider

Categories

(Toolkit :: Add-ons Manager, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- fixed
firefox68 - wontfix
firefox69 + wontfix
firefox70 --- wontfix
firefox71 --- verified
firefox72 --- verified

People

(Reporter: robwu, Assigned: mixedpuppy)

References

Details

(Keywords: crash, regression, Whiteboard: webext?)

Crash Data

Attachments

(2 files, 2 obsolete files)

Following a report about a startup crash on Reddit [1], I looked for similar crash signatures on Soccoro [2] and found several crashes. I don't have reproduction steps, but the user on Reddit might be in a position to provide more information if needed.

The most useful information for debugging the report is in the "AsyncShutdownTimeout" field at the "metadata" tab of crash reports.
For example, bp-3c9424b7-aca0-43db-958c-f99910181213 has

{
    "phase": "AddonManager: Waiting for providers to shut down.",
    "conditions": [{
        "name": "XPIProvider",
        "state": "(none)",
        "filename": "resource://gre/modules/AddonManager.jsm",
        "lineNumber": 678,
        "stack": ["resource://gre/modules/AddonManager.jsm:_startProvider:678", "resource://gre/modules/AddonManager.jsm:startup:807", "resource://gre/modules/AddonManager.jsm:startup:2802", "jar:file:///C:/Program%20Files%20(x86)/Mozilla%20Firefox/omni.ja!/components/addonManager.js:observe:66"]
    }, {
        "name": "EnvironmentAddonBuilder",
        "state": "(none)",
        "filename": "resource://gre/modules/TelemetryEnvironment.jsm",
        "lineNumber": 502,
        "stack": ["resource://gre/modules/TelemetryEnvironment.jsm:init:502", "resource://gre/modules/TelemetryEnvironment.jsm:EnvironmentCache:896", "resource://gre/modules/TelemetryEnvironment.jsm:getGlobal:59", "resource://gre/modules/TelemetryEnvironment.jsm:get currentEnvironment:66", "jar:file:///C:/Program%20Files%20(x86)/Mozilla%20Firefox/omni.ja!/components/TelemetryStartup.js:annotateEnvironment:40", "jar:file:///C:/Program%20Files%20(x86)/Mozilla%20Firefox/omni.ja!/components/TelemetryStartup.js:TelemetryStartup.prototype.observe:30"]
    }]
}


[1]: https://www.reddit.com/r/firefox/comments/a5qru6/firefox_64_opens_blank_tab_and_crashes/
[2] https://crash-stats.mozilla.com/search/?signature=~AsyncShutdownTimeout%20%7C%20AddonManager%3A%20Waiting%20for%20providers%20to%20shut%20down.&date=%3E%3D2018-12-06T12%3A20%3A24.000Z&date=%3C2018-12-13T12%3A20%3A24.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Priority: -- → P3
(NI to determine if this is telemetry or not)
Flags: needinfo?(kmaglione+bmo)
Priority: P3 → --
Priority: -- → P2
Flags: needinfo?(kmaglione+bmo)

Added a new signature from nightly.

Crash Signature: Waiting for providers to shut down. | JSON store: writing data,XPIProvider] → Waiting for providers to shut down. | JSON store: writing data,XPIProvider] [@ AsyncShutdownTimeout | AddonManager: Waiting to start provider shutdown. | EnvironmentAddonBuilder]

I've stumbled on a few more shutdown crashes mentioning the AddonManager but I'm not sure if they're the same issue or not. See here and here.

See Also: → 1555632

looking at the crashing graph in 69.0b this seems to be regressing in volume again compared to 68 (which was already quite bad).

Jim - Do we have anyone that can investigate this further? See phillip's note in Comment 4.

Flags: needinfo?(jmathies)
Flags: needinfo?(jmathies)
Whiteboard: webext?

Getting late in the game to fix for 69, but agreed that it'd be good to do so if possible given the volume.

I have been able to reproduce this crash with the same reproduction outlined in bug 1381633, which is:

# create a new profile
./firefox -CreateProfile testprofile
# run with -silent
./firefox -P testprofile -silent

The first run didn't produce what I expected, but when I did it a second or third time, I got the crash.

I added some additional logging in DeferredTask.jsm to get a traceback on all the console output:

JavaScript error: resource://gre/modules/DeferredTask.jsm, line 195: Error: Unable to arm timer, the object has been finalized.

Which always points to save@resource://gre/modules/ExtensionParent.jsm:2014:27

These two tracebacks happen a lot, the first earlier seemingly during startup, the second later seemingly during shutdown (ie. they are not interleaved).

JavaScript error: resource://gre/modules/DeferredTask.jsm, line 195: Error: Unable to arm timer, the object has been finalized.
arm@resource://gre/modules/DeferredTask.jsm:195:79
save@resource://gre/modules/ExtensionParent.jsm:2014:27
get@resource://gre/modules/ExtensionParent.jsm:2112:20
async*load@resource://gre/modules/Schemas.jsm:3485:35
async*lazyInit/promise<@resource://gre/modules/ExtensionParent.jsm:196:33
observe@resource://gre/modules/AsyncShutdown.jsm:551:16

JavaScript error: resource://gre/modules/DeferredTask.jsm, line 195: Error: Unable to arm timer, the object has been finalized.
arm@resource://gre/modules/DeferredTask.jsm:195:79
save@resource://gre/modules/ExtensionParent.jsm:2014:27
get@resource://gre/modules/ExtensionParent.jsm:2112:20
async*parseManifest@resource://gre/modules/Extension.jsm:1834:35
loadManifest@resource://gre/modules/Extension.jsm:1003:12
loadManifest@resource://gre/modules/Extension.jsm:1850:32
startup@resource://gre/modules/Extension.jsm:2159:18
startup@resource://gre/modules/Extension.jsm:1496:27
callBootstrapMethod@resource://gre/modules/addons/XPIProvider.jsm:1789:33
startup@resource://gre/modules/addons/XPIProvider.jsm:1893:32
startup@resource://gre/modules/addons/XPIProvider.jsm:2462:39
callProvider@resource://gre/modules/AddonManager.jsm:214:31
_startProvider@resource://gre/modules/AddonManager.jsm:650:5
startup@resource://gre/modules/AddonManager.jsm:876:14
startup@resource://gre/modules/AddonManager.jsm:3468:26
observe@resource://gre/modules/addonManager.js:86:29

At the end I get:

JavaScript error: resource://gre/modules/ExtensionSettingsStore.jsm, line 117: Error: The ExtensionSettingsStore was accessed before the initialize promise resolved.
WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"EnvironmentAddonBuilder","state":"(none)","filename":"resource://gre/modules/TelemetryEnvironment.jsm","lineNumber":596,"stack":["resource://gre/modules/TelemetryEnvironment.jsm:init:596","resource://gre/modules/TelemetryEnvironment.jsm:EnvironmentCache:1037","resource://gre/modules/TelemetryEnvironment.jsm:getGlobal:94","resource://gre/modules/TelemetryEnvironment.jsm:get currentEnvironment:101","resource://gre/modules/TelemetryStartup.jsm:annotateEnvironment:47","resource://gre/modules/TelemetryStartup.jsm:TelemetryStartup.prototype.observe:34"]}] Barrier: AddonManager: Waiting to start provider shutdown.
WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"AddonManager: shutting down.","state":[{"name":"AddonManager: Waiting to start provider shutdown.","state":[{"name":"EnvironmentAddonBuilder","state":"(none)","filename":"resource://gre/modules/TelemetryEnvironment.jsm","lineNumber":596,"stack":["resource://gre/modules/TelemetryEnvironment.jsm:init:596","resource://gre/modules/TelemetryEnvironment.jsm:EnvironmentCache:1037","resource://gre/modules/TelemetryEnvironment.jsm:getGlobal:94","resource://gre/modules/TelemetryEnvironment.jsm:get currentEnvironment:101","resource://gre/modules/TelemetryStartup.jsm:annotateEnvironment:47","resource://gre/modules/TelemetryStartup.jsm:TelemetryStartup.prototype.observe:34"]}]},{"name":"AddonManager: Waiting for providers to shut down.","state":"Not started"},{"name":"AddonRepository: async shutdown","state":""}],"filename":"resource://gre/modules/AddonManager.jsm","lineNumber":866,"stack":["resource://gre/modules/AddonManager.jsm:startup:866","resource://gre/modules/AddonManager.jsm:startup:3468","resource://gre/modules/addonManager.js:observe:86"]}] Barrier: profile-before-change
JavaScript error: resource://gre/modules/UpdateTimerManager.jsm, line 349: TypeError: setting getter-only property "gLogEnabled"
JavaScript error: resource://gre/modules/ModulesPing.jsm, line 66: NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS: [JavaScript Error: "setting getter-only property "gLogEnabled"" {file: "resource://gre/modules/UpdateTimerManager.jsm" line: 349}]'[JavaScript Error: "setting getter-only property "gLogEnabled"" {file: "resource://gre/modules/UpdateTimerManager.jsm" line: 349}]' when calling method: [nsIUpdateTimerManager::registerTimer]
FATAL ERROR: AsyncShutdown timeout in AddonManager: Waiting to start provider shutdown. Conditions: [{"name":"EnvironmentAddonBuilder","state":"(none)","filename":"resource://gre/modules/TelemetryEnvironment.jsm","lineNumber":596,"stack":["resource://gre/modules/TelemetryEnvironment.jsm:init:596","resource://gre/modules/TelemetryEnvironment.jsm:EnvironmentCache:1037","resource://gre/modules/TelemetryEnvironment.jsm:getGlobal:94","resource://gre/modules/TelemetryEnvironment.jsm:get currentEnvironment:101","resource://gre/modules/TelemetryStartup.jsm:annotateEnvironment:47","resource://gre/modules/TelemetryStartup.jsm:TelemetryStartup.prototype.observe:34"]}] At least one completion condition failed to complete within a reasonable amount of time. Causing a crash to ensure that we do not leave the user with an unresponsive process draining resources.

And the OSX crash results are the same as in this bug.

Based on the last output, I think that the change in bug 1379831 did not accomplish (consistently) what was intended, but am unsure still.

@aswan, any thoughts on comment 7?

Flags: needinfo?(aswan)

Running with -silent does not result in the quit application granted notification.
Without that notification, we have a deadlock in EnvironmentAddonBuilder where it blocks
beforeShutdown waiting on the XPIDatabase to load.

Basically, if startup happens and we do not get the quit-application-granted notification, we deadlock then crash. -silent does this as well as a few other paths in commandline handling.

Flags: needinfo?(aswan)
Assignee: nobody → mixedpuppy
Whiteboard: webext?
Pushed by scaraveo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/61dc2ad51719
ensure we notify quit application when running silent r=kmag
Pushed by scaraveo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0fd656a563f0
ensure we notify quit application when running silent r=kmag
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

the crash volume on nightly hasn't noticeably decreased since the patch landed - the str from comment #7 maybe are just describing an edge-case for this crash signature. should we attempt to uplift the fix to 69 beta anyway to see if it has an impact for the top crasher there and/or reopen the bug?

(In reply to [:philipp] from comment #16)

maybe are just describing an edge-case for this crash signature.

That is highly possible.

should we attempt to uplift the fix to 69 beta anyway to see if it has an impact for the top crasher there and/or reopen the bug?

I don't think it would hurt do to that. Keeping this bug open is probably good as well.

Flags: needinfo?(mixedpuppy)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: qe-verify+

Comment on attachment 9085311 [details]
Bug 1513855 ensure we notify quit application when running silent

Beta/Release Uplift Approval Request

  • User impact if declined: potential reduction in crashes.

This didn't fix the nightly crash rate, wondering if uplift would be ok to see if it has any affect on the beta crash rate (which is higher than nightly).

  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: - create a new profile (don't run firefox)
  • run firefox with command line arg -silent

secondary str

  • verify running firefox -search foo works properly after patch applied
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Small change in shutdown to make sure shutdown notifications are sent
  • String changes made/needed: none
Attachment #9085311 - Flags: approval-mozilla-beta?

Comment on attachment 9085311 [details]
Bug 1513855 ensure we notify quit application when running silent

Might as well see what affect it has on Beta. Approved for 69.0b15.

Attachment #9085311 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
QA Whiteboard: [qa-triaged]
Duplicate of this bug: 1575215

unfortunately the patch hasn't made a sizeable difference in the beta crash volume as expected, so i'll change the state back to affected.

Adding the QE+ flagg back, confirmed with :philipp.

Fwiw, managed to confirm the issue with 62.0.8 on Windows 10.
The current patch was checked using steps from comment #7 with 70.0a1 (2019-08-22), 69.0b15 on Windows 10, macOS 10.14, Ubuntu 16.0.4 and I did not encounter the crash at all.

Flags: qe-verify+

Comment on attachment 9085311 [details]
Bug 1513855 ensure we notify quit application when running silent

Clearing the approval so this doesn't show up on the needs-uplift queries.

Attachment #9085311 - Flags: approval-mozilla-beta+

There is a touch of this happening on Thunderbird as well. When I look at the crash report by clicking through on the graph itself, Thunderbird is the larger portion of the daily crashes. It may be good for someone over there to look at the patch on this bug and see if there are any shutdown paths that are not causing the quit-application notification.

Flags: needinfo?(mkmelin+mozilla)

Thx. Ben, please have a look at some point.

Flags: needinfo?(mkmelin+mozilla)

Curious if there's something more we can do here. I know we've been chipping away at these shutdown issues.

Whiteboard: webext?

Looking from 9/5 to now (past week), most of the crashes seem to be v60 or earlier. Out of 971 results, roughly 20 are Firefox >= 67 WINNT, only one on 69, and none on 68 release.

Thunderbird 60 is a big chunk of the results, as is Firefox 60. I suspect some earlier changes Kris did largely fixed the problem, my patch here just caught an STR that isn't normal.

edit: strange that my search provides different results than the table after the graph, looking more.

(In reply to [:philipp] from comment #29)

have you looked into all signatures?

yes, after realizing again that linking from the graph does not do that (thus my edit in the comment)

I spent a lot of time yesterday eyeballing code related to this. I've no idea how to reproduce it. However, the shutdown hang is happening somewhere here[1]. _shutdownBlocker, which the reports points to, returns the _pendingTask after a couple other items are done.

A small sampling of the crash reports (manually looking at 20-30 of them) show that many (but certainly not all) happen in the first 60-90 seconds of a session. Most but not all crashes are WINNT. Many show that firefox is non-default browser, but not all. I've no idea if it's possible to get any correlation data on uptime and default browser setting.

In any case, my suspicion is that there is some startup to pass a url to a running firefox, which then quickly exits. This is the code path my prior patch touched, but there could be something windows specific I am unaware of.

I'm not certain where to go from here at this time.

[1] https://searchfox.org/mozilla-release/rev/81e7b8ebd4a907d11ae3794cb9ba705e47e39060/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#598-616

Aaron, Jim mentioned the security boostrap work as we were discussing this bug, so I was wondering if you might have some input. I think this crash is still a similar issue as one path I fixed.

Comment 31 above has some basic info that you could read, but the essence of what I think the problem is: "my suspicion is that there is some startup to pass a url to a running firefox, which then quickly exits". There is a potential deadlock in this situation and I fixed one path on the patch that landed in august. Basically if we don't do the quit notification, we get a deadlock and this crash is forced to happen.

Does this sound like a potential problem with the security bootstrap?

Flags: needinfo?(aklotz)

I'm a little hazy on the problem here, but the launcher process should be unaffected by this.

Flags: needinfo?(aklotz)

So I can see this crash with the recent Firefox Nightly build on MacOS. See bp-36181aa7-dc3e-4fcb-b909-c882c0191019.

What happens for me is that when I have selected the profile inside of the profile manager Firefox doesn't startup. Sometimes I see the icon in the dock, but other times even not that. Then after a minute of hanging the crash reporter comes up.

Is that 100% reproducible for you? This sounds in line with one Firefox process quitting and possibly not calling Services.startup.quit. I'm not able to reproduce this with the profile manager as you describe.

Flags: needinfo?(hskupin)

Maybe mossop will know some details about the shutdown in this case (comment 35 and 36)

Flags: needinfo?(dtownsend)

Yes, it reproduces 100% of the time for this particular profile. When I'm back at work tomorrow I will make a copy of it, and try to nail down the problem. Leaving the needinfo request for now.

Hah, I think that I found at least my problem. So in the shell the CWD is eg. ~/foobar. Then in the Finder I removed that folder. When I now start Firefox from the shell, which operates on a folder which doesn't exist anymore, the above failure can be seen. If CWD still exists the problem is not visible.

Flags: needinfo?(hskupin)

Actually on MacOS you will also have to empty the Trash, otherwise the shell will refer to the deleted folder from inside the Trash. Running a debug version of Firefox it looks to be related to the startup cache:

[32842, Main Thread] WARNING: 'result.isErr()', file /builds/worker/workspace/build/src/startupcache/StartupCache.cpp, line 173
[32842, Main Thread] WARNING: dependent window created without a parent: file /builds/worker/workspace/build/src/toolkit/components/startup/nsAppStartup.cpp, line 631
++DOCSHELL 0x1130e6000 == 1 [pid = 32842] [id = {9dd8b9ce-ed4a-5448-9d6e-ec900194cb67}]
++DOMWINDOW == 1 (0x10ff25200) [pid = 32842] [serial = 1] [outer = 0x0]
++DOMWINDOW == 2 (0x115345000) [pid = 32842] [serial = 2] [outer = 0x10ff25200]
[32842, Main Thread] WARNING: 'result.isErr()', file /builds/worker/workspace/build/src/startupcache/StartupCache.cpp, line 173

https://searchfox.org/mozilla-central/rev/8512c64c65031321381d12c8f58cc22d5af15bf1/startupcache/StartupCache.cpp#173

Note that the above might only be one situation when this bug could appear, but there might also be others. Especially if that crash is happening a lot, I doubt that those users have the same situation.

(In reply to Shane Caraveo (:mixedpuppy) from comment #36)

Is that 100% reproducible for you? This sounds in line with one Firefox process quitting and possibly not calling Services.startup.quit. I'm not able to reproduce this with the profile manager as you describe.

We don't call appStartup.quit when quiting the profile manager to load the selected profile.

Flags: needinfo?(dtownsend)

This patch avoids blocking on app quit notifications which are not sent in all
situations. profileChangeTeardown always happens from nsXREDirProvider::DoShutdown.

This forces a quit to occur if it had not been called by this point.

The patch in comment 43 occasionally causes Firefox to want to enter safe mode on the next startup.
The patch in comment 44 occasionally causes Firefox to show the restore session tab on the next startup.

Both cases above are tested using the -silent command line to force the situation.

So it seems that both of these are probably too late. It seems that profileChangeTeardown is too late to shutdown stuff in the AddonManager, and ScopedXPCOMStartup destructor is too late to call quit.

Both remove the deadlock more generically, but I'm still having a problem getting the right timing.

I could probably use some advice from someone who might be more versed in various shutdown stuff. I've got two approaches to trying to deal with this deadlock but neither are completely right.

The second patch could also run into some additional complications due to some windows code that does a hacky shutdown: https://searchfox.org/mozilla-release/rev/167e8d714a6d7cb46cb111cd583b8d5d95318531/widget/windows/nsWindow.cpp#4951-4974

Flags: needinfo?(dtownsend)

We only run the main event loop if at least one window is open. There are various
cases where this doesn't happen, such as -silent. By the time we get there though
some components have started and require appropriate notifications to tell them
to shutdown safely. This patch sends the quit-application-granted and
quit-application notifications in the case that there were no windows open
and nsAppStartup::Quit hasn't already been called when we come to run the
event loop.

I'm not going to claim I know all the ins and outs of shutdown and particularly the async shutdown stuff. The entire startup and shutdown processes are frankly a mess held together with duct tape and bubble gum. But for the -silent case what is happening is we are not entering the event loop because no windows were opened at startup but Quit was never called so the notifications it sends are not sent. This should be a pretty low risk patch that sends those notifications out in that case. I haven't done a great deal of testing with it (and I wouldn't expect it to have any impact on our automated tests), but for the -silent case it goes from logging a whole bunch of exceptions (none that match this bug!) to logging no abnormal exceptions, so that seems like a step in the right direction.

Flags: needinfo?(dtownsend)

Dave, we could have a Marionette restart test for this particular case. So we would have this situation covered by a test. Examples can be found here:

https://searchfox.org/mozilla-central/source/testing/marionette/harness/marionette_harness/tests/unit/test_quit_restart.py

Part of the issue is that I'm not sure how it's being produced on windows. The -silent command line is one way I was able to produce the deadlock on osx, but the prior fix did not address the deadlock on windows. Thus...looking at how to otherwise prevent the deadlock (starting at comment 43).

I'll get around to testing the patch from Dave hopefully this week.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #49)

Dave, we could have a Marionette restart test for this particular case. So we would have this situation covered by a test. Examples can be found here:

https://searchfox.org/mozilla-central/source/testing/marionette/harness/marionette_harness/tests/unit/test_quit_restart.py

I'm not sure how. Aside from the fact that we don't seem to know what is triggering this my understanding is that marionette tests run against a running browser and this failure occurs when the browser never fully starts up.

It doesn't fully startup because of the -silent argument but otherwise it does? But as Shane also wrote it looks like not always reproducible. So maybe not really testable in automation.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #52)

It doesn't fully startup because of the -silent argument but otherwise it does? But as Shane also wrote it looks like not always reproducible. So maybe not really testable in automation.

Shane's working theory is that the quit-application notifications are not being sent. As far as I can tell the only way for that to be the case is if we never enter the main event loop.

I know it is a deadlock between asyncshutdown blockers in telemetry[1] and the addonmanager[2] (which waits on quit). Calling quit in my first patch addressed the -silent switch, but that isn't what is causing all the windows deadlock/crashes. I'm building with the latest patch now.

[1] https://searchfox.org/mozilla-release/rev/df950fbe858cb644a71b75fe3a8e3bcb0487b478/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#596

[2] https://searchfox.org/mozilla-release/rev/df950fbe858cb644a71b75fe3a8e3bcb0487b478/toolkit/mozapps/extensions/internal/XPIProvider.jsm#2483

Ok, using D51736 and running ./mach run -P someprofile -silent then running again without -silent results in the restore session tab occasionally being shown.

Since I don't know how to replicate the windows crashes, I'm uncertain if we'd have the same issue in real life.

(In reply to Shane Caraveo (:mixedpuppy) from comment #55)

Ok, using D51736 and running ./mach run -P someprofile -silent then running again without -silent results in the restore session tab occasionally being shown.

Does the same happen without the patch?

I wouldn't expect this change to affect this directly. It would be worth checking the contents of sessionCheckpoints.json in such a case. One thing to note is that we determine if there was a crash based on the "sessionstore-final-state-write-complete" checkpoint having been stored in the file (https://searchfox.org/mozilla-central/rev/8b7aa8af652f87d39349067a5bc9c0256bf6dedc/browser/components/sessionstore/SessionStartup.jsm#227-228) and we don't block shutdown on this happening (https://searchfox.org/mozilla-central/rev/8b7aa8af652f87d39349067a5bc9c0256bf6dedc/toolkit/components/crashmonitor/CrashMonitor.jsm#200-205). It wouldn't surprise me if shutdown was too fast in the case that we haven't actually started up fully and most of the browser services haven't been initialized.

Without the patch you get a hang then a hard crash (which we force to get out of the hang). With the patch, there is no crash, at least I don't get the crash dialog, and there are no errors to the console.

ie. without the patch, and removing my prior change to call quit for -silent

So, looking at this, we need "sessionstore-final-state-write-complete" to be written.

When quit is granted, RunState is updated[1]. So long as it is "closing", the file should be written[2].

[1] https://searchfox.org/mozilla-central/rev/d061ba55ac76f41129618d638f4ef674303ec103/browser/components/sessionstore/SessionStore.jsm#870
[2] https://searchfox.org/mozilla-central/rev/d061ba55ac76f41129618d638f4ef674303ec103/browser/components/sessionstore/SessionFile.jsm#431

During the write, a blocker for profileBeforeChange is setup[3], in which the write-complete notification is sent.

[3] https://searchfox.org/mozilla-central/rev/d061ba55ac76f41129618d638f4ef674303ec103/browser/components/sessionstore/SessionFile.jsm#478

The crash reporter uses the same profileBeforeChange blocker to write it's notifications received[4].

[4] https://searchfox.org/mozilla-central/rev/d061ba55ac76f41129618d638f4ef674303ec103/toolkit/components/crashmonitor/CrashMonitor.jsm#176

So, is there a potential for a race given both are using profileBeforeChange? Seems like that would explain why session restore only happens sometimes (in my testing, probably more than half the time, but not always).

(In reply to Shane Caraveo (:mixedpuppy) from comment #58)

So, looking at this, we need "sessionstore-final-state-write-complete" to be written.

When quit is granted, RunState is updated[1]. So long as it is "closing", the file should be written[2].

[1] https://searchfox.org/mozilla-central/rev/d061ba55ac76f41129618d638f4ef674303ec103/browser/components/sessionstore/SessionStore.jsm#870
[2] https://searchfox.org/mozilla-central/rev/d061ba55ac76f41129618d638f4ef674303ec103/browser/components/sessionstore/SessionFile.jsm#431

During the write, a blocker for profileBeforeChange is setup[3], in which the write-complete notification is sent.

[3] https://searchfox.org/mozilla-central/rev/d061ba55ac76f41129618d638f4ef674303ec103/browser/components/sessionstore/SessionFile.jsm#478

This is all pretty tricky to reason about...

So the profileBeforeChange blocker there blocks profileBeforeChange on the worker completing writing sessionstore.jsonlz4. The dispatch of the sessionstore-final-state-write-complete will be scheduled to occur in the next microtask after the blocking promise resolves (https://searchfox.org/mozilla-central/rev/d061ba55ac76f41129618d638f4ef674303ec103/browser/components/sessionstore/SessionFile.jsm#494). I think that this will be before the AsyncShutdown event loop for profile-before-change terminates but it might be worth checking.

Either way, that means that we start the asynchronous process of writing out the final checkpoint: https://searchfox.org/mozilla-central/rev/d061ba55ac76f41129618d638f4ef674303ec103/toolkit/components/crashmonitor/CrashMonitor.jsm#196 but we don't wait for it in any specific way since CrashMonitor's profileBeforeChange gets resolved once the "profile-before-change" checkpoint is written.

So now it's a race of the write of the final checkpoint complete against thread shutdown. Thread shutdown happens pretty quickly after the profile-before change notification completes (https://searchfox.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#1233).

OS.File blocks profile-before-change on completing IO but in this case we're posting something after profile-before-change has started so I'm not sure that will save us. But OS.File also blocks thread shutdown until all scheduled events are complete. writeAtomic ends up getting posted to the stream transport service's thread pool and on thread shutdown that also completes all pending requests. But OS.File spins an event loop after it receives the thread shutdown notification, depending on notification ordering the stream transport service may have already received the thread shutdown notification and so will reject a writeAtomic event posted after that point. Without checking I'd actually expect the stream transport service to get the notification much earlier than any JS.

Basically there is a lot of async and threaded stuff going on that will determine whether the final checkpoint gets written out to sessionCheckpoints.json or not. In a Firefox run that has loaded a lot of stuff and so will need to shutdown a lot of stuff we likely win the race every time. In a -silent run it may be more dicey.

All to say, if we're going to use that final checkpoint to decide whether the last run was a crash or not and we care about never showing the crash page by accident we really should explicitly block shutdown on writing the checkpoint.

In my tests, for whatever reason -silent doesn't crash without your fix in place. And when running with -silent that final checkpoint doesn't get written to sessionCheckpoints.json and does when I start up fully.

In fact it isn't written for me even with your fix in place.

Ah. It looks like the final session store write never happens anyway during -silent.

In order to trigger the write _sessionInitialized should be true: https://searchfox.org/mozilla-central/source/browser/components/sessionstore/SessionStore.jsm#844
We only set it to true in response to a browser window opening: https://searchfox.org/mozilla-central/source/browser/components/sessionstore/SessionStore.jsm#1526

I meant to write more...but had lunch first. I also see that the session file is never written in this case. So there is something that I still don't see that is causing the session restore tab to sometimes appear. The file is getting written but because sessionstore is never started, sessionstore-final-state-write-complete is never sent either.

So, I never see sessionstore-final-state-write-complete in the file when running silent, which seems to indicate we should always get a session restore tab, but we don't always get it. There's some weird logic in deciding whether to show it.

However, what I see is that we would only ever write the session file if we've run a session. And in that case, we should have sessionstore-windows-restored in the file as well. So I'm thinking we should check both here:

https://searchfox.org/mozilla-central/rev/8b7aa8af652f87d39349067a5bc9c0256bf6dedc/browser/components/sessionstore/SessionStartup.jsm#227-228

this._previousSessionCrashed = checkpoints["sessionstore-windows-restored"] && !checkpoints[
  "sessionstore-final-state-write-complete"
];

The other presumption I make here is that -silent is essentially synonymous with any startup that does not run a full session. Perhaps that is passing a url over to an already running instance, or something else. In those cases, the already running session should typically close correctly, in which case this file will be fine. So perhaps we don't need to do anything (like I suggest in comment 62).

(In reply to Shane Caraveo (:mixedpuppy) from comment #63)

The other presumption I make here is that -silent is essentially synonymous with any startup that does not run a full session. Perhaps that is passing a url over to an already running instance, or something else. In those cases, the already running session should typically close correctly, in which case this file will be fine. So perhaps we don't need to do anything (like I suggest in comment 62).

That is not a valid assumption. -silent gets us all the way to running command line handlers which is about the last step before spinning the event loop to load the first window. There are a number of other cases that cause us to bail out earlier than that, everywhere from not even bringing up XPCOM (which wouldn't cause this kind of problem because async shutdown is never even initialised) all the way up to where -silent does it.

So I can reproduce the sessionstore page showing up with or without the existing patch applied with no apparent crashing going on which leads me to believe that it is likely a separate issue to this crash. I think we should land the patch I have and see what effect it has on the metrics and figure out the sessionstore issue separately. Do you disagree?

Flags: needinfo?(mixedpuppy)

(In reply to Dave Townsend [:mossop] (he/him) from comment #64)

Do you disagree?

No.

Flags: needinfo?(mixedpuppy)

(In reply to Shane Caraveo (:mixedpuppy) from comment #62)

I meant to write more...but had lunch first. I also see that the session file is never written in this case. So there is something that I still don't see that is causing the session restore tab to sometimes appear. The file is getting written but because sessionstore is never started, sessionstore-final-state-write-complete is never sent either.

That's interesting. Would that mean if sessionstore never gets started we also don't see a sessionstore-windows-restored notification, or others which depend on it? That's a situation which we have seen for Marionette for a very long time now. See bug 1382162. Nowadays we have a custom marionette-startup-requested observer notification but that also sometimes aren't getting fired. I wonder if those things are somewhat related.

Flags: needinfo?(mixedpuppy)
Pushed by dtownsend@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/40f4f1079552
Make sure that quit-application-granted and quit-application notifications are always sent. r=froydnj

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #66)

(In reply to Shane Caraveo (:mixedpuppy) from comment #62)

I meant to write more...but had lunch first. I also see that the session file is never written in this case. So there is something that I still don't see that is causing the session restore tab to sometimes appear. The file is getting written but because sessionstore is never started, sessionstore-final-state-write-complete is never sent either.

That's interesting. Would that mean if sessionstore never gets started we also don't see a sessionstore-windows-restored notification, or others which depend on it?

Correct. -silent specifically works as a test case for me because it never starts a window, thus sessionstore also is never started.

Flags: needinfo?(mixedpuppy)
Status: REOPENED → RESOLVED
Closed: 1 year ago1 year ago
Resolution: --- → FIXED
Attachment #9105321 - Attachment is obsolete: true
Attachment #9105446 - Attachment is obsolete: true

Comment on attachment 9106375 [details]
Bug 1513855: Make sure that quit-application-granted and quit-application notifications are always sent. r=froydnj

Beta/Release Uplift Approval Request

  • User impact if declined: Occasional crashes
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: We only have one STR on osx for this, which the patch covers.
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch adds an additional quit which doesn't do anything if quit was already done. Given the low crash rate on nightly, I'd like to uplift this and see if it has any effect on beta.
  • String changes made/needed: none
Attachment #9106375 - Flags: approval-mozilla-beta?
Attachment #9085311 - Flags: approval-mozilla-beta?
Attachment #9085311 - Flags: approval-mozilla-beta?

Comment on attachment 9106375 [details]
Bug 1513855: Make sure that quit-application-granted and quit-application notifications are always sent. r=froydnj

Fix attempt for a high volume beta and release shutdown crash, uplift approved for 71 beta 10, let's see if this simple patch has an effect on crashes, thanks.

Attachment #9106375 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Marking as verified with the current builds: 71.0b10(20191113144733), 72.0a1 (2019-11-13) on Windows10, macOS 10.13, Ubuntu 18.04.
Can confirm that the startup restore_issue noticed is no longer encountered even after 2-3 -silent starts.

Initial data from 71.0b10 looks like there may indeed be a drop in crashes with this signature. We may want to keep this on the radar for ESR68 uplift as it's also showing a pretty high incidence rate. The patch grafts cleanly as-landed.

Dave, should we consider this for ESR68 uplift?

Flags: needinfo?(dtownsend)

Comment on attachment 9106375 [details]
Bug 1513855: Make sure that quit-application-granted and quit-application notifications are always sent. r=froydnj

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: We're seeing a drop in the incidence of crashes from this fix.
  • User impact if declined: User may see crashes.
  • Fix Landed on Version: 70
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The fix is very straightforward and just causes some additional notifications to be sent out on shutdown.
  • String or UUID changes made by this patch: None
Flags: needinfo?(dtownsend)
Attachment #9106375 - Flags: approval-mozilla-esr68?

Comment on attachment 9106375 [details]
Bug 1513855: Make sure that quit-application-granted and quit-application notifications are always sent. r=froydnj

Improves our AsyncShutdown crashes a bit. Approved for 68.4esr.

Attachment #9106375 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
You need to log in before you can comment on or make changes to this bug.