Crash in [@ AsyncShutdownTimeout | AddonManager: Waiting to start provider shutdown. | EnvironmentAddonBuilder]
Categories
(Toolkit :: Telemetry, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | wontfix |
firefox-esr78 | --- | wontfix |
firefox72 | --- | wontfix |
firefox73 | --- | wontfix |
firefox74 | --- | wontfix |
firefox75 | --- | wontfix |
firefox76 | --- | wontfix |
firefox77 | --- | wontfix |
firefox80 | --- | wontfix |
firefox81 | --- | wontfix |
firefox82 | --- | wontfix |
firefox83 | --- | wontfix |
firefox84 | --- | fixed |
People
(Reporter: mconley, Assigned: robwu)
References
(Regression)
Details
(Keywords: crash, regression, topcrash)
Crash Data
Attachments
(1 file)
This bug is for crash report bp-ce0b31ac-8428-4957-8cf7-98f5a0191205.
Top 10 frames of crashing thread:
0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 xul.dll static void Abort xpcom/base/nsDebugImpl.cpp:442
2 xul.dll NS_DebugBreak xpcom/base/nsDebugImpl.cpp
3 xul.dll nsresult nsDebugImpl::Abort xpcom/base/nsDebugImpl.cpp:134
4 xul.dll XPTC__InvokebyIndex
5 @0x23d477763ff
6 xul.dll trunc
7 xul.dll trunc
8 xul.dll trunc
9 xul.dll static bool XPCWrappedNative::CallMethod js/xpconnect/src/XPCWrappedNative.cpp:1149
This appears to be an AsyncShutdownTimeout crash. Here's one of the AsyncShutdown status payloads:
{"phase":"AddonManager: Waiting to start provider shutdown.","conditions":[{"name":"EnvironmentAddonBuilder","state":"(none)","filename":"resource://gre/modules/TelemetryEnvironment.jsm","lineNumber":608,"stack":["resource://gre/modules/TelemetryEnvironment.jsm:init:608","resource://gre/modules/TelemetryEnvironment.jsm:EnvironmentCache:1049","resource://gre/modules/TelemetryEnvironment.jsm:getGlobal:99","resource://gre/modules/TelemetryEnvironment.jsm:get currentEnvironment:106","resource://gre/modules/TelemetryStartup.jsm:annotateEnvironment:47","resource://gre/modules/TelemetryStartup.jsm:TelemetryStartup.prototype.observe:34"]}]}
Comment 1•4 years ago
|
||
The priority flag is not set for this bug.
:chutten, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 2•4 years ago
|
||
Well that's weird. The report says the browser was open nearly 8min but it hadn't completed "profile-after-change"? (That's what the observer in TelemetryStartup
is listening for. It's supposed to annotate the crash environment inside "profile-after-change".
(( There has been (quiet, vague) support for moving this out of "profile-after-change" (see bug 1553540), but nothing's ever required it. Maybe this would satisfy the activation energy requirements. ))
Not really sure how to investigate this, but it seems like it has enough volume that I can't just hope it goes away : |
Updated•4 years ago
|
Comment 3•4 years ago
|
||
After an OOM crash, this is the #2 overall top crash in 72.0.1. Volume did go down between 70 and 71 for this signature, and it is still fairly early in 72 cycle but we have accumulated around 1700 crashes so far.
Comment 4•4 years ago
|
||
This also looks similar to Bug 1513855, which was fixed in the 70 cycle.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 5•4 years ago
|
||
(In reply to Chris H-C :chutten from comment #2)
Well that's weird. The report says the browser was open nearly 8min but it hadn't completed "profile-after-change"? (That's what the observer in
TelemetryStartup
is listening for. It's supposed to annotate the crash environment inside "profile-after-change".(( There has been (quiet, vague) support for moving this out of "profile-after-change" (see bug 1553540), but nothing's ever required it. Maybe this would satisfy the activation energy requirements. ))
Not really sure how to investigate this, but it seems like it has enough volume that I can't just hope it goes away : |
This is the second most frequent asyncshutdown crash, so I poked at this a bit. I'm going to just outline what I'm seeing and you can correct me where I'm missing something.
We're hanging because telemetry added a shutdown blocker to add-on manager code. The add-on manager will wait for all of these blockers at https://searchfox.org/mozilla-central/rev/9ad88f80aeedcd3cd7d7f63be07f577861727054/toolkit/mozapps/extensions/AddonManager.jsm#1035
Telemetry's blocker is added at https://searchfox.org/mozilla-central/rev/9ad88f80aeedcd3cd7d7f63be07f577861727054/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#540-542 , and corresponds to _pendingTask
, set just below it. That does the following:
try {
// Gather initial addons details
await this._updateAddons(true);
if (!this._environment._addonsAreFull) {
// The addon database has not been loaded, wait for it to
// initialize and gather full data as soon as it does.
await AddonManagerPrivate.databaseReady;
// Now gather complete addons details.
await this._updateAddons();
}
} catch (err) {
this._environment._log.error("init - Exception in _updateAddons", err);
} finally {
this._pendingTask = null;
}
It also gets re-set at https://searchfox.org/mozilla-central/rev/9ad88f80aeedcd3cd7d7f63be07f577861727054/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#633-650, to be:
this._pendingTask = this._updateAddons().then(
result => {
this._pendingTask = null;
if (result.changed) {
this._environment._onEnvironmentChange(
changeReason,
result.oldEnvironment
);
}
},
err => {
this._pendingTask = null;
this._environment._log.error(
"_checkForChanges: Error collecting addons",
err
);
}
);
Some of these crashes have add-ons data (e.g. https://crash-stats.mozilla.org/report/index/8bd574ef-73df-4db4-91d4-ddc870200608#tab-telemetryenvironment ) and some do not (e.g. https://crash-stats.mozilla.org/report/index/cb329dd7-983e-47a0-91aa-54c270200608#tab-telemetryenvironment ). Though note that the add-ons data we do get is incomplete, ie it records dummy entries for the GMP and NPAPI plugin sections.
Effectively, we're relying on this._updateAddons
eventually resolving, and on await AddonManagerPrivate.databaseReady
doing so.
The latter already flags up an immediate problem: the implementation of databaseReady
can return a never-resolving promise, at https://searchfox.org/mozilla-central/rev/9ad88f80aeedcd3cd7d7f63be07f577861727054/toolkit/mozapps/extensions/AddonManager.jsm#3674 , if the XPIProvider is not yet started, or has been removed from the internal providers
set on the addon manager. It's not clear to me if that is likely to be the case here, though it'd probably be better to explicitly guard against this possibility if possible.
The other potential cause is that this._updateAddons
doesn't resolve. That waits for getting a bunch of information from the XPIProvider and database. They all seem to either depend on the addon database being loaded, or on iterating over install locations. I also found this ominous comment:
Rob, is there anything obvious I'm missing here that could explain why we could hang forever on shutdown? Is there maybe another path to shutting down the database that would cause us to hang forever?
It's not really 100% clear to me how best to move forward here. We could get some debugging information out of the cases where this happens by implementing some more meaningful state keeping inside telemetry - ie https://searchfox.org/mozilla-central/rev/9ad88f80aeedcd3cd7d7f63be07f577861727054/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#540-542 should provide an object that has a fetchState
method that returns some information about what _pendingTask
is blocked on.
:chutten, is that something you could look at doing?
Assignee | ||
Comment 7•4 years ago
|
||
Crashstats has 101 crashes with the annotation from bug 1644204, with the following reported states:
Awaiting AddonManagerPrivate.databaseReady
81 reportsAwaiting second _updateAddons
19 reportsAwaiting _updateAddons
1 report.
"EnvironmentAddonBuilder" is blocked on _pendingTask
, which is blocked on
AddonManager.databaseReady
, which relies on the following:
XPIProvider
has started and is still registered. (otherwise a never-resolving Promise is returned, as Gijs mentioned in comment 5)XPIProvider
is started (part 1, part 2, part 3, part 4) duringAddonManagerInternal.startup()
. This is always called, when the"addons-startup"
notification is received, which happens really early.EnvironmentAddonBuilder
's_pendingTask
only reaches.databaseReady
if_addonsAreFull
is falsey. Since it immediately follows_updateAddons
->_getActiveAddons
, andgetActiveAddons
only returns a non-truthyfullData
value when there is at least one provider (the first one beingXPIProvider
), we can infer thatXPIProvider
has indeed started.
XPIProvider
is never unregistered, except at shutdown, which is a blocker offinalShutdown
. AndfinalShutdown
is reached whenshutdownManager
is called from a "before-profile-change" blocker described by "AddonManager: shutting down.".- That point (
finalShutdown
inshutdownManager
) is only reached whenbeforeShutdown
is unblocked, butEnvironmentAddonBuilder
's_pendingTask
is one of the blockers for that (_shutdownBlocker
). So we can infer thatXPIProvider
hasn't shut down yet.
- That point (
XPIProvider.databaseReady
is composed ofdbReadyPromise
andproviderReadyPromise
:dbReadyPromise
resolves whenXPIDatabase.asyncLoadDB
has been called and finished.- If
asyncLoadDB
is not called, thenXPIProvider.shutdown
will call it, with the commentotherwise the telemetry shutdown blocker will never resolve
. ButXPIProvider.shutdown
is called within a blocker offinalShutdown
in AddonManager, registered in_startProvider
... .... andfinalShutdown
only happens afterbeforeShutdown
(as mentioned in the analysis of theXPIProvider
shutdown). So this logic that is supposed to ensure that the environment blocker is unblocked doesn't work.
- If
providerReadyPromise
resolves when some builtin add-ons have finished installing. This only used to register built-in themes (inBrowserGlue._beforeUIStartup
andXPIProvider.startup
). probably not causing issues
(I haven't looked at the Awaiting second _updateAddons
, but it could be the result of the time spent on the previous plus the time spent on file I/O).
Based on the full analysis above, it looks like there is a flaw in the logic from bug 1379831 (which was meant to address a regression from the 3rd patch of bug 1358907). A way to resolve this by implementing the originally intended logic is:
- Remove
asyncLoadDB
call since it doesn't work - https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/toolkit/mozapps/extensions/internal/XPIProvider.jsm#2691-2697 - Set a boolean flag before
await AddonManagerPrivate.databaseReady;
, and flip that boolean after it resolves, at https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#560 - In the
_shutdownBlocker
, if the flag is set, callasyncLoadDB
- at https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#679
The above is an implementation of the originally intended logic. But perhaps there are better alternatives (such as completely skipping the addon database access on shutdown).
Updated•4 years ago
|
Comment 8•4 years ago
|
||
I just hit this crash twice in a row with Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:82.0) Gecko/20100101 Firefox/82.0 ID:20200913213803
https://crash-stats.mozilla.org/report/index/608cbd1f-b602-4a90-ba66-ab0000200920
Comment 9•4 years ago
|
||
Looks like I hit it reproducible for every locally installed version of Firefox. Here one crash for 80.0 release:
https://crash-stats.mozilla.org/report/index/b0d3ad37-f8f0-4dd3-a0e3-6747b0200920
Chris, is there anything I can help with to get this debugged?
Updated•4 years ago
|
Assignee | ||
Comment 10•4 years ago
|
||
comment 7 has an analysis of the problem plus a description of a solution.
Henrik: what steps are you following to reproduce this consistently? If you set extensions.logging.enabled
to true
, do you see the asyncLoadDB
(as explained in comment 7) running to completion according to logs in the global JS console (Ctrl-Shift-J)?
Should be something like this:
1600639768916 addons.xpi-utils DEBUG Starting async load of XPI database /path/to/profdir/extensions.json
Unknown category for SetEventRecordingEnabled: fxmonitor
1600639768930 addons.xpi DEBUG Existing add-on doh-rollout@mozilla.org in app-system-defaults
1600639768930 addons.xpi DEBUG Existing add-on formautofill@mozilla.org in app-system-defaults
1600639768930 addons.xpi DEBUG Existing add-on screenshots@mozilla.org in app-system-defaults
1600639768930 addons.xpi DEBUG Existing add-on webcompat-reporter@mozilla.org in app-system-defaults
1600639768930 addons.xpi DEBUG Existing add-on webcompat@mozilla.org in app-system-defaults
1600639768931 addons.xpi DEBUG scanForChanges changed: false, state: {}
1600639768955 addons.xpi-utils DEBUG Finished async read of XPI database, parsing...
1600639769398 addons.xpi-utils DEBUG Successfully read XPI database
Comment 11•4 years ago
|
||
So as it turned out the problem was persistent because I started Firefox from the terminal out of a directory that has already been deleted via the Finder. I assume that it should be easy to get it reproduced on other systems. Starting Firefox from a directory that exists now, it works all fine.
Assignee | ||
Comment 12•4 years ago
|
||
Thanks Henrik, I am able to reproduce on Linux, bp-01bedaf5-12c9-4563-a84f-a64bb0200921
STR:
$ mkdir /tmp/profdir
$ mkdir /tmp/delme
$ cd /tmp/delme
$ rmdir /tmp/delme
$ firefox --no-remote -profile /tmp/profdir # create profile first
$ firefox --no-remote -profile /tmp/profdir
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
[3881976, Main Thread] ###!!! ABORT: file resource://gre/modules/TelemetryEnvironment.jsm, line 544
[3881976, Main Thread] ###!!! ABORT: file resource://gre/modules/TelemetryEnvironment.jsm, line 544
ExceptionHandler::WaitForContinueSignal waiting for continue signal...
ExceptionHandler::GenerateDump cloned child 3882238
ExceptionHandler::SendContinueSignalToChild sent continue signal to child
Note: the two lines in stderr are NOT from Firefox; those are from the shell (bash) before launching Firefox. The error message shows that the cwd no longer exists.
Firefox exits very early on startup because of https://searchfox.org/mozilla-central/rev/b58ca45005fe02077c92779483d1b60e9a49687c/toolkit/xre/nsAppRunner.cpp#4568-4570 :
rv = NS_GetSpecialDirectory(NS_OS_CURRENT_WORKING_DIR,
getter_AddRefs(workingDir));
NS_ENSURE_SUCCESS(rv, NS_ERROR_FAILURE);
With MOZ_LOG=ObserverService:5
I can confirm that the quit-application-granted
notification is never triggered, so this logic (which would usually call asyncLoadDB()
) is not triggered: https://searchfox.org/mozilla-central/rev/b58ca45005fe02077c92779483d1b60e9a49687c/toolkit/mozapps/extensions/internal/XPIProvider.jsm#2554-2561
I've also patched omni.ja
to log a message when asyncLoadDB
would be called, and once again confirmed that it is not called when the issue occurs.
This STR and the results confirm that the analysis from comment 7 was correct.
Comment 13•4 years ago
|
||
Thank you for your comprehensive answers, :robwu.
(In reply to Rob Wu [:robwu] from comment #7)
The above is an implementation of the originally intended logic. But perhaps there are better alternatives (such as completely skipping the addon database access on shutdown).
It might be unavoidable as Telemetry needs addon information to populate the Environment in the "main" ping with reason "shutdown". Unless, of course, having the addon information in the Environment of "main" pings with reason "shutdown" that haven't gotten around to loading the db isn't useful or necessary. I don't know how that information is used, though, so I couldn't say whether that was true.
Assignee | ||
Comment 14•4 years ago
|
||
Are you planning to work on it? If not, feel free to re-assign to me. After having looked at all of this I'm having a good idea of a way to fix this. The most significant changes are in the Addon Manager internals.
Updated•4 years ago
|
Assignee | ||
Comment 16•4 years ago
|
||
TelemetryEnvironment.jsm's EnvironmentAddonBuilder has a shutdown
blocker that depends on the addons database to have been loaded.
There are two calls to asyncLoadDB() in XPIProvider.jsm that are
supposed to activate the load. Neither of them work:
-
XPIProvider calls asyncLoadDB() during quitApplicationGranted.
But "quit-application-granted" is not always triggered, as seen in:
https://bugzilla.mozilla.org/show_bug.cgi?id=1601678#c12 -
XPIProvider.shutdown() calls asyncLoadDB().
But shutdown() is only called when TelemetryEnvironment's blocker has
been released. So this is never reached. More details in:
https://bugzilla.mozilla.org/show_bug.cgi?id=1601678#c7
To fix the deadlock, asyncLoadDB() is called from profile-before-change,
which is the same phase as the blocker of EnvironmentAddonBuilder.
The two existing calls to asyncLoadDB() mentioned above are obsolete and
have been removed.
After the removal of asyncLoadDB() from XPIProvider.shutdown(), the
test_ext_persistent_events.js test started to fail. This is because the
test sends the "sessionstore-windows-restored" notification, for which
XPIProvider has a handler that calls asyncLoadDB(), without awaiting
the result.
Since XPIProvider.shutdown() doesn't await the DB load any more, it is
possible for the DB to be unloaded while being used. This only happens
in tests, because the construction with the TelemetryEnvironment ensures
that the addons database has fully loaded before shutdown() is called.
To resolve this test-only issue, AddonTestUtils.promiseShutdownManager()
has been updated to explicitly wait for the pending _dbPromise if any.
Comment 17•4 years ago
|
||
Pushed by rob@robwu.nl: https://hg.mozilla.org/integration/autoland/rev/4ad0f91ab45b Resolve shutdown deadlock in EnvironmentAddonBuilder r=mixedpuppy
Comment 18•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Comment 19•4 years ago
|
||
Is this something we wanted to consider for Beta uplift or should it ride the 84 train to release?
Assignee | ||
Comment 20•4 years ago
|
||
Let's ride the train. I intentionally waited until after the version bump to let it bake. I'm not expecting regressions, but better safe than sorry.
Updated•4 years ago
|
Description
•