Closed Bug 1601678 Opened 4 years ago Closed 4 years ago

Crash in [@ AsyncShutdownTimeout | AddonManager: Waiting to start provider shutdown. | EnvironmentAddonBuilder]

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

RESOLVED FIXED
84 Branch
Tracking Status
firefox-esr68 --- wontfix
firefox-esr78 --- wontfix
firefox72 --- wontfix
firefox73 --- wontfix
firefox74 --- wontfix
firefox75 --- wontfix
firefox76 --- wontfix
firefox77 --- wontfix
firefox80 --- wontfix
firefox81 --- wontfix
firefox82 --- wontfix
firefox83 --- wontfix
firefox84 --- fixed

People

(Reporter: mconley, Assigned: robwu)

References

(Regression)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-ce0b31ac-8428-4957-8cf7-98f5a0191205.

Top 10 frames of crashing thread:

0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 xul.dll static void Abort xpcom/base/nsDebugImpl.cpp:442
2 xul.dll NS_DebugBreak xpcom/base/nsDebugImpl.cpp
3 xul.dll nsresult nsDebugImpl::Abort xpcom/base/nsDebugImpl.cpp:134
4 xul.dll XPTC__InvokebyIndex 
5  @0x23d477763ff 
6 xul.dll trunc 
7 xul.dll trunc 
8 xul.dll trunc 
9 xul.dll static bool XPCWrappedNative::CallMethod js/xpconnect/src/XPCWrappedNative.cpp:1149

This appears to be an AsyncShutdownTimeout crash. Here's one of the AsyncShutdown status payloads:

{"phase":"AddonManager: Waiting to start provider shutdown.","conditions":[{"name":"EnvironmentAddonBuilder","state":"(none)","filename":"resource://gre/modules/TelemetryEnvironment.jsm","lineNumber":608,"stack":["resource://gre/modules/TelemetryEnvironment.jsm:init:608","resource://gre/modules/TelemetryEnvironment.jsm:EnvironmentCache:1049","resource://gre/modules/TelemetryEnvironment.jsm:getGlobal:99","resource://gre/modules/TelemetryEnvironment.jsm:get currentEnvironment:106","resource://gre/modules/TelemetryStartup.jsm:annotateEnvironment:47","resource://gre/modules/TelemetryStartup.jsm:TelemetryStartup.prototype.observe:34"]}]}

The priority flag is not set for this bug.
:chutten, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(chutten)

Well that's weird. The report says the browser was open nearly 8min but it hadn't completed "profile-after-change"? (That's what the observer in TelemetryStartup is listening for. It's supposed to annotate the crash environment inside "profile-after-change".

(( There has been (quiet, vague) support for moving this out of "profile-after-change" (see bug 1553540), but nothing's ever required it. Maybe this would satisfy the activation energy requirements. ))

Not really sure how to investigate this, but it seems like it has enough volume that I can't just hope it goes away : |

Assignee: nobody → chutten
Status: NEW → ASSIGNED
Priority: -- → P1
See Also: → 1553540
Flags: needinfo?(chutten)

After an OOM crash, this is the #2 overall top crash in 72.0.1. Volume did go down between 70 and 71 for this signature, and it is still fairly early in 72 cycle but we have accumulated around 1700 crashes so far.

This also looks similar to Bug 1513855, which was fixed in the 70 cycle.

Severity: normal → S4

(In reply to Chris H-C :chutten from comment #2)

Well that's weird. The report says the browser was open nearly 8min but it hadn't completed "profile-after-change"? (That's what the observer in TelemetryStartup is listening for. It's supposed to annotate the crash environment inside "profile-after-change".

(( There has been (quiet, vague) support for moving this out of "profile-after-change" (see bug 1553540), but nothing's ever required it. Maybe this would satisfy the activation energy requirements. ))

Not really sure how to investigate this, but it seems like it has enough volume that I can't just hope it goes away : |

This is the second most frequent asyncshutdown crash, so I poked at this a bit. I'm going to just outline what I'm seeing and you can correct me where I'm missing something.

We're hanging because telemetry added a shutdown blocker to add-on manager code. The add-on manager will wait for all of these blockers at https://searchfox.org/mozilla-central/rev/9ad88f80aeedcd3cd7d7f63be07f577861727054/toolkit/mozapps/extensions/AddonManager.jsm#1035

Telemetry's blocker is added at https://searchfox.org/mozilla-central/rev/9ad88f80aeedcd3cd7d7f63be07f577861727054/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#540-542 , and corresponds to _pendingTask, set just below it. That does the following:

      try {
        // Gather initial addons details
        await this._updateAddons(true);

        if (!this._environment._addonsAreFull) {
          // The addon database has not been loaded, wait for it to
          // initialize and gather full data as soon as it does.
          await AddonManagerPrivate.databaseReady;

          // Now gather complete addons details.
          await this._updateAddons();
        }
      } catch (err) {
        this._environment._log.error("init - Exception in _updateAddons", err);
      } finally {
        this._pendingTask = null;
      }

It also gets re-set at https://searchfox.org/mozilla-central/rev/9ad88f80aeedcd3cd7d7f63be07f577861727054/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#633-650, to be:

    this._pendingTask = this._updateAddons().then(
      result => {
        this._pendingTask = null;
        if (result.changed) {
          this._environment._onEnvironmentChange(
            changeReason,
            result.oldEnvironment
          );
        }
      },
      err => {
        this._pendingTask = null;
        this._environment._log.error(
          "_checkForChanges: Error collecting addons",
          err
        );
      }
    );

Some of these crashes have add-ons data (e.g. https://crash-stats.mozilla.org/report/index/8bd574ef-73df-4db4-91d4-ddc870200608#tab-telemetryenvironment ) and some do not (e.g. https://crash-stats.mozilla.org/report/index/cb329dd7-983e-47a0-91aa-54c270200608#tab-telemetryenvironment ). Though note that the add-ons data we do get is incomplete, ie it records dummy entries for the GMP and NPAPI plugin sections.

Effectively, we're relying on this._updateAddons eventually resolving, and on await AddonManagerPrivate.databaseReady doing so.

The latter already flags up an immediate problem: the implementation of databaseReady can return a never-resolving promise, at https://searchfox.org/mozilla-central/rev/9ad88f80aeedcd3cd7d7f63be07f577861727054/toolkit/mozapps/extensions/AddonManager.jsm#3674 , if the XPIProvider is not yet started, or has been removed from the internal providers set on the addon manager. It's not clear to me if that is likely to be the case here, though it'd probably be better to explicitly guard against this possibility if possible.

The other potential cause is that this._updateAddons doesn't resolve. That waits for getting a bunch of information from the XPIProvider and database. They all seem to either depend on the addon database being loaded, or on iterating over install locations. I also found this ominous comment:

https://searchfox.org/mozilla-central/rev/9ad88f80aeedcd3cd7d7f63be07f577861727054/toolkit/mozapps/extensions/internal/XPIProvider.jsm#2691-2696

Rob, is there anything obvious I'm missing here that could explain why we could hang forever on shutdown? Is there maybe another path to shutting down the database that would cause us to hang forever?

It's not really 100% clear to me how best to move forward here. We could get some debugging information out of the cases where this happens by implementing some more meaningful state keeping inside telemetry - ie https://searchfox.org/mozilla-central/rev/9ad88f80aeedcd3cd7d7f63be07f577861727054/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#540-542 should provide an object that has a fetchState method that returns some information about what _pendingTask is blocked on.

:chutten, is that something you could look at doing?

Flags: needinfo?(rob)
Flags: needinfo?(chutten)
Depends on: 1644204

Certainly. I'll take that work in bug 1644204.

Flags: needinfo?(chutten)

Crashstats has 101 crashes with the annotation from bug 1644204, with the following reported states:

  • Awaiting AddonManagerPrivate.databaseReady 81 reports
  • Awaiting second _updateAddons 19 reports
  • Awaiting _updateAddons 1 report.

"EnvironmentAddonBuilder" is blocked on _pendingTask, which is blocked on
AddonManager.databaseReady, which relies on the following:

(I haven't looked at the Awaiting second _updateAddons, but it could be the result of the time spent on the previous plus the time spent on file I/O).

Based on the full analysis above, it looks like there is a flaw in the logic from bug 1379831 (which was meant to address a regression from the 3rd patch of bug 1358907). A way to resolve this by implementing the originally intended logic is:

  1. Remove asyncLoadDB call since it doesn't work - https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/toolkit/mozapps/extensions/internal/XPIProvider.jsm#2691-2697
  2. Set a boolean flag before await AddonManagerPrivate.databaseReady;, and flip that boolean after it resolves, at https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#560
  3. In the _shutdownBlocker, if the flag is set, call asyncLoadDB - at https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#679

The above is an implementation of the originally intended logic. But perhaps there are better alternatives (such as completely skipping the addon database access on shutdown).

Flags: needinfo?(rob)
Keywords: regression
Regressed by: 1358907
See Also: → 1379831
Has Regression Range: --- → yes

I just hit this crash twice in a row with Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:82.0) Gecko/20100101 Firefox/82.0 ID:20200913213803

https://crash-stats.mozilla.org/report/index/608cbd1f-b602-4a90-ba66-ab0000200920

Looks like I hit it reproducible for every locally installed version of Firefox. Here one crash for 80.0 release:
https://crash-stats.mozilla.org/report/index/b0d3ad37-f8f0-4dd3-a0e3-6747b0200920

Chris, is there anything I can help with to get this debugged?

Flags: needinfo?(chutten)
OS: Windows 10 → All
Hardware: Unspecified → All

comment 7 has an analysis of the problem plus a description of a solution.

Henrik: what steps are you following to reproduce this consistently? If you set extensions.logging.enabled to true, do you see the asyncLoadDB (as explained in comment 7) running to completion according to logs in the global JS console (Ctrl-Shift-J)?

Should be something like this:

1600639768916	addons.xpi-utils	DEBUG	Starting async load of XPI database /path/to/profdir/extensions.json
Unknown category for SetEventRecordingEnabled: fxmonitor
1600639768930	addons.xpi	DEBUG	Existing add-on doh-rollout@mozilla.org in app-system-defaults
1600639768930	addons.xpi	DEBUG	Existing add-on formautofill@mozilla.org in app-system-defaults
1600639768930	addons.xpi	DEBUG	Existing add-on screenshots@mozilla.org in app-system-defaults
1600639768930	addons.xpi	DEBUG	Existing add-on webcompat-reporter@mozilla.org in app-system-defaults
1600639768930	addons.xpi	DEBUG	Existing add-on webcompat@mozilla.org in app-system-defaults
1600639768931	addons.xpi	DEBUG	scanForChanges changed: false, state: {}
1600639768955	addons.xpi-utils	DEBUG	Finished async read of XPI database, parsing...
1600639769398	addons.xpi-utils	DEBUG	Successfully read XPI database

So as it turned out the problem was persistent because I started Firefox from the terminal out of a directory that has already been deleted via the Finder. I assume that it should be easy to get it reproduced on other systems. Starting Firefox from a directory that exists now, it works all fine.

Thanks Henrik, I am able to reproduce on Linux, bp-01bedaf5-12c9-4563-a84f-a64bb0200921

STR:

$ mkdir /tmp/profdir
$ mkdir /tmp/delme
$ cd /tmp/delme
$ rmdir /tmp/delme
$ firefox --no-remote -profile /tmp/profdir  # create profile first
$ firefox --no-remote -profile /tmp/profdir
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
[3881976, Main Thread] ###!!! ABORT: file resource://gre/modules/TelemetryEnvironment.jsm, line 544
[3881976, Main Thread] ###!!! ABORT: file resource://gre/modules/TelemetryEnvironment.jsm, line 544
ExceptionHandler::WaitForContinueSignal waiting for continue signal...
ExceptionHandler::GenerateDump cloned child 3882238
ExceptionHandler::SendContinueSignalToChild sent continue signal to child

Note: the two lines in stderr are NOT from Firefox; those are from the shell (bash) before launching Firefox. The error message shows that the cwd no longer exists.

Firefox exits very early on startup because of https://searchfox.org/mozilla-central/rev/b58ca45005fe02077c92779483d1b60e9a49687c/toolkit/xre/nsAppRunner.cpp#4568-4570 :

  rv = NS_GetSpecialDirectory(NS_OS_CURRENT_WORKING_DIR,
                              getter_AddRefs(workingDir));
  NS_ENSURE_SUCCESS(rv, NS_ERROR_FAILURE);

With MOZ_LOG=ObserverService:5 I can confirm that the quit-application-granted notification is never triggered, so this logic (which would usually call asyncLoadDB()) is not triggered: https://searchfox.org/mozilla-central/rev/b58ca45005fe02077c92779483d1b60e9a49687c/toolkit/mozapps/extensions/internal/XPIProvider.jsm#2554-2561

I've also patched omni.ja to log a message when asyncLoadDB would be called, and once again confirmed that it is not called when the issue occurs.

This STR and the results confirm that the analysis from comment 7 was correct.

See Also: → 1666221

Thank you for your comprehensive answers, :robwu.

(In reply to Rob Wu [:robwu] from comment #7)

The above is an implementation of the originally intended logic. But perhaps there are better alternatives (such as completely skipping the addon database access on shutdown).

It might be unavoidable as Telemetry needs addon information to populate the Environment in the "main" ping with reason "shutdown". Unless, of course, having the addon information in the Environment of "main" pings with reason "shutdown" that haven't gotten around to loading the db isn't useful or necessary. I don't know how that information is used, though, so I couldn't say whether that was true.

Flags: needinfo?(chutten)

Are you planning to work on it? If not, feel free to re-assign to me. After having looked at all of this I'm having a good idea of a way to fix this. The most significant changes are in the Addon Manager internals.

Sounds like a wonderful idea to me.

Assignee: chutten → rob

TelemetryEnvironment.jsm's EnvironmentAddonBuilder has a shutdown
blocker that depends on the addons database to have been loaded.
There are two calls to asyncLoadDB() in XPIProvider.jsm that are
supposed to activate the load. Neither of them work:

To fix the deadlock, asyncLoadDB() is called from profile-before-change,
which is the same phase as the blocker of EnvironmentAddonBuilder.

The two existing calls to asyncLoadDB() mentioned above are obsolete and
have been removed.


After the removal of asyncLoadDB() from XPIProvider.shutdown(), the
test_ext_persistent_events.js test started to fail. This is because the
test sends the "sessionstore-windows-restored" notification, for which
XPIProvider has a handler that calls asyncLoadDB(), without awaiting
the result.
Since XPIProvider.shutdown() doesn't await the DB load any more, it is
possible for the DB to be unloaded while being used. This only happens
in tests, because the construction with the TelemetryEnvironment ensures
that the addons database has fully loaded before shutdown() is called.

To resolve this test-only issue, AddonTestUtils.promiseShutdownManager()
has been updated to explicitly wait for the pending _dbPromise if any.

Pushed by rob@robwu.nl:
https://hg.mozilla.org/integration/autoland/rev/4ad0f91ab45b
Resolve shutdown deadlock in EnvironmentAddonBuilder r=mixedpuppy
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 84 Branch

Is this something we wanted to consider for Beta uplift or should it ride the 84 train to release?

Flags: needinfo?(rob)

Let's ride the train. I intentionally waited until after the version bump to let it bake. I'm not expecting regressions, but better safe than sorry.

Flags: needinfo?(rob)
See Also: → 1677729
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: