Open Bug 1605973 Opened 5 years ago Updated 18 days ago

Crash in [@ AsyncShutdownTimeout | profile-before-change | Remote Settings profile-before-change]

Categories

(Firefox :: Remote Settings Client, defect, P3)

defect

Tracking

()

REOPENED
Tracking Status
firefox-esr68 --- unaffected
firefox71 --- unaffected
firefox72 --- wontfix
firefox73 --- wontfix
firefox74 --- wontfix
firefox75 --- wontfix
firefox76 --- wontfix
firefox77 --- wontfix
firefox78 --- fix-optional
firefox130 - wontfix

People

(Reporter: philipp, Assigned: leplatrem)

References

(Blocks 1 open bug)

Details

(Keywords: crash, leave-open, regression)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-f6080902-a3c2-47e7-9fd0-6cebf0191226.

AsyncShutdownTimeout:

{
   "phase":"profile-before-change",
   "conditions":[
      {
         "name":"Remote Settings profile-before-change",
         "state":"Remaining: 1 callbacks.",
         "filename":"resource://services-settings/RemoteSettingsWorker.jsm",
         "lineNumber":131,
         "stack":[
            "resource://services-settings/RemoteSettingsWorker.jsm:null:131",
            "resource://services-settings/RemoteSettingsClient.jsm:get:345"
         ]
      }
   ]
}

this async shutdownhang signature is starting to show up during the 72 cycle.

:philipp, since this bug is a regression, could you fill (if possible) the regressed_by field?
For more information, please visit auto_nag documentation.

Flags: needinfo?(madperson)

Philipp, what kind of frequency is this? It looks like to me like it's primarily hitting low-end-ish machines, so I'm trying to figure out how frequent this is and/or how likely it is to be related to machines that would previously have shutdown-hung somewhere else.

I also guess that if we wanted to do any debugging here we need a better idea of what's going on here, ie what call into the worker is pending and why it's not returning. Mathieu, any ideas on that one? I'm primarily seeing pretty short uptimes in the crash reports, in case that's relevant...

Flags: needinfo?(mathieu)

the volume of this crash on beta is low to medium i'd say but it's always hard to predict how thing will go once this hits the release userbase. the majority of reports come from windows 10 and 64bit machines, so it's not only low-end machines, but in 40% of crashes a memory report was also submitted (indicating there was some memory pressure).

Flags: needinfo?(madperson)

In theory, any execution error in the worker is caught and returned as such:
https://searchfox.org/mozilla-central/rev/c79fc6e32a5b561205a71fcfef9112d34b4037ae/services/settings/RemoteSettingsWorker.js#123-130
So this could be an operation that hung. The most likely to fail is the one inserting the JSON dumps into IndexedDB, but that's only a guess...

Flags: needinfo?(mathieu)

(In reply to Mathieu Leplatre [:leplatrem] from comment #4)

In theory, any execution error in the worker is caught and returned as such:
https://searchfox.org/mozilla-central/rev/c79fc6e32a5b561205a71fcfef9112d34b4037ae/services/settings/RemoteSettingsWorker.js#123-130
So this could be an operation that hung. The most likely to fail is the one inserting the JSON dumps into IndexedDB, but that's only a guess...

What could we include in the asyncshutdown detail message to distinguish the commands here?

And, if we're assuming your guess is correct, can we split up the JSON dump loading so it's quicker to complete parts of it (which we can resume on next startup or something) ?

We could also consider doing the dump loading into a profile as part of the first startup routine ( https://firefox-source-docs.mozilla.org/toolkit/modules/toolkit_modules/FirstStartup.html ).

Flags: needinfo?(mathieu)

The volume has the tendency to go up, it seems. Are there plans to work on this soon?

(In reply to Jens Stutte [:jstutte] from comment #6)

The volume has the tendency to go up, it seems. Are there plans to work on this soon?

I don't see this - looking at https://crash-stats.mozilla.org/signature/?signature=AsyncShutdownTimeout%20%7C%20profile-before-change%20%7C%20Remote%20Settings%20profile-before-change&date=%3E%3D2020-01-01T12%3A09%3A00.000Z&date=%3C2020-02-25T12%3A09%3A00.000Z#graphs it's hovered between 10-20 crashes a day since it first started happening, occasionally peaking at 34 and going back down. I don't see a pattern of it increasing - but perhaps I'm looking at the wrong data? (ni for this)

I also believe 20 crashes a day is very low volume compared to other crashes we have on file. It's also quite plausible that before the changes from bug 1594521 would have also caused shutdown hangs or crashes, but they'd have happened in indexeddb or worker code instead. And finally, my understanding is that given that we don't know how to reproduce, it's very hard to track down what exactly is going wrong here or how to fix it - it may just not be worth it given the volume. The fix in bug 1597688 may help here but it's hard to tell given the volume is so low (almost no crashes on nightly).

Flags: needinfo?(jstutte)

We could add some details about the blocking callbacks like this, at least to be able to investigate a bit more:

diff --git a/services/settings/RemoteSettingsWorker.jsm b/services/settings/RemoteSettingsWorker.jsm
--- a/services/settings/RemoteSettingsWorker.jsm
+++ b/services/settings/RemoteSettingsWorker.jsm
@@ -66,17 +66,17 @@ class Worker {
         this.stop();
       };
     }
     // New activity: reset the idle timer.
     if (this.idleTimeoutId) {
       clearTimeout(this.idleTimeoutId);
     }
     return new Promise((resolve, reject) => {
-      const callbackId = ++this.lastCallbackId;
+      const callbackId = `${method}-${++this.lastCallbackId}`;
       this.callbacks.set(callbackId, [resolve, reject]);
       this.worker.postMessage({ callbackId, method, args });
     });
   }
 
   _onWorkerMessage(event) {
     const { callbackId, result, error } = event.data;
     const [resolve, reject] = this.callbacks.get(callbackId);
@@ -151,17 +151,19 @@ try {
       }
       // Otherwise, return a promise that the worker will resolve.
       return new Promise(resolve => {
         gShutdownResolver = resolve;
       });
     },
     {
       fetchState() {
-        return `Remaining: ${RemoteSettingsWorker.callbacks.size} callbacks.`;
+        const remainingCallbacks = RemoteSettingsWorker.callbacks;
+        const details = Array.from(remainingCallbacks.keys()).join(", ");
+        return `Remaining: ${remainingCallbacks.size} callbacks (${details}).`;
       },
     }
   );
 } catch (ex) {
   Cu.reportError(
     "Couldn't add shutdown blocker, assuming shutdown has started."
   );
   Cu.reportError(ex);
Flags: needinfo?(mathieu)

I was just looking at the graph which suggested an increase since beginning of January, did not look into details.

Flags: needinfo?(jstutte)

(In reply to Mathieu Leplatre [:leplatrem] from comment #8)

We could add some details about the blocking callbacks like this, at least to be able to investigate a bit more:

rs=me, want to put this up for review? Unsure what string length limits we have on crash reporter info, but it's worth a shot either way.

Flags: needinfo?(mathieu)
Keywords: leave-open
Assignee: nobody → mathieu
Status: NEW → ASSIGNED
Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/autoland/rev/e451ebdaac65 Show shutdown blockers in RemoteSettings worker r=Gijs
Flags: needinfo?(mathieu)

Hm, in fact, even before this extra logging, the async shutdown metadata in the crash reports contains some. Here's one from Nightly:

{"phase":"profile-before-change","conditions":[{"name":"Remote Settings profile-before-change","state":"Remaining: 2 callbacks.","filename":"resource://services-settings/RemoteSettingsWorker.jsm","lineNumber":131,"stack":["resource://services-settings/RemoteSettingsWorker.jsm:null:131","resource://services-settings/RemoteSettingsClient.jsm:_importJSONDump:651","resource://services-settings/RemoteSettingsClient.jsm:get:319"]}]}

and from release:

{"phase":"profile-before-change","conditions":[{"name":"Remote Settings profile-before-change","state":"Remaining: 1 callbacks.","filename":"resource://services-settings/RemoteSettingsWorker.jsm","lineNumber":131,"stack":["resource://services-settings/RemoteSettingsWorker.jsm:null:131","resource://services-settings/RemoteSettingsClient.jsm:_validateCollectionSignature:703"]}]}

and

{"phase":"profile-before-change","conditions":[{"name":"Remote Settings profile-before-change","state":"Remaining: 5 callbacks.","filename":"resource://services-settings/RemoteSettingsWorker.jsm","lineNumber":131,"stack":["resource://services-settings/RemoteSettingsWorker.jsm:null:131","resource://services-settings/RemoteSettingsClient.jsm:_validateCollectionSignature:703"]}]}

Does that (esp. the stacks?) help at all, Mathieu?

Flags: needinfo?(mathieu)

I don't know how to read the crash report to find anything specific to these actions :|

Flags: needinfo?(mathieu)

Hi Gijs, do we have now more information thanks to the patch on what is going on?

Flags: needinfo?(gijskruitbosch+bugs)

We just have 1 datapoint right now, which is:

{"phase":"profile-before-change","conditions":[{"name":"Remote Settings profile-before-change","state":"Remaining: 1 callbacks (importJSONDump-3).","filename":"resource://services-settings/RemoteSettingsWorker.jsm","lineNumber":147,"stack":["resource://services-settings/RemoteSettingsWorker.jsm:null:147","resource://services-settings/RemoteSettingsClient.jsm:_validateCollectionSignature:780"]}]}

which doesn't tell us so much yet. I'm also poking at some of this in bug 1545438 so I have leads there.

Flags: needinfo?(gijskruitbosch+bugs)

There are now 6 crashreports off nightly. 4 show that we were running importJSONDump. 2 show us running checkFileHash, and one of them has literally hundreds of checkFileHash tasks queued up... see the AsyncShutdownTimeout field on the metadata tab of this report: https://crash-stats.mozilla.org/report/index/b4665bb3-c1eb-4194-a82f-1c5140200313#tab-metadata

Not really sure what to do with this info... The multiple file check hashes are very strange, and I'm unsure what to do about the dumps. Should we consider aborting even the writes when shutdown happens? Mathieu, do you have ideas?

Flags: needinfo?(mathieu)

The hundreds of file checks could come from the security-state/intermediates implementation here: https://searchfox.org/mozilla-central/rev/7fba7adfcd695343236de0c12e8d384c9b7cd237/security/manager/ssl/RemoteSecuritySettings.jsm#482-488
I guess that they also would need some sort of shutdown handler there to abort these downloads...
I would feel so much better if we could find a way to package these thousands of attachments as a resource so that clients wouldn't have to download them (bug 1542177)...

As for the importJSONDump, they are supposed to happen on fresh profiles, and not too far from startup :/ Namely when the update timer kicks in or when the hello/register notification is received from the Push server.

Flags: needinfo?(mathieu)
Priority: -- → P3
Blocks: 1629005

This is now better on nightly but worse on beta. :-\

These particular hangs are all to do with the worker. bug 1624878 is for the db access from the main thread and shows the same, but much lower numbers on beta.

Flags: needinfo?(gijskruitbosch+bugs)

A lot of the hangs here are after 1-5 minutes, and in almost all of these cases, the install time matches the uptime completely or nearly - in other words, we're hanging on shutdown, having just updated/upgraded/installed Firefox.

importJSONDump is by far the biggest culprit, followed by checkFileHash.

Just like in bug 1624878, I wonder if we can make the import interruptible. Send another message to the worker as soon as shutdown starts, and in the importJSONDump code, abort the transaction if we're shutting down. At least in the importJSONDump case, as far as I can tell we already deal with failures, though in the maybeSync callsite we probably want to abort the sync instead of trying to continue with it.

For checkFileHash, I noticed we do a needless copy (we read the file, construct a UInt8Array from the data provided by the fetch() call, and then construct another UInt8Array based on that), but I also figured we could abort those, and make Downloader pass on the error, and then handle this in the consumers.

Mathieu, does that sound OK?

As an aside, as we did in the remote settings client, I wonder if we can make the asyncshutdown state indicate the collection/bucket we're importing/checking/whatever - that'd be more informative than the counters. :-)

Oh, and finally, I wonder if we need something like bug 1402267 - if the worker becomes unresponsive, maybe the JSM needs to give up on it in some way. That said, I don't know if that'd help here; if we get errors, we should still be removing pending callbacks...

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(mathieu)
See Also: → 1624878

Thanks Gijs for the follow-up here!

[...] At least in the importJSONDump case, as far as I can tell we already deal with failures, though in the maybeSync callsite we probably want to abort the sync instead of trying to continue with it.

Yes, the comment in maybeSync() says Report but go on, https://searchfox.org/mozilla-central/rev/d2cec90777d573585f8477d5170892e5dcdfb0ab/services/settings/RemoteSettingsClient.jsm#447-448
Basically, the idea behind that was to go on with the sync, just fetching all data from the server instead of relying on the packaged data. But we hadn't the shutdown phase in our mind...

[...] but I also figured we could abort those, and make Downloader pass on the error, and then handle this in the consumers.

Yes, that sounds clean!

I wonder if we need something like bug 1402267 - if the worker becomes unresponsive, maybe the JSM needs to give up on it in some way.

I didn't get that part... Would it be different from what we have here?
https://searchfox.org/mozilla-central/rev/d2cec90777d573585f8477d5170892e5dcdfb0ab/services/settings/RemoteSettingsWorker.jsm#66-74

Flags: needinfo?(mathieu)
Depends on: 1635408

More crashes are coming in from beta (and a somewhat alarming number from 76 release? I wonder if it's non-linear based on new RS collections or something?) that do relate to importJSONDump. At least the few I've checked on crash-stats ( https://crash-stats.mozilla.org/report/index/4cf0ac52-3161-43e1-9b1c-549770200507#allthreads , https://crash-stats.mozilla.org/report/index/5dec4f99-ae4a-4148-88fa-9a8c40200506#allthreads , https://crash-stats.mozilla.org/report/index/f45b886a-1b06-480a-bbd0-6f8460200506#allthreads ) do show the quotamanager or indexeddb threads appearing to do actual DB work. Andrew/Simon, does it look more plausible that there we're "just" taking longer to do the actual DB work than shutdown allows for?

Flags: needinfo?(sgiesecke)
Flags: needinfo?(bugmail)

Jan, do you think this might be affected/regressed by Bug 1619592?

Flags: needinfo?(sgiesecke) → needinfo?(jvarga)

I don't think bug 1619592 is related to this.
As :Gijs mentioned in comment 23, there's I/O activity on QuotaManager IO thread and such activity can take long time. I looked at the crash mentioned in comment 0, it's thread 17 and CacheQuotaClient::GetUsageForOriginInternal is on the stack. That method has a loop which traverses over directory entries on disk. There's a check QuotaManager::IsShuttingDown in the loop, but the real problem here seems to be that QuotaManager hasn't been notified about the shutdown yet so QuotaManager::IsShuttingDown returns false probably.
I assume that remote settings has a shutdown blocker for profile-before-change.

Our current design is that we let all consumers to finish/cleanup after getting profile-before-change, then we notify QuotaManager (profile-before-change-qm) which sets the isShuttingDown flag, so some pending initialization can be interrupted.

Maybe we need another notification for QuotaManaget which is sent before profile-before-change, the order would then be:
profile-change-net-teardown
profile-change-teardown
profile-before-change-xyz - QuotaManager is notified about upcoming shutdown
profile-before-change - all code that uses QuotaManager, IndexedDB, Cache API, LSNG should close connections if it doesn't want to risk data loss
profile-before-change-qm - real QuotaManager shutdown which may abort existing operations
profile-before-change-telemetry

Just for reference the order is currently:
profile-change-net-teardown
profile-change-teardown
profile-before-change
profile-before-change-qm
profile-before-change-telemetry

Maybe we don't need extra profile-before-change-xyz and just use profile-change-teardown, it needs to be investigated more.

Once the shutdown flag is set early, then consumers wouldn't be able to open new database connections after profile-change-teardown and I'm not sure if we can do that.

Andrew, any ideas ?

Flags: needinfo?(jvarga)

If the storage is not yet initialized during profile-before-change it means there can't be any open database connection so there's nothing to cleanup, so we can safely interrupt any pending storage initialization in QuotaManager.
However, new connections wouldn't be allowed still.

Depends on: 1636256

(In reply to Jan Varga [:janv] from comment #25)

it's thread 17 and CacheQuotaClient::GetUsageForOriginInternal is on the stack. That method has a loop which traverses over directory entries on disk.

Can we avoid doing this for system principal consumers like remote settings? QM restrictions shouldn't apply to our indexeddb use...

There's a check QuotaManager::IsShuttingDown in the loop, but the real problem here seems to be that QuotaManager hasn't been notified about the shutdown yet so QuotaManager::IsShuttingDown returns false probably.
I assume that remote settings has a shutdown blocker for profile-before-change.

Yes.

Our current design is that we let all consumers to finish/cleanup after getting profile-before-change, then we notify QuotaManager (profile-before-change-qm) which sets the isShuttingDown flag, so some pending initialization can be interrupted.

Maybe we need another notification for QuotaManaget which is sent before profile-before-change, the order would then be:

This would help a bit for the current situation, though if some of the expensive stuff only happens when we commit a transaction, it may not be sufficient - after all, it's quite possible for a transaction to start before whatever cut-off point we pick to stop accepting new connections to idb, and then run the expensive stuff later.

It seems the third crash-report I linked in my comment, ie https://crash-stats.mozilla.org/report/index/f45b886a-1b06-480a-bbd0-6f8460200506#allthreads , is stuck doing the actual sqlite transaction work. I haven't audited the full set of crashreports to see how common this is; the crash-stats UI doesn't make this all that easy, but if it's useful to do this I can take a look?

(In reply to Jan Varga [:janv] from comment #25)

Once the shutdown flag is set early, then consumers wouldn't be able to open new database connections after profile-change-teardown and I'm not sure if we can do that.

(In reply to :Gijs (he/him) from comment #27)

Can we avoid doing this for system principal consumers like remote settings? QM restrictions shouldn't apply to our indexeddb use...

An idea I've been proposing but haven't followed through on is that we have system logic provide a proof of shutdown blocker when they attempt to open an IDB database. If no blocker is provided, we assume a much more conservative blocker and enforce it. In general the concern is that code that's not shutdown aware tries to do things too close to shutdown, and this helps avoid that problem.

Remote settings is now very much shutdown aware (and getting more responsive to it in bug 1636256), so my main concern would just be the situation where QuotaManager is still constantly re-initializing when QuotaManager is broken. It seems reasonable that QuotaManager would give up on re-initializing itself at "profile-change-teardown". (Which would be different from preventing first-time initialization at shutdown. Although that's okay as long as we make sure we're not breaking the situations where QuotaManager gets cleared at shutdown because cookie persistence is session-only.)

Flags: needinfo?(bugmail)

The leave-open keyword is there and there is no activity for 6 months.
:leplatrem, maybe it's time to close this bug?

Flags: needinfo?(mathieu)

It looks like maybe this got fixed in 85? Fingers crossed?

(In reply to :Gijs (he/him) from comment #30)

It looks like maybe this got fixed in 85? Fingers crossed?

Jinxed it. We'll have to take another look in the new year...

The leave-open keyword is there and there is no activity for 6 months.
:leplatrem, maybe it's time to close this bug?

Flags: needinfo?(mathieu)

The leave-open keyword is there and there is no activity for 6 months.
:leplatrem, maybe it's time to close this bug?

Flags: needinfo?(mathieu)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(mathieu)
Resolution: --- → FIXED

Mathieu, looks like this is still occasionally shutdown crashing in 101 (and probably 102, but it just hasn't been on nightly for very long yet) - did you mean to close this?

Flags: needinfo?(mathieu)

Ok, sorry, I was trying to clean up the issues list, and I thought the lack of activity was a sufficient criteria.

Note for later: this would be the link to crash stats I think: https://crash-stats.mozilla.org/signature/?product=Firefox&signature=AsyncShutdownTimeout%20%7C%20profile-before-change%20%7C%20Remote%20Settings%20profile-before-change&version=100.0

Status: RESOLVED → REOPENED
Flags: needinfo?(mathieu)
Keywords: leave-open
Resolution: FIXED → ---

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release

:leplatrem, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mathieu)
Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash
Severity: normal normal → S3 S3

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release

:leplatrem, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mathieu)
Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on beta

:leplatrem, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mathieu)
Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release

:leplatrem, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mathieu)
Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release

For more information, please visit BugBot documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

One of these signatures is pretty high volume on release 130.
:leplatrem, can you take a look? We can file a new bug if it seems best to. There may be some new regression in 130 for this spike on release.

Flags: needinfo?(mathieu)

No longer a top crash in 130.0.1 release.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: