Crash in [@ AsyncShutdownTimeout | profile-before-change | RemoteSettingsClient - finish IDB access.]
Categories
(Firefox :: Remote Settings Client, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox74 | --- | unaffected |
firefox75 | --- | unaffected |
firefox76 | --- | wontfix |
firefox77 | --- | wontfix |
People
(Reporter: sg, Unassigned)
References
(Blocks 1 open bug)
Details
(Keywords: crash, regression)
Crash Data
This bug is for crash report bp-c331ee2e-97b2-47f0-8fc4-679480200325.
Top 10 frames of crashing thread:
0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 xul.dll NS_DebugBreak xpcom/base/nsDebugImpl.cpp
2 xul.dll nsDebugImpl::Abort xpcom/base/nsDebugImpl.cpp:134
3 xul.dll XPTC__InvokebyIndex
4 @0x1d02591d09f
5 xul.dll static XPCWrappedNative::CallMethod js/xpconnect/src/XPCWrappedNative.cpp:1141
6 xul.dll XPC_WN_CallMethod js/xpconnect/src/XPCWrappedNativeJSOps.cpp:947
7 xul.dll js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:568
8 xul.dll Interpret js/src/vm/Interpreter.cpp:3026
9 xul.dll js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:603
Only a few reports until now, but this seems to have started with Nightly build 20200324214641.
Updated•5 years ago
|
Comment 1•5 years ago
|
||
The pushlog for the first affected build:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9b338268ce36&tochange=dbabf2e388fa
so probably a consequence of:
https://hg.mozilla.org/mozilla-central/rev/354489628b3985c7dc385f039f763b15c05ccec7
:gijs, could you have a look please?
Comment 2•5 years ago
|
||
The crashreport in comment #0 has:
[{"method":"list","identifier":"main/public-suffix-list"},{"method":"getLastModified","identifier":"main/cfr-fxa"},{"method":"getLastModified","identifier":"main/normandy-recipes-capabilities"},{"method":"list","identifier":"main/tippytop"},{"method":"list","identifier":"main/tippytop"},{"method":"list","identifier":"main/tippytop"},{"method":"list","identifier":"main/tippytop"}]
as the pending operations.
Fixing bug 1624983 will stop these blocking shutdown.
It's still a bit surprising to me that all these operations are pending though. The 4 identical-looking tippytop
list calls are probably not actually identical, but from https://searchfox.org/mozilla-central/rev/4d9cd186767978a99dafe77eb536a9525980e118/browser/components/newtab/lib/FaviconFeed.jsm#180-184 . What I don't really understand is why these would all be pending 230s after the browser starts. Either the IO is very slow, or they accidentally started right before shutdown, or something else weird is going on.
Anyway, I also expect (see discussion in bug 1545438) that these would also previously have shutdown-crashed -- just with different signatures. :-\
Updated•5 years ago
|
Comment 4•5 years ago
|
||
I would say that it was resolved by Bug 1624983
Updated•5 years ago
|
Comment 5•5 years ago
|
||
There's still a bit left here, given there are crashreports from beta. I'll try and investigate more next week.
Comment 6•5 years ago
|
||
Gonna update the tracking flags but marking fix-optional for 76 given the low numbers.
Comment 7•5 years ago
|
||
So looking at a view of the crashes that includes the async shutdown timeout reason, there's about 15-ish a week, so this is probably less important than the worker ones which are about 3-4 times as frequent , covered in bug 1605973. That said, only a few are for short sessions (and even then, several minutes) - so I'm puzzled why they'd be hanging.
The only other thing I noticed is that most of them are now from sync operations in either the Remote Settings Client itself, or in the consumers (notably Normandy). They effectively do something like:
- delete entries that shouldn't be there anymore
- update updated entries
- save timestamp
- save metadata
If verifying the signature then fails, it'll call clear()
and try again (which is also represented in these shutdown hangs!), and if that fails, it'll try restoring from the JSON dump...
I guess in principle, the right thing to do there would be to have a single indexeddb transaction for these operations. Then we could abort the transaction on shutdown, and we'd have to re-run the sync next time we start (which we could do by clearing the pref tracking when it last ran, I think?). The downside, AFAICT, would be that if we're 3/4 of the way through the transaction, reverting it might take more doing than just finishing it, depending on the IDB implementation.
Mathieu/Andrew, how does that sound? And, if we try to squash this stuff into a single transaction and abort it (and set us up to try again shortly after startup next start), how likely are we to hang in IDB code trying to execute the abort instead (e.g. will that be guaranteed to involve IO)?
Another thing we could try and do to reduce the impact here is to break off the existing transactions if we're in the middle of shutdown anyway... but esp. for e.g. the normandy capabilities records, I don't think these are huge amounts of data so I'm puzzled that we'd hang trying to deal with them. Am I missing something obvious, like could we be throwing somewhere and then leaving tasks that have finished in the list of pending actions forever, or something?
Comment 8•5 years ago
|
||
Aha! Funny story! There's no handler for transaction.onabort
where the other hookups happen. I think this might have been a regression in bug 1620185 since it seems like kinto.js had an abort handler. I think this means that shutdown hangs are happening every time the shutdown blocker's abort logic comes into play and ends up triggering an abort. (Also, there's probably no test coverage for this. ;)
Re: Transactions: Our IndexedDB implementation uses the SQLite Write Ahead Log aka WAL which makes rollbacks/aborts very cheap. You do fundamentally need to be using a single SQLite transaction for that to encompass all of the steps in a single atomic group.
Re: Shutdown hang crashes, I think the above theory probably makes sense as I'm not seeing any IndexedDB threads in most of these crashes (threads should all start with "IndexedDB"). There was one with an IDB thread, but I think that was a legit slow shutdown because there was an fstat-esque thing going on. Similarly, the QuotaManager thread doesn't seem to be doing anything at the times of the hangs either. An exception is https://crash-stats.mozilla.org/report/index/4c2ee3ed-48ce-401d-a566-45d290200415 where QM (thread name: "QuotaManager IO") is still busy initializing QuotaManager's temporary storage at 187 seconds after startup[1].
1: Remote settings isn't subject to these initialization delays because its storage isn't constrained by quota and the data lives in storage/permanent/chrome
, but it's something to keep in mind as a busy QuotaManager thread would still inherently stall any subsequent attempts to open a system-principal-owned IDB database. We know in that crash and in general, however, that remote settings' attempt to open a database already succeeded.
Comment 9•5 years ago
|
||
Thanks, Andrew. I filed bug 1630766 for this. We'll need to address this before doing anything else, I expect.
Updated•5 years ago
|
Comment 10•5 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #8)
Aha! Funny story! There's no handler for
transaction.onabort
where the other hookups happen. I think this might have been a regression in bug 1620185 since it seems like kinto.js had an abort handler. I think this means that shutdown hangs are happening every time the shutdown blocker's abort logic comes into play and ends up triggering an abort. (Also, there's probably no test coverage for this. ;)
So, in the clear light of morning... I agree this is a problem, but why would it trigger a shutdown hang in this code? We only abort read-only transactions (gPendingReadOnlyTransactions
), and we don't use those to block shutdown. The write transactions ( gPendingWriteOperations
) do block shutdown, but we don't abort those. Would something else abort them?
Comment 11•5 years ago
|
||
Ah, yes, in the clear light of my own recent nap, I agree that I got my wires crossed on the ReadOnly
and Write
variants and that means that things are more mysterious and less causal than I suggested/was hoping. It's not clear why those would ever end in an abort, but it would be nice to have the abort handlers added to help rule that out as a possible source of problems, as now we get into territory like having to suspect the cycle collector or IndexedDB failing to re-open the database connection that was closed for being idle or edge cases related to stacked transactions.
Comment 12•5 years ago
|
||
Our IndexedDB idle-ness timeouts are here and excising them for direct reference are:
// The length of time that database connections will be held open after all
// transactions have completed before doing idle maintenance.
const uint32_t kConnectionIdleMaintenanceMS = 2 * 1000; // 2 seconds
// The length of time that database connections will be held open after all
// transactions and maintenance have completed.
const uint32_t kConnectionIdleCloseMS = 10 * 1000; // 10 seconds
// The length of time that idle threads will stay alive before being shut down.
const uint32_t kConnectionThreadIdleMS = 30 * 1000; // 30 seconds
Setting needinfo on :sg for thoughts about whether idle handling (or other things) could be involved here and to better understand what our existing test coverage of our idle handling logic is. It wouldn't surprise me if we could improve coverage by having tests that explicitly force connections to be idle and explicitly force idle threads to be spun-down, etc. One might also be able to experiment with this by creating a build with the constants lowered somewhat (or rendered into preferences).
Updated•5 years ago
|
Comment 13•5 years ago
|
||
Cautiously optimistic that bug 1630766 helped here, but the volume is low enough that it'll take a while to be sure.
Comment 14•5 years ago
|
||
Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is P3
(Backlog,) indicating it has been triaged, the bug's Severity is being updated to S3
(normal.)
Updated•5 years ago
|
Comment 15•5 years ago
|
||
The numbers on 77 are sufficiently low that I don't think this is the best place for me to spend more time right now.
Reporter | ||
Updated•4 years ago
|
Updated•3 years ago
|
Comment 17•3 years ago
|
||
Like bug 1605973 I think we should keep this open until it's actually fixed...
Comment 18•3 years ago
|
||
Yes. Although I could not find it in crash-stats
Comment 19•6 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on beta
- Top 5 desktop browser crashes on Windows on beta
:leplatrem, could you consider increasing the severity of this top-crash bug?
For more information, please visit BugBot documentation.
Updated•6 months ago
|
Comment 20•6 months ago
|
||
The bug is marked as tracked for firefox130 (beta). However, the bug still isn't assigned, has low priority and has low severity.
:markh, could you please find an assignee, increase the priority and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit BugBot documentation.
Updated•6 months ago
|
Updated•6 months ago
|
Comment 21•6 months ago
|
||
This is a worryingly high volume on beta 130. Can you help find an owner to investigate it? Thanks!
Comment 22•6 months ago
•
|
||
This collection of crashes shows people with an absolute tonne of saveAttachment
blocking write operations (in the security-state/intermediates
collection).
Comment 23•6 months ago
•
|
||
If the crash rate increased for this specific collection security-state/intermediates
, then it is related to the work we did in Bug 1889617.
Basically, instead of initiating 1700+ requests on tiny .pem files, we now download a single zip file and extract its content in the IndexedDB attachments storage (ie. 1700+ calls to (saveAttachments()
from a client.attachments.cacheAll()
method).
In theory, our IndexedDB code is instrumented with shutdown blockers (:Gijs helped us do this years ago :)), but there could be bugs where the code escapes the blocker (ie. Bug 1890278)?
We are not advanced platform developers, and have fast hardware, etc. Hence we would clearly need help to dive into this issue and fix it :/
Comment 24•6 months ago
|
||
Looking at the code again. It looks that we don't await the cache.set()
promise in the loop (source)
Which would totally explain the crashing behaviour. That's a good start.
Comment 25•6 months ago
|
||
OK, if this was fixed by the patch bug 1912893 then we should expect to see crashes go back down with 130.0b5. So far so good but let's wait a day or two before marking this fixed for 130/131.
Comment 26•6 months ago
|
||
This is a reminder regarding comment #20!
The bug is marked as tracked for firefox130 (beta). We have limited time to fix this, the soft freeze is in 14 days. However, the bug still isn't assigned and has low priority.
Comment 27•6 months ago
|
||
(In reply to Liz Henry (:lizzard) (relman/hg->git project) from comment #25)
OK, if this was fixed by the patch bug 1912893 then we should expect to see crashes go back down with 130.0b5. So far so good but let's wait a day or two before marking this fixed for 130/131.
I think this looks better now? There are 0 crashes on b6/7, and the 3 crashes on b5 list:
"importChanges() in security-state/cert-revocations"
as the sole cause of remote settings shutdown hanging, which is very different from the pile of attachment savings in intermediates that was there before - though perhaps separately interesting. I don't know what the uptake rate is for beta but b4 went to build only 2 days before b5 so the 30-fold higher crash rate seems pretty conclusive (I reached out to gsvelto to ask if we have ratio based on release uptake pings or whatnot, which would be better). Not sure if you want to untrack or mark "fixed" given that the crash as a whole is still there (and was there in 129 etc. etc.).
Comment 28•6 months ago
|
||
I'm going to go with untracking. In an ideal world, we would have had a different bug for the one specific cause of the spike instead of being lumped into this longstanding one, but c'est la vie.
Updated•6 months ago
|
Updated•6 months ago
|
Comment 29•6 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Updated•4 days ago
|
Description
•