Open Bug 1624878 Opened 5 years ago Updated 1 month ago

Crash in [@ AsyncShutdownTimeout | profile-before-change | RemoteSettingsClient - finish IDB access.]

Categories

(Firefox :: Remote Settings Client, defect, P3)

Unspecified
Windows 10
defect

Tracking

()

REOPENED
Tracking Status
firefox-esr68 --- unaffected
firefox74 --- unaffected
firefox75 --- unaffected
firefox76 --- wontfix
firefox77 --- wontfix

People

(Reporter: sg, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

This bug is for crash report bp-c331ee2e-97b2-47f0-8fc4-679480200325.

Top 10 frames of crashing thread:

0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 xul.dll NS_DebugBreak xpcom/base/nsDebugImpl.cpp
2 xul.dll nsDebugImpl::Abort xpcom/base/nsDebugImpl.cpp:134
3 xul.dll XPTC__InvokebyIndex 
4  @0x1d02591d09f 
5 xul.dll static XPCWrappedNative::CallMethod js/xpconnect/src/XPCWrappedNative.cpp:1141
6 xul.dll XPC_WN_CallMethod js/xpconnect/src/XPCWrappedNativeJSOps.cpp:947
7 xul.dll js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:568
8 xul.dll Interpret js/src/vm/Interpreter.cpp:3026
9 xul.dll js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:603

Only a few reports until now, but this seems to have started with Nightly build 20200324214641.

Flags: needinfo?(gijskruitbosch+bugs)

The crashreport in comment #0 has:

[{"method":"list","identifier":"main/public-suffix-list"},{"method":"getLastModified","identifier":"main/cfr-fxa"},{"method":"getLastModified","identifier":"main/normandy-recipes-capabilities"},{"method":"list","identifier":"main/tippytop"},{"method":"list","identifier":"main/tippytop"},{"method":"list","identifier":"main/tippytop"},{"method":"list","identifier":"main/tippytop"}]

as the pending operations.

Fixing bug 1624983 will stop these blocking shutdown.

It's still a bit surprising to me that all these operations are pending though. The 4 identical-looking tippytop list calls are probably not actually identical, but from https://searchfox.org/mozilla-central/rev/4d9cd186767978a99dafe77eb536a9525980e118/browser/components/newtab/lib/FaviconFeed.jsm#180-184 . What I don't really understand is why these would all be pending 230s after the browser starts. Either the IO is very slow, or they accidentally started right before shutdown, or something else weird is going on.

Anyway, I also expect (see discussion in bug 1545438) that these would also previously have shutdown-crashed -- just with different signatures. :-\

Depends on: 1624983
Flags: needinfo?(gijskruitbosch+bugs)

Hi Mathieu, could you please triage this bug?

Flags: needinfo?(mathieu)

I would say that it was resolved by Bug 1624983

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(mathieu)
Priority: -- → P3
Resolution: --- → FIXED
Assignee: nobody → gijskruitbosch+bugs
Target Milestone: --- → Firefox 76
Blocks: 1629005

There's still a bit left here, given there are crashreports from beta. I'll try and investigate more next week.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Gonna update the tracking flags but marking fix-optional for 76 given the low numbers.

Target Milestone: Firefox 76 → ---

So looking at a view of the crashes that includes the async shutdown timeout reason, there's about 15-ish a week, so this is probably less important than the worker ones which are about 3-4 times as frequent , covered in bug 1605973. That said, only a few are for short sessions (and even then, several minutes) - so I'm puzzled why they'd be hanging.

The only other thing I noticed is that most of them are now from sync operations in either the Remote Settings Client itself, or in the consumers (notably Normandy). They effectively do something like:

  • delete entries that shouldn't be there anymore
  • update updated entries
  • save timestamp
  • save metadata

If verifying the signature then fails, it'll call clear() and try again (which is also represented in these shutdown hangs!), and if that fails, it'll try restoring from the JSON dump...

I guess in principle, the right thing to do there would be to have a single indexeddb transaction for these operations. Then we could abort the transaction on shutdown, and we'd have to re-run the sync next time we start (which we could do by clearing the pref tracking when it last ran, I think?). The downside, AFAICT, would be that if we're 3/4 of the way through the transaction, reverting it might take more doing than just finishing it, depending on the IDB implementation.

Mathieu/Andrew, how does that sound? And, if we try to squash this stuff into a single transaction and abort it (and set us up to try again shortly after startup next start), how likely are we to hang in IDB code trying to execute the abort instead (e.g. will that be guaranteed to involve IO)?

Another thing we could try and do to reduce the impact here is to break off the existing transactions if we're in the middle of shutdown anyway... but esp. for e.g. the normandy capabilities records, I don't think these are huge amounts of data so I'm puzzled that we'd hang trying to deal with them. Am I missing something obvious, like could we be throwing somewhere and then leaving tasks that have finished in the list of pending actions forever, or something?

Flags: needinfo?(mathieu)
Flags: needinfo?(bugmail)
See Also: → 1605973

Aha! Funny story! There's no handler for transaction.onabort where the other hookups happen. I think this might have been a regression in bug 1620185 since it seems like kinto.js had an abort handler. I think this means that shutdown hangs are happening every time the shutdown blocker's abort logic comes into play and ends up triggering an abort. (Also, there's probably no test coverage for this. ;)

Re: Transactions: Our IndexedDB implementation uses the SQLite Write Ahead Log aka WAL which makes rollbacks/aborts very cheap. You do fundamentally need to be using a single SQLite transaction for that to encompass all of the steps in a single atomic group.

Re: Shutdown hang crashes, I think the above theory probably makes sense as I'm not seeing any IndexedDB threads in most of these crashes (threads should all start with "IndexedDB"). There was one with an IDB thread, but I think that was a legit slow shutdown because there was an fstat-esque thing going on. Similarly, the QuotaManager thread doesn't seem to be doing anything at the times of the hangs either. An exception is https://crash-stats.mozilla.org/report/index/4c2ee3ed-48ce-401d-a566-45d290200415 where QM (thread name: "QuotaManager IO") is still busy initializing QuotaManager's temporary storage at 187 seconds after startup[1].

1: Remote settings isn't subject to these initialization delays because its storage isn't constrained by quota and the data lives in storage/permanent/chrome, but it's something to keep in mind as a busy QuotaManager thread would still inherently stall any subsequent attempts to open a system-principal-owned IDB database. We know in that crash and in general, however, that remote settings' attempt to open a database already succeeded.

Flags: needinfo?(bugmail)
Depends on: 1630766

Thanks, Andrew. I filed bug 1630766 for this. We'll need to address this before doing anything else, I expect.

Flags: needinfo?(mathieu)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #8)

Aha! Funny story! There's no handler for transaction.onabort where the other hookups happen. I think this might have been a regression in bug 1620185 since it seems like kinto.js had an abort handler. I think this means that shutdown hangs are happening every time the shutdown blocker's abort logic comes into play and ends up triggering an abort. (Also, there's probably no test coverage for this. ;)

So, in the clear light of morning... I agree this is a problem, but why would it trigger a shutdown hang in this code? We only abort read-only transactions (gPendingReadOnlyTransactions), and we don't use those to block shutdown. The write transactions ( gPendingWriteOperations) do block shutdown, but we don't abort those. Would something else abort them?

Flags: needinfo?(bugmail)

Ah, yes, in the clear light of my own recent nap, I agree that I got my wires crossed on the ReadOnly and Write variants and that means that things are more mysterious and less causal than I suggested/was hoping. It's not clear why those would ever end in an abort, but it would be nice to have the abort handlers added to help rule that out as a possible source of problems, as now we get into territory like having to suspect the cycle collector or IndexedDB failing to re-open the database connection that was closed for being idle or edge cases related to stacked transactions.

Flags: needinfo?(bugmail)

Our IndexedDB idle-ness timeouts are here and excising them for direct reference are:

// The length of time that database connections will be held open after all
// transactions have completed before doing idle maintenance.
const uint32_t kConnectionIdleMaintenanceMS = 2 * 1000;  // 2 seconds

// The length of time that database connections will be held open after all
// transactions and maintenance have completed.
const uint32_t kConnectionIdleCloseMS = 10 * 1000;  // 10 seconds

// The length of time that idle threads will stay alive before being shut down.
const uint32_t kConnectionThreadIdleMS = 30 * 1000;  // 30 seconds

Setting needinfo on :sg for thoughts about whether idle handling (or other things) could be involved here and to better understand what our existing test coverage of our idle handling logic is. It wouldn't surprise me if we could improve coverage by having tests that explicitly force connections to be idle and explicitly force idle threads to be spun-down, etc. One might also be able to experiment with this by creating a build with the constants lowered somewhat (or rendered into preferences).

Flags: needinfo?(sgiesecke)

Cautiously optimistic that bug 1630766 helped here, but the volume is low enough that it'll take a while to be sure.

Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is P3 (Backlog,) indicating it has been triaged, the bug's Severity is being updated to S3 (normal.)

Severity: normal → S3

The numbers on 77 are sufficiently low that I don't think this is the best place for me to spend more time right now.

Assignee: gijskruitbosch+bugs → nobody
Flags: needinfo?(sgiesecke)
Status: REOPENED → RESOLVED
Closed: 5 years ago2 years ago
Resolution: --- → FIXED

Like bug 1605973 I think we should keep this open until it's actually fixed...

Flags: needinfo?(mathieu)

Yes. Although I could not find it in crash-stats

Status: RESOLVED → REOPENED
Flags: needinfo?(mathieu)
Resolution: FIXED → ---

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on beta
  • Top 5 desktop browser crashes on Windows on beta

:leplatrem, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(mathieu)
Keywords: topcrash

The bug is marked as tracked for firefox130 (beta). However, the bug still isn't assigned, has low priority and has low severity.

:markh, could you please find an assignee, increase the priority and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(markh)
Whiteboard: dom-lws-bugdash-triage
Whiteboard: dom-lws-bugdash-triage

This is a worryingly high volume on beta 130. Can you help find an owner to investigate it? Thanks!

This collection of crashes shows people with an absolute tonne of saveAttachment blocking write operations (in the security-state/intermediates collection).

If the crash rate increased for this specific collection security-state/intermediates, then it is related to the work we did in Bug 1889617.

Basically, instead of initiating 1700+ requests on tiny .pem files, we now download a single zip file and extract its content in the IndexedDB attachments storage (ie. 1700+ calls to (saveAttachments() from a client.attachments.cacheAll() method).

In theory, our IndexedDB code is instrumented with shutdown blockers (:Gijs helped us do this years ago :)), but there could be bugs where the code escapes the blocker (ie. Bug 1890278)?

We are not advanced platform developers, and have fast hardware, etc. Hence we would clearly need help to dive into this issue and fix it :/

Flags: needinfo?(mathieu)

Looking at the code again. It looks that we don't await the cache.set() promise in the loop (source)
Which would totally explain the crashing behaviour. That's a good start.

See Also: → 1912893

OK, if this was fixed by the patch bug 1912893 then we should expect to see crashes go back down with 130.0b5. So far so good but let's wait a day or two before marking this fixed for 130/131.

This is a reminder regarding comment #20!

The bug is marked as tracked for firefox130 (beta). We have limited time to fix this, the soft freeze is in 14 days. However, the bug still isn't assigned and has low priority.

(In reply to Liz Henry (:lizzard) (relman/hg->git project) from comment #25)

OK, if this was fixed by the patch bug 1912893 then we should expect to see crashes go back down with 130.0b5. So far so good but let's wait a day or two before marking this fixed for 130/131.

I think this looks better now? There are 0 crashes on b6/7, and the 3 crashes on b5 list:

"importChanges() in security-state/cert-revocations"

as the sole cause of remote settings shutdown hanging, which is very different from the pile of attachment savings in intermediates that was there before - though perhaps separately interesting. I don't know what the uptake rate is for beta but b4 went to build only 2 days before b5 so the 30-fold higher crash rate seems pretty conclusive (I reached out to gsvelto to ask if we have ratio based on release uptake pings or whatnot, which would be better). Not sure if you want to untrack or mark "fixed" given that the crash as a whole is still there (and was there in 129 etc. etc.).

Flags: needinfo?(ehenry)

I'm going to go with untracking. In an ideal world, we would have had a different bug for the one specific cause of the spike instead of being lumped into this longstanding one, but c'est la vie.

Flags: needinfo?(ehenry)

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
You need to log in before you can comment on or make changes to this bug.