<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 2

•

5 years ago

The crashreport in comment #0 has:

[{"method":"list","identifier":"main/public-suffix-list"},{"method":"getLastModified","identifier":"main/cfr-fxa"},{"method":"getLastModified","identifier":"main/normandy-recipes-capabilities"},{"method":"list","identifier":"main/tippytop"},{"method":"list","identifier":"main/tippytop"},{"method":"list","identifier":"main/tippytop"},{"method":"list","identifier":"main/tippytop"}]

as the pending operations.

Fixing bug 1624983 will stop these blocking shutdown.

It's still a bit surprising to me that all these operations are pending though. The 4 identical-looking tippytop list calls are probably not actually identical, but from https://searchfox.org/mozilla-central/rev/4d9cd186767978a99dafe77eb536a9525980e118/browser/components/newtab/lib/FaviconFeed.jsm#180-184 . What I don't really understand is why these would all be pending 230s after the browser starts. Either the IO is very slow, or they accidentally started right before shutdown, or something else weird is going on.

Anyway, I also expect (see discussion in bug 1545438) that these would also previously have shutdown-crashed -- just with different signatures. :-\

Depends on: 1624983

Flags: needinfo?(gijskruitbosch+bugs)

Julien Cristau [:jcristau]

Updated

•

5 years ago

Blocks: 1625034

Updated

•

5 years ago

status-firefox74: --- → unaffected

status-firefox75: --- → unaffected

status-firefox-esr68: --- → unaffected

Keywords: regression

Ethan Tseng [:ethan]

Comment 3

•

5 years ago

Hi Mathieu, could you please triage this bug?

Flags: needinfo?(mathieu)

Comment 4

•

5 years ago

I would say that it was resolved by Bug 1624983

Status: NEW → RESOLVED

Closed: 5 years ago

Flags: needinfo?(mathieu)

Priority: -- → P3

Resolution: --- → FIXED

Updated

•

5 years ago

Assignee: nobody → gijskruitbosch+bugs

status-firefox76: affected → fixed

Target Milestone: --- → Firefox 76

Updated

•

5 years ago

Blocks: 1629005

Comment 5

•

5 years ago

There's still a bit left here, given there are crashreports from beta. I'll try and investigate more next week.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Comment 6

•

5 years ago

Gonna update the tracking flags but marking fix-optional for 76 given the low numbers.

status-firefox76: fixed → fix-optional

status-firefox77: --- → affected

Target Milestone: Firefox 76 → ---

Comment 7

•

5 years ago

So looking at a view of the crashes that includes the async shutdown timeout reason, there's about 15-ish a week, so this is probably less important than the worker ones which are about 3-4 times as frequent , covered in bug 1605973. That said, only a few are for short sessions (and even then, several minutes) - so I'm puzzled why they'd be hanging.

The only other thing I noticed is that most of them are now from sync operations in either the Remote Settings Client itself, or in the consumers (notably Normandy). They effectively do something like:

delete entries that shouldn't be there anymore
update updated entries
save timestamp
save metadata

If verifying the signature then fails, it'll call clear() and try again (which is also represented in these shutdown hangs!), and if that fails, it'll try restoring from the JSON dump...

I guess in principle, the right thing to do there would be to have a single indexeddb transaction for these operations. Then we could abort the transaction on shutdown, and we'd have to re-run the sync next time we start (which we could do by clearing the pref tracking when it last ran, I think?). The downside, AFAICT, would be that if we're 3/4 of the way through the transaction, reverting it might take more doing than just finishing it, depending on the IDB implementation.

Mathieu/Andrew, how does that sound? And, if we try to squash this stuff into a single transaction and abort it (and set us up to try again shortly after startup next start), how likely are we to hang in IDB code trying to execute the abort instead (e.g. will that be guaranteed to involve IO)?

Another thing we could try and do to reduce the impact here is to break off the existing transactions if we're in the middle of shutdown anyway... but esp. for e.g. the normandy capabilities records, I don't think these are huge amounts of data so I'm puzzled that we'd hang trying to deal with them. Am I missing something obvious, like could we be throwing somewhere and then leaving tasks that have finished in the list of pending actions forever, or something?

Flags: needinfo?(mathieu)

Flags: needinfo?(bugmail)

Updated

•

5 years ago

Comment 8

•

5 years ago

Aha! Funny story! There's no handler for transaction.onabort where the other hookups happen. I think this might have been a regression in bug 1620185 since it seems like kinto.js had an abort handler. I think this means that shutdown hangs are happening every time the shutdown blocker's abort logic comes into play and ends up triggering an abort. (Also, there's probably no test coverage for this. ;)

Re: Transactions: Our IndexedDB implementation uses the SQLite Write Ahead Log aka WAL which makes rollbacks/aborts very cheap. You do fundamentally need to be using a single SQLite transaction for that to encompass all of the steps in a single atomic group.

Re: Shutdown hang crashes, I think the above theory probably makes sense as I'm not seeing any IndexedDB threads in most of these crashes (threads should all start with "IndexedDB"). There was one with an IDB thread, but I think that was a legit slow shutdown because there was an fstat-esque thing going on. Similarly, the QuotaManager thread doesn't seem to be doing anything at the times of the hangs either. An exception is https://crash-stats.mozilla.org/report/index/4c2ee3ed-48ce-401d-a566-45d290200415 where QM (thread name: "QuotaManager IO") is still busy initializing QuotaManager's temporary storage at 187 seconds after startup[1].

1: Remote settings isn't subject to these initialization delays because its storage isn't constrained by quota and the data lives in storage/permanent/chrome, but it's something to keep in mind as a busy QuotaManager thread would still inherently stall any subsequent attempts to open a system-principal-owned IDB database. We know in that crash and in general, however, that remote settings' attempt to open a database already succeeded.

Flags: needinfo?(bugmail)

Updated

•

5 years ago

Updated

•

5 years ago

Depends on: 1630766

Comment 9

•

5 years ago

Thanks, Andrew. I filed bug 1630766 for this. We'll need to address this before doing anything else, I expect.

Updated

•

5 years ago

Flags: needinfo?(mathieu)

Comment 10

•

5 years ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #8)

Aha! Funny story! There's no handler for transaction.onabort where the other hookups happen. I think this might have been a regression in bug 1620185 since it seems like kinto.js had an abort handler. I think this means that shutdown hangs are happening every time the shutdown blocker's abort logic comes into play and ends up triggering an abort. (Also, there's probably no test coverage for this. ;)

So, in the clear light of morning... I agree this is a problem, but why would it trigger a shutdown hang in this code? We only abort read-only transactions (gPendingReadOnlyTransactions), and we don't use those to block shutdown. The write transactions ( gPendingWriteOperations) do block shutdown, but we don't abort those. Would something else abort them?

Flags: needinfo?(bugmail)

Comment 11

•

5 years ago

Ah, yes, in the clear light of my own recent nap, I agree that I got my wires crossed on the ReadOnly and Write variants and that means that things are more mysterious and less causal than I suggested/was hoping. It's not clear why those would ever end in an abort, but it would be nice to have the abort handlers added to help rule that out as a possible source of problems, as now we get into territory like having to suspect the cycle collector or IndexedDB failing to re-open the database connection that was closed for being idle or edge cases related to stacked transactions.

Flags: needinfo?(bugmail)

Comment 12

•

5 years ago

Our IndexedDB idle-ness timeouts are here and excising them for direct reference are:

// The length of time that database connections will be held open after all
// transactions have completed before doing idle maintenance.
const uint32_t kConnectionIdleMaintenanceMS = 2 * 1000;  // 2 seconds

// The length of time that database connections will be held open after all
// transactions and maintenance have completed.
const uint32_t kConnectionIdleCloseMS = 10 * 1000;  // 10 seconds

// The length of time that idle threads will stay alive before being shut down.
const uint32_t kConnectionThreadIdleMS = 30 * 1000;  // 30 seconds

Setting needinfo on :sg for thoughts about whether idle handling (or other things) could be involved here and to better understand what our existing test coverage of our idle handling logic is. It wouldn't surprise me if we could improve coverage by having tests that explicitly force connections to be idle and explicitly force idle threads to be spun-down, etc. One might also be able to experiment with this by creating a build with the constants lowered somewhat (or rendered into preferences).

Flags: needinfo?(sgiesecke)

Updated

•

5 years ago

status-firefox76: fix-optional → wontfix

Firefox Bug Husbandry Bot

Comment 13

•

5 years ago

Cautiously optimistic that bug 1630766 helped here, but the volume is low enough that it'll take a while to be sure.

Comment 14

•

5 years ago

Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is P3 (Backlog,) indicating it has been triaged, the bug's Severity is being updated to S3 (normal.)

Severity: normal → S3

Pascal Chevrel:pascalc

Updated

•

5 years ago

status-firefox77: affected → wontfix

Simon Giesecke [:sg] [he/him]

Comment 15

•

5 years ago

The numbers on 77 are sufficiently low that I don't think this is the best place for me to spend more time right now.

Assignee: gijskruitbosch+bugs → nobody

Reporter

Updated

•

4 years ago

Flags: needinfo?(sgiesecke)

Updated

•

3 years ago

Status: REOPENED → RESOLVED

Closed: 5 years ago → 3 years ago

Resolution: --- → FIXED

Comment 17

•

3 years ago

Like bug 1605973 I think we should keep this open until it's actually fixed...

Flags: needinfo?(mathieu)

Comment 18

•

3 years ago

Yes. Although I could not find it in crash-stats

Status: RESOLVED → REOPENED

Flags: needinfo?(mathieu)

Resolution: FIXED → ---

Comment 19

•

6 months ago

The bug is linked to a topcrash signature, which matches the following criteria:

Top 20 desktop browser crashes on beta
Top 5 desktop browser crashes on Windows on beta

:leplatrem, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(mathieu)

Keywords: topcrash

Donal Meehan [:dmeehan]

Updated

•

6 months ago

status-firefox130: --- → affected

tracking-firefox130: --- → +

Liz Henry (:lizzard) (relman/hg->git project)

Comment 20

•

6 months ago

The bug is marked as tracked for firefox130 (beta). However, the bug still isn't assigned, has low priority and has low severity.

:markh, could you please find an assignee, increase the priority and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(markh)

Jan Varga [:janv]

Updated

•

6 months ago

Whiteboard: dom-lws-bugdash-triage

Jan Varga [:janv]

Updated

•

6 months ago

Whiteboard: dom-lws-bugdash-triage

Comment 21

•

6 months ago

This is a worryingly high volume on beta 130. Can you help find an owner to investigate it? Thanks!

status-firefox131: --- → affected

Comment 22

•

6 months ago

•

Edited

This collection of crashes shows people with an absolute tonne of saveAttachment blocking write operations (in the security-state/intermediates collection).

Comment 23

•

6 months ago

•

Edited

If the crash rate increased for this specific collection security-state/intermediates, then it is related to the work we did in Bug 1889617.

Basically, instead of initiating 1700+ requests on tiny .pem files, we now download a single zip file and extract its content in the IndexedDB attachments storage (ie. 1700+ calls to (saveAttachments() from a client.attachments.cacheAll() method).

In theory, our IndexedDB code is instrumented with shutdown blockers (:Gijs helped us do this years ago :)), but there could be bugs where the code escapes the blocker (ie. Bug 1890278)?

We are not advanced platform developers, and have fast hardware, etc. Hence we would clearly need help to dive into this issue and fix it :/

Flags: needinfo?(mathieu)

Comment 24

•

6 months ago

Looking at the code again. It looks that we don't await the cache.set() promise in the loop (source)
Which would totally explain the crashing behaviour. That's a good start.

Updated

•

6 months ago

Comment 25

•

6 months ago

OK, if this was fixed by the patch bug 1912893 then we should expect to see crashes go back down with 130.0b5. So far so good but let's wait a day or two before marking this fixed for 130/131.

Comment 26

•

6 months ago

This is a reminder regarding comment #20!

The bug is marked as tracked for firefox130 (beta). We have limited time to fix this, the soft freeze is in 14 days. However, the bug still isn't assigned and has low priority.

Comment 27

•

6 months ago

(In reply to Liz Henry (:lizzard) (relman/hg->git project) from comment #25)

OK, if this was fixed by the patch bug 1912893 then we should expect to see crashes go back down with 130.0b5. So far so good but let's wait a day or two before marking this fixed for 130/131.

I think this looks better now? There are 0 crashes on b6/7, and the 3 crashes on b5 list:

"importChanges() in security-state/cert-revocations"

as the sole cause of remote settings shutdown hanging, which is very different from the pile of attachment savings in intermediates that was there before - though perhaps separately interesting. I don't know what the uptake rate is for beta but b4 went to build only 2 days before b5 so the 30-fold higher crash rate seems pretty conclusive (I reached out to gsvelto to ask if we have ratio based on release uptake pings or whatnot, which would be better). Not sure if you want to untrack or mark "fixed" given that the crash as a whole is still there (and was there in 129 etc. etc.).

Flags: needinfo?(ehenry)

Liz Henry (:lizzard) (relman/hg->git project)

Comment 28

•

6 months ago

I'm going to go with untracking. In an ideal world, we would have had a different bug for the one specific cause of the spike instead of being lumped into this longstanding one, but c'est la vie.

tracking-firefox130: + → ---

Updated

•

6 months ago

Flags: needinfo?(ehenry)

Updated

•

6 months ago

status-firefox130: affected → ---

status-firefox131: affected → ---