Closed Bug 1463512 Opened 3 years ago Closed 3 years ago
Crash in shutdownhang | Rtl
Acquire SRWLock Exclusive | mozilla::places::History::Observe
This bug was filed from the Socorro interface and is report bp-d6b2a216-bb7c-45e6-8a03-316ee0180518. ============================================================= Seen while looking at nightly crash stats - crashes started using 20180515100038: https://bit.ly/2keWCBO. 35 crashes/27 installations. Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=45ec8fd380dd2c308e79dbb396ca87f2ce9b3f9c&tochange=380cf87c1ee3966dd94499942b73085754dc4824 ni on :mak in case he has some insight. There is one lone crash in 61 in beta 6, but there is a definite noticeable spike on nightly. Top 10 frames of crashing thread: 0 ntdll.dll NtWaitForAlertByThreadId 1 ntdll.dll RtlAcquireSRWLockExclusive 2 xul.dll mozilla::places::History::Observe toolkit/components/places/History.cpp:3321 3 xul.dll nsObserverList::NotifyObservers xpcom/ds/nsObserverList.cpp:112 4 xul.dll nsObserverService::NotifyObservers xpcom/ds/nsObserverService.cpp:297 5 xul.dll mozilla::places::Database::Observe toolkit/components/places/Database.cpp:2715 6 xul.dll nsObserverList::NotifyObservers xpcom/ds/nsObserverList.cpp:112 7 xul.dll nsObserverService::NotifyObservers xpcom/ds/nsObserverService.cpp:297 8 xul.dll nsXREDirProvider::DoShutdown toolkit/xre/nsXREDirProvider.cpp:1070 9 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1372 =============================================================
The range includes new autofill, I don't see other large changes there. I was hoping to see this disappear after the fix for bug 1461753. Though, it's also the day after we published https://blog.nightly.mozilla.org/2018/05/14/deep-dive-new-bookmark-sync-in-nightly/ and the new sync involves exclusive locks. I suppose we have no way to correlate crashes to prefs... I have no further ideas for now, but let's track this.
Priority: -- → P1
We're pretty much waiting on shutdown to acquire an exclusive slim R/W lock, likely to close the readonly connection to places.sqlite kept open by History. Sync seems to be more likely, but I'm really guessing here. Kit, could it be that the mirror doesn't give up if shutdown happens in the meanwhile? I think shutdown hang happens after 1 minute and 3 seconds, so it would be an operation longer than this time...
Flags: needinfo?(mak77) → needinfo?(kit)
Hmm, the mirror does have checks (via `Async.jankYielder`, in https://searchfox.org/mozilla-central/rev/eb51d734a8c06ac5973291be9bcc74ed4e999bca/services/common/async.js#105,108) to abort merging if it sees `quit-application`...but we don't (can't, without changing Sqlite.jsm?) abort the triggers. I guess we'll want to cancel the statement once shutdown begins? The mirror writes from a cloned connection, so it seems like it's safe to interrupt its statements without affecting the main Places connection. I also noticed Sqlite.jsm adds a shutdown blocker in `executeTransaction` (https://searchfox.org/mozilla-central/rev/eb51d734a8c06ac5973291be9bcc74ed4e999bca/toolkit/modules/Sqlite.jsm#662-664), which we probably don't want, either. If the transaction doesn't finish before shutdown, we'll just start the process over again on the next sync. OOC, why are we hanging trying to shut down the *read-only* connection? The mirror uses an IMMEDIATE transaction, and I thought, in any case, readers won't wait on writers with WAL? The other use of the shutdown mutex that I found is https://searchfox.org/mozilla-central/rev/eb51d734a8c06ac5973291be9bcc74ed4e999bca/toolkit/components/places/History.cpp#1166 and https://searchfox.org/mozilla-central/rev/eb51d734a8c06ac5973291be9bcc74ed4e999bca/toolkit/components/places/History.cpp#1880. Could it be that the mirror starts merging and runs its trigger, that takes a long time, History.cpp takes `mShutdownMutex` and starts its transaction, that transaction ends up waiting on the mirror, the user shuts down, history hasn't started its transaction and is still holding `mShutdownMutex`, and we hang?
Side note: Slow SQL telemetry (bug 1367023) would be very helpful here. Mine has two hits for `DELETE FROM itemsToMerge`, average 803ms.
(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #4) > Side note: Slow SQL telemetry (bug 1367023) would be very helpful here. Mine > has two hits for `DELETE FROM itemsToMerge`, average 803ms. we are discussing it and will likely be ready soon.
I have another possible theory, that this was caused by bug 1463017. Clearly that means this crash will disappear/reduce from the 20180530 nightly build. We'll be able to tell in a couple days. And this may also have other implications...
Moving to P2 for now, because we don't have sufficient data to act on it, hopefully the new Slow SQL dashboard will give us that info, plus we are investigating some reports of slow SQL that may be related. Anyway doesn't look like this is directly actionable yet.
Priority: P1 → P2
There are no crashes after the 9th of June build, so I assume bug 1467537 fixed this. IT can make sense, that bug may have caused longstanding statements and large wal journal, that are usually merged on shutdown. It also caused multi-seconds hangs on ssd, let alone mechanical disks.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
You need to log in before you can comment on or make changes to this bug.