Closed Bug 1435446 Opened 6 years ago Closed 6 years ago

Crash in AsyncShutdownTimeout | places.sqlite#0: waiting for clients | places.sqlite#0: AboutSync: promiseSql (18),places.sqlite#0: Bookmarks.jsm: fetchBookmark (17),places.sqlite#0: GuidHelper.getItemGuid (16),places.sqlite#0: History.jsm: insert (13)...

Categories

(Toolkit :: Storage, defect, P1)

60 Branch
Unspecified
Windows 10
defect

Tracking

()

VERIFIED FIXED
mozilla61
Tracking Status
firefox-esr52 --- unaffected
firefox59 --- unaffected
firefox60 --- wontfix
firefox61 --- fixed

People

(Reporter: kkumari, Assigned: lina)

References

Details

(Keywords: crash)

Crash Data

Attachments

(3 files, 1 obsolete file)

This bug was filed from the Socorro interface and is
report bp-9e673c3c-f891-4a43-927f-60ce90180202.
=============================================================

Top 10 frames of crashing thread:

0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 xul.dll Abort xpcom/base/nsDebugImpl.cpp:457
2 xul.dll NS_DebugBreak xpcom/base/nsDebugImpl.cpp:413
3 xul.dll nsDebugImpl::Abort xpcom/base/nsDebugImpl.cpp:146
4 xul.dll XPTC__InvokebyIndex xpcom/reflect/xptcall/md/win32/xptcinvoke_asm_x86_64.asm:97
5  @0x27852481e7f 
6 xul.dll XPCWrappedNative::CallMethod js/xpconnect/src/XPCWrappedNative.cpp:1234
7 xul.dll XPC_WN_CallMethod js/xpconnect/src/XPCWrappedNativeJSOps.cpp:929
8 xul.dll js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:473
9 xul.dll Interpret js/src/vm/Interpreter.cpp:3096

=============================================================
Prerequisite:
1. About:sync addon installed
2. services.sync.engine.bookmarks.buffer is set to true

STR:

1. Sign in to sync account (~1000 bookmarks)
2. Sync doesn't start automatically with About:sync addon(known bug 1332698)
3. Trigger sync
4. Open browser console and enter Weave.Service.enabled = false to disable sync

Observations: Sync get disabled but Sync icon is still spinning as syncing and last sync time shows wrong time. 

5. Enter Weave.Service.enabled = true to enable sync. 

Observation: I don't see any bookmark in library

6. I disconnect sync from about:preferences?entrypoint=menupanel#sync 
7. Sign in to sync on another new profile, trigger sync

Observations: I see bookmarks in library without favicon and last sync still shows wrong time. After few seconds/a minute browser crashed

I didn't see any error log in about:sync

Restarting the browser brought things to normal/as expected.
Assignee: nobody → kit
Priority: -- → P1
I can reproduce this consistently if I open the Library and start manually reordering bookmarks. Each move triggers a sync, so, if I drag quickly enough, statements on the main Places connection will eventually start failing with `SQLITE_BUSY`. (It helps with reproducing if I also run with a debug build and enable mirror trace logging; writing all those messages to the console increases the time our merge transaction is held open).

When that happens, we enter this loop, and never get out of it: https://searchfox.org/mozilla-central/rev/7ed2a637dc305e90fed6ffb041c6b8fa162b66bd/storage/mozStorageAsyncStatementExecution.cpp#231,250-258 `stepStatement` returns `SQLITE_BUSY` indefinitely, and the async thread on the Places connection locks up trying to retry that statement, forever. Others queue up behind it, and we get a shutdown hang on quit.

I added some statement logging, and I can see we're stuck retrying a statement like `UPDATE moz_bookmarks SET position = position + 1 WHERE parent = 2 AND position BETWEEN 17 AND 28`, which makes sense. I also see some busy logging for `BEGIN IMMEDIATE TRANSACTION`, but those never get stuck looping.

I can still use the mirror connection, though: even if the Places async thread is hung, I can paste `Weave.Service.engineManager.get('bookmarks')._store.ensureOpenMirror().then(m => m.apply()).then(_ => console.log('Merge finished'))` into the Browser Console, and the merge will finish. I can run this snippet several times rapidly, without deadlocking, unlike what I'm seeing for the main connection.

Things I tried:

* Cloning the Places connection in `apply` only, attaching to the mirror, running the merge, and closing the clone.
* Replacing every `executeCached` with `execute` in the mirror, thinking an unfinalized statement might be hanging around.
* Forcing GC or CC while the async thread was stuck.

None of those helped, and Places write queries on the main connection still fail with `SQLITE_BUSY`. Once that happens, we get into an infinite loop. Bumping https://searchfox.org/mozilla-central/rev/7ed2a637dc305e90fed6ffb041c6b8fa162b66bd/toolkit/components/places/Database.cpp#98 to a value like 45 seconds is the worst of both worlds: locks up the main thread for 45 seconds, then infinite loops.

If I remove the busy-waiting logic, the thread never gets stuck...though some moves fail while I'm syncing, of course, with no UI feedback besides the position indicator appearing to "not take".

Stepping back a bit, the only reason we're using a separate connection is because the main connection has helpers that check if a transaction is in progress, and continue anyway (bug 1305563, comment 165 has a longer explanation). We don't want that for the mirror, because the local tree might then change underneath us, and we'll corrupt it when we go to apply our changes.

What do you think we should do here, Mak?
Flags: needinfo?(mak77)
(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #3)
> Stepping back a bit, the only reason we're using a separate connection is
> because the main connection has helpers that check if a transaction is in
> progress, and continue anyway (bug 1305563, comment 165 has a longer
> explanation). We don't want that for the mirror, because the local tree
> might then change underneath us, and we'll corrupt it when we go to apply
> our changes.

Actually, we might be able to use the main Places connection to apply the merged tree.

When we `DELETE FROM {items, structure}ToMerge`, SQLite won't interleave other statements as it's running all our `INSTEAD OF DELETE` triggers. So, when we run `DELETE FROM structureToMerge`, anything in `moz_bookmarks` that's not in `mergeStates` must have been added by an interleaved `PlacesUtils.bookmarks.insert`, and anything in `mergeStates` that's not in `moz_bookmarks` must have been removed by an interleaved `PlacesUtils.bookmarks.remove`.

We could try to fix up those items in the `updateLocalStructure` trigger: move items that don't exist in `mergeStates` to the end of their parent folders, and fix holes in positions caused by deletes. (I think https://searchfox.org/mozilla-central/rev/7ed2a637dc305e90fed6ffb041c6b8fa162b66bd/toolkit/components/places/Bookmarks.jsm#2007-2023 does something like this for `reorder`). This is going to be tricky to get right, and there are probably better solutions, but something to think about.
(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #4)
> When we `DELETE FROM {items, structure}ToMerge`, SQLite won't interleave
> other statements as it's running all our `INSTEAD OF DELETE` triggers. So,
> when we run `DELETE FROM structureToMerge`, anything in `moz_bookmarks`
> that's not in `mergeStates` must have been added by an interleaved
> `PlacesUtils.bookmarks.insert`, and anything in `mergeStates` that's not in
> `moz_bookmarks` must have been removed by an interleaved
> `PlacesUtils.bookmarks.remove`.

I think I'm wrong here. If we use the main connection (https://searchfox.org/mozilla-central/rev/7ed2a637dc305e90fed6ffb041c6b8fa162b66bd/toolkit/components/places/PlacesUtils.jsm#1915-1922), we won't interleave `insert` and `remove` calls with our merge transaction, since Sqlite.jsm maintains a transaction queue. We only have to worry about interleaved writes to `moz_bookmarks` from C++ consumers, and those are going away.

So maybe the original reason for a separate connection no longer applies, and everything should Just Work™ now if we use the main Places connection.
Yes, we can remove all the moz_bookmark writes from cpp consumers soon, the tricky parts are tags and the Library left pane. Likely both are fixable in the 60-61 cycles, depending on the other priorities.

(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #3)
> When that happens, we enter this loop, and never get out of it:
> https://searchfox.org/mozilla-central/rev/
> 7ed2a637dc305e90fed6ffb041c6b8fa162b66bd/storage/
> mozStorageAsyncStatementExecution.cpp#231,250-258 `stepStatement` returns
> `SQLITE_BUSY` indefinitely, and the async thread on the Places connection
> locks up trying to retry that statement, forever.

This is strange, there must be something else ongoing, since after the BUSY reason disappears, the loop should unblock.
There are various possible reasons for an infinite busy status:
1. a transaction is left open (commit failed and we didn't handle that with a rollback)
2. the operation on your connection takes longer than expected, it looks infinite but in reality it's "very long time"
3. we somehow share a pre-compiled statement across different threads

Maybe a MOZ log could help analyzing the queries traffic and finding the culprit.
We should find the reason for the infinite SQLITE_BUSY, since that looks scary also for the rest of the code.
I'm a bit out of my depth looking at the hang.

* Running with `MOZ_LOG=timestamp,mozStorage:5`, I don't see any places where we have a `BEGIN` without a matching `COMMIT`...but we execute a lot of statements. :-)

* I added a retry counter to `AsyncExecuteStatements::executeStatement` that calls `::sqlite3_interrupt(mNativeConnection)` if we run the busy loop 5000 times. (5000 is somewhat arbitrary. I also tried interrupting immediately after seeing `SQLITE_BUSY`, and noticed we'd occasionally interrupt the merge transaction). I also added `sqlite3_expanded_sql` logging to `executeStatement`, so we can see what we're trying to run, and `Connection::stepStatement`, so we can see (in theory) what we interrupted.

* Interrupting "works": it unsticks the async thread, and statements continue to run. However, I did notice some cases where we'd interrupt a statement in a `PlacesUtils.bookmarks.update` transaction, causing SQLite to roll the entire transaction back. That's not desired or acceptable. It's OK to interrupt Sync, since we can just try to merge again later, but interrupting statements on the main Places connection means we'll lose data.

Here's an excerpt from the logging I added.

> executeAndProcessStatement: Ran 
>         INSERT INTO mergeStates(localGuid, mergedGuid, parentGuid, level,
>                                 position, valueState, structureState)
>         VALUES(IFNULL('sJKPawNpF3MW', 'sJKPawNpF3MW'), 'sJKPawNpF3MW', 'menu________',
>                1, 394, 1, 1)
> executeAndProcessStatement: Ran 
>         INSERT INTO mergeStates(localGuid, mergedGuid, parentGuid, level,
>                                 position, valueState, structureState)
>         VALUES(IFNULL('Eq7YgDzXcFMT', 'Eq7YgDzXcFMT'), 'Eq7YgDzXcFMT', 'menu________',
>                1, 395, 1, 1)
> executeStatement (0x0x192c9bda0): Giving up after 5000 attempts and interrupted so we could run 
>       UPDATE moz_bookmarks SET
>         syncChangeCounter = syncChangeCounter + 1
>       WHERE guid = 'menu________'
> stepStatement (0x0x192c9bda0): Interrupted
>       UPDATE moz_bookmarks SET
>         syncChangeCounter = syncChangeCounter + 1
>       WHERE guid = 'menu________' so someone else could run
> executeAndProcessStatement: Ran
>       UPDATE moz_bookmarks SET
>         syncChangeCounter = syncChangeCounter + 1
>       WHERE guid = 'menu________'
> executeAndProcessStatement: Ran SELECT b.guid, IFNULL(p.guid, "") AS parentGuid, b.position AS 'index',
>               b.dateAdded, b.lastModified, b.type, IFNULL(b.title, "") AS title,
>               h.url AS url, b.id AS _id, b.parent AS _parentId,
>               (SELECT count(*) FROM moz_bookmarks WHERE parent = b.id) AS _childCount,
>               p.parent AS _grandParentId, b.syncStatus AS _syncStatus
>        FROM moz_bookmarks b
>        LEFT JOIN moz_bookmarks p ON p.id = b.parent
>        LEFT JOIN moz_places h ON h.id = b.fk
>        WHERE b.guid = 'nWXmEoZ8DKGk'
> executeAndProcessStatement: Ran
>         INSERT INTO mergeStates(localGuid, mergedGuid, parentGuid, level,
>                                 position, valueState, structureState)
>         VALUES(IFNULL('FYScaCI-yqne', 'FYScaCI-yqne'), 'FYScaCI-yqne', 'menu________',
>                1, 396, 1, 1)
> executeAndProcessStatement: Ran SELECT b.guid, IFNULL(p.guid, "") AS parentGuid, b.position AS 'index',
>               b.dateAdded, b.lastModified, b.type, IFNULL(b.title, "") AS title,
>               h.url AS url, b.id AS _id, b.parent AS _parentId,
>               (SELECT count(*) FROM moz_bookmarks WHERE parent = b.id) AS _childCount,
>               p.parent AS _grandParentId, b.syncStatus AS _syncStatus
>        FROM moz_bookmarks b
>        LEFT JOIN moz_bookmarks p ON p.id = b.parent
>        LEFT JOIN moz_places h ON h.id = b.fk
>        WHERE b.guid = 'nWXmEoZ8DKGk'
> executeAndProcessStatement: Ran
>         INSERT INTO mergeStates(localGuid, mergedGuid, parentGuid, level,
>                                 position, valueState, structureState)
>         VALUES(IFNULL('mXrlYc55ohXf', 'mXrlYc55ohXf'), 'mXrlYc55ohXf', 'menu________',
>                1, 397, 1, 1)

So, we're building our `mergeStates` table, then try to bump the change counter on the menu after dragging a bookmark, that fails, so we interrupt...the same exact statement? How does that work? (Is that because we're waiting on all currently executing statements to finish, and we started to execute the `UPDATE` before `sqlite3_interrupt` was done?)
sqlite3_interrupt() interrupts anything on any thread and could be enqueued, it's not safe to call it unless you're on a read-only connection. it's likely the update had started by the time it ran.
The problem is that from that log you see what is being blocked, but you don't see what is blocking. We should check the full MOZ_LOG and identify if there's anything that is not finalized or closed properly.
You could try adding an error callback https://www.sqlite.org/errlog.html maybe it could print out something useful.

(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #8)
> So, we're building our `mergeStates` table, then try to bump the change
> counter on the menu after dragging a bookmark

INSERT INTO mergeStates is in a transaction, the changeRecords ones. So it's normal that anything else returns SQLITE_BUSY until that transaction is committed/rollbacked. When/where does that happen? could be that one of the awaits in that transaction never resolves?
Flags: needinfo?(mak77)
You could try to reduce TRANSACTIONS_QUEUE_TIMEOUT_MS in Sqlite.jsm to 30 seconds or such, and see if anything changes
This one is a doozy. I've attached patches with two different approaches that we can take. We might pick one or the other, or both, or neither. :-)

The first patch changes the statement retry logic in mozStorage to rollback if we fail with `SQLITE_BUSY` in a transaction. The SQLite docs for `sqlite3_step` (https://sqlite.org/c3ref/step.html) suggest that only COMMITs and statements outside a transaction are safe to retry the way we do now. If any other statement in a transaction fails with `SQLITE_BUSY`, I think we need to rollback the entire transaction. Conceptually, I guess this makes sense: if another connection changed the database while we were running our transaction, our statements might no longer be coherent, and we need to start again.

I suspect this also explains why calling `sqlite3_interrupt(mNativeConnection)` worked, since the docs for `sqlite3_interrupt` (https://sqlite.org/c3ref/interrupt.html) say: "If the interrupted SQL operation is an INSERT, UPDATE, or DELETE that is inside an explicit transaction, then the entire transaction will be rolled back automatically".

However, this causes data loss, since none of our consumers retry transactions. If you make a change that writes to bookmarks or history just as Sync is merging, and the write fails with `SQLITE_BUSY`, we'll fail with an error and lose the write. The user can try again, but they might not have feedback that anything went wrong.

Mak suggested using IMMEDIATE instead of DEFERRED transactions to work around the data loss issue. That still infinite loops with this patch; we get stuck retrying `BEGIN IMMEDIATE TRANSACTION`! I'm confused about why we never get out of the busy state, and why the infinite loop continues even when I close the mirror connection. (FWIW, the mirror closes cleanly at https://searchfox.org/mozilla-central/rev/9d47b5fb14152865ad1375eb3ee2e571d81ecdb9/storage/mozStorageConnection.cpp#1100; it doesn't have to wait for any outstanding statements). We closed the second writer; who's holding the lock?

Returning an error on busy has the same data loss problem.

The second patch changes the mirror to use the main Places connection, instead of a clone. This sidesteps the data loss problem thanks to Sqlite.jsm serializing our transactions. Tags and annos are the last remaining synchronous API consumers, and both should be OK. Tags aren't part of the merge tree; we use cascading `INSTEAD OF` triggers to update them, so the statements in the tagging service shouldn't interfere with our work, even with the `mozStorage` helper plowing ahead.

There are a lot of "should"s in that paragraph, and we'll want to test this thoroughly. There's also the problem that, if the merge fails, we'll also rollback tag changes. However, tags and annos are less commonly used (citation needed), so we may decide all this is fine.
I'm trying to whittle this down to a reproducible test case in https://github.com/kitcambridge/bug-1435446-repro, but that recovers from `SQLITE_BUSY` correctly, so something else is going on. I think my `ROLLBACK` patch working here is a red herring.

Andrew, CCing you just in case you have insights. :-)
Trying to reproduce the hang in xpcshell, no dice yet.
Good news: I found a way to avoid the busy loop without data loss. \o/ If I:

1) Change `Bookmarks.jsm` to use `BEGIN IMMEDIATE` for all transactions, *and*...
2) Call `::sqlite3_reset(aStatement)` after `PR_Sleep(PR_INTERVAL_NO_WAIT)`...

We retry the `BEGIN IMMEDIATE`, succeed, and run the transaction. (FWIW, https://searchfox.org/mozilla-central/rev/88e89261ea81860f44a44ed068cf5839aea5def6/storage/mozStorageConnection.cpp#1186,1200 also resets after `SQLITE_LOCKED_SHAREDCACHE`, so maybe it makes sense we do the same for `SQLITE_BUSY`?)

We don't need to roll anything back, and there's no data loss or errors. The UI lags for about a second to register some of the moves, but they succeed (I'm on a debug build with Sync trace logging enabled, to make the hang easier to reproduce), and the bookmarks end up in the correct location. About Sync shows no validation problems, and the Places integrity check passes.

Bad news: I'm not sure, but I suspect, that we're going to need to change all of Places to use `BEGIN IMMEDIATE`, not `BEGIN DEFERRED`. You might visit a page as we're syncing, for instance, and a statement in the transaction to insert the visit would fail with `SQLITE_BUSY`.

Do you think this is realistic or doable, Mak? If we do switch everything to IMMEDIATE, do you expect perf regressions, to the point where this wouldn't be worth it?
Flags: needinfo?(mak77)
Attachment #8951112 - Attachment is obsolete: true
(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #17)
> We retry the `BEGIN IMMEDIATE`, succeed, and run the transaction. (FWIW,
> https://searchfox.org/mozilla-central/rev/
> 88e89261ea81860f44a44ed068cf5839aea5def6/storage/mozStorageConnection.
> cpp#1186,1200 also resets after `SQLITE_LOCKED_SHAREDCACHE`, so maybe it
> makes sense we do the same for `SQLITE_BUSY`?)

The documentation says SQLITE_LOCKED_SHAREDCACHE works similarly to SQLITE_BUSY, but for the shared cache case, thus it's likely we should do the same.

> Bad news: I'm not sure, but I suspect, that we're going to need to change
> all of Places to use `BEGIN IMMEDIATE`, not `BEGIN DEFERRED`. You might
> visit a page as we're syncing, for instance, and a statement in the
> transaction to insert the visit would fail with `SQLITE_BUSY`.

That's what I was saying when I suggested trying IMMEDIATE, that we'd have to replace ALL the transactions on the Places connection to be IMMEDIATE, and I wasn't sure how easily we could do that.
I don't have problems from a functional point of view, the main reason to use deferred was to keep the db locked for the shorter time possible. Part of that reasoning has been superseded by using WAL (readers are not blocked), part by moving to async (no jank), and part by trying in general to use as small transactions as possible in code (good coding habits).

The main concern left, is how to do that in a way that has 100% coverage and is hard to regress, because we don't want tomorrow to discover we forgot to change a transaction, or we introduced a new one that is not immediate.
Likely Sqlite.jsm, since it runs everything async, could default to IMMEDIATE for all consumers.
The remaining calls in Places use mozStorageTransaction, and are not many https://searchfox.org/mozilla-central/search?q=mozStorageTransaction&path=places. We could add a defaultTransactionType attribute to mozIStorageConnection (defaults to deferred) and have mozStorageTransaction use the set value if its argument is TRANSACTION_DEFAULT (-1). At the same time I'd remove BeginTransactionAs, that is unused.
Flags: needinfo?(mak77)
Comment on attachment 8950808 [details]
Bug 1435446 - Add a default transaction type for storage connections.

https://reviewboard.mozilla.org/r/220048/#review230386

Cloning a connection should probably set the same transaction type on the cloned connection.

::: storage/mozStorageAsyncStatementExecution.cpp:525
(Diff revision 4)
>    }
>    if (mState == CANCELED)
>      return notifyComplete();
>  
>    if (statementsNeedTransaction() && mConnection->getAutocommit()) {
> -    if (NS_SUCCEEDED(mConnection->beginTransactionInternal(mNativeConnection,
> +    if (NS_SUCCEEDED(mConnection->beginTransactionInternal(mNativeConnection))) {

We probably want to retain IMMEDIATE here, regardless, because we are on the async thread and we are artificially adding a transaction, we want to know if we success/fail immediately, rather than at the next statement...

::: storage/mozStorageHelper.h:78
(Diff revision 4)
>    {
>      if (mConnection) {
>        nsAutoCString query("BEGIN");
> -      switch(aType) {
> +      int32_t type = aType;
> +      if (type == mozIStorageConnection::TRANSACTION_DEFAULT) {
> +        MOZ_ALWAYS_SUCCEEDS(mConnection->GetDefaultTransactionType(&type));

IIRC mozStorageTransaction is thread-safe, probably mDefaultTransactionType should be atomic.
Attachment #8950808 - Flags: review?(mak77)
Comment on attachment 8950808 [details]
Bug 1435446 - Add a default transaction type for storage connections.

https://reviewboard.mozilla.org/r/220048/#review230386

I did add a call to `aClone->SetDefaultTransactionType(mDefaultTransactionType)`; will add a test, too.

> IIRC mozStorageTransaction is thread-safe, probably mDefaultTransactionType should be atomic.

Good catch, since it's a read-write attribute, it might be changed at any time.
Comment on attachment 8950808 [details]
Bug 1435446 - Add a default transaction type for storage connections.

https://reviewboard.mozilla.org/r/220048/#review230392

::: toolkit/modules/Sqlite.jsm:1377
(Diff revision 4)
>     * @param func
>     *        (function) What to perform as part of the transaction.
>     * @param type optional
>     *        One of the TRANSACTION_* constants attached to this type.
>     */
> -  executeTransaction(func, type = this.TRANSACTION_DEFERRED) {
> +  executeTransaction(func, type = this.TRANSACTION_IMMEDIATE) {

Sqlite.jsm connection opening should probably grow an option to define the default transaction type.
Status: NEW → ASSIGNED
Component: Sync → Storage
Product: Firefox → Toolkit
Comment on attachment 8950808 [details]
Bug 1435446 - Add a default transaction type for storage connections.

https://reviewboard.mozilla.org/r/220048/#review231562

::: toolkit/modules/Sqlite.jsm:229
(Diff revision 5)
>    this._statementCounter = 0;
>  
>    // Increments whenever we request a unique operation id.
>    this._operationsCounter = 0;
>  
> +  this.defaultTransactionType = options.defaultTransactionType;

there's something unclear here, options is set as an optional object defaulting to {}. thus this can potentially be undefined.
Later we do
if (type == OpenedConnection.prototype.TRANSACTION_DEFAULT) {
  type = this.defaultTransactionType;

Thus it looks like type can be undefined.

It's likely you pretty much made options not optional anymore, but then the signature here should change and we should throw if options is undefined.

::: toolkit/modules/Sqlite.jsm:935
(Diff revision 5)
> +    }
> +
> +    openedOptions.defaultTransactionType = defaultTransactionType;
> +  } else {
> +    openedOptions.defaultTransactionType =
> +      OpenedConnection.prototype.TRANSACTION_DEFERRED;

We should probably get the default from the underlying connection after we open it, so we can set the default in just one place
Attachment #8950808 - Flags: review?(mak77)
Comment on attachment 8950810 [details]
Bug 1435446 - Reset executing statements on `SQLITE_BUSY` before retrying.

https://reviewboard.mozilla.org/r/220054/#review231596

::: storage/test/unit/test_retry_on_busy.js:1
(Diff revision 5)
> +/* eslint-disable mozilla/no-arbitrary-setTimeout */

please add the PD license boilerplate
Attachment #8950810 - Flags: review?(mak77) → review+
Comment on attachment 8953714 [details]
Bug 1435446 - Use immediate transactions by default in Places.

https://reviewboard.mozilla.org/r/222954/#review231600
Attachment #8953714 - Flags: review?(mak77) → review+
Comment on attachment 8950808 [details]
Bug 1435446 - Add a default transaction type for storage connections.

https://reviewboard.mozilla.org/r/220048/#review231982
Attachment #8950808 - Flags: review?(mak77) → review+
Pushed by kcambridge@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f8a3b9547dbc
Add a default transaction type for storage connections. r=mak
https://hg.mozilla.org/integration/autoland/rev/400d6348abca
Reset executing statements on `SQLITE_BUSY` before retrying. r=mak
https://hg.mozilla.org/integration/autoland/rev/381fa7a9955c
Use immediate transactions by default in Places. r=mak
https://hg.mozilla.org/mozilla-central/rev/f8a3b9547dbc
https://hg.mozilla.org/mozilla-central/rev/400d6348abca
https://hg.mozilla.org/mozilla-central/rev/381fa7a9955c
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Is this something you'd want to consider backporting to 60?
Flags: needinfo?(kit)
I don't think so.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #48)
> Is this something you'd want to consider backporting to 60?

Nope, as Mak says, this should ride the trains. It's not a regression, and, in practice, only affects an experimental feature that's pref'd off in 60.
Flags: needinfo?(kit)
Status: RESOLVED → VERIFIED

@Lina, on bug 1567018 we are hitting another possible SQLITE_BUSY infinite loop. On bug 1567018 comment 26, the user reported that removing the favicons.sqlite file cleared the problem. Can you say if the fix landed here would apply to the favicons.sqlite access as well? I'm wondering if a similar fix is also needed for favicons or if bug 1567018 is just a different problem.

Flags: needinfo?(lina)

Ooof, good question—I don't know! Even after this fix landed, we were still seeing stepStatement return SQLITE_BUSY forever (just a lot less frequently) when using a second read-write connection. I never could come up with a standalone test case, but it was easy to repro just by opening the Library and quickly dragging bookmarks around with Sync turned on. Eventually, the main connection would wedge forever trying to run a statement, and the second Sync-only connection would keep working...but the awesomebar and library would be broken, and closing Firefox would trigger a shutdown hang, just like bug 1567018 describes. In the end, we fixed this particular problem by removing the second connection entirely (bug 1567238, comment 4), and having the main Places connection do all the writing.

But favicons.sqlite is an attached database (even though it's on the same, single read-write connections), and I'm not sure how that factors into things. It might be a different problem, or it might be the same one, with some combination of attached databases, multiple connections—even though we only have one writer, we still have multiple readers—transactions, and locks not getting released when they should.

Mak, do you have any more insights?

Flags: needinfo?(lina) → needinfo?(mak)

(In reply to :Lina Cambridge from comment #52)

But favicons.sqlite is an attached database (even though it's on the same, single read-write connections), and I'm not sure how that factors into things. It might be a different problem, or it might be the same one, with some combination of attached databases, multiple connections—even though we only have one writer, we still have multiple readers—transactions, and locks not getting released when they should.

Unfortunately no, I'd really like to know what's up. Ideally an attached database shouldn't change much since we still have just a single connection writing. And using WAL should ideally ensure readers never get a BUSY state. I've seen other reports on the Web about SQLITE_BUSY with attached databases, so I can't exclude there are unknowns here.
One interesting question would be to figure out whether that favicons.sqlite was corrupt or sane... But since it has been removed in the meanwhile, we may never know. It could also be a case where a query is taking a really huge amount of time (maybe due to some specific characteristic of that db), now the perf profiler should report the running SQL query in a marker, or at least a part of it (bug 1635570), so we should always ask for profiles when possible.
Of course, having some way to reproduce these behavior would go a long way to help us understanding the problem.

Flags: needinfo?(mak)

Thanks for looking into it. A couple of questions which might be better moved to bug 1567018.

I wonder if we could add some diagnostic asserts to Nightly to trigger a crash if we ever get SQLITE_BUSY repeatedly assuming that would help debug this further. And should we land a fix to make it impossible to infinite loop here?

I noticed that in AsyncExecuteStatements::Run(), we check mCancelRequested. If mCancelRequested could be set between the time the runnable is dispatched and when it is executed, presumably it could be set after we have checked in ::Run. Should we be checking this while in the SQLITE_BUSY loop or perhaps checking for any other state indicating a shutdown is happening?

(In reply to Haik Aftandilian [:haik] from comment #54)

I wonder if we could add some diagnostic asserts to Nightly to trigger a crash if we ever get SQLITE_BUSY repeatedly assuming that would help debug this further. And should we land a fix to make it impossible to infinite loop here?

Note getting SQLITE_BUSY in certain cases is absolutely normal, because we have a few connections using multiple threads, we use PRAGMA busy_timeout to cope with that in Places, we should likely do the same in any component that uses both the sync and async API on the same connection.
By code inspection, that's likely just Places, Cookies and tests. Afaict, Places is the only one that has to do real mixed usage, Cookies only do opening/closing on MT, so maybe it's not strictly necessary. Effectively it looks like fixing Cookies and tests, would leave Places as the only mixed use consumer that is a bit more complex to attack.

I noticed that in AsyncExecuteStatements::Run(), we check mCancelRequested. If mCancelRequested could be set between the time the runnable is dispatched and when it is executed, presumably it could be set after we have checked in ::Run. Should we be checking this while in the SQLITE_BUSY loop or perhaps checking for any other state indicating a shutdown is happening?

Potentially yes, my only doubt is that accessing it requires a lock on mMutex, that may still be ok considered we don't set it that often.
I'm not completely sure why it's not just an atomic, but there's a long discussion in bug 506805, that may be outdated considered it was about PR_AtomicSet and very old Gcc. That discussion ended up replacing the old PR_ATOMIC with the mutex (https://hg.mozilla.org/mozilla-central/rev/097171b5d15f).
I wonder if today we could use std::atomic<bool>, Andrew was part of that original discussion and maybe has an insight.
Otherwise, I think we can just use the mutex.

I'm sorry, forgot the ni? regarding comment 55

Flags: needinfo?(bugmail)

(In reply to Marco Bonardo [:mak] from comment #55)

I noticed that in AsyncExecuteStatements::Run(), we check mCancelRequested. If mCancelRequested could be set between the time the runnable is dispatched and when it is executed, presumably it could be set after we have checked in ::Run. Should we be checking this while in the SQLITE_BUSY loop or perhaps checking for any other state indicating a shutdown is happening?

Potentially yes, my only doubt is that accessing it requires a lock on mMutex, that may still be ok considered we don't set it that often.

Acquiring the lock should be OK. It's only held for a few instructions each time it is used and it appears we never hold both mMutex and mDBMutex meaning we don't have to worry about deadlock if we follow that convention.

See Also: → 1643491
See Also: → 1643793

Sorry about the delay, I was actively looking into this when a big storm knocked out my power for a few days.

(In reply to Marco Bonardo [:mak] from comment #55)

I wonder if today we could use std::atomic<bool>, Andrew was part of that original discussion and maybe has an insight.
Otherwise, I think we can just use the mutex.

We should just use the mutex. If anything, the fact that we now have Link Time Optimization (LTO) means it's much more dangerous to try and get clever with use of atomics. The cost of the mutex is negligible given the inefficiency of SQLITE_BUSY itself.

Flags: needinfo?(bugmail)
You need to log in before you can comment on or make changes to this bug.