<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Assignee

Comment 3

•

7 years ago

I can reproduce this consistently if I open the Library and start manually reordering bookmarks. Each move triggers a sync, so, if I drag quickly enough, statements on the main Places connection will eventually start failing with `SQLITE_BUSY`. (It helps with reproducing if I also run with a debug build and enable mirror trace logging; writing all those messages to the console increases the time our merge transaction is held open). When that happens, we enter this loop, and never get out of it: https://searchfox.org/mozilla-central/rev/7ed2a637dc305e90fed6ffb041c6b8fa162b66bd/storage/mozStorageAsyncStatementExecution.cpp#231,250-258 `stepStatement` returns `SQLITE_BUSY` indefinitely, and the async thread on the Places connection locks up trying to retry that statement, forever. Others queue up behind it, and we get a shutdown hang on quit. I added some statement logging, and I can see we're stuck retrying a statement like `UPDATE moz_bookmarks SET position = position + 1 WHERE parent = 2 AND position BETWEEN 17 AND 28`, which makes sense. I also see some busy logging for `BEGIN IMMEDIATE TRANSACTION`, but those never get stuck looping. I can still use the mirror connection, though: even if the Places async thread is hung, I can paste `Weave.Service.engineManager.get('bookmarks')._store.ensureOpenMirror().then(m => m.apply()).then(_ => console.log('Merge finished'))` into the Browser Console, and the merge will finish. I can run this snippet several times rapidly, without deadlocking, unlike what I'm seeing for the main connection. Things I tried: * Cloning the Places connection in `apply` only, attaching to the mirror, running the merge, and closing the clone. * Replacing every `executeCached` with `execute` in the mirror, thinking an unfinalized statement might be hanging around. * Forcing GC or CC while the async thread was stuck. None of those helped, and Places write queries on the main connection still fail with `SQLITE_BUSY`. Once that happens, we get into an infinite loop. Bumping https://searchfox.org/mozilla-central/rev/7ed2a637dc305e90fed6ffb041c6b8fa162b66bd/toolkit/components/places/Database.cpp#98 to a value like 45 seconds is the worst of both worlds: locks up the main thread for 45 seconds, then infinite loops. If I remove the busy-waiting logic, the thread never gets stuck...though some moves fail while I'm syncing, of course, with no UI feedback besides the position indicator appearing to "not take". Stepping back a bit, the only reason we're using a separate connection is because the main connection has helpers that check if a transaction is in progress, and continue anyway (bug 1305563, comment 165 has a longer explanation). We don't want that for the mirror, because the local tree might then change underneath us, and we'll corrupt it when we go to apply our changes. What do you think we should do here, Mak?

Flags: needinfo?(mak77)

Assignee

Comment 4

•

7 years ago

(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #3) > Stepping back a bit, the only reason we're using a separate connection is > because the main connection has helpers that check if a transaction is in > progress, and continue anyway (bug 1305563, comment 165 has a longer > explanation). We don't want that for the mirror, because the local tree > might then change underneath us, and we'll corrupt it when we go to apply > our changes. Actually, we might be able to use the main Places connection to apply the merged tree. When we `DELETE FROM {items, structure}ToMerge`, SQLite won't interleave other statements as it's running all our `INSTEAD OF DELETE` triggers. So, when we run `DELETE FROM structureToMerge`, anything in `moz_bookmarks` that's not in `mergeStates` must have been added by an interleaved `PlacesUtils.bookmarks.insert`, and anything in `mergeStates` that's not in `moz_bookmarks` must have been removed by an interleaved `PlacesUtils.bookmarks.remove`. We could try to fix up those items in the `updateLocalStructure` trigger: move items that don't exist in `mergeStates` to the end of their parent folders, and fix holes in positions caused by deletes. (I think https://searchfox.org/mozilla-central/rev/7ed2a637dc305e90fed6ffb041c6b8fa162b66bd/toolkit/components/places/Bookmarks.jsm#2007-2023 does something like this for `reorder`). This is going to be tricky to get right, and there are probably better solutions, but something to think about.

Assignee

Comment 5

•

7 years ago

(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #4) > When we `DELETE FROM {items, structure}ToMerge`, SQLite won't interleave > other statements as it's running all our `INSTEAD OF DELETE` triggers. So, > when we run `DELETE FROM structureToMerge`, anything in `moz_bookmarks` > that's not in `mergeStates` must have been added by an interleaved > `PlacesUtils.bookmarks.insert`, and anything in `mergeStates` that's not in > `moz_bookmarks` must have been removed by an interleaved > `PlacesUtils.bookmarks.remove`. I think I'm wrong here. If we use the main connection (https://searchfox.org/mozilla-central/rev/7ed2a637dc305e90fed6ffb041c6b8fa162b66bd/toolkit/components/places/PlacesUtils.jsm#1915-1922), we won't interleave `insert` and `remove` calls with our merge transaction, since Sqlite.jsm maintains a transaction queue. We only have to worry about interleaved writes to `moz_bookmarks` from C++ consumers, and those are going away. So maybe the original reason for a separate connection no longer applies, and everything should Just Work™ now if we use the main Places connection.

Comment 6

•

7 years ago

Yes, we can remove all the moz_bookmark writes from cpp consumers soon, the tricky parts are tags and the Library left pane. Likely both are fixable in the 60-61 cycles, depending on the other priorities. (In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #3) > When that happens, we enter this loop, and never get out of it: > https://searchfox.org/mozilla-central/rev/ > 7ed2a637dc305e90fed6ffb041c6b8fa162b66bd/storage/ > mozStorageAsyncStatementExecution.cpp#231,250-258 `stepStatement` returns > `SQLITE_BUSY` indefinitely, and the async thread on the Places connection > locks up trying to retry that statement, forever. This is strange, there must be something else ongoing, since after the BUSY reason disappears, the loop should unblock. There are various possible reasons for an infinite busy status: 1. a transaction is left open (commit failed and we didn't handle that with a rollback) 2. the operation on your connection takes longer than expected, it looks infinite but in reality it's "very long time" 3. we somehow share a pre-compiled statement across different threads Maybe a MOZ log could help analyzing the queries traffic and finding the culprit. We should find the reason for the infinite SQLITE_BUSY, since that looks scary also for the rest of the code.

Assignee

Comment 8

•

7 years ago

I'm a bit out of my depth looking at the hang. * Running with `MOZ_LOG=timestamp,mozStorage:5`, I don't see any places where we have a `BEGIN` without a matching `COMMIT`...but we execute a lot of statements. :-) * I added a retry counter to `AsyncExecuteStatements::executeStatement` that calls `::sqlite3_interrupt(mNativeConnection)` if we run the busy loop 5000 times. (5000 is somewhat arbitrary. I also tried interrupting immediately after seeing `SQLITE_BUSY`, and noticed we'd occasionally interrupt the merge transaction). I also added `sqlite3_expanded_sql` logging to `executeStatement`, so we can see what we're trying to run, and `Connection::stepStatement`, so we can see (in theory) what we interrupted. * Interrupting "works": it unsticks the async thread, and statements continue to run. However, I did notice some cases where we'd interrupt a statement in a `PlacesUtils.bookmarks.update` transaction, causing SQLite to roll the entire transaction back. That's not desired or acceptable. It's OK to interrupt Sync, since we can just try to merge again later, but interrupting statements on the main Places connection means we'll lose data. Here's an excerpt from the logging I added. > executeAndProcessStatement: Ran > INSERT INTO mergeStates(localGuid, mergedGuid, parentGuid, level, > position, valueState, structureState) > VALUES(IFNULL('sJKPawNpF3MW', 'sJKPawNpF3MW'), 'sJKPawNpF3MW', 'menu________', > 1, 394, 1, 1) > executeAndProcessStatement: Ran > INSERT INTO mergeStates(localGuid, mergedGuid, parentGuid, level, > position, valueState, structureState) > VALUES(IFNULL('Eq7YgDzXcFMT', 'Eq7YgDzXcFMT'), 'Eq7YgDzXcFMT', 'menu________', > 1, 395, 1, 1) > executeStatement (0x0x192c9bda0): Giving up after 5000 attempts and interrupted so we could run > UPDATE moz_bookmarks SET > syncChangeCounter = syncChangeCounter + 1 > WHERE guid = 'menu________' > stepStatement (0x0x192c9bda0): Interrupted > UPDATE moz_bookmarks SET > syncChangeCounter = syncChangeCounter + 1 > WHERE guid = 'menu________' so someone else could run > executeAndProcessStatement: Ran > UPDATE moz_bookmarks SET > syncChangeCounter = syncChangeCounter + 1 > WHERE guid = 'menu________' > executeAndProcessStatement: Ran SELECT b.guid, IFNULL(p.guid, "") AS parentGuid, b.position AS 'index', > b.dateAdded, b.lastModified, b.type, IFNULL(b.title, "") AS title, > h.url AS url, b.id AS _id, b.parent AS _parentId, > (SELECT count(*) FROM moz_bookmarks WHERE parent = b.id) AS _childCount, > p.parent AS _grandParentId, b.syncStatus AS _syncStatus > FROM moz_bookmarks b > LEFT JOIN moz_bookmarks p ON p.id = b.parent > LEFT JOIN moz_places h ON h.id = b.fk > WHERE b.guid = 'nWXmEoZ8DKGk' > executeAndProcessStatement: Ran > INSERT INTO mergeStates(localGuid, mergedGuid, parentGuid, level, > position, valueState, structureState) > VALUES(IFNULL('FYScaCI-yqne', 'FYScaCI-yqne'), 'FYScaCI-yqne', 'menu________', > 1, 396, 1, 1) > executeAndProcessStatement: Ran SELECT b.guid, IFNULL(p.guid, "") AS parentGuid, b.position AS 'index', > b.dateAdded, b.lastModified, b.type, IFNULL(b.title, "") AS title, > h.url AS url, b.id AS _id, b.parent AS _parentId, > (SELECT count(*) FROM moz_bookmarks WHERE parent = b.id) AS _childCount, > p.parent AS _grandParentId, b.syncStatus AS _syncStatus > FROM moz_bookmarks b > LEFT JOIN moz_bookmarks p ON p.id = b.parent > LEFT JOIN moz_places h ON h.id = b.fk > WHERE b.guid = 'nWXmEoZ8DKGk' > executeAndProcessStatement: Ran > INSERT INTO mergeStates(localGuid, mergedGuid, parentGuid, level, > position, valueState, structureState) > VALUES(IFNULL('mXrlYc55ohXf', 'mXrlYc55ohXf'), 'mXrlYc55ohXf', 'menu________', > 1, 397, 1, 1) So, we're building our `mergeStates` table, then try to bump the change counter on the menu after dragging a bookmark, that fails, so we interrupt...the same exact statement? How does that work? (Is that because we're waiting on all currently executing statements to finish, and we started to execute the `UPDATE` before `sqlite3_interrupt` was done?)

Comment 9

•

7 years ago

sqlite3_interrupt() interrupts anything on any thread and could be enqueued, it's not safe to call it unless you're on a read-only connection. it's likely the update had started by the time it ran. The problem is that from that log you see what is being blocked, but you don't see what is blocking. We should check the full MOZ_LOG and identify if there's anything that is not finalized or closed properly. You could try adding an error callback https://www.sqlite.org/errlog.html maybe it could print out something useful. (In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #8) > So, we're building our `mergeStates` table, then try to bump the change > counter on the menu after dragging a bookmark INSERT INTO mergeStates is in a transaction, the changeRecords ones. So it's normal that anything else returns SQLITE_BUSY until that transaction is committed/rollbacked. When/where does that happen? could be that one of the awaits in that transaction never resolves?

Flags: needinfo?(mak77)

Comment 10

•

7 years ago

You could try to reduce TRANSACTIONS_QUEUE_TIMEOUT_MS in Sqlite.jsm to 30 seconds or such, and see if anything changes

Comment hidden (mozreview-request)

Assignee

Comment 14

•

7 years ago

This one is a doozy. I've attached patches with two different approaches that we can take. We might pick one or the other, or both, or neither. :-) The first patch changes the statement retry logic in mozStorage to rollback if we fail with `SQLITE_BUSY` in a transaction. The SQLite docs for `sqlite3_step` (https://sqlite.org/c3ref/step.html) suggest that only COMMITs and statements outside a transaction are safe to retry the way we do now. If any other statement in a transaction fails with `SQLITE_BUSY`, I think we need to rollback the entire transaction. Conceptually, I guess this makes sense: if another connection changed the database while we were running our transaction, our statements might no longer be coherent, and we need to start again. I suspect this also explains why calling `sqlite3_interrupt(mNativeConnection)` worked, since the docs for `sqlite3_interrupt` (https://sqlite.org/c3ref/interrupt.html) say: "If the interrupted SQL operation is an INSERT, UPDATE, or DELETE that is inside an explicit transaction, then the entire transaction will be rolled back automatically". However, this causes data loss, since none of our consumers retry transactions. If you make a change that writes to bookmarks or history just as Sync is merging, and the write fails with `SQLITE_BUSY`, we'll fail with an error and lose the write. The user can try again, but they might not have feedback that anything went wrong. Mak suggested using IMMEDIATE instead of DEFERRED transactions to work around the data loss issue. That still infinite loops with this patch; we get stuck retrying `BEGIN IMMEDIATE TRANSACTION`! I'm confused about why we never get out of the busy state, and why the infinite loop continues even when I close the mirror connection. (FWIW, the mirror closes cleanly at https://searchfox.org/mozilla-central/rev/9d47b5fb14152865ad1375eb3ee2e571d81ecdb9/storage/mozStorageConnection.cpp#1100; it doesn't have to wait for any outstanding statements). We closed the second writer; who's holding the lock? Returning an error on busy has the same data loss problem. The second patch changes the mirror to use the main Places connection, instead of a clone. This sidesteps the data loss problem thanks to Sqlite.jsm serializing our transactions. Tags and annos are the last remaining synchronous API consumers, and both should be OK. Tags aren't part of the merge tree; we use cascading `INSTEAD OF` triggers to update them, so the statements in the tagging service shouldn't interfere with our work, even with the `mozStorage` helper plowing ahead. There are a lot of "should"s in that paragraph, and we'll want to test this thoroughly. There's also the problem that, if the merge fails, we'll also rollback tag changes. However, tags and annos are less commonly used (citation needed), so we may decide all this is fine.

Assignee

Comment 15

•

7 years ago

I'm trying to whittle this down to a reproducible test case in https://github.com/kitcambridge/bug-1435446-repro, but that recovers from `SQLITE_BUSY` correctly, so something else is going on. I think my `ROLLBACK` patch working here is a red herring. Andrew, CCing you just in case you have insights. :-)

Assignee

Comment 16

•

7 years ago

Attached patch 0001-Bug-1435446-Trying-to-reproduce-the-hang-in-xpcshell.patch (obsolete) — Details — Splinter Review

Trying to reproduce the hang in xpcshell, no dice yet.

Assignee

Comment 17

•

7 years ago

Good news: I found a way to avoid the busy loop without data loss. \o/ If I: 1) Change `Bookmarks.jsm` to use `BEGIN IMMEDIATE` for all transactions, *and*... 2) Call `::sqlite3_reset(aStatement)` after `PR_Sleep(PR_INTERVAL_NO_WAIT)`... We retry the `BEGIN IMMEDIATE`, succeed, and run the transaction. (FWIW, https://searchfox.org/mozilla-central/rev/88e89261ea81860f44a44ed068cf5839aea5def6/storage/mozStorageConnection.cpp#1186,1200 also resets after `SQLITE_LOCKED_SHAREDCACHE`, so maybe it makes sense we do the same for `SQLITE_BUSY`?) We don't need to roll anything back, and there's no data loss or errors. The UI lags for about a second to register some of the moves, but they succeed (I'm on a debug build with Sync trace logging enabled, to make the hang easier to reproduce), and the bookmarks end up in the correct location. About Sync shows no validation problems, and the Places integrity check passes. Bad news: I'm not sure, but I suspect, that we're going to need to change all of Places to use `BEGIN IMMEDIATE`, not `BEGIN DEFERRED`. You might visit a page as we're syncing, for instance, and a statement in the transaction to insert the visit would fail with `SQLITE_BUSY`. Do you think this is realistic or doable, Mak? If we do switch everything to IMMEDIATE, do you expect perf regressions, to the point where this wouldn't be worth it?

Flags: needinfo?(mak77)

Comment hidden (mozreview-request)

Assignee

Updated

•

7 years ago

Attachment #8951112 - Attachment is obsolete: true

Comment 22

•

7 years ago

(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #17) > We retry the `BEGIN IMMEDIATE`, succeed, and run the transaction. (FWIW, > https://searchfox.org/mozilla-central/rev/ > 88e89261ea81860f44a44ed068cf5839aea5def6/storage/mozStorageConnection. > cpp#1186,1200 also resets after `SQLITE_LOCKED_SHAREDCACHE`, so maybe it > makes sense we do the same for `SQLITE_BUSY`?) The documentation says SQLITE_LOCKED_SHAREDCACHE works similarly to SQLITE_BUSY, but for the shared cache case, thus it's likely we should do the same. > Bad news: I'm not sure, but I suspect, that we're going to need to change > all of Places to use `BEGIN IMMEDIATE`, not `BEGIN DEFERRED`. You might > visit a page as we're syncing, for instance, and a statement in the > transaction to insert the visit would fail with `SQLITE_BUSY`. That's what I was saying when I suggested trying IMMEDIATE, that we'd have to replace ALL the transactions on the Places connection to be IMMEDIATE, and I wasn't sure how easily we could do that. I don't have problems from a functional point of view, the main reason to use deferred was to keep the db locked for the shorter time possible. Part of that reasoning has been superseded by using WAL (readers are not blocked), part by moving to async (no jank), and part by trying in general to use as small transactions as possible in code (good coding habits). The main concern left, is how to do that in a way that has 100% coverage and is hard to regress, because we don't want tomorrow to discover we forgot to change a transaction, or we introduced a new one that is not immediate. Likely Sqlite.jsm, since it runs everything async, could default to IMMEDIATE for all consumers. The remaining calls in Places use mozStorageTransaction, and are not many https://searchfox.org/mozilla-central/search?q=mozStorageTransaction&path=places. We could add a defaultTransactionType attribute to mozIStorageConnection (defaults to deferred) and have mozStorageTransaction use the set value if its argument is TRANSACTION_DEFAULT (-1). At the same time I'd remove BeginTransactionAs, that is unused.

Flags: needinfo?(mak77)

Comment hidden (mozreview-request)

Comment 28

•

7 years ago

mozreview-review

Comment on attachment 8950808 [details] Bug 1435446 - Add a default transaction type for storage connections. https://reviewboard.mozilla.org/r/220048/#review230386 Cloning a connection should probably set the same transaction type on the cloned connection. ::: storage/mozStorageAsyncStatementExecution.cpp:525 (Diff revision 4) > } > if (mState == CANCELED) > return notifyComplete(); > > if (statementsNeedTransaction() && mConnection->getAutocommit()) { > - if (NS_SUCCEEDED(mConnection->beginTransactionInternal(mNativeConnection, > + if (NS_SUCCEEDED(mConnection->beginTransactionInternal(mNativeConnection))) { We probably want to retain IMMEDIATE here, regardless, because we are on the async thread and we are artificially adding a transaction, we want to know if we success/fail immediately, rather than at the next statement... ::: storage/mozStorageHelper.h:78 (Diff revision 4) > { > if (mConnection) { > nsAutoCString query("BEGIN"); > - switch(aType) { > + int32_t type = aType; > + if (type == mozIStorageConnection::TRANSACTION_DEFAULT) { > + MOZ_ALWAYS_SUCCEEDS(mConnection->GetDefaultTransactionType(&type)); IIRC mozStorageTransaction is thread-safe, probably mDefaultTransactionType should be atomic.

Attachment #8950808 - Flags: review?(mak77)

Assignee

Comment 29

•

7 years ago

mozreview-review-reply

Comment on attachment 8950808 [details] Bug 1435446 - Add a default transaction type for storage connections. https://reviewboard.mozilla.org/r/220048/#review230386 I did add a call to `aClone->SetDefaultTransactionType(mDefaultTransactionType)`; will add a test, too. > IIRC mozStorageTransaction is thread-safe, probably mDefaultTransactionType should be atomic. Good catch, since it's a read-write attribute, it might be changed at any time.

Comment 30

•

7 years ago

mozreview-review

Comment on attachment 8950808 [details] Bug 1435446 - Add a default transaction type for storage connections. https://reviewboard.mozilla.org/r/220048/#review230392 ::: toolkit/modules/Sqlite.jsm:1377 (Diff revision 4) > * @param func > * (function) What to perform as part of the transaction. > * @param type optional > * One of the TRANSACTION_* constants attached to this type. > */ > - executeTransaction(func, type = this.TRANSACTION_DEFERRED) { > + executeTransaction(func, type = this.TRANSACTION_IMMEDIATE) { Sqlite.jsm connection opening should probably grow an option to define the default transaction type.

Comment hidden (mozreview-request)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6cff5d10d20baa280faf44c5dc13555e89d30fe1

Assignee

Comment 34

•

7 years ago

Assignee

Updated

•

7 years ago

Status: NEW → ASSIGNED

Component: Sync → Storage

Product: Firefox → Toolkit

Comment 36

•

7 years ago

mozreview-review

Comment on attachment 8950808 [details] Bug 1435446 - Add a default transaction type for storage connections. https://reviewboard.mozilla.org/r/220048/#review231562 ::: toolkit/modules/Sqlite.jsm:229 (Diff revision 5) > this._statementCounter = 0; > > // Increments whenever we request a unique operation id. > this._operationsCounter = 0; > > + this.defaultTransactionType = options.defaultTransactionType; there's something unclear here, options is set as an optional object defaulting to {}. thus this can potentially be undefined. Later we do if (type == OpenedConnection.prototype.TRANSACTION_DEFAULT) { type = this.defaultTransactionType; Thus it looks like type can be undefined. It's likely you pretty much made options not optional anymore, but then the signature here should change and we should throw if options is undefined. ::: toolkit/modules/Sqlite.jsm:935 (Diff revision 5) > + } > + > + openedOptions.defaultTransactionType = defaultTransactionType; > + } else { > + openedOptions.defaultTransactionType = > + OpenedConnection.prototype.TRANSACTION_DEFERRED; We should probably get the default from the underlying connection after we open it, so we can set the default in just one place

Attachment #8950808 - Flags: review?(mak77)

Comment 37

•

7 years ago

mozreview-review

Comment on attachment 8950810 [details] Bug 1435446 - Reset executing statements on `SQLITE_BUSY` before retrying. https://reviewboard.mozilla.org/r/220054/#review231596 ::: storage/test/unit/test_retry_on_busy.js:1 (Diff revision 5) > +/* eslint-disable mozilla/no-arbitrary-setTimeout */ please add the PD license boilerplate

Attachment #8950810 - Flags: review?(mak77) → review+

Comment 38

•

7 years ago

mozreview-review

Comment on attachment 8953714 [details] Bug 1435446 - Use immediate transactions by default in Places. https://reviewboard.mozilla.org/r/222954/#review231600

Attachment #8953714 - Flags: review?(mak77) → review+

Comment hidden (mozreview-request)

Narcis Beleuzu [:NarcisB]

Comment 42

•

7 years ago

mozreview-review

Comment on attachment 8950808 [details] Bug 1435446 - Add a default transaction type for storage connections. https://reviewboard.mozilla.org/r/220048/#review231982

Attachment #8950808 - Flags: review?(mak77) → review+

Comment hidden (mozreview-request)

Pulsebot

Comment 46

•

7 years ago

Pushed by kcambridge@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f8a3b9547dbc Add a default transaction type for storage connections. r=mak https://hg.mozilla.org/integration/autoland/rev/400d6348abca Reset executing statements on `SQLITE_BUSY` before retrying. r=mak https://hg.mozilla.org/integration/autoland/rev/381fa7a9955c Use immediate transactions by default in Places. r=mak

Comment 47

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f8a3b9547dbc https://hg.mozilla.org/mozilla-central/rev/400d6348abca https://hg.mozilla.org/mozilla-central/rev/381fa7a9955c

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox61: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla61

Ryan VanderMeulen [:RyanVM]

Comment 48

•

7 years ago

Is this something you'd want to consider backporting to 60?

status-firefox59: --- → unaffected

status-firefox-esr52: --- → unaffected

Flags: needinfo?(kit)

Comment 49

•

7 years ago

I don't think so.

Assignee

Comment 50

•

7 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #48) > Is this something you'd want to consider backporting to 60? Nope, as Mak says, this should ride the trains. It's not a regression, and, in practice, only affects an experimental feature that's pref'd off in 60.

status-firefox60: affected → wontfix

Flags: needinfo?(kit)

Kanchan Kumari QA

Reporter

Updated

•

7 years ago

Status: RESOLVED → VERIFIED

Comment 51

•

5 years ago

•

Edited

@Lina, on bug 1567018 we are hitting another possible SQLITE_BUSY infinite loop. On bug 1567018 comment 26, the user reported that removing the favicons.sqlite file cleared the problem. Can you say if the fix landed here would apply to the favicons.sqlite access as well? I'm wondering if a similar fix is also needed for favicons or if bug 1567018 is just a different problem.

Flags: needinfo?(lina)

Assignee

Comment 52

•

5 years ago

Ooof, good question—I don't know! Even after this fix landed, we were still seeing stepStatement return SQLITE_BUSY forever (just a lot less frequently) when using a second read-write connection. I never could come up with a standalone test case, but it was easy to repro just by opening the Library and quickly dragging bookmarks around with Sync turned on. Eventually, the main connection would wedge forever trying to run a statement, and the second Sync-only connection would keep working...but the awesomebar and library would be broken, and closing Firefox would trigger a shutdown hang, just like bug 1567018 describes. In the end, we fixed this particular problem by removing the second connection entirely (bug 1567238, comment 4), and having the main Places connection do all the writing.

But favicons.sqlite is an attached database (even though it's on the same, single read-write connections), and I'm not sure how that factors into things. It might be a different problem, or it might be the same one, with some combination of attached databases, multiple connections—even though we only have one writer, we still have multiple readers—transactions, and locks not getting released when they should.

Mak, do you have any more insights?

Flags: needinfo?(lina) → needinfo?(mak)

Comment 53

•

5 years ago

(In reply to :Lina Cambridge from comment #52)

But favicons.sqlite is an attached database (even though it's on the same, single read-write connections), and I'm not sure how that factors into things. It might be a different problem, or it might be the same one, with some combination of attached databases, multiple connections—even though we only have one writer, we still have multiple readers—transactions, and locks not getting released when they should.

Unfortunately no, I'd really like to know what's up. Ideally an attached database shouldn't change much since we still have just a single connection writing. And using WAL should ideally ensure readers never get a BUSY state. I've seen other reports on the Web about SQLITE_BUSY with attached databases, so I can't exclude there are unknowns here.
One interesting question would be to figure out whether that favicons.sqlite was corrupt or sane... But since it has been removed in the meanwhile, we may never know. It could also be a case where a query is taking a really huge amount of time (maybe due to some specific characteristic of that db), now the perf profiler should report the running SQL query in a marker, or at least a part of it (bug 1635570), so we should always ask for profiles when possible.
Of course, having some way to reproduce these behavior would go a long way to help us understanding the problem.

Flags: needinfo?(mak)

Comment 54

•

5 years ago

Thanks for looking into it. A couple of questions which might be better moved to bug 1567018.

I wonder if we could add some diagnostic asserts to Nightly to trigger a crash if we ever get SQLITE_BUSY repeatedly assuming that would help debug this further. And should we land a fix to make it impossible to infinite loop here?

I noticed that in AsyncExecuteStatements::Run(), we check mCancelRequested. If mCancelRequested could be set between the time the runnable is dispatched and when it is executed, presumably it could be set after we have checked in ::Run. Should we be checking this while in the SQLITE_BUSY loop or perhaps checking for any other state indicating a shutdown is happening?

Comment 55

•

5 years ago

•

Edited

(In reply to Haik Aftandilian [:haik] from comment #54)

I wonder if we could add some diagnostic asserts to Nightly to trigger a crash if we ever get SQLITE_BUSY repeatedly assuming that would help debug this further. And should we land a fix to make it impossible to infinite loop here?

Note getting SQLITE_BUSY in certain cases is absolutely normal, because we have a few connections using multiple threads, we use PRAGMA busy_timeout to cope with that in Places, we should likely do the same in any component that uses both the sync and async API on the same connection.
By code inspection, that's likely just Places, Cookies and tests. Afaict, Places is the only one that has to do real mixed usage, Cookies only do opening/closing on MT, so maybe it's not strictly necessary. Effectively it looks like fixing Cookies and tests, would leave Places as the only mixed use consumer that is a bit more complex to attack.

I noticed that in AsyncExecuteStatements::Run(), we check mCancelRequested. If mCancelRequested could be set between the time the runnable is dispatched and when it is executed, presumably it could be set after we have checked in ::Run. Should we be checking this while in the SQLITE_BUSY loop or perhaps checking for any other state indicating a shutdown is happening?

Potentially yes, my only doubt is that accessing it requires a lock on mMutex, that may still be ok considered we don't set it that often.
I'm not completely sure why it's not just an atomic, but there's a long discussion in bug 506805, that may be outdated considered it was about PR_AtomicSet and very old Gcc. That discussion ended up replacing the old PR_ATOMIC with the mutex (https://hg.mozilla.org/mozilla-central/rev/097171b5d15f).
I wonder if today we could use std::atomic<bool>, Andrew was part of that original discussion and maybe has an insight.
Otherwise, I think we can just use the mutex.

Comment 56

•

5 years ago

I'm sorry, forgot the ni? regarding comment 55

Flags: needinfo?(bugmail)

Comment 57

•

5 years ago

(In reply to Marco Bonardo [:mak] from comment #55)

I noticed that in AsyncExecuteStatements::Run(), we check mCancelRequested. If mCancelRequested could be set between the time the runnable is dispatched and when it is executed, presumably it could be set after we have checked in ::Run. Should we be checking this while in the SQLITE_BUSY loop or perhaps checking for any other state indicating a shutdown is happening?

Potentially yes, my only doubt is that accessing it requires a lock on mMutex, that may still be ok considered we don't set it that often.

Acquiring the lock should be OK. It's only held for a few instructions each time it is used and it appears we never hold both mMutex and mDBMutex meaning we don't have to worry about deadlock if we follow that convention.