Closed Bug 1359887 Opened 7 years ago Closed 7 years ago
Potential deadlock when forcing a wal checkpoint on Places startup
59 bytes, text/x-review-board-request
Some of the tests are hitting SQLITE_BUSY errors, this may be expected especially with our special Sqlite usage where 2 threads write to the same connection. We currently protect from that through the busy_timeout pragma, that makes Sqlite block the current thread, until it can actually run the query. The value is set to 100ms, while we potentially would like this to be large enough, it may also cause jankiness on main-thread. The problem should reduce proceeding with the remaining pieces of bug 519514 (and more in general bug 975979). We could increase the timeout for everyone, and risk to hit some jank cases in the wild, where a very long async operation may block the main-thread. The pro would be that the sync operations will fail less likely. Or we could set the timeout from a pref, and increase the pref value only in tests, that are more likely to run lots of operations in a short timeframe. Not setting a priority, because we don't have good data about how often the intermittents happen yet.
Component: Bookmarks & History → Places
Product: Firefox → Toolkit
I suspect the change is mostly due to the wal_checkpoint call we make after setting up the database schema. It's possible that using FULL checkpoint mode will help, the default (PASSIVE) indeed doesn't invoke busy handlers.
Setting as P1 due to the many failures registered in the last days.
Assignee: nobody → mak77
Priority: -- → P1
FULL checkpoint https://treeherder.mozilla.org/#/jobs?repo=try&revision=9c91dd1009d45e677c3fc8b81d2f1c72fb2c3b57 FULL checkpoint + 500ms busy_timeout https://treeherder.mozilla.org/#/jobs?repo=try&revision=76b8f2dcad7d4264fbfccdc56f1320ebfa9c9d55
full checkpoint doesn't help. a larger busy_timeout seems to help, I pushed further tests: timeout = 200 https://treeherder.mozilla.org/#/jobs?repo=try&revision=4f7a3dc5ee501dff8ecfaf29e9d71d36ebd1596b timeout = 300 https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee295daed04c1cd81d2ccff7611273131cc9caf5
All the tests failed so far, I'll have to start other experiments.
As I suspected, this is due to the initial forced checkpoint. But increasing the busy_timeout doesn't help, that means Sqlite detected a possible deadlock condition between the 2 threads and gave up to avoid deadlocking the whole thing. It's likely one thread is trying to get an exclusive lock, while the other one tries to get a shared lock. This is still one of the defects of our current situation where we keep using a mixture of synchronous and asynchronous APIs (especially not having fixed bug 519514, but potentially any sync API could hit the same). The only potential fix on another thread would involve changing a bunch of code to use BEGIN IMMEDIATE, check for SQLITE_LOCKED or SQLITE_BUSY and retry later, but that's not a fix we can apply to the main-thread. For now I guess I'll just remove the forced checkpoint. Based on recent discussions with the Sqlite team, modern filesystems are less likely to choke on the checkpoint and throw all of the wal journal away, plus the crash I introduced in bug 1355414 should help us recovering from an infinite corruption situation. We should try to proceed with the sync->async refactoring though, since these situations may become more frequent.
No longer blocks: 1359164
Summary: Evaluate increasing busy_timeout → Potential deadlock when forcing a wal checkpoint on Places startup
Comment on attachment 8863658 [details] Bug 1359887 - Potential deadlock when forcing wal checkpoints on Places startup. https://reviewboard.mozilla.org/r/135460/#review138452 You might want to add your last comment from the SQLite team about why checkpointing isn't as important nowadays in the commit message.
Attachment #8863658 - Flags: review?(past) → review+
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/f0b879129876 Potential deadlock when forcing wal checkpoints on Places startup. r=past
You need to log in before you can comment on or make changes to this bug.