Closed Bug 1359887 Opened 4 years ago Closed 4 years ago

Potential deadlock when forcing a wal checkpoint on Places startup


(Toolkit :: Places, enhancement, P1)




Tracking Status
firefox55 --- fixed


(Reporter: mak, Assigned: mak)



(Whiteboard: [fxsearch])


(1 file)

Some of the tests are hitting SQLITE_BUSY errors, this may be expected especially with our special Sqlite usage where 2 threads write to the same connection.
We currently protect from that through the busy_timeout pragma, that makes Sqlite block the current thread, until it can actually run the query. The value is set to 100ms, while we potentially would like this to be large enough, it may also cause jankiness on main-thread.
The problem should reduce proceeding with the remaining pieces of bug 519514 (and more in general bug 975979).

We could increase the timeout for everyone, and risk to hit some jank cases in the wild, where a very long async operation may block the main-thread. The pro would be that the sync operations will fail less likely.

Or we could set the timeout from a pref, and increase the pref value only in tests, that are more likely to run lots of operations in a short timeframe.

Not setting a priority, because we don't have good data about how often the intermittents happen yet.
Component: Bookmarks & History → Places
Product: Firefox → Toolkit
Blocks: 1359167
I suspect the change is mostly due to the wal_checkpoint call we make after setting up the database schema.
It's possible that using FULL checkpoint mode will help, the default (PASSIVE) indeed doesn't invoke busy handlers.
Setting as P1 due to the many failures registered in the last days.
Assignee: nobody → mak77
Priority: -- → P1
All the tests failed so far, I'll have to start other experiments.
Blocks: 1359164
As I suspected, this is due to the initial forced checkpoint.
But increasing the busy_timeout doesn't help, that means Sqlite detected a possible deadlock condition between the 2 threads and gave up to avoid deadlocking the whole thing. It's likely one thread is trying to get an exclusive lock, while the other one tries to get a shared lock. This is still one of the defects of our current situation where we keep using a mixture of synchronous and asynchronous APIs (especially not having fixed bug 519514, but potentially any sync API could hit the same). The only potential fix on another thread would involve changing a bunch of code to use BEGIN IMMEDIATE, check for SQLITE_LOCKED or SQLITE_BUSY and retry later, but that's not a fix we can apply to the main-thread.

For now I guess I'll just remove the forced checkpoint. Based on recent discussions with the Sqlite team, modern filesystems are less likely to choke on the checkpoint and throw all of the wal journal away, plus the crash I introduced in bug 1355414 should help us recovering from an infinite corruption situation.

We should try to proceed with the sync->async refactoring though, since these situations may become more frequent.
No longer blocks: 1359164
Summary: Evaluate increasing busy_timeout → Potential deadlock when forcing a wal checkpoint on Places startup
Blocks: 1359164
Blocks: 1360950
Comment on attachment 8863658 [details]
Bug 1359887 - Potential deadlock when forcing wal checkpoints on Places startup.

You might want to add your last comment from the SQLite team about why checkpointing isn't as important nowadays in the commit message.
Attachment #8863658 - Flags: review?(past) → review+
Pushed by
Potential deadlock when forcing wal checkpoints on Places startup. r=past
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Blocks: 1358367
Blocks: 1359168
You need to log in before you can comment on or make changes to this bug.