Closed Bug 1359887 Opened 8 years ago Closed 8 years ago

Potential deadlock when forcing a wal checkpoint on Places startup

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla55

Tracking Flags:

Tracking

Status

firefox55

---

fixed

People

(Reporter: mak, Assigned: mak)

References

Details

(Whiteboard: [fxsearch])

Attachments

(1 file)

Bug 1359887 - Potential deadlock when forcing wal checkpoints on Places startup. 8 years ago Marco Bonardo [:mak] 59 bytes, text/x-review-board-request	past : review+	Details

Marco Bonardo [:mak]

Assignee

Description

•

8 years ago

Some of the tests are hitting SQLITE_BUSY errors, this may be expected especially with our special Sqlite usage where 2 threads write to the same connection. We currently protect from that through the busy_timeout pragma, that makes Sqlite block the current thread, until it can actually run the query. The value is set to 100ms, while we potentially would like this to be large enough, it may also cause jankiness on main-thread. The problem should reduce proceeding with the remaining pieces of bug 519514 (and more in general bug 975979). We could increase the timeout for everyone, and risk to hit some jank cases in the wild, where a very long async operation may block the main-thread. The pro would be that the sync operations will fail less likely. Or we could set the timeout from a pref, and increase the pref value only in tests, that are more likely to run lots of operations in a short timeframe. Not setting a priority, because we don't have good data about how often the intermittents happen yet.

Marco Bonardo [:mak]

Assignee

Updated

•

8 years ago

Component: Bookmarks & History → Places

Product: Firefox → Toolkit

Marco Bonardo [:mak]

Assignee

Updated

•

8 years ago

Blocks: 1359167

Marco Bonardo [:mak]

Assignee

Comment 1

•

8 years ago

I suspect the change is mostly due to the wal_checkpoint call we make after setting up the database schema. It's possible that using FULL checkpoint mode will help, the default (PASSIVE) indeed doesn't invoke busy handlers.

Marco Bonardo [:mak]

Assignee

Comment 2

•

8 years ago

Setting as P1 due to the many failures registered in the last days.

Assignee: nobody → mak77

Priority: -- → P1

Marco Bonardo [:mak]

Assignee

Comment 3

•

8 years ago

FULL checkpoint https://treeherder.mozilla.org/#/jobs?repo=try&revision=9c91dd1009d45e677c3fc8b81d2f1c72fb2c3b57 FULL checkpoint + 500ms busy_timeout https://treeherder.mozilla.org/#/jobs?repo=try&revision=76b8f2dcad7d4264fbfccdc56f1320ebfa9c9d55

Marco Bonardo [:mak]

Assignee

Comment 4

•

8 years ago

full checkpoint doesn't help. a larger busy_timeout seems to help, I pushed further tests: timeout = 200 https://treeherder.mozilla.org/#/jobs?repo=try&revision=4f7a3dc5ee501dff8ecfaf29e9d71d36ebd1596b timeout = 300 https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee295daed04c1cd81d2ccff7611273131cc9caf5

Marco Bonardo [:mak]

Assignee

Comment 5

•

8 years ago

All the tests failed so far, I'll have to start other experiments.

Panos Astithas (he/him) [:past] (please ni?)

Updated

•

8 years ago

Status: NEW → ASSIGNED

Marco Bonardo [:mak]

Assignee

Comment 6

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0198a1f2d9bfab0529e56786afe78c33b9fb4132

Marco Bonardo [:mak]

Assignee

Updated

•

8 years ago

Blocks: 1359164

Marco Bonardo [:mak]

Assignee

Comment 7

•

8 years ago

As I suspected, this is due to the initial forced checkpoint. But increasing the busy_timeout doesn't help, that means Sqlite detected a possible deadlock condition between the 2 threads and gave up to avoid deadlocking the whole thing. It's likely one thread is trying to get an exclusive lock, while the other one tries to get a shared lock. This is still one of the defects of our current situation where we keep using a mixture of synchronous and asynchronous APIs (especially not having fixed bug 519514, but potentially any sync API could hit the same). The only potential fix on another thread would involve changing a bunch of code to use BEGIN IMMEDIATE, check for SQLITE_LOCKED or SQLITE_BUSY and retry later, but that's not a fix we can apply to the main-thread. For now I guess I'll just remove the forced checkpoint. Based on recent discussions with the Sqlite team, modern filesystems are less likely to choke on the checkpoint and throw all of the wal journal away, plus the crash I introduced in bug 1355414 should help us recovering from an infinite corruption situation. We should try to proceed with the sync->async refactoring though, since these situations may become more frequent.

No longer blocks: 1359164

Summary: Evaluate increasing busy_timeout → Potential deadlock when forcing a wal checkpoint on Places startup

Marco Bonardo [:mak]

Assignee

Updated

•

8 years ago

Blocks: 1359164

Marco Bonardo [:mak]

Assignee

Updated

•

8 years ago

Blocks: 1360950

Comment hidden (mozreview-request)

Panos Astithas (he/him) [:past] (please ni?)

Comment 9

•

8 years ago

mozreview-review

Comment on attachment 8863658 [details] Bug 1359887 - Potential deadlock when forcing wal checkpoints on Places startup. https://reviewboard.mozilla.org/r/135460/#review138452 You might want to add your last comment from the SQLite team about why checkpointing isn't as important nowadays in the commit message.

Attachment #8863658 - Flags: review?(past) → review+

Comment hidden (mozreview-request)

Pulsebot

Comment 11

•

8 years ago

Pushed by mak77@bonardo.net: https://hg.mozilla.org/integration/autoland/rev/f0b879129876 Potential deadlock when forcing wal checkpoints on Places startup. r=past

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 12

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f0b879129876

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox55: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

Marco Bonardo [:mak]

Assignee

Updated

•

8 years ago

Blocks: 1358367

Marco Bonardo [:mak]

Assignee

Updated

•

8 years ago

Blocks: 1359168

You need to log in before you can comment on or make changes to this bug.