Closed Bug 1579620 Opened 5 years ago Closed 3 years ago

Crash in [@ AsyncShutdownTimeout | Places Clients shutdown | SyncedBookmarksMirror: finalize]

Categories

(Firefox :: Sync, defect, P3)

70 Branch
defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox69 --- unaffected
firefox70 --- wontfix
firefox71 --- wontfix

People

(Reporter: philipp, Unassigned)

References

Details

(Keywords: crash, regression)

Crash Data

This bug is for crash report bp-ca2e3b49-5280-4cdc-8ca4-7ad5e0190907.

crash reports with this signature are starting to show up cross-platform in the firefox 70 cycle. the first nightly this happened from was 70.a1 build 20190807095705.

the async shutdown timeout field generally looks similar to this in the reports:

{

    "phase":"Places Clients shutdown",
    "conditions":[
        {
            "name":"SyncedBookmarksMirror: finalize",
            "state":{
                "steps":[
                    {
                        "step":"finalize",
                        "at":1567842792535
                    }
                ]
            },
            "filename":"resource://gre/modules/SyncedBookmarksMirror.jsm",
            "lineNumber":280,
            "stack":[
                "resource://gre/modules/SyncedBookmarksMirror.jsm:SyncedBookmarksMirror:280",
                "resource://gre/modules/SyncedBookmarksMirror.jsm:open:357"
            ]
        }
    ]

}

Thanks for filing this! There are a few things that can cause a shutdown hang:

  1. Failing to run PRAGMA optimize or detach from the mirror database.
  2. A large merge. See bp-517875df-31b7-4108-8d1f-b64200190907 ({"step":"fetchRemoteTree","at":1567826800504,"took":25221,"counts":[{"name":"items","count":468552}]}), bp-87e5eeb4-915b-442d-b9f5-55a5b0190906 ({"step":"fetchRemoteTree","at":1567794381310,"took":5659,"counts":[{"name":"items","count":235805}]}), and bp-8947f7e6-6280-4c1f-93e1-1315a0190905 ({"step":"fetchRemoteTree","at":1567678575700,"took":19300,"counts":[{"name":"items","count":1016274}]}). These are hanging during application because there are too many bookmarks—468k, 235k, and 1m+ bookmarks, respectively—well above our soft limit.
  3. A large number of tombstones. See bp-ad46895e-eb33-47ba-89b2-5d0480190905 ({"step":"merge","at":1567665986347,"took":970,"counts":[{"name":"items","count":2522},{"name":"deletes","count":1641866}]). The tree is pretty small, but that's a lot of tombstones to insert into a temp table—and then drop on the floor, since we won't store them.
  4. Anything that's blocking the Places async thread, like Sanitizer.jsm or a Places operation. These will also show up in the shutdown hang report, though I can't find an example now. Since we're in the queue behind it, there isn't much we can do.
  5. Database corruption. Nothing we can do here; Firefox will recreate the databases on the next startup.

We can fix (1) by optimizing when we apply, instead of waiting 'til shutdown. We can even get away with not calling finalize at all, since closing the main Places connection at shutdown will automatically detach the mirror database and drop all temp tables. However, if the main connection is stuck, we'll still see a shutdown hang, as in bug 1395542, bug 1426941, bug 1258350, and bug 1326309—but implicating Places instead of SyncedBookmarksMirror. Still, it's worth doing.

I don't think we should push on (2) just yet. Those numbers are well above our soft limit for bookmarks. https://phabricator.services.mozilla.com/D44769 should make this a bit better by adding more interrupt points, but, if SQLite is busy writing those changes to disk, there isn't much we can do. We can fix bug 1556459 to allow interrupting the main Places connections, but I think that's riskier than a shutdown hang.

We should fix (3). Inserting tombstones into a TEMP table for items that don't exist locally, and that we won't upload (and that we'll process one by one thanks to our trigger), is pretty wasteful.

(4) and (5) are out of our hands, unfortunately.

Let's keep this bug open for tracking, and fix the specific issues in dependent bugs.

Severity: critical → normal
Priority: -- → P3
Blocks: 1579727
Blocks: 1579728

Elaborating a bit:

We can fix (1) by optimizing when we apply, instead of waiting 'til shutdown.

I considered removing the shutdown blocker entirely, and using PlacesUtils.registerShutdownFunction. In theory, now that we use the main connection, and don't run any async tasks on shutdown, we can do this.

However, that would mean losing progress reports in shutdown hang crash annotations. Long-running merges blocked on a SQLite operation would still hang at shutdown—we can't interrupt read-write connections—except the signature would be something like AsyncShutdownTimeout | profile-change-teardown | Places Clients shutdown, and wouldn't mention SyncedBookmarksMirror at all. There would be no way to distinguish the mirror from other Places consumers.

The progress reports have been super valuable for optimizing the mirror. Bug 1567238, bug 1576810, and bug 1575757 (and bug 1561467, with the initial numbers) would have been much harder without them. With that in mind, I think we should still fix bug 1579727, but keep the blocker around, and accept that we'll still see shutdown hangs mentioning SyncedBookmarksMirror.

Actually, thinking more about this, (1) is wrong. We're hanging trying to run PRAGMA optimize, but that might be because we're blocked on another statement running on the async thread, not necessarily because the PRAGMA itself takes a long time to complete.

Consider:

  1. The merge finishes. Merging and application happens in a single runnable, so we know no one else is using the connection, and we block the async thread until we're done.
  2. We start fetching info to notify observers.
  3. Another caller wants to use the main Places connection, either during normal operation, or at shutdown. This could be the address bar, Bookmarks.jsm, the library view, maintenance, Sanitizer.jsm clearing data at shutdown...
  4. The other caller executes a statement that takes a long time. At this point, the async thread is blocked.
  5. The mirror can't finish notifying observers, because it can't run any more statements until statement (4) finishes.
  6. The user closes the browser. The mirror's shutdown blocker tries to run PRAGMA optimize, but that's blocked on (5), which is blocked on (4).

Basically, after the merge is done, other callers can use the Places connection, interleaving their own reads and writes. This isn't a consistency issue, since we already committed the merge transaction, and we're only reading from in-memory TEMP tables...but it does mean the mirror can be blocked on someone else's statements.

I think this can also happen even if the mirror isn't in the middle of a merge. In that case, it still can't run PRAGMA optimize or DETACH because the async thread is busy. This also means the hangs we see for notifyObservers and fetchLocalChangeRecords (bug 1561467, comment 0) might be misleading, because they're also counting the time they're waiting for other statements to run, not just their own.

However, fixing bug 1579727 by removing the shutdown blocker without waiting for PRAGMA optimize raises its own issues. If we did that, the mirror wouldn't report shutdown hangs at all (same issue as comment 2), since the blocker isn't waiting on pending merges to finish.

We could fix that by adding an AsyncShutdown.Barrier that waits for merges to finish before lifting the blocker, but now we're back to waiting on statements executed by other callers. That's going to be an issue as long as we're sharing the main Places connection with other callers—and using our own connection raises the specter of SQLITE_BUSY again. We've basically added a lot of complexity to get...what we have today.

TL;DR: let's close bug 1579727 as WONTFIX, and keep an eye on SyncedBookmarksMirror shutdown crashes. I suspect we're going to see many false positives, but the progress reports we get from actual hung merges are very useful, and bug 1579727 would regress that. There are already several bugs open for Places-related AsyncShutdown hangs, mentioned in comment 1. It's a bit flippant to say one more won't hurt, but the trade-off doesn't feel worth it.

Anything that's blocking the Places async thread, like Sanitizer.jsm or a Places operation. These will also show up in the shutdown hang report, though I can't find an example now. Since we're in the queue behind it, there isn't much we can do.

See bp-d59e26a3-22bc-4fe1-958a-dff200190925 (bug 1524200) for an example.

Lina, is there something actionable and upliftable to 71 to fix this crash or can I mark it as wontfix for 71? Thanks

Flags: needinfo?(lina)

As it is, I'm not sure this bug is actionable at all, in 71 or later.

We can "fix" the shutdown hang to have a different signature (like bug 1395542 or bug 1326309), but that would paper over the problem, and make it harder to track hung merges, instead of general Places connection hangs. Bug 1556459 might help, but also unclear.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.