Closed Bug 1594521 Opened 5 months ago Closed 5 months ago

Enable remote settings blocklist to ride the train after the beta experiment has finished successfully

Categories

(Toolkit :: Blocklist Implementation, task, P3)

task

Tracking

()

RESOLVED FIXED
mozilla72
Tracking Status
firefox72 --- fixed

People

(Reporter: Gijs, Assigned: Gijs)

References

(Depends on 1 open bug)

Details

Attachments

(3 files)

We'll want to switch 72 nightly over and have it ride the train with that if there are no issues found as a result of the beta experiment.

Priority: -- → P3
Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/6e3d285d14ee
enable remote settings blocklist on nightly, r=mconley
Flags: needinfo?(gijskruitbosch+bugs)
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/dea221c2e93e
enable remote settings blocklist on nightly, r=mconley
Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Status: RESOLVED → REOPENED
Flags: needinfo?(gijskruitbosch+bugs)
Resolution: FIXED → ---

I can reproduce this locally, though only if I run the entire manifest (toolkit/mozapps/extensions/test/xpcshell/xpcshell.ini), which makes it really difficult to get any meaningful debug output for the specific test that's failing. The test is passing, and reaches the end of the test task cleanly, but somehow fails to exit and hangs on shutdown. I expect it's nothing to do with this specific test, but somehow remote settings and/or its indexeddb access does not play nicely with the xpcshell test infrastructure these tests use. Any clues as to where I should start debugging?

We're hoping to ship this with 72, and this is a bit of a nasty surprise and the only blocker to the otherwise straightforward pref flip.

Flags: needinfo?(mathieu)
Flags: needinfo?(kmaglione+bmo)
Flags: needinfo?(gijskruitbosch+bugs)

My best guess is that there's an async shutdown blocker preventing shutdown from finishing. It might help to turn on AsyncShutdown logging. The following message in the log is also suspicious:

[task 2019-11-15T21:33:01.321Z] 21:33:01 INFO - PID 5331 | ###!!! [Parent][DispatchAsyncMessage] Error: PBackgroundIDBFactory::Msg_PBackgroundIDBFactoryRequestConstructor Value error: message was deserialized, but contained an illegal value

and makes me thing indexedDB is doing the blocking (possibly not even via AsyncShutdown, alas...)

Flags: needinfo?(kmaglione+bmo)

It seems like there's some kind of deadlock? I don't know how to figure out more though. :-(

Finding the DOM worker being present in my stacks suspicious. Is it possible that we just need to terminate https://searchfox.org/mozilla-central/rev/b58e44b74ef2b4a44bdfb4140c2565ac852504be/services/settings/RemoteSettingsWorker.jsm#91-93 on xpcom-shutdown (or perhaps even earlier?). It seems to work locally, but not sure if that's just a wallpaper...

(In reply to Kris Maglione [:kmag] from comment #8)

My best guess is that there's an async shutdown blocker preventing shutdown from finishing. It might help to turn on AsyncShutdown logging.

I tried this by just setting the pref in the test that's breaking, and that works when running that individual test that fails, but I get no output when running the entire manifest.

The following message in the log is also suspicious:

[task 2019-11-15T21:33:01.321Z] 21:33:01 INFO - PID 5331 | ###!!! [Parent][DispatchAsyncMessage] Error: PBackgroundIDBFactory::Msg_PBackgroundIDBFactoryRequestConstructor Value error: message was deserialized, but contained an illegal value

and makes me thing indexedDB is doing the blocking (possibly not even via AsyncShutdown, alas...)

Between the error and the main purpose of that worker being IDB access, maybe some of the indexeddb folks have ideas about what's happening here. (context: comment #6 and later; when running some extension tests, things hang on shutdown when using the new (indexeddb-via-remotesettings-based) blocklist backend).

Flags: needinfo?(jvarga)
Flags: needinfo?(bugmail)

I'll take a deeper look later, but re xpcom-shutdown, IndexedDB is no longer usable once "profile-before-change" has concluded. QuotaManager and all clients get torn down during "profile-before-change-qm" which happens after "profile-before-change". It seems very believable that an IndexedDB request might hang if it's issued after QuotaManager begins shutdown.

From the stack trace, XPCOMInit.cpp:632 is https://searchfox.org/mozilla-central/rev/b58e44b74ef2b4a44bdfb4140c2565ac852504be/xpcom/build/XPCOMInit.cpp#632 is "xpcom-shutdown-threads" which is very late and it's indicating PBackground is failing to shutdown, which also seems reasonable to get broken by a late IndexedDB request. (The problem could also be related to bug 1542884?)

One idea I've been kicking around in this space is that consumers of APIs like ChromeWorkers or SQLite.jsm or other uses that really need to be aware of shutdown need to provide a ShutdownBlocker at creation time that indicates what phase they are expected to terminate in. We'd pick a default like "profile-before-change" if nothing is specified. And in that case we'd hard terminate the ChromeWorker and deny creation of any new ChromeWorkers once profile-before-change occurs.

Yeah.

Flags: needinfo?(jvarga)

OK. I've confirmed that when this happens it's because the remote settings worker sticks around (by finding a frame in which I could inspect the WorkerPrivate instance on the DOM Worker thread in xcode, and its mScriptURL is the remote settings worker script).

What I'm not sure about is why. As far as I can tell nothing normally shuts down the worker - it just happens "naturally" at some point (?) when things work. Is that expected? I don't see other worker consumers (like the session store worker) do anything in particular to stop its worker.

If that is expected, then why isn't it shutting down here? As far as I can tell from the stacks in the attachment, all threads are stuck waiting for "something". And yes, in the case of the BackgroundImpl stuff, that's spinning an event loop inside xpcom-shutdown-threads, but it looks like that's supposed to time things out after 10 seconds, which (apparently?) isn't happening here, for reasons I don't understand. Like, is it just because there's JS alive waiting for one of the IDB promises (pretty much everything in https://searchfox.org/mozilla-central/source/services/settings/RemoteSettingsWorker.js seems to be async IDB methods) and it's never resolving? Something else?

What's even less clear to me is how wallpaper-y it would be to make the remote settings module force-terminate() its worker at profile-before-change (and refuse to start it once that's happened). If there's a fundamental issue with indexeddb and how it handles shutdown here, just nuking the worker from orbit, potentially mid-operation, would be Bad. But I don't really understand, nor do I see how I can find out, whether that's the case. I looked at the BackgroundImpl stuff a bit but I don't see how I can log something meaningful from it (like names of actors or something).

Ideas very welcome...

Oh, this may just be an instance of bug 1594572, then...

(In reply to Kris Maglione [:kmag] from comment #16)

Oh, this may just be an instance of bug 1594521, then...

You meant to copy a different bug number, I assume? :-)

Flags: needinfo?(kmaglione+bmo)

(In reply to :Gijs (he/him) from comment #17)

(In reply to Kris Maglione [:kmag] from comment #16)

Oh, this may just be an instance of bug 1594521, then...

You meant to copy a different bug number, I assume? :-)

Bah. Bug 1594572.

Flags: needinfo?(kmaglione+bmo)

Right, so I think there are a few things going on here:

  1. The worker is getting created late in the shutdown process and racing the worker's RuntimeService's compulsory shutdown of all workers. This is a bug in the RuntimeService, bug 1594572.
  2. Something is going wrong in IndexedDB/PBackground when IDB requests are issued against a dying PBackground. I've filed bug 1597312 on this.
  3. Either RemoteSettings or a RemoteSettings consumer is doing things well after shutdown has started, or not using shutdown blockers to avoid things it started before shutdown from bleeding into shutdown.

I'll focus on the 3rd point for the remainder of the comment here. I do want to emphasize that I acknowledge both the first two as important and serious bugs; and as situations that workers and IndexedDB have historically lucked out about because content consumers are destroyed earlier in cleanup. System JS, and in particular JSMs, present an interesting complication because their lifetime is basically forever. And ChromeWorkers as well, by default, at least until the runtime service terminates the worker. And all of this is because shutdown is super confusing and not really documented or owned in Firefox/Gecko.

(In reply to :Gijs (he/him) from comment #15)

What's even less clear to me is how wallpaper-y it would be to make the remote settings module force-terminate() its worker at profile-before-change (and refuse to start it once that's happened).

It's not wallpaper-y for a system service to be aware of system shutdown. When the runtime service forces workers to terminate, they are aborted without regard for what they are doing. Ideally the owner of the worker would use an appropriate shutdown blocker and:

  • Block shutdown if the operation is important.
  • Abort any outstanding transactions if they are not. (While IndexedDB will force-abort transactions at profile-before-change-qm, this is very late in the shutdown process, and allows a potentially huge amount of wasted work to happen which has its own overhead as any pending transactions are rolled back.)

Besides the improved shutdown performance, a major benefit for services that report telemetry need to be able to distinguish between unexpected errors and expected situations like shutdown. For example, Activity Stream's telemetry reports a tremendous number of TRANSACTION_FAILED errors that occur in its use of IndexedDB. Hopefully these are all happening at shutdown when QuotaManager shuts down at profile-before-change-qm and aborts all in-progress transactions, but it's not clear.

It's my hope that we can fix both the workers and IDB bug in this cycle, but I would strongly encourage adding a shutdown blocker, especially if the remote settings API wants to have invariants about its storage being reliable in the face of browser shutdown.

Flags: needinfo?(bugmail)
Flags: needinfo?(mathieu)
See Also: → 1597688
See Also: → 1597693
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/579519d2db34
enable remote settings blocklist on nightly, r=mconley
https://hg.mozilla.org/integration/autoland/rev/7cfb4262cf1c
avoid instantiating remote settings worker after shutdown, r=leplatrem,asuth
Blocks: 1597785
Depends on: 1605973
Blocks: 1618188
You need to log in before you can comment on or make changes to this bug.