Open Bug 1558214 Opened 5 years ago Updated 2 years ago

Don't register all profiled threads with nsThreadManager

Categories

(Core :: XPCOM, task)

task

Tracking

()

People

(Reporter: nical, Unassigned)

Details

The profiler currently ensures that all profiled threads have an nsThread wrapper.
I'm told this is because nsThreadManager wants to know about as many threads as possible.

Unfortunately this creates an implicit dependency between the thread's lifetime and ShutdownXPCOM: all profiled threads must be destroyed before the synchronous shutdown of XPCOM happens. That's very hard to enforce in some cases because of complicated dependencies between resources which involve reference counting and other threads.

As a result I need WebRender's threads to not have nsThread wrappers (while still beeing visible in the gecko profiler).

I propose that we wither explicitly aregister non-xpcom threads to nsThreadManager or add a parameter to profiler_register_thread allowing to opt out of the automatic nsThread wrapper.

Flags: needinfo?(kmaglione+bmo)

We need all threads to be registered with the thread manager for the sake of memory reporting. The profiler's involvement is largely incidental. We just happen to already register most threads with the profiler anyway, so it's a convenient place to put a hook.

There shouldn't be any requirement that all threads be destroyed at XPCOM shutdown, though. That's guaranteed not to be the case for the Rayon threadpool used by stylo, so we have code to deal with it.

What exactly is the problem you're seeing?

Flags: needinfo?(kmaglione+bmo)

The problem we are running into is that during xpcom shutdown, nsThreadManager::Shutdown runs which goes through all remaining nsThreads and destroys them. This means that if WebRender is not fully destroyed at that point we get a race condition / UAF from webrender destroying it's own threads after or during xpcom shutdown (bug 1557208).

Enforcing that webrender's threads are all gone before xpcom shutdown is hard because webrender resources are held alive by a graph of objects with reference counting, garbage collection, cycle collection and living in a number of threads. Some of the cycle collected objects won't be destroyed until the cycle collector's own shutdown which happens after nsThreads have become illegal.
Stylo's rayon thread pool is in an easier spot because it is owned by the main thread. It's rather simple to join and destroy the thread pool when you know that xpcom isn't shutting down concurrently and that content is not sending you new work to do while you are trying to shut down. For webrender the main thread has to talk to the compositor thread which has to talk to the render backend thread which has to negotiate shutting down the scene builder and worker threads. If we chain webrender's destruction in a synchronous series of handshakes between all of these threads, closing a widget starts causing jank, so we end up devising ways to make things synchronous only during real shutdown, but now we have to worry about non-synchronous destruction racing with synchronous destruction, and we still have the problem of textures and other resources that depend on webrender, while they are held alive by DOM elements and [...]

Long story short, adding a xpcom shutdown dependency to something on the far end of gfx creates a ton of hard situations to deal with. We had to go through similar things to ship e10s a few years ago and we couldn't really dodge the nsThread implications since all of gfx was running on actual nsThreads.

Now with webrender we have pretty much zero dependency with xpcom-related machinery other than the profiler. It would be massively simpler if webrender didn't get connected to this web of trickiness. so this is where my motivation comes from.

The deeper problem really is that the hard deadline imposed by xpcom shutdown is at odds with the "things will be destroyed in their own time" style of lifetime management we have in gecko (RC/GC/CC) and the need for asynchronous communication between threads for performance. But fixing that would be an enormous project.

Changing component to xpcom, as this bug feels like a higher-level xpcom thread management issue than something strictly related to the profiler (which was just a convenient place to place the thread registration&naming). And hopefully this will make this bug more visible to the right people.
Feel free to move it back to Gecko Profiler if the solution ends up being contained there.

Component: Gecko Profiler → XPCOM
Severity: normal → S3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.