Open Bug 1561937 Opened 5 years ago Updated 2 years ago

coalesce thread pools into 1 + epsilon thread pools

Categories

(Core :: XPCOM, enhancement)

enhancement

Tracking

()

People

(Reporter: froydnj, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink:P1])

We have many XPCOM-ish thread pools (I'm ignoring the non-XPCOM ones for the moment because they either require more thought to get rid of--extra APIs, e.g--or we're just stuck with them):

https://searchfox.org/mozilla-central/search?q=nsThreadPool&case=false&regexp=false&path=
https://searchfox.org/mozilla-central/search?q=symbol:_ZN7mozilla16SharedThreadPool3GetERK9nsTStringIcEj&redirect=false
https://searchfox.org/mozilla-central/search?q=symbol:_ZN7mozilla18GetMediaThreadPoolENS_15MediaThreadTypeE&redirect=false

  • OMTP
  • Canvas workers
  • IPC launching
  • Image Encoding
  • WebCrypto
  • DOM FileHandles (?!)
  • SSL cert verification
  • Media platform decoding
  • Media message control (?)
  • WebRTC decoding (?)
  • MediaDecoderStateMachine
  • MediaPlayback
  • DNS
  • IndexedDB QuotaClient (?!)
  • XPCJSRuntime
  • CubebOperation
  • MediaTimer
  • Stream Transport Service (STS)

...and I'm probably forgetting some. These should all really be a single thread pool.

The easiest path forward to consolidation is to change NS_NewThreadPool to hand out references to the same thread pool, similar to what SharedThreadPool does.

...except that we need to take into account a couple of things:

  • The media platform decoding thread pool needs a non-default stack size because it's executing third-party code that requires large amounts of stack space. I guess if we're consolidating down, maybe we can just declare that amount of stack space the minimum?
  • We can't consolidate things like DNS and STS into the One True Thread Pool because tasks on both of them are often blocking operations. (Along with IndexedDB QuotaClient and maybe others?) We need some method of communicating "this task can block" when dispatching threads into the pool, and I think we'll probably wind up maintaining a separate set of threads in the One True Thread Pool for servicing blocking tasks. Another dispatch flag might be the right answer here, or we should turn the dispatch flags into an object with a richer interface.

DNS also has issues surrounding outstanding DNS requests and shutdown.

So, when we talked about this in Whistler, the general consensus was that we should have a centralized task queue with extra task metadata that allows us to determine how many threads we need and what sorts of threads to dispatch tasks to.

One of the metadata criteria would be whether the task is primarily for CPU parallelism or whether it's primarily for the purposes of blocking IO. We'd want a maximum number of threads for CPU-bound tasks determined relative to the number of CPUs. For blocking IO tasks, the number would probably be higher. And we'd grow and shrink the thread pool depending on the number of each kind of task we have to deal with.

There's also the question of thread affinity for the sake of memory access efficiency. So if there's a particular queue of related tasks that are meant to operate on a set of data that they own, we'd ideally always try to dispatch them to the same OS thread when possible. That same sort of logic could be applied to tasks that require a large stack size.

I don't think we'd want to use that large stack size for tasks that don't need it—any more than we have to anyway—and we'd really want to preferentially spin down those threads before other threads when we don't have tasks that need them. But if we happen to have an idle thread with a large stack, and only pending tasks that don't require a large stack, we could still dispatch them to those threads, since they still meet the minimum requirements.

(In reply to Kris Maglione [:kmag] from comment #1)

There's also the question of thread affinity for the sake of memory access efficiency. So if there's a particular queue of related tasks that are meant to operate on a set of data that they own, we'd ideally always try to dispatch them to the same OS thread when possible. That same sort of logic could be applied to tasks that require a large stack size.

This can be future work, IMHO; it is entirely unrelated to trimming down how many threads we have.

I don't think we'd want to use that large stack size for tasks that don't need it—any more than we have to anyway—and we'd really want to preferentially spin down those threads before other threads when we don't have tasks that need them. But if we happen to have an idle thread with a large stack, and only pending tasks that don't require a large stack, we could still dispatch them to those threads, since they still meet the minimum requirements.

I'm assuming that the media platform decoder bits are CPU-bound, so the number of threads requiring a larger stack size than normal would be a (small) number, and trying to keep track of "how many threads do I have that are able to service large-stack-size-required requests" seems like unnecessary complication. I'd be willing to revisit this after we've a) trimmed the number of threadpools; and b) done whatever metadata changes we need so that we have a place to stick that flag, but for now, I think we can live without it.

(In reply to Nathan Froyd [:froydnj] from comment #2)

(In reply to Kris Maglione [:kmag] from comment #1)

There's also the question of thread affinity for the sake of memory access efficiency. So if there's a particular queue of related tasks that are meant to operate on a set of data that they own, we'd ideally always try to dispatch them to the same OS thread when possible. That same sort of logic could be applied to tasks that require a large stack size.

This can be future work, IMHO; it is entirely unrelated to trimming down how many threads we have.

Well, it's related to the extent that we ideally want task metadata that decides what sort of threads tasks can/should ideally run on. I don't think we need it now, but I think we should keep it in mind when we decide what to implement.

I'm assuming that the media platform decoder bits are CPU-bound, so the number of threads requiring a larger stack size than normal would be a (small) number, and trying to keep track of "how many threads do I have that are able to service large-stack-size-required requests" seems like unnecessary complication. I'd be willing to revisit this after we've a) trimmed the number of threadpools; and b) done whatever metadata changes we need so that we have a place to stick that flag, but for now, I think we can live without it.

I think it's a necessary complication. Not all processes are ever even going to have media decoder threads, so increasing the stack size of all thread pool threads in all processes for their sake may actually wind up causing regressions, though hopefully it won't.

Regardless, I think that's the sort of future we need to plan for, and if we're going to need this sort of task metadata to achieve the kind of memory reductions we need for fission, I think it makes sense to design that system as the first step, rather than try to unify thread pools first and then redesign them later.

Not a XPCOM thread pool, but something to keep in mind: stylo uses thread-local memory arenas.

In case that helps:
The profiler has AUTO_PROFILER_THREAD_SLEEP, which keeps track of sleeping threads, so we don't need to sample the stack again when it hasn't changed.

The One True Thread Pool could use the same, or a similar technique (e.g., stack sampling, or ideally some magic way to get the info from the OS?), to know if tasks are blocked and therefore new threads are needed.
This should help avoid CPU under-utilization, and deadlocks (when all running threads are waiting for other tasks that can't be scheduled anymore).

Whiteboard: [memshrink]
Whiteboard: [memshrink] → [MemShrink:P1]
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.