Closed Bug 1637829 Opened 4 years ago Closed 4 years ago

Consider increasing the background thread pool size

Categories

(Core :: XPCOM, task)

task

Tracking

()

RESOLVED FIXED
mozilla80
Tracking Status
firefox80 --- fixed

People

(Reporter: lina, Assigned: KrisWright)

Details

Attachments

(1 file)

Currently, our background thread pools (CPU and I/O) have one thread each. We're about to add some more uses of it from Rust: for extension storage, which uses the pool for disk I/O, and Firefox Accounts, which does blocking network calls. Ryan brought up the concern of FxA hogging the thread, causing other background tasks to get backed up.

Should we bump the pool size up?

And, in a related question, can we use telemetry to see how long our runnables are taking (or maybe a better measure is how many there are in the queue...or the number of other runnables that a runnable is blocking?), and use that to decide on a good pool size? I found the MOZ_COLLECTING_RUNNABLE_TELEMETRY define, which looks like it's only for Nightly (and Dev Edition?), but the only telemetry histograms I saw are for cycle collection, memory, observers, and thread wake-ups. GetLabeledRunnableName doesn't seem to be used at all...

(In reply to :Lina Cambridge from comment #0)

Should we bump the pool size up?

Bumping the pool size up is easy, so I'm not opposed to adding an extra thread as needed. For network IO in particular there's also the socket process - I don't know much about it since I don't spend a lot of time in network code, but network somehow puts networking I/O on it. I don't know the details of this or if it would be applicable for Accounts' network calls?

And, in a related question, can we use telemetry to see how long our runnables are taking (or maybe a better measure is how many there are in the queue...or the number of other runnables that a runnable is blocking?), and use that to decide on a good pool size? I found the MOZ_COLLECTING_RUNNABLE_TELEMETRY define, which looks like it's only for Nightly (and Dev Edition?), but the only telemetry histograms I saw are for cycle collection, memory, observers, and thread wake-ups. GetLabeledRunnableName doesn't seem to be used at all...

Recently some additional runnable execution logging was added that may be useful for you. It's not integrated with profiler labels or anything, but it still uses MOZ_LOG and can be parsed to look at event behavior. The logging we have feels like the most concrete way to look at how busy the background pools are right now.

As for the runnable telemetry defines, the reason MOZ_COLLECTING_RUNNABLE_TELEMETRY fields like GetLabeledRunnableName are nightly-only is because of Runnable's use of nsINamed, which bloats up the vtable, so the use of runnable names have sort of fizzled out. There's been talk about getting rid of the nsINamed dependency which is going to be a bit of legwork. If we want to label runnables outside of Nightly, we probably want to address this first.

Please note that networking code in Firefox should go if not through the socket process (although I think it should), at least go through the netwerk code that takes care of DoH, proxies, and other fine details of networking. (also ideally it shouldn't use blocking APIs)

Thank you so much for the thorough reply!

For network IO in particular there's also the socket process - I don't know much about it since I don't spend a lot of time in network code, but network somehow puts networking I/O on it. I don't know the details of this or if it would be applicable for Accounts' network calls?

Kind of ironically, we're probably using that too! 😂 (This might answer Mike's question, too!)

application-services has a networking facade called Viaduct, which we added so that we could use GeckoView's networking stack through Android-Components. In Fenix, Viaduct calls through an FFI into A-C code, with the request headers and body wrapped up in a Protobuf. A-C then calls into GeckoView, which makes the request and returns the response, which A-C packages up and sends back over the FFI to Rust. All that is to work around our Rust components being a "peer dependency" of GV...because a-s needs to make network requests, but doesn't know about GV; A-C is what ties them together, but it's one level up from both.

On Desktop, things are trickier—we're vendoring the Rust component directly, but it also uses Viaduct to make requests! So Ed just landed a backend in bug 1628068 that creates an HTTP channel on the main thread, and calls into the networking code to make the request and get a response. The background thread, meanwhile, is waiting on a condition variable until the request finishes.

So, basically: we're making our (Gecko's) async HTTP API synchronous, since that's what the Rust component expects...but we're then calling into the Rust component from the background thread queue, which makes it async again. Because of the way the dependencies are set up, there are many levels of wrapping and unwrapping involved, I'm wondering if we can cut down on some of those now that Rust has stabilized futures and async/await. But that's a project all its own...

The logging we have feels like the most concrete way to look at how busy the background pools are right now.

Oooh, thank you, that's super useful! Looking at those while we're developing locally, or from Nightly users who report slowdowns, would be a great first step!

There's been talk about getting rid of the nsINamed dependency which is going to be a bit of legwork. If we want to label runnables outside of Nightly, we probably want to address this first.

No worries, I don't think there's a mad rush to do this; I was more curious if we could decide on the thread pool size based on telemetry, so we could see how long the runnables take across our users' machines. But it's probably not worth all the legwork just to bump two numbers 😅 There are things we can do in the meantime, too, like LogRunnable and logan that you mentioned, and measuring how long the runnables take in Accounts and extension storage specifically.

(In reply to Mike Hommey [:glandium] from comment #2)

Please note that networking code in Firefox should go if not through the socket process (although I think it should), at least go through the netwerk code that takes care of DoH, proxies, and other fine details of networking. (also ideally it shouldn't use blocking APIs)

I don't want to derail this bug too much, but I wanted to add a few more details about why it's done this way:

  • In Firefox for iOS, application-services uses a Rust-based networking stack (Reqwest), since there's no Necko on iOS. This is synchronous.
  • In Fenix, we use Necko via GeckoView. Because application-services doesn't (can't, really) know about Gecko or GeckoView, it routes requests up through Android-Components, which makes it through GV and sends the response back. This is async under the hood in GV, but made synchronous by polling the GeckoResult, since the A-C concept-fetch interface is synchronous.
  • So, the way this is designed now, application-services components have to use a synchronous networking API.
  • a-s components also do synchronous disk I/O; its consumers call into the components on background threads. In Firefox for iOS, that's dispatch queues; in Fenix, Kotlin Coroutines; in Desktop, background task queues.
  • On Desktop, we're starting to vendor our Rust components directly, which means they need a way to make network requests.
  • So bug 1628068 added a facade that also uses the Necko async API, and makes it synchronous again by blocking on a monitor until Necko receives the response and calls all the callbacks.
  • So we are using Necko, meaning it'll take care of DoH, proxies, and everything else, just like the rest of Firefox 😊
  • The FxA Rust component needs to make network requests, so it goes through Viaduct. It (I think?) can also do some disk I/O, which is interleaved with the network calls.
  • But, we want to use the Rust component from JavaScript, and we don't want it blocking the main thread...so we call into it via runnables scheduled on the background queue.

It's a layer cake for sure, but it's done this way because our component is shared between three different platforms with different constraints (where the lowest common denominator is a fully synchronous networking stack written in plain Rust).

In Firefox for iOS, application-services uses a Rust-based networking stack (Reqwest), since there's no Necko on iOS. This is synchronous.

FWIW, reqwest can be used with async/await. Not sure when that happened, but that must be more recent than the code for Firefox for iOS.

So we are using Necko,

That's kind of all I needed to know. Thank you for the details, though, that's very informative. Comment #0 was only saying "blocking network calls", which could have been anything... even calling back into rust's libstd networking API.

Yep, I think Viaduct predates async/await getting stabilized.

Thanks for double-checking, I really appreciate it! It was helpful to write all this out, too 😊

It was helpful to write all this out, too 😊

Lina, do you think it would be worth us trying to incorporate some or all of the above into a more permanent home, e.g. in the Viaduct readme?

Ryan brought up the concern of FxA hogging the thread, causing other background tasks to get backed up.

Also just to be clear, I don't have any reason to expect that we'll be constantly hammering the FxA servers with network activity or anything like that. I'm mostly thinking through failures in edge cases, such as if several different parts of the browser try to get updated information from the FxA servers at the same time, and a server-side issue spikes the latency on those requests.

(In reply to Ryan Kelly [:rfkelly] from comment #7)

Lina, do you think it would be worth us trying to incorporate some or all of the above into a more permanent home, e.g. in the Viaduct readme?

Yes, that sounds perfect!

Having a thread limit of 1 in the background thread pools causes deadlock when attempting synchonous dispatch on the same pool. Increasing the thread limit on each pool to 2 lessens the risk of this kind of deadlock, so it's not the perfect solution where we wouldn't allow any kind of offthread synchronous dispatch that could cause a deadlock. These thread pools may eventually scale larger, so this patch sets the idle thread limit low.

Assignee: nobody → kwright
Status: NEW → ASSIGNED
Pushed by kwright@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8e6c8a5ccaea
Increase the number of background pool threads r=froydnj
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla80
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: