Open Bug 1535381 Opened 5 years ago Updated 2 years ago

Experiment with setting thread affinity in worker threads

Categories

(Core :: Performance, enhancement, P3)

enhancement

Tracking

()

Performance Impact low

People

(Reporter: alexical, Unassigned)

Details

(Keywords: perf:resource-use)

There's a number of cases (I can think of PaintWorkers, StyleWorkers, and WRWorkers), where we could potentially benefit from pinning our threads each to a particular core. We'd have to have an API in rayon for the StyleWorkers and WRWorkers, but I think we could experiment fairly easily with this with PaintWorkers.

Whiteboard: [fxperf] → [qf]

Also, to clarify: the benefit I'm interested in is chiefly improved CPU cache utilization. There's a number of reasons this might not pan out. One is that the default soft affinity implemented by the OS scheduler might already be enough on a per-frame basis - I don't know how often our jobs are shuffled to a different core mid-job (probably not often). If this rarely ever happens, we could still see a benefit if the logical successor to a job, which reads from the same memory as the same job on the previous frame, gets sent to the same worker. I imagine that's not the case, but it's something to think about.

cc'ing denispal, who I think might also find this interesting.

Whiteboard: [qf] → [qf:investigate]

(I know Core:Performance isn't an amazing component for this bug Core:General is less amazing)

Component: General → Performance

jld, aaron, bob - what options do we have to do this on the various platforms? Does the sandbox block this (or some of this) from content processes? GPU?

Note that cpearce noted there's an potential issue if we pin threads and other apps (or other of our processes) pin threads to the same cores... See a blog post by Raymond Chen of MSFT.

https://devblogs.microsoft.com/oldnewthing/20170309-00/?p=95695
Not sure where the post mentioned by Chris is; here's a list of his posts: https://github.com/mity/mctrl/wiki/Old-New-Win32API

Flags: needinfo?(jld)
Flags: needinfo?(cpearce)
Flags: needinfo?(bobowencode)
Flags: needinfo?(aklotz)

On Windows there are three APIs. From oldest to newest:

  1. Hard affinity (threads guaranteed to only run on the specified processors): SetThreadAffinityMask or SetThreadGroupAffinity for systems with > 64 processors. I would advise against using this API. Available on all supported versions of Windows;
  2. Soft affinity (a hint that will be used whenever possible, but if necessary threads will be allowed to run on other processors): SetThreadIdealProcessor or SetThreadIdealProcessorEx for systems with > 64 processors; Requires the THREAD_SET_INFORMATION access right. Can also be set via UpdateProcThreadAttribute at process/thread creation time. Available on all supported versions of Windows;
  3. CPU Sets incorporate a system-wide, holistic view of the load on various CPUs. Requires the THREAD_SET_LIMITED_INFORMATION access right. Requires Windows 10 and is the preferred API for that platform.

Obviously you would want to query for the cache topology before making any assignments to ensure that cache utilization works the way that you want it to.

Sandboxing would only be an issue if the sandbox broker were to restrict the processor affinity mask of the child process. This is because the OS does not allow a thread to assign itself to a processor that is excluded from the current process's affinity mask. We don't do this in the sandbox, so this is a non-issue.

I should point out, however, that it could be possible for the browser process itself to have restricted affinity (thus being inherited by all its children), if somebody started the browser that way. To avoid errors it might be prudent to query for the current process's affinity mask to ensure that we're only assigning our threads to a subset of the available processors.

Flags: needinfo?(aklotz)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #4)

Note that cpearce noted there's an potential issue if we pin threads and other apps (or other of our processes) pin threads to the same cores... See a blog post by Raymond Chen of MSFT.

https://devblogs.microsoft.com/oldnewthing/20170309-00/?p=95695

Raymond Chen's specific post I was thinking of was https://devblogs.microsoft.com/oldnewthing/20050607-00/?p=35413

Flags: needinfo?(cpearce)

Aaron, covered all I know on this and more. :-)

Flags: needinfo?(bobowencode)

On Linux we have sched_setaffinity, which provides hard affinity (threads run only on the specified CPUs, and the call fails if that set would be empty after applying other restrictions) and would have the same problems as mentioned in earlier comments. As far as sandboxing, if we use it we'd want to restrict it to threads changing their own affinity and not that of other threads, because otherwise it could set affinity on other processes owned by the same user (see also bug 1413313).

I don't know of a way to give hints for soft affinity. (There is an API for specifying NUMA node affinity on memory areas which allows either hard or soft affinity, but that probably doesn't help here.)

Flags: needinfo?(jld)

Haik, do you know anything about how MacOS handles this / how it might interact with our sandboxing?

Flags: needinfo?(haftandilian)

I looked into this and I found macOS has a thread affinity API where threads can be assigned an affinity tag which is used by the scheduler to either schedule threads together on CPU's sharing caches or to try to place them apart. The docs[1] explicitly state "explicit thread to processor binding is not supported". However the docs also mention "threads with affinity tags will tend to remain in place" so it's possible that just setting an affinity tag provides an affinity level that is better than what the scheduler does by default. The affinity tag namespace is shared with the child process when fork is used.

I don't /think/ sandboxing will have any impact on using this, but I don't know for sure. I haven't come across any sandbox rules affecting scheduling or that sound like they'd affect setting thread affinity. Just today I noticed some of the policies that ship with macOS use a "system-sched" rule, but its not documented.

  1. https://developer.apple.com/library/archive/releasenotes/Performance/RN-AffinityAPI/
Flags: needinfo?(haftandilian)

(In reply to Chris Pearce [:cpearce (GMT+13)] from comment #6)

Raymond Chen's specific post I was thinking of was https://devblogs.microsoft.com/oldnewthing/20050607-00/?p=35413

I think that it's pretty clear that we shouldn't be trying to set our thread affinities in an effort gain advantage over other programs on the computer. OTOH, I think that using thread affinities to properly separate our threads within our processes is perfectly reasonable.

(In reply to Aaron Klotz [:aklotz] (PTO May 29 - June 5) from comment #11)

I think that it's pretty clear that we shouldn't be trying to set our thread affinities in an effort gain advantage over other programs on the computer. OTOH, I think that using thread affinities to properly separate our threads within our processes is perfectly reasonable.

I was more taking the meta lesson from Chen's blog post; what if two [Gecko] developers do this?

My specific concern is that I've come across several different teams inside Gecko talking about setting affinities or similar issues such as setting the size of thread pools based on the number of cores, or setting thread priorities, and if teams make these decisions in isolation we could end up tripping over each others' settings.

Oh yes, absolutely. Really we should probably first get a handle on all the threads that are being created in the first place and taking a holistic view of that, before going anywhere near thread affinity.

(In reply to Chris Pearce [:cpearce (GMT+13)] from comment #12)

I was more taking the meta lesson from Chen's blog post; what if two [Gecko] developers do this?

My specific concern is that I've come across several different teams inside Gecko talking about setting affinities or similar issues such as setting the size of thread pools based on the number of cores, or setting thread priorities, and if teams make these decisions in isolation we could end up tripping over each others' settings.

(In reply to Aaron Klotz [:aklotz] (PTO May 29 - June 5) from comment #13)

Oh yes, absolutely. Really we should probably first get a handle on all the threads that are being created in the first place and taking a holistic view of that, before going anywhere near thread affinity.

Yes, it would be bad if we started indiscriminately applying thread affinity to thread pools across Gecko, or if two teams started applying thread affinity to their own problems. However, that's certainly not what I'm suggesting, and if anyone has another bug in mind in which people are investigating thread affinity, do let me know. What I'm proposing is that we look into what benefits if any we can see from setting thread affinity within one trial thread pool. This among other experiments could help us build the case for a more organized pan-gecko threading model, by allowing us to say, "we saw this improvement from doing X to thread pool Y, but in order to do X to thread pool Z, we'd need to make sure they're not stepping on each others toes." Without evidence from things like this or similar experiments, I think the work required to organize our threading across the application is unlikely to get off the ground, as right now I'm hearing engineers talk about wanting a better threading model but I haven't seen a bug or serious work on such a thing going on anywhere (please do link it here if there is such a thing - all I've seen is rather surface-level investigations into trimming out unnecessary threads to meet Fission goals.)

Performance Impact: --- → ?
Whiteboard: [qf:investigate]
Performance Impact: ? → P3
Priority: -- → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.