Closed Bug 1563950 Opened 4 months ago Closed 3 months ago

Slow WebAssembly threading with dav1d wasm build on Windows ARM64

Categories

(Core :: DOM: Workers, defect)

68 Branch
ARM64
Windows 10
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla70
Tracking Status
firefox70 --- verified

People

(Reporter: brion, Assigned: baku)

Details

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; rv:67.0) Gecko/20100101 Firefox/67.0

Steps to reproduce:

My WebAssembly builds of the dav1d AV1 decoder are running unexpectedly slow in threaded mode on a Windows 10 ARM64 machine, in Firefox 67/68.

  1. Enable shared memory in about:config
  2. Go to https://brionv.com/misc/ogv.js/demo3/codec-bench.html
  3. Click 'bench' and wait
  4. Save the "codec time only" time in seconds
  5. Check "Threading" checkbox
  6. Run again, compare the time

Actual results:

On Snapdragon 850-based Lenovo Yoga C630, I get similar times with and without threading: about 112-115 seconds.

Expected results:

Threaded mode should be about half the time of non-threaded. This can be confirmed with a Linux build of Firefox running in Windows Subsystem for Linux on the same machine (requires an X11 server, either local or remote), where I do see the expected performance improvement.

There may be something wrong with the atomics stuff used for locking? I haven't narrowed down a small test case yet.

Watching the CPU monitor in Task Manager, the Windows version spends more time in the slow "little" cores than the Linux version does, which may indicate that things are spinning (expensively) rather than waiting efficiently, which could be bogging down everything.

Component: Untriaged → Javascript: WebAssembly
Product: Firefox → Core
Flags: needinfo?(lhansen)

Interesting. In the case of the basic atomic accesses, we should be generating the same code on these two systems. For wait/wake, the wasm subsystem does not have its own locks (and does no spinning of its own, though I've contemplated that in the past) but uses wait/wait_until/notify_all on a js::ConditionVariable/js::UniqueLock pair when it needs to do any kind of waiting. In the past, for x86/older Windows, I believe we've had our own implementation of at least the condition variable because Windows did not supply one, not sure what the situation is now.

I will investigate further, though I don't have access to the hardware in question at the moment, so I may ask for some further assistance.

Indeed, threading is about 50% the time of non-threading on standard x64 Windows 10 too.

Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(lhansen)
Priority: -- → P3
OS: Unspecified → Windows 10
Hardware: Unspecified → ARM64

Thanks for looking into it! I'll be happy to help test/debug.

Hm, this may be a thread priority issue with an ARM-specific twist.

I noticed that if I run the single-threaded version on a single Worker thread (checking the 'Worker' box but not 'Threading'), it's much slower -- about 3x slower than running on the main thread. Looking at Task Manager, I can see that on the main thread CPU activity stays on the 'big' high-power cores, while the worker version spends most time on the 'little' low-power cores.

This may explain why x64 doesn't show a difference while arm64 does -- all cores are about the same; there's no big/little distinction.
If I understand this correctly, on arm64 turning threading on still pushes the Workers to the little cores but gets a parallelization speedup, coming out to a similar time as the single thread on a big core...

If so there may be something specific about the way worker threads are launched that's causing Windows to schedule them with a lower priority (hence to the little cores), that's different from how the Linux binary running on WSL does it.

Here is a much simpler test case:

https://brionv.com/misc/workerfoo/

It runs an additive calculation series in JS that takes about 1400ms on the main thread, or 7500ms if run in a worker on the ARM64 Windows build. The Linux build on the same machine gives the same times (1400ms-ish) for both modes, as does an x64 Windows machine (2000ms on the old machine I tested).

Nathan, something you know anything about?

We should make sure this is Windows-on-ARM64 only and not also Android-on-ARM64 (though given the Linux results Brion reports above I don't feel too nervous).

I tried the steps from comment 0 on a C630 with Nightly 2019-07-09, and my times were 120s without threading and 75s with threading. So I did see a decent speedup. Am I doing something wrong?

Hmm, I definitely still see the slowdown on Nightly 2019-07-09 myself; if it's coming up fast for you on the same hardware and same Firefox I wonder if it's possible there's been a behavior change to the asymmetric core scheduling in Windows...

I'm running Windows Insider fast ring build 18932; I believe it shipped originally with 1803 or 1809 editions, and I didn't have a chance to test on them or 1903 (current release). I may or may not be able to downgrade and retest, I'll look into it.

I was on 1809 if it matters. (They shipped with 1803 but I upgraded mine)

Also, I was on battery power at the time, on the Balanced power plan (there aren't any others to choose from, missing drivers maybe?). I'm afraid I'm no longer near that machine to try any other settings, and I'm going to be away for a while.

I've reset the machine from OEM recovery download... Back on Win10 1803, I can confirm that I see no slowdown in either codec-bench.html or the "workerfoo" simpler test.

To see if it is a regression in 1903 or 20H1 (Insiders fast ring), I'll try updating to the current release and then back to insiders and test along the way...

Sorry for the tangent, how do you get to the OEM recovery download? For some time I've been frustrated at not being able to switch to arbitrary builds, due to lack of arm64 iso's on MSDN subscriptions or the Insider site.

The OEM recovery download is a little hard to find, I found a link via some forum post. :) It's hiding here: https://pcsupport.lenovo.com/us/en/lenovorecovery

You'll need the serial number (on the sticker, or findable in BIOS if you hit fn+F2 during boot) and a Lenovo account to register; this lets you download a Windows .exe that downloads the actual recovery image and writes it to a USB stick.

From there, just plug in and boot and follow the instructions, and it'll eventually drop you into the first-boot experience on the original 1803 image.

You can upgrade to 1903 from there via the Windows 10 Update Assistant.

Long story short, in 1803 everything runs full speed but in 1903 I see the slowdown again -- so at least it doesn't require hitting Insider builds to reproduce. Most likely there's been a change in Windows' scheduler that's shoving the background threads farther in the background than they should be, onto the little cores, but I'm not sure what specifically would cause that. I got a little lost poking at the source for Workers. :)

I managed to make a patched build that works for me: if I remove the use of PRIORITY_LOW on non-chrome Worker threads in dom/workers/RuntimeService.cpp I get full speed in my worker tests even on the 20H1 Insiders build (now up to 18936):

  /*
  int32_t priority = aWorkerPrivate->IsChromeWorker()
                         ? nsISupportsPriority::PRIORITY_NORMAL
                         : nsISupportsPriority::PRIORITY_LOW;
  */
  int32_t priority = nsISupportsPriority::PRIORITY_NORMAL;

It looks like NSPR threading maps PRIORITY_LOW to Windows THREAD_PRIORITY_BELOW_NORMAL which is only one notch below normal, so I'm pretty sure this is Windows becoming more aggressive with sending any low-priority threads to the back burner on big-little config. Don't know if it makes sense to leave them at normal priority, or if there is some way to set them to lower priority without being sent to little cores, or if Microsoft needs to fix a regression in their scheduler...

That's a great find. baku, should workers just be using PRIORITY_NORMAL?

Component: Javascript: WebAssembly → DOM: Workers
Flags: needinfo?(amarchesini)

The component has been changed since the backlog priority was decided, so we're resetting it.
For more information, please visit auto_nag documentation.

Priority: P3 → --

int32_t priority = aWorkerPrivate->IsChromeWorker()
? nsISupportsPriority::PRIORITY_NORMAL
: nsISupportsPriority::PRIORITY_LOW;

When we implemented workers, they were not so popular and having a low priority was reasonable. I think it's time to change it.

Flags: needinfo?(amarchesini)
Assignee: nobody → amarchesini
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/38729643d8c4
Worker threads should run with a normal priority, r=smaug
Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

Confirmed fixed on Nightly. Y'all work fast! ;) Thanks for the fix!

I will mark this bug as verified fixed as per comment 21.

Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.