Closed Bug 1532286 Opened 5 years ago Closed 5 years ago

dav1d video decoder slower in Firefox than in Chrome and Safari

Categories

(Core :: JavaScript: WebAssembly, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox68 --- fixed

People

(Reporter: jandem, Assigned: jseward)

Details

Attachments

(3 files)

From https://twitter.com/brionv/status/1100929680054542336:

dav1d decode times for 'Llamigos' cartoon at 360p AV1:

native w/ avx2: 4.875 s <- 256-bit SIMD
native w/ sse41: 12.510 s <- 128-bit SIMD
native plain C: 20.761 s <- baseline

Wasm in Safari: 37.674 s
Wasm in Chrome: 46.396 s
Wasm in Firefox: 55.332 s

Safari does great on both dav1d and libvpx, beating Chrome and Firefox by a significant margin. I think it has better bounds checking; these codecs do a lot of 1-byte memory loads and stores (especially with no SIMD!)

Firefox gains a couple seconds if I disable the worker thread, there seems to be some extra message-passing overhead. Interestingly, for VP9 with libvpx Firefox and Chrome are neck-and-neck. Wonder what's different with dav1d...

Brion, do you happen to have a standalone test case for this? :)

Flags: needinfo?(brion)
Flags: needinfo?(brion)

What are these bounds checks of which you speak? We should be emitting no code for memory bounds checks on 64-bit platforms.

I haven't been able to narrow down a specific function that is particularly slow yet as I'm unable to get function level profiling to report function names for Wasm on Chrome or Safari. (Firefox shows function names if I build with -g4)

But if you want to give it a whirl, the benchmark I'm using is:
https://brionv.com/misc/ogv.js/demo2/codec-bench.html (ogv.js master with emscripten fastcomp)
https://brionv.com/misc/ogv.js/demo3/codec-bench.html (ogv.js simd2 branch with upstream LLVM)

(I've been informed that there's clever optimizations to minimize bounds checks on 64-bit so it may be register pressure and other issues)

Flags: needinfo?(brion)

Great, thank you!

Flags: needinfo?(brion)
Attached file Profile snapshot

I profiled demo2 and demo3 from comment 3 on nightly. These demos run for
very roughly 500 billion instructions, and I looked at chunks of approximately
40 billion instructions. Results from the two demos are very similar, so I
concentrate here on the hottest block for demo3.

The profile fragment attached covers 39.19 billion insns. The top 20 blocks
constitute 55.18% thereof. All of the blocks I looked at are unusually long
(easily 50+ instructions), and they are all pretty similar.

There is one obviously-lame bit of instruction selection here, which looks
like it's intended to be a 32-bit max (or min) operation:

  0xF3C5325632A:  cmpl   8(%rsp), %edi
  0xF3C5325632E:  setb   %sil
  0xF3C53256332:  movzbl %sil, %esi
  0xF3C53256336:  testl  %esi, %esi
  0xF3C53256338:  cmove  8(%rsp), %edi
  and esi is dead at this point (viz, next event for it is a write)

I think this achieves "edi = min(edi, 8(rsp))" (or perhaps it's "max"). The
middle 3 instructions, materialising the comparison result into %esi and then
testing it again, strike me as pointless and also pointlessly nuke a register.

In the hottest block in this profile snapshot, which contains 100 insns, this
idiom occurs no less than 14 times. This is an unusually bad example, but the
idiom is often to be seen also in other hot blocks I looked at.

Apart from that, I would observe that:

  • the instruction selection is mostly reasonable

  • there's nothing that would cause partial-register stalls

  • this appears at least 3 times:

     0xF3C53256354:  movl   16(%rsp), %eax
     0xF3C53256358:  addl   28(%rsp), %eax

which seems somewhat silly. I don't see any writes in this block to either
16(rsp) nor 28(rsp). If, as one might hypothesise, they are loop invariant,
then one could ask why the expression 16(rsp)+28(rsp) wasn't detected as loop
invariant and hoisted out.

  • register allocation isn't obviously bad, but perhaps it could be better?
    There's a bunch of pulling values off the stack: 16(rsp), 28(rsp), 148(rsp),
    220(rsp), and I wonder if some of those are loop invariant. At the same
    time, r8, r9 and r11 aren't mentioned at all. (They might be live across
    the entire block, or appear in the truncated tail.) Or do they have fixed
    roles?

Thanks a bunch of that analysis! Based on the cmov (which we don't often use), they're probably using select, which always does the test+cmov, with no condition-folding optimization like we do for branches:
https://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/CodeGenerator-x86-shared.cpp#302

Proposed scheme to fix the cmov badness:

  • in LIRGenerator::visitCompare, extend existing logic (CanEmitCompareAtUses)
    so that if the compare is used to feed a WasmSelect (and there's only one
    use of it?) then mark it as emitAtUses.

  • in LIRGenerator::visitWasmSelect, if the comparison is marked emitAtUses,
    then don't generate any of the LIR that's currently created. Instead create
    a single, new LIR node, LCompareAndSelect, which has 5 operands:

    • the two things to be compared (args 1, 2)
    • the comparison op (==, >=, etc) (arg 3)
    • the two things to be selected (cmov'd) between (args 4, 5)
  • in the CodeGenerators, turn that basically into

      dst <- arg4
      cmp  arg1, arg2
      cmov(condition=arg3)   dst, arg5
    

    with hopefully some regalloc magic to make "dst <- arg4" disappear

Initial explorations patch. Mostly debugging printfs.

Assignee: nobody → jseward

Did a quick test:

  • Firefox 66.0.1 (64-bits, Windows): Dropped Frames 208/3000
  • Firefox 67.0b4 (64-bits, Windows): Dropped Frames 0/3000
  • Firefox 68.0a1 (64-bits, Windows): Dropped Frames 2463/3000
  • Chrome 73.0 (64-bits, Windows): Dropped Frames 3/3000

Core i5-4590, on the first 50 seconds of Costa Rica in 1440p60.

This patch creates better code, when compiling Wasm via Ion, for 32-bit
integer selects which are guarded by a 32-bit integer compare. No other type
combinations are optimised at present. Changes:

  • LIR-shared.h: Add new LIR node LWasmCompareAndSelect, holding the values to
    compare, the values to select over, and the comparison details.

  • CanEmitCompareAtUses:

    • Rewrite (transform mechanically) remove hard-to-reason-about control flow
      (a loop which probably only iterates once, and a boolean control
      variable). These are both removed and replaced by obvious straight-line
      code.

    • Also allow deferred emission when there is a single user and it is a
      WasmSelect.

  • LIRGenerator::visitWasmSelect for both arm and x86-shared: If the condition
    is a compare, marked 'may be emitted at use point', and has the correct
    types, bundle up the comparison arguments and values to be selected-over
    into a single LWasmCompareAndSelect node.

  • CodeGenerator::visitWasmCompareAndSelect for both arm and x86-shared: From
    an LWasmCompareAndSelect node, generate the desired optimal sequence: a
    compare immediately followed by a conditional move.

  • CodeGenerator::generateBody(), ifdef JS_JITSPEW, one liner ridealong: print
    "# " before the LIR instruction name, so as to make reading IONFLAGS=codegen
    output easier. No (release-) functional change.

  • New test case select-int32.js.

None of the changes apply to aarch64, because wasm-via-Ion on aarch64 is not
currently supported.

It's not clear to me what was actually measured in comment 0 (elapsed or
codec-only time?), but that notwithstanding, the comment 10 patch improves
performance by around 17% on x86_64-linux for an optimised (-Og), non-debug
build, on a quiet Intel Xeon at 2.6 GHz, running the demo3 test case from
comment 3, with default config. The raw numbers are as follows; 3 runs for
each:

UNPATCHED
  61.436882602835546 fps decoding average
  58.613 seconds total (elapsed)
  56.106 seconds total (codec only)

  61.27588612656763 fps decoding average
  58.767 seconds total (elapsed)
  56.916 seconds total (codec only)

  60.77637130801688 fps decoding average
  59.25 seconds total (elapsed)
  57.457 seconds total (codec only)

PATCHED
  71.27587981473418 fps decoding average
  50.522 seconds total (elapsed)
  47.939 seconds total (codec only)

  72.43723849372384 fps decoding average
  49.712 seconds total (elapsed)
  47.877 seconds total (codec only)

  72.78129231764255 fps decoding average
  49.477 seconds total (elapsed)
  47.711 seconds total (codec only)

Why do you need the Lowering and CodeGenerator code to be per-platform? It's something we really try to avoid and especially for Lowering it's usually possible to avoid. (Also convention is for the commit message to describe what the patch does, it's unrelated to the bug title.)

Flags: needinfo?(jseward)

I wonder if you can use masm.cmp32move32 or add something like that.

(In reply to Jan de Mooij [:jandem] from comment #12)

Why do you need the Lowering and CodeGenerator code to be per-platform?

A good question. At least for the Lowering, LIRGenerator::visitWasmSelect
for x86-shared, arm and mips-shared, before my patch, look identical, and
I had assumed there was some reason that I didn't know, for the duplication.
I can try to common them up and see if anything breaks.

Flags: needinfo?(jseward)
Attachment #9057520 - Attachment description: Bug 1532286 - dav1d video decoder slower in Firefox than in Chrome and Safari. → Bug 1532286 - Wasm/Ion: generate better code for wasmSelect in some cases.
Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5e18acc4a42b
Wasm/Ion: generate better code for wasmSelect in some cases.  r=jandem.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: