Closed Bug 1532286 Opened 5 years ago Closed 5 years ago

dav1d video decoder slower in Firefox than in Chrome and Safari

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla68

Tracking Flags:

Tracking

Status

firefox68

---

fixed

People

(Reporter: jandem, Assigned: jseward)

Details

Attachments

(3 files)

Profile snapshot 5 years ago Julian Seward [:jseward] 5.38 KB, text/plain		Details
bug1532286-dav1d-select-1.diff 5 years ago Julian Seward [:jseward] 10.49 KB, patch		Details \| Diff \| Splinter Review
Bug 1532286 - Wasm/Ion: generate better code for wasmSelect in some cases. 5 years ago Julian Seward [:jseward] 47 bytes, text/x-phabricator-request		Details \| Review

Jan de Mooij [:jandem]

Reporter

Description

•

5 years ago

•

Edited

From https://twitter.com/brionv/status/1100929680054542336:

dav1d decode times for 'Llamigos' cartoon at 360p AV1:
native w/ avx2: 4.875 s <- 256-bit SIMD
native w/ sse41: 12.510 s <- 128-bit SIMD
native plain C: 20.761 s <- baseline

Wasm in Safari: 37.674 s
Wasm in Chrome: 46.396 s
Wasm in Firefox: 55.332 s
Safari does great on both dav1d and libvpx, beating Chrome and Firefox by a significant margin. I think it has better bounds checking; these codecs do a lot of 1-byte memory loads and stores (especially with no SIMD!)

Firefox gains a couple seconds if I disable the worker thread, there seems to be some extra message-passing overhead. Interestingly, for VP9 with libvpx Firefox and Chrome are neck-and-neck. Wonder what's different with dav1d...

Jan de Mooij [:jandem]

Reporter

Comment 1

•

5 years ago

Brion, do you happen to have a standalone test case for this? :)

Flags: needinfo?(brion)

Lars T Hansen [:lth]

Comment 2

•

5 years ago

What are these bounds checks of which you speak? We should be emitting no code for memory bounds checks on 64-bit platforms.

Brion Vibber

Comment 3

•

5 years ago

I haven't been able to narrow down a specific function that is particularly slow yet as I'm unable to get function level profiling to report function names for Wasm on Chrome or Safari. (Firefox shows function names if I build with -g4)

But if you want to give it a whirl, the benchmark I'm using is:
https://brionv.com/misc/ogv.js/demo2/codec-bench.html (ogv.js master with emscripten fastcomp)
https://brionv.com/misc/ogv.js/demo3/codec-bench.html (ogv.js simd2 branch with upstream LLVM)

(I've been informed that there's clever optimizations to minimize bounds checks on 64-bit so it may be register pressure and other issues)

Flags: needinfo?(brion)

Jan de Mooij [:jandem]

Reporter

Comment 4

•

5 years ago

Great, thank you!

Flags: needinfo?(brion)

Julian Seward [:jseward]

Assignee

Comment 5

•

5 years ago

•

Edited

Attached file Profile snapshot — Details

I profiled demo2 and demo3 from comment 3 on nightly. These demos run for
very roughly 500 billion instructions, and I looked at chunks of approximately
40 billion instructions. Results from the two demos are very similar, so I
concentrate here on the hottest block for demo3.

The profile fragment attached covers 39.19 billion insns. The top 20 blocks
constitute 55.18% thereof. All of the blocks I looked at are unusually long
(easily 50+ instructions), and they are all pretty similar.

There is one obviously-lame bit of instruction selection here, which looks
like it's intended to be a 32-bit max (or min) operation:

  0xF3C5325632A:  cmpl   8(%rsp), %edi
  0xF3C5325632E:  setb   %sil
  0xF3C53256332:  movzbl %sil, %esi
  0xF3C53256336:  testl  %esi, %esi
  0xF3C53256338:  cmove  8(%rsp), %edi
  and esi is dead at this point (viz, next event for it is a write)

I think this achieves "edi = min(edi, 8(rsp))" (or perhaps it's "max"). The
middle 3 instructions, materialising the comparison result into %esi and then
testing it again, strike me as pointless and also pointlessly nuke a register.

In the hottest block in this profile snapshot, which contains 100 insns, this
idiom occurs no less than 14 times. This is an unusually bad example, but the
idiom is often to be seen also in other hot blocks I looked at.

Apart from that, I would observe that:

the instruction selection is mostly reasonable
there's nothing that would cause partial-register stalls
this appears at least 3 times:

     0xF3C53256354:  movl   16(%rsp), %eax
     0xF3C53256358:  addl   28(%rsp), %eax

which seems somewhat silly. I don't see any writes in this block to either
16(rsp) nor 28(rsp). If, as one might hypothesise, they are loop invariant,
then one could ask why the expression 16(rsp)+28(rsp) wasn't detected as loop
invariant and hoisted out.

register allocation isn't obviously bad, but perhaps it could be better?
There's a bunch of pulling values off the stack: 16(rsp), 28(rsp), 148(rsp),
220(rsp), and I wonder if some of those are loop invariant. At the same
time, r8, r9 and r11 aren't mentioned at all. (They might be live across
the entire block, or appear in the truncated tail.) Or do they have fixed
roles?

Luke Wagner [:luke]

Comment 6

•

5 years ago

Thanks a bunch of that analysis! Based on the cmov (which we don't often use), they're probably using select, which always does the test+cmov, with no condition-folding optimization like we do for branches:
https://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/CodeGenerator-x86-shared.cpp#302

Julian Seward [:jseward]

Assignee

Comment 7

•

5 years ago

Proposed scheme to fix the cmov badness:

in LIRGenerator::visitCompare, extend existing logic (CanEmitCompareAtUses)
so that if the compare is used to feed a WasmSelect (and there's only one
use of it?) then mark it as emitAtUses.
in LIRGenerator::visitWasmSelect, if the comparison is marked emitAtUses,
then don't generate any of the LIR that's currently created. Instead create
a single, new LIR node, LCompareAndSelect, which has 5 operands:
- the two things to be compared (args 1, 2)
- the comparison op (==, >=, etc) (arg 3)
- the two things to be selected (cmov'd) between (args 4, 5)
in the CodeGenerators, turn that basically into
```
  dst <- arg4
  cmp  arg1, arg2
  cmov(condition=arg3)   dst, arg5
```
with hopefully some regalloc magic to make "dst <- arg4" disappear

Julian Seward [:jseward]

Assignee

Comment 8

•

5 years ago

Attached patch bug1532286-dav1d-select-1.diff — Details — Splinter Review

Initial explorations patch. Mostly debugging printfs.

Assignee: nobody → jseward

Ewout ter Hoeven

Comment 9

•

5 years ago

Did a quick test:

Firefox 66.0.1 (64-bits, Windows): Dropped Frames 208/3000
Firefox 67.0b4 (64-bits, Windows): Dropped Frames 0/3000
Firefox 68.0a1 (64-bits, Windows): Dropped Frames 2463/3000
Chrome 73.0 (64-bits, Windows): Dropped Frames 3/3000

Core i5-4590, on the first 50 seconds of Costa Rica in 1440p60.

Julian Seward [:jseward]

Assignee

Comment 10

•

5 years ago

Attached file Bug 1532286 - Wasm/Ion: generate better code for wasmSelect in some cases. — Details

This patch creates better code, when compiling Wasm via Ion, for 32-bit
integer selects which are guarded by a 32-bit integer compare. No other type
combinations are optimised at present. Changes:

LIR-shared.h: Add new LIR node LWasmCompareAndSelect, holding the values to
compare, the values to select over, and the comparison details.
CanEmitCompareAtUses:
- Rewrite (transform mechanically) remove hard-to-reason-about control flow
  (a loop which probably only iterates once, and a boolean control
  variable). These are both removed and replaced by obvious straight-line
  code.
- Also allow deferred emission when there is a single user and it is a
  WasmSelect.
LIRGenerator::visitWasmSelect for both arm and x86-shared: If the condition
is a compare, marked 'may be emitted at use point', and has the correct
types, bundle up the comparison arguments and values to be selected-over
into a single LWasmCompareAndSelect node.
CodeGenerator::visitWasmCompareAndSelect for both arm and x86-shared: From
an LWasmCompareAndSelect node, generate the desired optimal sequence: a
compare immediately followed by a conditional move.
CodeGenerator::generateBody(), ifdef JS_JITSPEW, one liner ridealong: print
"# " before the LIR instruction name, so as to make reading IONFLAGS=codegen
output easier. No (release-) functional change.
New test case select-int32.js.

None of the changes apply to aarch64, because wasm-via-Ion on aarch64 is not
currently supported.

Julian Seward [:jseward]

Assignee

Comment 11

•

5 years ago

It's not clear to me what was actually measured in comment 0 (elapsed or
codec-only time?), but that notwithstanding, the comment 10 patch improves
performance by around 17% on x86_64-linux for an optimised (-Og), non-debug
build, on a quiet Intel Xeon at 2.6 GHz, running the demo3 test case from
comment 3, with default config. The raw numbers are as follows; 3 runs for
each:

UNPATCHED
  61.436882602835546 fps decoding average
  58.613 seconds total (elapsed)
  56.106 seconds total (codec only)

  61.27588612656763 fps decoding average
  58.767 seconds total (elapsed)
  56.916 seconds total (codec only)

  60.77637130801688 fps decoding average
  59.25 seconds total (elapsed)
  57.457 seconds total (codec only)

PATCHED
  71.27587981473418 fps decoding average
  50.522 seconds total (elapsed)
  47.939 seconds total (codec only)

  72.43723849372384 fps decoding average
  49.712 seconds total (elapsed)
  47.877 seconds total (codec only)

  72.78129231764255 fps decoding average
  49.477 seconds total (elapsed)
  47.711 seconds total (codec only)

Jan de Mooij [:jandem]

Reporter

Comment 12

•

5 years ago

Why do you need the Lowering and CodeGenerator code to be per-platform? It's something we really try to avoid and especially for Lowering it's usually possible to avoid. (Also convention is for the commit message to describe what the patch does, it's unrelated to the bug title.)

Flags: needinfo?(jseward)

Jan de Mooij [:jandem]

Reporter

Comment 13

•

5 years ago

I wonder if you can use masm.cmp32move32 or add something like that.

Julian Seward [:jseward]

Assignee

Comment 14

•

5 years ago

(In reply to Jan de Mooij [:jandem] from comment #12)

Why do you need the Lowering and CodeGenerator code to be per-platform?

A good question. At least for the Lowering, LIRGenerator::visitWasmSelect
for x86-shared, arm and mips-shared, before my patch, look identical, and
I had assumed there was some reason that I didn't know, for the duplication.
I can try to common them up and see if anything breaks.

Flags: needinfo?(jseward)

Phabricator Automation

Updated

•

5 years ago

Attachment #9057520 - Attachment description: Bug 1532286 - dav1d video decoder slower in Firefox than in Chrome and Safari. → Bug 1532286 - Wasm/Ion: generate better code for wasmSelect in some cases.

Pulsebot

Comment 15

•

5 years ago

Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5e18acc4a42b
Wasm/Ion: generate better code for wasmSelect in some cases.  r=jandem.

Andrei Ciure[:aciure]

Comment 16

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5e18acc4a42b

Status: NEW → RESOLVED

Closed: 5 years ago

status-firefox68: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla68

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

dav1d video decoder slower in Firefox than in Chrome and Safari

Categories

(Core :: JavaScript: WebAssembly, defect)

Tracking

()

People

(Reporter: jandem, Assigned: jseward)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Comment 15

Comment 16

Attachment

General

Description

File Name

Content Type