Closed Bug 1891459 Opened 1 month ago Closed 1 month ago

Update dav1d to new version 5b5399911dd24703de641d65eda5b7f1e845d060 from 2024-04-15 13:19:42

Categories

(Core :: Audio/Video: Playback, enhancement)

enhancement

Tracking

()

RESOLVED FIXED
127 Branch
Tracking Status
firefox127 --- fixed

People

(Reporter: update-bot, Assigned: chunmin)

Details

(Whiteboard: [3pl-filed][task_id: fYmt3Lk_SSm-0ChT4jJAEA])

Attachments

(1 file)

This update covers 25 commits. Here are the overall diff statistics, and then the commit information.


media/libdav1d/asm/moz.build | 1 +
media/libdav1d/moz.yaml | 4 +-
media/libdav1d/vcs_version.h | 2 +-
third_party/dav1d/meson.build | 2 +
third_party/dav1d/meson_options.txt | 5 +
third_party/dav1d/src/arm/64/mc.S | 4 +-
third_party/dav1d/src/arm/64/mc_dotprod.S | 1413 +++++++++++++
third_party/dav1d/src/arm/64/msac.S | 21 +-
third_party/dav1d/src/arm/itx.h | 63 -
third_party/dav1d/src/arm/mc.h | 83 +-
third_party/dav1d/src/cdf.c | 1396 ++++++------
third_party/dav1d/src/cdf.h | 48 +-
third_party/dav1d/src/decode.c | 95 +-
third_party/dav1d/src/internal.h | 9 +-
third_party/dav1d/src/itx.h | 63 +
third_party/dav1d/src/lf_mask.c | 6 +-
third_party/dav1d/src/meson.build | 1 +
third_party/dav1d/src/refmvs.c | 4 +-
third_party/dav1d/src/riscv/itx.h | 63 -
third_party/dav1d/src/x86/ipred_avx2.asm | 3 +-
third_party/dav1d/src/x86/itx.h | 64 -
third_party/dav1d/src/x86/mc16_avx2.asm | 1548 +++++++++++--
third_party/dav1d/src/x86/mc_avx2.asm | 1471 +++++++++++--
third_party/dav1d/src/x86/mc_avx512.asm | 3037 +++++++++++++++++++---------
third_party/dav1d/tests/meson.build | 2 +-
25 files changed, 6745 insertions(+), 2663 deletions(-)


5b5399911dd24703de641d65eda5b7f1e845d060 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/5b5399911dd24703de641d65eda5b7f1e845d060
Authored: 2024-04-08 12:47:10 +0200
Committed: 2024-04-15 13:19:42 +0200

x86: Add 6-tap variants of 8bpc mc AVX-512 (Ice Lake) functions

6-tap filtering is only performed vertically due to use of VNNI
instructions processing 4 pixels per instruction horizontally.

Files Modified:

  • src/x86/mc_avx512.asm

38df35d2d1aa9faf31942cfee9a17244094cb6f8 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/38df35d2d1aa9faf31942cfee9a17244094cb6f8
Authored: 2024-04-08 12:47:09 +0200
Committed: 2024-04-15 13:12:20 +0200

x86: Add various 8bpc mc AVX-512 improvements

Files Modified:

  • src/x86/mc_avx512.asm

313af0b6a574cade1b227a50e29ca8b1b5ffcaee by Matthias Dressel

https://code.videolan.org/videolan/dav1d/commit/313af0b6a574cade1b227a50e29ca8b1b5ffcaee
Authored: 2024-04-06 18:26:26 +0200
Committed: 2024-04-14 01:57:37 +0200

CI: Update images

Now with clang 18 and downgraded xz-utils.

Files Modified:

  • .gitlab-ci.yml

09f2a21e7c49554c0b6755feaedbba1e70b6f7cf by Luca Barbato

https://code.videolan.org/videolan/dav1d/commit/09f2a21e7c49554c0b6755feaedbba1e70b6f7cf
Authored: 2024-04-13 13:52:10 +0200
Committed: 2024-04-13 23:19:52 +0200

Deduplicate itx macros

Files Modified:

  • src/arm/itx.h
  • src/itx.h
  • src/riscv/itx.h
  • src/x86/itx.h

f1c518901b7bff3d5557ab30996ebefdab2db482 by Ronald S. Bultje

https://code.videolan.org/videolan/dav1d/commit/f1c518901b7bff3d5557ab30996ebefdab2db482
Authored: 2024-04-13 09:10:00 -0400
Committed: 2024-04-13 09:53:54 -0400

Increase timeout multiplier for aarch64/riscv64/la64-qemu CI jobs

They have been failing occasionally lately.

Files Modified:

  • .gitlab-ci.yml

aa63a41ccddda86662374d4fb5a1e7fb1c69b881 by Matthias Dressel

https://code.videolan.org/videolan/dav1d/commit/aa63a41ccddda86662374d4fb5a1e7fb1c69b881
Authored: 2024-04-11 23:15:07 +0200
Committed: 2024-04-11 23:15:07 +0200

cli: Add missing ARM cpumasks help text

Forgotten in acc1121d2f6c0b6fb4dc0206a95c77aa2aadd762.

Files Modified:

  • tools/dav1d_cli_parse.c

9d77b6336a5f9921f503cf0ea19f294f558227e3 by Arpad Panyik

https://code.videolan.org/videolan/dav1d/commit/9d77b6336a5f9921f503cf0ea19f294f558227e3
Authored: 2024-03-22 16:20:38 +0100
Committed: 2024-04-11 19:03:58 +0200

AArch64: Add DotProd support for convolutions

Add an Armv8.4-A DotProd code path for standard bitdepth convolutions.
Only horizontal-vertical (HV) convolutions have 6-tap specialisations
of their vertical passes. All other convolutions are 4- or 8-tap
filters which fit well with the 4-element SDOT instruction.

Benchmarks show up-to 7-29% FPS increase depending on the input video
and the CPU used.

This patch will increase the .text by around 6.5 KiB.

Performance highly depends on the SDOT and MLA throughput ratio, this
can be seen on the vertical filter cases. Small cores are also
affected by the TBL execution latencies:

Relative performance to the C reference on some CPUs:

                      A76      A78       X1      A55

regular w4 hv neon: 5.52x 5.78x 10.75x 8.27x
regular w4 hv dotprod: 7.94x 8.49x 16.84x 8.09x
sharp w4 hv neon: 5.27x 5.22x 9.06x 7.87x
sharp w4 hv dotprod: 6.61x 6.73x 12.64x 6.89x

regular w8 hv neon: 1.95x 2.19x 2.56x 3.16x
regular w8 hv dotprod: 3.23x 2.81x 3.20x 3.26x
sharp w8 hv neon: 1.61x 1.79x 2.05x 2.72x
sharp w8 hv dotprod: 2.72x 2.29x 2.66x 2.76x

regular w16 hv neon: 1.63x 2.04x 2.16x 2.73x
regular w16 hv dotprod: 2.72x 2.57x 2.67x 2.80x
sharp w16 hv neon: 1.33x 1.67x 1.74x 2.34x
sharp w16 hv dotprod: 2.31x 2.14x 2.26x 2.39x

regular w32 hv neon: 1.48x 1.92x 1.94x 2.51x
regular w32 hv dotprod: 2.49x 2.40x 2.33x 2.58x
sharp w32 hv neon: 1.21x 1.56x 1.53x 2.14x
sharp w32 hv dotprod: 2.12x 2.02x 2.00x 2.22x

regular w64 hv neon: 1.42x 1.87x 1.85x 2.40x
regular w64 hv dotprod: 2.40x 2.32x 2.21x 2.46x
sharp w64 hv neon: 1.16x 1.52x 1.46x 2.04x
sharp w64 hv dotprod: 2.02x 1.96x 1.90x 2.11x

regular w128 hv neon: 1.39x 1.84x 1.80x 2.27x
regular w128 hv dotprod: 2.33x 2.28x 2.14x 2.35x
sharp w128 hv neon: 1.14x 1.50x 1.42x 1.94x
sharp w128 hv dotprod: 1.98x 1.93x 1.84x 2.03x

regular w8 h neon: 2.61x 3.20x 3.51x 3.55x
regular w8 h dotprod: 4.43x 5.17x 6.26x 4.30x
sharp w8 h neon: 2.01x 2.80x 2.89x 3.12x
sharp w8 h dotprod: 4.42x 5.16x 6.27x 4.28x

regular w16 h neon: 2.17x 3.13x 2.92x 3.35x
regular w16 h dotprod: 4.38x 4.27x 4.53x 3.90x
sharp w16 h neon: 1.74x 2.65x 2.48x 2.92x
sharp w16 h dotprod: 4.33x 4.27x 4.53x 3.91x

regular w64 h neon: 1.92x 2.82x 2.39x 2.96x
regular w64 h dotprod: 3.68x 3.60x 3.40x 3.18x
sharp w64 h neon: 1.47x 2.33x 2.05x 2.54x
sharp w64 h dotprod: 3.68x 3.60x 3.40x 3.17x

regular w4 v neon: 5.39x 7.38x 10.27x 11.41x
regular w4 v dotprod: 9.46x 14.15x 18.72x 9.84x
sharp w4 v neon: 4.51x 6.39x 8.17x 10.70x
sharp w4 v dotprod: 9.35x 14.20x 18.63x 9.78x

regular w16 v neon: 3.03x 4.03x 4.65x 6.28x
regular w16 v dotprod: 4.64x 3.75x 4.78x 3.89x
sharp w16 v neon: 2.29x 3.09x 3.44x 5.52x
sharp w16 v dotprod: 4.62x 3.74x 4.77x 3.89x

regular w64 v neon: 2.17x 3.14x 3.19x 4.46x
regular w64 v dotprod: 3.43x 3.00x 3.31x 2.74x
sharp w64 v neon: 1.61x 2.42x 2.34x 3.89x
sharp w64 v dotprod: 3.38x 3.00x 3.29x 2.73x

Files Added:

  • src/arm/64/mc_dotprod.S

Files Modified:

  • src/arm/64/mc.S
  • src/arm/mc.h
  • src/meson.build

dc9490134f8a3665843f287f721dc587e7c48ea2 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/dc9490134f8a3665843f287f721dc587e7c48ea2
Authored: 2024-04-08 22:30:48 +0200
Committed: 2024-04-08 22:51:15 +0200

meson: Enable parallel execution of checkasm in 'meson test'

It was originally disabled due to older meson versions mixing the output
of 'meson test -v' from different tests, which made the log difficult to
read. Newer versions however caches the output from each test as it runs
and prints it in one contiguous block, so that's no longer an issue.

Files Modified:

  • tests/meson.build

f6e05da0937f7fcbe7ad6ffc0a0fa94ab0059658 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/f6e05da0937f7fcbe7ad6ffc0a0fa94ab0059658
Authored: 2024-04-04 10:55:36 +0200
Committed: 2024-04-08 20:25:59 +0200

cdf: Combine memcpy() calls in dav1d_cdf_thread_copy()

Place multiple default contexts inside a single outer struct so
that copying can be performed in larger blocks.

Files Modified:

  • src/cdf.c

c8add4f8bfb5d105f7af020b3b40e76bfeccb384 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/c8add4f8bfb5d105f7af020b3b40e76bfeccb384
Authored: 2024-04-02 20:45:50 +0200
Committed: 2024-04-08 20:25:59 +0200

cdf: Reduce code size of dav1d_cdf_thread_update()

Reorder CDF arrays so that copying can be performed in larger blocks.

Files Modified:

  • src/cdf.c
  • src/cdf.h

ed24201356e5fdc893e9a346c428bb2c848887fa by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/ed24201356e5fdc893e9a346c428bb2c848887fa
Authored: 2024-04-03 13:24:36 +0200
Committed: 2024-04-08 20:25:58 +0200

cdf: Make qcat calculation branchless

Files Modified:

  • src/cdf.c
  • src/cdf.h

67fcf01bf2a47a9dc17460c7ac1547a311010468 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/67fcf01bf2a47a9dc17460c7ac1547a311010468
Authored: 2024-04-02 20:45:49 +0200
Committed: 2024-04-08 20:25:58 +0200

decode: Simplify read_mv_residual()

Files Modified:

  • src/decode.c

17a2180a61469c9ebda04fc42eb86b0179059c32 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/17a2180a61469c9ebda04fc42eb86b0179059c32
Authored: 2024-04-02 20:45:48 +0200
Committed: 2024-04-08 20:25:58 +0200

cdf: Remove separate intra-only dmv contexts

We can simply use the regular mv contexts for intra frames.

They are mutually exclusive, and the dmv contexts were already
discarded and replaced with default contexts on frame completion.

Files Modified:

  • src/cdf.c
  • src/cdf.h
  • src/decode.c

e2145f5295ce1de024a05adfe113e5252acf58e6 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/e2145f5295ce1de024a05adfe113e5252acf58e6
Authored: 2024-04-02 20:45:47 +0200
Committed: 2024-04-08 20:25:58 +0200

cdf: Skip unnecessary context copying in dav1d_cdf_thread_update()

The intrabc and dmv contexts are never reused between frames.

Files Modified:

  • src/cdf.c

e27b451e2a33bafbf07dffb3f1a58ad57e5d63fa by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/e27b451e2a33bafbf07dffb3f1a58ad57e5d63fa
Authored: 2024-03-22 16:37:07 +0100
Committed: 2024-04-04 13:06:12 +0000

cli: Handle SIGINT and SIGTERM more gracefully

Attempt to finish writing the current frame before exiting to avoid
ending up with a partially written frame at the end of the output file.

Only try catching a signal once, falling back to the default behavior
of exiting immediately the second time a given signal is raised.

Files Modified:

  • tools/dav1d.c

72dfbc075b772129eaabb101c4c453e9395e5c90 by Kyle Siefring

https://code.videolan.org/videolan/dav1d/commit/72dfbc075b772129eaabb101c4c453e9395e5c90
Authored: 2024-03-30 16:54:39 -0400
Committed: 2024-04-03 09:23:21 +0000

ARM64: Improve hi_tok msac

Before:
msac_decode_hi_tok_c: 259.5 ( 1.00x)
msac_decode_hi_tok_neon: 220.7 ( 1.18x)
msac_decode_symbol_adapt4_c: 105.7 ( 1.00x)
msac_decode_symbol_adapt4_neon: 63.3 ( 1.67x)

After:
msac_decode_hi_tok_c: 260.9 ( 1.00x)
msac_decode_hi_tok_neon: 197.9 ( 1.32x)
msac_decode_symbol_adapt4_c: 105.7 ( 1.00x)
msac_decode_symbol_adapt4_neon: 63.3 ( 1.67x)

decode_symbol_adapt4 is not changed, but is included for reference since
decode_hi_tok calls it.

Files Modified:

  • src/arm/64/msac.S

5e31720b8902ec9bcf1f3aaa9a135ee34b58af30 by Martin Storsjö

https://code.videolan.org/videolan/dav1d/commit/5e31720b8902ec9bcf1f3aaa9a135ee34b58af30
Authored: 2024-03-28 11:30:41 +0200
Committed: 2024-04-02 10:35:29 +0000

checkasm: Add support for the private macOS kperf API for benchmarking

On AArch64, the performance counter registers usually are
restricted and not accessible from user space.

On macOS, we currently use mach_absolute_time() as timer on
aarch64. This measures wallclock time but with a very coarse
resolution.

There is a private API, kperf, that one can use for getting
high precision timers though. Unfortunately, it requires running
the checkasm binary as root (e.g. with sudo).

Also, as it is a private, undocumented API, it can potentially
change at any time.

This is handled by adding a new meson build option, for switching
to this timer. If the timer source in checkasm could be changed
at runtime with an option, this wouldn't need to be a build time
option.

This allows getting benchmarks like this:

mc_8tap_regular_w16_hv_8bpc_c: 1522.1 ( 1.00x)
mc_8tap_regular_w16_hv_8bpc_neon: 331.8 ( 4.59x)

Instead of this:

mc_8tap_regular_w16_hv_8bpc_c: 9.0 ( 1.00x)
mc_8tap_regular_w16_hv_8bpc_neon: 1.9 ( 4.76x)

Co-authored-by: J. Dekker <jdek@itanimul.li>

Files Modified:

  • meson.build
  • meson_options.txt
  • tests/checkasm/checkasm.c
  • tests/checkasm/checkasm.h

abc8a1689fbefec880bb3c0064c66afcb1e9d4b9 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/abc8a1689fbefec880bb3c0064c66afcb1e9d4b9
Authored: 2024-03-28 15:58:36 +0100
Committed: 2024-03-28 15:58:36 +0100

lf_mask: Align lvl buffers

Ensures that SIMD stores performed by memset() are aligned.

Files Modified:

  • src/internal.h

119df64b21304c581ba29bb0f1104695b7943150 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/119df64b21304c581ba29bb0f1104695b7943150
Authored: 2024-03-28 15:58:35 +0100
Committed: 2024-03-28 15:58:35 +0100

lf_mask: Use sizeof() in memset() size calculations

Files Modified:

  • src/lf_mask.c

df3dafddc37191507897d7c2df0f71285640e07a by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/df3dafddc37191507897d7c2df0f71285640e07a
Authored: 2024-03-28 15:58:34 +0100
Committed: 2024-03-28 15:58:34 +0100

lf_mask: Use a union type for last_delta_lf

On architectures without unaligned load capabilites the compiler will
otherwise load the individual 8-bit values one at a time.

Files Modified:

  • src/decode.c
  • src/internal.h

076955a1534bb49325a2252f6a1f494674e5363a by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/076955a1534bb49325a2252f6a1f494674e5363a
Authored: 2024-03-28 01:27:48 +0100
Committed: 2024-03-28 01:41:28 +0100

refmvs: Fix buffer overread in save_tmvs() asm

The refmvs_block struct is only 12 bytes large but it's accessed
using 16-byte unaligned loads in asm.

In order to avoid reading past the end of the allocated buffer
we therefore need to pad the allocation size by 4 bytes.

Files Modified:

  • src/refmvs.c

3d98a242a055438ca76020434a530ebe074fa892 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/3d98a242a055438ca76020434a530ebe074fa892
Authored: 2024-03-22 10:41:48 +0100
Committed: 2024-03-22 11:11:58 +0100

x86: Add 6-tap variants of high bit-depth mc AVX2 functions

Files Modified:

  • src/x86/mc16_avx2.asm

b3323a8ccd9a45393ad5da5fb438b0572fb4df9c by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/b3323a8ccd9a45393ad5da5fb438b0572fb4df9c
Authored: 2024-03-22 10:41:45 +0100
Committed: 2024-03-22 10:41:45 +0100

x86: Add minor high bit-depth mc 8-tap AVX2 improvements

Files Modified:

  • src/x86/mc16_avx2.asm

9849ede1304da1443cfb4a86f197765081034205 by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/9849ede1304da1443cfb4a86f197765081034205
Authored: 2024-03-18 11:38:02 +0100
Committed: 2024-03-21 12:30:05 +0000

x86: Add 6-tap variants of 8bpc mc AVX2 functions

6-taps filters are sufficient in the majority of cases, and are
quite a bit faster than the equivalent 8-tap filters.

Files Modified:

  • src/x86/ipred_avx2.asm
  • src/x86/mc_avx2.asm

02c2033a1e73eba7e269828123d9e45c0c81998b by Henrik Gramner

https://code.videolan.org/videolan/dav1d/commit/02c2033a1e73eba7e269828123d9e45c0c81998b
Authored: 2024-03-18 11:37:59 +0100
Committed: 2024-03-21 12:30:05 +0000

x86: Add minor 8bpc mc 8-tap AVX2 improvements

Files Modified:

  • src/x86/mc_avx2.asm

The try push is done, we found jobs with unclassified failures.

Needs Close Investigation:

  • No tests were found for flavor 'plain' and the following manifest filters:
    skip_if, run_if, fail_if, subsuite(name=media), tags(['media-engine-compatible']), pathprefix(['dom/media/test/mochitest_background_video.toml', 'dom/media/test/mochitest_bugs.toml', 'dom/media/test/mochitest_eme.toml', 'dom/media/test/mochitest_stream.toml'])

    Make sure the test paths (if any) are spelt correctly and the corresponding
    --flavor and --subsuite are being used. See mach mochitest --help for a
    list of valid flavors.

    • 4 of 4 failed on the same (retriggered) task
      - test-windows11-64-2009-qr/opt-mochitest-media-wmfme (Lc19kPrsTyiM9pxInjHwVg)
      - test-windows11-64-2009-qr/opt-mochitest-media-wmfme (aqZf1dAnRHO-SBaRPA5Tsw)
      - test-windows11-64-2009-qr/opt-mochitest-media-wmfme (b8Zwt1zZTY-IqFe7idtoOg)
      - test-windows11-64-2009-qr/opt-mochitest-media-wmfme (eE9PgfVNT72hZ1JC9iCDxA)

Needs Investigation (Possible Intermittents):

  • test-android-em-7.0-x86_64-qr/debug-isolated-process-geckoview-junit-fis - 1 of 4 failed on the same (retriggered) task (failed: BrxqwKe8Qha4HsI25Go3yw)

These failures could mean that the library update changed something and caused
tests to fail. You'll need to review them yourself and decide where to go from here.

In either event, I have done all I can and you will need to take it from here. If you
don't want to land my patch, you can replicate it locally for editing with
./mach vendor media/libdav1d/moz.yaml

When reviewing, please note that this is external code, which needs a full and
careful inspection - not a rubberstamp.

Assignee: nobody → cchang
Flags: needinfo?(cchang)
Pushed by cchang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6a8b2627281a
Update dav1d to 5b5399911dd24703de641d65eda5b7f1e845d060 r=chunmin
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → 127 Branch
Flags: needinfo?(cchang)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: