Update dav1d to new version b129d9f2cb897cedba77a60bd5e3621c14ee5484 from 2025-01-02 15:30:21
Categories
(Core :: Audio/Video: Playback, enhancement)
Tracking
()
Tracking | Status | |
---|---|---|
firefox136 | --- | fixed |
People
(Reporter: update-bot, Assigned: chunmin)
Details
(Whiteboard: [3pl-filed][task_id: DYcpeTJvTkq9L35y26OXUA])
Attachments
(1 file)
This update covers 11 commits. Here are the overall diff statistics, and then the commit information.
media/libdav1d/moz.yaml | 4 +-
media/libdav1d/vcs_version.h | 2 +-
third_party/dav1d/src/arm/32/looprestoration.S | 580 ++++++++++----------
third_party/dav1d/src/arm/32/looprestoration16.S | 626 +++++++++++-----------
third_party/dav1d/src/arm/looprestoration.h | 173 +++++-
third_party/dav1d/src/cpu.c | 1 +
third_party/dav1d/src/decode.c | 14 +-
third_party/dav1d/src/lib.c | 4 +-
third_party/dav1d/src/looprestoration_tmpl.c | 435 +++++++++++----
third_party/dav1d/src/mc_tmpl.c | 223 ++++---
third_party/dav1d/src/obu.c | 3 +-
11 files changed, 1201 insertions(+), 864 deletions(-)
b129d9f2cb897cedba77a60bd5e3621c14ee5484 by Martin Storsjö <martin@martin.st>
https://code.videolan.org/videolan/dav1d/commit/b129d9f2cb897cedba77a60bd5e3621c14ee5484
Authored: 2024-12-24 21:52:15 +0200
Committed: 2025-01-02 15:30:21 +0000
mc: Reduce stack use in {put,prep}scaled{bilin,8tap}
For the bilin cases, this seems to make things marginally faster
(measured on x86_64; 7-25% faster with compiler autovectorization).
For 8tap, it doesn't make much of a difference at all.
Before: GCC Clang
mc_scaled_8tap_regular_w128_8bpc_c: 115155.5 98549.3
mc_scaled_8tap_regular_w128_8bpc_ssse3: 17936.0 18411.1
mc_scaled_bilinear_w128_8bpc_c: 40290.0 51812.9
mc_scaled_bilinear_w128_8bpc_ssse3: 18243.9 18177.0
After:
mc_scaled_8tap_regular_w128_8bpc_c: 116304.3 99453.2
mc_scaled_8tap_regular_w128_8bpc_ssse3: 18387.0 18077.3
mc_scaled_bilinear_w128_8bpc_c: 37381.4 41145.0
mc_scaled_bilinear_w128_8bpc_ssse3: 18423.8 18031.6
(Benchmarked with the seed 0; the total runtime for the scaled
benchmarks are significantly affected by the random seed.)
This reduces the stack usage of these functions from around 65 KB
each, to less than 1 KB for bilin, and around 2 KB for 8tap.
With this in place, the required stack space for dav1d should
be mostly identical across configurations; on x86_64 (both with
and without assembly), it can run with 62 KB of stack, and
on arm and aarch64, it can run with 58 KB of stack.
Files Modified:
- src/mc_tmpl.c
cd5bfa124a8c3c4c41e033253a291c387aba0eb0 by Brad Smith <brad@comstyle.com>
https://code.videolan.org/videolan/dav1d/commit/cd5bfa124a8c3c4c41e033253a291c387aba0eb0
Authored: 2024-12-26 04:09:33 -0500
Committed: 2024-12-29 18:32:23 +0000
riscv: Fix building on non-Linux OS's
CLOCK_MONOTONIC_RAW is not POSIX/portable.
Files Modified:
- tests/checkasm/checkasm.h
5ea4939a1dde36ebc24fc5a026eeacaa74f366a4 by James Almer <jamrial@gmail.com>
https://code.videolan.org/videolan/dav1d/commit/5ea4939a1dde36ebc24fc5a026eeacaa74f366a4
Authored: 2024-12-27 13:48:54 -0300
Committed: 2024-12-27 13:48:54 -0300
obu: don't print warnings for Metadata OBUs of types "Unregistered user private"
Files Modified:
- src/obu.c
2ba57aa535896bcc8c450bbf7d0958791e38ec78 by Martin Storsjö <martin@martin.st>
https://code.videolan.org/videolan/dav1d/commit/2ba57aa535896bcc8c450bbf7d0958791e38ec78
Authored: 2024-12-16 12:58:00 +0200
Committed: 2024-12-20 14:32:32 +0200
arm32: looprestoration: Rewrite the wiener functions
Switch to the same cache-friendly algorithm as was done for arm64
in 2e73051c57a1b2c28c46f72f9edec62f299ebac5 and for the reference
C code in 8291a66e50f2a1f5fcfa8615379d31ff15626991.
Contrary to the arm64 implementation, this uses a main loop in C
(very similar to the one in the main C implementation in
8291a66e50f2a1f5fcfa8615379d31ff15626991) rather than assembly;
this gives a bit more overhead on the call to each function, but
it shouldn't affect the big picture much.
Performane wise, this doesn't make much of a difference - it makes
things a little bit faster on some cores, and a little bit slower
on others:
Before: Cortex A7 A8 A53 A72 A73
wiener_7tap_8bpc_neon: 269384.4 147730.7 140028.5 92662.5 92929.0
wiener_7tap_10bpc_neon: 352690.2 159970.2 169427.8 116614.9 119371.1
After:
wiener_7tap_8bpc_neon: 238328.0 157274.1 134588.6 92200.3 97619.6
wiener_7tap_10bpc_neon: 336369.3 162182.0 161954.4 125521.2 130634.0
This is mostly in line with the results on arm64 in
2e73051c57a1b2c28c46f72f9edec62f299ebac5. On arm64, there was a
bit larger speedup for the 7tap case, mostly attributed to
unrolling the vertical filter (and the new filter_hv function) to
operate on 16 pixels at a time. On arm32, there's not enough
registers to do that, so we can't get such gains from unrolling.
(Reducing the unrolling on the arm64 version to match the case
on arm32 also shows similar performance numbers as on arm32 here.)
In the arm64 version, we also added separate 5tap versions of all
functions; not doing that for arm32 at this point.
This increases the binary size by 2 KB.
This doesn't have any immediate effect on how much stack space
dav1d requires in total, since the largest stack users on arm
currently are the 8tap_scaled functions.
Files Modified:
- src/arm/32/looprestoration.S
- src/arm/32/looprestoration16.S
- src/arm/looprestoration.h
8291a66e50f2a1f5fcfa8615379d31ff15626991 by Martin Storsjö <martin@martin.st>
https://code.videolan.org/videolan/dav1d/commit/8291a66e50f2a1f5fcfa8615379d31ff15626991
Authored: 2024-12-12 16:05:51 +0200
Committed: 2024-12-19 14:19:19 +0200
looprestoration: Use only 6 row buffer for wiener, like NEON/x86
This uses a separate function for combined horizontal and vertical
filtering, without needing to write the intermediate results
back to memory inbetween.
This mostly serves as an example for how to adjust the logic for
that case; unless we actually merge the horizontal and vertical
filtering within the _hv function, we still need space for a
7th row on the stack within that function (which means we use just
as much stack as before), but we also need one extra memcpy to
write it into the right destination.
In a build where the compiler is allowed to vectorize and inline
the wiener functions into each other, this change actually reduces
the final binary size by 4 KB, if the C version of the wiener filter
is retained.
This change makes the vectorized C code as fast as it was before
with Clang 18; on Xcode Clang 16, it's 2x slower than it was before.
Unfortunately, with GCC, this change makes the code a bit slower
again.
Files Modified:
- src/looprestoration_tmpl.c
a149f5c3c098cc8f78298fbad57702779d44b0e3 by Martin Storsjö <martin@martin.st>
https://code.videolan.org/videolan/dav1d/commit/a149f5c3c098cc8f78298fbad57702779d44b0e3
Authored: 2024-12-14 00:38:37 +0200
Committed: 2024-12-19 14:19:19 +0200
looprestoration: Make the C wiener h filter more optimizable for the compiler
This increases the binary size by 9 KB, on aarch64 with Xcode Clang 16,
if the C version of the filter is retained (which it isn't
by default).
This makes the vectorized C code roughly as fast as it was before
the rewrite on GCC; on Clang it also becomes 1.3x-2.0x faster,
while still being slower than it was initially.
Files Modified:
- src/looprestoration_tmpl.c
9da303e989c6d17bd97152ff490e9ab00c739f1f by Martin Storsjö <martin@martin.st>
https://code.videolan.org/videolan/dav1d/commit/9da303e989c6d17bd97152ff490e9ab00c739f1f
Authored: 2024-12-12 16:05:51 +0200
Committed: 2024-12-19 14:19:13 +0200
looprestoration: Rewrite the C version of the wiener filter
This reduces the stack usage of these functions (the C version)
significantly.
These C versions aren't used on architectures that already have
wiener filters implemented in assembly, but they matter both if
running e.g. with assembly disabled (e.g. for sanitizer builds),
and matter as example for how to do a cache efficient SIMD
implementation.
This roughly matches how these functions are implemented in the
aarch64 assembly (although that assembly function uses a mainloop
function written in assembly, and custom calling conventions
between the functions).
With this in place, dav1d can run with around 76 KB of stack
with assembly disabled.
This increases the binary size by around 14 KB (in the case of
aarch64 with Xcode Clang 16), unless built with (the default)
-Dtrim_dsp=true. (By default, the C version of the wiener filter
gets skipped entirely.)
On 32 bit arm, the assembly wiener function implementation still
uses large buffers on the stack though, but due to other functions
using less stack there, dav1d can still run with 72 KB of stack
there.
Unfortunately, this change also makes the functions slower, depending
on how well the compiler was able to optimize the previous version.
On GCC (which didn't manage to vectorize the functions so well before),
it becomes 1.6x-2.0x slower, while it gets 2.5x-5x slower on Clang
(where it was very well vectorized before).
Most of this performance can be gained back with later changes on
top, though.
Files Modified:
- src/looprestoration_tmpl.c
d242c47b437c950b545e96e7872aa914edc50be5 by Luc Trudeau <ltrudeau@twoorioles.com>
https://code.videolan.org/videolan/dav1d/commit/d242c47b437c950b545e96e7872aa914edc50be5
Authored: 2024-12-02 09:32:33 -0500
Committed: 2024-12-02 09:32:33 -0500
Replace Av1Block with pal_sz in read_pal_indices
Files Modified:
- src/decode.c
9a75cebc36e04d41b9f18a412209c2e364e0fc8b by Henrik Gramner <gramner@twoorioles.com>
https://code.videolan.org/videolan/dav1d/commit/9a75cebc36e04d41b9f18a412209c2e364e0fc8b
Authored: 2024-12-02 13:40:42 +0100
Committed: 2024-12-02 13:47:04 +0100
Explicitly use uint8_t for the order_palette() scratch buffer
It previously used 'pixel' which is typedefed to uint8_t in files
that aren't bitdepth-templated, but those are indices and not
pixels so that was just confusing and misleading.
Files Modified:
- src/decode.c
575af2585915eb15007bd09f2ec4b8ef3e0e051d by victorien <victorien@videolan.org>
https://code.videolan.org/videolan/dav1d/commit/575af2585915eb15007bd09f2ec4b8ef3e0e051d
Authored: 2024-11-28 17:39:32 +0100
Committed: 2024-11-28 17:56:13 +0100
flush: Reset f->task_thread.error
f->task_thread.error can be set during flushing, not resetting this can
lead to c->task_thread.first being increased after having already submitted
a frame post flushing. That's fine if it happens on the very first frame,
but if that's the case on any subsequent frame it will incur a wrong frame
ordering.
Now that a non-first frame will be considered as such, its tasks won't be
able to execute (since they depend on a truly previous frame considered as
being after) and c->task_thread.cur will be increased past that frame, with
no way of it being reset, eventually leading to a hang.
Files Modified:
- src/lib.c
767efeca0621ef7ecdfb8a83afdce54c86ed23fd by Wan-Teh Chang <wtc@google.com>
https://code.videolan.org/videolan/dav1d/commit/767efeca0621ef7ecdfb8a83afdce54c86ed23fd
Authored: 2024-11-26 14:26:25 +0000
Committed: 2024-11-26 14:26:25 +0000
Fix ClangTidy misc-include-cleaner warnings
Files Modified:
- src/cpu.c
- src/looprestoration_tmpl.c
Reporter | ||
Comment 1•1 month ago
|
||
DYcpeTJvTkq9L35y26OXUA |
I've submitted a try run for this commit: https://treeherder.mozilla.org/jobs?repo=try&revision=0fe361f6f63254f78b94cca0fa73aa1ad166cbad
Reporter | ||
Comment 2•1 month ago
|
||
Updated•1 month ago
|
Reporter | ||
Comment 3•1 month ago
|
||
JGU58EyDS6mXj60GEjkYAw |
All jobs completed, we found the following issues.
Known Issues:
-
browser/components/migration/tests/browser/browser_do_migration.js
- test-linux1804-64-qr/opt-mochitest-browser-chrome-swr-a11y-checks-6 (OOhgyQe4Tqy-P18IRWFq2g) -
browser/components/migration/tests/browser/browser_extension_migration.js
- test-linux1804-64-qr/opt-mochitest-browser-chrome-swr-a11y-checks-6 (OOhgyQe4Tqy-P18IRWFq2g) -
browser/components/sessionstore/test/browser_async_window_flushing.js
- test-linux1804-64-qr/opt-mochitest-browser-chrome-spi-nw-8 (XDT_rCZFQAasTklHB41yKg) -
dom/base/test/browser_bug1303838.js
- test-linux1804-64-qr/opt-mochitest-browser-chrome-spi-nw-4 (Z3EGcmv-RKGXRyfV1LpLxw) -
gfx/layers/apz/test/mochitest/browser.toml
- test-linux1804-64-qr/opt-mochitest-browser-chrome-swr-a11y-checks-3 (GV8cdQMiTgGBN4njRFyNsg) -
gfx/layers/apz/test/mochitest/browser_test_position_sticky.js
- test-linux1804-64-qr/opt-mochitest-browser-chrome-swr-a11y-checks-3 (GV8cdQMiTgGBN4njRFyNsg) -
toolkit/components/normandy/test/browser/browser_actions_PreferenceExperimentAction.js
- test-linux1804-64-qr/opt-mochitest-browser-chrome-swr-a11y-checks-8 (R_XoaieeSfCR02M01p-jxg) -
toolkit/mozapps/extensions/test/browser/browser_html_detail_view.js
- test-linux1804-64-qr/opt-mochitest-browser-chrome-swr-a11y-checks-7 (BfPFfmf3RhiPGz9MQJH7MQ) -
toolkit/mozapps/extensions/test/browser/browser_permission_prompt_userScripts.js
- test-linux1804-64-qr/opt-mochitest-browser-chrome-swr-a11y-checks-7 (BfPFfmf3RhiPGz9MQJH7MQ)
These failures may mean that the library update succeeded; you'll need to review
them yourself and decide. If there are lint failures, you will need to fix them in
a follow-up patch. (Or ignore the patch I made, and recreate it yourself with
./mach vendor media/libdav1d/moz.yaml
.)
In either event, I have done all I can, so you will need to take it from here.
When reviewing, please note that this is external code, which needs a full and
careful inspection - not a rubberstamp.
Comment 5•1 month ago
|
||
bugherder |
Description
•