Closed
Bug 1352894
Opened 8 years ago
Closed 3 years ago
Crash in ff_vp9_loop_filter_v_16_16_sse2
Categories
(Core :: Audio/Video: Playback, defect, P3)
Tracking
()
RESOLVED
WORKSFORME
| Tracking | Status | |
|---|---|---|
| firefox55 | --- | affected |
People
(Reporter: n.nethercote, Unassigned)
Details
(Keywords: crash, stale-bug, Whiteboard: qa-not-actionable)
Crash Data
This bug was filed from the Socorro interface and is
report bp-09052897-8633-4191-a8c8-1f93d2170331.
=============================================================
There have been 71 occurrences of this crash in the past 7 days on Nightly, across 6 installations. And smaller numbers on other channels going back to FF47.
jya, any ideas?
Flags: needinfo?(jyavenard)
Comment 1•8 years ago
|
||
If you search on crash-stats for ff_vp9_, there are at least 50 similar signatures, like ff_vp9_loop_filter_v_88_16_sse2, ff_vp9_idct_idct_32x32_add_ssse3, ff_vp9_idct_idct_32x32_add_avx, ff_vp9_loop_filter_h_16_16_sse2, etc.
Comment 2•8 years ago
|
||
Ronald, any ideas?
Is this something known that got fixed upstream and we could have missed during our code integration?
Flags: needinfo?(jyavenard) → needinfo?(rsbultje)
Comment 3•8 years ago
|
||
Do you know what video was being watched (e.g. do we have access to the file, or can we contact the user) while the crash occurred? I'll review the code around the loop filter, but having a complete backtrace (with a "disass" around the assembly crash site) or file access would be very helpful.
Flags: needinfo?(rsbultje)
Comment 4•8 years ago
|
||
Unfortunately, that particular crash report has no URL attached and due to privacy concern I wouldn't be able to provide much information anyway...
Being VP9, there's a great chance that the website would be YouTube... Of the few reports that do have a URL attached, it is indeed YouTube
In the past 7 days, there's been 194 crashes, 53% are on Windows 7 and the rest on Windows XP
Two public URLs that caused the crash.
https://www.youtube.com/watch?v=HrAnOqztv5w
https://www.youtube.com/watch?v=hA6VrZbv8Ck
Processors involved appear to always be either:
GenuineIntel family 15 model 4 stepping 9 | 2
or:
GenuineIntel family 15 model 4 stepping 3 | 2
so Pentium 4 (who can still use that in these days and age!??)
all 32 bits Firefox, over 50% on Intel G41 express graphics
Comment 5•8 years ago
|
||
I think the reason you see a p4 associated with it is because the crash is in a SSE2 function that has a SSSE3 counterpart. Anyone having a newer CPU would not see a crash in the SSE2 function, but either in the SSSE3 function (if it's a higher-up bug), or not at all (if the bug is specific to the SSE2 code). I see 192 crashes with the SSE2 version and 27 with the SSSE3 counterpart of the same function. It suggest it's not the specific SSE2 function that has a bug, but rather something higher-level (loopfilter template, loopfilter memory, ...).
For both SSE2 and SSSE3, I clicked on a few raw dumps, and it indeed seems they're all in 32bit code. Would I be able to conclude that this means the bug is likely 32bit-specific? Then, looking at the raw dumps, there is an offset in the first frame of the crashing thread, can we somehow link that to a specific instruction in the binary (disassembly)?
I've downloaded the first video using youtube-dl (HrAnOqztv5w ) at all VP9 resolutions (id=242, 243, 244, 247, 248, 271, 278, 313) and played them in 32bit ffmpeg restricted to SSE2 with address sanitizer, and everything worked fine:
$ ls *.webm
Flawless FULL COVERAGE Foundation Routine-HrAnOqztv5w.242.webm
Flawless FULL COVERAGE Foundation Routine-HrAnOqztv5w.243.webm
Flawless FULL COVERAGE Foundation Routine-HrAnOqztv5w.244.webm
Flawless FULL COVERAGE Foundation Routine-HrAnOqztv5w.247.webm
Flawless FULL COVERAGE Foundation Routine-HrAnOqztv5w.248.webm
Flawless FULL COVERAGE Foundation Routine-HrAnOqztv5w.271.webm
Flawless FULL COVERAGE Foundation Routine-HrAnOqztv5w.278.webm
Flawless FULL COVERAGE Foundation Routine-HrAnOqztv5w.313.webm
$ for n in *.webm; do ./ffmpeg -i "${n}" -f null -v error -nostats -; done
$ git diff
diff --git a/libavutil/cpu.c b/libavutil/cpu.c
index 16e0c92..20d81db 100644
--- a/libavutil/cpu.c
+++ b/libavutil/cpu.c
@@ -93,7 +93,8 @@ int av_get_cpu_flags(void)
flags = get_cpu_flags();
atomic_store_explicit(&cpu_flags, flags, memory_order_relaxed);
}
- return flags;
+ return flags & (AV_CPU_FLAG_MMX | AV_CPU_FLAG_MMXEXT |
+ AV_CPU_FLAG_SSE | AV_CPU_FLAG_SSE2);
}
void av_set_cpu_flags_mask(int mask)
$ grep address config.mak
CFLAGS=-m32 -std=c11 -mdynamic-no-pic -fomit-frame-pointer -pthread -g -Wdeclaration-after-statement -Wall -Wdisabled-optimization -Wpointer-arith -Wredundant-decls -Wwrite-strings -Wtype-limits -Wundef -Wmissing-prototypes -Wno-pointer-to-int-cast -Wstrict-prototypes -Wempty-body -Wno-parentheses -Wno-switch -Wno-format-zero-length -Wno-pointer-sign -Wno-unused-const-variable -O0 -fsanitize=address -fno-math-errno -fno-signed-zeros -mstack-alignment=16 -Qunused-arguments -Werror=implicit-function-declaration -Werror=missing-prototypes -Werror=return-type
LDFLAGS=-g -fsanitize=address -Wl,-dynamic,-search_paths_first -Qunused-arguments
$
I realize there's some issues with this test: it's a 64bit system running a 32bit binary (the fact that the bug occurs only on 32bit binaries and has a far higher crash count in SSE2 than in SSSE3 functions makes me believe - by distribution - that that means the system was 32bit also, not a 64bit system running 32bit binaries), asan doesn't cover assembly (valgrind does, I believe, but unfortunately valgrind doesn't work on Mac Sierra), I'm on a Mac (not Windows). However, things like memory management inside ffvp9 do not really differ by architecture or system.
So, some questions that go more to the higher level (where I'm suspecting the bug may lie):
- can you reproduce the crash using a 32bit build on the videos above?
- how do you guys allocate memory for AVFrame data[] planes? Do you use a custom callback or do you let FFmpeg allocate buffers internally? Assuming you're using a custom implementation, do you have a link to the code for that? Does it provide the same characteristics as avcodec_default_get_buffer2() in terms of plane/line padding, line/buffer alignment, etc.? If you remove the custom callback, and if you could reproduce the crash earlier, did it go away after removing the custom callback?
- how is av_malloc() implemented in your (32bit Windows) build? Do you know if symbols like HAVE_POSIX_MEMALIGN, HAVE_ALIGNED_MALLOC or HAVE_MEMALIGN are available for that target platform? (I'm assuming that HAVE_ALIGNED_MALLOC is 1 and the rest is 0.)
Flags: needinfo?(jyavenard)
Comment 6•8 years ago
|
||
The config file used for the win32 build can be found there:
https://dxr.mozilla.org/mozilla-central/source/media/ffvpx/config_win32.h
config.h was indeed produced on a 64 bits machine running Visual Studio SDK in 32 bits mode, using a FFmpeg checkout of the same version as what's being resynced. It's then manually copied into our own tree.
The macro: /HAVE_(MALLOC_H|ARC4RANDOM|LOCALTIME_R|MEMALIGN|POSIX_MEMALIGN) are as set by Mozilla build system. I'm not sure on what those would be here. :glandium will know
:glandium what would those be on windows 32 build?
Flags: needinfo?(jyavenard) → needinfo?(mh+mozilla)
Comment 7•8 years ago
|
||
1- I haven't.. I don't have a 32 bits only machine available these days...
2- for the AVFrame if you're referring to the AVFrame used internally, we let FFmpeg manages the memory internally (we used to make use of callbacks but got rid of that over a year ago.
If you're referring to the AVFrame passed to avcodec_decode_video2 where the result will be copied
then the allocation of that one is done there:
https://dxr.mozilla.org/mozilla-central/source/dom/media/platforms/ffmpeg/FFmpegDataDecoder.cpp#154
Comment 8•8 years ago
|
||
(In reply to Jean-Yves Avenard [:jya] from comment #6)
> The config file used for the win32 build can be found there:
> https://dxr.mozilla.org/mozilla-central/source/media/ffvpx/config_win32.h
>
> config.h was indeed produced on a 64 bits machine running Visual Studio SDK
> in 32 bits mode, using a FFmpeg checkout of the same version as what's being
> resynced. It's then manually copied into our own tree.
>
> The macro: /HAVE_(MALLOC_H|ARC4RANDOM|LOCALTIME_R|MEMALIGN|POSIX_MEMALIGN)
> are as set by Mozilla build system. I'm not sure on what those would be
> here. :glandium will know
>
> :glandium what would those be on windows 32 build?
You can check yourself in the configure logs for windows 32 bit builds e.g. https://archive.mozilla.org/pub/firefox/nightly/2017/04/2017-04-06-03-02-06-mozilla-central/mozilla-central-win32-nightly-bm91-build1-build3.txt.gz for the last nightly
03:12:53 INFO - checking for malloc.h... yes
03:13:07 INFO - checking for memalign... no
03:13:07 INFO - checking for posix_memalign... no
arc4random and localtime_r are not checked on windows at all (the tests are skipped entirely), so the defines are not set.
Flags: needinfo?(mh+mozilla)
Comment 9•8 years ago
|
||
We've tried to do some tests with ffmpeg developers on this feature. On 32bit Mac with tsan, the whole thing is clear. The part where this is specific to windows/32bit makes me suspicious that it may be related to the manual alignment feature, but it's hard to know that for sure. In the logs that I looked at, the stack pointer was always 16-byte aligned.
Would it be possible for you guys to run a representative firefox or ffmpeg build on such a file on a 32bit windows machine under tsan, valgrind (I don't know if either of that makes sense), drmemory or something similar? I'm hoping for some new insights that I can't get right now because of lack of combination of tools, machine etc.
Updated•8 years ago
|
Priority: -- → P1
Comment 10•8 years ago
|
||
This is a P1 bug without an assignee.
P1 are bugs which are being worked on for the current release cycle/iteration/sprint.
If the bug is not assigned by Monday, 28 August, the bug's priority will be reset to '--'.
Keywords: stale-bug
Comment 11•8 years ago
|
||
Mass change P1->P2 to align with new Mozilla triage process
Priority: P1 → P2
Comment 12•6 years ago
|
||
Moving to p3 because no activity for at least 1 year(s).
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P2 → P3
Updated•4 years ago
|
Whiteboard: qa-not-actionable
Comment 13•3 years ago
|
||
This is still getting the occasional report, e.g. https://crash-stats.mozilla.org/report/index/404b18c2-cd73-4823-8f71-a35b70211129
Comment 14•3 years ago
|
||
Closing because no crashes reported for 12 weeks.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•