Closed Bug 1293996 Opened 3 years ago Closed 3 years ago
Crash in adapt
701 bytes, patch
|Details | Diff | Splinter Review|
58 bytes, text/x-review-board-request
This bug was filed from the Socorro interface and is report bp-864dc8d3-c840-4fb3-bcbe-256292160810. ============================================================= Crashing Thread (66) Frame Module Signature Source 0 mozavcodec.dll adapt_probs media/ffvpx/libavcodec/vp9.c:3929 1 mozavcodec.dll vp9_decode_frame media/ffvpx/libavcodec/vp9.c:4225 2 mozavcodec.dll frame_worker_thread media/ffvpx/libavcodec/pthread_frame.c:147 3 mozavcodec.dll win32thread_worker media/ffvpx/compat/w32pthreads.h:81 4 ucrtbase.dll _crt_at_quick_exit 5 kernel32.dll BaseThreadInitThunk 6 ntdll.dll __RtlUserThreadStart 7 ntdll.dll _RtlUserThreadStart this vp9 related crash is regressing in crash stats data since firefox 46 and seems to be increasing in numbers in 49.0b1 again. many user comments seem to reference playing content on youtube as source of the crash.
@jya, @k17e - Can you find an owner for this? It's flagged as a carryover regression with no owner. Thanks.
if this is a libavcodec/vp9 bug then its probably ronald who knows this best, he is on vacation till end of august IIRC though. Either way, can this be reproduced somehow with command line ffmpeg?
I dont know if it's ever been reproduced. We can provide the memory dump if required on Windows. Easy to track with visual studio then.
3 years ago
Priority: -- → P1
(In reply to Jean-Yves Avenard [:jya] from comment #4) > We can provide the memory dump if required on Windows. We can't share memory dumps from Socorro outside of Mozilla. We'd need permission from the person submitting the crash.
3 years ago
Assignee: nobody → jyavenard
(99.15% in signature vs 00.53% overall) reason = EXCEPTION_INT_DIVIDE_BY_ZERO (93.45% in signature vs 19.21% overall) adapter_vendor_id = 0x1002 (60.24% in signature vs 00.77% overall) cpu_info = AuthenticAMD family 18 model 1 stepping 0 | 4 (89.70% in signature vs 45.25% overall) platform_pretty_version = Windows 7 (81.94% in signature vs 38.07% overall) platform_version = 6.1.7601 Service Pack 1 (00.61% in signature vs 34.52% overall) reason = EXCEPTION_ACCESS_VIOLATION_READ (33.21% in signature vs 00.48% overall) cpu_info = AuthenticAMD family 18 model 1 stepping 0 | 2 (29.21% in signature vs 00.16% overall) adapter_subsys_id = 061d1025 (28.61% in signature vs 00.17% overall) adapter_driver_version = 8.862.3.0 (24.00% in signature vs 00.22% overall) adapter_device_id = 0x964a (59.27% in signature vs 37.87% overall) bios_manufacturer = American Megatrends Inc. (04.61% in signature vs 23.11% overall) adapter_vendor_id = 0x10de It's often (93.45%) the same CPU: AuthenticAMD family 18 model 1 stepping 0 (with different number of cores).
Hi Jean-Yves, this crash is spiking on 49.0.1. Is there a fix in the works? I'd be happy to take an uplift in Beta50.
Ronald, got any ideas about this one?
Flags: needinfo?(jyavenard) → needinfo?(rsbultje)
Can you provide a link to the exact source code file used in Firefox 49.0.1 for vp9.c, and can you also provide disassembly for the given function in the 49.0.1 build of Firefox? The source code is pretty obviously correct upstream (it does an explicit check for denominator being zero before doing the division), so maybe there's something funky going on like local patches or local miscompilation? We've had an AMD-specific crash before but that was related to longnops instead of shortnops and usually affected SIMD code. Looking at Chrome, there's a remarkably similar bug (claiming the same issue - technically, really! - in libvpx as well as ffmpeg) here: https://bugs.chromium.org/p/chromium/issues/detail?id=599899 Interesting quotes: "I suspect the issue may be related to something I had seen a while back, where crashes happened due to a hardware bug in certain family of AMD processors." (from Yaowu) It's unclear to me what the patch does exactly, but it seems like it's trying to add sign-extension instructions (adding padding, i.e. specifically to delay the pipeline) hoping that that will workaround the known CPU bug "665 Integer Divide Instruction May Cause Unpredictable Behavior". I suppose we can do that also although I have to admit I'm never a big fan of such "fixes".
The other patch referenced in the bug seem to split prob adaptation in two functions, where the return-if-zero is done before entering the function that does the divide. Since that introduces a call between the branch and the div (instead of call before branch+div), it works around the CPU bug also, and would do so without a slowdown. We can of course do both also. I suppose it's not easy to get my hands on access (SSH) to one of the affected machines to reproduce the crash and test potential workarounds?
Here's the source code: https://hg.mozilla.org/releases/mozilla-release/annotate/2d931a5eaf8a/media/ffvpx/libavcodec/vp9.c#l3929 And here's the disassembly around the crashing instruction for one of the crash reports (I've downloaded another one and it's pretty similar): 64f14624 <.data>: mozavcodec.dll+0xe4624 64f14624: 3b c2 cmp %edx,%eax 64f14626: 0f 47 c2 cmova %edx,%eax 64f14629: ba cd cc cc 0c mov $0xccccccd,%edx 64f1462e: f7 e2 mul %edx 64f14630: 0f a4 c2 07 shld $0x7,%eax,%edx 64f14634: 8b 44 24 3c mov 0x3c(%esp),%eax 64f14638: 2b c8 sub %eax,%ecx 64f1463a: 0f af d1 imul %ecx,%edx 64f1463d: 83 ea 80 sub $0xffffff80,%edx 64f14640: c1 ea 08 shr $0x8,%edx 64f14643: 02 d0 add %al,%dl 64f14645: 88 55 01 mov %dl,0x1(%ebp) 64f14648: 8b 07 mov (%edi),%eax 64f1464a: 2b d8 sub %eax,%ebx 64f1464c: 03 d8 add %eax,%ebx 64f1464e: 74 54 je 0x64f146a4 64f14650: 0f b6 4d 02 movzbl 0x2(%ebp),%ecx 64f14654: 33 d2 xor %edx,%edx 64f14656: 89 4c 24 3c mov %ecx,0x3c(%esp) 64f1465a: 8b cb mov %ebx,%ecx 64f1465c: c1 e0 08 shl $0x8,%eax 64f1465f: d1 e9 shr %ecx 64f14661: 03 c1 add %ecx,%eax 64f14663: f7 f3 div %ebx 64f14665: 8b c8 mov %eax,%ecx 64f14667: 83 f9 01 cmp $0x1,%ecx 64f1466a: 7d 05 jge 0x64f14671 64f1466c: 33 c9 xor %ecx,%ecx 64f1466e: 41 inc %ecx 64f1466f: eb 0a jmp 0x64f1467b 64f14671: b8 ff 00 00 00 mov $0xff,%eax 64f14676: 3b c8 cmp %eax,%ecx 64f14678: 0f 4f c8 cmovg %eax,%ecx 64f1467b: 6a 14 push $0x14 64f1467d: 58 pop %eax 64f1467e: 3b d8 cmp %eax,%ebx 64f14680: ba cd cc cc 0c mov $0xccccccd,%edx 64f14685: 0f 47 d8 cmova %eax,%ebx 64f14688: 8b c3 mov %ebx,%eax 64f1468a: f7 e2 mul %edx 64f1468c: 0f a4 c2 07 shld $0x7,%eax,%edx 64f14690: 8b 44 24 3c mov 0x3c(%esp),%eax 64f14694: 2b c8 sub %eax,%ecx 64f14696: 0f af d1 imul %ecx,%edx 64f14699: 83 ea 80 sub $0xffffff80,%edx 64f1469c: c1 ea 08 shr $0x8,%edx 64f1469f: 02 d0 add %al,%dl 64f146a1: 88 55 02 mov %dl,0x2(%ebp) 64f146a4: 8b 5f 10 mov 0x10(%edi),%ebx ^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^ 64f146a7: 8b 47 0c mov 0xc(%edi),%eax 64f146aa: 03 d8 add %eax,%ebx 64f146ac: 74 54 je 0x64f14702 64f146ae: 0f b6 4d 03 movzbl 0x3(%ebp),%ecx 64f146b2: 33 d2 xor %edx,%edx 64f146b4: 89 4c 24 3c mov %ecx,0x3c(%esp) 64f146b8: 8b cb mov %ebx,%ecx 64f146ba: c1 e0 08 shl $0x8,%eax 64f146bd: d1 e9 shr %ecx 64f146bf: 03 c1 add %ecx,%eax 64f146c1: f7 f3 div %ebx 64f146c3: 8b c8 mov %eax,%ecx 64f146c5: 83 f9 01 cmp $0x1,%ecx 64f146c8: 7d 05 jge 0x64f146cf 64f146ca: 33 c9 xor %ecx,%ecx 64f146cc: 41 inc %ecx 64f146cd: eb 0a jmp 0x64f146d9 64f146cf: b8 ff 00 00 00 mov $0xff,%eax 64f146d4: 3b c8 cmp %eax,%ecx 64f146d6: 0f 4f c8 cmovg %eax,%ecx 64f146d9: 6a 14 push $0x14 64f146db: 58 pop %eax 64f146dc: 3b d8 cmp %eax,%ebx 64f146de: ba cd cc cc 0c mov $0xccccccd,%edx 64f146e3: 0f 47 d8 cmova %eax,%ebx 64f146e6: 8b c3 mov %ebx,%eax 64f146e8: f7 e2 mul %edx 64f146ea: 0f a4 c2 07 shld $0x7,%eax,%edx 64f146ee: 8b 44 24 3c mov 0x3c(%esp),%eax 64f146f2: 2b c8 sub %eax,%ecx 64f146f4: 0f af d1 imul %ecx,%edx 64f146f7: 83 ea 80 sub $0xffffff80,%edx 64f146fa: c1 ea 08 shr $0x8,%edx 64f146fd: 02 d0 add %al,%dl 64f146ff: 88 55 03 mov %dl,0x3(%ebp) 64f14702: 8b 47 08 mov 0x8(%edi),%eax 64f14705: 2b f0 sub %eax,%esi 64f14707: 8d 1c 06 lea (%esi,%eax,1),%ebx 64f1470a: 85 db test %ebx,%ebx 64f1470c: 74 54 je 0x64f14762 64f1470e: 0f b6 4d 04 movzbl 0x4(%ebp),%ecx 64f14712: 33 d2 xor %edx,%edx 64f14714: 89 4c 24 3c mov %ecx,0x3c(%esp) 64f14718: 8b cb mov %ebx,%ecx 64f1471a: c1 e0 08 shl $0x8,%eax 64f1471d: d1 e9 shr %ecx 64f1471f: 03 c1 add %ecx,%eax 64f14721: f7 f3 div %ebx 64f14723: 8b .byte 0x8b The values of the registers at the time of the crash: Thread 37 (crashed) 0 mozavcodec.dll!adapt_probs [vp9.c:2d931a5eaf8a : 3929 + 0x0] eip = 0x64f146a4 esp = 0x123bf714 ebp = 0x00eb776f ebx = 0x00000000 esi = 0x00000000 edi = 0x00ebb2d8 eax = 0x00000000 ecx = 0x00000000 edx = 0x00000086 efl = 0x00010246 (In reply to Ronald S. Bultje from comment #10) > I suppose it's not easy to get my hands on access (SSH) to one of the > affected machines to reproduce the crash and test potential workarounds? Unfortunately yes. We would need to contact by email the people who reported crashes and ask them.
(In reply to Ronald S. Bultje from comment #9) > Looking at Chrome, there's a remarkably similar bug (claiming the same issue > - technically, really! - in libvpx as well as ffmpeg) here: > https://bugs.chromium.org/p/chromium/issues/detail?id=599899 > > Interesting quotes: "I suspect the issue may be related to something I had > seen a while back, where crashes happened due to a hardware bug in certain > family of AMD processors." (from Yaowu) ...and the cpu model that was mentioned there (AMD A6-3400M, https://bugs.chromium.org/p/chromium/issues/attachmentText?aid=232024) is the same one that we see failing in our crash stats data according to comment #6 as well.
The "problem" in this disass is that adapt_prob appears inlined, so branch-call-div wouldn't do anything. I'd really like SSH access to an affected machine to test potential fixes. Can we get one from either an affected user or can we contact AMD directly? Maybe AMD (since they have insight in the exact details of this bug) can suggest how we should change the code to work around it?
(In reply to Ronald S. Bultje from comment #9) > Can you provide a link to the exact source code file used in Firefox 49.0.1 > for vp9.c, and can you also provide disassembly for the given function in The link in the description gives you exactly the code where the crash occurred. It's using the revision you're interested in.. https://hg.mozilla.org/mozilla-unified/annotate/f21e48559ecd/media/ffvpx/libavcodec/vp9.c#l3929 The ffvpx code included is almost identical to what's found upstream. We have almost zero patches applied. Great care was taken to ensure we made as little modifications as possible so resyncing with upstream will be easy.
Thanks @jya, I indeed just wanted to confirm there were no strange firefox-specific modifications. I'm at this point pretty sure it's the AMD hardware bug... Not 100% sure how to go from there to a fix yet... Do you guys have suggestions or recommendations?
This is the only patch applied to our tree: https://dxr.mozilla.org/mozilla-central/source/media/ffvpx/changes.patch
Hi David... Didn't you identify an AMD bug a while ago and crafted a solution around it? Any suggestions?
(In reply to Jean-Yves Avenard [:jya] from comment #17) > Hi David... > > Didn't you identify an AMD bug a while ago and crafted a solution around it? > Any suggestions? This looks exactly like the issue that Chrome encountered. How about taking their workaround as a starting point: https://chromium-review.googlesource.com/#/c/73570/
Going by what libvpx did, I propose the attached patch. However, as mentioned, I have no way of testing this patch so I'm not sure how to confirm that this does indeed fix the issue. Can we get in touch with one of the bug submitters and see if they can test a modified binary with this patch included to see if that resolves the crash?
Another option is to take the patch and see if it fixes the crash in the wild (we would need to wait for the patch to reach beta or uplift it, since we don't have any report with Aurora or Nightly).
Hello Jean-Yves, as Marco said, could we get stabilize this patch on pre-beta (to catch unexpected fallouts) and then uplift to Beta50 to determine if it fixes the crash or not? Thoughts?
I have no problem with the patch, it looks identical to the earlier code, just different order of operations. So there's nothing to lose
jya, can you take care of landing the patch or helping Ronald to do it?
Dear Firefox programmer, I had plenty of crashes when whatching vp9 encoded content on youtube, so I followed the bug reports and I ended here. One of my crash report is 852b8ad6-ff1b-4eb7-8f39-4ef152161011. I'll be happy to test a patched binary, and see if the issue is still ocurring. Best regards, Nicolas DEROUINEAU
Thanks Nicolas, this is great. Jya, given the nature of the bug, perhaps we should build Firefox on try with the official configuration and let Nicolas try that instead of a local build.
Attachment #8798908 - Flags: review?(jyavenard) → review+
Comment on attachment 8800178 [details] Bug 1293996: Workaround AMD hardware related crash. https://reviewboard.mozilla.org/r/85168/#review83740
Attachment #8800178 - Flags: review?(jyavenard) → review+
Windows 32 bits installer: https://email@example.com/try-win32/firefox-52.0a1.en-US.win32.installer.exe Windws 64 bits installer: https://firstname.lastname@example.org/try-win64/firefox-52.0a1.en-US.win64.installer.exe Thank you for testing !
Flags: needinfo?(rbultje) → needinfo?(nicodero72)
I have installed the nightly, and so far vp9 content decoding doesn't crash. However, I noticed that this bug happened after a while, so I'll keep some content running and keep you updated.
Pushed by email@example.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/6154e22334bc Workaround AMD hardware related crash. r=jya
[Tracking Requested - why for this release]: It's a low volume crash on Beta, but high volume on release. Fix verified by Nicolas (comment 29), testing with a larger population on Beta will confirm that the crash if fixed for everybody. The risk is pretty low, considering that the patch is just changing the order of operations.
Hello Jean-Yves, should we uplift this patch to Aurora51 and Beta50?
Comment on attachment 8800178 [details] Bug 1293996: Workaround AMD hardware related crash. Approval Request Comment [Feature/regressing bug #]: ffvp9 [User impact if declined]: crashes on some AMD machines [Describe test coverage new/current, TreeHerder]: in central. manual check. Lots of Michu tests. [Risks and why]: can't think of any. It's just reordering instructions to go around AMD CPU pipeline bug. Fix is also found in chrome code [String/UUID change made/needed]: none
Comment on attachment 8800178 [details] Bug 1293996: Workaround AMD hardware related crash. Crash fix that was verified by at least one person, Aurora51+, Beta50+
Needinfoing myself so I don't forget to check if this is actually fixed.
Looks like the crash is gone in 50.0b8!
Awesome. I've upstreamed the patch to ffmpeg git also. Thanks very much for your help and input, guys!
You need to log in before you can comment on or make changes to this bug.