Closed Bug 1293996 Opened 3 years ago Closed 3 years ago

Crash in adapt_probs

Categories

(Core :: Audio/Video: Playback, defect, P1, critical)

46 Branch
x86
All
defect

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox48 --- wontfix
firefox49 --- wontfix
firefox50 + verified
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: philipp, Assigned: jya, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-864dc8d3-c840-4fb3-bcbe-256292160810.
=============================================================
Crashing Thread (66)
Frame 	Module 	Signature 	Source
0 	mozavcodec.dll 	adapt_probs 	media/ffvpx/libavcodec/vp9.c:3929
1 	mozavcodec.dll 	vp9_decode_frame 	media/ffvpx/libavcodec/vp9.c:4225
2 	mozavcodec.dll 	frame_worker_thread 	media/ffvpx/libavcodec/pthread_frame.c:147
3 	mozavcodec.dll 	win32thread_worker 	media/ffvpx/compat/w32pthreads.h:81
4 	ucrtbase.dll 	_crt_at_quick_exit 	
5 	kernel32.dll 	BaseThreadInitThunk 	
6 	ntdll.dll 	__RtlUserThreadStart 	
7 	ntdll.dll 	_RtlUserThreadStart

this vp9 related crash is regressing in crash stats data since firefox 46 and seems to be increasing in numbers in 49.0b1 again.
many user comments seem to reference playing content on youtube as source of the crash.
Flags: needinfo?(rbultje)
Flags: needinfo?(michael)
Pretty small volume, we won't take any fix as ride along in 48
@jya, @k17e - Can you find an owner for this?  It's flagged as a carryover regression with no owner.  Thanks.
Flags: needinfo?(jyavenard)
Flags: needinfo?(ajones)
if this is a libavcodec/vp9 bug then its probably ronald who knows this best, he is on vacation till end of august IIRC though.
Either way, can this be reproduced somehow with command line ffmpeg?
Flags: needinfo?(michael)
I dont know if it's ever been reproduced. We can provide the memory dump if required on Windows. Easy to track with visual studio then.
Flags: needinfo?(jyavenard)
Flags: needinfo?(ajones)
Priority: -- → P1
(In reply to Jean-Yves Avenard [:jya] from comment #4)
> We can provide the memory dump if required on Windows.

We can't share memory dumps from Socorro outside of Mozilla. We'd need permission from the person submitting the crash.
Assignee: nobody → jyavenard
(99.15% in signature vs 00.53% overall) reason = EXCEPTION_INT_DIVIDE_BY_ZERO
(93.45% in signature vs 19.21% overall) adapter_vendor_id = 0x1002
(60.24% in signature vs 00.77% overall) cpu_info = AuthenticAMD family 18 model 1 stepping 0 | 4
(89.70% in signature vs 45.25% overall) platform_pretty_version = Windows 7
(81.94% in signature vs 38.07% overall) platform_version = 6.1.7601 Service Pack 1
(00.61% in signature vs 34.52% overall) reason = EXCEPTION_ACCESS_VIOLATION_READ
(33.21% in signature vs 00.48% overall) cpu_info = AuthenticAMD family 18 model 1 stepping 0 | 2
(29.21% in signature vs 00.16% overall) adapter_subsys_id = 061d1025
(28.61% in signature vs 00.17% overall) adapter_driver_version = 8.862.3.0
(24.00% in signature vs 00.22% overall) adapter_device_id = 0x964a
(59.27% in signature vs 37.87% overall) bios_manufacturer = American Megatrends Inc.
(04.61% in signature vs 23.11% overall) adapter_vendor_id = 0x10de

It's often (93.45%) the same CPU: AuthenticAMD family 18 model 1 stepping 0 (with different number of cores).
Hi Jean-Yves, this crash is spiking on 49.0.1. Is there a fix in the works? I'd be happy to take an uplift in Beta50.
Flags: needinfo?(jyavenard)
Ronald, got any ideas about this one?
Flags: needinfo?(jyavenard) → needinfo?(rsbultje)
Can you provide a link to the exact source code file used in Firefox 49.0.1 for vp9.c, and can you also provide disassembly for the given function in the 49.0.1 build of Firefox? The source code is pretty obviously correct upstream (it does an explicit check for denominator being zero before doing the division), so maybe there's something funky going on like local patches or local miscompilation? We've had an AMD-specific crash before but that was related to longnops instead of shortnops and usually affected SIMD code.

Looking at Chrome, there's a remarkably similar bug (claiming the same issue - technically, really! - in libvpx as well as ffmpeg) here:
https://bugs.chromium.org/p/chromium/issues/detail?id=599899

Interesting quotes: "I suspect the issue may be related to something I had seen a while back, where crashes happened due to a hardware bug in certain family of AMD processors." (from Yaowu)

It's unclear to me what the patch does exactly, but it seems like it's trying to add sign-extension instructions (adding padding, i.e. specifically to delay the pipeline) hoping that that will workaround the known CPU bug "665 Integer Divide Instruction May Cause Unpredictable Behavior". I suppose we can do that also although I have to admit I'm never a big fan of such "fixes".
The other patch referenced in the bug seem to split prob adaptation in two functions, where the return-if-zero is done before entering the function that does the divide. Since that introduces a call between the branch and the div (instead of call before branch+div), it works around the CPU bug also, and would do so without a slowdown. We can of course do both also.

I suppose it's not easy to get my hands on access (SSH) to one of the affected machines to reproduce the crash and test potential workarounds?
Here's the source code: https://hg.mozilla.org/releases/mozilla-release/annotate/2d931a5eaf8a/media/ffvpx/libavcodec/vp9.c#l3929

And here's the disassembly around the crashing instruction for one of
the crash reports (I've downloaded another one and it's pretty similar):
64f14624 <.data>:
mozavcodec.dll+0xe4624
64f14624:	3b c2                	cmp    %edx,%eax
64f14626:	0f 47 c2             	cmova  %edx,%eax
64f14629:	ba cd cc cc 0c       	mov    $0xccccccd,%edx
64f1462e:	f7 e2                	mul    %edx
64f14630:	0f a4 c2 07          	shld   $0x7,%eax,%edx
64f14634:	8b 44 24 3c          	mov    0x3c(%esp),%eax
64f14638:	2b c8                	sub    %eax,%ecx
64f1463a:	0f af d1             	imul   %ecx,%edx
64f1463d:	83 ea 80             	sub    $0xffffff80,%edx
64f14640:	c1 ea 08             	shr    $0x8,%edx
64f14643:	02 d0                	add    %al,%dl
64f14645:	88 55 01             	mov    %dl,0x1(%ebp)
64f14648:	8b 07                	mov    (%edi),%eax
64f1464a:	2b d8                	sub    %eax,%ebx
64f1464c:	03 d8                	add    %eax,%ebx
64f1464e:	74 54                	je     0x64f146a4
64f14650:	0f b6 4d 02          	movzbl 0x2(%ebp),%ecx
64f14654:	33 d2                	xor    %edx,%edx
64f14656:	89 4c 24 3c          	mov    %ecx,0x3c(%esp)
64f1465a:	8b cb                	mov    %ebx,%ecx
64f1465c:	c1 e0 08             	shl    $0x8,%eax
64f1465f:	d1 e9                	shr    %ecx
64f14661:	03 c1                	add    %ecx,%eax
64f14663:	f7 f3                	div    %ebx
64f14665:	8b c8                	mov    %eax,%ecx
64f14667:	83 f9 01             	cmp    $0x1,%ecx
64f1466a:	7d 05                	jge    0x64f14671
64f1466c:	33 c9                	xor    %ecx,%ecx
64f1466e:	41                   	inc    %ecx
64f1466f:	eb 0a                	jmp    0x64f1467b
64f14671:	b8 ff 00 00 00       	mov    $0xff,%eax
64f14676:	3b c8                	cmp    %eax,%ecx
64f14678:	0f 4f c8             	cmovg  %eax,%ecx
64f1467b:	6a 14                	push   $0x14
64f1467d:	58                   	pop    %eax
64f1467e:	3b d8                	cmp    %eax,%ebx
64f14680:	ba cd cc cc 0c       	mov    $0xccccccd,%edx
64f14685:	0f 47 d8             	cmova  %eax,%ebx
64f14688:	8b c3                	mov    %ebx,%eax
64f1468a:	f7 e2                	mul    %edx
64f1468c:	0f a4 c2 07          	shld   $0x7,%eax,%edx
64f14690:	8b 44 24 3c          	mov    0x3c(%esp),%eax
64f14694:	2b c8                	sub    %eax,%ecx
64f14696:	0f af d1             	imul   %ecx,%edx
64f14699:	83 ea 80             	sub    $0xffffff80,%edx
64f1469c:	c1 ea 08             	shr    $0x8,%edx
64f1469f:	02 d0                	add    %al,%dl
64f146a1:	88 55 02             	mov    %dl,0x2(%ebp)
64f146a4:	8b 5f 10             	mov    0x10(%edi),%ebx
^^^^^^^^^	^^^^^^^^^^^^^^^^^^^^^	^^^^^^^^^^^^^^^^^^^^^^^
64f146a7:	8b 47 0c             	mov    0xc(%edi),%eax
64f146aa:	03 d8                	add    %eax,%ebx
64f146ac:	74 54                	je     0x64f14702
64f146ae:	0f b6 4d 03          	movzbl 0x3(%ebp),%ecx
64f146b2:	33 d2                	xor    %edx,%edx
64f146b4:	89 4c 24 3c          	mov    %ecx,0x3c(%esp)
64f146b8:	8b cb                	mov    %ebx,%ecx
64f146ba:	c1 e0 08             	shl    $0x8,%eax
64f146bd:	d1 e9                	shr    %ecx
64f146bf:	03 c1                	add    %ecx,%eax
64f146c1:	f7 f3                	div    %ebx
64f146c3:	8b c8                	mov    %eax,%ecx
64f146c5:	83 f9 01             	cmp    $0x1,%ecx
64f146c8:	7d 05                	jge    0x64f146cf
64f146ca:	33 c9                	xor    %ecx,%ecx
64f146cc:	41                   	inc    %ecx
64f146cd:	eb 0a                	jmp    0x64f146d9
64f146cf:	b8 ff 00 00 00       	mov    $0xff,%eax
64f146d4:	3b c8                	cmp    %eax,%ecx
64f146d6:	0f 4f c8             	cmovg  %eax,%ecx
64f146d9:	6a 14                	push   $0x14
64f146db:	58                   	pop    %eax
64f146dc:	3b d8                	cmp    %eax,%ebx
64f146de:	ba cd cc cc 0c       	mov    $0xccccccd,%edx
64f146e3:	0f 47 d8             	cmova  %eax,%ebx
64f146e6:	8b c3                	mov    %ebx,%eax
64f146e8:	f7 e2                	mul    %edx
64f146ea:	0f a4 c2 07          	shld   $0x7,%eax,%edx
64f146ee:	8b 44 24 3c          	mov    0x3c(%esp),%eax
64f146f2:	2b c8                	sub    %eax,%ecx
64f146f4:	0f af d1             	imul   %ecx,%edx
64f146f7:	83 ea 80             	sub    $0xffffff80,%edx
64f146fa:	c1 ea 08             	shr    $0x8,%edx
64f146fd:	02 d0                	add    %al,%dl
64f146ff:	88 55 03             	mov    %dl,0x3(%ebp)
64f14702:	8b 47 08             	mov    0x8(%edi),%eax
64f14705:	2b f0                	sub    %eax,%esi
64f14707:	8d 1c 06             	lea    (%esi,%eax,1),%ebx
64f1470a:	85 db                	test   %ebx,%ebx
64f1470c:	74 54                	je     0x64f14762
64f1470e:	0f b6 4d 04          	movzbl 0x4(%ebp),%ecx
64f14712:	33 d2                	xor    %edx,%edx
64f14714:	89 4c 24 3c          	mov    %ecx,0x3c(%esp)
64f14718:	8b cb                	mov    %ebx,%ecx
64f1471a:	c1 e0 08             	shl    $0x8,%eax
64f1471d:	d1 e9                	shr    %ecx
64f1471f:	03 c1                	add    %ecx,%eax
64f14721:	f7 f3                	div    %ebx
64f14723:	8b                   	.byte 0x8b

The values of the registers at the time of the crash:
Thread 37 (crashed)
 0  mozavcodec.dll!adapt_probs [vp9.c:2d931a5eaf8a : 3929 + 0x0]
    eip = 0x64f146a4   esp = 0x123bf714   ebp = 0x00eb776f   ebx = 0x00000000
    esi = 0x00000000   edi = 0x00ebb2d8   eax = 0x00000000   ecx = 0x00000000
    edx = 0x00000086   efl = 0x00010246

(In reply to Ronald S. Bultje from comment #10)
> I suppose it's not easy to get my hands on access (SSH) to one of the
> affected machines to reproduce the crash and test potential workarounds?

Unfortunately yes. We would need to contact by email the people who reported
crashes and ask them.
(In reply to Ronald S. Bultje from comment #9)
> Looking at Chrome, there's a remarkably similar bug (claiming the same issue
> - technically, really! - in libvpx as well as ffmpeg) here:
> https://bugs.chromium.org/p/chromium/issues/detail?id=599899
> 
> Interesting quotes: "I suspect the issue may be related to something I had
> seen a while back, where crashes happened due to a hardware bug in certain
> family of AMD processors." (from Yaowu)

...and the cpu model that was mentioned there (AMD A6-3400M, https://bugs.chromium.org/p/chromium/issues/attachmentText?aid=232024) is the same one that we see failing in our crash stats data according to comment #6 as well.
The "problem" in this disass is that adapt_prob appears inlined, so branch-call-div wouldn't do anything. I'd really like SSH access to an affected machine to test potential fixes. Can we get one from either an affected user or can we contact AMD directly? Maybe AMD (since they have insight in the exact details of this bug) can suggest how we should change the code to work around it?
(In reply to Ronald S. Bultje from comment #9)
> Can you provide a link to the exact source code file used in Firefox 49.0.1
> for vp9.c, and can you also provide disassembly for the given function in

The link in the description gives you exactly the code where the crash occurred. It's using the revision you're interested in..


https://hg.mozilla.org/mozilla-unified/annotate/f21e48559ecd/media/ffvpx/libavcodec/vp9.c#l3929

The ffvpx code included is almost identical to what's found upstream. We have almost zero patches applied. Great care was taken to ensure we made as little modifications as possible so resyncing with upstream will be easy.
Thanks @jya, I indeed just wanted to confirm there were no strange firefox-specific modifications. I'm at this point pretty sure it's the AMD hardware bug... Not 100% sure how to go from there to a fix yet... Do you guys have suggestions or recommendations?
Hi David...

Didn't you identify an AMD bug a while ago and crafted a solution around it? Any suggestions?
Flags: needinfo?(dmajor)
Blocks: 1289666
See Also: → 1034706
(In reply to Jean-Yves Avenard [:jya] from comment #17)
> Hi David...
> 
> Didn't you identify an AMD bug a while ago and crafted a solution around it?
> Any suggestions?

This looks exactly like the issue that Chrome encountered. How about taking their workaround as a starting point: https://chromium-review.googlesource.com/#/c/73570/
Flags: needinfo?(dmajor)
Going by what libvpx did, I propose the attached patch. However, as mentioned, I have no way of testing this patch so I'm not sure how to confirm that this does indeed fix the issue. Can we get in touch with one of the bug submitters and see if they can test a modified binary with this patch included to see if that resolves the crash?
Flags: needinfo?(rsbultje)
Another option is to take the patch and see if it fixes the crash in the wild (we would need to wait for the patch to reach beta or uplift it, since we don't have any report with Aurora or Nightly).
Hello Jean-Yves, as Marco said, could we get stabilize this patch on pre-beta (to catch unexpected fallouts) and then uplift to Beta50 to determine if it fixes the crash or not? Thoughts?
Flags: needinfo?(jyavenard)
I have no problem with the patch, it looks identical to the earlier code, just different order of operations. 

So there's nothing to lose
Flags: needinfo?(jyavenard)
Attachment #8798908 - Flags: review?(jyavenard)
jya, can you take care of landing the patch or helping Ronald to do it?
Dear Firefox programmer,
I had plenty of crashes when whatching vp9 encoded content on youtube, so I followed the bug reports and I ended here. One of my crash report is 852b8ad6-ff1b-4eb7-8f39-4ef152161011.

I'll be happy to test a patched binary, and see if the issue is still ocurring.

Best regards,
Nicolas DEROUINEAU
Thanks Nicolas, this is great. Jya, given the nature of the bug, perhaps we should build Firefox on try with the official configuration and let Nicolas try that instead of a local build.
Flags: needinfo?(jyavenard)
Attachment #8798908 - Flags: review?(jyavenard) → review+
Comment on attachment 8800178 [details]
Bug 1293996: Workaround AMD hardware related crash.

https://reviewboard.mozilla.org/r/85168/#review83740
Attachment #8800178 - Flags: review?(jyavenard) → review+
Flags: needinfo?(jyavenard)
I have installed the nightly, and so far vp9 content decoding doesn't crash. However, I noticed that this bug happened after a while, so I'll keep some content running and keep you updated.
Pushed by jyavenard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6154e22334bc
Workaround AMD hardware related crash. r=jya
https://hg.mozilla.org/mozilla-central/rev/6154e22334bc
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
[Tracking Requested - why for this release]: It's a low volume crash on Beta, but high volume on release. Fix verified by Nicolas (comment 29), testing with a larger population on Beta will confirm that the crash if fixed for everybody. The risk is pretty low, considering that the patch is just changing the order of operations.
Hello Jean-Yves, should we uplift this patch to Aurora51 and Beta50?
Flags: needinfo?(jyavenard)
Yes please
Flags: needinfo?(jyavenard)
Comment on attachment 8800178 [details]
Bug 1293996: Workaround AMD hardware related crash.

Approval Request Comment
[Feature/regressing bug #]: ffvp9
[User impact if declined]: crashes on some AMD machines 
[Describe test coverage new/current, TreeHerder]: in central. manual check. Lots of Michu tests. 
[Risks and why]: can't think of any. It's just reordering instructions to go around AMD CPU pipeline bug.  Fix is also found in chrome code
[String/UUID change made/needed]: none
Attachment #8800178 - Flags: approval-mozilla-beta?
Attachment #8800178 - Flags: approval-mozilla-aurora?
Comment on attachment 8800178 [details]
Bug 1293996: Workaround AMD hardware related crash.

Crash fix that was verified by at least one person, Aurora51+, Beta50+
Attachment #8800178 - Flags: approval-mozilla-beta?
Attachment #8800178 - Flags: approval-mozilla-beta+
Attachment #8800178 - Flags: approval-mozilla-aurora?
Attachment #8800178 - Flags: approval-mozilla-aurora+
Needinfoing myself so I don't forget to check if this is actually fixed.
Flags: needinfo?(mcastelluccio)
Looks like the crash is gone in 50.0b8!
Flags: needinfo?(mcastelluccio)
Awesome. I've upstreamed the patch to ffmpeg git also. Thanks very much for your help and input, guys!
You need to log in before you can comment on or make changes to this bug.