Closed Bug 1137614 Opened 5 years ago Closed 5 years ago

crash in vp8_diamond_search_sadx4

Categories

(Core :: WebRTC, defect, critical)

All
Windows NT
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla40
Tracking Status
firefox37 - wontfix
firefox38 + verified
firefox39 --- verified
firefox40 --- verified

People

(Reporter: adalucinet, Assigned: dmajor)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-08a96f74-e008-465c-9b97-8f9c22150226.
=============================================================

More reports: https://crash-stats.mozilla.com/signature/?signature=vp8_diamond_search_sadx4&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1

Steps to reproduce:
1. Launch any RTL build (Eg: ar build)
2. Start a Hello conversation.
3. Share any app (Eg: Chrome).
4. Focus another app and wait a couple of seconds.

Notes:
1. Reproducible *only* with RTL builds.
Can this be retested in a ASAN build?  Presumption is that something is corrupting memory in RTL mode
(In reply to Randell Jesup [:jesup] from comment #1)
> Can this be retested in a ASAN build?  Presumption is that something is
> corrupting memory in RTL mode

Could you please offer some guidance in order to retest? Couldn't find any RTL asan builds via ftp.
Flags: needinfo?(rjesup)
I experienced a related issue on Windows 7 using Nightly 39.0a1 (non RTL build)
Crash reports:
https://crash-stats.mozilla.com/report/index/ac2cf255-f826-476f-9032-f7cbb2150317
https://crash-stats.mozilla.com/report/index/5b9e05cb-1d8f-423d-8ada-5e6e02150317

This can be reproduced fairly easily although there does not seem to be a sequence to follow, seems to happen when playing with tab and window sharing.
Ralph, tim - this has been showing up recently when sharing windows/tabs in Hello.  Pretty consistently hits this location, all 0 or -1 addresses (all windows, but that might be an artifact).  I'm thinking it's not the codec, but such a consistent failure location in the codec may imply it is a codec bug.
Flags: needinfo?(tterribe)
Flags: needinfo?(rjesup)
Flags: needinfo?(giles)
I reproduced using tab or window sharing:

1 Link clicker joins
2 Link generator joins
3 Link generator shares a window (Microsoft Outlook)
4 Window appears on the link clicker side
5 5 seconds later the link generator Firefox crashes

Crash report: https://crash-stats.mozilla.com/report/index/f3af3a3f-511f-41d6-9f1c-d4b7f2150317


1 Link clicker joins
2 Link generator joins
3 Link generator shares his tabs
4 Tabs appear on the link clicker side
5 20 seconds later the link generator Firefox crashes

Crash report: https://crash-stats.mozilla.com/report/index/ebbe8b70-4bec-4240-80f6-33ae52150317
It's certainly possible, quite a few bugs have been fixed upstream since we last pulled their code. I've been working on a libvpx update.
Flags: needinfo?(giles)
[Tracking Requested - why for this release]: This issue is occurring quite often since screensharing was introduced in Loop/ Hello. This libvpx update is something we'll want to uplift to 38.
Libvpx update is bug 1148639
Depends on: 1148639
[Tracking Requested - why for this release]:

Ralph -- Does it make sense to make this bug about investigating a "cherry-pick" fix for this crash from the libvpx update on bug 1148639?  Can you look into doing that, or would it make sense for someone else to do that? Thanks.

Note: I've added tracking-fx37 to this bug because Firefox 37 is when this bug was first seen/likely introduced.  Also, this bug appears to affect Win 64bit builds only.
(In reply to Maire Reavy [:mreavy] (Plz needinfo me - on PTO March 28 to April 5) from comment #10)
> [Tracking Requested - why for this release]:
> 
> Ralph -- Does it make sense to make this bug about investigating a
> "cherry-pick" fix for this crash from the libvpx update on bug 1148639?  Can
> you look into doing that, or would it make sense for someone else to do
> that? Thanks.
> 
> Note: I've added tracking-fx37 to this bug because Firefox 37 is when this
> bug was first seen/likely introduced.  Also, this bug appears to affect Win
> 64bit builds only.

Adding a needinfo to Ralph for the questions above ^^ in case (like me!), he's behind on bugmail. :-)
Flags: needinfo?(giles)
FYI. looking at the crash:

Where it's crashing (IF there's no inlining), it must be x being null - but x is used right before calling this function.  Also the 0xffffffff address wouldn't make much sense.

And inlining wouldn't affect it either, as that would just deref mv, and mv is &mvp_full in the code that calls diamond_search_sadx4()

Also, looking at upstream for mcomp.c, pickinter.c and findnearmv.h (and their issue tracker) doesn't show any indication of fixes in this area or bugs reported.

This all makes me concerned about a win64 compiler issue, or perhaps a wild pointer (but a wild pointer or memory corruption would be very unlikely to hit this sort of fixed address reliably).  

The last chance would be some sort of processor-specific issue (AMD only, or Intel-only, or something finer-grained).  

Do the crashes have the actual processor info, or just the build target info?
Flags: needinfo?(kairo)
(In reply to Randell Jesup|PTO until Apr 6 [:jesup] from comment #12)
> Do the crashes have the actual processor info, or just the build target info?

We have some CPU info: https://crash-stats.mozilla.com/search/?product=Firefox&signature=%3Dvp8_diamond_search_sadx4&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_info

> This all makes me concerned about a win64 compiler issue

Surely possible given we haven't been doing Win64 for a long time yet.

David might be able to help investigate if he has time. He has a lot of crash experience.
Flags: needinfo?(kairo) → needinfo?(dmajor)
If there's nothing to backport, and it's specific to win64, we could disable asm for that file on win64?
Flags: needinfo?(giles)
(In reply to Ralph Giles (:rillian) from comment #14)
> If there's nothing to backport, and it's specific to win64, we could disable
> asm for that file on win64?

I don't see any ASM in that file, nor in those routines
I am experiencing this every time a make a Hello call on nightly. Do you need me to direct-attach and debug this somehow?
tracking for 38. I will let the 37 tracking flag to make sure Lawrence sees it.
  1212:     vp8_clamp_mv(ref_mv, x->mv_col_min, x->mv_col_max, x->mv_row_min, x->mv_row_max);
000007FED6E65DA1  mov         r10d,dword ptr [r14+237Ch]  
000007FED6E65DA8  mov         qword ptr [rsp+60h],rax  
000007FED6E65DAD  mov         rax,qword ptr [rbp+97h]  
000007FED6E65DB4  movaps      xmm0,xmmword ptr [r14+2338h]   <-- crash is here
000007FED6E65DBC  movups      xmmword ptr [rbp-59h],xmm0  
000007FED6E65DC0  mov         qword ptr [rbp-49h],rax  

RAX = 0000000051D66348 RBX = 0000000000000000 RCX = 0000000051D64020 RDX = 0000000051D64768 RSI = 0000000000000002 RDI = 0000000051D64020 R8  = 0000000051D658E0 R9  = 00000000608AEDCC R10 = 00000000000000FF R11 = 000000005A28CCB0 R12 = 0000000000000000 R13 = 00000000000002E0 R14 = 0000000051D64020 R15 = 0000000051D65910 RIP = 000007FED6E65DB4 RSP = 00000000608AEC20 RBP = 00000000608AED11 EFL = 00010204 

XMM0 = 000000000000000000000000000002D0 
XMM1 = 74747474747474747474747474747474 
XMM2 = 00000000000000190000000000000006 
XMM3 = 74747474747474747474747474747474 
XMM4 = 000000000000004B000000000000000A 
XMM5 = 74747474747474747474747474747474 
XMM6 = 00000000000000000000000000000000 
XMM7 = 00000000000000000000000000000000 
XMM00 = +1.0E-042#DEN      XMM01 = +0.00000E+000      XMM02 = +0.00000E+000      XMM03 = +0.00000E+000      XMM10 = +7.74708E+031      XMM11 = +7.74708E+031      XMM12 = +7.74708E+031      XMM13 = +7.74708E+031      XMM20 = +8.4E-045#DEN      XMM21 = +0.00000E+000      XMM22 = +3.5E-044#DEN      XMM23 = +0.00000E+000      XMM30 = +7.74708E+031      XMM31 = +7.74708E+031      XMM32 = +7.74708E+031      XMM33 = +7.74708E+031      XMM40 = +1.4E-044#DEN      XMM41 = +0.00000E+000      XMM42 = +1.1E-043#DEN      XMM43 = +0.00000E+000      XMM50 = +7.74708E+031      XMM51 = +7.74708E+031      XMM52 = +7.74708E+031      XMM53 = +7.74708E+031      XMM60 = +0.00000E+000      XMM61 = +0.00000E+000      XMM62 = +0.00000E+000      XMM63 = +0.00000E+000      XMM70 = +0.00000E+000      XMM71 = +0.00000E+000      XMM72 = +0.00000E+000      XMM73 = +0.00000E+000      MXCSR = 00001FA0
Socorro (and even windbg's !analyze) claim that the crash address is 0, but that's not the case, judging by the registers.

I want to say this is an alignment issue. The crash is on a "movaps" which wants 16-byte alignment, but the addresses I've looked at all end in a hex digit 8.

From the 0x2338 offset it seems to be the "mvsadcost" field.
> 0:101> dt x
> Local var @ 0x3bdfefa0 Type macroblock*
> [...]
>   +0x2338 mvsadcost        : [2] ???? 
>   +0x2348 mbmode_cost      : ???? 
 The two array elements (mvsadcost[0] and mvsadcost[1]) are together 16 bytes, so I guess the compiler is trying to copy them both in one go:
>    mvsadcost[0] = x->mvsadcost[0];
>    mvsadcost[1] = x->mvsadcost[1];

As they are two int*, it seems OK for them to be individually only 8-byte aligned. So this might be a compiler issue. Let's try nudging the code until the aligned copy goes away. Maybe add a dummy field in between?
Flags: needinfo?(dmajor)
Attached patch alignment hack (obsolete) — Splinter Review
What do you think of this?

An alternative might be to change the function not to copy both fields right after one another, but I'm not convinced the compiler won't try to do something clever anyway.
Attachment #8587701 - Flags: review?(giles)
Comment on attachment 8587701 [details] [diff] [review]
alignment hack

Review of attachment 8587701 [details] [diff] [review]:
-----------------------------------------------------------------

Looks safe to try anyway. NB for libvpx changes people include a copy of the patch in tree for any changes you make, and add it to the list applied in update.py. That way it doesn't get clobbered when we pull new upstream code. Please make that change and let me see the patch again.
Attachment #8587701 - Flags: review?(giles)
bsmedberg: can you try the patch as you have a repeatable crash?  Thanks
Flags: needinfo?(tterribe) → needinfo?(benjamin)
Do you have a try build or do I have to do a win64 build myself?
Flags: needinfo?(benjamin) → needinfo?(rjesup)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e8d0dacf8c5b
(just pushed, it will take a little while)
Flags: needinfo?(rjesup)
(In reply to Maire Reavy [:mreavy] (Plz needinfo me - on PTO March 28 to April 5) from comment #10)
> Note: I've added tracking-fx37 to this bug because Firefox 37 is when this
> bug was first seen/likely introduced.  Also, this bug appears to affect Win
> 64bit builds only.

Given that this bug only affects Win64 builds and we didn't ship these with 37, I'm not tracking for 37.
Build's ready: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e8d0dacf8c5b
Flags: needinfo?(benjamin)
A few minutes with this try build and I didn't crash. That's not conclusive proof but it's looking good.
Flags: needinfo?(benjamin)
Attached patch alignment hackSplinter Review
Tested the patch command on a separate tree
Attachment #8587701 - Attachment is obsolete: true
Attachment #8588135 - Flags: review?(giles)
Comment on attachment 8588135 [details] [diff] [review]
alignment hack

Review of attachment 8588135 [details] [diff] [review]:
-----------------------------------------------------------------

Yep, thanks!
Attachment #8588135 - Flags: review?(giles) → review+
As soon as we merge (or before!), lets get this nominated for Aurora and Beta, thanks!
37 -> wontfix since we don't support win64 for 37
Flags: needinfo?(dmajor)
https://hg.mozilla.org/mozilla-central/rev/7a1eb59f83a5
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
David, can we have the uplift to beta and aurora? Thanks
Comment on attachment 8588135 [details] [diff] [review]
alignment hack

Approval Request Comment
[Feature/regressing bug #]: Compiler
[User impact if declined]: crashes on Win64 with Hello
[Describe test coverage new/current, TreeHerder]: green on m-c
[Risks and why]: Should be low risk, just rearranging a data structure
[String/UUID change made/needed]: none
Flags: needinfo?(dmajor)
Attachment #8588135 - Flags: approval-mozilla-beta?
Attachment #8588135 - Flags: approval-mozilla-aurora?
Assignee: nobody → dmajor
Comment on attachment 8588135 [details] [diff] [review]
alignment hack

Should be in 38 beta 3.
Attachment #8588135 - Flags: approval-mozilla-beta?
Attachment #8588135 - Flags: approval-mozilla-beta+
Attachment #8588135 - Flags: approval-mozilla-aurora?
Attachment #8588135 - Flags: approval-mozilla-aurora+
I still crash latest 40.0a1 ar build (from 2015-04-07) with str from comment 0:
- x32 build: 
-- bp-2bfabee0-abfb-45d2-959c-8ca6f2150408
-- bp-8ab7610a-9630-4255-a13f-3e1cc2150408
- x64 build: bp-9c7a95a9-7a3d-41c9-a498-292042150408

I don't think that the old and new crash are related (based on threads). David, any ideas?
Flags: needinfo?(dmajor)
All three of those are crashes in your nvidia driver. I would file a new bug and needinfo :milan to take a look.
Flags: needinfo?(dmajor)
I have not been able to reproduce on today's nightly (40.0a1 2015-04-08) using the Win64 build with E10s disabled whereas I could reproduce pretty consistently before.
I also checked crash-stats and there are no hits on the 04-07 build.
Flags: qe-verify+
A libvpx developer thought this was an include order issue in the way we're building the library.

RT, could you please try reproducing with this build try build? It has the alignment hack disabled and builds all of libvpx in non-unified mode.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f4a57e6a1786
Flags: needinfo?(rtestard)
(In reply to Ralph Giles (:rillian) from comment #43)
> A libvpx developer thought this was an include order issue in the way we're
> building the library.
> 
> RT, could you please try reproducing with this build try build? It has the
> alignment hack disabled and builds all of libvpx in non-unified mode.
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=f4a57e6a1786

Hi, sure but I'm away on travel this week without access to the computer I could reproduce this on so will only be able to do it early next week.
Flags: needinfo?(rtestard)
I built the change in comment 43 and the generated code is still doing a movaps on offset 0x2338. Unless there is some indirect factor at play, I think that build will still crash. Will be interested to hear the results of actual testing.
Thanks for testing, David. Pinging myself to follow up next week.
Flags: needinfo?(giles)
Depends on: 1155792
cpeterson: are you sure you want that Depends-on here???
Flags: needinfo?(cpeterson)
Sorry. A bad bisection lead me to suspect this bug caused bug 1155792.
No longer depends on: 1155792
Flags: needinfo?(cpeterson)
(In reply to Ralph Giles (:rillian) from comment #43)
> A libvpx developer thought this was an include order issue in the way we're
> building the library.
> 
> RT, could you please try reproducing with this build try build? It has the
> alignment hack disabled and builds all of libvpx in non-unified mode.
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=f4a57e6a1786

Hi, I now tried the build and have not been able to reproduce.

I although noticed that Youtube or Vimeo videos won't play in the build locally regardless of whether I'm on a Hello call or not (the video is not rendered locally in the browser although the audio plays correctly). If I share my tabs whilst viewing a Youtube or Vime video I although can see the video rendered on the clicker side.
RT, just to confirm, you can't reproduce the crash with WebM content, but you're getting the black-frames bug? The later sounds like a known problem with hardware-accelerated h.264 decoding, and shouldn't occur with webm video.
Flags: needinfo?(giles)
Ralph,
Actually when playing a little more on my Windows 8 machine this morning I was able to reproduce when sharing tabs only (I have not been able to reproduce using window sharing so far).
Crash reports:
https://crash-stats.mozilla.com/report/index/30478179-cd65-437f-82af-b14c62150423
https://crash-stats.mozilla.com/report/index/e3697203-63c9-440b-bfb6-32d952150423
https://crash-stats.mozilla.com/report/index/0aecd8d0-0c16-48fa-b06d-4eec22150423
https://crash-stats.mozilla.com/report/index/2de3bebd-da78-4a3c-ab14-fd0422150423

Crashes seem to happen after a short time (30s to 2 minutes) switching between tabs / browsing through sites. The display also goes funky when/prior to crashing (see attached).
Attached image Sans titre.png
The crashes in comment #51 are in the Windows video acceleration driver, so I think they're a separate issue.

Given you can't reproduce the libvpx bug, I'll remove the alignment patch in favour on the non-unified build change in bug 1148639.
Verified as fixed using the following environment:

FF 38.0b9 he and en-us
OS:Win 7 x64

also checked Soccorro the bug doesn't reproduce anymore, based on this verification and the fact that the issues found on comment 51 are video acceleration driver related maarking the bug as verified on FF 38.
Socorro shows 178 crashes in the last week on FF 41.0a1 and FF 40.0a2, we were unable to reproduce the crash though.
https://crash-stats.mozilla.com/report/list?range_unit=days&range_value=7&signature=vp8_diamond_search_sadx4#tab-sigsummary

David, any ideas?
Flags: needinfo?(dmajor)
Spot check of a few minidumps points to the same alignment issue as before.
Flags: needinfo?(dmajor) → needinfo?(giles)
I guess we should restore your patch.
Flags: needinfo?(giles)
I will go ahead and close this bug since there are no more crashes in newer builds after the fix from 1173396.
https://crash-stats.mozilla.com/report/list?range_unit=days&range_value=7&signature=vp8_diamond_search_sadx4#tab-reports
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.