Closed Bug 1270686 Opened 8 years ago Closed 8 years ago

Firefox 47 spike in crashes in TppTimerpExecuteCallback, on ATI graphics cards

Categories

(Core :: Audio/Video: Playback, defect, P1)

x86
Windows 8
defect

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox47 blocking fixed
firefox48 --- fixed
firefox49 --- fixed
firefox-esr45 --- affected

People

(Reporter: dbaron, Assigned: cpearce)

References

Details

(Keywords: crash, topcrash, Whiteboard: [gfx-noted])

Crash Data

Attachments

(6 files, 4 obsolete files)

84.02 KB, image/png
Details
109.92 KB, image/png
Details
163.98 KB, text/plain
Details
8.02 KB, patch
glandium
: review+
Details | Diff | Splinter Review
2.04 KB, patch
cpearce
: review+
Details | Diff | Splinter Review
58 bytes, text/x-review-board-request
ajones
: review+
Details
[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-9e1546a6-57a7-4743-9fe0-9e3062160429.
=============================================================

Both 47.0b1 and 47.0b2 have had significantly more crashes with signature TppTimerpExecuteCallback (see also bug 964351) than other recent beta/release versions.

All of these crashes happen on ATI graphics cards (every single one has vendor ID 0x1002), and they have an interesting distribution of install age (which I believe is time from Firefox install to crash) peaking around 40 seconds, with the oldest being around 100 seconds.

A decent number of individual users have submitted 2 crash reports, although relatively small numbers have submitted more than that.  It seems like it may be the sort of crash that is preventing these users from ever starting Firefox, although I'm not sure (I'd have expected slightly more repeat-submission in that case, although that's just intuition).

The dominant OS is Windows 8.1 (at 90.5%), although there are some crashes from Windows 10 (7%) and Windows 8 (2.5%).

The top of the distribution of adapter device IDs is:

Rank 	Adapter device id 	Count 	%
1 	0x9851 	182 	26.07 %
2 	0x9834 	97 	13.90 %
3 	0x9853 	66 	9.46 %
4 	0x9830 	54 	7.74 %
5 	0x9832 	47 	6.73 %
6 	0x9838 	32 	4.58 %
7 	0x983d 	19 	2.72 %
8 	0x9839 	15 	2.15 %
9 	0x9850 	15 	2.15 %
10 	0x990b 	14 	2.01 %
11 	0x9809 	12 	1.72 %
12 	0x9852 	12 	1.72 %
Flags: needinfo?(milan)
Flags: needinfo?(bas)
FWIW, I noticed this because it's at the #1 on my query for crashes in the first 2 minutes after install:

https://crash-stats.mozilla.com/search/?product=Firefox&version=47.0b1&install_age=%3C120&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
(that's for 47.0b1)

(though it's #23 on the overall topcrash list).
And this is present on nightly, but not in large enough numbers to get an accurate regression range.  In particular, it started here:

Build ID       Count
20160219030248 1
20160225030209 1
20160301030237 2
20160310030242 1
20160311030233 2
20160312030405 2
20160313030418 1
20160314030215 1
20160315030230 1
20160318030236 1
20160319030558 4
20160321030217 1
20160322030417 1
20160324030447 2
20160402030226 1
20160403030243 1
20160406030221 2
20160411030231 2
20160416030220 1
20160421030302 1
20160424030601 1
(In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) from comment #1)
> FWIW, I noticed this because it's at the #1 on my query for crashes in the
> first 2 minutes after install:
> 
> https://crash-stats.mozilla.com/search/?product=Firefox&version=47.
> 0b1&install_age=%3C120&_facets=signature&_columns=date&_columns=signature&_co
> lumns=product&_columns=version&_columns=build_id&_columns=platform#facet-
> signature
> (that's for 47.0b1)
> 
> (though it's #23 on the overall topcrash list).

I wonder if we're somehow deciding to blocklist these people only after first launch? Indeed I don't see any other explanation behind the low number of repeat submissions.
Flags: needinfo?(bas)
(In reply to Bas Schouten (:bas.schouten) from comment #3)
> I wonder if we're somehow deciding to blocklist these people only after
> first launch? Indeed I don't see any other explanation behind the low number
> of repeat submissions.

I don't think I buy that since there are plenty of 2x submissions, and a few 3x, 4x, etc.

I was thinking a more likely explanation is that people are perhaps likely to try starting up and crash more than 2 times, but they might actually bother to submit the crash report only the first or second time.  This argues for the theory that this is a startup crash that prevents users from using Firefox.
It doesn't look like the crash guard is stopping the repeated crashes, I don't see any of these with GraphicsStartupTest set in the metadata.  There is also plenty of these crashes 5+ minutes into the session, rather than on startup.

I also wouldn't be surprised if some of the other crashes on these devices were "the same" (e.g., https://crash-stats.mozilla.com/report/index/8d424a9e-d247-4d32-a7b7-d04ee2160429, https://crash-stats.mozilla.com/report/index/9a63e97f-6cda-4ffd-8bc5-95bcd2160429).

Anyone knows how to do the correlation on "strange" DLLs being loaded?
Flags: needinfo?(milan)
(In reply to Milan Sreckovic [:milan] from comment #5)
> It doesn't look like the crash guard is stopping the repeated crashes, I
> don't see any of these with GraphicsStartupTest set in the metadata.  There
> is also plenty of these crashes 5+ minutes into the session, rather than on
> startup.

Er, right.  I think I might still have been looking at my filtered query when I concluded that!

It is still the largest crash within the first 120 seconds after install.

> Anyone knows how to do the correlation on "strange" DLLs being loaded?

By reading the correlation reports are in https://crash-analysis.mozilla.com/crash_analysis/ -- but the conclusion I came to right before filing this bug was that this crash was extremely well correlated with ATI drivers, and with a decent (although not huge) spread of versions of the drivers.
Flags: needinfo?(jmuizelaar)
Will need to check. Is there some correlation on what the users did at that time in these reports ? Is some new build 47 functionality being used?
(In reply to Paul Blinzer from comment #7)
> Will need to check. Is there some correlation on what the users did at that
> time in these reports ? Is some new build 47 functionality being used?

There are quite a few youtube reports so it could be video related.
Flags: needinfo?(jmuizelaar)
This crash is a bit weird. It looks like we're calling a TimerCallback that doesn't exist. The best reasoning that I could come up with that would cause this is someone calling FreeLibrary on dll that still has a pending timer.
(In reply to Milan Sreckovic [:milan] from comment #5)
> Anyone knows how to do the correlation on "strange" DLLs being loaded?

these are module correlations for this signature in yesterday's data for 47.0b2:
  TppTimerpExecuteCallback|EXCEPTION_ACCESS_VIOLATION_EXEC (124 crashes)
    100% (124/124) vs.   8% (1857/21929) atiuxpag.dll
    100% (124/124) vs.   8% (1857/21929) aticfx32.dll
    100% (124/124) vs.   9% (1922/21929) atidxx32.dll
    100% (124/124) vs.   9% (1968/21929) msvproc.dll
     99% (123/124) vs.  17% (3638/21929) MSAudDecMFT.dll
     97% (120/124) vs.  15% (3301/21929) RTWorkQ.dll
    100% (124/124) vs.  21% (4576/21929) WINMMBASE.dll
     97% (120/124) vs.  18% (3880/21929) kernel.appcore.dll
    100% (124/124) vs.  21% (4610/21929) SHCore.dll
    100% (124/124) vs.  21% (4611/21929) bcryptPrimitives.dll
    100% (124/124) vs.  21% (4611/21929) combase.dll
     87% (108/124) vs.  13% (2903/21929) ondemandconnroutehelper.dll
     91% (113/124) vs.  25% (5527/21929) winhttp.dll
     98% (122/124) vs.  34% (7468/21929) xmllite.dll
     99% (123/124) vs.  44% (9698/21929) d2d1.dll
     73% (91/124) vs.  22% (4931/21929) wshbth.dll
    100% (124/124) vs.  52% (11380/21929) d3d11.dll
    100% (124/124) vs.  56% (12267/21929) dxgi.dll
    100% (124/124) vs.  67% (14646/21929) msmpeg2vdec.dll
     94% (117/124) vs.  62% (13503/21929) explorerframe.dll
     94% (117/124) vs.  62% (13630/21929) FWPUCLNT.DLL
    100% (124/124) vs.  70% (15290/21929) DWrite.dll
    100% (124/124) vs.  70% (15333/21929) mozavcodec.dll
    100% (124/124) vs.  70% (15333/21929) mozavutil.dll
    100% (124/124) vs.  70% (15414/21929) AudioSes.dll
    100% (124/124) vs.  71% (15517/21929) cryptsp.dll
    100% (124/124) vs.  71% (15543/21929) bcrypt.dll
     99% (123/124) vs.  71% (15538/21929) sspicli.dll
     84% (104/124) vs.  56% (12264/21929) duser.dll
     84% (104/124) vs.  56% (12267/21929) dui70.dll
    100% (124/124) vs.  73% (15968/21929) evr.dll
    100% (124/124) vs.  73% (15972/21929) mf.dll
    100% (124/124) vs.  73% (16033/21929) dxva2.dll
     80% (99/124) vs.  53% (11630/21929) secur32.dll
     40% (49/124) vs.  13% (2809/21929) twinapi.dll
    100% (124/124) vs.  74% (16246/21929) Wpc.dll
    100% (124/124) vs.  75% (16499/21929) pnrpnsp.dll
    100% (124/124) vs.  75% (16520/21929) nlaapi.dll
    100% (124/124) vs.  75% (16526/21929) NapiNSP.dll
     31% (39/124) vs.   7% (1521/21929) Bcp47Langs.dll

...and further faceted by version numbers in the attached file
Paul does the AMD driver use CreateTimerQueueTimer()/CreateThreadpoolTimer() or FreeLibrary() in any suspicious ways?
Flags: needinfo?(paul.blinzer)
it uses FreeLibrary() but not really in any suspicious ways. We will need to look into this...
Flags: needinfo?(paul.blinzer)
This is the sort of crash where I think we need to consider not shipping Firefox 47 if we can't fix it, given that it represents about 20% of crashes within 2 minutes of install (likely users who become unable to use Firefox and have to switch to other browsers).  Seems like we need to do more to make progress here, including possibly disabling code changes around the regression range.
Flags: needinfo?(milan)
(In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) from comment #15)
> Seems like we need to do more to
> make progress here, including possibly disabling code changes around the
> regression range.

Do we have a regression range?

One possible strategy for debugging this is logging the addresses of libraries that have been FreeLibrary()ed. That should hopefully tell us whether it's a library that's going away or a timer being setup for some random code address that never existed.
We have identified that a thread is getting stuck waiting for another thread to finish an operation, but we don't have a root cause yet based on the crashdump info.
Is there a repro scenario observed that has a higher likelihood to lead to the issue?
Flags: needinfo?(dbaron)
Attached patch Log all dlls loaded (obsolete) — Splinter Review
This might do the job. It's sort of spammy but hopefully that's ok for Nightly
Attachment #8754063 - Flags: review?(aklotz)
Yes, I think we want to understand what's going on before we ship 47 with this.

Without knowing if any of this is useful:

This is currently hitting us on AMD, but not exclusively.  For example: https://crash-stats.mozilla.com/report/index/6412e15e-a779-4397-a4ac-d1a372160512 is the same crash, on Intel, on 45esr.

We also had a spike in these two years ago, tracked in bug 964351 (recent comments with this problem already made in there, so this is not news to some.)  I don't think we ever resolved why that was the case.

This is a Flash crash in 46.0.1, on Intel, slightly different stack: https://crash-stats.mozilla.com/report/index/16c7b99b-efcc-4457-a0e6-ccb152160517
(In reply to Jeff Muizelaar [:jrmuizel] from comment #16)
> Do we have a regression range?

An approximate one; see comment 2.

(In reply to Paul Blinzer from comment #17)
> We have identified that a thread is getting stuck waiting for another thread
> to finish an operation, but we don't have a root cause yet based on the
> crashdump info.
> Is there a repro scenario observed that has a higher likelihood to lead to
> the issue?

I don't know of any steps to reproduce the bug.  However, for many users it appears to be basically a crash-on-startup.

(In reply to Milan Sreckovic [:milan] from comment #19)
> This is currently hitting us on AMD, but not exclusively.  For example:
> https://crash-stats.mozilla.com/report/index/6412e15e-a779-4397-a4ac-d1a372160512
> is the same crash, on Intel, on 45esr.
...
> https://crash-stats.mozilla.com/report/index/16c7b99b-efcc-4457-a0e6-ccb152160517

I'd define those to not be "this" or "the same crash".  Crash signature is not a perfect classification method; *this* bug is about the spike in crashes on AMD cards.
Flags: needinfo?(dbaron)
Whiteboard: [gfx-noted]
Comment on attachment 8754063 [details] [diff] [review]
Log all dlls loaded

Review of attachment 8754063 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/WindowsDllBlocklist.cpp
@@ +733,5 @@
>        printf_stderr("LdrLoadDll: Blocking load of '%s'.  XPCOM components must support ASLR.\n", dllName);
>        return STATUS_DLL_NOT_FOUND;
>      }
>    }
> +  MODULEINFO info;

Nit: Please move the declaration of info into the #ifdef NIGHTLY_BUILD block.
Attachment #8754063 - Flags: review?(aklotz) → review+
(In reply to Aaron Klotz [:aklotz] from comment #21)
> Comment on attachment 8754063 [details] [diff] [review]
> Log all dlls loaded
> 
> Review of attachment 8754063 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: mozglue/build/WindowsDllBlocklist.cpp
> @@ +733,5 @@
> >        printf_stderr("LdrLoadDll: Blocking load of '%s'.  XPCOM components must support ASLR.\n", dllName);
> >        return STATUS_DLL_NOT_FOUND;
> >      }
> >    }
> > +  MODULEINFO info;
> 
> Nit: Please move the declaration of info into the #ifdef NIGHTLY_BUILD block.

This may not actually work as I don't think we have access to the crash reporter in mozglue...
Ah shite, I think you're right! doh...
Attached patch Log all dlls loaded v2 (obsolete) — Splinter Review
This turned out ugly... Be gentle in the review as I just want it to work and then I'll take it out of the tree ASAP.

This basically keeps a list of information and then passes it to the callback when the callback is registered.
Attachment #8754063 - Attachment is obsolete: true
Attachment #8755053 - Flags: review?(mh+mozilla)
Attachment #8755053 - Flags: review?(aklotz)
Comment on attachment 8755053 [details] [diff] [review]
Log all dlls loaded v2

Ted, is it going to be a problem putting a list of all the DLLs ever loaded into the AppNotes?
Attachment #8755053 - Flags: feedback?(ted)
Hi Jeff, we are a week away from entering RC. If there is a potential fix, I hope we can get it uplifted before Thursday, when I gtb 47.0b9. Thanks!
Flags: needinfo?(jmuizelaar)
(In reply to Ritu Kothari (:ritu) from comment #26)
> Hi Jeff, we are a week away from entering RC. If there is a potential fix, I
> hope we can get it uplifted before Thursday, when I gtb 47.0b9. Thanks!

We're still in the dark - the patch above would be nightly only, and to get us some extra information.  I don't expect us to get anywhere without that information, and it will probably be Thursday before we start getting reasonable amount of data with the extra information if we land this patch (or something like it), so I don't see this making 47b9.

Ritu, can you start the conversation regarding comment 15 - I don't know what it takes to decide and make something a release blocker.
Flags: needinfo?(milan) → needinfo?(rkothari)
Comment on attachment 8755053 [details] [diff] [review]
Log all dlls loaded v2

Review of attachment 8755053 [details] [diff] [review]:
-----------------------------------------------------------------

Why do you need this? We already have a list of loaded dlls in crash reports. What do you expect the difference would be with this additional information?
To know the list of DLLs that were loaded but have since been unloaded.  I don't think that's in the crash report.
Comment on attachment 8755053 [details] [diff] [review]
Log all dlls loaded v2

Review of attachment 8755053 [details] [diff] [review]:
-----------------------------------------------------------------

Ah okay, so a slightly better patch summary would be "Log all DLLs ever loaded"...

::: mozglue/build/WindowsDllBlocklist.cpp
@@ +17,4 @@
>  #include <windows.h>
>  #include <winternl.h>
>  #include <io.h>
> +#define PSAPI_VERSION 2

My reading of https://msdn.microsoft.com/en-us/library/windows/desktop/ms683201(v=vs.85).aspx is that this will break running Firefox on Windows XP.

@@ +561,5 @@
> +
> +extern "C" {
> +NS_EXPORT void
> +RegisterDllLoadCallback(LoadCallBackFn aCallback) {
> +  // some weird mangling thing is making it so I can't pass function pointers

More about this below.

@@ +767,5 @@
> +  if (!ret) {
> +      MODULEINFO moduleInfo;
> +      if (GetModuleInformation(GetCurrentProcess(), *handle, &moduleInfo, sizeof(moduleInfo))) {
> +          DllLoadInfo info;
> +          strcpy(info.name, dllName);

Considering the size of those buffers are some constant + 1, even if it's the same constant, you might as well have a static_assert on their size being equal.

@@ +773,5 @@
> +          info.SizeOfImage = moduleInfo.SizeOfImage;
> +          if (gLoadInfoCallback) {
> +            gLoadInfoCallback(info);
> +          } else {
> +            gDllLoadInfos.push_back(info);

There's potential for thread-safety issues here (as well as with gLoadInfoCallback).

::: mozglue/build/WindowsDllBlocklist.h
@@ +43,5 @@
> +  LPVOID lpBaseOfDll;
> +  DWORD SizeOfImage;
> +};
> +typedef void (*LoadCallBackFn)(DllLoadInfo&);
> +// extern "C" because the demangler was being mean

The reason this happens, I presume, is because WindowsDllBlocklist.h is not included from WindowsDllBlocklist.cpp, and you have those definitions twice, and for some reason, they don't match exactly as far as mangling goes.

@@ +44,5 @@
> +  DWORD SizeOfImage;
> +};
> +typedef void (*LoadCallBackFn)(DllLoadInfo&);
> +// extern "C" because the demangler was being mean
> +extern "C" {

extern "C" NS_IMPORT void ...

That said, MFBT_API would be better here (I know NS_IMPORT is what is being used, but that's because the file used to be in toolkit/, with a function definition in xpcom/).

Now, overall, this is what I'd suggest:
- Create a separate patch that:
  - Includes WindowsDllBlocklist.h from WindowsDllBlocklist.cpp.
  - Replace NS_EXPORT in WindowsDllBlocklist.cpp with MFBT_API.
  - Replace NS_IMPORT in WindowsDllBlocklist.h with MFBT_API.
  - Replace the nscore.h include in WindowsDllBlocklist.h with mozilla/Attributes.h
- Remove the extern "C" in this patch, and remove the duplication of the function type and struct declarations.

The reason for a separate patch is that it would stay in, whether or not the crash reporter annotation thing stays.
Attachment #8755053 - Flags: review?(mh+mozilla) → review-
Depends on: 1275283
Flags: needinfo?(jmuizelaar)
Attached patch Log all dlls ever loaded v3 (obsolete) — Splinter Review
This address the mangling/headers mess. Re: thread safety we register the callback early enough that it should be before there are any parallel dll loading, and once the callback is called we should be thread safe (AppNotes are)
Attachment #8755053 - Attachment is obsolete: true
Attachment #8755053 - Flags: review?(aklotz)
Attachment #8755053 - Flags: feedback?(ted)
Attachment #8755886 - Flags: review?(mh+mozilla)
Attachment #8755886 - Flags: review?(aklotz)
Attachment #8755886 - Flags: feedback?(ted)
Comment on attachment 8755886 [details] [diff] [review]
Log all dlls ever loaded v3

Review of attachment 8755886 [details] [diff] [review]:
-----------------------------------------------------------------

OK but needs some cleanup before landing.

::: mozglue/build/WindowsDllBlocklist.cpp
@@ +267,4 @@
>  
>  namespace {
>  
> +typedef NTSTATUS (NTAPI *LdrLoadDll_func) (PWCHAR filePath, PULONG flags, PUNICODE_STRING moduleFileName, HMODULE *handle);

I'd prefer that this signature not change for the sake of consistency with NT Native API types. Just cast it to an HMODULE in the GetModuleInformation call.

@@ +550,5 @@
> +
> +#ifdef NIGHTLY_BUILD
> +static std::vector<DllLoadInfo> gDllLoadInfos;
> +static LoadCallBackFn gLoadInfoCallback;
> +

Nit: please remove blank lines

@@ +771,3 @@
>  } // namespace
>  
>  MFBT_API void

This should be wrapped in #ifdef NIGHTLY_BUILD

::: mozglue/build/WindowsDllBlocklist.h
@@ +34,5 @@
>      MOZ_DECL_USE_GUARD_OBJECT_NOTIFIER
>  };
>  
> +#ifdef NIGHTLY_BUILD
> +#define DLLNAME_MAX 128

Nit: would you mind declaring an enum class for DLLNAME_MAX instead of using #define?
Attachment #8755886 - Flags: review?(aklotz) → review+
Comment on attachment 8755886 [details] [diff] [review]
Log all dlls ever loaded v3

Review of attachment 8755886 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/WindowsDllBlocklist.cpp
@@ +17,4 @@
>  #include <windows.h>
>  #include <winternl.h>
>  #include <io.h>
> +#define PSAPI_VERSION 2

See previous review.
Attachment #8755886 - Flags: review?(mh+mozilla) → review-
Attached patch Log all dlls ever loaded v4 (obsolete) — Splinter Review
This dynamically checks for K32GetModuleInformation so that it works on XP.
Attachment #8755886 - Attachment is obsolete: true
Attachment #8755886 - Flags: feedback?(ted)
Attachment #8756077 - Flags: review?(mh+mozilla)
(In reply to Milan Sreckovic [:milan] from comment #27)
> (In reply to Ritu Kothari (:ritu) from comment #26)
> > Hi Jeff, we are a week away from entering RC. If there is a potential fix, I
> > hope we can get it uplifted before Thursday, when I gtb 47.0b9. Thanks!
> 
> We're still in the dark - the patch above would be nightly only, and to get
> us some extra information.  I don't expect us to get anywhere without that
> information, and it will probably be Thursday before we start getting
> reasonable amount of data with the extra information if we land this patch
> (or something like it), so I don't see this making 47b9.
> 
> Ritu, can you start the conversation regarding comment 15 - I don't know
> what it takes to decide and make something a release blocker.

Hi Milan, I have already set this bug as tracked blocking for Fx47. For a critical issue like this, I would be willing to take a fix or consider backing out some changes (like DBaron suggested) in 47.0b9 (Thursday) or RC1 (Monday). Based on your comment, we might have a speculative fix ready by Thursday. Is that right?
Flags: needinfo?(rkothari) → needinfo?(milan)
(In reply to Ritu Kothari (:ritu) from comment #35)
> 
> Hi Milan, I have already set this bug as tracked blocking for Fx47. For a
> critical issue like this, I would be willing to take a fix or consider
> backing out some changes (like DBaron suggested) in 47.0b9 (Thursday) or RC1
> (Monday). Based on your comment, we might have a speculative fix ready by
> Thursday. Is that right?

Thursday is very optimistic for a fix.
Comment on attachment 8756077 [details] [diff] [review]
Log all dlls ever loaded v4

Review of attachment 8756077 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/WindowsDllBlocklist.cpp
@@ +17,4 @@
>  #include <windows.h>
>  #include <winternl.h>
>  #include <io.h>
> +#define PSAPI_VERSION 2

From my reading of that MSDN article, you should "just" set this to 1 and add psapi to OS_LIBS in mozglue/build/moz.build, and shouldn't have to care about the K32 prefix or getting the function from the symbol.
Attachment #8756077 - Flags: review?(mh+mozilla)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #36)
> (In reply to Ritu Kothari (:ritu) from comment #35)
> > 
> > Hi Milan, I have already set this bug as tracked blocking for Fx47. For a
> > critical issue like this, I would be willing to take a fix or consider
> > backing out some changes (like DBaron suggested) in 47.0b9 (Thursday) or RC1
> > (Monday). Based on your comment, we might have a speculative fix ready by
> > Thursday. Is that right?
> 
> Thursday is very optimistic for a fix.

Right - Thursday may be when we start getting the data that would hopefully help us with a fix.  It is very unlikely that we'd actually have a fix by then.
Flags: needinfo?(milan)
Use a different approach for XP support.
Attachment #8756077 - Attachment is obsolete: true
Attachment #8756111 - Flags: review?(mh+mozilla)
Comment on attachment 8756111 [details] [diff] [review]
Log all dlls ever loaded v5

Review of attachment 8756111 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/WindowsDllBlocklist.cpp
@@ +545,5 @@
>    }
>    return nullptr;
>  }
>  
> +#define DLLNAME_MAX 128

You're redefining DLLNAME_MAX here.

@@ +770,5 @@
>  
> +#ifdef NIGHTLY_BUILD
> +MFBT_API void
> +RegisterDllLoadCallback(LoadCallBackFn aCallback) {
> +  // some weird mangling thing is making it so I can't pass function pointers

You can remove this comment.

@@ +800,5 @@
>    if (GetModuleHandleA("user32.dll")) {
>      sUser32BeforeBlocklist = true;
>    }
>  
> +

Don't add an extra line here.

::: mozglue/build/WindowsDllBlocklist.h
@@ +34,5 @@
>      MOZ_DECL_USE_GUARD_OBJECT_NOTIFIER
>  };
>  
> +#ifdef NIGHTLY_BUILD
> +#define DLLNAME_MAX 128

Move this out of the ifdef.
Attachment #8756111 - Flags: review?(mh+mozilla) → review+
sorry had to back out for bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=28637696&repo=mozilla-inbound
Flags: needinfo?(jmuizelaar)
Backed out because a later test run reported many crashes caused by this push:

https://hg.mozilla.org/integration/mozilla-inbound/rev/3a3dcaf96852

Test run with failures (on Windows 7 opt): https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=5c66fb81be682bf7602a8a7b8655186cbeaed1d2
Failure log (example): https://treeherder.mozilla.org/logviewer.html#?job_id=28700505&repo=mozilla-inbound

10:54:17     INFO -  Thread 23 (crashed)
10:54:17     INFO -   0  ucrtbase.dll!_invalid_parameter + 0xdc
10:54:17     INFO -      eip = 0x69ad3405   esp = 0x0ba5f7cc   ebp = 0x0ba5f7f4   ebx = 0x00000000
10:54:17     INFO -      esi = 0x024570e0   edi = 0x00000000   eax = 0x0ba5f19c   ecx = 0x00000008
10:54:17     INFO -      edx = 0x00000000   efl = 0x00000212
10:54:17     INFO -      Found by: given as instruction pointer in context
10:54:17     INFO -   1  ucrtbase.dll!_invalid_parameter_noinfo_noreturn + 0x8
10:54:17     INFO -      eip = 0x69ad3448   esp = 0x0ba5f7fc   ebp = 0x0ba5f808
10:54:17     INFO -      Found by: call frame info
10:54:17     INFO -   2  mozglue.dll!std::vector<DllLoadInfo,std::allocator<DllLoadInfo> >::_Reallocate(unsigned int) [vector:5c66fb81be68 : 1638 + 0x11]
10:54:17     INFO -      eip = 0x69bd2242   esp = 0x0ba5f810   ebp = 0x0ba5f808
10:54:17     INFO -      Found by: call frame info
Does gDllLoadInfos need to be protected by a lock?
Hi Milan, Jeff, could you please nominate the patch for uplift to Beta? I will gtb 47.0b9 tomorrow ~10:00am PST and I hope we can get a diagnostic patch landed to m-b before that. Thanks for all the hard work on this one!
Flags: needinfo?(milan)
just by circumstantial evidence this TppTimerpExecuteCallback signature looks related to enabling D3D11 DXVA in bug 1248496.

from 47.0b7 to 47.0b8 the volume of this signature shrunk by halve - in between some versions of atidxx32.dll have been blacklisted from using D3D11 DXVA (bug 1273691) and the TppTimerpExecuteCallback signature has ceased in beta 8 for exactly those versions:

module correlation in beta 7:
   100% (158/158) vs.   9% (1890/20570) atidxx32.dll
          1% (2/158) vs.   0% (32/20570) 8.17.10.483
          0% (0/158) vs.   0% (60/20570) 8.17.10.489
          1% (1/158) vs.   0% (19/20570) 8.17.10.494
          0% (0/158) vs.   0% (1/20570) 8.17.10.498
          1% (2/158) vs.   0% (14/20570) 8.17.10.511
          0% (0/158) vs.   0% (1/20570) 8.17.10.514
         16% (26/158) vs.   0% (76/20570) 8.17.10.519
          1% (1/158) vs.   0% (19/20570) 8.17.10.520
         22% (34/158) vs.   1% (195/20570) 8.17.10.525
         32% (50/158) vs.   0% (80/20570) 8.17.10.531
         14% (22/158) vs.   0% (83/20570) 8.17.10.539
          1% (2/158) vs.   0% (61/20570) 8.17.10.545
          0% (0/158) vs.   0% (10/20570) 8.17.10.560
          2% (3/158) vs.   1% (112/20570) 8.17.10.569
          3% (4/158) vs.   0% (79/20570) 8.17.10.581
          1% (1/158) vs.   0% (6/20570) 8.17.10.595
          0% (0/158) vs.   0% (2/20570) 8.17.10.605
          6% (10/158) vs.   2% (315/20570) 8.17.10.625

...compared to correlation data in beta 8:
    100% (56/56) vs.   9% (916/10640) atidxx32.dll
          5% (3/56) vs.   0% (20/10640) 8.17.10.483
          0% (0/56) vs.   0% (26/10640) 8.17.10.489
          2% (1/56) vs.   0% (9/10640) 8.17.10.494
          0% (0/56) vs.   0% (3/10640) 8.17.10.511 (blocked from using d3d11 dxva in 47.0b8)
          0% (0/56) vs.   0% (1/10640) 8.17.10.514
          0% (0/56) vs.   0% (13/10640) 8.17.10.519 (blocked from using d3d11 dxva in 47.0b8)
          0% (0/56) vs.   0% (9/10640) 8.17.10.520
          0% (0/56) vs.   0% (35/10640) 8.17.10.525 (blocked from using d3d11 dxva in 47.0b8)
         71% (40/56) vs.   1% (82/10640) 8.17.10.531
          0% (0/56) vs.   0% (14/10640) 8.17.10.539 (blocked from using d3d11 dxva in 47.0b8)
          4% (2/56) vs.   0% (15/10640) 8.17.10.545
          0% (0/56) vs.   0% (8/10640) 8.17.10.560
          4% (2/56) vs.   0% (44/10640) 8.17.10.569
         14% (8/56) vs.   0% (43/10640) 8.17.10.581
          0% (0/56) vs.   0% (1/10640) 8.17.10.595
          0% (0/56) vs.   2% (164/10640) 8.17.10.625 (blocked from using d3d11 dxva in 47.0b8)
Blocks: 1248496
Yeah, I think that theory is pretty compelling. This suggests that we should disable D3D11 DXVA on AMD in 47.
Flags: needinfo?(jmuizelaar) → needinfo?(matt.woodrow)
Paul, under what circumstances is atidxx32.dll loaded/unloaded?
Flags: needinfo?(paul.blinzer)
Component: Graphics → Audio/Video: Playback
It looks atidxx32.dll is still loaded in the crashes.

Paul, is there another dll that is loaded/unloaded when using D3D11 for DXVA?
For your information, there also these "browser_perf-recordings-clear-02.js | Test timed out" permanent failures between the last landing and its backout:

https://treeherder.mozilla.org/logviewer.html#?job_id=28748516&repo=mozilla-inbound

A retrigger with 20 jobs on the backout didn't let a single job fail, so either this a new intermittent which came, permafailed, and went away, or it's related to this patch.
So - is there work beyond what's in beta 8 with bug 1273691?  Or we want to go hard at it and disable all AMD D3D11 DXVA, 47 and higher, using the downloadable list?
Flags: needinfo?(milan)
by chance atidxx32.dll 8.17.10.531, which seems to be at the root of 70% of the remaining crashes in beta 8, will go into the blacklist in 47.0b9 as well (bug 1275739). not sure how to progress beyond that...
(In reply to Milan Sreckovic [:milan] from comment #58)
> So - is there work beyond what's in beta 8 with bug 1273691?  Or we want to
> go hard at it and disable all AMD D3D11 DXVA, 47 and higher, using the
> downloadable list?

Unless we have information to suggest otherwise, I think we should delay enabling D3D11 DXVA on all AMD hardware until at least FF48. We can probably just do this with a code change.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #60)
> (In reply to Milan Sreckovic [:milan] from comment #58)
> > So - is there work beyond what's in beta 8 with bug 1273691?  Or we want to
> > go hard at it and disable all AMD D3D11 DXVA, 47 and higher, using the
> > downloadable list?
> 
> Unless we have information to suggest otherwise, I think we should delay
> enabling D3D11 DXVA on all AMD hardware until at least FF48. We can probably
> just do this with a code change.

Hi Milan, Jeff, I just took a patch from Philipp in bug 1275739. Is there a different patch that is needed or is that sufficient? I will gtb 47.0b9 in an hour so appreciate a prompt reply. Thanks!
Flags: needinfo?(milan)
Flags: needinfo?(jmuizelaar)
(In reply to [:philipp] from comment #53)
> just by circumstantial evidence this TppTimerpExecuteCallback signature
> looks related to enabling D3D11 DXVA in bug 1248496.

Note also that the first nightly that https://hg.mozilla.org/mozilla-central/rev/320e4b7ceee4 appeared in was 2016-02-18, which is consistent with the regression range in comment 2 (the first crash was on 2016-02-19).
(In reply to Ritu Kothari (:ritu) from comment #61)
> > 
> > Unless we have information to suggest otherwise, I think we should delay
> > enabling D3D11 DXVA on all AMD hardware until at least FF48. We can probably
> > just do this with a code change.
> 
> Hi Milan, Jeff, I just took a patch from Philipp in bug 1275739. Is there a
> different patch that is needed or is that sufficient? I will gtb 47.0b9 in
> an hour so appreciate a prompt reply. Thanks!

With an hour deadline, we should probably stick with Philipp's patch.  I do like Jeff's suggestion, but we'd want media to comment on it, and that may get us into tomorrow.
Flags: needinfo?(milan) → needinfo?(ajones)
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #62)
> (In reply to [:philipp] from comment #53)
> > just by circumstantial evidence this TppTimerpExecuteCallback signature
> > looks related to enabling D3D11 DXVA in bug 1248496.
> 
> Note also that the first nightly that
> https://hg.mozilla.org/mozilla-central/rev/320e4b7ceee4 appeared in was
> 2016-02-18, which is consistent with the regression range in comment 2 (the
> first crash was on 2016-02-19).

If this is true, we should clearly be backing out "Bug 1248496 - Enable D3D11 DXVA" and disabling D3D11 DXVA for Fx47. We can keep it enabled in Fx48 to investigate, rootcause and fix similar issues.

Hi Anthony, Chris, Matt: What would it take for us to backout the patch from bug 1248496? Also, what new risks would this backout be introducing?
Flags: needinfo?(cpearce)
Flags: needinfo?(matt.woodrow)
Attachment #8757006 - Flags: review?(ajones)
Attachment #8757006 - Flags: approval-mozilla-beta?
Flags: needinfo?(cpearce)
Attachment #8757006 - Flags: review?(ajones) → review+
Comment on attachment 8757006 [details] [diff] [review]
Disable D3D11 DXVA (patch for beta)

We've noticed a huge # of crashes and it makes sense to turn this pref off. Beta47+
Attachment #8757006 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: needinfo?(ajones)
(In reply to Ritu Kothari (:ritu) from comment #67)
> Comment on attachment 8757006 [details] [diff] [review]
> Disable D3D11 DXVA (patch for beta)
> 
> We've noticed a huge # of crashes and it makes sense to turn this pref off.
> Beta47+

seems this landed in https://hg.mozilla.org/releases/mozilla-beta/rev/2ee4473c729a
We've already blacklisted most of the driver versions that appear in the crash reports so that D3D11 DXVA won't be used. Let's blacklist the remainder, so that we don't see the same thing in 48.
Comment on attachment 8758089 [details]
MozReview Request: Bug 1270686 - Blacklist more ATI drivers for D3D11 DXVA2. r?kentuckyfriedtakahe

https://reviewboard.mozilla.org/r/56476/#review53040
Attachment #8758089 - Flags: review?(ajones) → review+
Comment on attachment 8758089 [details]
MozReview Request: Bug 1270686 - Blacklist more ATI drivers for D3D11 DXVA2. r?kentuckyfriedtakahe

Approval Request Comment
[Feature/regressing bug #]: D3D11 DXVA
[User impact if declined]: Crashes for some users with ATI graphics cards with obsolete graphics drivers when they try to play video.
[Describe test coverage new/current, TreeHerder]: We don't have coverage of busted drivers AFAIK.
[Risks and why]: Low; we're reverting these users to the previously working path (D3D9 DXVA)
[String/UUID change made/needed]: None.
Attachment #8758089 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/524bf899365d
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Assignee: nobody → cpearce
Attachment #8758089 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
likely root cause is found and working on a fix
Flags: needinfo?(paul.blinzer)
The Crimson hotfix driver package 16.6.1 (file version 16.15.2211, available e.g. via http://support.amd.com/en-us/download/desktop?os=Windows+10+-+64, or equivalent for other OS releases) has a fix for the issue included. 
Please confirm on that driver version and remove the blacklist entry from that version onwards
Flags: needinfo?(dbaron)
Flags: needinfo?(dbaron) → needinfo?(cpearce)
(In reply to Paul Blinzer from comment #77)
> The Crimson hotfix driver package 16.6.1 (file version 16.15.2211, available
> e.g. via http://support.amd.com/en-us/download/desktop?os=Windows+10+-+64,
> or equivalent for other OS releases) has a fix for the issue included. 
> Please confirm on that driver version and remove the blacklist entry from
> that version onwards

Is this bug present in all AMD drivers before 16.15.2211? i.e. is there anything we can do to avoid triggering the bug other than blocking D3D11-DXVA on all AMD drivers before 16.15.2211?
Flags: needinfo?(jmuizelaar) → needinfo?(paul.blinzer)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #78)
> (In reply to Paul Blinzer from comment #77)
> > The Crimson hotfix driver package 16.6.1 (file version 16.15.2211, available
> > e.g. via http://support.amd.com/en-us/download/desktop?os=Windows+10+-+64,
> > or equivalent for other OS releases) has a fix for the issue included. 
> > Please confirm on that driver version and remove the blacklist entry from
> > that version onwards
> 
> Is this bug present in all AMD drivers before 16.15.2211? i.e. is there
> anything we can do to avoid triggering the bug other than blocking
> D3D11-DXVA on all AMD drivers before 16.15.2211?

Not all AMD drivers, though due to different changes in that code over time it can't be exactly determined when it was introduced. 
This issue is apparently triggered by allocating/reusing memory resources that are pending release but due to the GPU being very busy, still being held for a relatively long time in the driver, until the OS may trigger a preemption. 
Reducing the load on the GPU during allocation/reallocation of memory resources should avoid the trigger.
Flags: needinfo?(paul.blinzer)
Flags: needinfo?(cpearce)
Depends on: 1284672
You need to log in before you can comment on or make changes to this bug.