Closed Bug 1218473 Opened 9 years ago Closed 8 years ago

Crash spike in CreateWindowEx with Firefox 42.0b9 on Optimus

Categories

(Core :: IPC, defect)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox42 + fixed
firefox43 + fixed
firefox44 + wontfix
firefox45 - wontfix
firefox46 - wontfix
firefox47 - wontfix
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- fixed

People

(Reporter: bugzilla, Assigned: bugzilla)

References

Details

(Keywords: crash, Whiteboard: [44dll])

Crash Data

Attachments

(1 file, 5 obsolete files)

+++ This bug was initially created as a clone of Bug #1218128 +++

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-2fbc02f7-de63-439f-a485-f12992151024.
=============================================================


Top stack frames:
0 		@0x10f43280 	
1 	xul.dll 	js::CreateRegExpMatchResult(JSContext*, JS::Handle<JSString*>, js::MatchPairs const&, JS::MutableHandle<JS::Value>) 	js/src/builtin/RegExp.cpp
2 	xul.dll 	regexp_exec_impl 	js/src/builtin/RegExp.cpp
3 	xul.dll 	regexp_exec_impl 	js/src/builtin/RegExp.cpp
4 	xul.dll 	js::regexp_exec(JSContext*, unsigned int, JS::Value*) 	js/src/builtin/RegExp.cpp
5 	xul.dll 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
6 	xul.dll 	Interpret 	js/src/vm/Interpreter.cpp
7 	xul.dll 	js::RunScript(JSContext*, js::RunState&) 	js/src/vm/Interpreter.cpp
8 	xul.dll 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
9 	xul.dll 	nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) 	js/xpconnect/src/XPCWrappedJSClass.cpp
10 	xul.dll 	nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) 	js/xpconnect/src/XPCWrappedJS.cpp
11 	xul.dll 	PrepareAndDispatch 	xpcom/reflect/xptcall/md/win32/xptcstubs.cpp
[...]

The three crash signatures attached to the bug are highly spiking in the last beta of this cycle, 42.0b9, and are ~40% of the early crashes for this version. Those crashes are practically all early in startup and the stacks for js::CreateRegExpMatchResult and js::AddTypePropertyId are similar (the latter has 2 more frames or so on top), the js::regexp_exec are heavily truncated.

The signatures did exist previously at much lower volume but exploded as startup crashes in this version. The regression range is https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_42_0b8_RELEASE&tochange=FIREFOX_42_0b9_RELEASE - I'm not sure what can cause this.

This is bad enough to block release and we are only a week before release with this, so we need urgent action here (there is a reason why I file this on a weekend).
As per https://bugzilla.mozilla.org/show_bug.cgi?id=1218128#c12 I am separating this bug out from the JS stack.
Attached patch Beta 42 backout (obsolete) — Splinter Review
Approval Request Comment
[Feature/regressing bug #]: bug 1213567
[User impact if declined]: Crashes
[Describe test coverage new/current, TreeHerder]: plugin tests
[Risks and why]: Low, this was the state of beta prior to beta9
[String/UUID change made/needed]: None
Attachment #8678984 - Flags: review+
Attachment #8678984 - Flags: approval-mozilla-beta?
Attached patch Beta 42 backout (obsolete) — Splinter Review
Fixed bug number in commit message.
See https://bugzilla.mozilla.org/show_bug.cgi?id=1218473#c2
Attachment #8678984 - Attachment is obsolete: true
Attachment #8678984 - Flags: approval-mozilla-beta?
Attachment #8678986 - Flags: review+
Attachment #8678986 - Flags: approval-mozilla-beta?
Comment on attachment 8678986 [details] [diff] [review]
Beta 42 backout

Approved for uplift to beta from discussion with Sylvestre. We need this to fix a startup crash in beta 9.
Attachment #8678986 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I am guessing this also needs uplift to m-r.
Flags: needinfo?(wkocher)
[Tracking Requested - why for this release]:

Setting tracking flags and marking each channel as affected. 
We are backing this out from beta but leaving it in 43/44 for the moment to get more information.
Could be just that another Display DLL (in this case, dlumd11.dll 9.18.7.9) needs blacklisting? Displaylink software has caused crashes before in bug 1195844 and bug 1160295. Displaylink released a new version of their software on the 22nd (http://downloads.displaylink.com/releasenotes/DisplayLink_7.9%20M3_release-notes.txt).
hi arthur, there is no indication that display link drivers would be involved here as well...
Reviewing the correlation data, there is 100% correlation with nVidia drivers. In particular, _etoured.dll is 100% correlation. This is significant because "Detours" is an API interception framework that is licensed by Microsoft Research. Depending on what nVidia drivers are doing with that, they could be modifying code in our address space in such a way that it is incompatible with our own code.
Another thing I've noticed is that the crash reports are showing Intel Graphics.

My best hypothesis at the moment is that the crashing systems have both Intel Integrated Graphics and nVidia Discrete Graphics and the Intel GPU is set as the active GPU.

Unfortunately I don't have suitable hardware to test this.

In summary, we need:

* Hardware with both Intel Integrated Graphics and nVidia Discrete Graphics;
* Intel Integrated Graphics is the active GPU;
* Windows 8 or newer;
* Running 43 Dev Ed (for now at least, unless patch is backed out) or 42 Beta 9

Kairo, I think we need some QA for this. Who should be the point of contact?
Flags: needinfo?(kairo)
(In reply to philipp from comment #9)
> hi arthur, there is no indication that display link drivers would be
> involved here as well...

When I was looking through the crash modules, I saw that DLL highlighted in red so I thought it might have been the culprit.
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #12)
> Kairo, I think we need some QA for this. Who should be the point of contact?

Usually, Florin is our contact for release QA, but I don't know if they have hardware that matches. The gfx team in Toronto (Milan) has a number of different machines esp. in terms of graphics configurations, so we may have a chance to dig up a machine like that there. I'm ni?ing both to see if we have a machine like that somewhere.
Flags: needinfo?(milan)
Flags: needinfo?(florin.mezei)
Comment on attachment 8678986 [details] [diff] [review]
Beta 42 backout

See comment 2 for request comment. A fix probably isn't going to be ready in the next day or two.
Attachment #8678986 - Flags: approval-mozilla-aurora?
Flags: needinfo?(milan) → needinfo?(jmuizelaar)
Flags: needinfo?(jmuizelaar)
Summary: Crash spike in CreateWindowEx with Firefox 42.0b9 → Crash spike in CreateWindowEx with Firefox 42.0b9 on Optimus
Comment on attachment 8678986 [details] [diff] [review]
Beta 42 backout

Approved for uplift to aurora (43) so we avoid hitting this startup crash when 43 moves to beta.
Attachment #8678986 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I have a W540 in the QA lab that may be similar enough
I can't trivially reproduce on that machine
Thanks Jeff. Do we have any contacts for driver devs at nVidia? I'd love to discuss their use of detours with them...
Flags: needinfo?(jmuizelaar)
Aaron, should we also be backing this out on nightly (so it doesn't hold up the 44 aurora release?)  
It turns out are merging tomorrow instead of Monday. 
We could still uplift to 44 aurora after tomorrow though. Sorry to bug you about this but I wanted to check that you know about the early merge date.
Flags: needinfo?(aklotz)
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #19)
> Thanks Jeff. Do we have any contacts for driver devs at nVidia? I'd love to
> discuss their use of detours with them...

I think Bas knows someone.
Flags: needinfo?(bas)
Removing the ni? to me, I already forwarded the QA request.
Flags: needinfo?(kairo)
this has problems applying to aurora - 

grafting 311479:4d60b9516805 "Bug 1218473: Back out 45ab7cdffbb4 on suspicion of causing spike in CreateWindowEx crashes; r=backout a=lizzard CLOSED TREE"
merging ipc/glue/Neutering.h
warning: conflicts during merge.
merging ipc/glue/Neutering.h incomplete! (edit conflicts, then use 'hg resolve --mark')
merging ipc/glue/WindowsMessageLoop.cpp
merging widget/windows/nsWindow.cpp
abort: unresolved conflicts, can't continue
(use hg resolve and hg graft --continue)

could you take a look, thanks!
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/5e1587041f9a

(In reply to Carsten Book [:Tomcat] from comment #23)
> this has problems applying to aurora - 

This was because we had to use MOZ_STACK_CLASS and not MOZ_RAII on Beta.
We should be able to use the Aurora changeset if we decide to backout from 44 (or 45 if it comes to that.)
(In reply to Bob Owen (:bobowen) from comment #24)
> remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/5e1587041f9a
> 
> (In reply to Carsten Book [:Tomcat] from comment #23)
> > this has problems applying to aurora - 
> 
> This was because we had to use MOZ_STACK_CLASS and not MOZ_RAII on Beta.
> We should be able to use the Aurora changeset if we decide to backout from
> 44 (or 45 if it comes to that.)

Thanks Bob :-)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #22)
> Removing the ni? to me, I already forwarded the QA request.

We're currently trying to setup a machine with hardware similar to what's been specified in Comment 12. Camelia will follow up here as soon possible.
Flags: needinfo?(florin.mezei)
(In reply to Andrei Vaida, QA [:avaida] from comment #27)
> We're currently trying to setup a machine with hardware similar to what's
> been specified in Comment 12. Camelia will follow up here as soon possible.

Just a quick update here, we managed to setup a similar machine and we're currently trying to reproduce the bug. Camelia will follow up with test results on this matter, first thing tomorrow.
Flags: needinfo?(camelia.badau)
It looks like these crashes went away when 44 went to Aurora (!), so I'd suggest that this be tested with 42 Beta 9.
Looks like this isn't an issue in 44.
Flags: needinfo?(aklotz)
I've tested on Windows 8.1 32bit using Firefox 42 Beta 9 (buildID: 20151022152545) and I can't reproduce the issue (the crash). I've tried several times to reproduce this crash, but with no success. 

I've tested on a machine with the following characteristics: 
- Hardware with both Intel Integrated Graphics (Intel HD 2500) and nVidia Discrete Graphics (NVIDIA GeForce 210);
- Intel Integrated Graphics is the active GPU
Flags: needinfo?(camelia.badau)
Flags: needinfo?(jmuizelaar)
I got a machine with a matching hardware configuration that kept hitting this repeatedly on startup with current Nightly. It stopped after a few times and restarting in safe mode once, so unfortunately there's no real steps to reproduce.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #32)
> I got a machine with a matching hardware configuration that kept hitting
> this repeatedly on startup with current Nightly. It stopped after a few
> times and restarting in safe mode once, so unfortunately there's no real
> steps to reproduce.

Does it come back if you use a fresh profile?
Flags: needinfo?(gpascutto)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #33)
> Does it come back if you use a fresh profile?

Ding ding, you win!

(yes)
Flags: needinfo?(gpascutto)
Do you have a pointer (from about:crashes) to some of these crashes?
The NVIDIA DLLs are injected via AppInit_DLLs:

ntdll!ZwMapViewOfSection+0x12
ntdll!LdrpMapViewOfSection+0xc7
ntdll!LdrpFindOrMapDll+0x345
ntdll!LdrpLoadDll+0x2b6
ntdll!LdrLoadDll+0xaa
mozglue!`anonymous namespace'::patched_LdrLoadDll+0x1b0
KERNELBASE!LoadLibraryExW+0x1f7
KERNELBASE!LoadLibraryExA+0x26
kernel32!LoadLibraryA+0xba
nvinit+0x11cb
nvinit+0x5477
nvinit!nvCoprocThunk+0x6e94
nvinit!nvCoprocThunk+0x6e1a
ntdll!LdrpCallInitRoutine+0x14
ntdll!LdrpRunInitializeRoutines+0x26f
ntdll!LdrpLoadDll+0x453
ntdll!LdrLoadDll+0xaa
mozglue!`anonymous namespace'::patched_LdrLoadDll+0x1b0
KERNELBASE!LoadLibraryExW+0x1f7
kernel32!BasepLoadAppInitDlls+0x167
kernel32!LoadAppInitDlls+0x82
USER32!ClientThreadSetup+0x1f9
USER32!__ClientThreadSetup+0x5
ntdll!KiUserCallbackDispatcher+0x2e
GDI32!GdiDllInitialize+0x1c
USER32!_UserClientDllInitialize+0x32f
ntdll!LdrpCallInitRoutine+0x14
ntdll!LdrpRunInitializeRoutines+0x26f
ntdll!LdrpLoadDll+0x453
ntdll!LdrLoadDll+0xaa
mozglue!`anonymous namespace'::patched_LdrLoadDll+0x1b0
KERNELBASE!LoadLibraryExW+0x1f7
firefox!XPCOMGlueLoad+0x23c
firefox!XPCOMGlueStartup+0x1d
firefox!InitXPCOMGlue+0xba
firefox!NS_internal_main+0x5c
firefox!wmain+0xbe
firefox!__tmainCRTStartup+0xfe
kernel32!BaseThreadInitThunk+0xe
ntdll!__RtlUserThreadStart+0x70
Flags: needinfo?(bas)
CreateWindowExW is being patched by:
00000000 0031d458 00000001 nvd3d9wrap!setDeviceHandle+0x1c91
722b0000 00000001 00000000 nvd3d9wrap!initialise+0x373
722b0000 00000001 00000000 nvd3d9wrap!setDeviceHandle+0x467b
722b0000 00000001 00000000 nvd3d9wrap!setDeviceHandle+0x4602
722bb5d6 722b0000 00000001 ntdll!LdrpCallInitRoutine+0x14
00000000 778bdec7 00000000 ntdll!LdrpRunInitializeRoutines+0x26f
0031d750 0031d634 00000000 ntdll!LdrpLoadDll+0x453
0031d634 0031d768 0031d750 ntdll!LdrLoadDll+0xaa
0060d874 0031d768 0031d750 mozglue!`anonymous namespace'::patched_LdrLoadDll+0x1b0
00000000 00000000 0060d874 KERNELBASE!LoadLibraryExW+0x1f7
0031d7bc 00000000 00000000 KERNELBASE!LoadLibraryExA+0x26
0031d7bc 0031dd6c 752e485c kernel32!LoadLibraryA+0xba
0031de98 00000001 00000000 nvinit+0x11cb
752c0000 00000001 00000000 nvinit+0x5477
752c0000 00000001 00000000 nvinit!nvCoprocThunk+0x6e94
752c0000 00000001 00000000 nvinit!nvCoprocThunk+0x6e1a
752ccfd9 752c0000 00000001 ntdll!LdrpCallInitRoutine+0x14
00000000 778be887 00000000 ntdll!LdrpRunInitializeRoutines+0x26f
0031e190 0031e074 00000000 ntdll!LdrpLoadDll+0x453
0031e074 0031e1a8 0031e190 ntdll!LdrLoadDll+0xaa
005f6654 0031e1a8 0031e190 mozglue!`anonymous namespace'::patched_LdrLoadDll+0x1b0
00000000 00000000 005f6654 KERNELBASE!LoadLibraryExW+0x1f7
00000058 fffdb800 00000001 kernel32!BasepLoadAppInitDlls+0x167
00000000 77afe230 00000001 kernel32!LoadAppInitDlls+0x82
77ae010a 0031e578 00000000 USER32!ClientThreadSetup+0x1f9
0031e578 00000000 0031ed48 USER32!__ClientThreadSetup+0x5
75b70718 0031ec44 75b0b970 ntdll!KiUserCallbackDispatcher+0x2e
75af0000 00000001 0031ed14 GDI32!GdiDllInitialize+0x1c
75af0000 00000001 00000000 USER32!_UserClientDllInitialize+0x32f
75b0b6ed 75af0000 00000001 ntdll!LdrpCallInitRoutine+0x14
00000000 778be60b 00000000 ntdll!LdrpRunInitializeRoutines+0x26f
0031f00c 0031eef0 00000000 ntdll!LdrpLoadDll+0x453
0031eef0 0031f024 0031f00c ntdll!LdrLoadDll+0xaa
005f6654 0031f024 0031f00c mozglue!`anonymous namespace'::patched_LdrLoadDll+0x1b0
00000000 00000000 005f6654 KERNELBASE!LoadLibraryExW+0x1f7
0135141a 76c33656 76bb17e5 firefox!XPCOMGlueLoad+0x23c
76c33656 76bb17e5 00000001 firefox!XPCOMGlueStartup+0x1d
005f7f30 005f7e78 00000001 firefox!InitXPCOMGlue+0xba
00000000 00000000 00000000 firefox!NS_internal_main+0x5c
005f7e78 ffffefa0 005f6e90 firefox!wmain+0xbe
fffde000 0032f8e4 77b09882 firefox!__tmainCRTStartup+0xfe
fffde000 7788f02b 00000000 kernel32!BaseThreadInitThunk+0xe
0135257b fffde000 00000000 ntdll!__RtlUserThreadStart+0x70
0135257b fffde000 00000000 ntdll!_RtlUserThreadStart+0x1b
Monitoring VirtualProtect calls made against addresses containing code, I see these functions are being hooked by these NVIDIA DLLs (this list maight not be complete): 

ole32!CoSetProxyBlanket
ole32!CoCreateInstance
GDI32!NtGdiDdDDIQueryAdapterInfo
GDI32!NtGdiDdDDIGetDisplayModeList
USER32!CreateWindowExW
USER32!ChangeDisplaySettingsExW
USER32!EnumDisplayDevicesW
USER32!EnumDisplayDevicesA
USER32!DisplayConfigGetDeviceInfo
kernel32!RegSetValueExW
kernel32!RegSetValueExA
kernel32!RegDeleteValueW
kernel32!RegQueryValueExW
kernel32!K32GetModuleInformation
kernel32!K32GetMappedFileNameW
kernel32!K32EnumProcessModulesEx
KERNELBASE!FreeLibrary
KERNELBASE!LoadLibraryExW
KERNELBASE!GetModuleHandleExW
KERNELBASE!GetModuleHandleW
The NVIDIA stuff all happens during the loading of the NVINIT stuff and is early enough such that the Firefox process is still single-threaded. Our function hooking code *does* correctly patch the already-hooked CreateWindowExW.

I'm wondering, however, if there is some kind of edge case where the NVINIT dll could be injected off main thread. AppInit_DLLs depends on user32.dll being loaded, and we delayload user32.dll.
We only delayload with mozglue.dll, so by the time we load xul.dll we should have brought in the Optimus stuff.
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #29)
> It looks like these crashes went away when 44 went to Aurora (!), so I'd
> suggest that this be tested with 42 Beta 9.

Aaron, I was looking at the Aurora44 crash stats from the last 7 days and this is a top-crash at #5. Is there a fix that we might consider safe and worth uplifting to Aurora to address it? Thanks!
Flags: needinfo?(aklotz)
gcp, could you please take a look at your registry in HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Windows NT\CurrentVersion\Windows and tell me what else (if anything) other than nvinit.dll is included in the AppInit_DLLs value?
Flags: needinfo?(gpascutto)
I just solved this mofo! There's a bug in Detours 2.x where if it sees the sequence of jmps from a NOP space patch, it incorrectly calculates the target address of the second jmp. It ends up copying bytes (and writing out a jmp to its trampoline) at an address that is 5 bytes prior to the intended target, corrupting everything on the way. That's why we were seeing CreateWindowExWHook show up in unrelated code like JS regex stuff.

I further confirmed this by examining gcp's crash dump above and comparing it to the bytes surrounding CreateWindowExWHook in the original binary. At CreateWindowExWHook-5, the instructions diverge from the binary, and 0xe9 (jmp opcode) is the first byte.

Corollaries:

1. Detours 2.x is fundamentally incompatible with NOP-space patching. Looking at the code, this bug was fixed in Detours 3.x;
2. We can't use NOP-space patching when Optimus Drivers are running because they use Detours 2.x;
3. There are probably a bunch of other random crashes that are caused by this that will be fixed as fallout from this;
4. The Optimus drivers may cause bad things to happen with other code that does NOP space patching, such as Windows hotfixes.

Solution:
I think that we can work around this by disabling NOP-space patching when detours 2.x is present. We could either make it conditional on the presence of the Detours DLL or on the presence of NVIDIA Optimus DLLs.
Flags: needinfo?(gpascutto)
Flags: needinfo?(aklotz)
Attached patch Patch (obsolete) — Splinter Review
This patch adds a compatibility check to WindowsNopSpacePatcher: It looks for evidence of Optimus DLLs having either already been injected or having the potential to be injected in the future. If the compat check fails, we either abort the initialization, or if we've already previously initialized, prevent any further NOP space patching.

I noticed that TestDllInterceptor should be moved to the same location as nsWindowsDllInterceptor.h but I'll file a follow-up bug for that.
Attachment #8693091 - Flags: review?(ehsan)
Comment on attachment 8693091 [details] [diff] [review]
Patch

Review of attachment 8693091 [details] [diff] [review]:
-----------------------------------------------------------------

Great find!

::: xpcom/build/nsWindowsDllInterceptor.h
@@ +146,5 @@
>  
>    void Init(const char* aModuleName)
>    {
> +    if (!IsCompatible()) {
> +      return;

Do you mind adding an NS_WARNING here, please?

@@ +164,5 @@
> +   * this NVIDIA code in our address space and disable NOP space patching if it
> +   * is. We also check AppInit_DLLs since this is the mechanism that the Optimus
> +   * drivers use to inject into our process.
> +   */
> +  bool IsCompatible()

Nit: please make this a static function.

@@ +189,5 @@
> +    // won't be loaded once user32 is initialized.
> +    HKEY hkey = NULL;
> +    if (!RegOpenKeyExW(HKEY_LOCAL_MACHINE,
> +          MOZ_UTF16("SOFTWARE\\Microsoft\\Windows NT\\CurrentVersion\\Windows"),
> +          0, KEY_READ, &hkey)) {

Since this is in HKLM, we should be asking for the minimum access rights that we need to maximize the chance of it working.  I think KEY_QUERY_VALUE should be all that you need here.

@@ +215,5 @@
> +          if (!_wsplitpath_s(token, nullptr, 0, nullptr, 0,
> +                             fname, mozilla::ArrayLength(fname),
> +                             nullptr, 0)) {
> +            // nvinit.dll is responsible for bootstrapping the DLL injection, so
> +            // so that's the library that we check for here

Nit: s/so //

@@ +218,5 @@
> +            // nvinit.dll is responsible for bootstrapping the DLL injection, so
> +            // so that's the library that we check for here
> +            const wchar_t kNvInitName[] = MOZ_UTF16("nvinit");
> +            if (!_wcslwr_s(fname, mozilla::ArrayLength(fname)) &&
> +                !wcsncmp(fname, kNvInitName, mozilla::ArrayLength(kNvInitName))) {

Why not just use _wcsnicmp() directly?
Attachment #8693091 - Flags: review?(ehsan) → review+
Attached patch Patch (r2) (obsolete) — Splinter Review
Carrying forward r+.

(In reply to :Ehsan Akhgari from comment #47)
> Comment on attachment 8693091 [details] [diff] [review]
> Patch
> 
> Review of attachment 8693091 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Great find!
> 
> ::: xpcom/build/nsWindowsDllInterceptor.h
> @@ +146,5 @@
> >  
> >    void Init(const char* aModuleName)
> >    {
> > +    if (!IsCompatible()) {
> > +      return;
> 
> Do you mind adding an NS_WARNING here, please?
Done.

> 
> @@ +164,5 @@
> > +   * this NVIDIA code in our address space and disable NOP space patching if it
> > +   * is. We also check AppInit_DLLs since this is the mechanism that the Optimus
> > +   * drivers use to inject into our process.
> > +   */
> > +  bool IsCompatible()
> 
> Nit: please make this a static function.
Done.

> 
> @@ +189,5 @@
> > +    // won't be loaded once user32 is initialized.
> > +    HKEY hkey = NULL;
> > +    if (!RegOpenKeyExW(HKEY_LOCAL_MACHINE,
> > +          MOZ_UTF16("SOFTWARE\\Microsoft\\Windows NT\\CurrentVersion\\Windows"),
> > +          0, KEY_READ, &hkey)) {
> 
> Since this is in HKLM, we should be asking for the minimum access rights
> that we need to maximize the chance of it working.  I think KEY_QUERY_VALUE
> should be all that you need here.
Done.

> 
> @@ +215,5 @@
> > +          if (!_wsplitpath_s(token, nullptr, 0, nullptr, 0,
> > +                             fname, mozilla::ArrayLength(fname),
> > +                             nullptr, 0)) {
> > +            // nvinit.dll is responsible for bootstrapping the DLL injection, so
> > +            // so that's the library that we check for here
> 
> Nit: s/so //
Fixed

> 
> @@ +218,5 @@
> > +            // nvinit.dll is responsible for bootstrapping the DLL injection, so
> > +            // so that's the library that we check for here
> > +            const wchar_t kNvInitName[] = MOZ_UTF16("nvinit");
> > +            if (!_wcslwr_s(fname, mozilla::ArrayLength(fname)) &&
> > +                !wcsncmp(fname, kNvInitName, mozilla::ArrayLength(kNvInitName))) {
> 
> Why not just use _wcsnicmp() directly?

Done. I'm not sure why I did that. I blame fatigue. :-)
Attachment #8693091 - Attachment is obsolete: true
Attachment #8693129 - Flags: review+
With vs2015 I ran into errors.
nsWindowsDllInterceptor.h(179): error C2440: 'initializing': cannot convert from 'const char16_t [13]' to 'const wchar_t *'
Great piece of detective work, Aaron.
I wonder if this ever causes problems for the patching used by the sandbox.
I seem to remember that their patching was a bit different. 

(In reply to zhoubcfan from comment #50)
> With vs2015 I ran into errors.
> nsWindowsDllInterceptor.h(179): error C2440: 'initializing': cannot convert
> from 'const char16_t [13]' to 'const wchar_t *'

I think that all of those MOZ_UTF16s should just be L"...", as we're definitely dealing with wchar_t.
You only need to use MOZ_UTF16 for char16_t, because for VS2013 char16_t = wchar_t, but VS2015 correctly defines char16_t as a separate type.
So, they need a different literal specifier. 
(Not realising this caused me a fair amount of pain on the printing work I've just been doing.)
(In reply to Bob Owen (:bobowen) from comment #51)
> Great piece of detective work, Aaron.
> I wonder if this ever causes problems for the patching used by the sandbox.
> I seem to remember that their patching was a bit different. 

It might be worth auditing the code to make sure. The Chromium sandbox uses just about every function interception technique under the sun (and them some), so it is possible ;-)
It's a shutdown crash, the reported "test" is garbage because it's just whatever was the last test to run in that suite (or in that chunk of the suite, for the suites that run a directory of tests, shut down, and start up again on the next directory).
(In reply to Phil Ringnalda (:philor) from comment #55)
> It's a shutdown crash, the reported "test" is garbage because it's just
> whatever was the last test to run in that suite (or in that chunk of the
> suite, for the suites that run a directory of tests, shut down, and start up
> again on the next directory).

Were there any of these failures that you know of that were *not* on x64? The NOP space patcher is disabled on x64...
Flags: needinfo?(philringnalda)
As you can see from the links to the logs, they were all on Win8, which has x64 in the jobname, which is the complete extent of my knowledge of whether or not it is actually x64, much less whether or not something that should be disabled is successfully disabled.

But if you need help retriggering lots of times after you push it to try if you think that the reason these crashes came to a sudden complete stop when I backed you out was coincidence, that's something I actually *do* have info about.
Flags: needinfo?(philringnalda)
Aaron, I am trying to look through 44+ tracked bug and it seems we have a fix ready but it was backed out due to test failures. Are those test failures still an issue? If not, do you think we should consider uplifting the fix to Beta44? Crash fixes are generally good uplift candidates. Please let me know.
Flags: needinfo?(aklotz)
This should only be an issue on 45+.
Flags: needinfo?(aklotz)
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #59)
> This should only be an issue on 45+.

Clarification: this should only be a *serious* issue on 45+. It will probably improve crash rates across the board, but given the problems we had the first time I checked this in, I'm going to want to let this bake a while before deciding on uplifting.
Attached patch Patch (r3) (obsolete) — Splinter Review
Rebased, made some minor fixes to avoid conflicts. Carrying forward r+.
Attachment #8693129 - Attachment is obsolete: true
Attachment #8705206 - Flags: review+
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #60)
> (In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #59)
> > This should only be an issue on 45+.
> 
> Clarification: this should only be a *serious* issue on 45+. It will
> probably improve crash rates across the board, but given the problems we had
> the first time I checked this in, I'm going to want to let this bake a while
> before deciding on uplifting.

I was leaning towards wontfix'ing it for FF44 until I looked at crash-stats and those don't like nice. We've had ~115 crashes on 44.0b6 with the first signature here. :( I will still keep this around and come back to in a week or so to see if it has baked enough.
https://hg.mozilla.org/mozilla-central/rev/e135879cff29
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
i think the "mozilla::UniquePtr<T>::reset" signature popping up since 44.0b6 is related to this as well, as it has a similar profile of intel/nvidia dual gpus on affected installations:
https://crash-stats.mozilla.com/search/?signature=%3Dmozilla%3A%3AUniquePtr%3CT%3E%3A%3Areset&version=44.0b&_facets=signature&_facets=version&_facets=app_notes&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-app_notes

on early beta7 data this is making up 5% of all crashes
[@ nsDocument::MozExitPointerLock ] also may be related, as a user in bug 1236718 is experiencing both this signature and that signature with upgraded drivers.
This is showing up in my uplift request queue because of the approvals that were given out three months ago. Can we either remove those a+ flags if they no longer are valid or re-request approval for the current release branches if this needs uplifting?
Flags: needinfo?(aklotz)
Comment on attachment 8705206 [details] [diff] [review]
Patch  (r3)

Approval Request Comment
[Feature/regressing bug #]: n/a
[User impact if declined]: Crashes for users with NVIDIA Optimus drivers
[Describe test coverage new/current, TreeHerder]: Covered by tests running on Windows, already on m-c, crash hasn't been seen on Nightly since Jan 9.
[Risks and why]: Low, disables some code in the presence of infringing drivers
[String/UUID change made/needed]: None
Flags: needinfo?(aklotz)
Attachment #8705206 - Flags: approval-mozilla-beta?
Attachment #8705206 - Flags: approval-mozilla-aurora?
(In reply to philipp from comment #65)
> i think the "mozilla::UniquePtr<T>::reset" signature popping up since 44.0b6
> is related to this as well, as it has a similar profile of intel/nvidia dual
> gpus on affected installations:
> https://crash-stats.mozilla.com/search/
> ?signature=%3Dmozilla%3A%3AUniquePtr%3CT%3E%3A%3Areset&version=44.
> 0b&_facets=signature&_facets=version&_facets=app_notes&_facets=build_id&_colu
> mns=date&_columns=signature&_columns=product&_columns=version&_columns=build_
> id&_columns=platform#facet-app_notes
> 
> on early beta7 data this is making up 5% of all crashes

From what I'm seeing, that sig is also a dup.
Crash Signature: [@ `anonymous namespace''::CreateWindowExWHook ] [@ CreateWindowInternal ] [@ mozilla::widget::TaskbarWindowPreview::Release ] [@ mozilla::ipc::SuppressedNeuteringRegion::SuppressedNeuteringRegion ] → [@ `anonymous namespace''::CreateWindowExWHook ] [@ CreateWindowInternal ] [@ mozilla::widget::TaskbarWindowPreview::Release ] [@ mozilla::ipc::SuppressedNeuteringRegion::SuppressedNeuteringRegion ] [@ mozilla::UniquePtr<T>::reset ]
Comment on attachment 8705206 [details] [diff] [review]
Patch  (r3)

A top crash fix, let's uplift to Aurora45 soon. Hopefully we can stabilize this fix for a day or two and then take this in 44.0b9.
Attachment #8705206 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
the fix seems to have its indented effect - there are no more crashes with the `anonymous namespace''::CreateWindowExWHook signature in recent nightly builds and there are also no new reports of this in yesterday's 45.0a2 build after the fix has landed (whereas there were around 10-30 per day before that).
Comment on attachment 8705206 [details] [diff] [review]
Patch  (r3)

There are some indications on Nightly/Aurora that this fix is helping. Given that this was a top crash, it makes sense to uplift to Beta44.
Attachment #8705206 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
This looked good on Nightly for a few days with no reports after 20160109030208. However, something  is triggering this crash as of yesterdays build (20160113141333) for one user that submitted 9 reports.
See Also: → 1240607
hi, bad news as the "mozilla::UniquePtr<T>::reset" signature is back in 44 rc1. it had been gone in 44.0b8 & 44.0b9 (the fix from comment #77 only landed in beta 9) - so it looks like this signature is appearing arbitrarily.

is it possible that this may be triggered by some compiler optimization?
if so there will probably be a rc2 anyway, so we should keep that under close inspection...
Flags: needinfo?(kairo)
Flags: needinfo?(aklotz)
(In reply to philipp from comment #80)
> hi, bad news as the "mozilla::UniquePtr<T>::reset" signature is back in 44
> rc1. it had been gone in 44.0b8 & 44.0b9 (the fix from comment #77 only
> landed in beta 9) - so it looks like this signature is appearing arbitrarily.
> 
> is it possible that this may be triggered by some compiler optimization?
> if so there will probably be a rc2 anyway, so we should keep that under
> close inspection...

No, this is probably due to a deficiency with the detection of the Optimus driver.

If we're going to do something for the next rc, I suggest that we take bug 1240607 which would eliminate this.
Flags: needinfo?(aklotz)
Depends on: 1243098
Flags: needinfo?(kairo)
Whiteboard: [44dll]
Backed out from 44 to hopefully address the issues being reported in bug 1243098, bug 1243914, and elsewhere. The backout should ship in Firefox 44.0.2.

https://hg.mozilla.org/releases/mozilla-release/rev/cde34eae03ba
Backed out from all other branches as well after IRC discussion with Aaron and Sylvestre.

https://hg.mozilla.org/mozilla-central/rev/576a6dcde5b6
https://hg.mozilla.org/releases/mozilla-aurora/rev/02156c26b928
https://hg.mozilla.org/releases/mozilla-beta/rev/fd105431c25d

I guess one unanswered question is whether we should also take the backout patch we'd been landing on previous releases or not for Fx45 or whether that ship has already sailed.
Status: RESOLVED → REOPENED
Flags: needinfo?(aklotz)
Resolution: FIXED → ---
Target Milestone: mozilla46 → ---
Tracking as it is an important source of crash.
Aaron, ok with a new backout? thanks
(In reply to Ryan VanderMeulen [:RyanVM] from comment #84)
> I guess one unanswered question is whether we should also take the backout
> patch we'd been landing on previous releases or not for Fx45 or whether that
> ship has already sailed.

Which backout patch are you referencing, exactly?
Flags: needinfo?(ryanvm)
The one we'd landed on earlier releases.
Flags: needinfo?(ryanvm)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #87)
> The one we'd landed on earlier releases.

I still don't know which patch I am being asked about here. Please be specific.

With respect to this bug, as long as bug 1240607 stays in the tree (and it is) then as far as I'm concerned we do not need to track this bug any longer. It would still be useful to learn why it has been triggering the problems that it has been, but that does not warrant the urgency of tracking.
Flags: needinfo?(aklotz)
Depends on: 1249849
No longer depends on: 1218128
From the signatures listed this doesn't look like a topcrash anymore and aaron says in comment 88 that the main issues (startup crashes) are fixed in bug 1240607.  Untracking for 46+. Leaving the 45 tracking for sylvestre though.
This should be landed on 52 along with bug 1240977 and bug 1240848. The memory problems that were previously an issue in this patch should have been fixed by bug 1249849. Will verify.
(In reply to Aaron Klotz [:aklotz] from comment #91)
> This should be landed on 52 along with bug 1240977 and bug 1240848. The
> memory problems that were previously an issue in this patch should have been
> fixed by bug 1249849. Will verify.

Yes, the malloc situation is correct now.
Attached patch Patch (r4)Splinter Review
Rebased to 52. Carrying forward r+.
Attachment #8678986 - Attachment is obsolete: true
Attachment #8705206 - Attachment is obsolete: true
Attachment #8800905 - Flags: review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/fcbf4915f370db4034386ff9f4675ad93863daca
Bug 1218473: Add check for presence of NVIDIA Optimus drivers to WindowsNopSpacePatcher; r=ehsan
https://hg.mozilla.org/mozilla-central/rev/fcbf4915f370
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
This is showing up in my uplift queries because the old patches have beta/aurora approval and the status flags for 50 and 51 are not set. If we aren't uplifting this, can we set the status flags to unaffected or wontfix?
You need to log in before you can comment on or make changes to this bug.