Startup crashes in BaseThreadInitThunk since 2016-12-03

RESOLVED FIXED in Firefox 55

Status

()

defect
--
critical
RESOLVED FIXED
3 years ago
10 months ago

People

(Reporter: philipp, Assigned: ccorcoran)

Tracking

(Blocks 1 bug, {crash, regression, topcrash})

51 Branch
mozilla55
x86
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox50- wontfix, firefox51+ wontfix, firefox52+ wontfix, firefox-esr52 wontfix, firefox53+ wontfix, firefox54+ wontfix, firefox55+ fixed)

Details

(Whiteboard: [tbird crash], crash signature)

Attachments

(4 attachments, 7 obsolete attachments)

3.10 MB, application/x-zip-compressed
Details
580.57 KB, application/x-7z-compressed
Details
59 bytes, text/x-review-board-request
dmajor
: review+
Details
8.87 KB, patch
Details | Diff | Splinter Review
Reporter

Description

3 years ago
[Tracking Requested - why for this release]:
[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-40f2ad2a-3d7f-4eca-86e8-69e762161208.
=============================================================

crashes with this signature have been lingering for a while for very old builds, but on 2016-12-03 they started spiking on 50 and 51 release/beta builds. on 50.0.2 this is accounting for 0.7% of browser crashes, on 51.0b6 for 1.16% .

unfortunately they apparently happen quite early during startup, so there is not much useful metadata/correlations generated in the reprts, that would point to an obvious external cause for this.

user comments for the signature: https://crash-stats.mozilla.com/signature/?release_channel=release&release_channel=beta&signature=BaseThreadInitThunk&date=%3E%3D2016-12-03T00%3A00%3A00.000Z#comments
Track 51+ as startup crash.

Hi ting,
Can you help take a look at this?
Flags: needinfo?(janus926)
The signature has been there for a long time, but it jumps up for 50.0.2 and 50.1.b5. The crashes have suffix 0xfff6 in the address with reason EXCEPTION_ACCESS_VIOLATION_EXEC. The stack from bp-e388f1ec-db23-44e1-a360-a4f8a2161213:

  ntdll!KiFastSystemCallRet
  ntdll!ZwWaitForSingleObject+0xc
  KERNELBASE!WaitForSingleObjectEx+0x98
  kernel32!WaitForSingleObjectExImplementation+0x75
  kernel32!WaitForSingleObject+0x12
  xul!google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread(struct _EXCEPTION_POINTERS * exinfo = 0x0652f6a4, struct MDRawAssertionInfo * assertion = 0x00000000)+0x59
  xul!google_breakpad::ExceptionHandler::HandleException(struct _EXCEPTION_POINTERS * exinfo = 0x0652f6a4)+0x73
  kernel32!UnhandledExceptionFilter+0x127
  ntdll!__RtlUserThreadStart+0x62
  ntdll!_EH4_CallFilterFunc+0x12
  ntdll!_except_handler4+0x8e
  ntdll!ExecuteHandler2+0x26
  ntdll!ExecuteHandler+0x24
  ntdll!RtlDispatchException+0x127
  ntdll!KiUserExceptionDispatcher+0xf
  0x75c4fff6
  0x5b001c
  kernel32!BaseThreadInitThunk+0xe
  ntdll!__RtlUserThreadStart+0x70
  ntdll!_RtlUserThreadStart+0x1b

The changes exist on both 50.0.2 [1] and 50.1.b5 [2] are:

  Andrew McCreight — Bug 1321066 - Explicitly guard against reentrance in nsSMILTimeContainer. r=dholbert a=dveditz
  Lee Salzman — Bug 1315848 - Skia clamped gradient fix r=mchang a=ritu
  Frederik Braun — Bug 1312272 - Test that marquee event handlers are subject to CSP. r=smaug, a=gchang

Any ideas?

[1] https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=d936940d5476&tochange=cc272f7d48d3
[2] https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=1b954c82dd04&tochange=9afe68360fa8
Flags: needinfo?(lsalzman)
Flags: needinfo?(janus926)
Flags: needinfo?(continuation)
The crashes from bug 1321066 are release asserts, and look very distinct (inside SMIL code), so I don't think that could be the issue.
Flags: needinfo?(continuation)
Though we're crashing inside breakpad, so I guess it could be anything. Ted, do you have any idea what might be going wrong here?
Flags: needinfo?(ted)
Component: General → Breakpad Integration
Product: Core → Toolkit
I do not believe fix for bug 131584 is related to this.
Flags: needinfo?(lsalzman)
It doesn't seem a crash inside breakpad, but executing code in improper address from the thread function. The stack may be corrupted.
Ah, sorry.
Component: Breakpad Integration → General
Flags: needinfo?(ted)
Product: Toolkit → Core
(In reply to Ting-Yu Chou [:ting] from comment #2)
> The signature has been there for a long time, but it jumps up for 50.0.2 and
> 50.1.b5. The crashes have suffix 0xfff6 in the address with reason
> EXCEPTION_ACCESS_VIOLATION_EXEC. The stack from
> bp-e388f1ec-db23-44e1-a360-a4f8a2161213:

FYI--when you get a stack from a minidump you need to make sure you are looking at the stack starting at the *exception record*. In WinDbg this requires running the command `.ecxr`. I don't remember if there's any special action required in Visual C++.

Otherwise you see the obvious answer that yes, we write minidumps from Breakpad code. :)
The crashes from comment 0 and comment 2 both have the same stack on the main thread:
 0 	ntdll.dll	KiFastSystemCallRet	
1 	user32.dll	NtUserMessageCall	
2 	user32.dll	ImeWndCreateHandler	
3 	user32.dll	SendMessageW	
4 	xul.dll	MessageWindow::SendRequest()	toolkit/xre/nsNativeAppSupportWin.cpp:591
5 	xul.dll	nsNativeAppSupportWin::Start(bool*)	toolkit/xre/nsNativeAppSupportWin.cpp:682
6 	xul.dll	XREMain::XRE_mainStartup(bool*)	toolkit/xre/nsAppRunner.cpp:3745
7 	xul.dll	XREMain::XRE_main(int, char** const, nsXREAppData const*)	toolkit/xre/nsAppRunner.cpp:4409
8 	xul.dll	XRE_main	toolkit/xre/nsAppRunner.cpp:4515 

Frame 4 is https://hg.mozilla.org/releases/mozilla-release/annotate/cc272f7d48d3/toolkit/xre/nsNativeAppSupportWin.cpp#l591

The Firefox process is trying to remote to an existing Firefox process.
For some reasons this happens only on Windows 7.
OS: Windows → Windows 7
Crash Signature: [@ BaseThreadInitThunk] → [@ BaseThreadInitThunk] [@ igd11dxva32.dll | BaseThreadInitThunk]
Note: every crash I've looked at (51.0b9) has thread 0 sitting in  	mozilla::net::nsSocketTransportService::GetThreadSafely() or some related STS code.  

Checking what network code changed around then (or maybe xpcom thread code?) would probably help point out the regression source.

-> Networking, NI mcmanus
Component: General → Networking
Flags: needinfo?(mcmanus)
nothing comes to mind.. but honestly that's far enough back that if it wasn't a major change git is a better resource than my memory.
Flags: needinfo?(mcmanus)
I think it is not related, but in bug 1323084 we have some shutdown hangs in Thread shutdown code. The starting date is different, I would say not related.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #11)
> The Firefox process is trying to remote to an existing Firefox process.

It doesn't seem related, bp-4a29ceec-2bbe-41b3-9423-9e31b2161210 is a similar crash but the main thread has different stack, and it crashed right after installation.

From the stack content of the crashing thread, it seems the thread function pointer for _RtlUserThreadStart is invalid.
Component: Networking → General
Reporter

Comment 18

2 years ago
this startup crash is still somewhat on the rise in the last couple of days: http://bit.ly/2hHKbhu

maybe the following crash comment provides some insights of what's going on:
"how do i fix this because ever since i downloaded this thing it installed virus (nova rambler) and then it messed up all my browsers so i removed parts of it but it still messes some of my browsers except opera"
https://crash-stats.mozilla.com/report/index/898bff61-cb26-4aab-a195-a5dab2170101
I noticed two comments:

bp-688468cf-9d28-47fb-956e-dfe442170108:
  I installer some download manager. Mozilla start crashing after installing the software. I uninstalled the software but it keeps crashing my browser
bp-1adfb644-b551-4921-b55f-5443e2170103:
  I DON'T EVEN KNOW WHAT HAPPENED. I JUST DOWNLOADED IDM AND THIS HAPPENED

So I installed IDM (Internet Download Manager v627build2) on Windows7, and then it crashes every time I open a page. The crash goes away after uninstalling it and it doesn't crash in safe mode, so should be related to the addon.

However the crash stack/reason don't match to what were reported, see bp-aaacb7a5-230e-45dc-977f-44edd2170109.
If you restrict the search to crashes that have the addons field defined (https://crash-stats.mozilla.com/search/?signature=%3DBaseThreadInitThunk&addons=%21__null__&product=Firefox&version=51.0b&_sort=-date&_facets=signature&_facets=addons&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-addons), there are a few malware addons, e.g.:
12 	{de71f09a-3342-48c5-95c1-4b0f17567554} 	74 	13.86 %
13 	ar1er-ewrgfdgomusix@jetpack 	56 	10.49 %

https://dxr.mozilla.org/addons/source/addons/671633
https://dxr.mozilla.org/addons/source/addons/707711

Andy, could you look at the other addons in that list and see if they are malware?

(In reply to Ting-Yu Chou [:ting] from comment #19)
> So I installed IDM (Internet Download Manager v627build2) on Windows7, and
> then it crashes every time I open a page. The crash goes away after
> uninstalling it and it doesn't crash in safe mode, so should be related to
> the addon.
> 
> However the crash stack/reason don't match to what were reported, see
> bp-aaacb7a5-230e-45dc-977f-44edd2170109.

Perhaps open a new bug for this crash and CC charles@tonec.com (working at Tonec, which builds Internet Download Manager).
Flags: needinfo?(amckay)
Depends on: 1329654

Comment 21

2 years ago
Re-directing to the review team who are the people who implement policy around malware.
Flags: needinfo?(amckay) → needinfo?(jorge)
Covered in bug 1329654.
Flags: needinfo?(jorge)
Is it possible that the thread was created from a module which have been unloaded before the thread runs?

Ted, will there be any concerns if we include (MiniDumpWithUnloadedModules | MiniDumpWithProcessThreadData) in the minidump?
Flags: needinfo?(ted)
The crash has been spiking even more since 13 Jan.
Some users are mentioning IDM.

Comment 26

2 years ago
Marco/Jorge,

I left my comment on bug - https://bugzilla.mozilla.org/show_bug.cgi?id=1329654 to get more information for addon ID {de71f09a-3342-48c5-95c1-4b0f17567554}

I am developer of the addon and trying to get information on what issues are you guys noticing so that we can get it fixed. The extensions uses approved apis and we present complete disclosure to users while extension is installed. In our testing we find no issues with the extension on all versions on Windows 7 and above.
Flags: needinfo?(mcastelluccio)
Flags: needinfo?(jorge)
Replied in bug 1329654.
Flags: needinfo?(mcastelluccio)
Flags: needinfo?(jorge)
There are also shutdown crashes that present pretty much exactly like the startup ones (e.g., https://crash-stats.mozilla.com/report/index/d29c784e-3824-45d0-b1c1-1ed6a2170117) perhaps those, if related, could be used for metadata/correlations?
When this blew up in 2011 (bug 601587) it came and went without us figuring out what's going on, except perhaps for bug 601587 comment 21, a correlation with "windows search" - http://systemexplorer.net/file-database/file/msshsq-dll/5844765
(In reply to Milan Sreckovic [:milan] from comment #28)
> There are also shutdown crashes that present pretty much exactly like the
> startup ones (e.g.,
> https://crash-stats.mozilla.com/report/index/d29c784e-3824-45d0-b1c1-
> 1ed6a2170117) perhaps those, if related, could be used for
> metadata/correlations?

Thread 2 may be interesting; it's about the only thread not sitting in an NtWaitFor.... or equivalent:

nsLayoutStylesheetCache::HTMLSheet() / FinalExceptionHandlerPad2 / wcstombs / Win32FileOutputProvider::Seek / WriteFile / NtWriteFile / NtgetcontextThread

(There's also one thread with no stack at all, but this one is interesting)

I don't really believe the HTMLSheet() frame, but the rest seems consistent
I'd love it if we figured this out but I don't think it should block 51 release at this point - since it's already a bad issue in 50. 

It seems a little odd that the signature doesn't appear in 52/53.
(In reply to Ting-Yu Chou [:ting] from comment #23)
> Ted, will there be any concerns if we include (MiniDumpWithUnloadedModules |
> MiniDumpWithProcessThreadData) in the minidump?

I don't think that should cause any problems. You'll have to get that information out using a debugger, but that should be OK.
Flags: needinfo?(ted)
(In reply to Ting-Yu Chou [:ting] from comment #23)
> Is it possible that the thread was created from a module which have been
> unloaded before the thread runs?

Or it could be something like CreateRemoteThread.
See Also: → 1334027
This seems to be the #1 browser crash in 52.0b
It's the top crash for 51.0.1 still. I'm blocking updates (excluding in Play Store) for a few more of the devices most severely affected by the startup crash:  

  Dell Venue 8 yellowtail
  ZTE V975– redhookbay
  acer B1-730HD vespa
  AcerA1-830– ducati
  AsusTransformer Pad (TF103CG)– K018

With these blocked, I am going to go ahead and enable updates for 51.0 Fennec.
Oops, I just commented in the wrong bug ! Should be in bug 1318667.
I've contacted one of the users who experienced the crash. He said he downloaded "Robosizer" from http://www.robosizer.com/download.html, which redirected to http://www.snapfiles.com/get/robosizer.html, which redirected to http://www.snapfiles.com/php/sfdwnld.php?id=112684&a=7144899&loc=2.

However, he was able to simply remove it and, after that, Firefox started working again.
He also said: "I wonder if Robosizer works in Windows 10 machines. I have another desktop pc running Win7, and Robosizer works as it should, without causing problems.".
Reporter

Comment 40

2 years ago
Posted file RoboSizer_Setup.zip
i can corroborate the observations of the user - installing the robosizer tool (see .zip) on win10 caused crashing or silent failures when trying to launch firefox or thunderbird, whereas on windows 7 it caused no apparent issues.
Reporter

Updated

2 years ago
See Also: → 1335080
Another user replied, he downloaded a file from http://sharkdownloads.com/compressdecompress/winrar-5-31-final-crack-full/.

He had to remove the Firefox installation directory and install Firefox again to get it working again.
This is the only file I could download from that page (from one of the comments).

I haven't tested it yet.
Reporter

Comment 43

2 years ago
i wasn't able to reproduce crashes by anything that i could download from that page in comment #41 :-(
Robosizer has NtCreateThreadEx() called with invalid function pointer with stack:

: kd> !wow64exts.k
Walking Native Stack... 
 # Child-SP          RetAddr           Call Site
00 ffff8701`eceb32e8 fffff801`f90c2fab nt!PspCreateThread
01 ffff8701`eceb32f0 fffff801`f8d5e393 nt!NtCreateThreadEx+0x1ef
02 ffff8701`eceb3a90 00007fff`76e17784 nt!KiSystemServiceCopyEnd+0x13
03 00000000`03a9e848 00000000`50061790 ntdll!NtCreateThreadEx+0x14
04 00000000`03a9e850 00000000`500610de wow64!Wow64NtCreateThreadEx+0xc8
05 00000000`03a9e8d0 00000000`50056ea5 wow64!whNtCreateThreadEx+0x41e
06 00000000`03a9eab0 00000000`50131cf7 wow64!Wow64SystemServiceEx+0x155
07 00000000`03a9f370 00000000`5006bfa1 wow64cpu!ServiceNoTurbo+0xb
08 00000000`03a9f420 00000000`5005cbb0 wow64!RunCpuSimulation+0xf311
09 00000000`03a9f4a0 00007fff`76dea10d wow64!Wow64LdrpInitialize+0x120
0a 00000000`03a9f750 00007fff`76de9fae ntdll!_LdrpInitialize+0x109
0b 00000000`03a9f7d0 00000000`00000000 ntdll!LdrInitializeThunk+0xe
Walking Guest (WoW) Stack... 
 # ChildEBP          RetAddr           
00 03b9f134 10008c3d ntdll_776c0000!NtCreateThreadEx+0xc
WARNING: Stack unwind information not available. Following frames may be wrong.
01 03b9f82c 76f49fb3 EasyHook32!RhIsX64Process+0x23d
02 (Inline) -------- combase!CMessageCall::CleanupMonitorLock+0x5 [d:\rs1\onecore\com\combase\dcomrem\call.cxx @ 578]
03 03b9fc60 00000000 combase!CMessageCall::~CMessageCall+0x43 [d:\rs1\onecore\com\combase\dcomrem\call.cxx @ 522]

I am not sure how does Robosizer use EasyHook, but after I replace "C:\Program Files (x86)\RoboSizer\EasyHook32.dll" with the latest version (2.7.6035.0), Firefox *does not* crash.
(In reply to Ting-Yu Chou [:ting] from comment #44)
> Robosizer has NtCreateThreadEx() called with invalid function pointer with

Actually it's not an invalid function pointer for NtCreateThreadEx(), I just sorted things out for what happen to Robosizer, and will update the details to bug 1335080 because I don't know how many of this crash come from Robosizer.
I just realized the crashes are happened with cpu arch x86, sadly we don't have one in Taipei office.
Reporter

Comment 47

2 years ago
cpu_arch x86 in socorro only means that it is a 32bit build of firefox that is crashing afaik...
Thanks for the information, then I'd like to confirm the meaning of the "os_arch" in correlation tab:

  (96.27% in signature vs 35.68% overall) os_arch = x86 [99.42% vs 54.79% if startup_crash = 1]

Is the x86 here means the OS is a 32bit one?  I know it's a silly question, but I want to make sure I understand it correctly.
(In reply to Ting-Yu Chou [:ting] from comment #48)
> Thanks for the information, then I'd like to confirm the meaning of the
> "os_arch" in correlation tab:
> 
>   (96.27% in signature vs 35.68% overall) os_arch = x86 [99.42% vs 54.79% if
> startup_crash = 1]
> 
> Is the x86 here means the OS is a 32bit one?  I know it's a silly question,
> but I want to make sure I understand it correctly.

Not a silly question, the wording is confusing and could be improved!
os_arch is the architecture of the OS (detected from the amount of virtual memory available), cpu_arch is the arch for which Firefox was built.

So yes, here the OS is almost always 32 bit.
Great, I'll reinstall the Win7 to 32bit and give it another try to reproduce the crash.
This is now top #3 at release channel (51.0.1).
Keywords: topcrash
See Also: → 1290403
(In reply to Ting-Yu Chou [:ting] from comment #50)
> Great, I'll reinstall the Win7 to 32bit and give it another try to reproduce
> the crash.

Still unable to reproduce, tried also IDB but didn't crash. :(
(In reply to Ting-Yu Chou [:ting] from comment #52)
> Still unable to reproduce, tried also IDB but didn't crash. :(

Correction: IDM
An interesting comment from bp-002a95d8-952a-47ba-868d-ec1912170217:

  It seems that installing RemoveWAT causes Firefox to become unusable. From console output, I'm suspicious that it modified some files. 

I'll see if I can reproduce.
I installed RemoveWAT 2.2.7 from https://www.sendspace.com/file/42onal, which installed a bunch of malwares.  Firefox now keeps opening ad pages every few seconds but it doesn't crash every time I launch it, though when it crashes the signature is matched and the address is ended with "fff6" which takes 50% of the crashes in the past week.

I am writing to the user to get the RemoveWAT he/she installed.
The user doesn't keep the RemoveWAT he installed. However, from the binary I have, the injected code for the thread is doing:

  00020000 55              push    ebp
  00020001 8bec            mov     ebp,esp
  00020003 51              push    ecx
  00020004 53              push    ebx
  00020005 56              push    esi
  00020006 57              push    edi
  00020007 8b7d08          mov     edi,dword ptr [ebp+8]
  0002000a 8d4720          lea     eax,[edi+20h]
  0002000d 50              push    eax
  0002000e 57              push    edi
  0002000f ff97c8010000    call    dword ptr [edi+1C8h]  // [1]
  00020015 50              push    eax
  00020016 ff97cc010000    call    dword ptr [edi+1CCh]  // [2] CRASH!

The first call [1] goes to LoadLibraryA() for kernel32.dll, and the second call [2] goes to apphelp!StubGetProcAddress() which is invalid for Firefox process as apphelp.dll is not loaded. Robosizer (bug 1335080) is similiar, just it calls AcLayers!NS_Armadillo::APIHook_GetProcAddress().

This is the reason for the crashes with address suffix "fff6". But I don't know why the number went high after the specific date, maybe a new malware(?) was born that day...
What if we load apphelp.dll earlier by ourselves?
What do you think if we add something like:

  https://hg.mozilla.org/mozilla-central/annotate/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/update/updater/loaddlls.cpp#l11

to preload apphelp.dll in firefox.exe? Or do you have some other ideas to address comment 56?
Flags: needinfo?(benjamin)
I don't think I'm the right person to be asking. I haven't done any debugging of this crash. But updater.exe isn't what's crashing here, so I confident that you don't want to be modifying the updater DLL loading mechanism.

What is apphelp? Why would we want it loaded at all?
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #59)
> debugging of this crash. But updater.exe isn't what's crashing here, so I
> confident that you don't want to be modifying the updater DLL loading
> mechanism.

What I am thinking is a similar DLL loading mechanism for firefox.exe.
 
> What is apphelp? Why would we want it loaded at all?

Honestly I don't know what is apphelp.dll, there's not much information about it, the answer from google is: it's a part of Application Compatibility Client Library from Microsoft.  This issue is about dll injection, which a 3rd party software injects a piece of code into firefox process, and the code calls the function pointers of LoadLibrary() and GetProcAddress() that it retrieved earlier in its process.  However, somehow apphelp.dll or AcLayers.dll was loaded in its process, so the function pointer of GetProcAddress() is actually apphelp!StubGetProcAddress() or AcLayers!NS_Armadillo::APIHook_GetProcAddress() which is invalid for firefox process, because they haven't been loaded when the code runs.

Do you know anyone who is familiar with this that I can talk to?
If people are dynamically adding broken code to our process, I don't think we should try to fix that by preloading DLLs that we don't need ourselves. Each DLL load costs startup I/O.

If we can't resolve this with DLL blocklisting, I recommend WONTFIX.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #61)
> If people are dynamically adding broken code to our process, I don't think
> we should try to fix that by preloading DLLs that we don't need ourselves.
> Each DLL load costs startup I/O.
> 
> If we can't resolve this with DLL blocklisting, I recommend WONTFIX.

As of today this is still top #3 crash (note it is a startup one) in release channel, shouldn't we try to prevent us from being crashed by this kind of code?  User probably wouldn't know which software breaks firefox, and after trying reinstall, they may turn to the other browsers and blame us.

We can measure the cost of preloading DLLs and then decide whether it's worth or not.  Or we can prohibit the other processes to set any piece of memory in firefox process executable (I don't know if it's doable).  Either way, I think we should do something.

:njn, what do you think?

:dmajor, :aklotz, do you have any other ideas how we could deal with this issue (see comment 60 for the brief description)?
Flags: needinfo?(n.nethercote)
Flags: needinfo?(dmajor)
Flags: needinfo?(aklotz)
(In reply to Ting-Yu Chou [:ting] from comment #56)
>   0002000f ff97c8010000    call    dword ptr [edi+1C8h]  // [1]
>   00020015 50              push    eax
>   00020016 ff97cc010000    call    dword ptr [edi+1CCh]  // [2] CRASH!
> 
> The first call [1] goes to LoadLibraryA() for kernel32.dll, and the second
> call [2] goes to apphelp!StubGetProcAddress() which is invalid for Firefox
> process as apphelp.dll is not loaded.

If apphelp is not loaded, how do you know that this call is going to apphelp!<something> ?
I think the fix alluded to by bug 1335080 comment 2 is: https://easyhook.codeplex.com/SourceControl/changeset/74620

So the process that's doing the injection is taking its own function pointer for GetProcAddress and expecting the target's (Firefox's) address space to have the same function at the same address? And things are going wrong because one half of this pair is using apphelp and the other isn't?

I really don't think we should be loading up apphelp just to let some code use a dodgy injection technique on us.

I'll let aklotz chime in if he has any techniques to block this app, otherwise we should look for a non-code solution (contact AVs, support pages, etc).
Flags: needinfo?(dmajor)
(In reply to David Major [:dmajor] from comment #63)
> If apphelp is not loaded, how do you know that this call is going to
> apphelp!<something> ?

I switched (.process) to the suspicious process, and used "ln <function pointer>".
> Honestly I don't know what is apphelp.dll, there's not much information
> about it, the answer from google is: it's a part of Application
> Compatibility Client Library from Microsoft.

Apphelp (and AcLayers) are Windows code that provides backwards-compatibility fixes by shimming APIs, e.g. making GetWindowsVersion lie to keep some old program happy.

I don't remember the details but I've seen this stuff used for questionable purposes in the past, since it's a way to introduce code into a process well before main().
While I don't want to enable evil code injection, startup crashes also mean dead-to-users installs.

Perhaps if we detect a startup crash in a previous run we can start in maximum-chance-of-starting mode (safe mode), and if *that* crashes, try enabling apphelp (and once it comes up, giving the user a warning that there may be malware affecting their Firefox installation and provide a link to more info).  We can discuss if we want to allow browsing or just tell the user and exit.
I'm in favour of any and all reasonable techniques that block third party code injection. I don't have much to add beyond that, sorry.
Flags: needinfo?(n.nethercote)

Updated

2 years ago
Whiteboard: [tbird crash]
One of the action items for the InjectEject project is to intercept BaseThreadInitThunk and examine the thread start address for known malicious entry points (such as LoadLibrary).

A simple mitigation for this could be to add checks inside that hook to ensure that BaseThreadInitThunk's thread start address points to valid executable memory. If we wanted to be even more aggressive we could ensure that said executable memory was backed by an image.

But this hook doesn't exist yet! :-)
Flags: needinfo?(aklotz)
(In reply to Aaron Klotz [:aklotz] from comment #69)
> One of the action items for the InjectEject project is to intercept
> BaseThreadInitThunk and examine the thread start address for known malicious
> entry points (such as LoadLibrary).
> 
> A simple mitigation for this could be to add checks inside that hook to
> ensure that BaseThreadInitThunk's thread start address points to valid
> executable memory. If we wanted to be even more aggressive we could ensure
> that said executable memory was backed by an image.
> 
> But this hook doesn't exist yet! :-)

Is there a bug tracking the addition of this hook?
Ah heck, I can do this right now.

This patch adds some code to the DLL blocklist (for the lack of a better location) that hooks kernel32!BaseThreadInitThunk and verifies that the thread start address belongs to VM that is committed, execute-read, and backed by an image. If that test fails, the thread exits.
Attachment #8850169 - Flags: review?(dmajor)
Comment on attachment 8850169 [details] [diff] [review]
Hook BaseThreadInitThunk and verify thread start address

Ah crap, this needs 64-bit interceptor changes.
Attachment #8850169 - Flags: review?(dmajor)
[Tracking Requested - why for this release]: Top startup crash in 52, even worse than in 51.
This adds support for:
test r/m32, r32
jne rel8

and expands x64 support by also accepting REX.WRB and REX.WB
Assignee: nobody → aklotz
Attachment #8850209 - Flags: review?(m_kato)
Comment on attachment 8850169 [details] [diff] [review]
Hook BaseThreadInitThunk and verify thread start address

OK, now this patch works.
Attachment #8850169 - Flags: review?(dmajor)
Comment on attachment 8850169 [details] [diff] [review]
Hook BaseThreadInitThunk and verify thread start address

Review of attachment 8850169 [details] [diff] [review]:
-----------------------------------------------------------------

I am concerned about this code not providing any feedback if things go wrong, for example:
- Someone already hooked BaseThreadInitThunk in a way that our disassembler no longer understands
- Silently blocking some "good" threads that fail our requirements
- Some future version of Windows adds a new protection flag that makes us block the world

We may want to add some logging about what we blocked or why we failed.
+bsmedberg in case he has other concerns in this regard.

::: mozglue/build/WindowsDllBlocklist.cpp
@@ +277,5 @@
>  }
>  
>  namespace {
>  
> +typedef void (__fastcall* BaseThreadInitThunk_func)(BOOL aIsInitialThread, void* aStartAddress, void* aThreadParam);

Please double-check that this signature looks the same on all platforms (including XP if this is going to be uplifted).

@@ +700,5 @@
>  
>    return stub_LdrLoadDll(filePath, flags, moduleFileName, handle);
>  }
>  
> +static bool

This could use a comment explaining why the lack of these conditions means we should block the thread.

@@ +701,5 @@
>    return stub_LdrLoadDll(filePath, flags, moduleFileName, handle);
>  }
>  
> +static bool
> +ShouldBlockThread(void* aStartAddress, void* aThreadParam)

Nit: aThreadParam unused

@@ +707,5 @@
> +  bool shouldBlock = false;
> +  MEMORY_BASIC_INFORMATION startAddressInfo;
> +  if (VirtualQuery(aStartAddress, &startAddressInfo, sizeof(startAddressInfo))) {
> +    shouldBlock |= startAddressInfo.State != MEM_COMMIT;
> +    shouldBlock |= startAddressInfo.Protect != PAGE_EXECUTE_READ;

Just want to confirm: are these really the only acceptable values?
Is there any valid scenario in which code might be demand-committed or RWX, for example?
Attachment #8850169 - Flags: feedback?(benjamin)
It would be so amazing to be able to prevent this! Tracking, in a hopeful way, for 53 onwards.
(In reply to David Major [:dmajor] from comment #76)
> Comment on attachment 8850169 [details] [diff] [review]
> Hook BaseThreadInitThunk and verify thread start address
> 
> Review of attachment 8850169 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I am concerned about this code not providing any feedback if things go
> wrong, for example:
> - Someone already hooked BaseThreadInitThunk in a way that our disassembler
> no longer understands

In that case the original function remains untouched, which makes us no worse off than we are now.

> - Some future version of Windows adds a new protection flag that makes us
> block the world

I think that's a risk we take any time we do something like this. It's not much different from LdrLoadDll in that regard.

> 
> We may want to add some logging about what we blocked or why we failed.

This is fair. I suppose we could somehow record the actual protection attributes of any blocked start addresses.

> +bsmedberg in case he has other concerns in this regard.
> 
> ::: mozglue/build/WindowsDllBlocklist.cpp
> @@ +277,5 @@
> >  }
> >  
> >  namespace {
> >  
> > +typedef void (__fastcall* BaseThreadInitThunk_func)(BOOL aIsInitialThread, void* aStartAddress, void* aThreadParam);
> 
> Please double-check that this signature looks the same on all platforms
> (including XP if this is going to be uplifted).

It is identical from Windows 7 through to Windows 10 Anniversary Update on both 32-bit and 64-bit implementations.

> 
> @@ +700,5 @@
> >  
> >    return stub_LdrLoadDll(filePath, flags, moduleFileName, handle);
> >  }
> >  
> > +static bool
> 
> This could use a comment explaining why the lack of these conditions means
> we should block the thread.

Sure.

> 
> @@ +701,5 @@
> >    return stub_LdrLoadDll(filePath, flags, moduleFileName, handle);
> >  }
> >  
> > +static bool
> > +ShouldBlockThread(void* aStartAddress, void* aThreadParam)
> 
> Nit: aThreadParam unused

Ah, yeah, I plan to use that in the future but it isn't needed yet. I'll remove for now.

> 
> @@ +707,5 @@
> > +  bool shouldBlock = false;
> > +  MEMORY_BASIC_INFORMATION startAddressInfo;
> > +  if (VirtualQuery(aStartAddress, &startAddressInfo, sizeof(startAddressInfo))) {
> > +    shouldBlock |= startAddressInfo.State != MEM_COMMIT;
> > +    shouldBlock |= startAddressInfo.Protect != PAGE_EXECUTE_READ;
> 
> Just want to confirm: are these really the only acceptable values?
> Is there any valid scenario in which code might be demand-committed or RWX,
> for example?

At least part of this is a policy decision. I don't really think that we should be allowing RWX, though perhaps we should punt on that for the purposes of this bug, just check for X, and worry about the rest as part of InjectEject efforts. If we added recording of thread blocks, we could follow up on this. I would tend to think that any legitimate binaries would not be using RWX, but I suppose that isn't an assumption we can make.

As for the demand-commit thing, here is my thought process: We know we're dealing with the start address of the thread. Let us hypothetically consider the possibility that its page is intentionally inaccessible. BaseThreadInitThunk invokes the thread function, an AV exception is raised (as is the case with whatever is causing this crash right now). Where can user code catch that AV exception in order to perform the delayed commit? The only thing I can think of that could make this happen is if somebody were to override our unhandled exception filter, which our breakpad integration code actively tries to prevent.
> As for the demand-commit thing, here is my thought process: We know we're
> dealing with the start address of the thread. Let us hypothetically consider
> the possibility that its page is intentionally inaccessible.
> BaseThreadInitThunk invokes the thread function, an AV exception is raised
> (as is the case with whatever is causing this crash right now). Where can
> user code catch that AV exception in order to perform the delayed commit?

In a VectoredExceptionHandler. The Win64 ASan runtime does this for shadow-memory pages. I admit though that taking this approach for actual executable code does seem a bit farfetched.
> > - Someone already hooked BaseThreadInitThunk in a way that our disassembler
> > no longer understands
> 
> In that case the original function remains untouched, which makes us no
> worse off than we are now.

Let's say we ship this and it doesn't fix the current crash. True, we're no worse off, but we still have a high-impact crash on our hands. It would be useful to get some diagnostics to inform a subsequent attempt at a fix.
Not all code injection cause crash like this, shouldn't we leave user the choice (they may be aware of it) whether to run it or not?  If our goal is to get more users on firefox...
Attachment #8850209 - Flags: review?(m_kato) → review+
Should this land on m-c? Do you want to land it behind a pref and run some kind of experiment? 
If it needs a policy decision, who should make that?
Flags: needinfo?(aklotz)
Comment on attachment 8850169 [details] [diff] [review]
Hook BaseThreadInitThunk and verify thread start address

I feel like the MEM_IMAGE check could be pretty risky, and I think it's worth putting some numbers on this before rolling it out. In particular, here are the things we don't know:

* how many people could be affected by this today
* whether we'd accidentally be breaking something real
* whether this would turn crashes into hangs

Elaborating on this last bit: a common thread launch mechanism is to launch a thread and then wait on an event for the thread (immediately or eventually). If the thread never runs, that wait is going to end up hanging. And in general I'd say a hang is worse than a crash.

Also I'm a little worried that this code will break crashfirefox.exe.

In terms of risk, can we measure this before we start blocking things?
Attachment #8850169 - Flags: feedback?(benjamin) → feedback-
(In reply to Benjamin Smedberg [:bsmedberg] from comment #83)
> Comment on attachment 8850169 [details] [diff] [review]
> Hook BaseThreadInitThunk and verify thread start address

> Elaborating on this last bit: a common thread launch mechanism is to launch
> a thread and then wait on an event for the thread (immediately or
> eventually). If the thread never runs, that wait is going to end up hanging.
> And in general I'd say a hang is worse than a crash.
> 

From the perspective of the kernel, the thread *does* run, it's just that the user's thread proc does not. When we "block" the thread from running, we are already running as that new thread, so we just call ExitThread on ourselves, thus "completing" the thread's execution as far as any kernel handles are concerned.

> Also I'm a little worried that this code will break crashfirefox.exe.

We can exempt null pointers.

> 
> In terms of risk, can we measure this before we start blocking things?

Sure. But I'm going to turn this over to ccorcoran go from here.
Assignee: aklotz → ccorcoran
Flags: needinfo?(aklotz)
FWIW, if we can land this without MEM_IMAGE now (very low risk, should fix the crash at hand?) and then deal with the MEM_IMAGE measurement and landing later, that would probably help keep this moving.
Comment hidden (mozreview-request)
Assignee

Comment 89

2 years ago
Ok my first checkin; seems the review requests went overkill. Let me know if I need to clean anything up.
Basically the last commit contains my changes:

- NULL thread procs are allowed (and will thus crash). This allows crashfirefox.exe to work. Send ((LPTHREAD_START_ROUTINE)1) to silently block the thread.
- Removed the check for MEM_IMAGE. agree this would be safer
- I'm afraid of calling ExitThread() from BaseThreadInitThunk; if we want NOP behavior, I think it's slightly safer to use an empty thread proc. Just in case the original BaseThreadInitThunk does magic that's needed.


> In terms of risk, can we measure this before we start blocking things?

Shall I create another bug to add measurement? Seems like a good idea to annotate crash reports with this.


Ting: 

> Not all code injection cause crash like this, shouldn't we leave user the choice (they may be aware of it) whether to run it or not?  If our goal is to get more users on firefox...

IMO this is a strange option to give to the user. By the time we give the user the choice, we know for certain that clicking "Continue" would crash.
Flags: needinfo?(benjamin)

Comment 90

2 years ago
mozreview-review
Comment on attachment 8852415 [details]
Bug 1322554: don't block threads based on MEM_IMAGE; safer exiting thread; don't block null threadproc to allow crashfirefox.exe;

https://reviewboard.mozilla.org/r/124660/#review127190

Thanks! This is looking good, but I don't think it's necessary to land separately. Splitting up patches is normally encouraged, but in this specific case, where the first patch is not quite right and the second one goes back and corrects it, I think it's better to just land the right thing the first time. (If MozReview is being uncooperative because you took this over from Aaron, it might be easier to cancel the first review and start a fresh one.)

Good idea about NopThreadProc; I too was a bit worried about the ExitThread.

A few minor nit-picks below:

::: mozglue/build/WindowsDllBlocklist.cpp:711
(Diff revision 1)
>  }
>  
>  static bool
> -ShouldBlockThread(void* aStartAddress, void* aThreadParam)
> +ShouldBlockThread(void* aStartAddress)
>  {
> +  // allows crashfirefox.exe to continue to work. also if your threadproc is null, this crash is intentional.

When your comments are in complete-sentence style (as opposed to terse same-line comments), please capitalize. Same for NopThreadProc below.

::: mozglue/build/WindowsDllBlocklist.cpp:712
(Diff revision 1)
>  
>  static bool
> -ShouldBlockThread(void* aStartAddress, void* aThreadParam)
> +ShouldBlockThread(void* aStartAddress)
>  {
> +  // allows crashfirefox.exe to continue to work. also if your threadproc is null, this crash is intentional.
> +  if(aStartAddress == 0)

Should have a space after 'if'.

::: mozglue/build/WindowsDllBlocklist.cpp:720
(Diff revision 1)
>    bool shouldBlock = false;
>    MEMORY_BASIC_INFORMATION startAddressInfo;
>    if (VirtualQuery(aStartAddress, &startAddressInfo, sizeof(startAddressInfo))) {
>      shouldBlock |= startAddressInfo.State != MEM_COMMIT;
>      shouldBlock |= startAddressInfo.Protect != PAGE_EXECUTE_READ;
> -    shouldBlock |= !(startAddressInfo.Type & MEM_IMAGE);
> +    //shouldBlock |= !(startAddressInfo.Type & MEM_IMAGE); in the future we may want to be stricter by adding this check. for now playing it safe.

In general we try not to have commented-out code in the tree since it tends to just go stale. (Same for #if 0 and similar.) We'll be adding MEM_IMAGE in a followup bug (hmm, someone should file one!) so there's no concern about forgetting about this.

::: mozglue/build/WindowsDllBlocklist.cpp:728
(Diff revision 1)
>    return shouldBlock;
>  }
>  
> +// allows blocked threads to still run normally through BaseThreadInitThunk, in case there's any magic there that we shouldn't skip.
> +DWORD WINAPI
> +NopThreadProc(void* aThreadParam)

On some of our platforms the compiler complains about unused variables. This code is Windows-only so it's probably fine, but as a general habit I would write "(void* /* aThreadParam */)" or just "(void*)".

Also I'd mark the function static since it's only for a one-off use in this file.
Attachment #8852415 - Flags: review?(dmajor)
Assignee

Updated

2 years ago
Attachment #8852413 - Attachment is obsolete: true
Attachment #8852413 - Flags: review?(m_kato)
Assignee

Updated

2 years ago
Attachment #8852414 - Attachment is obsolete: true
Attachment #8852414 - Flags: review?(dmajor)
Assignee

Updated

2 years ago
Attachment #8852415 - Attachment is obsolete: true

Comment 92

2 years ago
mozreview-review
Comment on attachment 8852516 [details]
Bug 1322554: Interpose kernel32!BaseThreadInitThunk to add verification of thread start addresses;

https://reviewboard.mozilla.org/r/124708/#review127234

Oh, I forgot about the DllInterceptor patch. That one could have stayed separate -- my fault, sorry. I don't want to cause another round of churn so don't worry about splitting them again.

::: mozglue/build/WindowsDllBlocklist.cpp:716
(Diff revision 1)
> +  // Allows crashfirefox.exe to continue to work. Also if your threadproc is null, this crash is intentional.
> +  if (aStartAddress == 0)
> +    return false;
> +
> +  bool shouldBlock = false;
> +  MEMORY_BASIC_INFORMATION startAddressInfo;

One more nit -- just to be super extra safe let's initialize this to {0}. Normally it wouldn't matter, but given the environment we're in, with other code hooking stuff and whatnot, let's be cautious.
Attachment #8852516 - Flags: review?(dmajor) → review+
We should not present anything to users.

From discussion with dmajor on IRC, we're going to proceed here with removing the MEM_IMAGE check, and if we want to add that later we'll need some form of field telemetry to see what might break.
Flags: needinfo?(benjamin)
Attachment #8850169 - Flags: review?(dmajor)
Comment hidden (mozreview-request)
Assignee

Updated

2 years ago
Keywords: checkin-needed
Backed out for breaking at least Windows 8 x64 builds (ignore the "likely all Windows builds"):

https://hg.mozilla.org/integration/autoland/rev/943594e7f5d8460ea0270ee03c297dfa49e7fe91

Push showing the failure (the push for your patch is still running, maybe slower machine): https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=495a39e11056e7137b3b82e83e677d4c60994624&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=87560660&repo=autoland

08:52:03     INFO -  Executing c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/obj-firefox/dist\bin\xpcshell.exe -g c:\builds\moz2_slave\autoland-w64-d-000000000000000\build\src\obj-firefox\dist\bin/ -a c:\builds\moz2_slave\autoland-w64-d-000000000000000\build\src\obj-firefox\dist\bin/ -f c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/toolkit/mozapps/installer\precompile_cache.js -e precompile_startupcache("resource://gre/");
08:52:03     INFO -  Assertion failure: false (MOZ_ASSERT_UNREACHABLE: Unrecognized opcode sequence), at c:\builds\moz2_slave\autoland-w64-d-000000000000000\build\src\obj-firefox\dist\include\nsWindowsDllInterceptor.h:1118
08:52:03     INFO -  Traceback (most recent call last):
08:52:03     INFO -    File "c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/toolkit/mozapps/installer/packager.py", line 397, in <module>
08:52:03     INFO -      main()
08:52:03     INFO -    File "c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/toolkit/mozapps/installer/packager.py", line 391, in main
08:52:03     INFO -      args.source, gre_path, base)
08:52:03     INFO -    File "c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/toolkit/mozapps/installer/packager.py", line 165, in precompile_cache
08:52:03     INFO -      errors.fatal('Error while running startup cache precompilation')
08:52:03     INFO -    File "c:\builds\moz2_slave\autoland-w64-d-000000000000000\build\src\python\mozbuild\mozpack\errors.py", line 103, in fatal
08:52:03     INFO -      self._handle(self.FATAL, msg)
08:52:03     INFO -    File "c:\builds\moz2_slave\autoland-w64-d-000000000000000\build\src\python\mozbuild\mozpack\errors.py", line 98, in _handle
08:52:03     INFO -      raise ErrorMessage(msg)
08:52:03     INFO -  mozpack.errors.ErrorMessage: Error: Error while running startup cache precompilation
08:52:03     INFO -  c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/toolkit/mozapps/installer/packager.mk:41: recipe for target 'stage-package' failed
08:52:03     INFO -  mozmake.EXE[7]: *** [stage-package] Error 1
08:52:03     INFO -  mozmake.EXE[7]: Leaving directory 'c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/obj-firefox/browser/installer'
08:52:03     INFO -  c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/toolkit/mozapps/installer/packager.mk:96: recipe for target 'make-package' failed
08:52:03     INFO -  mozmake.EXE[6]: *** [make-package] Error 2
08:52:03     INFO -  c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/config/rules.mk:519: recipe for target 'default' failed
08:52:03     INFO -  mozmake.EXE[5]: *** [default] Error 2
08:52:03     INFO -  c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/browser/build.mk:9: recipe for target 'package' failed
08:52:03     INFO -  mozmake.EXE[4]: *** [package] Error 2
08:52:03     INFO -  c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/build/moz-automation.mk:100: recipe for target 'automation/package' failed
08:52:03     INFO -  mozmake.EXE[3]: *** [automation/package] Error 2
Flags: needinfo?(ccorcoran)
Comment hidden (mozreview-request)
Assignee

Comment 99

2 years ago
(In reply to Carl Corcoran [:ccorcoran] from comment #97)
> Created attachment 8854781 [details]
> Bug 1322554: added handling for CALL reg opcode;
> 
> Review commit: https://reviewboard.mozilla.org/r/126762/diff/#index_header
> See other reviews: https://reviewboard.mozilla.org/r/126762/

This fixes the cppunittest errors on win2008. Kernel32.dll!BaseThreadInitThunk contained:

> FFD2     call rdx

which we didn't handle. Verified fixed on win2008 & treeherder.

Still getting the hang of reviewboard; it re-requested review for the previous patch (attachment 8852516 [details]). Nothing changed there.
Flags: needinfo?(ccorcoran)

Comment 100

2 years ago
mozreview-review
Comment on attachment 8854781 [details]
Bug 1322554: Interpose kernel32!BaseThreadInitThunk to add verification of thread start addresses;

https://reviewboard.mozilla.org/r/126762/#review129568

::: xpcom/build/nsWindowsDllInterceptor.h:1117
(Diff revision 1)
>            int64_t* ptrToJmpDest = reinterpret_cast<int64_t*>(origBytes + nOrigBytes + 5 + offset);
>            intptr_t jmpDest = static_cast<intptr_t>(*ptrToJmpDest);
>            JumpPatch jump(nTrampBytes, jmpDest, JumpType::Jmp);
>            nTrampBytes = jump.GenerateJump(tramp);
>            nOrigBytes += 5;
> +        } else if ((origBytes[nOrigBytes] & (kMaskMod|kMaskReg)) == 0xd0) {// (rm = xx010xxx) | (mod = 11xxxxxx)

Please remove this comment and replace 0xd0 with BuildModRmByte(kModReg, 2, 0)
Attachment #8854781 - Flags: review?(aklotz) → review+

Comment 101

2 years ago
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a50a4f9e972b
Interpose kernel32!BaseThreadInitThunk to add verification of thread start addresses; r=dmajor
Since time for 53 is short, as I indicated to ddurst earlier I landed the patch, with aklotz's nit from his review, merged into one patch.
Please nominate for aurora/beta (and probably ESR52) ASAP, thanks.
Flags: needinfo?(ccorcoran)
Assignee

Comment 104

2 years ago
Approval Request Comment
[Feature/Bug causing the regression]: 3rd-party apps
[User impact if declined]: browser crash. this is top crash & startup crash
[Is this code covered by automated tests?]: partially. if the browser doesn't crash, we know it's passing.
[Has the fix been verified in Nightly?]: just landed
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: low risk
[Why is the change risky/not risky?]: 
this fix involves hooking a new system function (kernel32!BaseThreadInitThunk); some Windows binaries could potentially break our patching code. This is unlikely imo as the same hooking code is already being used for many other calls without issue.

[String changes made/needed]: none
Attachment #8850169 - Attachment is obsolete: true
Attachment #8850209 - Attachment is obsolete: true
Attachment #8852516 - Attachment is obsolete: true
Attachment #8854781 - Attachment is obsolete: true
Flags: needinfo?(ccorcoran)
Attachment #8855203 - Flags: approval-mozilla-esr52?
Attachment #8855203 - Flags: approval-mozilla-beta?
Attachment #8855203 - Flags: approval-mozilla-aurora?

Comment 105

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/a50a4f9e972b
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Comment on attachment 8855203 [details] [diff] [review]
a50a4f9e972b.txt

Fix for malware/antivirus dll related startup crash. Let's try to land this for beta 10.
Attachment #8855203 - Flags: approval-mozilla-beta?
Attachment #8855203 - Flags: approval-mozilla-beta+
Attachment #8855203 - Flags: approval-mozilla-aurora?
Attachment #8855203 - Flags: approval-mozilla-aurora+
This'll need a rebased patch (or some other bugs to be uplifted) for Beta.
Flags: needinfo?(ccorcoran)
Aaron, might you be able to help with the rebasing so we can try to land this today for beta 10?
Flags: needinfo?(aklotz)
I'm sure glad I don't have to explain them, only report on them.

Backed out in https://hg.mozilla.org/mozilla-central/rev/1518ee478393 - since this landed, we've only had a couple of Win8 debug xpcshell runs which did not time out in some random updater test, like https://treeherder.mozilla.org/logviewer.html#?job_id=89061558&repo=mozilla-inbound

Why does it fail? Why does it fail on trunk, but not on aurora? Does that mean the patch is fine on aurora, or that it's broken but the updater tests don't happen to show that it is broken there? Those are some of the things I'm glad I don't have to explain.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla55 → ---
Assignee

Comment 111

2 years ago
I am looking into this now.

My best guess is that we introduced a race condition somehow. If a caller is waiting for a signal from a thread that we block, then it will freeze. This trades a crash for a freeze, which in this thread we've determined is acceptable. That's not a complete theory though; it doesn't explain random regression timeouts.

And/or maybe there are cases where we're blocking threads too liberally. Still not a complete theory.

Going to test again with logging to narrow it down.
Flags: needinfo?(ccorcoran)
Phil, it doesn't appear that this log has dumps/stacks of the process that timed out, even though it says

mozcrash Using tests\bin\minidumpwriter.exe to write a dump to c:\users\cltbld~1.t-w\appdata\local\temp\xpc-other-69txcc\1ad92b13-faee-47e8-9475-636bc4390bbe.dmp for [5092]

And there aren't any minidumps in the uploaded artifacts (or at least I don't see any).

Stacks should point to the hang pretty reliably. Is there a way to get the minidump output for these minidumps?
Flags: needinfo?(philringnalda)
I doubt it, other than getting a loaner - after the test timeout there's a mozprocess 1000 second timeout, then an 1800 second timeout that's buildbot giving up, then the buildbot master loses contact with the slave. The two ways off the top of my head that we've done that before are by starving the python process running buildbot of either memory or CPU so badly that it blows up, or by actually crashing Windows.
Flags: needinfo?(philringnalda)
Clearing ni? because backout
Flags: needinfo?(aklotz)
Marking this affected for 55 since the patch was backed out from m-c (but not from aurora)
Let's push a Try for beta -- I really, really want to see this in 53 (and in 52ESR after that)
Flags: needinfo?(ccorcoran)
Reporter

Comment 117

2 years ago
though the one reproducible case to trigger this signature (bug 1335080) went away for me with this patch, it curiously seems to have had the opposite effect on the crash rates with aurora, as the volume of the signature increased in builds there after the patch has landed: https://crash-stats.mozilla.com/signature/?release_channel=aurora&signature=BaseThreadInitThunk&date=%3E%3D2016-06-01T00%3A00%3A00.000Z#graphs

under those circumstances and because we shipped with this problem for a while now, i don't think we should attempt to uplift it to 53 that close to the release date now, but first wait and see how things turn out on 54.0b.
This seems really unlikely for 53 at this point. We will do an RC2 later this week but I don't have a good feeling here based on the aurora crashes and the test failures when trying to uplift.
Attachment #8855203 - Flags: approval-mozilla-beta+ → approval-mozilla-beta-
Should this be backed out of Aurora based on Comment 117? The crash rate on Aurora is almost double what it was last week.
Ryan can you back this out of aurora as well?  Based on comment 117 it seems like we shouldn't let this go into beta yet.
Flags: needinfo?(ryanvm)
(In reply to David Major [:dmajor] from comment #80)
> Let's say we ship this and it doesn't fix the current crash. True, we're no
> worse off, but we still have a high-impact crash on our hands. It would be
> useful to get some diagnostics to inform a subsequent attempt at a fix.

Next time I am going to push harder on this. :-(
Reporter

Comment 122

2 years ago
on aurora the spike had a strong correlation to the module WRusr.dll (Webroot SecureAnywhere):
(99.32% in signature vs 02.26% overall) Module "WRusr.dll" = true

one crash comment that might also shed a bit more light on this was:
"Everybody in the team of developers I'm in now cannot start Firefox Developer edition because it immediately crashes. Some of my colleagues didn't have a page set as their home and theirs only crashed when going to a URL but I cannot open it at all as my home page was set as a web URL."
https://crash-stats.mozilla.com/report/index/3d23d0d1-45f2-4a2b-adc4-251b22170411#tab-details
Assignee

Comment 124

2 years ago
FYI I'm looking into the two results that have been noted here as the result of the patch:

1. Test timeouts on win8 x64, which I have finally been able to reproduce in a debugger last night. Today I'm digging into this further. At first glance seems like a race condition buried in the rust runtime that's triggered during Firefox startup. Example stack trace here: https://gist.github.com/thenfour/13617392b36ee2b3e9ae22914c611698

2. Investigate the WRusr.dll crashes and again see how/if this patch is increasing crashes.


We may well find that the impact of hooking the thread start routine is too much impact, revealing other issues that are otherwise dormant. To me, revealing dormant bugs as a good reason to continue efforts here, if only to help resolve other intermittent issues. More to come...
Flags: needinfo?(ccorcoran)
Assignee

Comment 125

2 years ago
I am still waiting on getting access to the raw minidumps to investigate the WRusr.dll crash. Something occurred to me as a possible explanation though: What are the implications of patching BaseThreadInitThunk *while it may be currently executing*? I didn't look yet into our detours implementation to see what kind of guards we have against that scenario, but it's one way this patch may cause new crashes in BaseThreadInitThunk.

Regarding test timeouts, it seems to be a race condition within the rust runtime that is normally hidden from view. Hooking BaseThreadInitThunk nudges timing windows just slightly, very occasionally causing a deadlock. Here's a description of what happens in the code to cause the contention:

> Main thread (XRE_Init)
> {
> 	creates thread1 (InitDwrite)
> 	rust runtime begins lazy-initializing tls,
> 	ACQUIRES rust tls mutex
> 	trying to release that mutex, calls into GetProcAddress
> 	WAITS for loader lock
> }
> 
> Thread1 (InitDwrite)
> {
> 	completes its work, exits
> 	ACQUIRES loader lock
> 	on_tls_callback https://github.com/rust-lang/rust/blob/master/src/libstd/sys/windows/thread_local.rs#L242
> 	WAITS for rust tls mutex
> }

The bug here is that rust runtime is acquiring a mutex while in a loader lock, creating a race condition.

I didn't think of a handy way to forcibly induce the deadlock (e.g. using Sleep()), as both locks are fully contained within the rust runtime. I'm honestly surprised we don't see it deadlock more often; the timing window is not that large here and under normal conditions it seems to barely miss.

A practical solution is to find a workaround in XRE_Main; I would suggest we force rust to complete initializing tls before we create the InitDwrite thread. I'm now looking into the best way to accomplish that; it's certainly a one-liner that just touches a tls value or so.


FYI I'm on Easter holiday, returning on Tuesday.
I'm pretty sure it's a bug to acquire any other lock while holding the loader lock.

https://blogs.msdn.microsoft.com/oldnewthing/20040128-00/?p=40853
(also https://blogs.msdn.microsoft.com/oldnewthing/20140808-00/?p=293)

Although I also bet there's a lot of software out there which might affect this in one way or another.

Are we installing this patch early enough that there aren't other threads running?
(In reply to Carl Corcoran [:ccorcoran] from comment #125)
> What are the implications of patching BaseThreadInitThunk *while it may be
> currently executing*? I didn't look yet into our detours implementation to
> see what kind of guards we have against that scenario, but it's one way this
> patch may cause new crashes in BaseThreadInitThunk.

If we are able to take the NopSpacePatcher route, then this can be done safely. (In fact, this is the reason why functions begin with this awkward-looking "mov edi, edi" rather than "nop; nop" -- so that patching can't cause someone to be mid-instruction.)

If we have to take the AddDetour route, we're super unsafe and all bets are off.

But given the consistency of the repro reported by that engineering team, I'm leaning against a timing problem. In the past we've had problems in wanting to hook the same API as an AV etc. It could be that wrusr tries to hook BaseThreadInitThunk but our patched version confuses its disassembler and it falls over.
Seems highly unlikely that we'd consider this as a dot-release candidate, so marking 53 as wontfix.
Assignee

Comment 129

2 years ago
Regarding the temporary spike in crashes from hooking BaseThreadInitThunk, I'm guessing WRusr.dll is at some point 
attempting to remove our hook, and in doing so causing memory corruption around BaseThreadInitThunk. To confirm this 100%, I would need to install Webroot SecureAnywhere and try to repro. Then find a workaround or block it.

What's happening to cause the crash? Looking at a few minidumps, it's restricted to 32-bit build, across Windows platforms (i've seen win10 and 8.1 out of about 6 dumps). This disassembly reveals pretty much everything:

> 0:036> u kernel32!BaseThreadInitThunk-8 l10
> kernel32!_BaseDllInitialize+0x29b:
> 766e6298 cc              int     3
> 766e6299 cc              int     3
> 766e629a cc              int     3
> 766e629b e90e49ab98      jmp     mozglue!`anonymous namespace'::patched_BaseThreadInitThunk (0f19abae)
> kernel32!BaseThreadInitThunk:
> 766e62a0 8bff            mov     edi,edi
> 766e62a2 55              push    ebp
> 766e62a3 8bec            mov     ebp,esp
> 766e62a5 51              push    ecx
> 766e62a6 a1c4008b6b      mov     eax,dword ptr ds:[6B8B00C4h]
> 766e62ab 33c5            xor     eax,ebp

Points of interest:
  1. 766e629b our long jmp is intact, and calling into our hook. We have not removed the hook.
  2. 766e62a0 mov edi,edi - this is a problem. This should be our trampoline to the long jmp. Suggests that something has clobbered our hook.
  3. 766e62a6 this is actually where the crash happens.

#3 is the most interesting, and I don't really have an explanation yet. Two bytes have been overwritten:

> 0:036> !chkimg -d kernel32
>     766e629b-766e629f  5 bytes - kernel32!_BaseDllInitialize+29e
> 	[ cc cc cc cc cc:e9 0e 49 ab 98 ]
>     766e62a9-766e62aa  2 bytes - kernel32!BaseThreadInitThunk+9 (+0x0e)
> 	[ 78 76:8b 6b ]
> 7 errors : kernel32 (766e629b-766e62aa)

The original bits *should* disassemble like:

> 766e62a6 a1c4007876      mov     eax,dword ptr [KERNEL32!__security_cookie (767800C4h)]

Those 2 overwritten bytes are almost always 8b 6b, though in one dump I saw 90 6b. Could be broken fragments of assembly from something? Maybe but nothing obvious.

Finally, notable thread activity:

  1. Main thread is in GeckoMediaPluginService::Init, waiting while launching a thread; presumably the crashing thread.
  2. Crashing thread has finished and encountered the corrupt address. *boom*
  3. there are about 3 worker threads in WRusr.dll, waiting (presumably for ipc).
  4. all other threads seem fairly benign. Note that almost all threads return into our hook.

So here's what seems to be happening, given the pieces to that puzzle:

  1. we hook BaseThreadInitThunk
  2. thread X starts up, runs under our hook
  3. WRusr clobbers our hook, and in the process corrupts 2 other bytes in BaseThreadInitThunk
  4. thread X ends, but BaseThreadInitThunk explodes because of the corruption.

Next steps? How much diligence do we want to throw at this? If we want to fix this WRusr.dll issue, we should repro in a VM to see what's causing the corruption. Or is blocking it an option?
Flags: needinfo?(benjamin)
Blocking is absolutely an option if it works. Per new blocklisting policy, we don't allow external DLLs in our process.
Flags: needinfo?(benjamin)
Assignee

Updated

2 years ago
Depends on: 1358151
Reporter

Comment 131

2 years ago
the crash volume of this signature is declining to the levels we've had before december - maybe the malware received an update "fixing" the startup crashes.
https://crash-stats.mozilla.com/signature/?product=Firefox&signature=BaseThreadInitThunk&date=%3E%3D2016-10-27T11%3A28%3A07.000Z#graphs
Assignee

Updated

2 years ago
Depends on: 1361410

Comment 132

2 years ago
I'm one of the developers for Webroot (WRusr.dll), I'd like permissions to download the raw dump files from Mozilla.

I'm also having some trouble reproducing this myself. According to what I've gathered in the thread above. This is limited to 32 bit Firefox and has happened on  I tries the GA version of Firefox and the Nightly version. I tried both 64 and 32 bit Firefox on both 64 and 32 bit Windows 7, 8.1, and 10. I see some consensus in the thread that this tends to happen on 32 bit Windows 7. I don't have a 32 bit Windows 7 immediately ready; I'm installing one now to test this.
Johnny, you can find a build to reproduce the problem in bug 1361410 comment 0.

Carl, can you reproduce the crash yourself? If you can, could you send Johnny a raw dump?
We can't share users' raw dumps, because we would need their consent.
Flags: needinfo?(ccorcoran)

Comment 134

2 years ago
Marco, thanks for the build and steps to repro. I'll get on this now.

Comment 135

2 years ago
I just reproduce it myself. Thanks for pointing me in the right direction Marco. I'll report back once I complete my analysis.
Assignee

Comment 136

2 years ago
Johnny, thanks for your help. I see you managed to repro it. If you still need minidumps, I attached them on bug 1361410.
Flags: needinfo?(ccorcoran)

Comment 137

2 years ago
I've found the root of the problem on our end. I'm 90% sure I have a solution in place to resolve this. I'm doing some further validation on my end. Once that is done I'll update again. I might be able to send someone a one-off build to do some validation against on your end. If that is something someone would be interested in let me know.
Assignee

Comment 138

2 years ago
Johnny I'm happy to give it a shot. But we're just as interested in ensuring users with the current WRusr.dll don't crash. Is there a workaround we can use in the meantime so even with the current WRusr.dll, it won't crash? For example a way to tell WRusr not to touch this particular detour?
Flags: needinfo?(johnnyshaw02)

Comment 139

2 years ago
I'm almost positive that isn't an option, but I'll verify. Exposing that ease of control to security software would be a major hole. What is the reason these users can't update their antivirus when this fix becomes available?
Flags: needinfo?(johnnyshaw02)

Comment 140

2 years ago
The user could set Firefox to "Allow" int he Application Protection. This will disable ALL protection for the browser though. If you click the gear next to "Identity Protection", then go to "Application Protection", and then move the Firefox binary to "Allow". This will tell WRusr not to inject into the process.
Assignee

Comment 141

2 years ago
Yea, and as you said, everyone should be caught up to the latest WRusr.dll. For the rare stragglers, and for the Nightly users in the meantime, we may decide to simply skip this detour depending on the version of WRusr.dll. Would you be able to let us know which version of WRusr.dll we could use as a condition?
Flags: needinfo?(johnnyshaw02)

Comment 142

2 years ago
Yeah, I can let you know what version to expect it in when I know.
Flags: needinfo?(johnnyshaw02)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 146

2 years ago
mozreview-review
Comment on attachment 8867827 [details]
Bug 1322554: don't hook BaseThreadInitThunk if WRusr.dll is present on Windows x86;

https://reviewboard.mozilla.org/r/139352/#review142728

::: mozglue/build/WindowsDllBlocklist.cpp:765
(Diff revision 1)
>  
> +static bool
> +IsSafeToHookBaseThreadInitThunk()
> +{
> +  // Bug 1361410: WRusr.dll will overwrite our hook and cause a crash. Workaround: If we detect WRusr.dll, don't hook.
> +#ifndef HAVE_64BIT_BUILD

Looking at crash-stats, pretty much all the BaseThreadInitThunk crashes are 32-bit builds. The very very small number of 64-bit ones are probably unrelated noise.

What do you think about putting this entire patch series behind #ifdef _M_IX86? Hooks always carry risk of bad interactions with other software, so I like to avoid hooking on machines where we don't need to. I haven't gone back and looked at the previous patch but this might also eliminate the need for some of the disassembler changes.

::: mozglue/build/WindowsDllBlocklist.cpp:767
(Diff revision 1)
> +IsSafeToHookBaseThreadInitThunk()
> +{
> +  // Bug 1361410: WRusr.dll will overwrite our hook and cause a crash. Workaround: If we detect WRusr.dll, don't hook.
> +#ifndef HAVE_64BIT_BUILD
> +  HMODULE hWRusr = (HMODULE)GetModuleHandleW(L"WRusr.dll");
> +  if (hWRusr != NULL) {

Hm, why the cast? If you like, you could just avoid it with `if (GetModuleHandle...)`. Also, we probably don't need to worry about following the pattern of printf_stderr's -- I don't think they add too much value since we very rarely step through this file after a patch is landed.
Assignee

Comment 147

2 years ago
mozreview-review-reply
Comment on attachment 8867827 [details]
Bug 1322554: don't hook BaseThreadInitThunk if WRusr.dll is present on Windows x86;

https://reviewboard.mozilla.org/r/139352/#review142728

> Hm, why the cast? If you like, you could just avoid it with `if (GetModuleHandle...)`. Also, we probably don't need to worry about following the pattern of printf_stderr's -- I don't think they add too much value since we very rarely step through this file after a patch is landed.

Originally I was writing this with a version check, where I use the HMODULE to get at the resources. Without the version checking it's true this is unnecessary, but also benign to leave it.
Comment hidden (mozreview-request)
Assignee

Comment 149

2 years ago
mozreview-review-reply
Comment on attachment 8867827 [details]
Bug 1322554: don't hook BaseThreadInitThunk if WRusr.dll is present on Windows x86;

https://reviewboard.mozilla.org/r/139352/#review142728

> Looking at crash-stats, pretty much all the BaseThreadInitThunk crashes are 32-bit builds. The very very small number of 64-bit ones are probably unrelated noise.
> 
> What do you think about putting this entire patch series behind #ifdef _M_IX86? Hooks always carry risk of bad interactions with other software, so I like to avoid hooking on machines where we don't need to. I haven't gone back and looked at the previous patch but this might also eliminate the need for some of the disassembler changes.

Seems reasonable; no reason to spread this patch where it doesn't benefit. I would still keep the 64-bit opcode support though; this is just adding support for opcodes we should already be supporting but just haven't seen yet. More the better. Or should we file separate bugs for them?

Comment 150

2 years ago
For what its worth. Rather the just blindly re-writing code WRusr tries to be smart about replacing code when clearing hooks. Its possible the method of hooking mozglue is using isn't conventional. Obviously WRusr could be smarter about removing the hooks. But, the better approach from our side would be to allow mosglue to do it's job. My point is, if you have a secondary method for hooking. You might be positive results using it.

To close the loop on my side I did send over a build to Carl that has a patch on our end. Let me know if you don't get it Carl.
Comment on attachment 8855203 [details] [diff] [review]
a50a4f9e972b.txt

esr52-, this seems too high risk for a limited impact there (and we should get a fix to stick in trunk first anyway :) ).
Attachment #8855203 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52-

Comment 152

2 years ago
mozreview-review
Comment on attachment 8867827 [details]
Bug 1322554: don't hook BaseThreadInitThunk if WRusr.dll is present on Windows x86;

https://reviewboard.mozilla.org/r/139352/#review144122

Sorry, I should clarify -- could you please put the entire patch stack behind a 32-bit ifdef? (Including AddHook, NopThreadProc, ShouldHook, TestDllInterceptor, etc.) And I think I'd prefer _M_IX86 because the HAVE_ stuff is more of configure's turf, and the _M_ approach is the most common pattern in our Windows-only files.

Regarding the 64-bit assembler changes, my concern is mostly because this bug is of high uplift-interest on relman's radar, and I wouldn't want to rush unnecessary code to beta. Perhaps we could split that code off into its own bug that rides the trains like normal?
Attachment #8867827 - Flags: review?(dmajor) → review+
Comment hidden (mozreview-request)
Assignee

Updated

2 years ago
Attachment #8867827 - Attachment is obsolete: true
Assignee

Updated

2 years ago
Attachment #8852516 - Attachment is obsolete: true

Comment 154

2 years ago
mozreview-review
Comment on attachment 8854781 [details]
Bug 1322554: Interpose kernel32!BaseThreadInitThunk to add verification of thread start addresses;

https://reviewboard.mozilla.org/r/126762/#review145738

::: mozglue/build/WindowsDllBlocklist.cpp:870
(Diff revision 3)
>                                reinterpret_cast<intptr_t>(patched_RtlInstallFunctionTableCallback),
>                                (void**)&stub_RtlInstallFunctionTableCallback);
>    }
>  #endif
> +
> +#ifdef _M_IX86// Minimize impact; crashes in BaseThreadInitThunk are vastly more frequent on x86

Nit: Add a space before //

::: toolkit/xre/test/win/TestDllInterceptor.cpp:523
(Diff revision 3)
>        TestHook(TestSendMessageTimeoutW, "user32.dll", "SendMessageTimeoutW") &&
>  #endif
>        TestHook(TestTlsAlloc, "kernel32.dll", "TlsAlloc") &&
>        TestHook(TestTlsFree, "kernel32.dll", "TlsFree") &&
> +#ifdef _M_IX86
> +      TestDetour("kernel32.dll", "BaseThreadInitThunk") &&

This call to TestDetour is going to be a departure from our usual pattern of calling TestHook wherever the product code calls AddHook. I understand why it's this way -- we can't really call stub_BaseThreadInitThunk from the test code -- but it's worth explaining in a comment, if for no other reason than to deter future coders from breaking the pattern without good reason. :)
Attachment #8854781 - Flags: review?(dmajor) → review+
Actually, speaking of the TestHook/TestDetour distinction, have you tried using AddDetour rather than AddHook to see if it interacts any better with wrusr.dll? Johnny Shaw did mention in comment 150 that we might have better luck with alternative hook techniques.
Flags: needinfo?(ccorcoran)
Assignee

Comment 156

2 years ago
David, I just tried AddDetour and indeed Webroot no longer crashes. Our hook is still overwritten, which is a potential problem, but the crash we've noted is no longer happening.

Which means we should be able to proceed now.

Johnny, thank you for all your help and diligence.

New patch incoming...
Flags: needinfo?(ccorcoran)
Comment hidden (mozreview-request)

Comment 158

2 years ago
Happy to help. Thank you guys for working with us. I'm certainly willing to keep our communication open to collaborate in the future.

Comment 159

2 years ago
Pushed by dmajor@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7d2e74c69253
Interpose kernel32!BaseThreadInitThunk to add verification of thread start addresses; r=aklotz,dmajor

Comment 160

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/7d2e74c69253
Status: REOPENED → RESOLVED
Last Resolved: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Please request uplift on this when you get a chance; not sure if it's worth seeing if crash reports go down or not, seeing as you've verified it fixes things locally.
Flags: needinfo?(ccorcoran)
OTOH it would be good to observe this on nightly for a few days to watch for regressions. In code like this that's always a risk.
Reporter

Comment 163

2 years ago
unfortunately this patch doesn't prevent the startup issue with robosizer on win10 raised in bug 1335080 when i tested it now again (whereas the prior patch in attachment 8855203 [details] [diff] [review] fixed it for me).
Assignee

Comment 164

2 years ago
:philipp we seem to get very few crashes with this signature on x64, so to reduce impact we decided to start by restricting this patch to x86.

https://crash-stats.mozilla.com/signature/?product=Firefox&signature=BaseThreadInitThunk&date=%3E%3D2017-01-01T08%3A14%3A00.000Z&date=%3C2017-05-18T08%3A14%3A00.000Z#graphs

View crashes per day by cpu arch, and we get 1-4 crashes per day on x64 in BaseThreadInitThunk, compared to x86's in the hundreds.

If this patch is successful in reducing x86's crashes, I think we should indeed try it on x64.

Interesting about the stats for this signature, in the past month it's already gone significantly down. I'm curious what caused this change around mid-March.

All this also reignites the desire to get telemetry about this patch.
Flags: needinfo?(ccorcoran)
Reporter

Comment 165

2 years ago
ah thanks for the pointer. in fact, bug 1335080 is fixed when testing with a 32bit build of nightly.

when robosizer is running, a 64bit nightly fails to launch silently though without generating any report, so it may be that the real extend of the issue with 64bit installations is masked in crash report data...
Depends on: 1368319
Assignee

Comment 166

2 years ago
:philipp I took another look at RoboSizer on win64. It consistently crashes for me too, though it appears unrelated to BaseThreadInitThunk. It's injecting EasyHook64.dll into Firefox, which is causing corruption at a semi-random place during startup. Sometimes it happens before crash reporting is hooked up.

The solution here is likely to do some Dll blocking for EasyHook with Robosizer. I have created bug 1369188 to track it specifically.
(In reply to Carl Corcoran [:ccorcoran] from comment #166)
> :philipp I took another look at RoboSizer on win64. It consistently crashes
> for me too, though it appears unrelated to BaseThreadInitThunk. It's
> injecting EasyHook64.dll into Firefox, which is causing corruption at a
> semi-random place during startup. Sometimes it happens before crash
> reporting is hooked up.
> 
> The solution here is likely to do some Dll blocking for EasyHook with
> Robosizer. I have created bug 1369188 to track it specifically.

The original patch did fix the crash for Philipp though, so that's strange.
Assignee

Comment 168

2 years ago
Robosizer injects its DLLs via CreateRemoteThread, which this patch blocks because the threadstart memory has PAGE_EXECUTE_READWRITE, which we block with this patch. So it's true that this patch fixes this crash, though the crash is unrelated to BaseThreadInitThunk. I lost sight of this effect of the patch.

Our BaseThreadInitThunk detour has the side-effect of auto-blocking lots of DLLs like EasyHook, which means we should consider that we're not just fixing BaseThreadInitThunk crash signatures, but any crash related to remotely-injected code. The motivation for restricting to x86 was the very few BaseThreadInitThunk crash signatures we receive for x64. So I think we should re-consider x64 support.
Is this something we're still considering for eventual backport to ESR52 or should we wontfix? Seems like a tough call given the risk vs. reward :\
The BaseThreadInitThunk crashes are not as common as they used to be (#32 on ESR52, they used to be in the top three).
Assignee

Comment 171

2 years ago
The BaseThreadInitThunk crash signature is just 1 signature this patch should fix. In theory, this should prevent any crashes stemming from WriteProcessMemory/CreateRemoteThread DLL injection.
(In reply to Carl Corcoran [:ccorcoran] from comment #171)
> The BaseThreadInitThunk crash signature is just 1 signature this patch
> should fix. In theory, this should prevent any crashes stemming from
> WriteProcessMemory/CreateRemoteThread DLL injection.

Yes, it is definitely a great prevention mechanism. I'm just saying we currently don't have a crash as concerning as "BaseThreadInitThunk" was a while ago, so we should take this into account when deciding whether to uplift to ESR52.
Doesn't seem like there's a clear consensus to move forward with nominating this for ESR52, so setting the status to wontfix to stop tracking it.
Blocks: 1372827
No longer blocks: 1374611
(In reply to David Major [:dmajor] from comment #66)
> > Honestly I don't know what is apphelp.dll, there's not much information
> > about it, the answer from google is: it's a part of Application
> > Compatibility Client Library from Microsoft.
> 
> Apphelp (and AcLayers) are Windows code that provides
> backwards-compatibility fixes by shimming APIs, e.g. making
> GetWindowsVersion lie to keep some old program happy.
> 
> I don't remember the details but I've seen this stuff used for questionable
> purposes in the past, since it's a way to introduce code into a process well
> before main().

I don't know why I came back to this bug, and I was thinking: "Why would they use apphelp or aclayers to call GetProcAddress, when GetProcAddress is in kernel32.dll and they could have safely called it?". Maybe the issue here wasn't that they were using apphelp and aclayers to do dodgy things, but Windows was automatically "applying compatibility fixes" redirecting GetProcAddress (and maybe other functions?) to those DLLs.
If that's the case, could similar crashes happen again when some third-party software which injects into Firefox (with any mechanism) somehow gets "compatibility fixes" by Windows and out of the blue starts using functions from DLLs that we don't load (as we don't get "compatibility fixes")? Is there any way we could prevent this problem?
I think we're better off spending time on the injecteject effort than worrying about that scenario.
You need to log in before you can comment on or make changes to this bug.