Closed Bug 951827 Opened 10 years ago Closed 10 years ago

startup crash in CryptBaseRunOnceRoutine coming from random_generateSeed in jsMath - some software is hooking LdrLoadDll in a broken way after Mozilla's hooking

Categories

(Toolkit :: Startup and Profile System, defect)

26 Branch
x86
Windows 7
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla30
Tracking Status
firefox27 + verified
firefox28 --- verified
firefox29 --- verified
firefox30 --- verified
b2g-v1.3 --- fixed

People

(Reporter: kairo, Assigned: away)

References

Details

(Keywords: crash, topcrash-win)

Crash Data

Attachments

(2 files, 13 obsolete files)

4.45 KB, patch
ehsan.akhgari
: review+
Details | Diff | Splinter Review
7.27 KB, patch
away
: review+
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is 
report bp-98d6a12b-0c88-44f6-9ba4-848a32131218.
=============================================================

Top Stack Frames:

0 		@0x750002 	
1 	advapi32.dll 	CryptBaseRunOnceRoutine 	
2 	ntdll.dll 	RtlpRunOnceWakeAll 	
3 	advapi32.dll 	SystemFunction036 	
4 	msvcr100.dll 	rand_s 	f:\dd\vctools\crt_bld\self_x86\crt\src\rand_s.c
5 	mozjs.dll 	random_generateSeed 	js/src/jsmath.cpp
6 	mozjs.dll 	random_initState 	js/src/jsmath.cpp
7 	mozjs.dll 	js_math_random(JSContext *,unsigned int,JS::Value *) 	js/src/jsmath.cpp
8 	mozjs.dll 	js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
9 	mozjs.dll 	Interpret 	js/src/vm/Interpreter.cpp
10 	mozjs.dll 	js::RunScript(JSContext *,js::RunState &) 	js/src/vm/Interpreter.cpp


This has been around for a while, but has spiked to topcrash #15 in Firefox 27.0b1.
Oh, and not just that it was low-volume but piked in 27 beta, but it's a startup crash (99.9% in the first minute of uptime).
Summary: crash in CryptBaseRunOnceRoutine coming from random_generateSeed in jsMath → startup crash in CryptBaseRunOnceRoutine coming from random_generateSeed in jsMath
NI on Naveed to help with assignee for investigation here.
Flags: needinfo?(nihsanullah)
This is our only startup topcrasher in 27betas. It would be great to get some traction here.  Please let me know if there is anything QA can do to help.
OS: Windows NT → Windows 7
QA Contact: twalker
This crash is a regression from bug 868860. That 99.9% of the crashes are on Windows 7 suggests this might be a Windows 7 bug.
Blocks: 868860
I found some related Firefox and Chromium bug reports for crashes in rand_s(). Those crashes are all startup crashes, possibly related to ASLR or third-party DLL injection.

• Bug 694344 (correlated with mandatory ASLR bug 677797)
• Bug 723447 (correlated with McAfee's "Host Intrusion Prevention" DLL)
• Chromium bug : http://code.google.com/p/chromium/issues/detail?id=74172
See Also: → 694344, 723447
The code that is crashing landed in Firefox 24 (May 2013, bug 868860), but these crashes only started to spike in mid-December. I suspect Microsoft's December 2013 Patch Tuesday might have triggered these crashes.

This InfoWorld article reports that many people had problems with the December 2013 update, "with the majority reported on Windows 7" (just like this bug).

http://www.infoworld.com/t/microsoft-windows/botched-black-tuesday-patch-kb-2887069-freezes-fails-configure-triggers-bsod-andor-zaps-sound-drivers-232
If this were caused *only* by patch Tuesday, It's odd that even in the last week, we are seeing many more of this crash on Firefox 27 than on Firefox 26:

1 	27.0 	820 	83.08 %
2 	26.0 	53 	5.37 %
3 	15.0 	23 	2.33 %

Even accounting for release throttling it seems that this somehow affects FF27 more than FF26.

It doesn't seem to have been mentioned:

*most* of these crashes are EXCEPTION_STACK_OVERFLOW with a crash address that ends in 0x02, e.g. 0x2e0002 or 0x310002 or 0x1350002.

But there are some with EXCEPTION_VIOLATION_ACCESS_READ with a constant crash address of 0x103.

Use https://crash-stats.mozilla.com/search/?signature=%3DCryptBaseRunOnceRoutine&_facets=version&_facets=product&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=user_comments&_columns=reason&_columns=address to peruse them.

Stack is:
0 		@0x2e0002 	
1 	advapi32.dll 	CryptBaseRunOnceRoutine 	
2 	ntdll.dll 	RtlpRunOnceWakeAll 	
3 	advapi32.dll 	SystemFunction036 	
4 	msvcr100.dll 	rand_s 	f:\dd\vctools\crt_bld\self_x86\crt\src\rand_s.c
5 	mozjs.dll 	random_generateSeed 	js/src/jsmath.cpp

The things that are interesting to me about this are:

* Reading the CRT sources, "SystemFunction036" is a synonym for RtlGenRandom
* advapi32.dll seems to have some initialization code which these functions are hooking. I'm walking through this in a debugger now to see what it looks like.
_CryptBaseRunOnceRoutine does a bunch of module loading and dynamic symbol lookups use _LdrGetProcedureAddress. I was in one loop where it appeared to be looking for the symbol "MZ" in a bunch of modules (perhaps every module currently loaded into the current process, but I'm not sure).
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #9)
> _CryptBaseRunOnceRoutine does a bunch of module loading and dynamic symbol
> lookups use _LdrGetProcedureAddress. I was in one loop where it appeared to
> be looking for the symbol "MZ" in a bunch of modules (perhaps every module
> currently loaded into the current process, but I'm not sure).

Do you mean symbol as in a name/export, or literally those two characters? MZ is the magic number at the start of executable binaries.
A simple workaround might be to add an "Is Windows 7?" check to avoid calling rand_s() here:

  https://hg.mozilla.org/mozilla-central/file/bbe08810e87b/js/src/jsmath.cpp#l699

The generated seed will be weak on Windows 7, but the implementation would not be much different than it was before random_generateSeed() was changed to call rand_s():

  https://hg.mozilla.org/mozilla-central/rev/ddfd95298835#l2.23

However, this workaround might only be postponing the CryptBaseRunOnceRoutine crash to a later crypto call. But maybe this crash is a CryptBaseRunOnceRoutine race that won't happen later?
I don't think this is related to rand_s(). If anything I would say this coincides with the blocklist init changes which are on 27.

advapi32!CryptBaseRunOnceRoutine tried to call LdrLoadDll("cryptbase.dll") but wound up in a trampoline (one that doesn't look like ours):

0:000> k4
0015c2a0 76727332 0x750002
0015c2d0 775234c8 advapi32!CryptBaseRunOnceRoutine+0x1d
0015c2ec 76721935 ntdll!RtlRunOnceExecuteOnce+0x33
0015c308 706fb536 advapi32!SystemFunction036+0x1c

0:000> u 750000
00750000 85f5            test    ebp,esi
00750002 51              push    ecx
00750003 7705            ja      0075000a
00750005 ebf9            jmp     00750000
00750007 55              push    ebp
00750008 8bec            mov     ebp,esp
0075000a e97bf5dc76      jmp     ntdll!LdrLoadDll+0x5 (7751f58a)

0:000> !address 750000
Base Address:           00750000
End Address:            00751000
(I dumped this in order to convince myself that the start of trampoline is indeed 750000, and not some address before it)

We loop around and around pushing ecx a ton of times from 15c2a0 down to 63000 (and I guess hitting the guard a couple of pages before 61000)

0:000> r ebp, esp
Last set context:
ebp=0015c2a0 esp=00063000
0:000> !address esp
Base Address:           00061000
End Address:            00160000

-----------

The other group of dumps with the 0x103 have the same trampoline situation, just different code

0:000> r
Last set context:
eax=0018c38c ebx=00000000 ecx=00000000 edx=00000000 esi=00000103 edi=0018c3c4
eip=00380002 esp=0018c348 ebp=0018c360 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010206
00380002 ad              lods    dword ptr [esi]      ds:0023:00000103=????????

0:000> u 380000
00380000 85f5            test    ebp,esi
00380002 ad              lods    dword ptr [esi]
00380003 7705            ja      0038000a
00380005 ebf9            jmp     00380000
00380007 55              push    ebp
00380008 8bec            mov     ebp,esp
0038000a e97bf57577      jmp     ntdll!LdrLoadDll+0x5 (77adf58a)

-----------

Could this have resulted from two pieces of code trying to hook the same function? Maybe one trampoline messed up the assumptions of the other?
Yikes, we're interpreting an address as instructions. Emphasis added [ with brackets ]:

00750000 [ 85 ][ f5 ]    test    ebp,esi
00750002 [ 51 ]          push    ecx
00750003 [ 77 ]05        ja      0075000a
00750005 ebf9            jmp     00750000
00750007 55              push    ebp
00750008 8bec            mov     ebp,esp
0075000a e97bf5dc76      jmp     ntdll!LdrLoadDll+0x5 ([ 77 ][ 51 ][ f5 ][ 8a ])

00380000 [ 85 ][ f5 ]    test    ebp,esi
00380002 [ ad ]          lods    dword ptr [esi]
00380003 [ 77 ]05        ja      0038000a
00380005 ebf9            jmp     00380000
00380007 55              push    ebp
00380008 8bec            mov     ebp,esp
0038000a e97bf57577      jmp     ntdll!LdrLoadDll+0x5 ([ 77 ][ ad ][ f5 ][ 8a ])

Reverse the endian, then subtract the +0x5 offset, and it's a match.

I think some hook code created a trampoline here, and some other hook code overwrote its first few bytes with the address of LdrLoadDll. One of those two hooks is likely ours.

That certainly explains why the code seems to have had its assumptions messed up. (And it explains how we got such a strange opcode as "lods", that isn't natural in real code)
Yep, somebody invited themselves to the party:
User32BeforeBlocklist 	1
THEORY: External software is racing with our DllBlocklist to hook LdrLoadDll.

***** Prior to bug 932100, they won the race *****

Initial state:
  LdrLoadDll+0: MOV EDI, EDI
  LdrLoadDll+2: PUSH EBP
  LdrLoadDll+3: MOV EBP, ESP
  LdrLoadDll+5: <rest of the function>

They add their hook. They could have used a nop-space patch but it seems their code always does detours:
  LdrLoadDll+0: JMP <their_patched_function>
  LdrLoadDll+5: <rest of the function>

They paste the clobbered bytes into a trampoline. For them, code starts at offset +5.
  TheirTrampoline+0: <address of original LdrLoadDll>
  TheirTrampoline+4: <Unknown but always seems to be 0x05, likely # of bytes copied>
  TheirTrampoline+5: MOV EDI, EDI
  TheirTrampoline+7: PUSH EBP
  TheirTrampoline+8: MOV EBP, ESP
  TheirTrampoline+A: JMP LdrLoadDll+5 

Then the DLL blocklist adds its hook. Because LdrLoadDll doesn't have the right bytes anymore, it cannot use a nop space patch, so it adds a detour:
  LdrLoadDll+0: JMP <our_patched_function>
  LdrLoadDll+5: <rest of the function>

If our patch allows the call, we jump to OurTrampoline+4:
  OurTrampoline+0: <address of original LdrLoadDll>
  OurTrampoline+4: JMP <their_patched_function>

In this way, both detours can coexist. 

***** After bug 932100, we win the race *****

Initial state:
  LdrLoadDll-5: NOP NOP NOP NOP NOP
  LdrLoadDll+0: MOV EDI, EDI
  LdrLoadDll+2: PUSH EBP
  LdrLoadDll+3: MOV EBP, ESP
  LdrLoadDll+5: <rest of the function>

The DLL blocklist adds its hook first. The bytes are right, so it chooses a nop space patch.
  LdrLoadDll-5: JMP <our_patched_function>
  LdrLoadDll+0: JMP -5 ; 2-byte short jump
  LdrLoadDll+2: PUSH EBP
  LdrLoadDll+3: MOV EBP, ESP
  LdrLoadDll+5: <rest of the function>

Then the other code adds its hook:
  LdrLoadDll+0: JMP <their_patched_function>
  LdrLoadDll+5: <rest of the function>

As before, they paste the clobbered bytes into a trampoline:
  TheirTrampoline+0: <address of original LdrLoadDll>
  TheirTrampoline+4: <Unknown but always seems to be 0x05, likely # of bytes copied>
  TheirTrampoline+5: JMP -5     <*<*<*<*<*< THIS IS THE PROBLEM
  TheirTrampoline+7: PUSH EBP
  TheirTrampoline+8: MOV EBP, ESP
  TheirTrampoline+A: JMP LdrLoadDll+5 

The problem is that the code blindly relocated a "JMP -5" instruction. If ever they call into TheirTrampoline+5, it will jump back to TheirTrampoline+0 and start executing an address as code!

Our own detouring code would abort rather than relocate such an instruction. It seems the other software's code does not have such protection, but this is only exposed when they lose the race.

Now, what software this is, I have no idea. Some of these crash reports are rather vanilla.
Blocks: 932100
No longer blocks: 868860
Clearing needinfo? naveed because dmajor is investigating.
Flags: needinfo?(nihsanullah)
Component: JavaScript Engine → Startup and Profile System
Product: Core → Toolkit
Summary: startup crash in CryptBaseRunOnceRoutine coming from random_generateSeed in jsMath → startup crash in CryptBaseRunOnceRoutine coming from random_generateSeed in jsMath - some software is hooking LdrLoadDll in a broken way after Mozilla's hooking
It seems to me like the right fix here is to opt out of creating a short trampoline in nsWindowsDllInterceptor.h.
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #17)
> It seems to me like the right fix here is to opt out of creating a short
> trampoline in nsWindowsDllInterceptor.h.

Are you suggesting that we always do the 5-byte jump instead of the 2-byte jump? I don't think that will be effective. Even the 5-byte jumps are encoded as PC-relative. Our own code has logic to do fixup when we relocate such instructions, but I don't have reason to believe that the other software will do so.
It's interesting that this crash only shows up on Windows 7 build 7600 (that's the original, not SP1) -- presumably because the layout of ntdll affects the instructions that get executed when we treat the address as code.

The underlying problem is likely just as present on other Windows versions, but may or may not result in badness that we can attribute to this via crash-stats.
Maybe we could hook GetProcAddress and fake out anyone who later tries to hook LdrLoadDll -- let them hook the stub_ function instead. (Then I guess we'd have to do the do the same for GetProcAddress itself -- hooks all the way down!) I don't know if I would take such a change straight to beta though. It seems innocuous, but then again, so did the LdrLoadDll changes.
(In reply to comment #20)
> Maybe we could hook GetProcAddress and fake out anyone who later tries to hook
> LdrLoadDll -- let them hook the stub_ function instead. (Then I guess we'd have
> to do the do the same for GetProcAddress itself -- hooks all the way down!) I
> don't know if I would take such a change straight to beta though. It seems
> innocuous, but then again, so did the LdrLoadDll changes.

Yeah I don't think such a patch should skip riding the trains...  It seems quite risky to me.
Ok, how about this: nop out the first five bytes, write our jump at offset 5, and move all the clobbered instructions to our trampoline. This way, those first five nops can be relocated by someone else without fixup.

I think this is about the best we can do as far as safety. We would need to expand the instruction decoding logic a little, to make sure it understands LdrLoadDll bytes 5 through 10 on various platforms. It puts us at risk of a new ntdll shipping with instructions that we don't understand, but this code had already been taking that risk on XP where LdrLoadDll didn't have a hotpatch prologue.
Flags: needinfo?(ehsan)
Attached patch Beta branch pseudo-backout (obsolete) — Splinter Review
Posting this just in case. As a last resort we can back out the blocklist initialization changes from bug 932100. There's been a lot of churn so a straight backout isn't practical, but this change gets us back to the same ordering that we had in https://hg.mozilla.org/releases/mozilla-release/file/29657549d158/browser/app/nsBrowserApp.cpp#l607
Assignee: nobody → dmajor
Status: NEW → ASSIGNED
I think the safest path forward for Beta would be the backout, but I think we should try to fix this on central and aurora using the approach in comment 22 and see if it works when the current aurora train moves to beta.  Comment 22 indeed sounds safer than comment 18.
Flags: needinfo?(ehsan)
Comment on attachment 8363947 [details] [diff] [review]
Beta branch pseudo-backout

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 932100 + external code
User impact if declined: Consistent startup crashes
Testing completed (on m-c, etc.): Not tested in builds. Locally I stepped through the code with a debugger, but without STR I can't test the fix.
Risk to taking this patch (and alternatives if risky): Taking this patch will bring back the bitguard crashes. But at least we have other ways to deal with that (e.g. email affected users). For the long term, we are looking at more in-depth fixes such as comment 22, but we don't have time to test those in the current cycle.
String or IDL/UUID changes made by this patch: None
Attachment #8363947 - Flags: review?(benjamin)
Attachment #8363947 - Flags: approval-mozilla-beta?
Comment on attachment 8363947 [details] [diff] [review]
Beta branch pseudo-backout

marking a+ based on email from Bhavana
Attachment #8363947 - Flags: review?(benjamin)
Attachment #8363947 - Flags: review+
Attachment #8363947 - Flags: approval-mozilla-beta?
Attachment #8363947 - Flags: approval-mozilla-beta+
Looks like it worked. I don't see any hits on beta after build 20140120132616.
(In reply to David Major [:dmajor] from comment #28)
> Looks like it worked. I don't see any hits on beta after build
> 20140120132616.

Ok, so the supersearch facets actually pointed out a small number of hits on beta build 20140123185438. But there are also a small number of hits on release 26. This tells me that there are other cases of conflicting hooks outside of the most recent ordering regression. More reason to pursue a permanent fix.
Depends on: 972671
Attachment #8376581 - Flags: review?(ehsan)
Attachment #8376582 - Flags: review?(ehsan)
I did this to avoid dealing with offset math in Part 4.
Attachment #8376584 - Flags: review?(ehsan)
I decided to go with an absolute indirect jump rather than initial-nop trickery. A slot on the trampoline serves as the pointer for the jump.
Attachment #8376585 - Flags: review?(ehsan)
Attached patch Part 5: Tests for shared hooks (obsolete) — Splinter Review
Attachment #8376586 - Flags: review?(ehsan)
Comment on attachment 8376581 [details] [diff] [review]
Part 1: Don't undo hooks to functions that have since changed

Review of attachment 8376581 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/xre/nsWindowsDllInterceptor.h
@@ +89,5 @@
>      for (int i = 0; i < mPatchedFnsLen; i++) {
>        byteptr_t fn = mPatchedFns[i];
>  
> +      // If other code has changed this function, it is not safe to modify.
> +      if (*((uint16_t*)fn) != 0xf9eb) {

Please move 0xf9eb to a static const, and use it both here and in WriteHook.

@@ +242,5 @@
> +      }
> +
> +      // If other code has changed this function, it is not safe to modify.
> +#if defined(_M_IX86)
> +      if (*origBytes != 0xe9) {

Ditto.

@@ +246,5 @@
> +      if (*origBytes != 0xe9) {
> +        continue;
> +      }
> +#elif defined(_M_X64)
> +      if (*((uint16_t*)origBytes) != 0xbb49) {

Ditto.
Attachment #8376581 - Flags: review?(ehsan) → review+
Attachment #8376582 - Flags: review?(ehsan) → review+
Attachment #8376584 - Flags: review?(ehsan) → review+
Comment on attachment 8376585 [details] [diff] [review]
Part 4: Support absolute jumps in hooks

Review of attachment 8376585 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/xre/nsWindowsDllInterceptor.h
@@ +45,5 @@
>   *    function.
> + *    (Special "shared" mode) Replace first 6 bytes of OrigFunction with an
> + *    indirect jump to the hook function. The indirect jump uses an absolute
> + *    address, which allows us to coexist with other hooks that don't know how
> + *    to relocate our 5-byte PC-relative jump.

Please document what you mean here by "shared" mode.  It won't be clear unless people read this bug.
Attachment #8376585 - Flags: review?(ehsan) → review+
Comment on attachment 8376586 [details] [diff] [review]
Part 5: Tests for shared hooks

Review of attachment 8376586 [details] [diff] [review]:
-----------------------------------------------------------------

Very neat test!
Attachment #8376586 - Flags: review?(ehsan) → review+
Invited some static consts to the party
Attachment #8363947 - Attachment is obsolete: true
Attachment #8376581 - Attachment is obsolete: true
Attachment #8378740 - Flags: review+
Attachment #8376582 - Attachment is obsolete: true
Attachment #8378741 - Flags: review+
Attachment #8376584 - Attachment is obsolete: true
Attachment #8378742 - Flags: review+
Attachment #8376585 - Attachment is obsolete: true
Attachment #8378743 - Flags: review+
Attached patch Part 5: Tests for shared hooks (obsolete) — Splinter Review
Attachment #8376586 - Attachment is obsolete: true
Attachment #8378746 - Flags: review+
Keywords: checkin-needed
Ok, let's talk about testing. Basically we won't know whether this fixes the CryptBaseRunOnceRoutine crash until we get it out to a beta population. We only get about one crash a week on Nightly and Aurora. I still don't even know what the other software is; I've just been going off of the remnants that it leaves behind.

Here's what I was able to test: I verified on the latest nightly that the new-style hook gets installed, and that the pass-through to the original function (in the absence of other hooks) works correctly. In order to test our compatibility with various ntdll, I ran the latest TestDllInterceptor.exe on all of our supported Windows versions and service packs from XP SP2 to Windows 8.1. The tests all passed.
Attached patch Aurora patch rollup (obsolete) — Splinter Review
[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 932100 + external software
User impact if declined: startup crashes
Testing completed (on m-c, etc.): see comment 45
Risk to taking this patch (and alternatives if risky): Since we don't know what the other software is, we don't know how it will react to this change. The type of instruction in the new hook is commonly used, so I expect that the other software should understand it, but I don't have proof. If this patch is not accepted for beta then we need to re-apply the bitguard backout from comment 25.
String or IDL/UUID changes made by this patch: none
Attachment #8379910 - Flags: review+
Attachment #8379910 - Flags: approval-mozilla-beta?
Attachment #8379910 - Flags: approval-mozilla-aurora?
Attachment #8379910 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I am experience a crash on start-up in ntdll.dll as filed in bug 975820. The change-set indicates that this bug was landed when the crashing started. Given that this was a crash-on-startup bug and I am now experiencing a crash-on-startup bug I wonder if the two are related. (They may not be - other patches were also landed in the cset, but I am taking a guess!)
Comment on attachment 8379910 [details] [diff] [review]
Aurora patch rollup

Putting beta nomination on hold while we look at bug 975820.
Attachment #8379910 - Flags: approval-mozilla-beta?
Backed out at dmajor's request.

https://hg.mozilla.org/integration/mozilla-inbound/rev/1f7c3b10f482
https://hg.mozilla.org/releases/mozilla-aurora/rev/cf50055d30af
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla30 → ---
Comment on attachment 8379910 [details] [diff] [review]
Aurora patch rollup

Backed out of aurora
Attachment #8379910 - Flags: approval-mozilla-aurora+
Depends on: 975820
Attached patch Pseudo-backout of 932100 (obsolete) — Splinter Review
Revive attachment 8363947 [details] [diff] [review] and un-bitrot
Attachment #8378740 - Attachment is obsolete: true
Attachment #8378741 - Attachment is obsolete: true
Attachment #8378742 - Attachment is obsolete: true
Attachment #8378743 - Attachment is obsolete: true
Attachment #8378746 - Attachment is obsolete: true
Attachment #8379910 - Attachment is obsolete: true
Attachment #8382635 - Flags: review+
Comment on attachment 8382635 [details] [diff] [review]
Pseudo-backout of 932100

Given the recent churn, we aren't going to have a complete fix for Beta in time (possibly not Aurora either). As a stopgap, this patch once again reverts bug 932100, just like we did for beta 27.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 932100 + external code
User impact if declined: Startup crashes
Testing completed (on m-c, etc.): We already applied this to beta 27 (comment 25)
Risk to taking this patch (and alternatives if risky): Low
String or IDL/UUID changes made by this patch: None
Attachment #8382635 - Flags: approval-mozilla-beta?
Attachment #8382635 - Flags: approval-mozilla-aurora?
Firefox 30 is still marked as affected. If it is still the case, could you land the patch in m-c before? Thanks.
(In reply to Sylvestre Ledru [:sylvestre] from comment #54)
> Firefox 30 is still marked as affected. If it is still the case, could you
> land the patch in m-c before? Thanks.

This is kind of a special case because the proposed patch is not really a fix; it is a stopgap to buy us some time while we try experimental fixes on nightly.

If you insist on landing this on 30 then I'll do it, but it seems awkward because I would just undo it soon afterward.
Flags: needinfo?(sledru)
Comment on attachment 8382635 [details] [diff] [review]
Pseudo-backout of 932100

No, I am fine with the explanation. Thanks!
Attachment #8382635 - Flags: approval-mozilla-beta?
Attachment #8382635 - Flags: approval-mozilla-beta+
Attachment #8382635 - Flags: approval-mozilla-aurora?
Attachment #8382635 - Flags: approval-mozilla-aurora+
Flags: needinfo?(sledru)
Due to bug 975820, the six-byte indirect jump approach has been backed out. This patch is a different approach, using our usual five-byte relative jumps (in other words, Ehsan's suggestion from comment 17).

I was originally wary of this approach on the premise that: if the unknown software doesn't understand our two-byte jumps, they probably won't understand five-byte jumps either.

However, in bug 975280 we learned that various antivirus programs are already installing five-byte hooks, long before any other code runs. Surely there must be some overlap between McAffee/Comodo users, and users who have the unknown software -- so, if the unknown software doesn't understand five-byte jumps, then we would have seen crash reports on this before we started messing with our hooks. So there is a chance that this patch might work.

I offered a try build of this in bug 975820 comment 18, and users of both antiviruses report no crashes. If we're wrong on this approach, then the CryptBaseRunOnceRoutine crash will still happen. I'm willing to give it a try and hope for the best, because it's one of our only remaining low-risk options.
Attachment #8383957 - Flags: review?(ehsan)
Comment on attachment 8383957 [details] [diff] [review]
5-byte relative jump

Review of attachment 8383957 [details] [diff] [review]:
-----------------------------------------------------------------

Let's try this!
Attachment #8383957 - Flags: review?(ehsan) → review+
Attachment #8382635 - Attachment is obsolete: true
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/2c010a9a5d4b
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 932100 + external software
User impact if declined: Startup crashes
Testing completed (on m-c, etc.): This has been on nightly for several days, no issues on mozillazine
Risk to taking this patch (and alternatives if risky): This is unlikely to make things worse, but it's possible that this still won't fix the CryptBaseRunOnceRoutine crashes. We won't know until this reaches the beta audience. I am requesting Aurora approval in order to get this onto the first 29 beta.
String or IDL/UUID changes made by this patch: None
Attachment #8388762 - Flags: review+
Attachment #8388762 - Flags: approval-mozilla-aurora?
Attachment #8388762 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
No signs of this crash on 29.0b so far. We should give it a few more days before declaring victory, but it's looking good!
This signature has essentially disappeared from 29.0b. I see three stray hits in the past week, but they are at an unrelated, later part of CryptBaseRunOnceRoutine.

Looks like we've finally got a working fix, and it also means the AppInit protections are enabled again. Thanks everyone for the patience with the various iterations of this change.
(In reply to David Major [:dmajor] (Away March 22 to April 2) from comment #66)
> Looks like we've finally got a working fix, and it also means the AppInit
> protections are enabled again.

Awesome, looking forward to having better blocklisting in 29 and later!
looks like this is all gone.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: