Closed Bug 1235982 Opened 4 years ago Closed 2 years ago

Investigate enabling VS2015's "Control Flow Guard" (CFG) security checks using /guard:cf

Categories

(Firefox Build System :: General, defect, P1)

defect

Tracking

(firefox60 fixed)

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: cpeterson, Assigned: tjr)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files, 5 obsolete files)

VS2015 introduced a new runtime security feature called "Control Flow Guard" (CFG):

"When the /guard:cf Control Flow Guard (CFG) option is specified, the compiler and linker insert extra runtime security checks to detect attempts to compromise your code. During compiling and linking, all indirect calls in your code are analyzed to find every location that the code can reach when it runs correctly."

http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx

https://msdn.microsoft.com/en-us/library/dn919635.aspx

Adobe incrementally enabled CFG for the Flash Player in 2015:

http://blogs.adobe.com/security/2015/12/community-collaboration-enhances-flash.html
Sounds similiar to Intel MPX... which we may want to investigate on all desktop platforms.
Depends on: vs2015
Yes, Control Flow Guard is interesting.

To be most effective, we will have to annotate JIT code as well though, else it will just treat all JIT code as valid jump targets. MS did this for Chakra's JIT a while ago.

Anyway, once we have this feature enabled for our C++ code, I'd be happy to look into the JIT side of this.
Assignee: nobody → ted
(In reply to Jan de Mooij [:jandem] from comment #2)
> Yes, Control Flow Guard is interesting.
> 
> To be most effective, we will have to annotate JIT code as well though, else
> it will just treat all JIT code as valid jump targets. MS did this for
> Chakra's JIT a while ago.
> 
> Anyway, once we have this feature enabled for our C++ code, I'd be happy to
> look into the JIT side of this.

It looks like you'd want to use SetProcessValidCallTargets for this (it's only available on Windows 10 and Server 2016, so you'd have to dynamically load it):
https://msdn.microsoft.com/en-us/library/windows/desktop/dn934202(v=vs.85).aspx

The CFG documentation does validate your statement:
"The VirtualProtect and VirtualAlloc functions will by default treat a specified region of executable and committed pages as valid indirect call targets."
https://msdn.microsoft.com/en-us/library/windows/desktop/mt637065(v=vs.85).aspx
I pushed this to Try so we can see if there's any perf hit in Talos from enabling this.
Considering the description of the feature, I'm afraid of the result both in size and in performance for binaries as large as xul.dll. Did you look what that ended up generating for indirect calls?
I didn't yet. I was trying to get Windows 10 Talos results, because that's where the feature is actually most useful, but we don't run those by default so I'd have to do a no-op Try push of the base changeset to get results to compare against.

I just triggered some extra Talos runs on Win 7/Win 8 so we can compare those. The builds are done if you had something specific in mind to compare:
http://archive.mozilla.org/pub/firefox/try-builds/tmielczarek@mozilla.com-6868c8ea9165a1855f4b694211cf77e2f9a86eff/try-win32/
http://archive.mozilla.org/pub/firefox/try-builds/tmielczarek@mozilla.com-6868c8ea9165a1855f4b694211cf77e2f9a86eff/try-win64/
Hi Ted, what's the status here? Do we have Windows 10 Talos results?
Flags: needinfo?(ted)
(In reply to Jan de Mooij [:jandem] from comment #9)
> Hi Ted, what's the status here? Do we have Windows 10 Talos results?

I never got to try that. I'm not very good at evaluating the performance impact of things, so we might need to find someone else to look at that. It should just be a matter of pushing my patch to Try, and triggering the right jobs, and comparing against the right base revision.
Flags: needinfo?(ted)
OK I'll push this to Try so we can at least compare binary size and figure out how to run Windows 10 jobs on Try.
Flags: needinfo?(jdemooij)
Talos shows some regressions unfortunately. xul.dll is 3 MB bigger on 32-bit, 3.9 MB on 64-bit. I then realized I should probably benchmark PGO builds, as that's what we ship to users and MSVC PGO does aggressive inlining of virtual calls. So I'll have to get numbers for PGO builds next.

Talos results:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9d841679906f&newProject=try&newRevision=28398cef4bc8&framework=1&showOnlyImportant=0

Windows 64-bit, binary size:

ZIP file: 57619628 -> 59025348 bytes
xul.dll:  67876944 -> 71953488 bytes

Windows 32-bit, binary size:

ZIP file: 53574619 -> 54698245 bytes
xul.dll:  54236752 -> 57389648 bytes
For the PGO builds, we also have some regressions unfortunately. It probably makes more sense to enable CFG incrementally: start with the relatively cold code and add it to one moz.build file at a time.

Talos: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=194ed0aac059&newProject=try&newRevision=ed3c138a666c&framework=1&showOnlyImportant=0

Windows 64-bit, binary size:

ZIP file: 57727770 -> 58709306 bytes
xul.dll:  64057424 -> 66875472 bytes

Windows 32-bit, binary size:

ZIP file: 54686009 -> 55683793 bytes
xul.dll:  52973136 -> 56349776 bytes
Sigh, the perfherder links no longer show the Windows 10 results. Maybe a side effect of the perfherder bustage last week :/
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #5)
> (In reply to Jan de Mooij [:jandem] from comment #2)
> > Yes, Control Flow Guard is interesting.
> > 
> > To be most effective, we will have to annotate JIT code as well though, else
> > it will just treat all JIT code as valid jump targets. MS did this for
> > Chakra's JIT a while ago.
> > 
> > Anyway, once we have this feature enabled for our C++ code, I'd be happy to
> > look into the JIT side of this.
> 
> It looks like you'd want to use SetProcessValidCallTargets for this (it's
> only available on Windows 10 and Server 2016, so you'd have to dynamically
> load it):
> https://msdn.microsoft.com/en-us/library/windows/desktop/dn934202(v=vs.85).
> aspx
> 
> The CFG documentation does validate your statement:
> "The VirtualProtect and VirtualAlloc functions will by default treat a
> specified region of executable and committed pages as valid indirect call
> targets."
> https://msdn.microsoft.com/en-us/library/windows/desktop/mt637065(v=vs.85).
> aspx

There is another option to SetProcessValidCallTargets, which is to mark the functions that call the JITed functions with __declspec(guard(ignore)) which will disable CFG for all indirect calls made inside the marked method.  More here: https://blog.trailofbits.com/2016/12/27/lets-talk-about-cfi-microsoft-edition/

Obviously this is less secure than using the SetProcessValidCallTargets API and retaining CFG.
I'm not working on this atm. I think we first have to decide whether the binary size increase in comment 13 is acceptable, and then we should investigate the Talos regressions. Maybe we can start turning this on for parts of the tree (like SpiderMonkey).
Flags: needinfo?(jdemooij)
I tried with VS2017 RC2 (non-PGO) and it's still the same 6% size increase as before, so no improvement there. It's only about 250k for just SpiderMonkey, so that could be a good place to start if we wanted to do it incrementally.

One tricky thing is that we wouldn't get crash reports for these. According to the comments in https://blogs.msdn.microsoft.com/vcblog/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature/, CFG failures go through RaiseFailFastException which bypasses Breakpad's exception handler. (And apparently the same goes for /GS failures, which is a good thing to keep in mind. I remember a while back we were wondering about possible classes of crashes that go unreported.)
(In reply to David Major [:dmajor] from comment #17)
> One tricky thing is that we wouldn't get crash reports for these. According
> to the comments in
> https://blogs.msdn.microsoft.com/vcblog/2014/12/08/visual-studio-2015-
> preview-work-in-progress-security-feature/, CFG failures go through
> RaiseFailFastException which bypasses Breakpad's exception handler. (And
> apparently the same goes for /GS failures, which is a good thing to keep in
> mind. I remember a while back we were wondering about possible classes of
> crashes that go unreported.)

That's pretty annoying. Could we hook RaiseFailFastException? How hard would it be to trigger one of these crashes intentionally for testing (/GS or /guard:cf)?
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #18)
> That's pretty annoying. Could we hook RaiseFailFastException? How hard would
> it be to trigger one of these crashes intentionally for testing (/GS or
> /guard:cf)?

I assembled "xor ecx, ecx; xor edx, edx; call kernelbase!RaiseFailFastException" into some random point of my Nightly, detached my debugger, and indeed it skipped Breakpad and landed back in my default postmortem debugger (or presumably would have popped WER on a real-world machine).

I hesitate to think about hooking the API since it seems deliberately designed to avoid (potentially buggy) exception hooks and instead terminate as directly as possible. It may be safer to try to look at the crash reports from the WER side. I want to say that we had an account at one point, but I don't know if it's still active.
(In reply to David Major [:dmajor] from comment #19)
> I hesitate to think about hooking the API since it seems deliberately
> designed to avoid (potentially buggy) exception hooks and instead terminate
> as directly as possible. It may be safer to try to look at the crash reports
> from the WER side. I want to say that we had an account at one point, but I
> don't know if it's still active.

This is a fair point given what scenarios this protection is trying to prevent, but it does seem very unfortunate that we can't track crashes like this using Socorro - especially if the net result is that they end up in a memory hole.
https://codereview.chromium.org/2412983006 suggests that you can take advantage of CFG in system libraries by setting guard:cf on the .exe, but I can't find any documentation to support this. If it's true, that could be another first-step that we could take with minimal size impact.
(In reply to David Major [:dmajor] from comment #21)
> https://codereview.chromium.org/2412983006 suggests that you can take
> advantage of CFG in system libraries by setting guard:cf on the .exe, but I
> can't find any documentation to support this. If it's true, that could be
> another first-step that we could take with minimal size impact.

I believe this refers to the Phase 1 rollout mentioned here: https://bugs.chromium.org/p/chromium/issues/detail?id=584575#c8

Which is to say, they haven't enabled it on all their individual compilation units/libraries/subprocesses - only on the final 'bundle it all up into a exe' step. (I'm not familiar enough with the build system (yet) to know if this is similarly feasible for us.)
(In reply to Tom Ritter [:tjr] from comment #22)

I wouldn't really describe it as a bundling step. Much like us, Chrome has a small .exe that more or less just loads up the DLLs where the actual code lives. My understanding is that they've enabled CFG on the cpp files and linking step of chrome.exe but not the other libraries.

The analogous thing for us would be to enable for firefox.exe (a 468K file) but not xul.dll (a 61M file). Our build system should be able to handle that pretty easily, likely with just a couple tweaks to http://searchfox.org/mozilla-central/source/browser/app/moz.build.
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review116182

Your changes added trailing whitespace to some files.

::: toolkit/xre/nsAppRunner.cpp:102
(Diff revision 1)
>  #include <windows.h>
>  #include <intrin.h>
>  #include <math.h>
>  #include "cairo/cairo-features.h"
>  #include "mozilla/WindowsDllBlocklist.h"
> +#include "mozilla/WindowsCFGStatus.h"

WindowsCFGStatus.h should be alphabetically #included before WindowsDllBLocklist.h.
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review116182

Fixed! (modulo the test I need to rewrite)

> WindowsCFGStatus.h should be alphabetically #included before WindowsDllBLocklist.h.

Fixed!
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review116554

::: toolkit/xre/test/browser_checkcfgstatus.js:1
(Diff revision 2)
> +// Any copyright is dedicated to the Public Domain.

This whole test is incorrect and is in-development.
Depends on: 1342564
Attachment #8840087 - Flags: review?(mh+mozilla)
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review116890

::: toolkit/xre/test/browser_checkcfgstatus.js:65
(Diff revision 4)
> +                  "work properly.");
> +
> +        let contentProcessGone = TestUtils.topicObserved("ipc:content-shutdown");
> +
> +        ContentTask.spawn(browser, null, function() {
> +          privateNoteIntentionalCrash();

Note that this test will throw a false positive (it will always crash and thus indicate CFG is enabled) until https://bugzilla.mozilla.org/show_bug.cgi?id=1342564 is fixed.
Okay, thanks to the herculean efforts of mconley and dmajor, I've enabled CFG on firefox.exe and mozglue.dll and implemented a test that ensures they are enabled. I have perf measurements here, but I'm told they may not be accurate (and in particular the fileio, the main significant one, should be ignored.) 

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=5a7f8a39068a&newProject=try&newRevision=84ca36169d0f&framework=1&showOnlyImportant=0

What would be the next steps for this?
(In reply to Tom Ritter [:tjr] from comment #32)
> Okay, thanks to the herculean efforts of mconley and dmajor, I've enabled
> CFG on firefox.exe and mozglue.dll and implemented a test that ensures they
> are enabled. I have perf measurements here, but I'm told they may not be
> accurate (and in particular the fileio, the main significant one, should be
> ignored.) 
> 
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-
> inbound&originalRevision=5a7f8a39068a&newProject=try&newRevision=84ca36169d0f
> &framework=1&showOnlyImportant=0
> 
> What would be the next steps for this?

Joel, can you help Tom interpret the perf results? Our main areas to watch would be binary size, startup time, and jemalloc perf.
Flags: needinfo?(jmaher)
why are there so many retriggers for osx?  isn't this a windows only compiler bug?
I am not sure about jemalloc perf.  We do track binary sizes, although I have not seen that be stable- I can look at startup time when more results come in.

you can see the data for build metrics by switching the framework in the compare view:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=5a7f8a39068a&newProject=try&newRevision=84ca36169d0f&framework=2&filter=win&showOnlyImportant=0

unfortunately 1 data point isn't a good indicator as we have stopped uploading installer size data as of Feb 21st for some odd reason.
(In reply to Joel Maher ( :jmaher) from comment #34)
> why are there so many retriggers for osx?  isn't this a windows only
> compiler bug?

Yes. Are those retriggers not on the base comparison? That is to say: I didn't do them and am not sure who did. I don't know why they show up.
after retriggers to wash out noise, there are only 2 issues on the comparison, these are actually an issue with addon signing (we used signed addons in production, but unsigned on try)- so these are false positives.

From a talos perspective this is all good.
Flags: needinfo?(jmaher)
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review117650

Someone with knowledge in mochitests should look at the test itself.

I'm not all that convinced that enabling CFG in 2 directories is going to make a difference. The only sources that will be built with -guard:cf are browser/app/nsBrowserApp.cpp, mozglue/build/dummy.cpp, mozglue/build/WindowsDllBlocklist.cpp and mozglue/build/SSE.cpp. Maybe LTCG with -guard:cf in LDFLAGS forces other things that end up in firefox.exe and mozglue.dll to be compiled with CFG, but a) that needs to be assessed whether that's true, b) if it's true, it would only work on PGO builds (so not on local developer builds) and c) firefox.exe and mozglue.dll are still very small attack surfaces. The majority of ROP targets are still in xul.dll.

I guess there's a possible (small) win from being able to enable the system32.dll CFG through firefox.exe, though...

::: browser/app/moz.build:48
(Diff revision 4)
>      CFLAGS += ['-guard:cf']
>      CXXFLAGS += ['-guard:cf']
>      LDFLAGS += ['-guard:cf']

Something is wrong here. Your patch is not based on vanilla mozilla-central, and some other patch already added this, with a comment.

::: mozglue/build/WindowsCFGStatus.cpp:6
(Diff revision 4)
> +#ifdef MOZ_MEMORY
> +#define MOZ_MEMORY_IMPL
> +#include "mozmemory_wrap.h"
> +#define MALLOC_FUNCS MALLOC_FUNCS_MALLOC
> +// See mozmemory_wrap.h for more details. This file is part of libmozglue, so
> +// it needs to use _impl suffixes.
> +#define MALLOC_DECL(name, return_type, ...) \
> +  extern "C" MOZ_MEMORY_API return_type name ## _impl(__VA_ARGS__);
> +#include "malloc_decls.h"
> +#endif

It's not clear to me why you need this. You're not using allocator functions.

::: mozglue/build/WindowsCFGStatus.cpp:59
(Diff revision 4)
> +}
> +
> +typedef int (*int_arg_fn)(int);
> +
> +MFBT_API bool
> +CFG_CheckStatus()

This function name and signature is confusing. It only returns if CFG is not properly enabled.

::: mozglue/build/WindowsCFGStatus.cpp:66
(Diff revision 4)
> +  int val;
> +  val=setjmp(env);
> +
> +  if(val == 0) {
> +    int_arg_fn slide_to_the_left = (int_arg_fn)((uintptr_t)(not_entry_point)+0x20);
> +    slide_to_the_left(10);

It doesn't seem the argument has any sort of importance here. It just makes things harder to understand.

::: xpcom/system/nsIXULRuntime.idl:187
(Diff revision 4)
> +
> +  /**
> +   * Crashes if CFG is enabled. This is only for automated
> +   * testing purposes.
> +   */
> +  readonly attribute boolean windowsCFGEnabled;

I'm not a big fan of adding a seemingly public API to an IDL to trigger a crash. You might as well open the mozglue dll and call the function directly in the mochitest. Or create an entirely separate dll, ship it with mochitest and use that during mochitest.
Attachment #8840087 - Flags: review?(mh+mozilla)
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review117656

::: mozglue/build/WindowsCFGStatus.cpp:59
(Diff revision 4)
> +CFG_CheckStatus()
> +{
> +  int val;
> +  val=setjmp(env);
> +
> +  if(val == 0) {
> +    int_arg_fn slide_to_the_left = (int_arg_fn)((uintptr_t)(not_entry_point)+0x20);
> +    slide_to_the_left(10);
> +  }
> +
> +  return false;
> +}

Forgot to mention: The whole setup here is really not obvious, and it took some mental effort to understand how and why it works. A comment detailing everything would go a long way.
(In reply to Mike Hommey [:glandium] from comment #38)
> I'm not all that convinced that enabling CFG in 2 directories is going to
> make a difference. The only sources that will be built with -guard:cf are
> browser/app/nsBrowserApp.cpp, mozglue/build/dummy.cpp,
> mozglue/build/WindowsDllBlocklist.cpp and mozglue/build/SSE.cpp. Maybe LTCG
> with -guard:cf in LDFLAGS forces other things that end up in firefox.exe and
> mozglue.dll to be compiled with CFG, but a) that needs to be assessed
> whether that's true, b) if it's true, it would only work on PGO builds (so
> not on local developer builds) and c) firefox.exe and mozglue.dll are still
> very small attack surfaces. The majority of ROP targets are still in xul.dll.
> 
> I guess there's a possible (small) win from being able to enable the
> system32.dll CFG through firefox.exe, though...

Right, it was my understanding that we wanted to deploy it incrementally before turning it on for xul.dll - that's definitely where most of the benefit lies. The point of turning it on in mozglue was just to make the test work and to ship some additional DLL with it. I'd be happy to turn it on in more dlls - libglexv2, mozavcodec, libegl (or even xul.dll) though.

I will work on the patch to clean it up with your feedback, thanks.
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review117650

> Something is wrong here. Your patch is not based on vanilla mozilla-central, and some other patch already added this, with a comment.

Yea, I don't think I was based on -central.
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review119512

::: browser/app/moz.build:70
(Diff revision 5)
>      DELAYLOAD_DLLS += [
>          'winmm.dll',
>          'user32.dll',
>      ]
>  
> +    if CONFIG['_MSC_VER']:

Unintuitively, this matches for clang-cl. I think you want to test CONFIG['CC_TYPE'] == 'msvc' instead.

::: mozglue/build/WindowsCFGStatus.h:16
(Diff revision 5)
> +#include <windows.h>
> +#include "mozilla/Attributes.h"
> +#include "mozilla/Types.h"
> +
> +extern "C" {
> +MFBT_API bool CFG_CrashOrDisabled();

DisabledOrCrash?

::: mozglue/build/WindowsCFGStatus.cpp:39
(Diff revision 5)
> +
> +  longjmp(env, 1);
> +  return;
> +}
> +
> +typedef int (*fn_ptr)();

void

::: mozglue/build/WindowsCFGStatus.cpp:45
(Diff revision 5)
> +
> +/*
> + * CFG protects indirect function calls and ensures they call a valid entry
> + * point. We create a function pointer that calls an invalid entry point.
> + * That invalid entry point is a nop sled.
> + * 

tailing whitespace

::: mozglue/build/WindowsCFGStatus.cpp:48
(Diff revision 5)
> + * To 'return' from the function safely we jump back to our original
> + * function - no preamble and no return.

Come to think of it, what's not clear to me is why this is expected to crash when CFG is enabled, considering you longjmp and skip the function return code that does the CFG check. Are we then hitting and failing the CFG check from CFG_CrashOrDisabled? Or does CFG add a check on longjmp too?
Attachment #8840087 - Flags: review?(mh+mozilla)
(Please also find a reviewer for the test)
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review119512

> Come to think of it, what's not clear to me is why this is expected to crash when CFG is enabled, considering you longjmp and skip the function return code that does the CFG check. Are we then hitting and failing the CFG check from CFG_CrashOrDisabled? Or does CFG add a check on longjmp too?

The CFG check occurs on the function call, not the function return. So it's the indirect function call sldie_to_the_left() that triggers the CFG check and crash.
Attachment #8840087 - Flags: review?(mconley)
(In reply to Tom Ritter [:tjr] from comment #46)
> Comment on attachment 8840087 [details]
> Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a
> crash
> 
> https://reviewboard.mozilla.org/r/114624/#review119512
> 
> > Come to think of it, what's not clear to me is why this is expected to crash when CFG is enabled, considering you longjmp and skip the function return code that does the CFG check. Are we then hitting and failing the CFG check from CFG_CrashOrDisabled? Or does CFG add a check on longjmp too?
> 
> The CFG check occurs on the function call, not the function return. So it's
> the indirect function call sldie_to_the_left() that triggers the CFG check
> and crash.

Mmmm... how does that protect anything, then? if some malicious code is jumping in the middle of a function, it's not going to be running the CFG check that is added at compile time for function calls... Or does that only protect against malicious code creating fake vtables?
(In reply to Mike Hommey [:glandium] from comment #47)
> (In reply to Tom Ritter [:tjr] from comment #46)
> > Comment on attachment 8840087 [details]
> > Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a
> > crash
> > 
> > https://reviewboard.mozilla.org/r/114624/#review119512
> > 
> > > Come to think of it, what's not clear to me is why this is expected to crash when CFG is enabled, considering you longjmp and skip the function return code that does the CFG check. Are we then hitting and failing the CFG check from CFG_CrashOrDisabled? Or does CFG add a check on longjmp too?
> > 
> > The CFG check occurs on the function call, not the function return. So it's
> > the indirect function call sldie_to_the_left() that triggers the CFG check
> > and crash.
> 
> Mmmm... how does that protect anything, then? if some malicious code is
> jumping in the middle of a function, it's not going to be running the CFG
> check that is added at compile time for function calls... Or does that only
> protect against malicious code creating fake vtables?

Well it certainly doesn't protect everything, like attacker-controlled jmp targets. But it does protect against function pointer overwriting, some type confusion, and some particular exploitations of UAFs. https://blog.trailofbits.com/2016/12/27/lets-talk-about-cfi-microsoft-edition/ is a good rundown and they have a showcase at https://github.com/trailofbits/cfg-showcase They also have a real-world example of trying to deploy CFG and CFI: https://blog.trailofbits.com/2017/02/20/the-challenges-of-deploying-security-mitigations/
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review120568

Hey tjr,

I have some questions about our choice here putting the function inside mozglue and accessing it with ctypes. Is there a reason we can't make this a function of nsIDebug2? See below.

::: mozglue/build/WindowsCFGStatus.cpp:52
(Diff revision 6)
> + * performs, so if we hit the return (ret) we would mess up the stack.
> + * To 'return' from the function safely we jump back to our original
> + * function - no preamble and no return.
> + *
> + * We use setjmp/longjmp because inline asm instructions aren't supported
> + * in x64 by MSVC.

Can you document how the return value should be interpreted?

::: mozglue/build/WindowsCFGStatus.cpp:59
(Diff revision 6)
> +  int val;
> +  val=setjmp(env);

Nit: spaces before and after =.

We can also combine these two:

```C++
int val = setjmp(env);
```

::: mozglue/build/WindowsCFGStatus.cpp:62
(Diff revision 6)
> +  // setjmp returns 0 on the initial call and whatever value is given
> +  // to longjmp (here it is 1) when it is jumped back to.
> +  int val;
> +  val=setjmp(env);
> +
> +  if(val == 0) {

Nit: Space after `if`

::: mozglue/build/WindowsCFGStatus.cpp:63
(Diff revision 6)
> +  // to longjmp (here it is 1) when it is jumped back to.
> +  int val;
> +  val=setjmp(env);
> +
> +  if(val == 0) {
> +    fn_ptr slide_to_the_left = (fn_ptr)((uintptr_t)(not_entry_point)+0x20);

Spaces on either side of +.

::: mozglue/build/WindowsCFGStatus.cpp:65
(Diff revision 6)
> +  val=setjmp(env);
> +
> +  if(val == 0) {
> +    fn_ptr slide_to_the_left = (fn_ptr)((uintptr_t)(not_entry_point)+0x20);
> +    slide_to_the_left();
> +  }

Maybe add some documentation here that if CFG is indeed enabled, we expect to crash here.

::: toolkit/xre/test/browser_checkcfgstatus.js:3
(Diff revision 6)
> +"use strict";
> +
> +// From browser_crash_previous_frameloader.js

I think you can remove this documentation. It's unfortunate that we're duplicating here. :/

::: toolkit/xre/test/browser_checkcfgstatus.js:41
(Diff revision 6)
> +    file.remove(true);
> +  }
> +}
> +
> +// Calls a function that should crash when CFG is enabled
> +add_task(function* test_cfg_enabled() {

I'm not certain we need a full-blown mochitest-browser test here. That seems a bit heavy-weight for what we're testing here.

We probably want to use a mochitest-plain instead.

I'm also worried that this test is going to break as we ratchet up sandboxing, since I suspect at some point, pointing ctypes at mozglue.dll is probably going to stop working.

I hate to tell you this since I know you worked so hard to get this stood up and working as is! But looking at it all together as the reviewer, I'm starting to see the potential for brittleness.

Here's my recommendation (which maybe you should run by `ted` or `froydnj`?):

1. Instead of adding the function to mozglue, add the function to [nsDebugImpl.cpp](http://searchfox.org/mozilla-central/rev/78ac0ceba97bd2deed847a8d0ae86ccf7a8887bf/xpcom/base/nsDebugImpl.cpp)
2. Add a method to call your function to [nsIDebug2.idl](http://searchfox.org/mozilla-central/source/xpcom/base/nsIDebug2.idl) (I can guide you on how to wire that up).
2. Write a mochitest-plain test [kinda like this one](http://searchfox.org/mozilla-central/rev/78ac0ceba97bd2deed847a8d0ae86ccf7a8887bf/dom/plugins/test/mochitest/test_hanging.html#12). The script runs in the scope of content by default. You'll need SpecialPowers to get access to nsIDebug2, but then you can crash to your hearts content. That should simplify a lot of this, I think.

I know that sounds like a lot. I'm willing to sign-off on this test and let you land it as a mochitest-browser if you file a follow-up bug to replace it with a mochitest-plain.
Attachment #8840087 - Flags: review?(mconley) → review-
Blocks: 1345985
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review120568

> I'm not certain we need a full-blown mochitest-browser test here. That seems a bit heavy-weight for what we're testing here.
> 
> We probably want to use a mochitest-plain instead.
> 
> I'm also worried that this test is going to break as we ratchet up sandboxing, since I suspect at some point, pointing ctypes at mozglue.dll is probably going to stop working.
> 
> I hate to tell you this since I know you worked so hard to get this stood up and working as is! But looking at it all together as the reviewer, I'm starting to see the potential for brittleness.
> 
> Here's my recommendation (which maybe you should run by `ted` or `froydnj`?):
> 
> 1. Instead of adding the function to mozglue, add the function to [nsDebugImpl.cpp](http://searchfox.org/mozilla-central/rev/78ac0ceba97bd2deed847a8d0ae86ccf7a8887bf/xpcom/base/nsDebugImpl.cpp)
> 2. Add a method to call your function to [nsIDebug2.idl](http://searchfox.org/mozilla-central/source/xpcom/base/nsIDebug2.idl) (I can guide you on how to wire that up).
> 2. Write a mochitest-plain test [kinda like this one](http://searchfox.org/mozilla-central/rev/78ac0ceba97bd2deed847a8d0ae86ccf7a8887bf/dom/plugins/test/mochitest/test_hanging.html#12). The script runs in the scope of content by default. You'll need SpecialPowers to get access to nsIDebug2, but then you can crash to your hearts content. That should simplify a lot of this, I think.
> 
> I know that sounds like a lot. I'm willing to sign-off on this test and let you land it as a mochitest-browser if you file a follow-up bug to replace it with a mochitest-plain.

That sounds like a good approach. I opened https://bugzilla.mozilla.org/show_bug.cgi?id=1345985 to track that and will work on it.
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review120730

::: mozglue/build/WindowsCFGStatus.cpp:19
(Diff revision 7)
> +
> +// Inspired by https://github.com/trailofbits/cfg-showcase/blob/master/cfg_icall.cpp
> +
> +jmp_buf env;
> +
> +#pragma optimize("", off )

You'll want to pair this with a #pragma optimize("", on) after this function to restore the optimization settings.

::: mozglue/build/WindowsCFGStatus.cpp:63
(Diff revision 7)
> +MFBT_API bool
> +CFG_DisabledOrCrash()
> +{
> +  // setjmp returns 0 on the initial call and whatever value is given
> +  // to longjmp (here it is 1) when it is jumped back to.
> +  int val = setjmp(env);

I can't figure out the point of the setjmp here. Is it just so we drop down to the "return false" if the crash doesn't happen? Wouldn't that happen anyway without the setjmp?
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review120730

> I can't figure out the point of the setjmp here. Is it just so we drop down to the "return false" if the crash doesn't happen? Wouldn't that happen anyway without the setjmp?

Oh, I think I get it: You don't want to follow the natural function epilogue because you skipped the prologue.

Also, I should RTFC :-)
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review121862
Attachment #8840087 - Flags: review?(mh+mozilla) → review+
(In reply to Mike Conley (:mconley) (Catching up on reviews and needinfos) from comment #49)
> Here's my recommendation (which maybe you should run by `ted` or `froydnj`?):
> 
> 1. Instead of adding the function to mozglue, add the function to
> [nsDebugImpl.cpp](http://searchfox.org/mozilla-central/rev/
> 78ac0ceba97bd2deed847a8d0ae86ccf7a8887bf/xpcom/base/nsDebugImpl.cpp)
> 2. Add a method to call your function to
> [nsIDebug2.idl](http://searchfox.org/mozilla-central/source/xpcom/base/
> nsIDebug2.idl) (I can guide you on how to wire that up).
> 2. Write a mochitest-plain test [kinda like this
> one](http://searchfox.org/mozilla-central/rev/
> 78ac0ceba97bd2deed847a8d0ae86ccf7a8887bf/dom/plugins/test/mochitest/
> test_hanging.html#12). The script runs in the scope of content by default.
> You'll need SpecialPowers to get access to nsIDebug2, but then you can crash
> to your hearts content. That should simplify a lot of this, I think.
> 
> I know that sounds like a lot. I'm willing to sign-off on this test and let
> you land it as a mochitest-browser if you file a follow-up bug to replace it
> with a mochitest-plain.

So after talking with ted, he reminded me that nsDebugImpl.cpp lives in xul.dll. We aren't enabling this on xul.dll (yet) because it increases code size a lot (see above). So until it's enabled in xul.dll I can't move the test there.  But I will be happy to move it when we can get sign-off on enabling it on xul.dll. 

Aside from that, ted said that he believes that the test would need to be a browser-chrome test instead of a mochitest-plain, because it's going to crash the content process. He didn't think mochitest-plain gives that sort of power.
Flags: needinfo?(mconley)
(In reply to Tom Ritter [:tjr] from comment #55)
> So after talking with ted, he reminded me that nsDebugImpl.cpp lives in
> xul.dll. We aren't enabling this on xul.dll (yet) because it increases code
> size a lot (see above). So until it's enabled in xul.dll I can't move the
> test there.  But I will be happy to move it when we can get sign-off on
> enabling it on xul.dll. 
> 
> Aside from that, ted said that he believes that the test would need to be a
> browser-chrome test instead of a mochitest-plain, because it's going to
> crash the content process. He didn't think mochitest-plain gives that sort
> of power.

I spoke to soon, I could keep the test in mozglue but call it from nsDebugImpl.cpp - which I will work on in https://bugzilla.mozilla.org/show_bug.cgi?id=1345985
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review120730

> You'll want to pair this with a #pragma optimize("", on) after this function to restore the optimization settings.

https://msdn.microsoft.com/en-us/library/chh3fb0k.aspx says the pragma applies on a function-by-function basis...
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review122274

::: mozglue/build/WindowsCFGStatus.cpp:19
(Diff revision 7)
> +
> +// Inspired by https://github.com/trailofbits/cfg-showcase/blob/master/cfg_icall.cpp
> +
> +jmp_buf env;
> +
> +#pragma optimize("", off )

My interpretation of that sentence is that you can control optimization on at function granularity [by setting a different pragma for each one].

The doc also says "takes effect at the first function defined after the pragma is seen" which kind of only makes sense if it's going to stay in effect afterwards.

In all my experience, people have always paired pragma optimize; check out dxr for examples. I'd be happy to be proven wrong but you'd need to show me actual disassembly. :-)
(In reply to David Major [:dmajor] from comment #58)
> In all my experience, people have always paired pragma optimize; check out
> dxr for examples. I'd be happy to be proven wrong but you'd need to show me
> actual disassembly. :-)

Plus it also eliminates any confusion on the part of the human reader. :-)
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review122522

Hey tjr,

Sorry for the delay here - got bogged down with other reviews.

::: toolkit/xre/test/browser_checkcfgstatus.js:76
(Diff revision 7)
> +      // We are expecting a crash, and if we don't crash, the test hangs
> +      // and eventually times out. This is slow and confusing. Instead
> +      // crash for sure and allow the single assert to mark the test failed.
> +      let zero = new ctypes.intptr_t(8);
> +      let badptr = ctypes.cast(zero, ctypes.PointerType(ctypes.int32_t));
> +      badptr.contents

I worry that this could cause the content process to go down before the Assert.ok(false makes it to the parent.

In that event, we'd lose the most important information; that CFG is somehow gone.

Instead of crashing here, we could have line 84 change to something like:

```js

// If we don't crash within 5 seconds, give up.
let timeout = new Promise(resolve => setTimeout(resolve, 5000, [null]));

let [subject, /* , data */] = yield Promise.race([timeout, contentProcessGone]);

if (!subject) {
  // We timed out, or otherwise didn't crash properly.
  Assert.ok(false, "...");
} else {
  // We crashed properly, clean up...
}
```
Attachment #8840087 - Flags: review?(mconley) → review-
(In reply to Tom Ritter [:tjr] from comment #55)
> So after talking with ted, he reminded me that nsDebugImpl.cpp lives in
> xul.dll. We aren't enabling this on xul.dll (yet) because it increases code
> size a lot (see above). So until it's enabled in xul.dll I can't move the
> test there.  But I will be happy to move it when we can get sign-off on
> enabling it on xul.dll. 
> 
> Aside from that, ted said that he believes that the test would need to be a
> browser-chrome test instead of a mochitest-plain, because it's going to
> crash the content process. He didn't think mochitest-plain gives that sort
> of power.

Okay.

> I spoke to soon, I could keep the test in mozglue but call it from
> nsDebugImpl.cpp - which I will work on in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1345985

Okay. :)
Flags: needinfo?(mconley)
Attachment #8840087 - Flags: review?(mconley)
Fixed mconley's comment about the assert and dmajor's comment about the pragma.
Keywords: leave-open
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review122568

If you confirm that this indeed hits the timeout and fails properly if CFG isn't available, this looks good to me! Thanks!
Attachment #8840087 - Flags: review?(mconley) → review+
(In reply to Mike Conley (:mconley) (Catching up on reviews and needinfos) from comment #65)
> If you confirm that this indeed hits the timeout and fails properly if CFG
> isn't available, this looks good to me! Thanks!

I can confirm - if you ignore #1342564 which, until fixed, causes the test to always pass no matter what.

Can we merge to inbound?
(In reply to Tom Ritter [:tjr] from comment #66)
> (In reply to Mike Conley (:mconley) (Catching up on reviews and needinfos)
> from comment #65)
> > If you confirm that this indeed hits the timeout and fails properly if CFG
> > isn't available, this looks good to me! Thanks!
> 
> I can confirm - if you ignore #1342564 which, until fixed, causes the test
> to always pass no matter what.
> 
> Can we merge to inbound?

I've queued your patch for autoland.
Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5bbf6972b904
Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash r=glandium,mconley
Backout by kwierso@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/61aaa7f8adf3
Backed out changeset 5bbf6972b904 for landing prematurely a=backout
For everyone following at home: this is now blocked on being able to track potential breakage from this. This requires reviewing Windows Error Reporting reports.  We have access to these in theory, but in practice the reports seem to be missing data for new Firefox versions. Additionally there are concerns about how we are able to ingest/share/make this data available. So we're talking about these two (separate) issues with Legal and basically anyone who can get us in touch with a real person at WER who can help us.
Attachment #8742329 - Flags: review?(mh+mozilla)
Depends on: 1361058
No longer depends on: 1342564
(In reply to Tom Ritter [:tjr] from comment #70)
> For everyone following at home: this is now blocked on being able to track
> potential breakage from this. This requires reviewing Windows Error
> Reporting reports.  We have access to these in theory, but in practice the
> reports seem to be missing data for new Firefox versions. Additionally there
> are concerns about how we are able to ingest/share/make this data available.
> So we're talking about these two (separate) issues with Legal and basically
> anyone who can get us in touch with a real person at WER who can help us.

We have this set up now albeit manually. I'm curious why we felt we needed WER set up for this. Regardless, can we move this bug along, maybe in 58?
Flags: needinfo?(tom)
Assignee: ted → nobody
Priority: -- → P2
(In reply to Jim Mathies [:jimm] from comment #71)
> (In reply to Tom Ritter [:tjr] from comment #70)
> > For everyone following at home: this is now blocked on being able to track
> > potential breakage from this. This requires reviewing Windows Error
> > Reporting reports.  We have access to these in theory, but in practice the
> > reports seem to be missing data for new Firefox versions. Additionally there
> > are concerns about how we are able to ingest/share/make this data available.
> > So we're talking about these two (separate) issues with Legal and basically
> > anyone who can get us in touch with a real person at WER who can help us.
> 
> We have this set up now albeit manually. I'm curious why we felt we needed
> WER set up for this. Regardless, can we move this bug along, maybe in 58?

I don't think so. The manual process we have is not being performed on the nightly builds (or even beta or 64 bit) so we would not have any chance of seeing crashes until release. Similarly, there is not any guarentee we would see any crashes in the current version of MSFT's crash reports to us.

https://bugzilla.mozilla.org/show_bug.cgi?id=1361058#c22 contains more info about the upcoming SysDev portal which it seems like we should wait for before trying this. (Unless we want to move it forward with no safety net except manual user reports.)
Flags: needinfo?(tom)
> I don't think so. The manual process we have is not being performed on the
> nightly builds (or even beta or 64 bit) so we would not have any chance of
> seeing crashes until release. Similarly, there is not any guarentee we would
> see any crashes in the current version of MSFT's crash reports to us.

Hey Tracy, what's the status on getting all this set up? I'm interested in getting manual updates first.
Flags: needinfo?(twalker)
jimm, see:

> https://bugzilla.mozilla.org/show_bug.cgi?id=1361058#c22 contains more info
> about the upcoming SysDev portal which it seems like we should wait for
> before trying this. (Unless we want to move it forward with no safety net
> except manual user reports.)

It looks like it will be coming out this Fall.

Tom, is there a beta version we could tap into and try to get numbers on our 57beta?
Flags: needinfo?(twalker) → needinfo?(tom)
(In reply to Tracy Walker [:tracy] from comment #74)
> jimm, see:
> 
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1361058#c22 contains more info
> > about the upcoming SysDev portal which it seems like we should wait for
> > before trying this. (Unless we want to move it forward with no safety net
> > except manual user reports.)
> 
> It looks like it will be coming out this Fall.
> 
> Tom, is there a beta version we could tap into and try to get numbers on our
> 57beta?

There is, but I've asked Microsoft about it multiple times and have been ignored, so it doesn't seem likely.
Flags: needinfo?(tom)
We recently landed JOB_OBJECT_LIMIT_DIE_ON_UNHANDLED_EXCEPTION which sets SEM_NOGPFAULTERRORBOX on the content process which might cause problems. Need to test the patch now and see how it behaves. (And maybe test it on an old version too - I don't remember the dialog popping on the content process so long ago.)
Random note: to reduce bitmap size we would need to reduce the number of indirect call targets. AKA: add 'final' keywords to more classes.
Depends on: 1342564
Okay, With Alex's help in Bug 1342564 I have a try run that passes the cfg test, but causes what looks like 3 or 4 unique crashes across 6 tests.


Try run without CFG: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9bf9bf02f366b1f1fc17a3c16bc868e7e2ad0584&selectedJob=148920348
Try run with CFG: https://treeherder.mozilla.org/#/jobs?repo=try&revision=28fe0fe8a7187c34c64e51aa1af6f64e0b8fe568&selectedJob=149017567

Crash Type 1:

toolkit/components/aboutmemory/tests/test_memoryReporters.xul 
	application crashed [@ RtlEnterCriticalSection + 0xd]
	I could preoduce something similar locally

toolkit/components/aboutmemory/tests/test_aboutmemory5.xul
	appears to eb the same as above.
	repdocued locally and got a stacktrace
	log in attached file


Crash Type 2:

tests/reftest/tests/dom/media/test/crashtests/1411322.html
	application crashed [@ IsBadReadPtr + 0x5d]
tests/reftest/tests/layout/style/crashtests/1401692.html
	application crashed [@ IsBadReadPtr + 0x5d]

	./mach test doesn't know how to run crashtests, so I couldn't reproduce locally
	log attached


Crash Type 3: (Really just a test failure, browser doesn't crash)

image/test/mochitest/test_bug601470.html
	Error: (msgtype=0x150090,name=PBrowser::Msg_UpdateNativeWindowHandle) Channel error: cannot send/recv
	Error: (msgtype=0x150084,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv
	Also could reproduce locally
	log below


Crash Type 4 (??)

dom/media/mediasource/test/test_MediaSource_memory_reporting.html
	Error: (msgtype=0x150090,name=PBrowser::Msg_UpdateNativeWindowHandle) Channel error: cannot send/recv
	Error: (msgtype=0x150084,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv

	This one seems similar to the above one; except it both time-outs and crashes, log attached



======================================================

22:18:00     INFO -  1923 INFO None1924 INFO TEST-START | image/test/mochitest/test_bug601470.html
22:18:00     INFO -  GECKO(4768) | ++DOMWINDOW == 37 (000001F676C0A000) [pid = 3896] [serial = 119] [outer = 000001F6724B8A40]
22:18:00     INFO -  GECKO(4768) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x150090,name=PBrowser::Msg_UpdateNativeWindowHandle) Channel error: cannot send/recv
22:18:00     INFO -  GECKO(4768) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x150084,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv
22:18:00     INFO -  GECKO(4768) | ++DOCSHELL 000001854E557000 == 4 [pid = 4768] [id = {ea761e0f-9abb-432a-9972-2308a3834526}]
22:18:00     INFO -  GECKO(4768) | ++DOMWINDOW == 9 (000001854D7E5470) [pid = 4768] [serial = 18] [outer = 0000000000000000]
22:18:00     INFO -  GECKO(4768) | ++DOMWINDOW == 10 (000001854E1A9400) [pid = 4768] [serial = 19] [outer = 000001854D7E5470]
22:18:00     INFO -  GECKO(4768) | ++DOMWINDOW == 11 (000001854ED34400) [pid = 4768] [serial = 20] [outer = 000001854D7E5470]
22:18:13     INFO -  GECKO(4768) | --DOMWINDOW == 10 (000001854E1A9400) [pid = 4768] [serial = 19] [outer = 0000000000000000] [url = about:blank]
22:19:54     INFO -  GECKO(4768) | [Parent 4768, Lazy Idle] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/widget/windows/WinUtils.cpp, line 1483
22:21:50     INFO -  [3912, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file z:/build/build/src/toolkit/components/places/Database.cpp, line 743
22:21:50     INFO -  [3912, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file z:/build/build/src/toolkit/components/places/Database.cpp, line 581
22:21:50     INFO -  [3912, Main Thread] WARNING: Unable to get a connection to vacuum database: file z:/build/build/src/storage/VacuumManager.cpp, line 139
22:21:50     INFO -  [3912, Main Thread] WARNING: 'NS_FAILED(rv)', file z:/build/build/src/dom/quota/ActorsParent.cpp, line 2664
22:21:50     INFO -  [3912, Main Thread] WARNING: 'NS_FAILED(rv)', file z:/build/build/src/dom/quota/ActorsParent.cpp, line 2827
22:21:50     INFO -  [3912, IPDL Background] WARNING: '!quotaManager', file z:/build/build/src/dom/quota/ActorsParent.cpp, line 6462
22:26:04     INFO -  Buffered messages finished
22:26:04  WARNING -  TEST-UNEXPECTED-TIMEOUT | image/test/mochitest/test_bug601470.html | application timed out after 370 seconds with no output

======================================================
See Also: → 1403707
I debugged the test failures in comment 81, and they appear to stem from a common cause: With CFG, ntdll appears to maintain some "protected" heaps (for what purpose, I don't know) that live in read-only pages. These heaps are included in the list returned by GetProcessHeaps. Because they are read-only, HeapLock's write operations will crash.

With CFG:
0:000> dw ntdll!RtlpNumberOfProtectedHeaps L1
00007ffa`29020028  0001

Without CFG:
0:000> dw ntdll!RtlpNumberOfProtectedHeaps L1
00007ffa`29020028  0000

Back in the CFG process - here are the two heap lists in ntdll:

0:000> dqs ntdll!RtlpProcessHeapsListBuffer L5
00007ffa`2901db20  0000026b`034d0000
00007ffa`2901db28  0000026b`013a0000
00007ffa`2901db30  0000026b`037a0000
00007ffa`2901db38  0000026b`05200000
00007ffa`2901db40  00000000`00000000
0:000> dqs ntdll!RtlpProtectedheapsListBuffer L2
00007ffa`2901db00  0000026b`034a0000
00007ffa`2901db08  00000000`00000000

GetProcessHeaps combines these two lists into the array we pass:

0:000> dqs 0x26b18d88b20 L5
0000026b`18d88b20  0000026b`034d0000
0000026b`18d88b28  0000026b`013a0000
0000026b`18d88b30  0000026b`037a0000
0000026b`18d88b38  0000026b`05200000
0000026b`18d88b40  0000026b`034a0000

I hacked around this by ignoring read-only heaps in nsMemoryReporterManager.cpp, and the crashes went away.
Attached patch Workaround for protected heaps (obsolete) — Splinter Review
tjr can you give this a try?
Attachment #8937840 - Flags: feedback?(tom)
Should they be ignored, or should we skip locking them? Presumably whatever concurrent funny business locking is protecting against is inapplicable on a read-only heap.
Protected heaps seem to be an internal implementation detail, and given the lack of any documentation that I can find, I would be very wary of touching them. For all I know, they might become un-protected at some point, in which case walking them without locking would be a bad idea. Also, WinDbg's `!heap` doesn't know what to do with one of those pointers, so maybe they are in some nonstandard format.

I checked that in the scenario above where we had 5 heaps, the non-CFG process only had 4. So this mystery heap is an extra one, rather than a conversion of one of our existing heaps (which are presumably the ones we care most about).
I'm pretty sure the separate heap is where the CFG infrastructure keeps it's sets of valid call targets, so it makes sense that it'd be a new one.

https://bugs.chromium.org/p/chromium/issues/detail?id=665516 seems to indicate that the Chrome folks had the same issue, and came to the same conclusion as you regarding the safety of mucking with these things.

Nice debugging!
Ah, so Chrome replaced GetProcessHeaps (plural) with GetProcessHeap (singular, the main heap). That certainly makes for less complex code!

Maybe we should do the same, given that this function's comment says they are only "10s of KiBs"

// Windows can have multiple separate heaps. During testing there were multiple
// heaps present but the non-default ones had sizes no more than a few 10s of
// KiBs. So we combine their sizes into a single measurement.
I'm not a Windows expert, but my understanding is that the truth of statement is a function of how different DLLs are doing allocations (or maybe a function of if you've got multiple CRTs loaded!) So that'd only be applicable if our build/runtime is similar to theirs :-)
I was referring to the comment in nsMemoryReporterManager, which according to Bugzilla is my own fault! :-)
https://bugzilla.mozilla.org/show_bug.cgi?id=1194061#c4
I had completely forgotten about that bug.
Comment on attachment 8937840 [details] [diff] [review]
Workaround for protected heaps

Thanks :dmajor, that fixed it:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=afb1e337f0d9418ef8abe2481efb28eb66b5e00d&selectedJob=153622062

I believe all tests can be green now, I just need to:
1. Make the tests not give a false negative on Windows 7 (CFG isn't available there)
2. Investigate Comment 76
Attachment #8937840 - Flags: feedback?(tom) → feedback+
Attachment #8933770 - Attachment is obsolete: true
Attachment #8933772 - Attachment is obsolete: true
Attachment #8933773 - Attachment is obsolete: true
Attachment #8742329 - Attachment is obsolete: true
Attachment #8939839 - Flags: review?(nfroyd)
Attachment #8937840 - Attachment is obsolete: true
Comment on attachment 8939839 [details]
Bug 1235982: Work around GetProcessHeaps giving us some read-only heaps with CFG.

https://reviewboard.mozilla.org/r/210150/#review215834
Attachment #8939839 - Flags: review?(nfroyd) → review+
Comment on attachment 8840087 [details]
Bug 1235982 Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash

https://reviewboard.mozilla.org/r/114624/#review215860

::: mozglue/build/WindowsCFGStatus.h:23
(Diff revision 13)
> + * Tests for Microsoft's Control Flow Guard compiler security feature.
> + * This function will crash if CFG is enabled.
> + * @returns false if CFG is not enabled. Crashes if CFG is enabled.
> + * It will never return true.
> + */
> +MFBT_API bool CFG_DisabledOrCrash();

There is a dependency on the calling convention here from the js file using `ctypes.default_abi`. I'd feel better if there was a comment saying to update the js file if this function's calling convention ever changes.

::: toolkit/xre/test/browser.ini:8
(Diff revision 13)
>  [browser_checkdllblockliststate.js]
>  skip-if = os != "win"
> +
> +[browser_checkcfgstatus.js]
> +# CFG is only supported on Windows 10+, only run it there
> +skip-if = os != "win" || os_version == "5.1" || os_version == "5.2" || os_version == "6.1" || os_version == "6.3"

Nit -- No need to test for 5.x here (bug 1336712 is trying to remove such tests)

::: toolkit/xre/test/browser_checkcfgstatus.js:48
(Diff revision 13)
> +      // Until 1342564 is fixed, we need to disable this call or it will cause False Positives
> +      privateNoteIntentionalCrash();
> +
> +      Cu.import("resource://gre/modules/ctypes.jsm");
> +      let mozglue = ctypes.open("mozglue.dll");
> +      let CFG_DisabledOrCrash = mozglue.declare("CFG_DisabledOrCrash", ctypes.default_abi, ctypes.bool);

What happens if the lookup of `CFG_DisabledOrCrash` fails (e.g. someone renames it, or accidentally drops the dllexport tag)?

Does it throw? Return null? Something else? Mostly I'm wondering: is the result distinguishable from the "crash because we meant to" case?

I wonder if `Assert.ok(CFG_DisabledOrCrash)` would cover this... but if the lookup throws then that doesn't help.
Depends on: 1428137
(In reply to David Major [:dmajor] from comment #97)
> ::: toolkit/xre/test/browser_checkcfgstatus.js:48
> (Diff revision 13)
> > +      // Until 1342564 is fixed, we need to disable this call or it will cause False Positives
> > +      privateNoteIntentionalCrash();
> > +
> > +      Cu.import("resource://gre/modules/ctypes.jsm");
> > +      let mozglue = ctypes.open("mozglue.dll");
> > +      let CFG_DisabledOrCrash = mozglue.declare("CFG_DisabledOrCrash", ctypes.default_abi, ctypes.bool);
> 
> What happens if the lookup of `CFG_DisabledOrCrash` fails (e.g. someone
> renames it, or accidentally drops the dllexport tag)?
> 
> Does it throw? Return null? Something else? Mostly I'm wondering: is the
> result distinguishable from the "crash because we meant to" case?
> 
> I wonder if `Assert.ok(CFG_DisabledOrCrash)` would cover this... but if the
> lookup throws then that doesn't help.

It fails with:

> 22 INFO TEST-UNEXPECTED-FAIL | toolkit/xre/test/browser_checkcfgstatus.js | A promise chain failed to handle a rejection: Error: couldn't find function symbol in library - stack: JS frame :: resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js :: register :: line 199

That seems okay to me.

I addressed the other two.
See Also: → 1433171
Okay, I dug into SEM_NOGPFAULTERRORBOX.

First I changed all JOB_LOCKDOWN's to JOB_RESTRICTED so we don't set that. Then I used the Browser Content Toolbox to crash a child tab with

> Components.utils.import("resource://gre/modules/ctypes.jsm");
> let mozglue = ctypes.open("mozglue.dll");
> let CFG_DisabledOrCrash = mozglue.declare("CFG_DisabledOrCrash", ctypes.default_abi, ctypes.bool);
> CFG_DisabledOrCrash();

This did not pop a Windows Error Reporting box. 

In the Event Viewer, I can see my crash, there's one Error entry for the crash and one Information entry relating to Windows Error Reporting.

> Fault bucket 1767808267141311314, type 5
> Event Name: BEX
> Response: Not available
> Cab Id: 0
> 
> Problem signature:
> P1: firefox.exe
> P2: 60.0.0.6599
> P3: 5a6a3377
> P4: ntdll.dll
> P5: 10.0.15063.608
> P6: 802f667e
> P7: 00086170
> P8: c0000409
> P9: 0000000a
> P10: 
> 
> Attached files:
> \\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER7B1F.tmp.dmp
> \\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER7BFA.tmp.WERInternalMetadata.xml
> \\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER7BF9.tmp.csv
> \\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER7C0A.tmp.txt
> 
> These files may be available here:
> C:\ProgramData\Microsoft\Windows\WER\ReportArchive\AppCrash_firefox.exe_5d6f15fe6fdf16f06ab02a7c2fd63eee54bb4354_19d5be90_6c3e7ff1
> 
> Analysis symbol: 
> Rechecking for solution: 0
> Report Id: 94c2bd3b-6225-42e6-8631-cce237267f10
> Report Status: 268435456
> Hashed bucket: c1e3dd27c376882f48888440b1f4eb52


Next, I built it WITH JOB_LOCKDOWN, and crashed it the same way. I see the same entries in the Event Viewer. 

> Fault bucket 1623672248898971858, type 5
> Event Name: BEX
> Response: Not available
> Cab Id: 0
> 
> Problem signature:
> P1: firefox.exe
> P2: 60.0.0.6599
> P3: 5a6a371b
> P4: ntdll.dll
> P5: 10.0.15063.608
> P6: 802f667e
> P7: 00086170
> P8: c0000409
> P9: 0000000a
> P10: 
> 
> Attached files:
> \\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER6BD5.tmp.dmp
> \\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER6C91.tmp.WERInternalMetadata.xml
> \\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER6CA0.tmp.csv
> \\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER6CB1.tmp.txt
> 
> These files may be available here:
> C:\ProgramData\Microsoft\Windows\WER\ReportArchive\AppCrash_firefox.exe_5796c88a70dc2a771487e9525d719b534f7fc9b4_4e8ee341_7ee070f5
> 
> Analysis symbol: 
> Rechecking for solution: 0
> Report Id: 588e35ea-8c2b-4075-85b2-0b5a93dbb668
> Report Status: 268435456
> Hashed bucket: eed880e7eb6cbac6c688714ecb4868d2

If I go into 'Problem Reports' it lists both crashes and indicates that both reports were sent.


A close read on the documentation from Microsoft says it does not SHOW the Windows Error Reporting box. Comments on the internet sometimes repeat this and othertimes say it disables Windows Error Reporting, but my testing indicates that is not correct. I'd love to get some sort of official confirmation from Microsoft but I believe that SEM_NOGPFAULTERRORBOX still keeps Windows Error Reporting 'working' (in the background).





I'll need to rebase the patches, and I'll do a final try and perf run to make sure nothing's gone crazy. But I think at this point we can land this in Nightly? I'm going to flag a bunch a few people for signoff.
Flags: needinfo?(mconley)
Flags: needinfo?(jmathies)
Assignee: nobody → tom
Priority: P2 → P1
> A close read on the documentation from Microsoft says it does not SHOW the
> Windows Error Reporting box. Comments on the internet sometimes repeat this
> and othertimes say it disables Windows Error Reporting, but my testing
> indicates that is not correct. I'd love to get some sort of official
> confirmation from Microsoft but I believe that SEM_NOGPFAULTERRORBOX still
> keeps Windows Error Reporting 'working' (in the background).

Seems like a good question to post to the internal ms discussion list. Do you have access to that?
Flags: needinfo?(jmathies)
(In reply to Tom Ritter [:tjr] from comment #103)
> I'll need to rebase the patches, and I'll do a final try and perf run to
> make sure nothing's gone crazy. But I think at this point we can land this
> in Nightly? I'm going to flag a bunch a few people for signoff.

I appreciate the vote of confidence, but this is definitely outside of my area of expertise. :) If you're looking for a second opinion after jimm, dmajor is a good choice for stuff like this, imo.
Flags: needinfo?(mconley)
> I'll need to rebase the patches, and I'll do a final try and perf run to
> make sure nothing's gone crazy. But I think at this point we can land this
> in Nightly? I'm going to flag a bunch a few people for signoff.

If there are no problems with codesize or perf results then I don't see any reason not to proceed. I guess the only thing is to have someone commit to keeping a close eye on WER results for a few days in case something surprising happens.
Alright, let's try and land this.
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/2d97a25a89a5
Add CFG to firefox.exe and mozglue, and a mochitest to ensure a crash r=glandium,mconley
https://hg.mozilla.org/integration/autoland/rev/1b20c4624424
Work around GetProcessHeaps giving us some read-only heaps with CFG. r=froydnj
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/2d97a25a89a5
https://hg.mozilla.org/mozilla-central/rev/1b20c4624424
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Depends on: 1435009
To put this data somewhere, a CFG failure looks like this


> Unhandled exception at 0x77C46170 (ntdll.dll) in firefox.exe: RangeChecks instrumentation code detected an out of range array access.
This bug may have contributed to a sudden change in the Telemetry probe MEMORY_VSIZE[1] which seems to have occurred in Nightly 20180201[2][3].

There was a sudden appearance of dominating samples with the value 0, only for Windows[4]. I really have no idea how Windows would report a virtual memory size of 0 all of a sudden, but this seems like a Windows-only memory-adjacent change..

Is this intentional? Is this expected?

[1]: http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/824/alerts/?from=2018-02-01&to=2018-02-01
[2]: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3804441e575c9f46fcb03894de3c780eeae7197f&tochange=ee7f64d0764961e9663c681efc8d2b3759af79d6
[3]: https://mzl.la/2BLgEyE
[4]: https://mzl.la/2BNv29M
Flags: needinfo?(tom)
Blocks: 1436914
Flags: needinfo?(tom)
Blocks: cfg
Product: Core → Firefox Build System
(In reply to Chris H-C :chutten from comment #115)
> This bug may have contributed to a sudden change in the Telemetry probe
> MEMORY_VSIZE[1] which seems to have occurred in Nightly 20180201[2][3].
> 
> There was a sudden appearance of dominating samples with the value 0, only
> for Windows[4]. I really have no idea how Windows would report a virtual
> memory size of 0 all of a sudden, but this seems like a Windows-only
> memory-adjacent change..

I'm getting a >2TB vsize on windows. Maybe some counter overflows?
It is a minor annoyance for me because it obscures other virtual memory leaks when monitoring FF processes. I guess I should set 2TB as baseline.
See bug 1436914 for discussion of the issue. We can't fix the vsize number without disabling CFG, which we're not going to do.
You need to log in before you can comment on or make changes to this bug.