Closed Bug 1403668 Opened 3 years ago Closed 3 years ago

AddressSanitizer builds do not dump logs for stack overflows

Categories

(Core :: General, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox58 --- fixed

People

(Reporter: tsmith, Assigned: decoder)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files)

Attached file asan_verbose_log.txt
Normally on error detection "ASAN:DEADLYSIGNAL" is printed to stderr and logs are then dumped. This does not seem to be working in the "stack overflow" case. We do see that the process exits with -11 (SEGV) but no AddressSanitizer information is dumped. This does work on other application (and I think it has in the past on Firefox but I'm not 100%)

asan_verbose_log.txt shows the output running with ASAN_OPTIONS=verbosity=1

For an example bug 1403656 will reproduce the issue using m-c build:
BuildID=20170925234146
SourceStamp=e6b3498a39b94616ba36798fe0b71a3090b1b14c
Like I said in IRC, my only theory here is that somehow the stack size changes in Stylo broke this. I don't know if Stylo is involved in whatever reflow stuff happens in that bug. Maybe you could bisect that test case to figure out if a change on m-c broke this?
Tyson -- can you perform the bisect?
Flags: needinfo?(twsmith)
Pinged Al about priority.
Alright, I spent a few hours debugging this (I think bisections wouldn't have helped given the outcome) and I think I solved it, but the findings are a bit worrying:

First of all, we have several SIGSEGV handlers in the codebase, these include

  * toolkit/profile/nsProfileLock.cpp (installs all sorts of handlers for cleaning up profile locks)
  * js/src/ds/MemoryProtectionExceptionHandler.cpp (JS handler to annotate some interesting crashes)
  * js/src/wasm/WasmSignalHandlers.cpp (JS Handler for wasm/asm.js out-of-bounds checks)

This is not the full list, but these are the ones that matter. First, I found out that with "allow_user_segv_handler=0" in ASAN_OPTIONS, the problem goes away. That is a clear indicator for one of our handlers being the problem because with this option, the program just can't install any SIGSEGV handlers at all. I then managed to narrow it down to the three handlers above: With all of them disabled, I would get an ASan trace as expected, enabling only one of them broke the ASan trace again. However, *none* of the signal handlers were actually called (confirmed with good old fprintf debugging). And after a while I realized why this is the case: None of these handlers use SA_ONSTACK [1], so they run on the same stack as the main program. That won't work because we happen to be out of stack space so already the attempt to call our signal handler crashes the program. ASan sets up the alternate signal stack and uses that to catch signals, which is why ASan is still working in this situation. The fix for ASan here is to add SA_ONSTACK to all three signal handlers (patch coming in a few).

Now for the reason why this is worrying: This does not only affect ASan. Basically, I expect these three signal handlers to break *all* of our SIGSEGV handling at least on Linux, including the crash reporter, if we happen to be out of stack space. Worst case this means that we are missing crash reports on Linux (maybe Android too?) if we are out of stack space. In combination with stack clash issues (e.g. bug 909094) this might be happening more often also in general OOM situations. In the JS engine, I hit a lot of over-recursion crashes due to that particular bug. I'm not sure about other OSes as I don't know enough about their signal handling mechanisms. I know Mac uses a forwarding model as well but I don't know if by default, signal handlers are on the program's main stack. We should investigate this more closely.

Since none of the other signal handlers will run, this could also lead to locks not being cleaned up (see nsProfileLock.cpp) after such a crash.

For the final question: How did this ever work? I don't know. From what I can see, this should have never worked. Maybe previously, there was some stack space reserved somehow (by the kernel? by us?) which somehow magically made it work. In the Linux Kernel, things regarding the stack recently changed due to the Stack Clash vulnerability, maybe that regressed it.

I'm putting a needinfo on Ted and Nathan here to give feedback on this issue and to help estimate the impact of this outside of ASan. I'll also upload a patch in a few that should fix the problem.

[1] http://man7.org/linux/man-pages/man2/sigaltstack.2.html
Flags: needinfo?(twsmith)
Flags: needinfo?(ted)
Flags: needinfo?(nfroyd)
Comment on attachment 8916274 [details]
Bug 1403668 - Use SA_ONSTACK for several SIGSEGV handlers.

Tyson, could you confirm that this patch fixes the ASan build behavior? It does fix it for me locally.
Attachment #8916274 - Flags: feedback?(twsmith)
One more thing I noticed while looking at this code is that the signal handlers in nsProfileLock.cpp do not use SA_NODEFER while the other SIGSEGV signal handlers we have do. Is this intended?
(In reply to Christian Holler (:decoder) from comment #7)
> Tyson, could you confirm that this patch fixes the ASan build behavior? It
> does fix it for me locally.

Of course. I'm traveling at the moment and tomorrow is a holiday here but I will test it ASAP. Thanks for the patch!
Green on try (oranges are unrelated intermittents, OSX perma-orange is unrelated according to sheriffs): https://treeherder.mozilla.org/#/jobs?repo=try&revision=91573e68a293
Assignee: nobody → choller
Breakpad's signal handler does use sigaltstack / SA_ONSTACK:
https://dxr.mozilla.org/mozilla-central/rev/5eba13f5b3a6ad80decdd8c7b30bff5fa477844f/toolkit/crashreporter/breakpad-client/linux/handler/exception_handler.cc#134
https://dxr.mozilla.org/mozilla-central/rev/5eba13f5b3a6ad80decdd8c7b30bff5fa477844f/toolkit/crashreporter/breakpad-client/linux/handler/exception_handler.cc#299

...but it's entirely plausible that these other handlers not doing so breaks our ability to handle stack overflow. We have had bug 481781 on file for a *long* time.
Flags: needinfo?(ted)
I can believe that adding SA_ONSTACK to registering these signal handlers works; I am skeptical that we really want to do this everywhere.  We should at least do it for ASan, because we know that ASan registers an alternate stack.  Do we want to do it in general for non-ASan builds and silently depend on Breakpad's alternate signal stack?  That doesn't seem like a good thing to me; it seems like we should be setting up our own alternate stack somewhere before we register breakpad (making sure breakpad won't overwrite ours), and then setting SA_ONSTACK.

I don't know why this hasn't been a problem before...stack overflow is not *terribly* common, though, so perhaps the issue was just hidden.
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #12)
> I can believe that adding SA_ONSTACK to registering these signal handlers
> works; I am skeptical that we really want to do this everywhere.  We should
> at least do it for ASan, because we know that ASan registers an alternate
> stack.  Do we want to do it in general for non-ASan builds and silently
> depend on Breakpad's alternate signal stack?  That doesn't seem like a good
> thing to me; 

The worst thing that can happen (in builds without ASan *and* crash reporter), is that the signal handlers behave as they did before this patch. The man page for SA_ONSTACK says:

> "If an alternate stack is not available, the default stack will be used."

So omitting this flag in any configurations that don't have a sigaltstack setup has no effect at all.

> it seems like we should be setting up our own alternate stack
> somewhere before we register breakpad (making sure breakpad won't overwrite
> ours), and then setting SA_ONSTACK.

I agree that we should do this, but it sounds like a major refactoring that we should not try to push through in this bug. This is a priority bug for fuzzing because it keeps hitting us in automation. With this simple patch we have the chance to fix the problem both for ASan as well as for our release builds (fixing problems with the crash reporter that we had for 9 years now). We can then still make a refactoring to fix the bug in builds that don't use either ASan or the crash reporter.

Therefore I propose to move forward with this patch and then file a follow-up bug for making a generic sigaltstack refactoring.
(In reply to Nathan Froyd [:froydnj] from comment #12)
> it seems like we should be setting up our own alternate stack
> somewhere before we register breakpad (making sure breakpad won't overwrite
> ours), and then setting SA_ONSTACK.

FWIW, it doesn't seem to matter *when* you create the alternate stack, just so long as it exists when the exception handler is called. So we could probably check if an alternate stack exists after everything else (Breakpad, ASan, Elf Loader) has had the chance to set one up, and create it if not. The following did the trick in local testing (very quick and dirty of course):

  stack_t alt = {};
  alt.ss_size = 64 * 1024;
  alt.ss_sp = mmap(nullptr, alt.ss_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, 0, 0);
  sigaltstack(&alt, nullptr);

So it's just a matter of figuring out where/when to do it.
We forked the Breakpad client a while ago, so we could absolutely pull the altstack stuff out into somewhere common and just call it once on startup and make everything else in Gecko use it.
Blocks: 1407075
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15)
> We forked the Breakpad client a while ago, so we could absolutely pull the
> altstack stuff out into somewhere common and just call it once on startup
> and make everything else in Gecko use it.

I filed bug 1407075 for this.
Comment on attachment 8916274 [details]
Bug 1403668 - Use SA_ONSTACK for several SIGSEGV handlers.

https://reviewboard.mozilla.org/r/187488/#review193100

Good catch! My old "crash me now" extension no longer works, but it would be interesting to re-test crashing with a stack overflow on a nightly build with this patch.
Attachment #8916274 - Flags: review?(ted) → review+
Pushed by choller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0152a50835c9
Use SA_ONSTACK for several SIGSEGV handlers. r=ted
Duplicate of this bug: 481781
https://hg.mozilla.org/mozilla-central/rev/0152a50835c9
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
(In reply to Christian Holler (:decoder) from comment #7)
> Tyson, could you confirm that this patch fixes the ASan build behavior? It
> does fix it for me locally.

Works for me as well. Thanks everyone!
Comment on attachment 8916274 [details]
Bug 1403668 - Use SA_ONSTACK for several SIGSEGV handlers.

https://reviewboard.mozilla.org/r/187488/#review193604

It works. Thanks.
Attachment #8916274 - Flags: review+
Attachment #8916274 - Flags: feedback?(twsmith) → feedback+
You need to log in before you can comment on or make changes to this bug.