Closed Bug 1581584 Opened 6 years ago Closed 4 years ago

ASan is emitting wrong stacks for multiple MOZ_CRASH locations

Categories

(Core :: MFBT, defect)

x86_64
Linux
defect
Not set
critical

Tracking

()

VERIFIED FIXED

People

(Reporter: decoder, Unassigned)

References

()

Details

Attachments

(2 files, 1 obsolete file)

I have noticed that on my machine, when compiling with ASan (with or without UBSan) using the Clang provided by mozbuild, the compiler manages to merge multiple MOZ_CRASHlocations into one. It also managed to merge a non-MOZ_CRASH location with a MOZ_CRASH location, giving me a crash trace that indicated a null-deref inside a Length() function call on nsTArray rather than pointing to the MOZ_CRASH a few lines after.

We explicitly try to prevent this from happening as stated in the comments here [1]:

 * We use __LINE__ to prevent the compiler from folding multiple crash sites
 * together, which would make crash reports hard to understand.

This trick apparently doesn't work anymore. I've come up with a gtest that demonstrates the problem (patch attached). You can compile with ASan and then run ./mach gtest TestCrash.ASanBogusStack. I derived this from a fuzzing target in Necko and this is not necessarily minimal but I thought it is good enough to demonstrate the issue.

Note that this also affects MOZ_ASSERT. If you take a look at bug 1577574 you will see that it is about a fatal assertion being hit, but the attached ASan crash information points to a null-deref in StaticPtr.

This bug is harmful to automated testing with ASan, but it might in fact not be limited to ASan at all. We probably need to prevent the compiler from re-ordering things here properly. It is unclear to me if Windows is affected because Windows has MOZ_NoReturn which is marked as noinline. Still that does not seem like a 100% safe guarantee that it will be fine.

If the bug affects non-ASan builds, this might also influence our crash reports.

[1] https://searchfox.org/mozilla-central/rev/7ed8e2d3d1d7a1464ba42763a33fd2e60efcaedc/mfbt/Assertions.h#199

Attachment #9093079 - Attachment description: Patch to reproduce the problem → Patch to reproduce the problem, build with --disable-debug
Attached file Standalone testcase (obsolete) —

Here is a standalone testcase that does not require mozilla-central.

$ ~/.mozbuild/clang/bin/clang++ -O2 -g -fsanitize=address -o test.o test.cpp && ./test.o
If you see a crash in line 38, then the compiler optimized our crash
Calling testCrashFoo
AddressSanitizer:DEADLYSIGNAL
=================================================================
==25990==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x0000004f9175 bp 0x7ffe29059e80 sp 0x7ffe29059d40 T0)
==25990==The signal is caused by a WRITE memory access.
==25990==Hint: address points to the zero page.
    #0 0x4f9174 in main /srv/repos/mozilla-central/test.cpp:38:5
    #1 0x7f6445221b96 in __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310
    #2 0x41af39 in _start (/srv/repos/mozilla-central/test.o+0x41af39)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /srv/repos/mozilla-central/test.cpp:38:5 in main
==25990==ABORTING

This should work better for testing the different strategies we can take to avoid this.

Attached file Standalone testcase
Attachment #9093098 - Attachment is obsolete: true

Using the MOZ_NoReturn approach on Linux gives me the proper stack with MOZ_NoReturn on the top and the second frame is pointing to the correct location. This means that Windows is likely unaffected by this right now. But I don't know if this isn't just because the compiler isn't smart enough yet.

Debuggability nicety is super-important, and having to jump up a frame in a debugger any time a crash happens to reach the real crash-location is not acceptable on non-Windows. (I'm not sure it should be acceptable on Windows either, but I guess I missed that change whenever it happened.) We need an approach that expands out to crashing behavior without adding an extra frame.

I note that right now,

#  ifdef MOZ_UBSAN
#    define MOZ_CRASH_WRITE_ADDR 0x1
#  else
#    define MOZ_CRASH_WRITE_ADDR NULL
#  endif

#  ifdef __cplusplus
#    define MOZ_REALLY_CRASH(line)                                  \
      do {                                                          \
        *((volatile int*)MOZ_CRASH_WRITE_ADDR) = line; /* NOLINT */ \
        ::abort();                                                  \
      } while (false)
#  else
#    define MOZ_REALLY_CRASH(line)                                  \
      do {                                                          \
        *((volatile int*)MOZ_CRASH_WRITE_ADDR) = line; /* NOLINT */ \
        abort();                                                    \
      } while (false)
#  endif
#endif

will cast to volatile int* a number that is not a multiple of int alignment. I wonder if #define MOZ_CRASH_WRITE_ADDR sizeof(int) would make a difference here, because just casting 1 is undefined behavior even taking into account that compilers may special-case null dereferences of volatile null pointers.

(In reply to Jeff Walden [:Waldo] from comment #4)

I wonder if #define MOZ_CRASH_WRITE_ADDR sizeof(int) would make a difference here, because just casting 1 is undefined behavior even taking into account that compilers may special-case null dereferences of volatile null pointers.

I tried this and unfortunately it doesn't seem to affect the optimization.

We explicitly try to prevent this from happening as stated in the comments here [1]:

 * We use __LINE__ to prevent the compiler from folding multiple crash sites
 * together, which would make crash reports hard to understand.

This trick apparently doesn't work anymore.

Hi. I wrote that comment (and code). Note that it's within the #if defined(_MSC_VER) block. I don't think I touched the Linux code at the time; I don't know if the trick ever worked there. Also, I'll point out that we don't add extra frames on Windows. What brings down the process is the __debugbreak() at https://searchfox.org/mozilla-central/rev/7ed8e2d3d1d7a1464ba42763a33fd2e60efcaedc/mfbt/Assertions.h#223. The MOZ_NoReturn is just a backup in case you step past the break with a debugger, plus having a noinline after the break discourages the compiler from combining like codepaths.

I'm currently away from my usual machines, but if nobody else gets there before me, I'd be curious to look at the disassembly of your test cases on Linux.

This may be a problem with ASan's unwinder. If I run the standalone test under lldb or gdb, the line of the crash is correct without fprintf.

ASAN_OPTIONS=fast_unwind_on_fatal=1 doesn't make a difference.

(In reply to Jesse Schwartzentruber (:truber) from comment #7)

This may be a problem with ASan's unwinder. If I run the standalone test under lldb or gdb, the line of the crash is correct without fprintf.

ASAN_OPTIONS=fast_unwind_on_fatal=1 doesn't make a difference.

Confirmed, in GDB everything looks normal, I even set a breakpoint inside the assembler code:

   0x0000000000516fe0 <+848>:   movl   $0x26,0x0                           // This is the first MOZ_REALLY_CRASH
   0x0000000000516feb <+859>:   callq  0x4e4db0 <__asan_handle_no_return>
   0x0000000000516ff0 <+864>:   callq  0x41a0f0 <abort@plt>
   0x0000000000516ff5 <+869>:   movl   $0x2f,0x0                           // This is the second MOZ_REALLY_CRASH
   0x0000000000517000 <+880>:   callq  0x4e4db0 <__asan_handle_no_return>
   0x0000000000517005 <+885>:   callq  0x41a0f0 <abort@plt>
End of assembler dump.
(gdb) break *0x0000000000516ff5
Breakpoint 1 at 0x516ff5: file test.cpp, line 47.
(gdb) r
Starting program: /srv/repos/mozilla-central/test.o 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
If you see a crash in line 38, then the compiler optimized our crash
Calling testCrashFoo

Breakpoint 1, main () at test.cpp:47
47          MOZ_REALLY_CRASH(__LINE__); // This is where the crash actually happens
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
main () at test.cpp:47
47          MOZ_REALLY_CRASH(__LINE__); // This is where the crash actually happens
(gdb) c
Continuing.
AddressSanitizer:DEADLYSIGNAL
=================================================================
==5803==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000516ff5 bp 0x7fffffffe040 sp 0x7fffffffdf00 T0)
==5803==The signal is caused by a WRITE memory access.
==5803==Hint: address points to the zero page.
     #0 0x516ff4 in main /srv/repos/mozilla-central/test.cpp:38:5
    #1 0x7ffff6a9bb96 in __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310
    #2 0x41a3d9 in _start (/srv/repos/mozilla-central/test.o+0x41a3d9)
 
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /srv/repos/mozilla-central/test.cpp:38:5 in main
==5803==ABORTING
[Inferior 1 (process 5803) exited with code 01]
(gdb) 

This seems really weird, maybe an ASan bug, but doesn't look like the kind of optimization problem I was expecting.

I've tried to look a bit further into this, but I don't really see what causes this.

It certainly looks like this problem is ASan specific but the code ordering influences it. Adding the fprintf in front of the "real" MOZ_CRASH fixes the problem as well as removing the abort call in the MOZ_REALLY_CRASH macro. Both seem to produce different code that somehow cause the right stacks to be emitted. However, when the wrong stack is emitted, the signal handler is still called from the correct location, so this is not a case of crashing in the wrong location.

Summary: Clang is optimizing MOZ_CRASH on Linux → ASan is emitting wrong stacks for multiple MOZ_CRASH locations

Sounds like someone should report a bug upstream, at this point, then? As far as I can tell there's nothing we really can do here right now, certainly not without understanding exactly what upstream's error is to work around it.

I already did so, unfortunately there hasn't been a response yet. I'll try to ping the devs directly.

This should be fixed by https://llvm.org/r373529.

decoder, any interest in writing a patch to merge the fix from upstream?

Flags: needinfo?(choller)

Edit: Nevermind, I misread the question ;) Sure, I could do that if it applies cleanly. I'll leave the needinfo here since I need to finish some other stuff first but then I can look. We also used to have Clang Trunk builds that we could use for ASan instead if the patch can't be backported easily.

Flags: needinfo?(choller)
Flags: needinfo?(choller)

Hi Christian, does this issue still occur on your end or has it been fixed? can we close this issue ?

I believe we use a sufficiently recent Clang by now.

Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(choller)
Resolution: --- → FIXED

Marking this Verified based on Comment 16.

Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: