Open Bug 1526052 Opened 5 years ago Updated 2 years ago

frame pointer-based unwinding for aarch64 is not so great

Categories

(Toolkit :: Crash Reporting, defect, P3)

ARM64
Windows
defect

Tracking

()

People

(Reporter: froydnj, Unassigned)

References

(Blocks 1 open bug)

Details

The current stackwalker for aarch64 on windows does frame-pointer based unwinding. This kind of unwinding is currently busted because we're not actually compiling everything with frame pointers (bug 1525630), but even once that gets fixed, there are still bugs.

Quick terminology note: fp/lr/sp are the frame pointer, link register, and stack pointer, respectively, and are aliases for the actual hardware registers x29/x30/x31.

Your typical stack frame with frame pointers on aarch64 is going to look like:

caller's sp ---> +----------------------------------+
                 | caller's frame pointer (8 bytes) |
                 +----------------------------------+
                 |     return address (8 bytes)     |
         fp ---> +----------------------------------+
                 |                                  |
                 |     ........................     |
                 |                                  |
         sp ---> +----------------------------------+

The return address slot in the above diagram is going to be the value of lr on entry to the function.

The current unwinding code in breakpad lives here:

https://chromium.googlesource.com/breakpad/breakpad/+/master/src/processor/stackwalker_arm64.cc#189
https://searchfox.org/mozilla-central/rev/490ab7f9b84570573a49d7fa018673ce0d5ddf22/toolkit/crashreporter/google-breakpad/src/processor/stackwalker_arm64.cc#189-218

This currently matches what we have in m-c and what socorro is using to process crash dumps.

This bit parses out the two words for the return address and the previous frame pointer:

https://searchfox.org/mozilla-central/rev/490ab7f9b84570573a49d7fa018673ce0d5ddf22/toolkit/crashreporter/google-breakpad/src/processor/stackwalker_arm64.cc#191-203

That part is fine, but the dodgy part comes in how we're going to fill in the previous frame:

https://searchfox.org/mozilla-central/rev/490ab7f9b84570573a49d7fa018673ce0d5ddf22/toolkit/crashreporter/google-breakpad/src/processor/stackwalker_arm64.cc#214-218

The previous fp and sp are fine, those make sense. However, the assignment for pc in the previous frame is (mostly) incorrect: we should be using the return address that we saved to the stack. Instead we're using lr, which means that if lr is ever modified by a function, i.e.:

  1. it decides to call another function, modifying lr; or
  2. it decides to use lr for some other purpose (temporary values)

we're going to compute a completely bogus pc for our previous frame.

The code in breakpad is more correct for certain kinds of functions, e.g. most crashes will feature a bunch of threads like:

Thread 67
0 ntdll.dll!NtWaitForSingleObject + 0x4

and at that point, it's guaranteed that we can't have set up fp for our new function, so fp actually points to the frame for the caller of NtWaitForSingleObject, and if we used the recovered lr of the frame (as we should be doing), rather than the value of lr (as we're currently doing), we'd skip a frame. But I think that's preferable to unwinding off into the weeds if you get a bogus lr.

I can't tell whether this makes any difference on our current crashes, because our current crashes don't have frame pointers and I haven't been able to make a build with frame pointers give me a minidump that I can twiddle with. But I'm reasonable sure it's going to make a difference on crashes we get in the future.

For comparison's sake I built a little Breakpad test program I have for aarch64-linux and ran it under qemu to generate a minidump. I tarred up the binary, symbols, and minidump here:
https://drive.google.com/file/d/1eqxS5wk4ivksqcDAxEw2ho6RTNqjbjXZ/view?usp=sharing

minidump_stackwalk seems to be able to use frame pointers to unwind if I don't give it the symbols to use, although it loses the last frame, and the frame pointer values don't match up with the ones it recovers when unwinding with symbols.

Without symbols I get:

Thread 0 (crashed)
 0  test_crash + 0x23f0
     x0 = 0x7ffffffff0dea000    x1 = 0x0000000000000001
     x2 = 0x0000000000000000    x3 = 0x0000000000000000
     x4 = 0x0000aaaab2980408    x5 = 0x0000000000000000
     x6 = 0x0000000000000001    x7 = 0x0000aaaab9736010
     x8 = 0x0000000000000000    x9 = 0x0000ffffa2c0b400
    x10 = 0x0000000000000040   x11 = 0x0000000000000000
    x12 = 0x0000000000000002   x13 = 0x0000000000000000
    x14 = 0x0000000000000000   x15 = 0x0000000000000001
    x16 = 0x0000aaaab297fe20   x17 = 0x0000ffffa2bc3558
    x18 = 0x000000000000000f   x19 = 0x0000aaaab296a5f8
    x20 = 0x0000000000000000   x21 = 0x0000aaaab295a2ac
    x22 = 0x0000000000000000   x23 = 0x0000000000000000
    x24 = 0x0000000000000000   x25 = 0x0000000000000000
    x26 = 0x0000000000000000   x27 = 0x0000000000000000
    x28 = 0x0000000000000000    fp = 0x0000ffffca04b590
     lr = 0x0000aaaab295a4c4    sp = 0x0000ffffca04b580
     pc = 0x0000aaaab295a3f0
    Found by: given as instruction pointer in context
 1  test_crash + 0x24c0
     fp = 0x0000ffffca04b750    lr = 0x0000ffffa28aed24
     sp = 0x0000ffffca04b5a0    pc = 0x0000aaaab295a4c4
    Found by: previous frame's frame pointer
 2  libc-2.28.so + 0x20d20
     fp = 0x0000000000000000    lr = 0x0000aaaab295a2e0
     sp = 0x0000ffffca04b760    pc = 0x0000ffffa28aed24
    Found by: previous frame's frame pointer

With symbols I get:

Thread 0 (crashed)
 0  test_crash!do_crash(int volatile*) + 0x10
     x0 = 0x7ffffffff0dea000    x1 = 0x0000000000000001
     x2 = 0x0000000000000000    x3 = 0x0000000000000000
     x4 = 0x0000aaaab2980408    x5 = 0x0000000000000000
     x6 = 0x0000000000000001    x7 = 0x0000aaaab9736010
     x8 = 0x0000000000000000    x9 = 0x0000ffffa2c0b400
    x10 = 0x0000000000000040   x11 = 0x0000000000000000
    x12 = 0x0000000000000002   x13 = 0x0000000000000000
    x14 = 0x0000000000000000   x15 = 0x0000000000000001
    x16 = 0x0000aaaab297fe20   x17 = 0x0000ffffa2bc3558
    x18 = 0x000000000000000f   x19 = 0x0000aaaab296a5f8
    x20 = 0x0000000000000000   x21 = 0x0000aaaab295a2ac
    x22 = 0x0000000000000000   x23 = 0x0000000000000000
    x24 = 0x0000000000000000   x25 = 0x0000000000000000
    x26 = 0x0000000000000000   x27 = 0x0000000000000000
    x28 = 0x0000000000000000    fp = 0x0000ffffca04b590
     lr = 0x0000aaaab295a4c4    sp = 0x0000ffffca04b580
     pc = 0x0000aaaab295a3f0
    Found by: given as instruction pointer in context
 1  test_crash!main + 0xc0
    x19 = 0x0000aaaab296a5f8   x20 = 0x0000000000000000
    x21 = 0x0000aaaab295a2ac   x22 = 0x0000000000000000
    x23 = 0x0000000000000000   x24 = 0x0000000000000000
    x25 = 0x0000000000000000   x26 = 0x0000000000000000
    x27 = 0x0000000000000000   x28 = 0x0000000000000000
     fp = 0x0000ffffca04b590    sp = 0x0000ffffca04b590
     pc = 0x0000aaaab295a4c4
    Found by: call frame info
 2  libc-2.28.so!__libc_start_main + 0xe0
    x19 = 0x0000aaaab296a5f8   x20 = 0x0000000000000000
    x21 = 0x0000aaaab295a2ac   x22 = 0x0000000000000000
    x23 = 0x0000000000000000   x24 = 0x0000000000000000
    x25 = 0x0000000000000000   x26 = 0x0000000000000000
    x27 = 0x0000000000000000   x28 = 0x0000000000000000
     fp = 0x0000ffffca04b750    sp = 0x0000ffffca04b750
     pc = 0x0000ffffa28aed24
    Found by: call frame info
 3  test_crash!_start + 0x30
    x19 = 0x0000000000000000   x20 = 0x0000000000000000
    x21 = 0x0000aaaab295a2ac   x22 = 0x0000000000000000
    x23 = 0x0000000000000000   x24 = 0x0000000000000000
    x25 = 0x0000000000000000   x26 = 0x0000000000000000
    x27 = 0x0000000000000000   x28 = 0x0000000000000000
     fp = 0x0000000000000000    sp = 0x0000ffffca04b890
     pc = 0x0000aaaab295a2e0
    Found by: call frame info
Priority: -- → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.