Closed Bug 1642516 Opened 5 years ago Closed 4 years ago

Our stackwalkers are inconsistent about caller addresses / return addresses

Categories

(Core :: Gecko Profiler, defect, P3)

defect

Tracking

()

RESOLVED FIXED
92 Branch
Tracking Status
firefox92 --- fixed

People

(Reporter: mstange, Assigned: mstange)

Details

Attachments

(2 files)

CC'ing Jim, Jed, Julian and Jeff, who have all worked on stack walkers for longer than I have.

It looks like our assortment of stack walkers is inconsistent on how "non-leaf" addresses are treated - are those addresses "return addresses", or are they adjusted to point into the calling instruction? The crash stack walkers in breakpad all seem to do an adjustment by at least one byte. As for the profiler, it looks like currently only Lul does any adjustment (by one byte), and the other profiling stack walkers don't make any adjustments and just use the raw return addresses.
We should be consistent: either all stack walkers should do the adjustment, or none should. But which way should we standardize on?

Brief background for other readers of this bug: Stack walkers collect code addresses for functions that are currently executing on a thread. There's one frame at the top with the value from the program counter / "instruction pointer" register, and then there's a number of frames for the rest of the stack for all the calling functions. But which address is reported for those calling frames? Usually, the easiest value to get ahold of during stack walking is the return address, i.e. the address of the instruction that will be executed after the current function returns.

Example:

XRE_InitChildProcess(int, char**, XREChildData const*):
[...]
                    41a004d:  leaq -552(%rbp), %rdi
                    41a0054:  callq -58351369 <MessageLoop::Run()>  ; <-- calls MessageLoop::Run()
 return address ->  41a0059:  movq (%rbx), %rax
                    41a005c:  movq %rbx, %rdi
[...]

If we're inside MessageLoop::Run(), called by XRE_InitChildProcess(int, char**, XREChildData const*), then the immediate return address will be 41a0059. That's the address of the instruction after the call instruction - the call instruction is at 41a0054.
In this case, the breakpad stack walker actually reported the address 41a0058 for the XRE_InitChildProcess frame (see "module_offset": "0x41a0058" in the raw dump of the crash report); this address was obtained by subtracting one byte from the return address.

Why does the specific address matter, and why hasn't the inconsistency shown up in the past? When you only look up function names for an address, the specific address inside the function does not make a difference. But once you start to look up line number information and inline frame information for an address, the specific address does matter. For example, the call to MessageLoop::Run() given above is at line 740:

                  [...]
                  735 #if defined(MOZ_SANDBOX)
                  736       AddContentSandboxLevelAnnotation();
                  737 #endif
                  738
                  739       // Run the UI event loop on the main thread.
 calling line ->  740       uiMessageLoop.MessageLoop::Run();
                  741
                  742       // Allow ProcessChild to clean up after itself before going out of
                  743       // scope and being deleted
                  744       process->CleanUp();
                  745       mozilla::Omnijar::CleanUp();
                  [...]

If you were to look up the line number for 41a0059, you'd get line 744. That's not the value we want! We want to get line 740. And we get line 740 if we look up values from 41a0055 to 41a0058 (i.e. addresses within the bytes of the call instruction).
A similar issue appears if you look up inline frames for an address. Looking up the return address will frequently yield inline frames that are completely unrelated to what's on the stack during the call.
This hasn't come up in the profiler before now, because the profiler wasn't showing line numbers before. I'm currently in the process of adding support for line numbers and inline frames, so now this problem has become apparent.
Crash reports have always displayed line numbers so they always needed the adjustment.

So, it's clear that we need to do this adjustment somewhere in the profiler, but where should we do it? I think we have the following options:

  1. In Gecko, in the stackwalkers
  2. In Gecko, when producing the profile JSON
  3. In the front-end, when loading the profile and converting it to the "processed" format
  4. In the front-end, when looking up symbols

The front-end accepts profiles from more data sources than just the Gecko profiler. At the moment, those other sources all come with pre-symbolicated function names and no addresses, but in theory they could come with unsymbolicated addresses. So then the question is which convention is the norm for other profilers. It seems that at least the sample tool on macOS produces return addresses, i.e. doesn't include adjustment.

I'm leaning towards 3. If we need to do it in the front-end for some data sources anyway, then we should just do it for everyone including Gecko. And I think doing it in 3 rather than 4 is easier to implement; 4 would probably require doing the adjustment in other places in addition to during symbolication.

There's an argument for 1: The stackwalkers are architecture-specific, so they can make a more educated guess at a reasonable adjustment. For example, the breakpad arm64 and PPC unwinders subtract the right amount (4 or 8 bytes) to land exactly at the call instruction, because call instructions are fixed width on those architectures. And the one for arm subtracts 2 because odd and even addresses have different meanings and subtracting an even amount removes temptation for any part in the rest of the pipeline to read unintended meaning into the adjusted address.
However, as for the profiler, all we need is an address that's "somewhere inside" the call instruction, just so long as symbolication gives us the right information. So I think subtracting 1 byte, unilaterally, in the front-end, is totally fine.

So then my proposal for this bug is simple: We should remove the "minus one byte" adjustment from Lul, and do a quick audit of the other profiler stack walkers to make sure I didn't miss anything.
Then I will add code to the front-end that does the adjustment unilaterally during profile import.
Does that sound reasonable?

Update Nov 2021: Option 3 has now been implemented, in this bug and in firefox-devtools/profiler #3548
And there's a good write-up of this problem space at https://docs.rs/symbolic/8.5.0/symbolic/common/struct.InstructionInfo.html#background

Assignee: nobody → mstange
Status: NEW → ASSIGNED

(In reply to Markus Stange [:mstange] from comment #0)

And the one for arm subtracts 2 because odd and even addresses have different meanings and subtracting an even amount removes temptation for any part in the rest of the pipeline to read unintended meaning into the adjusted address.
However, as for the profiler, all we need is an address that's "somewhere inside" the call instruction, just so long as symbolication gives us the right information. So I think subtracting 1 byte, unilaterally, in the front-end, is totally fine.

Actually, it may be the case that symbolication may get tripped up if we subtract only 1 on arm. Here's where the - 2 in the arm breakpad code was introduced:

because when only decrementing by 1, the stack walker find that all instruction are exactly the next instruction to be executed and all line number are wrong by 1.

I will double-check this.

Severity: -- → S3
Priority: -- → P3

ARM is a little weird, because it supports a classical 32-bit RISC-ish instruction encoding, but also the 16-bit Thumb encoding, and the variable-length 16/32-bit Thumb2 instructions which are a compatible extension of Thumb1. (Thumb2 is usually the best choice if the CPU supports it, and CPUs large enough to run Gecko typically do.) The low bit of code pointers indicates whether the CPU should use Thumb mode. As a result, the return addresses (LR values) from the stack unwinder will be the address of the next instruction plus 1, so subtracting 1 is still in the wrong instruction.

This will make it so that, for return addresses, symbolication can look up
the address "returnAddress - 1" and get the correct line number (and inline
stack) for the call instruction.
Without the masking, returnAddress - 1 might still fall into the instruction
after the call instruction, giving wrong line numbers / inline stacks.

This makes all our stackwalkers in Firefox consistent with respect to return addresses:
For non-leaf frames in stacks, the code address now always points to the instruction
after the call instruction, i.e. to the instruction that will be executed once the
function returns.
For symbolication purposes, 1 byte will need to be subtracted in order to obtain correct
line number + inline stack information for the call instruction. This subtraction will
be the responsibility of the Firefox profiler front-end, not of the stackwalkers.

Depends on D121930

Pushed by mstange@themasta.com: https://hg.mozilla.org/integration/autoland/rev/64cddf423637 In the EHABI stack unwinder, make sure to report proper instruction addresses by masking of the bit that indicates ARM or Thumb mode. r=jld https://hg.mozilla.org/integration/autoland/rev/306a7f49a334 In the Lul stack walker, don't subtract 1 byte from the return address, for consistency with other stackwalkers. r=gerald
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 92 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: