Closed Bug 856899 Opened 11 years ago Closed 11 years ago

Decide how to do stack unwinding on ARM B2G

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jld, Assigned: jld)

References

Details

(Keywords: perf, Whiteboard: [c=profiling s=2013.08.09])

Walking stacks on ARM for B2G profiling is... complicated.  There are a few different approaches:

1. The breakpad unwinder (bug 779291), which interprets DWARF unwind info.  It's already integrated into SPS and it works with the existing toolchain.  However, it might not perform well on our slower target devices; I don't know if we have data on that yet.  This also doesn't help us much if we want visibility into the kernel, or non-Gecko processes.

2. The ARM Procedure Call Standard frame pointer variant.  Which is deprecated by ARM, and (if I understand correctly) can't be implemented entirely correctly on Thumb or Thumb2.  But perf (see bug 831611 comment 52) doesn't depend on the parts that can't be approximated on Thumb2 with a small compiler change.  And this works, today, for perf and for a suitably modified NS_StackWalk.  But: it needs toolchain changes, which the upstream might have issues with because they're arguably inventing a nonstandard ABI, and any closed-source libraries (e.g., from hardware vendors) would need to be recompiled[*] to not break the frame pointer chain.  Also, hand-coded assembly is a problem; if it's inline in C/C++ then GCC will complain about register usage conflicts, but if it's in a .S file then we'll have no idea — and it might not be possible to avoid using r11.

3. The ARM Exception Handling ABI.  ARM has its own specialized EH format, apparently.  The toolchain should already support it (-fasynchronous-unwind-tables), and it's what Linux is already using to for perf stacks in the kernel (i.e., the kernel already has an unwinder for it, which is run synchronously in interrupt context if I understand the perf architecture correctly — contrast the malloc reentrancy issues with libunwind).  It might even be usable to get user stacks from kernel space.  Needs more investigation.

So far I've been working on approach #2; in constrast, Fennec (which can't easily rebuild the system libraries the way B2G can) is using #1.  But that may not be the ideal approach.


[*] Possibly not with the modified GCC; using -ffixed-r11, if it works, the code won't create frames but also won't overwrite the frame pointer.  So it will just disappear from the stack, more or less, the way JaegerMonkey output does; this means we can at least find out what part of Gecko is calling into that code, which is probably what we want to know.
Group: mozilla-corporation-confidential
Whiteboard: c=performance
Status: NEW → ASSIGNED
Whiteboard: c=performance → c=
See also bug 867721 comment 9 — currently breakpad uses too much memory for a 256MB B2G device to start the homescreen app.
Reasons why APCS frame pointers aren't what we want:

* Vendor binaries and assembly code will unavoidably break the frame chain.

* Nobody will give me a clear answer as to how to resolve bug 870639 (or any future instances of that issue which may occur).

* Needing to rebuild the world with a modified toolchain is a nontrivial requirement.

* Making IonMonkey/BC cooperate is nontrivial, because it wants the same register as a not-entirely-compatible kind of "frame pointer".  I already have changes that introduce ifdef'ed machine-dependent code into places where I'm pretty sure it doesn't belong, because no other platform we run on breaks the assumption that the old frame pointer is stored at [fp] instead of [fp, #-12].  And they mostly work but there's subtle breakage I haven't tracked down yet.  And, assuming that can be fixed, this raises questions of future maintainability.

* In general, we're going to be doing everything differently from Fennec.


Reasons why EHABI unwinding isn't what we want:

* It will probably never work for perf.  Trying to dynamically read the unwind information at sample time will encounter several problems due to being run in interrupt context: (1) copy_from_user needs to be able to fail if the memory isn't resident, which can maybe be worked around by mlocking the extab in the loader and burn a bunch of RAM we may not have, and (2) we can't safely do find_vma to find out where the ELF header is hopefully mapped, as it's protected by a sleep lock.  If we load the tables in ahead of time, then more invasive changes need to be made to store that information in a way that it can be gotten to, and I have concerns about getting them upstreamed.

* Unless a miracle occurs with respect to breakpad, it probably needs a from-scratch unwinder for SPS.  (The nice self-contained implementation contributed by ARM Ltd. to the Linux kernel is, of course, MPL-incompatible.)  Which isn't the end of the world, but compared to changes that are already written, it's more work.

* Making script cooperate with this is something that I think has been worked on but isn't particularly close to landing; I could be wrong about this.

* It's unavoidably going to be slower than frame pointers (see above about use in perf, which can otherwise comfortably run at multiple kHz on phones, which is occasionally useful).


Now that bug 863475 has landed, DWARF has the same issues as EHABI, but maybe worse.


And, for the sake of something like completeness:
 Reasons why adopting the iOS ABI isn't what we want:

* Now we have to patch the compiler *and* the kernel!  And chances are we'll break even more inline asm statements!  The only saving grace is that it's not an invasive change to IonMonkey anymore.  (And for the kernel we might be able to binary-patch it on the fly.)

* Generally, all the non-monkey-themed APCS-FP problems apply.
Depends on: 884079
Keywords: perf
It may not be quite as bad as that — :mwu found http://lwn.net/Articles/495595/ which adds support for copying the entire stack (up to a given limit) and registers into the perf buffer and doing the processing later.  This is similar to what our breakpad unwinder is doing in userspace.

The kernel side of this has landed, at least, but it will need to be backported to the older versions we're using, and some ARM-specific glue, and some tooling.

Specifically: our bandwidth for writing to storage (or streaming over USB) is limited, and the writes aren't free in terms of CPU; I suspect that we'll be better off doing the unwinding on the device in miniperf-record (which will no longer be quite so "mini") before writing.  Note that we're already snooping on the perf events there, because the recorder needs to inject synthetic records for mmaps that existed before the profiling started.  But if nothing else we could just lower the sample rate.

(Alternately, if successive samples for the same thread tend to leave most of the stack sample unchanged, it could be omitted and that might help.)

Something related that might be impractical: a tool with the stack dumps and the full DWARF info could probably recover the values of local variables, which could be useful.  Definitely needs work to glue the pieces together.
Whiteboard: c= → [c= ]
Blocks: 884079
No longer depends on: 884079
Whiteboard: [c= ] → c=profiling
Summary: Stack unwinding for ARM B2G profiling → Decide how to do stack unwinding on ARM B2G
Depends on: 895154
Depends on: 897769
Exception handling shall be used.  How to deal with jitcode is undetermined, but out of scope for this bug (bug 895159 seems to be the one for that).
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: c=profiling → [c=profiling s=2013.08.09]
You need to log in before you can comment on or make changes to this bug.