Open Bug 1529108 Opened 6 years ago Updated 1 year ago

Unwinding from libc.so on Android 32 bit fails (bad stacks during syscalls)

Categories

(Core :: Gecko Profiler, defect, P2)

defect

Tracking

()

Tracking Status
firefox67 --- affected

People

(Reporter: mstange, Unassigned)

References

Details

See this (inverted) profile: https://perfht.ml/2Xc01E0

Most samples inside system calls seem to have broken stacks. The immediate parent frame looks somewhat reasonable, but past that, unwinding fails.

This is happening on my Moto G5. I can send you the libc.so from it.

For the syscall frame, the most common raw address, as a module-relative offset, is 0x18d68, which unsurprisingly looks perfectly correct:

             syscall:
00018d4c         mov        ip, sp                                              ; End of unwind block (FDE at 0x88058), Begin of unwind block (FDE at 0x88084), DATA XREF=syscall@PLT+8, syscall@GOT
00018d50         push       {r4, r5, r6, r7}
00018d54         mov        r7, r0
00018d58         mov        r0, r1
00018d5c         mov        r1, r2
00018d60         mov        r2, r3
00018d64         ldm        ip, {r3, r4, r5, r6}
00018d68         svc        #0x0
00018d6c         pop        {r4, r5, r6, r7}
00018d70         cmn        r0, #0x1000
00018d74         bxls       lr
00018d78         rsb        r0, r0, #0x0
00018d7c         b          dword_79730+4
                        ; endp

But for __pthread_cond_timedwait, the two addresses that unwinding reports are 0x46f16 and 0x46f4f, which would be around here:

             loc_46f10:
00046f10         ldr.w      sl, [r6]                                            ; CODE XREF=_ZL24__pthread_cond_timedwaitP23pthread_cond_internal_tP15pthread_mutex_tbPK8timespec+14
00046f14         mov        r0, r8
00046f16         bl         dword_796f0+4
00046f1a         ldr.w      fp, [r6]
00046f1e         bl         sub_795b4                                           ; sub_795b4
00046f22         mov        sb, r0
00046f24         movs       r5, #0x0
00046f26         mov        r1, r6
00046f28         mov        r3, sl
00046f2a         ldr.w      r0, [sb]
00046f2e         str        r0, [sp, #0x38 + var_28]
00046f30         mov.w      r0, #0xffffffff
00046f34         strd       r4, r5, [sp, #0x38 + var_38]
00046f38         str        r0, [sp, #0x38 + var_30]
00046f3a         lsl.w      r0, fp, #0x7
00046f3e         uxtb       r0, r0
00046f40         bfi        r0, r7, #0x8, #0x1
00046f44         eor        r2, r0, #0x89
00046f48         movs       r0, #0xf0
00046f4a         bl         dword_795c0+4
00046f4e         mov        r4, r0
00046f50         adds       r0, r4, #0x1
00046f52         beq        loc_46f6a

which doesn't look correct at all. 0x46f4f is even between two instructions.

Jed, could you look into this? Is there any other information I could gather to help?

Flags: needinfo?(jld)

I seem to also see mis-parented frames in this one, also on Android: https://perfht.ml/2GQeu2E

(In reply to Paul Adenot (:padenot) from comment #2)

I seem to also see mis-parented frames in this one, also on Android: https://perfht.ml/2GQeu2E

Most of the bad stacks in this profile are bug 1434402 though, not this one.

(In reply to Markus Stange [:mstange] from comment #1)

Jed, could you look into this? Is there any other information I could gather to help?

I'd need a copy of the libc.so (or a reference to the source, if it's public).

I think the 0x46f4f is right: this is Thumb code, so that's the value that would be in lr to return to the instruction after the bl. (0x46f16 is more suspect.)

My guess is that the unwind info doesn't represent the pop correctly, so we correctly unwind into __pthread_cond_timedwait because that return address is in lr, but then mis-unwind out of it because the stack pointer is wrong. But this commit from 2013 should have fixed that. (Also, calling directly (up to tail call elimination) from __pthread_cond_timedwait into syscall doesn't look possible either immediately before that patch or in a newer version.)

Flags: needinfo?(jld) → needinfo?(mstange)

I have a copy of the library. These look like the exidx entries bounding this region:

0x7ba14:        0x7ff9d000      0x80b0b0b0      0x7ff9e8f4      0x808408b0
(gdb) p/x 0x7ba14 + 0x7ff9d000 - (1<<31)
$6 = 0x18a14
(gdb) p/x 0x7ba1c + 0x7ff9e8f4 - (1<<31)
$7 = 0x1a310

That range covers syscall and a few other assembly-implemented routines. 80 means a short descriptor and b0 means Finish: move lr to pc and end the unwind step. So the pop isn't represented in the ARM EH tables; this is a bug in Bionic.

But it also has DWARF unwind info:

  [17] .eh_frame         PROGBITS        00087fa4 087fa4 0016bc 00   A  0   0  4
  [18] .eh_frame_hdr     PROGBITS        00089660 089660 00072c 00   A  0   0  4

And that's probably where the problem is. The commit I linked in comment #4 added unwind directives for DWARF; the ARM EH info uses a different syntax (which is something I used to know but haven't had to think about in years).

We ran into a similar problem on B2G in bug 905521 (note that the patches are no longer accessible; according to a Bugzilla admin that's intended behavior) and in particular bug 905521 comment #9 suggests that this is working as intended from the Android project's point of view, although I don't remember what made me come to that conclusion.

So the EHABI unwinder as it exists won't work here; we'd need to either use another unwinder (LUL?) or else write code to dynamically generate a replacement extab/exidx (either by compiling from the DWARF info or recognizing specific binaries and patching them). Or get patches landed in Bionic (if they'd even accept them) and then wait a very long time for that to end up on actual devices.

Flags: needinfo?(mstange)
See Also: → 905521

Oh dear. Thanks for taking a look! The last option (getting patches into bionic and waiting a long time) doesn't really sound practical.

It seems like Lul already has arm support. So maybe we can use dwarf unwinding without much work!
However, this naive patch doesn't give me any stacks:

diff --git a/tools/profiler/core/platform.cpp b/tools/profiler/core/platform.cpp
--- a/tools/profiler/core/platform.cpp
+++ b/tools/profiler/core/platform.cpp
@@ -120,26 +120,20 @@
 // Mac builds only have frame pointers when MOZ_PROFILING is specified, so
 // FramePointerStackWalk() only works in that case. We don't use MozStackWalk()
 // on Mac.
 #if defined(GP_OS_darwin) && defined(MOZ_PROFILING)
 #  define HAVE_NATIVE_UNWIND
 #  define USE_FRAME_POINTER_STACK_WALK
 #endif

-// Android builds use the ARM Exception Handling ABI to unwind.
-#if defined(GP_PLAT_arm_linux) || defined(GP_PLAT_arm_android)
-#  define HAVE_NATIVE_UNWIND
-#  define USE_EHABI_STACKWALK
-#  include "EHABIStackWalk.h"
-#endif
-
 // Linux builds use LUL, which uses DWARF info to unwind stacks.
 #if defined(GP_PLAT_amd64_linux) || defined(GP_PLAT_x86_linux) ||    \
     defined(GP_PLAT_x86_android) || defined(GP_PLAT_mips64_linux) || \
+    defined(GP_PLAT_arm_linux) || defined(GP_PLAT_arm_android) || \
     defined(GP_PLAT_arm64_linux) || defined(GP_PLAT_arm64_android)
 #  define HAVE_NATIVE_UNWIND
 #  define USE_LUL_STACKWALK
 #  include "lul/LulMain.h"
 #  include "lul/platform-linux-lul.h"

 // On linux we use LUL for periodic samples and synchronous samples, but we use
 // FramePointerStackWalk for backtrace samples when MOZ_PROFILING is enabled.

Makoto, Julian, have either of you recently tried lul on ARM 32-bit Android?

Flags: needinfo?(m_kato)
Flags: needinfo?(jseward)

(In reply to Markus Stange [:mstange] from comment #7)

Makoto, Julian, have either of you recently tried lul on ARM 32-bit Android?

No, I haven't. I have a vague memory that LUL used to work on that
target, but maybe the code got removed (I forget). In principle it
seems to me that we could make it work (again?) if we wanted to.

Flags: needinfo?(jseward)

(In reply to Markus Stange [:mstange] from comment #7, some more)

One thing I forgot to mention, which might be relevant here, is that
LUL doesn't force you to choose between Dwarf and EXIDX on arm32.
Instead it can (well, could) read both formats, and store them in
an internal representation which isn't the same as either Dwarf
or EXIDX. Then the unwinder proper will use internal representation
without knowing which format it came from.

The EXIDX -> Internal form reader was removed some years ago in
favour of simply using Jed's standalone EXIDX unwinder on arm32.
But I guess we could reinstate it.

(In reply to Markus Stange [:mstange] from comment #7)

Makoto, Julian, have either of you recently tried lul on ARM 32-bit Android?

When I tried CFI and/or EHABI when enabling native stack walker for Linux/arm (not Android), many results for stack walker was like same. But that test was using gcc, not clang, so I didn't test it after using clang, not gcc. Also, all binaries of Android may not have CFI information (aarch64 case is bug 1525498). Even if using CFI, we need another trick to walk stack when no rule set (JIT frame or Java JIT frame?). Most case of thumb2 by clang doesn't keep frame pointer register into function, so if no CFI, we cannot walk stack well.

Flags: needinfo?(m_kato)
Priority: -- → P2

(In reply to Julian Seward [:jseward] from comment #9)

The EXIDX -> Internal form reader was removed some years ago in
favour of simply using Jed's standalone EXIDX unwinder on arm32.
But I guess we could reinstate it.

I investigated this some more. As Julian predicted, once I restored the EXIDX -> Internal form converter code that was removed in bug 1173847, stack walking libxul using LUL started working! I got beautiful full stacks for samples in libc.so, so the hand-off from Dwarf unwinding to EXIDX unwinding is working, too.
Compare: EHABIStackWalk profile, LUL profile with EXIDX reader reinstated

So, short term, I think switching 32-bit ARM away from EHABStackWalk and to LUL + EXIDX reader would be the fastest way to fix this bug and to get usable call stacks from libc.so on the Moto G5. This would make the EHABStackWalk code unused.
It should be noted that we're already using LUL for arm64 stackwalking; it's just 32-bit ARM that's not using LUL today.

However, LUL comes with a pretty hefty startup cost, and with some memory overhead: This change caused profiler initialization time to increase from ~80ms to ~400ms on my phone. This is already a problem on arm64 (and really in all places where we're using LUL today), so we'll want to have a general solution for it. But for 32 bit arm, we could instead go for another approach.
Proposal: We could turn LUL into a Frankenstein unwinder that supports both the LUL format and the raw EXIDX format, by opportunistically using code from EHABIStackWalk to do EXIDX-unwinding on a per-stackframe basis, controlled by LUL. Then we don't need the EXIDX converter and save the startup overhead.
I hacked this up in a quick-and-dirty way, and got the following profile: LUL profile with per-frame EHABIStackWalk fallback
It seems to work, and profiler initialization time is down to 130ms! The extra 50ms compared to just EHABIStackWalk (80ms) might be from converting Dwarf unwinding info of system libraries. These timings are preliminary, I'll need to collect some more info here.

For a proper implementation, we'll probably want to make a version of the EHABIStackWalk code that uses more LUL-isms, such as using TaggedUWords everywhere (to propagate whether values are valid), reading from a stack image rather than from the raw stack, and using LUL-style "address" terminology.

Summary: Unwinding from syscalls on Android does not work reliably on my phone → Unwinding from libc.so on Android 32 bit fails (bad stacks during syscalls)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.