Closed Bug 326594 Opened 15 years ago Closed 13 years ago

Consider using backtrace() for Unix stack traces

Categories

(Core :: XPCOM, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9beta2

People

(Reporter: dbaron, Assigned: karlt)

References

Details

Attachments

(2 files, 3 obsolete files)

There's some work in bug 323853 for using backtrace() for Unix stack traces.  It needs some further work (to handle unlimited stack depth), and perhaps some work on getting the tools we have that deal with stack traces to deal with its output.
Assignee: dbaron → nobody
Blocks: 331744
Using the symbols in gcc's unwind.h directly instead of through glibc's backtrace  would avoid the need for a buffer (of stack depth).
With the current logic in nsStackWalk I get SIGSEGV with *bp, when this walks of the top/base/root of the stack (even though the stack is not corrupted):

http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/xpcom/base/nsStackWalk.cpp&rev=1.10&mark=1168#1154

(gdb) p bp
$15 = (void **) 0x6e09004f4e5f6f6e
(gdb) p *info
$17 = {buffer = 0x523d00, size = 22, entries = 3}
(gdb) p *info->buffer@3
$28 = {0x2adc14856c1e, 0x2adc168ab84a, 0x4f53492e4f4e5f62}
(gdb) bt
#0  0x00002adc1546ad20 in NS_StackWalk (
    aCallback=0x2adc14856644 <stack_callback>, aSkipFrames=1, 
    aClosure=0x2adc14969b88)
    at /home/karl/moz/mozilla/xpcom/base/nsStackWalk.cpp:1168
#1  0x00002adc148566e9 in backtrace (t=0x2adc14969b80, skip=1)
    at /home/karl/moz/mozilla/tools/trace-malloc/lib/nsTraceMalloc.c:914
#2  0x00002adc14856c1e in calloc (count=1, size=296)
    at /home/karl/moz/mozilla/tools/trace-malloc/lib/nsTraceMalloc.c:1108
#3  0x00002adc168ab84a in g_malloc0 () from /usr/lib/libglib-2.0.so.0
#4  0x00002adc16899915 in g_hash_table_resize ()
   from /usr/lib/libglib-2.0.so.0
#5  0x00002adc168c8416 in g_get_language_names ()
   from /usr/lib/libglib-2.0.so.0
#6  0x00002adc147134b9 in XRE_main (argc=3, argv=0x7fff966152e8, 
    aAppData=0x5b1200)
    at /home/karl/moz/mozilla/toolkit/xre/nsAppRunner.cpp:2675
#7  0x0000000000401249 in main (argc=7, argv=0x7fff966152e8)
    at /home/karl/moz/mozilla/browser/app/nsBrowserApp.cpp:153
(gdb) info f
Stack level 0, frame at 0x7fff966147c0:
 rip = 0x2adc1546ad20 in NS_StackWalk
    (/home/karl/moz/mozilla/xpcom/base/nsStackWalk.cpp:1168); 
    saved rip 0x2adc148566e9
 called by frame at 0x7fff96614810
 source language c++.
 Arglist at 0x7fff966147b0, args: aCallback=0x2adc14856644 <stack_callback>, 
    aSkipFrames=1, aClosure=0x2adc14969b88
 Locals at 0x7fff966147b0, Previous frame's sp is 0x7fff966147c0
 Saved registers:
  rbp at 0x7fff966147b0, rip at 0x7fff966147b8
(gdb) p *(void**)0x7fff966147b0
$30 = (void *) 0x7fff96614800
(gdb) p *((void**)0x7fff966147b0 + 1)
$31 = (void *) 0x2adc148566e9
(gdb) p *(void**)0x7fff96614800
$32 = (void *) 0x7fff96614870
(gdb) p *(void**)0x7fff96614870
$33 = (void *) 0x7fff966148f0
(gdb) p *(void**)0x7fff966148f0
$34 = (void *) 0x6e09004f4e5f6f6e
(gdb) p /x (void*)__libc_stack_end
$37 = 0xffffffff966152e0

The algorithm seems to work for most frames but not all.
Not sure if this is specific to x86_64 or gcc-4.1.2 or compilation flags for libglib, with would have been something like
"-O2 -pipe -march=nocona -frename-registers -fweb -funswitch-loops -momit-leaf-frame-pointer".
Attached patch use libgcc_s to unwind the stack (obsolete) — Splinter Review
This solves the problem above for me.
We just need to decide which other platforms should use this method.

As gcc compiled the code, it seems most reliable to ask gcc how the stack should be unwound.

% readelf -s /lib/libgcc_s.so.1| grep Unwind_Backtrace
    31: 0000000000007eb5   200 FUNC    GLOBAL DEFAULT   11 _Unwind_Backtrace@@GCC_3.3

So I guess we can do this with gcc version >= 3.3.
Assignee: nobody → mozbugz
Status: NEW → ASSIGNED
OK, on __i386, _Unwind_Backtrace is only unwinding one frame.  Will look at that.
_Unwind_Backtrace is only effective if the elf file contains unwind
information, the presence of which can be recognised by the .eh_frame_hdr
section (readelf -S).
On i386, this unwind information is included when compiling with -fexceptions
or -funwind-tables, and g++ (4.1.2) turns on -fexceptions by default for C++,
but we explicitly disable with -fno-exceptions.

glibc's i386 |backtrace| function first uses _Unwind_Backtrace to get as far
as it can, then continues with the saved-ebp-register-is-a-frame-pointer
method that we use (plus a couple more safety checks).

According to comments in glibc's i386 backtrace, the assumptions of the
saved-frame-pointer method are valid "with every stack frame if not compiled
without frame pointer" (-fomit-frame-pointer).

On i386, _Unwind_Backtrace will only be helpful for functions in external
libraries compiled from C++ or from C with -funwind-tables/exceptions.  gcc
uses -fexceptions when building libgcc_s, and glibc uses -fexceptions for
source files that call user provided callbacks, so libc libm and libpthread
contain unwind info for at least some functions.

However glibc strongly recommends against using -fomit-frame-pointer, so our
saved-frame-pointer method should work fine there, and probably in most other
libraries.

If there are distributions using -fomit-frame-pointer then using
google-breakpad's StackwalkerX86 is likely more versatile than glibc's
combination of _Unwind_Backtrace and the saved-frame-pointer linking.

The main problem with our saved-frame-pointer method on i386 is that it
doesn't notice when its assumptions are invalid, resulting in a SIGSEGV.
google-breakpad's method looks like it safely tests memory before accessing
it.  We could set up a SIGSEGV handler or use a syscall like mincore or write
(thanks, roc) to detect whether it is safe to access the memory.
On amd64, even gcc -fno-unwind-tables -fno-exceptions still generates a
.eh_frame_hdr section and _Unwind_Backtrace works fine even for the simplest
of C programs.

glibc's x86_64 backtrace only uses _Unwind_Backtrace, and that is the method I
think we should use on this platform.

On amd64, the saved-frame-pointer linkage assumptions are not safe.  It seems
that, even when rbp is a frame pointer and saved, it is not necessarily saved
at the start of the frame, so our current method fails to find the register.

(gdb) info f
Stack level 0, frame at 0x7fff23ff8170:
 rip = 0x2b4a87a89de6 in NS_StackWalk
    (/home/karl/moz/mozilla/xpcom/base/nsStackWalk.cpp:1226); 
    saved rip 0x2b4a86e756e9
 called by frame at 0x7fff23ff81c0
 source language c++.
 Arglist at 0x7fff23ff8160, args: aCallback=0x2b4a86e75644 <stack_callback>, 
    aSkipFrames=1, aClosure=0x2b4a86f88b88
 Locals at 0x7fff23ff8160, Previous frame's sp is 0x7fff23ff8170
 Saved registers:
  rbp at 0x7fff23ff8160, rip at 0x7fff23ff8168
(gdb) p *(void**)0x7fff23ff8160
$4 = (void *) 0x7fff23ff81b0
(gdb) info f 1
Stack frame at 0x7fff23ff81c0:
 rip = 0x2b4a86e756e9 in backtrace
    (/home/karl/moz/mozilla/tools/trace-malloc/lib/nsTraceMalloc.c:914); 
    saved rip 0x2b4a86e75c1e
 called by frame at 0x7fff23ff8230, caller of frame at 0x7fff23ff8170
 source language c.
 Arglist at 0x7fff23ff81b0, args: t=0x2b4a86f88b80, skip=1
 Locals at 0x7fff23ff81b0, Previous frame's sp is 0x7fff23ff81c0
 Saved registers:
  rbp at 0x7fff23ff81b0, rip at 0x7fff23ff81b8
(gdb) p *(void**)0x7fff23ff81b0
$5 = (void *) 0x7fff23ff8220
(gdb) info f 2
Stack frame at 0x7fff23ff8230:
 rip = 0x2b4a86e75c1e in calloc
    (/home/karl/moz/mozilla/tools/trace-malloc/lib/nsTraceMalloc.c:1108); 
    saved rip 0x2b4a88eca84a
 called by frame at 0x7fff23ff8250, caller of frame at 0x7fff23ff81c0
 source language c.
 Arglist at 0x7fff23ff8220, args: count=1, size=296
 Locals at 0x7fff23ff8220, Previous frame's sp is 0x7fff23ff8230
 Saved registers:
  rbp at 0x7fff23ff8220, rip at 0x7fff23ff8228
(gdb) p *(void**)0x7fff23ff8220
$6 = (void *) 0x7fff23ff82a0
(gdb) info f 3
Stack frame at 0x7fff23ff8250:
 rip = 0x2b4a88eca84a in g_malloc0; saved rip 0x2b4a88eb8915
 called by frame at 0x7fff23ff8290, caller of frame at 0x7fff23ff8230
 Arglist at 0x7fff23ff8230, args: 
 Locals at 0x7fff23ff8230, Previous frame's sp is 0x7fff23ff8250
 Saved registers:
  rbx at 0x7fff23ff8238, rbp at 0x7fff23ff8240, rip at 0x7fff23ff8248
(gdb) p *(void**)0x7fff23ff82a0
$7 = (void *) 0x6e09004f4e5f6f6e
(gdb) p *(void**)0x7fff23ff8240
$8 = (void *) 0x7fff23ff82a0
(gdb) info f 4
Stack frame at 0x7fff23ff8290:
 rip = 0x2b4a88eb8915 in g_hash_table_resize; saved rip 0x2b4a88ee7416
 called by frame at 0x7fff23ff83e0, caller of frame at 0x7fff23ff8250
 Arglist at 0x7fff23ff8248, args: 
 Locals at 0x7fff23ff8248, Previous frame's sp is 0x7fff23ff8290
 Saved registers:
  rbx at 0x7fff23ff8258, rbp at 0x7fff23ff8260, r12 at 0x7fff23ff8268,
  r13 at 0x7fff23ff8270, r14 at 0x7fff23ff8278, r15 at 0x7fff23ff8280,
  rip at 0x7fff23ff8288
(gdb) p *(void**)0x7fff23ff8260
$11 = (void *) 0x7fff23ff82a0
(gdb) info f 5
Stack frame at 0x7fff23ff83e0:
 rip = 0x2b4a88ee7416 in g_get_language_names; saved rip 0x2b4a86d32601
 called by frame at 0x7fff23ff8b30, caller of frame at 0x7fff23ff8290
 Arglist at 0x7fff23ff8288, args: 
 Locals at 0x7fff23ff8288, Previous frame's sp is 0x7fff23ff83e0
 Saved registers:
  rbx at 0x7fff23ff83a8, rbp at 0x7fff23ff83b0, r12 at 0x7fff23ff83b8,
  r13 at 0x7fff23ff83c0, r14 at 0x7fff23ff83c8, r15 at 0x7fff23ff83d0,
  rip at 0x7fff23ff83d8
(gdb) p *(void**)0x7fff23ff83b0
$12 = (void *) 0x7fff23ff8b20
(gdb) info f 6
Stack frame at 0x7fff23ff8b30:
 rip = 0x2b4a86d32601 in XRE_main
    (/home/karl/moz/mozilla/toolkit/xre/nsAppRunner.cpp:2675); 
    saved rip 0x401249
 called by frame at 0x7fff23ff8bf0, caller of frame at 0x7fff23ff83e0
 source language c++.
 Arglist at 0x7fff23ff8b20, args: argc=3, argv=0x7fff23ff8cc8, 
    aAppData=0x5b1200
 Locals at 0x7fff23ff8b20, Previous frame's sp is 0x7fff23ff8b30
 Saved registers:
  rbx at 0x7fff23ff8b00, rbp at 0x7fff23ff8b20, r12 at 0x7fff23ff8b08,
  r13 at 0x7fff23ff8b10, r14 at 0x7fff23ff8b18, rip at 0x7fff23ff8b28
The ppc unwind code should be safe as the ABI requires it on linux at least.

The easy safety checks on bp for i386 are borrowed from glibc's i386 backtrace.

_Unwind_Backtrace will also be useful on ia64 (if anyone uses that), and this patch tries it out on other platforms that have the function.  (It can't be worse than nothing.)

I also intend to swap the sun #ifdef/#else block to tidy things up but it's easier to see the changes like this.
Attachment #279157 - Attachment is obsolete: true
Attachment #280850 - Flags: review?(dbaron)
BTW, regarding __libc_stack_end:
Although it is not defined in any header file, it is not a GLIBC_PRIVATE symbol and glibc seems to expect external use.  glibc-2.5/elf/Versions has:

  GLIBC_2.1 {
    # functions used in other libraries
    _dl_mcount;
    # historically used by Garbage Collectors
    __libc_stack_end;
  }
What OS+CPU combinations have you tested this on, and what testing did you do?
I've only tested this on Intel/Linux:
  amd64 was Gentoo.
  i386 was Ubuntu 7.04.

(Assuming I have the #if conditions right:)
* The only ppc change is the "long(next) & 3" which should be unnecessary but
  provides half a safety check for corrupted stacks.
* The ppc64 change was from looking at glibc code but is not enabled.
* The other differences are in gcc platforms that previously returned
  NS_ERROR_NOT_IMPLEMENTED but now give _Unwind_Backtrace a try.

Testing involved running with
  "--trace-malloc malloc.log --shutdown-leaks sdleak.log"
and flicking through the output from uncategorized.pl to check that it looked reasonable.
The line counts in the last hunk of the patch are not right (emacs diff-mode didn't do all that I hoped), so the patch does not apply correctly.
I'll wait to see if there are further comments before updating the patch.
There is an issue with _Unwind_Backtrace that I just came across today:
If the pc jumps (function call) to dynamic library that has been unloaded,
then _Unwind_Backtrace fails with SIGSEGV.

The old code (happened to?) notice that the stack was invalid and so the walk was aborted gracefully in this case.

It still seems that _Unwind_Backtrace is successful a much greater proportion of the time.
Attached patch safety checks and libgcc_s v3 (obsolete) — Splinter Review
This is the patch that I propose we land.
The changes are that the line counts are correct and the __sun-specific code is now first with the generic gcc code later, which makes the #if statements a little cleaner.
Attachment #280850 - Attachment is obsolete: true
Attachment #284250 - Flags: review?(dbaron)
Attachment #280850 - Flags: review?(dbaron)
Comment on attachment 284250 [details] [diff] [review]
safety checks and libgcc_s v3

>Index: xpcom/base/nsStackWalk.cpp
>-/* vim: set shiftwidth=4 tabstop=8 autoindent cindent expandtab: */
>+/* -*- Mode: C; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 4 -*-
>+ * vim: set shiftwidth=4 tabstop=8 autoindent cindent expandtab: */

Mode: C++

In your changes to the old stack walking code:
>+#if (defined(__ppc__) && defined(XP_MACOSX)) || defined(__ppc64__)

What led you to add the __ppc64__ test?  Do you know that it's correct?

>+    void *pc = *(bp+2);
>+#else // i386 or ppc32 linux

And do you know that this path is correct for ppc32 Linux?

>+    void *pc = *(bp+1);
>+#endif

(I'll comment on the new code shortly; I don't trust editing text here for too long...)
Comment on attachment 284250 [details] [diff] [review]
safety checks and libgcc_s v3

r+sr=dbaron with the comments above addressed.  My main concern about the comments is that I'd rather say nothing than have comments asserting something that's not actually known or tested to be true -- I didn't see any mention that you'd tested on those systems, so I'm a little surprised to see the comments.  (But I missed the above mention of glibc code and ppc64, so that one's ok.)

I'm also a little skeptical about the sanity-checking code in the old Linux NS_StackWalk -- if the stack tracing doesn't work, we may as well crash -- in situations where it doesn't work, people shouldn't be using it.  That said, I'm ok with it.
Attachment #284250 - Flags: superreview+
Attachment #284250 - Flags: review?(dbaron)
Attachment #284250 - Flags: review+
In reply to comment #15)
> Mode: C++

Done.

> 
> In your changes to the old stack walking code:
> >+#if (defined(__ppc__) && defined(XP_MACOSX)) || defined(__ppc64__)
> 
> What led you to add the __ppc64__ test?  Do you know that it's correct?
> 
> >+    void *pc = *(bp+2);
> >+#else // i386 or ppc32 linux
> 
> And do you know that this path is correct for ppc32 Linux?
> 
> >+    void *pc = *(bp+1);
> >+#endif

glibc-2.5/sysdeps/powerpc/powerpc32/backtrace.c has:

  for (                         count = 0;
       current != NULL &&       count < size;
       current = BOUNDED_1 (current->next), count++)
    array[count] = current->return_address;

with

struct layout
{
  struct layout *__unbounded next;
  void *__unbounded return_address;
};

and includes the comment "Note that every routine is required by the ABI to lay out the stack like this".

I haven't tested on powerpc but this convinced me that our code was correct for powerpc linux.

The powerpc64 version differs only in:
struct layout
{
  struct layout *__unbounded next;
  long condition_register;
  void *__unbounded return_address;
};

I also made the following changes:

-#if (defined(__ppc__) && defined(XP_MACOSX)) || defined(__ppc64__)
+#if (defined(__ppc__) && defined(XP_MACOSX)) || defined(__powerpc64__)
+    // ppc mac or powerpc64 linux
     void *pc = *(bp+2);
-#else // i386 or ppc32 linux
+#else // i386 or powerpc32 linux
     void *pc = *(bp+1);
 #endif

__powerpc64__ is the preprocessor symbol to use with gcc/Linux.
I don't know how Darwin 64-bit stacks are laid out so I've added the comment to indicate that we only know that path to be correct for "ppc mac or powerpc64 linux".
Comment on attachment 285565 [details] [diff] [review]
safety checks and libgcc_s v4

Benjamin, should I get review from you on this too?
Attachment #285565 - Flags: review?(benjamin)
comment 17 sounds good.
Comment on attachment 285565 [details] [diff] [review]
safety checks and libgcc_s v4

Was removing the x86-64 ifdef conditions intentional?
(In reply to comment #20)
> Was removing the x86-64 ifdef conditions intentional?

Yes, the saved-frame-pointer linkage assumptions in the old method are not safe
for x86_64 (comment 7 goes into more detail), and glibc's x86_64 backtrace just uses _Unwind_Backtrace.

Removing the x86_64 ifdef conditions means we use _Unwind_Backtrace also (and also removes "__asm__( "movq %%rbp, %0" : "=g"(bp));" which we no longer use).  

HAVE_DLADDR && HAVE__UNWIND_BACKTRACE will be true when defined(linux) && defined(__GNUC__) && defined(__x86_64__).
Attachment #285565 - Flags: review?(benjamin) → review+
<bsmedberg> dbaron: is there a point in reviewing/taking bug 326594 (use backtrace)?
<dbaron> bsmedberg, I don't think it needs your review separately unless you really want to; I think we should take it; it should help a *lot* on x86_64 Linux and some on i386 Linux.
Comment on attachment 285565 [details] [diff] [review]
safety checks and libgcc_s v4

Requesting aproval1.9 as  this enables memory leak analysis and means that our crash handling doesn't crash (any where near as often) on x86_64.  Any risk should be confined to debug or memory test builds.
Attachment #285565 - Flags: approval1.9?
Attachment #285565 - Flags: approval1.9? → approval1.9+
1.1887      mozilla/configure.in
1.11        mozilla/xpcom/base/nsStackWalk.cpp
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9 M10
This breaks building on FreeBSD which has no <unwind.h>, but does have an __Unwind_backtrace(), presumably because it uses gcc 4.2.  Please add a configure check for the header too.  This is on FreeBSD-CURRENT-i386.
(In reply to comment #25)

gcc-4.2.2/gcc/Makefile.in looks to me like it is installs unwind.h into

  $(libdir)/gcc/$(target_noncanonical)/$(version)/include

What are the contents of that directory on FreeBSD?

If unwind.h is not there, then can you check whether this patch fixes the issue on FreeBSD, please?
Attachment #284250 - Attachment is obsolete: true
(In reply to comment #26)
> gcc-4.2.2/gcc/Makefile.in looks to me like it is installs unwind.h into
> 
>   $(libdir)/gcc/$(target_noncanonical)/$(version)/include
> 
> What are the contents of that directory on FreeBSD?

FreeBSD doesn't use GCC's build infrastructure and doesn't have that directory.  There are a bunch of unwind headers in the gcc source tree, but none are installed.  I would have to start a thread on the freebsd mailing lists to find out why.

The patch, however, fixes the problem, and it's better style to explicitly check for the header anyway...  Sorry about the delay in getting back to you.

Regards,
  -Jeremy
Comment on attachment 288388 [details] [diff] [review]
check for unwind.h

I'm not an autoconf expert.  Is this an appropriate way to check for unwind.h?
Attachment #288388 - Flags: review?(benjamin)
Comment on attachment 288388 [details] [diff] [review]
check for unwind.h

Can I have it all on one line?
Attachment #288388 - Flags: review?(benjamin) → review+
Comment on attachment 288388 [details] [diff] [review]
check for unwind.h

> Can I have it all on one line?

Sure.
Attachment #288388 - Flags: approval1.9?
Reopening this since drivers don't seem to look at closed bugs for approval requests (ugh!). :(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #288388 - Flags: approval1.9? → approval1.9+
Checked in attachment 288388 [details] [diff] [review], all on one line: revision 1.1891
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Depends on: 593467
You need to log in before you can comment on or make changes to this bug.