Closed Bug 787302 Opened 12 years ago Closed 12 years ago

Most stacks are truncated (using minidump_stackwalk on a Mac Tinderbox debug build)

Categories

(Toolkit :: Crash Reporting, defect)

x86_64
macOS
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla18
Tracking Status
firefox17 --- fixed
firefox18 --- fixed

People

(Reporter: jruderman, Assigned: espindola)

References

Details

(Keywords: regression, Whiteboard: [fuzzblocker])

Attachments

(8 files, 1 obsolete file)

This broke a little over a month ago.  Maybe due to the switch to clang?

My DOM fuzzer relies on these crash stacks for distinguishing known bugs from new bugs.  For many crashes, the top item is sufficient, but there are some (e.g. bug 588237) that I need to ignore based on something in the middle of the stack.

I'll attach some dumps and truncated-stacks that I got using the testcases in bug 588237 and bug 763828.  They're all using the build and symbol files in
https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1346378090/
along with the minidump_stackwalk executable from
http://hg.mozilla.org/build/tools/raw-file/default/breakpad/osx/minidump_stackwalk
Attached file dump (bug 588237)
Attached file dump (bug 763828)
> https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-
> central-macosx64-debug/1346378090/

Copied to http://www.squarefree.com/bug787302/build_1346378090/ in case it gets deleted.
Okay, so the fact that the second frame in both stacks is "Found by: stack scanning" is telling. That means that we don't have usable unwind information for frame 0, so we're resorting to scanning.
Locally I get a full stack, but it is found by stack scanning. The first function in the stack is:

__Z14mozalloc_abortPKc:
0000000000000b40        pushq   %rbp
0000000000000b41        movq    %rsp, %rbp
0000000000000b44        pushq   %rbx
0000000000000b45        pushq   %rax
0000000000000b46        movq    1219(%rip), %rbx
0000000000000b4d        movq    (%rbx), %rsi
0000000000000b50        callq   0xbc4 ## symbol stub for: _fputs
0000000000000b55        movq    (%rbx), %rsi
0000000000000b58        leaq    241(%rip), %rdi ## literal pool for: 

0000000000000b5f        callq   0xbc4 ## symbol stub for: _fputs
0000000000000b64        movl    $123, 0


So we should be able to just follow rsp, rbp. I will try to figure out what went wrong.
Status: NEW → ASSIGNED
Assignee: nobody → respindola
It's possible the minidump_stackwalk the tinderbox machines are using is old enough that it doesn't know how to use frame pointers to walk an x86-64 stack.
(In reply to Ted Mielczarek [:ted] from comment #8)
> It's possible the minidump_stackwalk the tinderbox machines are using is old
> enough that it doesn't know how to use frame pointers to walk an x86-64
> stack.

Do you know where can I check the version (and update it if needed)?
Looks like we have two problems

1) the 64 bit unwinder (even the current svn one) will not try rsp/rbp.
2) clang and breakpad are not agreeing on the cfi

I will try to fix 2 first.
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #10)
> Looks like we have two problems
> 
> 1) the 64 bit unwinder (even the current svn one) will not try rsp/rbp.

Oh huh, I would have sworn I implemented that at some point. It wouldn't be hard to add.

> 2) clang and breakpad are not agreeing on the cfi

In this case there definitely ought to be usable CFI here since we have full symbol information. Thanks for looking into it!
I just downloaded

https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1346378090/firefox-18.0a1.en-US.mac64.crashreporter-symbols.zip

and:

$ unzip -L firefox-18.0a1.en-US.mac64.crashreporter-symbols.zip
$ find . -name *.sym | xargs grep STACK
./XUL/83BAA1821A40335FB1C01A83683E86240/XUL.sym:FUNC 1255f80 84 0 JSD_ASSERT_VALID_STACK_FRAME
./libnspr4.dylib/3FECC3934B0F385B90FDFC52F3AF1C0A0/libnspr4.dylib.sym:FUNC a830 34 0 _PR_RELEASE_LOCK_STACK

I will debug why we are missing STACK entries tomorrow.
That is really exceptionally bad. So bad that I would backout clang on Firefox 17 if we don't have an easy fix.
(In reply to Ted Mielczarek [:ted] from comment #13)
> That is really exceptionally bad. So bad that I would backout clang on
> Firefox 17 if we don't have an easy fix.

I just noticed that running dump_syms on a .dylib produces a dump with no line numbers, but with stack information. Running it on the .dylib.dSYM produces a dump without stack information but with line numbers. Our build process uses the .dylib.dSYM, which explains the problem.

Do you now if we changed anything in this area?
A summary of what is going on:

* on x86_64, gcc produces both eh_frame and debug_frame, clang produces only eh_frame
* LOAD segments are not copied to .dSYM. From macho_reader.cc:

      // Mach-O files in .dSYM bundles have the contents of the loaded
      // segments removed, and their file offsets and file sizes zeroed
      // out.  However, the sections within those segments still have
      // non-zero sizes.  There's no reason to call MisplacedSectionData in
      // this case; the caller may just need the section's load
      // address. But do set the contents' limits to NULL, for safety.

The correct fix is to change dump_syms to look at both the .dylib and the .dSYM. The easy fix is to patch clang to produce debug_frame too. I am doing the easy fix, ted, can I assign you a bug about fixing dump_syms and updating it?
Yes, I'm not sure I'll get to it right away though.
I tested this locally and we do get STACK lines in the output of "make buildsyms".

Sending for catlee review since rail is away.
Attachment #660951 - Flags: review?(catlee)
Comment on attachment 660951 [details] [diff] [review]
patch clang to always produce debug_frame

Review of attachment 660951 [details] [diff] [review]:
-----------------------------------------------------------------

<rubberstamp>
Attachment #660951 - Flags: review?(catlee) → review+
(In reply to Jesse Ruderman from comment #21)
> I'm still getting truncated stacks with the crash in bug 588237 using
> https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-
> central-macosx64-debug/1347968942/
> built from
> http://hg.mozilla.org/mozilla-central/rev/e4757379b99a

Yes, sorry. I had to install centos5 vms as we update clang on linux and OS X in sync. I am just finishing up the last clang build.
Attachment #662222 - Flags: review?(rail) → review+
Attachment #662282 - Flags: review?(rail) → review+
Attachment #662265 - Flags: review?(rail) → review+
The build on try had STACK lines. I have just requested a clobber on inbound, so the next inbound build should be good too.
Whiteboard: [fuzzblocker], [leave-open] → [fuzzblocker]
https://hg.mozilla.org/mozilla-central/rev/20f683bc1ce7
https://hg.mozilla.org/mozilla-central/rev/35b2c35d3243
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
Much better!  The stacks still aren't getting all the way to main, but they're getting far enough for my scripts to work.
Blocks: 733905
Status: RESOLVED → VERIFIED
(In reply to Jesse Ruderman from comment #31)
> Much better!  The stacks still aren't getting all the way to main, but
> they're getting far enough for my scripts to work.

Where are they stopping? Can you report a bug about that?
Blocks: 793109
I filed bug 793109
[Approval Request Comment]
Bug caused by (feature/regressing bug #): 787302
User impact if declined: Harder to debug other bugs because of broken stacks.
Testing completed (on m-c, etc.): 
Risk to taking this patch (and alternatives if risky): compiler bug, but low risk since it has been in use on m-c. We could backport just the fix, but we would then have a new version of clang to keep track of.
String or UUID changes made by this patch:
Attachment #663536 - Flags: approval-mozilla-aurora?
Comment on attachment 663536 [details] [diff] [review]
Backport 787568 and 787302 to aurora

[Triage Comment]
Early enough in the cycle that we can take a fix like this in support of debugging/fuzzing.
Attachment #663536 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Setting checkin-needed so we don't forget this for mozilla-aurora.
Keywords: checkin-needed
https://hg.mozilla.org/releases/mozilla-aurora/rev/b88c6a990b0f

Rafael, my apologies for your name getting a bit messed up. You can thank Windows' crappy command line for that. That said, you can avoid possible problems like that in the future by making sure your patches have the necessary commit information in them when you post them.
https://developer.mozilla.org/en-US/docs/Creating_a_patch_that_can_be_checked_in
Note that in the future if you're branch-landing patches that have already landed on mozilla-central you can use "hg transplant" instead and just reuse the existing commit metadata:
https://blog.mozilla.org/ted/2009/12/02/easy-branch-landing-of-patches/
I normally do transplant, but not when there's a branch-specific patch posted to a bug...
Ah, missed that. sorry!
Thanks everyone. The aurora patch was a squashed version of the patches that went to inbound.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: