Last Comment Bug 787302 - Most stacks are truncated (using minidump_stackwalk on a Mac Tinderbox debug build)
: Most stacks are truncated (using minidump_stackwalk on a Mac Tinderbox debug ...
Status: VERIFIED FIXED
[fuzzblocker]
: regression
Product: Toolkit
Classification: Components
Component: Breakpad Integration (show other bugs)
: Trunk
: x86_64 Mac OS X
: -- normal (vote)
: mozilla18
Assigned To: Rafael Ávila de Espíndola (:espindola) (not reading bugmail)
:
Mentors:
Depends on:
Blocks: 733905 793109
  Show dependency treegraph
 
Reported: 2012-08-30 20:32 PDT by Jesse Ruderman
Modified: 2012-10-01 07:37 PDT (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed
fixed


Attachments
dump (bug 588237) (263.50 KB, text/plain)
2012-08-30 20:33 PDT, Jesse Ruderman
no flags Details
truncated stack (bug 588237) (443 bytes, text/plain)
2012-08-30 20:34 PDT, Jesse Ruderman
no flags Details
dump (bug 763828) (273.23 KB, application/octet-stream)
2012-08-30 20:35 PDT, Jesse Ruderman
no flags Details
truncated stack (bug 763828) (421 bytes, text/plain)
2012-08-30 20:36 PDT, Jesse Ruderman
no flags Details
patch clang to always produce debug_frame (1.42 KB, patch)
2012-09-13 13:21 PDT, Rafael Ávila de Espíndola (:espindola) (not reading bugmail)
catlee: review+
Details | Diff | Review
switch packages once try is done (3.48 KB, patch)
2012-09-18 10:44 PDT, Rafael Ávila de Espíndola (:espindola) (not reading bugmail)
rail: review+
Details | Diff | Review
Fix a really silly bug, we want this patch on OS X :-( (795 bytes, patch)
2012-09-18 12:13 PDT, Rafael Ávila de Espíndola (:espindola) (not reading bugmail)
rail: review+
Details | Diff | Review
switch packages once try is done (3.39 KB, patch)
2012-09-18 13:05 PDT, Rafael Ávila de Espíndola (:espindola) (not reading bugmail)
rail: review+
Details | Diff | Review
Backport 787568 and 787302 to aurora (3.87 KB, patch)
2012-09-21 13:12 PDT, Rafael Ávila de Espíndola (:espindola) (not reading bugmail)
akeybl: approval‑mozilla‑aurora+
Details | Diff | Review

Description Jesse Ruderman 2012-08-30 20:32:10 PDT
This broke a little over a month ago.  Maybe due to the switch to clang?

My DOM fuzzer relies on these crash stacks for distinguishing known bugs from new bugs.  For many crashes, the top item is sufficient, but there are some (e.g. bug 588237) that I need to ignore based on something in the middle of the stack.

I'll attach some dumps and truncated-stacks that I got using the testcases in bug 588237 and bug 763828.  They're all using the build and symbol files in
https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1346378090/
along with the minidump_stackwalk executable from
http://hg.mozilla.org/build/tools/raw-file/default/breakpad/osx/minidump_stackwalk
Comment 1 Jesse Ruderman 2012-08-30 20:33:38 PDT
Created attachment 657122 [details]
dump (bug 588237)
Comment 2 Jesse Ruderman 2012-08-30 20:34:31 PDT
Created attachment 657123 [details]
truncated stack (bug 588237)
Comment 3 Jesse Ruderman 2012-08-30 20:35:37 PDT
Created attachment 657124 [details]
dump (bug 763828)
Comment 4 Jesse Ruderman 2012-08-30 20:36:00 PDT
Created attachment 657125 [details]
truncated stack (bug 763828)
Comment 5 Jesse Ruderman 2012-08-30 21:09:20 PDT
> https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-
> central-macosx64-debug/1346378090/

Copied to http://www.squarefree.com/bug787302/build_1346378090/ in case it gets deleted.
Comment 6 Ted Mielczarek [:ted.mielczarek] 2012-08-31 07:43:59 PDT
Okay, so the fact that the second frame in both stacks is "Found by: stack scanning" is telling. That means that we don't have usable unwind information for frame 0, so we're resorting to scanning.
Comment 7 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-12 14:54:34 PDT
Locally I get a full stack, but it is found by stack scanning. The first function in the stack is:

__Z14mozalloc_abortPKc:
0000000000000b40        pushq   %rbp
0000000000000b41        movq    %rsp, %rbp
0000000000000b44        pushq   %rbx
0000000000000b45        pushq   %rax
0000000000000b46        movq    1219(%rip), %rbx
0000000000000b4d        movq    (%rbx), %rsi
0000000000000b50        callq   0xbc4 ## symbol stub for: _fputs
0000000000000b55        movq    (%rbx), %rsi
0000000000000b58        leaq    241(%rip), %rdi ## literal pool for: 

0000000000000b5f        callq   0xbc4 ## symbol stub for: _fputs
0000000000000b64        movl    $123, 0


So we should be able to just follow rsp, rbp. I will try to figure out what went wrong.
Comment 8 Ted Mielczarek [:ted.mielczarek] 2012-09-12 15:38:42 PDT
It's possible the minidump_stackwalk the tinderbox machines are using is old enough that it doesn't know how to use frame pointers to walk an x86-64 stack.
Comment 9 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-12 16:14:24 PDT
(In reply to Ted Mielczarek [:ted] from comment #8)
> It's possible the minidump_stackwalk the tinderbox machines are using is old
> enough that it doesn't know how to use frame pointers to walk an x86-64
> stack.

Do you know where can I check the version (and update it if needed)?
Comment 10 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-12 17:08:41 PDT
Looks like we have two problems

1) the 64 bit unwinder (even the current svn one) will not try rsp/rbp.
2) clang and breakpad are not agreeing on the cfi

I will try to fix 2 first.
Comment 11 Ted Mielczarek [:ted.mielczarek] 2012-09-12 17:33:39 PDT
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #10)
> Looks like we have two problems
> 
> 1) the 64 bit unwinder (even the current svn one) will not try rsp/rbp.

Oh huh, I would have sworn I implemented that at some point. It wouldn't be hard to add.

> 2) clang and breakpad are not agreeing on the cfi

In this case there definitely ought to be usable CFI here since we have full symbol information. Thanks for looking into it!
Comment 12 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-12 18:59:33 PDT
I just downloaded

https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1346378090/firefox-18.0a1.en-US.mac64.crashreporter-symbols.zip

and:

$ unzip -L firefox-18.0a1.en-US.mac64.crashreporter-symbols.zip
$ find . -name *.sym | xargs grep STACK
./XUL/83BAA1821A40335FB1C01A83683E86240/XUL.sym:FUNC 1255f80 84 0 JSD_ASSERT_VALID_STACK_FRAME
./libnspr4.dylib/3FECC3934B0F385B90FDFC52F3AF1C0A0/libnspr4.dylib.sym:FUNC a830 34 0 _PR_RELEASE_LOCK_STACK

I will debug why we are missing STACK entries tomorrow.
Comment 13 Ted Mielczarek [:ted.mielczarek] 2012-09-13 06:38:31 PDT
That is really exceptionally bad. So bad that I would backout clang on Firefox 17 if we don't have an easy fix.
Comment 14 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-13 11:12:01 PDT
(In reply to Ted Mielczarek [:ted] from comment #13)
> That is really exceptionally bad. So bad that I would backout clang on
> Firefox 17 if we don't have an easy fix.

I just noticed that running dump_syms on a .dylib produces a dump with no line numbers, but with stack information. Running it on the .dylib.dSYM produces a dump without stack information but with line numbers. Our build process uses the .dylib.dSYM, which explains the problem.

Do you now if we changed anything in this area?
Comment 15 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-13 12:36:41 PDT
A summary of what is going on:

* on x86_64, gcc produces both eh_frame and debug_frame, clang produces only eh_frame
* LOAD segments are not copied to .dSYM. From macho_reader.cc:

      // Mach-O files in .dSYM bundles have the contents of the loaded
      // segments removed, and their file offsets and file sizes zeroed
      // out.  However, the sections within those segments still have
      // non-zero sizes.  There's no reason to call MisplacedSectionData in
      // this case; the caller may just need the section's load
      // address. But do set the contents' limits to NULL, for safety.

The correct fix is to change dump_syms to look at both the .dylib and the .dSYM. The easy fix is to patch clang to produce debug_frame too. I am doing the easy fix, ted, can I assign you a bug about fixing dump_syms and updating it?
Comment 16 Ted Mielczarek [:ted.mielczarek] 2012-09-13 12:50:38 PDT
Yes, I'm not sure I'll get to it right away though.
Comment 17 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-13 13:21:42 PDT
Created attachment 660951 [details] [diff] [review]
patch clang to always produce debug_frame

I tested this locally and we do get STACK lines in the output of "make buildsyms".

Sending for catlee review since rail is away.
Comment 18 Chris AtLee [:catlee] 2012-09-14 11:17:54 PDT
Comment on attachment 660951 [details] [diff] [review]
patch clang to always produce debug_frame

Review of attachment 660951 [details] [diff] [review]:
-----------------------------------------------------------------

<rubberstamp>
Comment 19 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-17 06:28:01 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/436ae8b59f87
Comment 20 Ed Morley [:emorley] 2012-09-17 12:23:58 PDT
https://hg.mozilla.org/mozilla-central/rev/436ae8b59f87
Comment 22 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-18 06:14:41 PDT
(In reply to Jesse Ruderman from comment #21)
> I'm still getting truncated stacks with the crash in bug 588237 using
> https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-
> central-macosx64-debug/1347968942/
> built from
> http://hg.mozilla.org/mozilla-central/rev/e4757379b99a

Yes, sorry. I had to install centos5 vms as we update clang on linux and OS X in sync. I am just finishing up the last clang build.
Comment 23 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-18 10:44:43 PDT
Created attachment 662222 [details] [diff] [review]
switch packages once try is done

Try at:
https://tbpl.mozilla.org/?tree=Try&rev=5bd058fc51b1
Comment 24 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-18 12:13:39 PDT
Created attachment 662265 [details] [diff] [review]
Fix a really silly bug, we want this patch on OS X :-(
Comment 25 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-18 13:05:23 PDT
Created attachment 662282 [details] [diff] [review]
switch packages once try is done

now with OS X patched.

https://tbpl.mozilla.org/?tree=Try&rev=10cb43ddedc3
Comment 26 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-18 13:58:37 PDT
 https://hg.mozilla.org/integration/mozilla-inbound/rev/20f683bc1ce7
Comment 27 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-18 14:00:55 PDT
 https://hg.mozilla.org/integration/mozilla-inbound/rev/35b2c35d3243
Comment 28 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-18 14:04:08 PDT
The build on try had STACK lines. I have just requested a clobber on inbound, so the next inbound build should be good too.
Comment 31 Jesse Ruderman 2012-09-19 10:43:40 PDT
Much better!  The stacks still aren't getting all the way to main, but they're getting far enough for my scripts to work.
Comment 32 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-19 11:45:06 PDT
(In reply to Jesse Ruderman from comment #31)
> Much better!  The stacks still aren't getting all the way to main, but
> they're getting far enough for my scripts to work.

Where are they stopping? Can you report a bug about that?
Comment 33 Jesse Ruderman 2012-09-21 01:04:43 PDT
I filed bug 793109
Comment 34 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-09-21 13:12:13 PDT
Created attachment 663536 [details] [diff] [review]
Backport 787568 and 787302 to aurora

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 787302
User impact if declined: Harder to debug other bugs because of broken stacks.
Testing completed (on m-c, etc.): 
Risk to taking this patch (and alternatives if risky): compiler bug, but low risk since it has been in use on m-c. We could backport just the fix, but we would then have a new version of clang to keep track of.
String or UUID changes made by this patch:
Comment 35 Alex Keybl [:akeybl] 2012-09-21 16:43:51 PDT
Comment on attachment 663536 [details] [diff] [review]
Backport 787568 and 787302 to aurora

[Triage Comment]
Early enough in the cycle that we can take a fix like this in support of debugging/fuzzing.
Comment 36 Gary Kwong [:gkw] [:nth10sd] 2012-09-24 18:04:14 PDT
Setting checkin-needed so we don't forget this for mozilla-aurora.
Comment 37 Ryan VanderMeulen [:RyanVM] 2012-09-24 19:14:00 PDT
https://hg.mozilla.org/releases/mozilla-aurora/rev/b88c6a990b0f

Rafael, my apologies for your name getting a bit messed up. You can thank Windows' crappy command line for that. That said, you can avoid possible problems like that in the future by making sure your patches have the necessary commit information in them when you post them.
https://developer.mozilla.org/en-US/docs/Creating_a_patch_that_can_be_checked_in
Comment 38 Ted Mielczarek [:ted.mielczarek] 2012-09-25 05:30:32 PDT
Note that in the future if you're branch-landing patches that have already landed on mozilla-central you can use "hg transplant" instead and just reuse the existing commit metadata:
https://blog.mozilla.org/ted/2009/12/02/easy-branch-landing-of-patches/
Comment 39 Ryan VanderMeulen [:RyanVM] 2012-09-25 06:09:15 PDT
I normally do transplant, but not when there's a branch-specific patch posted to a bug...
Comment 40 Ted Mielczarek [:ted.mielczarek] 2012-09-25 06:10:39 PDT
Ah, missed that. sorry!
Comment 41 Rafael Ávila de Espíndola (:espindola) (not reading bugmail) 2012-10-01 07:37:50 PDT
Thanks everyone. The aurora patch was a squashed version of the patches that went to inbound.

Note You need to log in before you can comment on or make changes to this bug.