Closed Bug 1062709 Opened 5 years ago Closed 5 years ago

Clean up stack printing and fixing

Categories

(Core :: XPCOM, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla35

People

(Reporter: njn, Assigned: njn)

References

Details

Attachments

(4 files, 5 obsolete files)

Our printing and "fixing" (i.e. post-processing) of stacks is a mess.

PRINTING
--------
Two separate functions format individual stack entries:

- NS_FormatCodeAddressDetails(), which is used by nsSigHandlers.cpp and
  nsTraceRefcnt.cpp.
  
- CodeAddressService::GetLocation(), which is used by DMD.cpp,
  nsTraceRefcnt.cpp, and BlockingResourceBase.cpp.

They are similar, but CodeAddressService provides a cache, which means it's
more work to set up but is much faster if you're printing lots of stacks with
repeated entries.

The exact formatting of the stack entries depends on which of these functions
you call, and which platform you are on. What follows is my best attempt to
capture all the variations. The inputs are <pc>, <fn>, <foffset>, <file>,
<line>, <lib>, <loffset>. The conditions are shown in the comments on the RHS.

CodeAddressService::GetLocation(), all platforms:

> "??? <pc>"                        #      if: no <fn>, no <lib>, no <loffset>
> "<fn|???> (<file>:<line>) <pc>"   # else if: <file>
> "<fn|???>[<lib> +<loffset>] <pc>" # else

NS_FormatCodeAddressDetails(), Windows:

> "<fn>+<foffset> [<lib> +<loffset>]"    #      if: fn
> "UNKNOWN [<lib> +<loffset>]"           # else if: lib
> "UNKNOWN <pc>"                         # else

then, if <file>, append:

> " (<file>, line <line>)"

# then append:

> "\n"

NS_FormatCodeAddressDetails(), Sun:

> "<pc> <lib|??>:<fn|??>+<foffset>\n"

NS_FormatCodeAddressDetails(), other Unixes:

> "UNKNOWN <pc>\n"                          # if: no lib
> "UNKNOWN [<lib> +<loffset>]\n"            # if: no fn
> "<fn>+<foffset> [<lib> +<loffset>]\n"     # else

NS_FormatCodeAddressDetails(), other:

> (error)

This is madness. I plan to unify these five variants.

FIXING
------
Printed stacks are often "fixed" to add extra information that's not obtainable
in the above functions. Specifically:

- fix_stacks_using_bpsyms.py is used on tinderbox machines; all mochitest
  output goes through it.

- fix_{linux,macosx}_stacks.py are used on local machines for mochitest, and
  also for DMD output.

- fix_b2g_stacks.py is used for DMD output on B2G.

These scripts expect the output to have a particular form, like this:

> <stuff1>[<lib> +<offset>]<stuff2>

The contents of <stuff1> is restricted, but exactly what's allowed is hard to
determine, because it depends on two complex regexps in each of these scripts.

This form isn't particularly easy to parse, and it's causing problems for
bug 1044709 where I need to put stack entries in JSON strings. So I want to
change it to an easier-to-parse form. The above-mentioned unification of the
formatting functions will make this easier.
This patch unifies all formatting functions into the single following variant.

> "<fn|???> (<file>:<line>) <pc>"                    #      if: <file>
> "FIX[fn=<fn|???> lib=<lib> offset=<loffset>] <pc>" # else if: <lib>
> "??? <pc>"                                         # else

Things to note:

- Case 1 is the best case: we don't need to do any post-processing.

- Case 2 indicates that post-processing is needed. It's in a form that's much
  more obvious and easier to parse.

- Case 3 is the "no useful information" case.

- The <pc> is always on the end now. It wasn't on most of the
  NS_FormatCodeAddressDetails() cases previously. It seems useful.

- <foffset> is never printed. It doesn't seem useful.

- "(<file>:<line>) is always used for source locations, rather than "(<file>,
  line <line>).

- "???" is used consistently for unknown things. No more "??" and "UNKNOWN".

- There's no newline. Some callers don't want it, and callers who need it can
  easily add it.

The patch modifies all callers of the formatting functions accordingly.

The patch also modifies the fix scripts accordingly. fix_b2g_stacks.py is in
the B2G repo and will need to be modified separately, and I'll need to
co-ordinate the landings to avoid breaking DMD on B2G, alas.
Attachment #8483951 - Flags: review?(dbaron)
Nice! I made the Windows stack printing match other platforms more closely a while ago, it's all a mess in there.

Just for reference, Breakpad's minidump_stackwalk has a similar stack printing routine:
https://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/minidump_stackwalk.cc#149

(This is what gets used when we print stack frames from minidumps on tinderbox.)
> Just for reference, Breakpad's minidump_stackwalk has a similar stack
> printing routine:
> https://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/
> minidump_stackwalk.cc#149
> 
> (This is what gets used when we print stack frames from minidumps on
> tinderbox.)

Those stack frames don't require post-processing by the fix*.py scripts, right? AFAICT, even prior to this change they aren't in a form that the fix*.py scripts would match.
(In reply to Nicholas Nethercote [:njn] from comment #4)
> Created attachment 8484769 [details] [review]
> Link to GitHub PR: https://github.com/mozilla-b2g/B2G/pull/384

I created this attachment manually; not sure if there's an automated way to do this kind of review request.
(In reply to Nicholas Nethercote [:njn] from comment #3)
> Those stack frames don't require post-processing by the fix*.py scripts,
> right? AFAICT, even prior to this change they aren't in a form that the
> fix*.py scripts would match.

Right, that was just for reference as "here's another way we print stack frames that wind up in test logs".
> "FIX[fn=<fn|???> lib=<lib> offset=<loffset>] <pc>" # else if: <lib>

erahm has said he finds this ugly. On Linux, you usually don't have <fn> so any line in this form is practically useless -- "oh, it's somewhere in libxul" -- and so post-processing is practically mandatory. On Mac, however, having a <fn> is common, so post-processing isn't always necessary to get something useful. Here's a Mac example:

> nsAString_internal::SetCapacity(unsigned int, mozilla::fallible_t const&)[/Users/njn/moz/mi1/d64dmd/dist/NightlyDebug.app/Contents/MacOS/XUL +0x33277] 0x102d0d277

The constraints on the final form are:

1. Need to allow almost arbitrary stuff beforehand, in a way that is easy to parse.

2. <lib> and <loffset> are highly constrained, but <fn> can be C++ type signatures, and so can contain chars like ' ', ':', '*', '&', '(, ')'. So whatever separates <fn> and <lib> must involve chars that can't appear in <fn>.

The FIX form has six chars at the front ("FIX[fn") that satisfy (1), and the "lib=" satisfies (2). If we go with a form similar to the existing forms:

> "<fn|???> [<lib> +<loffset>]"

then the dividing line between any preceding stuff and <fn> (and possibly even <lib>) is unclear.
Is there a reason you don't want to put the function name at the end of the line?

lib+offset fn [file+line]
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #8)
> Is there a reason you don't want to put the function name at the end of the
> line?
> 
> lib+offset fn [file+line]

That'll make the unfixed Mac case harder to read, because it puts the interesting part (<fn>) nearer to the end.

Also, "<lib>+<offset>" doesn't distinguish itself clearly from "arbitrary preceding stuff".
(In reply to Nicholas Nethercote [:njn] from comment #7)
> If we go with a form similar to the existing forms:
> 
> > "<fn|???> [<lib> +<loffset>]"
> 
> then the dividing line between any preceding stuff and <fn> (and possibly
> even <lib>) is unclear.

Perhaps just a very explicit prefix would suffice. Something like:

    frame  0: [Foo::Bar(int baz) lib=libxul+0xABCD1234 pc=0x12300000]
    frame  1: [nsBlah::Foo() /path/to/file/nsBlah.cpp+345 pc=0x12300002]
    ...
    frame 23: [??? lib=libxul+0x1234ABCD pc=0x12300004]

This provides for my proposed constraint 3:
  Format should be consistent across frames, function name always first

So now we can determine if a line has a stack frame that could use fixing by using basic regex like "frame\s+[:digit:]+:\s\[([^\]])*\]

I guess if we always have a pc= at the end then you can get away with not having the brackets:

    frame  0: Foo::Bar(int baz) lib=libxul+0xABCD1234 pc=0x12300000
    frame  1: nsBlah::Foo() /path/to/file/nsBlah.cpp+345 pc=0x12300002
    ...
    frame 23: ??? lib=libxul+0x1234ABCD pc=0x12300004

is_stack_frame_regex = "^.*frame\s.*pc=-0x[:hex:]+.*$"
Frame number prefixes are a good idea. We could even do it like GDB and use #0, #1, etc. That would allow the following:

> "#<n>: <fn|???> (<file>:<line>) <pc>"     #      if: <file>
> "#<n>: <fn|???> [<lib> +<loffset>] <pc>"  # else if: <lib>
> "#<n>: ??? (???:???) <pc>"                # else

Which has the following properties:

- Highly readable even in the non-fixed form.

- More similar to what we currently produce.

- Still pretty easy to parse with a regex. (What comes before cannot match /#\d+:/, which is sufficiently unlikely that I'm happy with it.)

- The 1st and 3rd cases have the same form, which is nice.
I'm going to do it like comment 11, except with the <pc>, because erahm and I decided it doesn't add much.
Attachment #8487687 - Flags: review?(dbaron)
Attachment #8483951 - Attachment is obsolete: true
Attachment #8483951 - Flags: review?(dbaron)
dbaron, I updated the GitHub patch so it's also ready for review. Thanks.
Here's some example output:

>    #00: replace_malloc[cd64dmd/dist/lib/libdmd.so +0x30ee]
>    #01: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x2c632a6]
>    #02: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x11e27df]
>    #03: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x11e2c1f]
>    #04: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x11e7701]
>    #05: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x11e69b8]
>    #06: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x11e6f50]
>    #07: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x11e73b2]
>    #08: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x1efb146]
>    #09: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x1efadb3]
>    #10: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x1f398d5]
>    #11: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x16aa877]
>    #12: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x1adfe67]

I'm ambivalent about the 0-padding of single digit numbers -- it keeps everything at the same indent level, which is nice, but the extra zeroes are ugly. Not sure what's best.
That looks good to me, I guess you could just use spaces. I'm reasonably ambivalent as well, so whatever is easiest is probably fine. FWIW lldb doesn't pad, not sure about gdb.

>     #0: replace_malloc[cd64dmd/dist/lib/libdmd.so +0x30ee]
>     #9: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x1efadb3]
>    #10: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x1f398d5]
>    #11: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x16aa877]
>    #12: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x1adfe67]

or

>    #0:  replace_malloc[cd64dmd/dist/lib/libdmd.so +0x30ee]
>    #9:  ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x1efadb3]
>    #10: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x1f398d5]
>    #11: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x16aa877]
>    #12: ???[/home/njn/moz/mi7/cd64dmd/dist/bin/libxul.so +0x1adfe67]
BTW, the frame numbers are quite useful in mochitest output, where stack frames are printed among lots of other output -- it makes it much easier to see where the stack traces start and end.
dbaron: 12 day review ping.
Updated patch that fixes some small problems found on the try server.
Attachment #8491106 - Flags: review?(dbaron)
Attachment #8487686 - Attachment is obsolete: true
Attachment #8487686 - Flags: review?(dbaron)
Updated patch that fixes some small problems found on the try server.
Attachment #8491108 - Flags: review?(dbaron)
Attachment #8487687 - Attachment is obsolete: true
Attachment #8487687 - Flags: review?(dbaron)
Blocks: 1044709
Comment on attachment 8491106 [details] [diff] [review]
(part 1) - Add a frame number argument to NS_WalkStackCallback

TableTicker.cpp:

>+  StackWalkCallback(/* frameNumber */ 0, aSample->pc, aSample->sp, &nativeStack);

It almost seems like this should be -1 given that the code following
will send 0, I think.  Although I suppose the argument is a uint32_t.
(Should it be?)

nsStackWalk.h:

>+// aFrameNumber is the frame number. aSP will be the best approximation
>+// possible of what the stack pointer will be pointing to when the execution
>+// returns to executing that at that PC. If no approximation can be made it
>+// will be nullptr.

Perhaps better to make the description of each param described be its
own paragraph (i.e., don't rewrap the old stuff)?


r=dbaron
Attachment #8491106 - Flags: review?(dbaron) → review+
So I'm a bit confused by some of the comments in patch 2?  Is this changing what the fix_*_stack scripts fix up (i.e., both poor function names and adding file:line info)?  And is there such a thing as a "FIX[...]" entry, or is that now "???[...]"?

Does the inclusion of frame numbers break tools that want to invert stacks?  In particular, make-tree.pl has the ability to build stack trees rooted either at main or at AddRef/Release, and both are useful.  (It's the default case, rooted at main, that would be at risk here, rather than --reverse, I think.)  I think make-tree.pl is still a pretty important tool for debugging reference count leaks.


Sorry for taking so long to get to this; it was one of the large items in the pile of reviews when I got back from vacation.
Flags: needinfo?(n.nethercote)
> TableTicker.cpp:
> 
> >+  StackWalkCallback(/* frameNumber */ 0, aSample->pc, aSample->sp, &nativeStack);
> 
> It almost seems like this should be -1 given that the code following
> will send 0, I think.  Although I suppose the argument is a uint32_t.
> (Should it be?)

That is definitely a weird one. The good news is that StackWalkCallback() ignores the frame number argument, so it doesn't really matter what value we pass in. I'll add a comment about this.
Flags: needinfo?(n.nethercote)
> So I'm a bit confused by some of the comments in patch 2?  Is this changing
> what the fix_*_stack scripts fix up (i.e., both poor function names and
> adding file:line info)?  And is there such a thing as a "FIX[...]" entry, or
> is that now "???[...]"?

In short, it:

- Makes the output of our C++ stack-walking code more consistent (e.g. no variation across platforms), and easier to parse.

- Changes the fix*.py scripts to handle the new output form.

Furthermore, there are three variants in the output form. If we have a filename, it'll be like this:

> "#01: foo (Foo.cpp:123)

Otherwise, if we have a library name it'll be like this:

> "#01: foo[lib +0x1234]

(In my original patch this variant used the "FIX" form, but no longer because it was ugly.)

Otherwise, it'll be like this:

> "#01: ??? (???:???)

which is basically the same as the first case.

The fix*.py scripts only have to deal with the second case.


> Does the inclusion of frame numbers break tools that want to invert stacks? 
> In particular, make-tree.pl has the ability to build stack trees rooted
> either at main or at AddRef/Release, and both are useful.  (It's the default
> case, rooted at main, that would be at risk here, rather than --reverse, I
> think.)  I think make-tree.pl is still a pretty important tool for debugging
> reference count leaks.

I don't know anything about make-tree.pl. I'll give it a run and see if it's affected.
Bleh, the frame number does break make-tree.pl. Here's sample output with the frame numbers present:

> n/a     .root
>   1           #09: ??? (???:???)
>   | 1           #08: _start (cd64r/dist/bin/firefox)
>   |   1           #07: __libc_start_main (/build/buildd/eglibc-2.19/csu/libc-start.c:321)
>   |     1           #06: main (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:612)
>   |       1           #05: InitXPCOMGlue(char const*, nsIFile**) (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:571)
>   |         1           #04: NS_NewNativeLocalFile (/home/njn/moz/mi8/cd64r/xpcom/glue/standalone/../../../../xpcom/glue/standalone/nsXPCOMGlue.cpp:575)
>   |           1           #03: NS_NewNativeLocalFile (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:2081)
>   |             1           #02: nsRefPtr (/home/njn/moz/mi8/cd64r/xpcom/io/../../dist/include/nsAutoPtr.h:887)
>   |               1           #01: nsRefPtr (/home/njn/moz/mi8/cd64r/xpcom/io/../../dist/include/nsAutoPtr.h:885)
>   |                 1           #00: nsLocalFile::AddRef() (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)
>   1           #10: ??? (???:???)
>   | 1           #09: _start (cd64r/dist/bin/firefox)
>   |   1           #08: __libc_start_main (/build/buildd/eglibc-2.19/csu/libc-start.c:321)
>   |     1           #07: main (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:643)
>   |       1           #06: do_main(int, char**, nsIFile*) (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:282)
>   |         1           #05: XRE_main (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:4439)
>   |           1           #04: XREMain::XRE_main(int, char**, nsXREAppData const*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:4183)
>   |             1           #03: ScopedAppData (/home/njn/moz/mi7/xpcom/glue/AppData.cpp:56)
>   |               1           #02: void mozilla::SetStrongPtr<nsIFile>(nsIFile*&, nsIFile*) (/home/njn/moz/mi7/cd64r/xpcom/build/../../dist/include/mozilla/AppData.h:59)
>   |                 1           #01: void ns_if_addref<nsIFile*>(nsIFile*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsISupportsUtils.h:46)
>   |                   1           #00: nsLocalFile::AddRef() (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)
>   1           #11: ??? (???:???)
>   | 1           #10: _start (cd64r/dist/bin/firefox)
>   |   1           #09: __libc_start_main (/build/buildd/eglibc-2.19/csu/libc-start.c:321)
>   |     1           #08: main (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:643)
>   |       1           #07: do_main(int, char**, nsIFile*) (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:282)
>   |         1           #06: XRE_main (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:4439)
>   |           1           #05: XREMain::XRE_main(int, char**, nsXREAppData const*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:4205)
>   |             1           #04: XREMain::XRE_mainInit(bool*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:3137)
>   |               1           #03: nsXREDirProvider::Initialize(nsIFile*, nsIFile*, nsIDirectoryServiceProvider*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsXREDirProvider.cpp:107)
>   |                 1           #02: nsCOMPtr<nsIFile>::operator=(nsIFile*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:678)
>   |                   1           #01: nsCOMPtr<nsIFile>::assign_with_AddRef(nsISupports*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:1159)
>   |                     1           #00: nsLocalFile::AddRef() (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)
>   -1          #13: ??? (???:???)
>   | -1          #12: _start (cd64r/dist/bin/firefox)
>   |   -1          #11: __libc_start_main (/build/buildd/eglibc-2.19/csu/libc-start.c:321)
>   |     -1          #10: main (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:643)
>   |       -1          #09: do_main(int, char**, nsIFile*) (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:282)
>   |         -1          #08: XRE_main (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:4439)
>   |           -1          #07: XREMain::XRE_main(int, char**, nsXREAppData const*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:4205)
>   |             -1          #06: XREMain::XRE_mainInit(bool*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:3137)
>   |               -1          #05: nsXREDirProvider::Initialize(nsIFile*, nsIFile*, nsIDirectoryServiceProvider*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsXREDirProvider.cpp:107)
>   |                 -1          #04: nsCOMPtr<nsIFile>::operator=(nsIFile*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:679)
>   |                   -1          #03: nsCOMPtr<nsIFile>::Assert_NoQueryNeeded() (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:534)
>   |                     -1          #02: ~nsCOMPtr (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:524)
>   |                       -1          #01: ~nsCOMPtr (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:522)
>   |                         -1          #00: nsLocalFile::Release() (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)
>   1           #17: ??? (???:???)
>     1           #16: _start (cd64r/dist/bin/firefox)
>       1           #15: __libc_start_main (/build/buildd/eglibc-2.19/csu/libc-start.c:321)
>         1           #14: main (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:643)
>           1           #13: do_main(int, char**, nsIFile*) (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:282)
>             1           #12: XRE_main (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:4439)
>               1           #11: XREMain::XRE_main(int, char**, nsXREAppData const*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:4205)
>                 1           #10: XREMain::XRE_mainInit(bool*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:3137)
>                   1           #09: nsXREDirProvider::Initialize(nsIFile*, nsIFile*, nsIDirectoryServiceProvider*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsXREDirProvider.cpp:107)
>                     1           #08: nsCOMPtr<nsIFile>::operator=(nsIFile*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:679)
>                       1           #07: nsCOMPtr<nsIFile>::Assert_NoQueryNeeded() (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:531)
>                         1           #06: nsCOMPtr (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:614)
>                           1           #05: nsCOMPtr (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:614)
>                             1           #04: nsCOMPtr<nsIFile>::assign_from_qi(nsQueryInterface, nsID const&) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:1169)
>                               1           #03: nsQueryInterface::operator()(nsID const&, void**) const (/home/njn/moz/mi7/xpcom/glue/nsCOMPtr.cpp:14)
>                                 1           #02: nsLocalFile::QueryInterface(nsID const&, void**) (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)
>                                   1           #01: NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*) (/home/njn/moz/mi7/xpcom/glue/nsISupportsImpl.cpp:18)
>                                     1           #00: nsLocalFile::AddRef() (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)

And the equivalent output with all the frame numbers converted to "#00":

> n/a     .root
>   3           #00: ??? (???:???)
>     3           #00: _start (cd64r/dist/bin/firefox)
>       3           #00: __libc_start_main (/build/buildd/eglibc-2.19/csu/libc-start.c:321)
>         1           #00: main (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:612)
>         | 1           #00: InitXPCOMGlue(char const*, nsIFile**) (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:571)
>         |   1           #00: NS_NewNativeLocalFile (/home/njn/moz/mi8/cd64r/xpcom/glue/standalone/../../../../xpcom/glue/standalone/nsXPCOMGlue.cpp:575)
>         |     1           #00: NS_NewNativeLocalFile (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:2081)
>         |       1           #00: nsRefPtr (/home/njn/moz/mi8/cd64r/xpcom/io/../../dist/include/nsAutoPtr.h:887)
>         |         1           #00: nsRefPtr (/home/njn/moz/mi8/cd64r/xpcom/io/../../dist/include/nsAutoPtr.h:885)
>         |           1           #00: nsLocalFile::AddRef() (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)
>         2           #00: main (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:643)
>           2           #00: do_main(int, char**, nsIFile*) (/home/njn/moz/mi7/cd64r/browser/app/../../../browser/app/nsBrowserApp.cpp:282)
>             2           #00: XRE_main (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:4439)
>               1           #00: XREMain::XRE_main(int, char**, nsXREAppData const*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:4183)
>               | 1           #00: ScopedAppData (/home/njn/moz/mi7/xpcom/glue/AppData.cpp:56)
>               |   1           #00: void mozilla::SetStrongPtr<nsIFile>(nsIFile*&, nsIFile*) (/home/njn/moz/mi7/cd64r/xpcom/build/../../dist/include/mozilla/AppData.h:59)
>               |     1           #00: void ns_if_addref<nsIFile*>(nsIFile*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsISupportsUtils.h:46)
>               |       1           #00: nsLocalFile::AddRef() (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)
>               1           #00: XREMain::XRE_main(int, char**, nsXREAppData const*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:4205)
>                 1           #00: XREMain::XRE_mainInit(bool*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsAppRunner.cpp:3137)
>                   1           #00: nsXREDirProvider::Initialize(nsIFile*, nsIFile*, nsIDirectoryServiceProvider*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../../toolkit/xre/nsXREDirProvider.cpp:107)
>                     1           #00: nsCOMPtr<nsIFile>::operator=(nsIFile*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:678)
>                     | 1           #00: nsCOMPtr<nsIFile>::assign_with_AddRef(nsISupports*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:1159)
>                     |   1           #00: nsLocalFile::AddRef() (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)
>                     0           #00: nsCOMPtr<nsIFile>::operator=(nsIFile*) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:679)
>                       1           #00: nsCOMPtr<nsIFile>::Assert_NoQueryNeeded() (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:531)
>                       | 1           #00: nsCOMPtr (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:614)
>                       |   1           #00: nsCOMPtr (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:614)
>                       |     1           #00: nsCOMPtr<nsIFile>::assign_from_qi(nsQueryInterface, nsID const&) (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:1169)
>                       |       1           #00: nsQueryInterface::operator()(nsID const&, void**) const (/home/njn/moz/mi7/xpcom/glue/nsCOMPtr.cpp:14)
>                       |         1           #00: nsLocalFile::QueryInterface(nsID const&, void**) (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)
>                       |           1           #00: NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*) (/home/njn/moz/mi7/xpcom/glue/nsISupportsImpl.cpp:18)
>                       |             1           #00: nsLocalFile::AddRef() (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)
>                       -1          #00: nsCOMPtr<nsIFile>::Assert_NoQueryNeeded() (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:534)
>                         -1          #00: ~nsCOMPtr (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:524)
>                           -1          #00: ~nsCOMPtr (/home/njn/moz/mi7/cd64r/toolkit/xre/../../dist/include/nsCOMPtr.h:522)
>                             -1          #00: nsLocalFile::Release() (/home/njn/moz/mi8/cd64r/xpcom/io/../../../xpcom/io/nsLocalFileUnix.cpp:244)

Bah. I guess I'll have to remove the frame numbers and find another way to proceed.
(In reply to Nicholas Nethercote [:njn] from comment #26)
> Bah. I guess I'll have to remove the frame numbers and find another way to
> proceed.

An alternative would be to fix make-tree.pl to remove the frame number
(In reply to David Baron [:dbaron] (UTC-7) (needinfo? for questions) from comment #27)
> An alternative would be to fix make-tree.pl to remove the frame number

Although if you do that, you still need to make the fix_*_stack scripts handle fixup of stack frames that have the frame number removed.
Comment on attachment 8491108 [details] [diff] [review]
(part 2) - Clean up stack printing and fixing

Other comments on the patch in addition to the 2 previous comments:

security/sandbox/win/src/warnonlysandbox/wosCallbacks.h:

need to add a "\n"

tools/rb/fix_linux_stack.py:

I think the comment at the top was useful; why not just replace the first
sentence with your replacement, and leave the rest (but replacing UNKNOWN with ???)?


You either need to patch fix-linux-stack.pl or get automation converted
to using fix_linux_stack.py before landing this.  Or did that happen while I wasn't looking?


tools/rb/fix_macosx_stack.py:

Your comment refers to the "FIX[...]" entries that are no longer present.

Same comment as fix_linux_stack.py about the comment as well, although
I'm a bit less confident that it's correct for mac.

tools/rb/fix_stack_using_bpsyms.py:

again, out of date "FIX[...]" comment


review- because I'd like to see what you decide about the make-tree issue
Attachment #8491108 - Flags: review?(dbaron) → review-
> You either need to patch fix-linux-stack.pl or get automation converted
> to using fix_linux_stack.py before landing this.  Or did that happen while I
> wasn't looking?

Already done! :) Bug 1050601.

> review- because I'd like to see what you decide about the make-tree issue

I just talked to erahm about this. I will try to keep the frame numbers in and modify make-tree.pl. One possibility is to just modify the numbers to all be "#00" or something; clunky, but it would work.

Another possibility is to require that fixing occur before running make-tree.pl. I know that the make-tree.pl output is a lot smaller, but that's less of an issue now that the fix*.py scripts have caches in them that make them much faster on files with repeated stack frames. Anyway, not sure if that'll be necessary.

Hopefully there aren't any other tools in the tree that silently rely on stack frames having a particular form...
(In reply to Nicholas Nethercote [:njn] from comment #30)
> I just talked to erahm about this. I will try to keep the frame numbers in
> and modify make-tree.pl. One possibility is to just modify the numbers to
> all be "#00" or something; clunky, but it would work.

Yeah, that sounds fine (done in make-tree.pl, I presume, and with output that would then be parseable by fix_*_stack).
This fixes make-tree.pl to replace stack frame numbers with 0.

I wonder if I should make the stack traces start at one instead of zero. It
would allow tools that need to strip out the frame number (which includes
make-tree.pl and DMD) to use #00 to indicate "no frame number". We could also
use 0 for that weird case in TableTicker.cpp. Plus, counting from 1 is just
nicer for this sort of thing, IMO. dbaron, what do you think?
Attachment #8494910 - Flags: review?(dbaron)
Attachment #8491106 - Attachment is obsolete: true
Updated version with comments addressed. The previous patch fixed the
make-tree.pl issue, so this version is only slightly different to the previous
one.

I tweaked the comments at the top of the fix*.py scripts a little. But the old
comment in fix_linux_stack.py was overly specific and thus misleading. Because
the output really comes from NS_FormatCodeAddress(). That function is called by
NS_StackWalk(), which is called by nsTraceRefcnt::WalkTheStack() but also
several other functions.
Attachment #8494913 - Flags: review?(dbaron)
Attachment #8491108 - Attachment is obsolete: true
Comment on attachment 8494910 [details] [diff] [review]
(part 1) - Add a frame number argument to NS_WalkStackCallback

>diff --git a/tools/rb/make-tree.pl b/tools/rb/make-tree.pl

>+              $_ =~ s/#\d\d: /#00: /;        # replace frame number with 0


This really belongs in patch #2.

You should allow more than 2 digits, since frame numbers could go over 100.  Probably just "\d+".

r=dbaron with that.

(There really wasn't any need to re-request review on patch 1, particularly if the make-tree.pl change were in patch 2.)
Attachment #8494910 - Flags: review?(dbaron) → review+
(In reply to Nicholas Nethercote [:njn] from comment #32)
> I wonder if I should make the stack traces start at one instead of zero. It
> would allow tools that need to strip out the frame number (which includes
> make-tree.pl and DMD) to use #00 to indicate "no frame number". We could also
> use 0 for that weird case in TableTicker.cpp. Plus, counting from 1 is just
> nicer for this sort of thing, IMO. dbaron, what do you think?

I don't have a strong feeling, though this sounds kinda nice.
(In reply to Nicholas Nethercote [:njn] from comment #33)
> I tweaked the comments at the top of the fix*.py scripts a little. But the
> old
> comment in fix_linux_stack.py was overly specific and thus misleading.
> Because
> the output really comes from NS_FormatCodeAddress(). That function is called
> by
> NS_StackWalk(), which is called by nsTraceRefcnt::WalkTheStack() but also
> several other functions.

FWIW, the code actually lived in nsTraceRefcnt::WalkTheStack prior to bug 189221 (after which it was moved again in bug 374689).
Comment on attachment 8494913 [details] [diff] [review]
(part 2) - Clean up stack printing and fixing

r=dbaron, though perhaps the comment for fix_stack_using_bpsyms.py should mention that the symbol is often bad on Linux
Attachment #8494913 - Flags: review?(dbaron) → review+
Comment on attachment 8484769 [details] [review]
Link to GitHub PR: https://github.com/mozilla-b2g/B2G/pull/384

r=dbaron, although the xpcom/base/nsStackWalk.h comment in patch 2 should probably also mention this file.

(One thing I'm having second thoughts about is whether the fix_*_stack.py scripts should remove the #0, #1; I was initially thinking they did, but now I notice they don't.  I guess it's probably ok to keep it, though.)
Attachment #8484769 - Flags: review?(dbaron) → review+
Comment on attachment 8484769 [details] [review]
Link to GitHub PR: https://github.com/mozilla-b2g/B2G/pull/384

I updated fix_b2g_stack.py to handle both the old and the new formats, per erahm's request.
Attachment #8484769 - Flags: review?(erahm)
> I updated fix_b2g_stack.py to handle both the old and the new formats, per
> erahm's request.

And I've tested locally that it works with both formats.
Attachment #8484769 - Flags: review?(erahm) → review+
https://hg.mozilla.org/mozilla-central/rev/a0b82c954206
https://hg.mozilla.org/mozilla-central/rev/7a1b7d7eba12
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
> ValueError: invalid literal for int() with base 16: '0x--DOMWINDOW == 36 (10B40A00) [pid = 3756] [serial = 4583] [outer = 0A028E00] [url = about:blank'

For that to happen, there must be some logging lines being spliced together, something like:

> #91: nsTimerEvent::Run() [libxul.so 0x--DOMWINDOW == 36 (10B40A00) [pid = 3756] [serial = 4583] [outer = 0A028E00] [url = about:blank'

That's really weird, but I can tighten up the regexp to avoid franken-frames like this from matching.
This is kind of scope creep, but you could also try doing what is done for --DOMWINDOW:
  http://mxr.mozilla.org/mozilla-central/source/dom/base/nsGlobalWindow.cpp#1291

I think the printf_stderr somehow prints without getting spliced up like that?
  https://bugzilla.mozilla.org/show_bug.cgi?id=934091#c14

Though I see that was based on something you originally suggested. ;)
Hmm, PrintStackFrame() in toolkit/xre/nsSigHandlers.cpp prints its stacks to stdout. I bet that's the problem.
The diagnosis in comment 46 and comment 48 are correct -- some output lines are being spliced together. I did 72 runs of Mochi-5 with the patches applied and got 10 failures. That's a failure rate of ~14%.

The problem is that I was lazy and just used `0x.+` in the regexp matching library addresses. I tried changing it to `0x[0-9A-Fa-f]+` and this fixed the problem because the spliced lines no longer match. I did 45 runs of Mochi-5 with this change and they all passed. Here's an example of the splicing:

> 12:38:58     INFO -  #90: nsTimerImpl::Fire() [xpcom/threads/nsTimerImpl.cpp:618]
> 12:38:58     INFO -  #91: nsTimerEvent::Run() [xpcom/threads/nsTimerImpl.cpp:716]
> 12:38:59     INFO -  #92: XRE_AddJarManifestLocation[xul +0x--DOCSHELL 0FDB2800 == 15 [pid = 1948] [id = 548]
> 12:38:59     INFO -  --DOCSHELL 0FBAD000 == 14 [pid = 1948] [id = 552]
> 12:38:59     INFO -  --DOCSHELL 0F2DD400 == 13 [pid = 1948] [id = 551]
> 12:38:59     INFO -  --DOCSHELL 0E603000 == 12 [pid = 1948] [id = 550]
> 12:38:59     INFO -  --DOCSHELL 0C034800 == 11 [pid = 1948] [id = 562]
> 12:38:59     INFO -  --DOCSHELL 0FD48800 == 10 [pid = 1948] [id = 565]
> 12:38:59     INFO -  --DOCSHELL 10641000 == 9 [pid = 1948] [id = 564]
> 12:38:59     INFO -  --DOCSHELL 0F46A800 == 8 [pid = 1948] [id = 563]
> 12:38:59     INFO -  --DOCSHELL 106CEC00 == 7 [pid = 1948] [id = 560]
> 12:38:59     INFO -  --DOCSHELL 0C037800 == 6 [pid = 1948] [id = 559]
> 12:38:59     INFO -  96307]
> 12:38:59     INFO -  #93: NS_ProcessNextEvent(nsIThread *,bool) [xpcom/glue/nsThreadUtils.cpp:265]
> 12:38:59     INFO -  #94: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) [ipc/glue/MessagePump.cpp:99]

So that's good, but it would be nice to fix the root cause, which is the splicing. So I also tried adding flush() calls to all the places that print stack traces. I did another 45 runs of Mochi-5 with this in place (but without the regexp fix) and all of them passed.

So both changes in isolation avoid the problem. But for thoroughness, I will land both of them.
(In reply to Nicholas Nethercote [:njn] from comment #49)
> So that's good, but it would be nice to fix the root cause, which is the
> splicing. So I also tried adding flush() calls to all the places that print
> stack traces. I did another 45 runs of Mochi-5 with this in place (but
> without the regexp fix) and all of them passed.

Thank you for fixing the root cause too! :-)
Attachment #8500188 - Attachment description: Link to: https://github.com/mozilla-b2g/B2G/pull/393 → Link to GitHub PR: https://github.com/mozilla-b2g/B2G/pull/393
Attachment #8500188 - Flags: review?(erahm)
https://hg.mozilla.org/mozilla-central/rev/4ca3b16e24bd
https://hg.mozilla.org/mozilla-central/rev/0b61bbe1dda7
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
FYI the root cause of the log-mingling is known, bug 798300. It's just hard to fix because we have to work around Windows' incredibly stupid buffering behavior.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #54)
> FYI the root cause of the log-mingling is known, bug 798300. It's just hard
> to fix because we have to work around Windows' incredibly stupid buffering
> behavior.

It's easy to fix by adding explicit flushes in the cases that might write to stdout. I did exactly that in this patch and it fixed at least some of the mingling.
You need to log in before you can comment on or make changes to this bug.