Closed Bug 811335 Opened 7 years ago Closed 7 years ago

Scan more bytes to find crash stacks on B2G

Categories

(Toolkit :: Crash Reporting, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla20
blocking-basecamp +
Tracking Status
firefox18 --- fixed
firefox19 --- fixed
firefox20 --- fixed

People

(Reporter: kairo, Assigned: ted)

References

Details

Not sure if that's actually correct here or on the Socorro side.

We're having a large amount of B2G from-device reports where we only end up with an address, e.g. bp-2061f88f-4045-441d-9574-ca86d2121112 or bp-0a77c52c-1d5a-4075-ba93-1f6502121112, and Ted says we probably can improve on that by scanning further to find the stack frames. We should do that.
Ted, how soon can we make this happen?
Jim Blandy should be able to help, failing that Julian is good fallback.
blocking-basecamp: --- → +
Note that the fix for this would be on the server side of things, it's not something that has to ship in-product.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #0)
> We're having a large amount of B2G from-device reports where we only end up
> with an address, e.g. bp-2061f88f-4045-441d-9574-ca86d2121112 or
> bp-0a77c52c-1d5a-4075-ba93-1f6502121112, and Ted says we probably can
> improve on that by scanning further to find the stack frames.

It's not clear to me whether the presence of a single frame here is a
result of not scanning _enough_ stack, or whether there's some other
kind of unwind failure -- eg there is no CFI for the given address and
we're not using stack-scanning as a fallback.

At least for the recent breakpad versions that Ted and I have been
hacking around, the ARM unwinder does fall back to stack scanning when
it has no other option.  In that case, assuming (as is likely) that
there is more than one frame in the top 32K of stack, it should have
produced more than one frame.

But I don't know whether the particular snapshot of breakpad you are
can do stack unwind by stack scanning on ARM.  Ted, any idea?
> But I don't know whether the particular snapshot of breakpad you are
> can do stack unwind by stack scanning on ARM.  Ted, any idea?

you are += using

I should add: if it doesn't, it would be easy enough to add it in.
I have a patch lying around somewhere which might do the job.
crash-stats pulls the tip of Breakpad SVN to build its minidump_stackwalk tool with every release, so it's very much up-to-date. I looked at an instance of the crash from comment 0 previously but I don't know where that analysis got off to, so I just repeated it. Using the crash bp-2061f88f-4045-441d-9574-ca86d2121112, minidump_stackwalk gives a single frame:
Thread 0 (crashed)
 0  0x40a804e2
     r4 = 0x41a47e30    r5 = 0xbe9ff930    r6 = 0xbe9ff91c    r7 = 0xbe9ff9c0
     r8 = 0xbe9ffe90    r9 = 0x41a28000   r10 = 0x00000000    fp = 0x00000000
     sp = 0xbe9ff908    lr = 0x40ad90e5    pc = 0x40a804e2
    Found by: given as instruction pointer in context

If I run my dump-lookup tool on it (http://hg.mozilla.org/users/tmielczarek_mozilla.com/dump-lookup/, simply prints out all plausible return addresses on the stack), we find that there are some reasonable-looking things further up the stack (I don't have the symbols for this build locally, but when I did this before these addresses were in reasonable-looking functions):
0xbe9ff9b4: libnspr4.so + 0xce8b
0xbe9ffa24: libnspr4.so + 0xce7d

There are more functions on the stack, from which you could get a plausible stack trace. You'll note that the sp at the time of the crash is 0xbe9ff908, but the first plausible return address there (from the dump-lookup output) is 0xbe9ff9b4, which is a full 172 bytes up the stack. Since we only scan 30 words (120 bytes), we won't find that.

I'm sure this isn't some B2G-specific thing, it just happens that we have low crash volume so this particular crash shows up heavily, and because of this failure mode it's very hard to get useful data out of it.

My proposed change would be to alter unwinding slightly so that if we're attempting to unwind from a frame where we have no CFI (like here, where it's some memory address out in space), we let the stack scanning go quite a bit farther to find a return address.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #6)
> I'm sure this isn't some B2G-specific thing, it just happens that we have
> low crash volume so this particular crash shows up heavily, and because of
> this failure mode it's very hard to get useful data out of it.

Well, right now, almost all B2G crashes end up in a way like this, though some have a few stack frames, like bp-89e64411-0e20-4d86-b984-7277f2121129 or bp-72168754-807a-42c3-8316-fb3cc2121129 - see https://crash-analysis.mozilla.com/rkaiser/2012-11-29/2012-11-29.b2g.crashes.html for a list of B2G crashes for one day (the "Crash" column has links to the actual reports).

bp-83634501-600e-44ab-bd5a-f9c362121129 at least seems to have a reasonable top of the stack until it wanders off to a bare address, reports like bp-8604473e-df47-4189-a38f-44e432121129 are the only somewhat reasonable stacks I've seen at all so far from our B2G reports.

We need to find some way very soon to get reasonable stacks out of this so we can do any analysis on B2G crashes. If the larger scan range proposed here is that solution, then I'd be happy with that for sure. :)
Here is the place in the breakpad trunk sources specifying the magic
30 word limit that Ted mentioned.  It's obviously trivial to change,
but I don't know where in our system we would put the change, in order
for it to change the behaviour of crash-stats.

Index: src/google_breakpad/processor/stackwalker.h
===================================================================
--- src/google_breakpad/processor/stackwalker.h	(revision 1085)
+++ src/google_breakpad/processor/stackwalker.h	(working copy)
@@ -110,7 +110,7 @@
   bool ScanForReturnAddress(InstructionType location_start,
                             InstructionType* location_found,
                             InstructionType* ip_found) {
-    const int kRASearchWords = 30;
+    const int kRASearchWords = 100;
     return ScanForReturnAddress(location_start, location_found, ip_found,
                                 kRASearchWords);
   }
Okay, I found the time to write the patch I was thinking about, I put it up for upstream review from jimb:
http://breakpad.appspot.com/501002

This is almost equivalent to Julian's patch, but it applies it only in very limited circumstances. That should be enough to fix this bug.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9)
> Okay, I found the time to write the patch I was thinking about, I put it up
> for upstream review from jimb:
> http://breakpad.appspot.com/501002

What's the process/ETA for getting this rolled out to Socorro?
This needs to get reviewed and landed upstream, and then will automatically go out with the next Socorro release, or we can probably push it out faster if necessary.
jimb: can you review that Breakpad patch?
LGTM on the breakpad patch.
Landed upstream:
http://code.google.com/p/google-breakpad/source/detail?r=1086

If you want this to go out faster than the next Socorro release you'll need to poke the Socorro team about that.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #14)
> If you want this to go out faster than the next Socorro release you'll need
> to poke the Socorro team about that.

Usually, there's a release every Wednesday. Laura, is there one this week? If so, letting it ride with that is surely enough. ;-)
Target Milestone: --- → mozilla20
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #14)
> Landed upstream:
> http://code.google.com/p/google-breakpad/source/detail?r=1086

Socorro did another push yesterday, and https://crash-stats.mozilla.com/status says that the Breakpad revision it's running now is 1086, so I guess this has been deployed. :)

Thanks, let's hope that today's B2G data will show the improvements from this!
It's a bit hard to try and verify this fix, as we get so many corrupt stacks (see the discussion in bug 817946) on B2G, but it looks like we're getting way fewer reports now that only have a bare address listed as a signature.
You need to log in before you can comment on or make changes to this bug.