Closed
Bug 811335
Opened 12 years ago
Closed 12 years ago
Scan more bytes to find crash stacks on B2G
Categories
(Toolkit :: Crash Reporting, defect)
Toolkit
Crash Reporting
Tracking
()
People
(Reporter: kairo, Assigned: ted)
References
Details
Not sure if that's actually correct here or on the Socorro side.
We're having a large amount of B2G from-device reports where we only end up with an address, e.g. bp-2061f88f-4045-441d-9574-ca86d2121112 or bp-0a77c52c-1d5a-4075-ba93-1f6502121112, and Ted says we probably can improve on that by scanning further to find the stack frames. We should do that.
Reporter | ||
Comment 1•12 years ago
|
||
Ted, how soon can we make this happen?
Comment 2•12 years ago
|
||
Jim Blandy should be able to help, failing that Julian is good fallback.
Updated•12 years ago
|
blocking-basecamp: --- → +
Assignee | ||
Comment 3•12 years ago
|
||
Note that the fix for this would be on the server side of things, it's not something that has to ship in-product.
Comment 4•12 years ago
|
||
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #0)
> We're having a large amount of B2G from-device reports where we only end up
> with an address, e.g. bp-2061f88f-4045-441d-9574-ca86d2121112 or
> bp-0a77c52c-1d5a-4075-ba93-1f6502121112, and Ted says we probably can
> improve on that by scanning further to find the stack frames.
It's not clear to me whether the presence of a single frame here is a
result of not scanning _enough_ stack, or whether there's some other
kind of unwind failure -- eg there is no CFI for the given address and
we're not using stack-scanning as a fallback.
At least for the recent breakpad versions that Ted and I have been
hacking around, the ARM unwinder does fall back to stack scanning when
it has no other option. In that case, assuming (as is likely) that
there is more than one frame in the top 32K of stack, it should have
produced more than one frame.
But I don't know whether the particular snapshot of breakpad you are
can do stack unwind by stack scanning on ARM. Ted, any idea?
Comment 5•12 years ago
|
||
> But I don't know whether the particular snapshot of breakpad you are
> can do stack unwind by stack scanning on ARM. Ted, any idea?
you are += using
I should add: if it doesn't, it would be easy enough to add it in.
I have a patch lying around somewhere which might do the job.
Assignee | ||
Comment 6•12 years ago
|
||
crash-stats pulls the tip of Breakpad SVN to build its minidump_stackwalk tool with every release, so it's very much up-to-date. I looked at an instance of the crash from comment 0 previously but I don't know where that analysis got off to, so I just repeated it. Using the crash bp-2061f88f-4045-441d-9574-ca86d2121112, minidump_stackwalk gives a single frame:
Thread 0 (crashed)
0 0x40a804e2
r4 = 0x41a47e30 r5 = 0xbe9ff930 r6 = 0xbe9ff91c r7 = 0xbe9ff9c0
r8 = 0xbe9ffe90 r9 = 0x41a28000 r10 = 0x00000000 fp = 0x00000000
sp = 0xbe9ff908 lr = 0x40ad90e5 pc = 0x40a804e2
Found by: given as instruction pointer in context
If I run my dump-lookup tool on it (http://hg.mozilla.org/users/tmielczarek_mozilla.com/dump-lookup/, simply prints out all plausible return addresses on the stack), we find that there are some reasonable-looking things further up the stack (I don't have the symbols for this build locally, but when I did this before these addresses were in reasonable-looking functions):
0xbe9ff9b4: libnspr4.so + 0xce8b
0xbe9ffa24: libnspr4.so + 0xce7d
There are more functions on the stack, from which you could get a plausible stack trace. You'll note that the sp at the time of the crash is 0xbe9ff908, but the first plausible return address there (from the dump-lookup output) is 0xbe9ff9b4, which is a full 172 bytes up the stack. Since we only scan 30 words (120 bytes), we won't find that.
I'm sure this isn't some B2G-specific thing, it just happens that we have low crash volume so this particular crash shows up heavily, and because of this failure mode it's very hard to get useful data out of it.
My proposed change would be to alter unwinding slightly so that if we're attempting to unwind from a frame where we have no CFI (like here, where it's some memory address out in space), we let the stack scanning go quite a bit farther to find a return address.
Reporter | ||
Comment 7•12 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #6)
> I'm sure this isn't some B2G-specific thing, it just happens that we have
> low crash volume so this particular crash shows up heavily, and because of
> this failure mode it's very hard to get useful data out of it.
Well, right now, almost all B2G crashes end up in a way like this, though some have a few stack frames, like bp-89e64411-0e20-4d86-b984-7277f2121129 or bp-72168754-807a-42c3-8316-fb3cc2121129 - see https://crash-analysis.mozilla.com/rkaiser/2012-11-29/2012-11-29.b2g.crashes.html for a list of B2G crashes for one day (the "Crash" column has links to the actual reports).
bp-83634501-600e-44ab-bd5a-f9c362121129 at least seems to have a reasonable top of the stack until it wanders off to a bare address, reports like bp-8604473e-df47-4189-a38f-44e432121129 are the only somewhat reasonable stacks I've seen at all so far from our B2G reports.
We need to find some way very soon to get reasonable stacks out of this so we can do any analysis on B2G crashes. If the larger scan range proposed here is that solution, then I'd be happy with that for sure. :)
Comment 8•12 years ago
|
||
Here is the place in the breakpad trunk sources specifying the magic
30 word limit that Ted mentioned. It's obviously trivial to change,
but I don't know where in our system we would put the change, in order
for it to change the behaviour of crash-stats.
Index: src/google_breakpad/processor/stackwalker.h
===================================================================
--- src/google_breakpad/processor/stackwalker.h (revision 1085)
+++ src/google_breakpad/processor/stackwalker.h (working copy)
@@ -110,7 +110,7 @@
bool ScanForReturnAddress(InstructionType location_start,
InstructionType* location_found,
InstructionType* ip_found) {
- const int kRASearchWords = 30;
+ const int kRASearchWords = 100;
return ScanForReturnAddress(location_start, location_found, ip_found,
kRASearchWords);
}
Assignee | ||
Comment 9•12 years ago
|
||
Okay, I found the time to write the patch I was thinking about, I put it up for upstream review from jimb:
http://breakpad.appspot.com/501002
This is almost equivalent to Julian's patch, but it applies it only in very limited circumstances. That should be enough to fix this bug.
Comment 10•12 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9)
> Okay, I found the time to write the patch I was thinking about, I put it up
> for upstream review from jimb:
> http://breakpad.appspot.com/501002
What's the process/ETA for getting this rolled out to Socorro?
Assignee | ||
Comment 11•12 years ago
|
||
This needs to get reviewed and landed upstream, and then will automatically go out with the next Socorro release, or we can probably push it out faster if necessary.
Assignee | ||
Comment 12•12 years ago
|
||
jimb: can you review that Breakpad patch?
Comment 13•12 years ago
|
||
LGTM on the breakpad patch.
Assignee | ||
Comment 14•12 years ago
|
||
Landed upstream:
http://code.google.com/p/google-breakpad/source/detail?r=1086
If you want this to go out faster than the next Socorro release you'll need to poke the Socorro team about that.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 15•12 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #14)
> If you want this to go out faster than the next Socorro release you'll need
> to poke the Socorro team about that.
Usually, there's a release every Wednesday. Laura, is there one this week? If so, letting it ride with that is surely enough. ;-)
Updated•12 years ago
|
status-firefox18:
--- → fixed
status-firefox19:
--- → fixed
status-firefox20:
--- → fixed
Target Milestone: --- → mozilla20
Reporter | ||
Comment 16•12 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #14)
> Landed upstream:
> http://code.google.com/p/google-breakpad/source/detail?r=1086
Socorro did another push yesterday, and https://crash-stats.mozilla.com/status says that the Breakpad revision it's running now is 1086, so I guess this has been deployed. :)
Thanks, let's hope that today's B2G data will show the improvements from this!
Reporter | ||
Comment 17•12 years ago
|
||
It's a bit hard to try and verify this fix, as we get so many corrupt stacks (see the discussion in bug 817946) on B2G, but it looks like we're getting way fewer reports now that only have a bare address listed as a signature.
You need to log in
before you can comment on or make changes to this bug.
Description
•