Closed Bug 601312 Opened 14 years ago Closed 14 years ago

Useless crash reports on OS 10.6/x86-64 when crashing in method jit [@ JaegerShot ]

Categories

(Toolkit :: Crash Reporting, defect)

x86_64
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: bzbarsky, Assigned: ted)

References

Details

I crashed twice today and once yesterday. The reports are: http://crash-stats.mozilla.com/report/index/bp-e4c18a23-f693-4979-a8bd-1c9f82101001 http://crash-stats.mozilla.com/report/index/bp-38135e53-55d3-4351-a560-ea4f72101001 http://crash-stats.mozilla.com/report/index/bp-c4ea3590-86a5-4b2e-aba8-5234f2100930 All three have no stack, and say: "No proper signature could be created because no good data for the crashing thread (0) was found" in the "processor notes". A few more crashes like that were mentioned on irc today. I just queried the crash CSV for 2010-09-30, and of the 48 64-bit Mac crashes it has, 35 look like this. The other 13 look like plausible crashes.
I've had this problem, too. (I mentioned them on IRC. :P) Crashed three times today: http://crash-stats.mozilla.com/report/index/76903d35-8b5b-4766-981b-511182101001 http://crash-stats.mozilla.com/report/index/bp-7e54763b-2937-4133-9999-c32682101001 http://crash-stats.mozilla.com/report/index/bp-cbd899fd-0bac-4000-9127-ea1bb2101001 But I also had a similar crash on Sept. 27th: http://crash-stats.mozilla.com/report/index/bp-312b6423-63e8-486a-8a0e-819eb2100927 I had another crash on the 27th that created a signature just fine: http://crash-stats.mozilla.com/report/index/bp-6a4f63c4-db70-4e59-af3c-425392100927 Note that in between those two crashes on the 27th, the build ID changed. (I must have updated.) The good signature occurred in 20100926031017 and the bad signature showed up in 20100926215019. I believe those build IDs coincide with the 64-bit switch, as bz says, but I offer the hard evidence here, just in case it ever comes into question. :)
See bug 600412 comment 2 for the rationale. I've already fixed this upstream in Breakpad, it just needs to get rolled out to our production server (bug 601114).
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
I'm reopening this. I just crashed three times in a row in a span of a few minutes, and every single one of the crashes had this problem. Note that these were in reasonably current nightlies: http://crash-stats.mozilla.com/report/index/bp-c44b0682-d1d2-42aa-9769-2275b2101012 http://crash-stats.mozilla.com/report/index/bp-ba57e2e0-b1aa-4730-a913-f893d2101012 http://crash-stats.mozilla.com/report/index/bp-5407c8bc-d471-49c9-b46f-2c4d22101012 I'm running the nightly under a debugger now, since I can't rely on breakpad, but it hasn't crashed yet....
Status: RESOLVED → REOPENED
blocking2.0: --- → ?
Resolution: DUPLICATE → ---
Er, that third one has a stack now (didn't when I filed). And I caught the most recent crash (http://crash-stats.mozilla.com/report/index/bp-3e67135b-9038-47e3-87d3-f516d2101012) in gdb, and _that_ couldn't go up the stack either, past saying that I was called from JaegerShot. So maybe the real issue here is that mjit is just really screwing up. But it'd be nice if breakpad got the JaegerShot part....
I agree. What do the stack bytes look like? The Breakpad stackwalker will only scan back 15 words of stack memory while looking for a return address: http://code.google.com/p/google-breakpad/source/browse/trunk/src/google_breakpad/processor/stackwalker.h#120
Summary: Lots of useless crash reports on OS 10.6 after 64-bit switch → Useless crash reports on OS 10.6/x86-64 when crashing in method jit [@ JaegerShot ]
Status: REOPENED → NEW
> What do the stack bytes look like? How do I get that information?
In gdb, type "info frame" to find the stack address (Stack level X, frame at <address>:) then x/15xg <address> will show you all the stack memory that Breakpad's stackwalker would scan (in 8-byte words). If you don't see the address of the caller frame in there, then that's why Breakpad can't find it.
Thanks. I'll see if I can get this to happen in gdb again....
Looks like |x/15xg <address> - 15*8| is the right thing (stack grows down). And <address> should be $rsp, which doesn't necessarily match the thing info frame says. Within those constraints, looks like the return address was 18 words from $rsp.
Okay, so the simpler fix is probably just to bump that constant from 15 to 20 or so. However, if methodjit's stack frame grows by 2 or 3 words, that means we'd be back in this situation. I'm open to other ideas on how to improve Breakpad/JIT integration. We could, for example, have it provide a frame pointer on x86-64 and teach the stack walker how to look for valid frame pointers and follow them.
I think we should do a client stackwalk and save the eh-pointer memory areas on x86-64. I really don't think that blocks, though.
blocking2.0: ? → -
Well, the only question is how many such crashes we have now and whether we're properly collating them to get an idea of how crashy mjit is.... and whether we need to spend time de-crashifying it.
bsmedberg: will that help here? Does methodjit write out eh_frame data? (I honestly have no idea.)
Probably not, but the EH frame pointer should be pointing at the compiled caller and we'll pick that up.
The constant regulating how far Breakpad tries to scan should be set so that it's a good 2x or 3x the size of the typical stack frame. Increase that puppy, I say.
I'm going to bump this constant up, but I also talked to Jim about a possible API for letting our JITs inform Breakpad about JIT code pages and functions that call into them so it can have more useful data instead of fumbling around on the stack.
Assignee: nobody → ted.mielczarek
I filed bug 604725 on the API proposal.
(In reply to comment #11) > I think we should do a client stackwalk and save the eh-pointer memory areas on > x86-64. I really don't think that blocks, though. Renominating. AIUI, this bug means that we're going to see a bunch of meaningless crash reports, doesn't it?
blocking2.0: - → ?
Depends on: 605798
I landed an upstream fix that should fix a lot of these cases (by doubling the size of the stack area we search, as suggested by jimb): http://code.google.com/p/google-breakpad/source/detail?r=715 I just filed bug 605798 to get the production Socorro copy updated.
Ok, this is live in staging, and it seems to do the trick for some of these crashes, anyway. Compare: http://crash-stats.stage.mozilla.com/report/index/8dc72891-cb8d-49c9-a456-0d41f2101020 which is the same exact dump as: https://crash-stats.mozilla.com/report/index/1aeae141-c5d0-4a74-b4be-c17a72101020 resubmitted to production with the updated minidump_stackwalk. Is that a sane stack? I don't know, it looks kind of crazy to me, but I don't really know anything about Jaeger. It does eventually wind its way down to the event loop, so I guess it's not completely off the rails.
> Is that a sane stack? For something like gmail or a site using jquery or prototype, yes. That long call() cascade is what your typical api entry point's implementation looks like for "modern" js libraries...
Should be fixed in production now.
Status: NEW → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
blocking2.0: ? → betaN+
You need to log in before you can comment on or make changes to this bug.