Closed Bug 555309 Opened 14 years ago Closed 14 years ago

Hang crash reports have (no signature)

Categories

(Core Graveyard :: Plug-ins, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: benjamin, Assigned: cjones)

References

Details

(Whiteboard: [land lorentz])

Attachments

(5 files)

Crash reporets that are a result of a plugin hang currently end up with (no signature). This is not useful at all.

e.g. parent-side http://crash-stats.mozilla.com/report/index/7b2399aa-1ef1-48ab-be59-35a6a2100326
child-side http://crash-stats.mozilla.com/report/index/e50ae8e1-9381-4181-b4ea-11bdc2100326

"No thread was identified as the cause of the crash; No signature could be created because we do not know which thread crashed".

What we want is the signature of the main thread (for the parent process) and the plugin thread (in the child process). My thought was simply to annotate (in the .extra file) from which thread socorro should generate the signature. If we can just fix the dump file itself, that's fine too.

If we go the .extra route, we need to file a socorro bug too, so let me know how you want to do it.
I have doubts that this is going to produce anything really useful either way. You're dumping live code in both processes, so even for the same hang the chances of getting the same signature don't seem really likely.

Anyway, if you want this to show up we'll have to put a breakpad info stream in the dump (which I asked cjones not to when he wrote that patch, bug 544936 comment 47). The processor uses the crashed or requesting thread from the exception stream or breakpad info stream respectively, and Socorro follows that:
http://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/minidump_processor.cc#73
If the process is dead-hung, which is by far the most common thing we've seen, the stack in the child and parent are both likely to be relatively stable. It seemed to me that modifying the minidump was the hard path, and just telling socorro "take the signature from thread N" would be easier.

The stacks from the parent are going to require additional work, I think. e.g. with the parent crash above you want the frame below CrashReporter::CreatePairedMinidumps, and you probably want to continue all the way below mozilla::ipc::RPCChannel::Call which is where the variability is really going to start.
FWIW, I think that getting usable statistical data out of these reports is vital for the beta. I think we can do it manually if necessary, but I'd really like to end up with socorro doing it for us pretty quickly.
I suspect the best solution would be extracting this information from fully-processed dumps on crash-stats, but we might be able to hack some exception info in the mean time.
Hacking this into the linux code wasn't /too/ bad.  Here's a thread dump with the new code

http://crash-stats.mozilla.com/report/index/aebaa3e3-eb15-48ea-a590-90c062100329

(No stack, huh.  This is a debug build, but I guess socorro has been switched to eh_frame-based stackwalking exclusively?)  Anyway, on linux I'm marking the plugin thread as having crashed on a SIGSTOP, which should easily distinguish it from "real" crashes.

The plugin subprocess reports the plugin thread ID back during NP_Initialize().  If for some reason the plugin thread has died before we dump it, then we get the same kind of minidumps as the current code: "No crashing thread could be identified".

Looking into windows now.
From what I can tell, the best we'll be able to do on windows is identify the plugin thread in the dump, but not set its "crashing address".  To do that, we'd need access to the plugin thread's CONTEXT while it's paused by MiniDumpWriteDump().  I'll see what socorro does with a crash address of "0"; if it looks at the top of the crashing thread's stack, we're back in business.  Otherwise we'll need another approach for windows.  (ISTR it just uses the 0 address.)

One hacky option that requires socorro cooperation is to set a magic crash address, say 42, that says "look at the top of the crashed thread's stack."
(In reply to comment #6)
> One hacky option that requires socorro cooperation is to set a magic crash
> address, say 42, that says "look at the top of the crashed thread's stack."

Actually, this wouldn't need socorro cooperation, just a change to minidump_processor.
Looks like another option is to |OpenThread(pluginThreadId); SuspendThread(); GetThreadContext();| and hope that the thread is actually hung, i.e. the context doesn't change until MiniDumpWriteDump() finishes.  MiniDumpWriteDump() might go ahead and properly dump an already-attached thread too; that's another option.
Went with the SuspendThread()/GetThreadContext()/MiniDumpWriteDump() approach; working quite well indeed.  Here's what the plugin-side crashes look like

http://crash-stats.mozilla.com/report/index/bba7e1e4-13c6-46ac-8f6e-101932100329

I used EXCEPTION_BREAKPOINT to distinguish them from real crashes.  Looks like we'll need some new filters to chomp the kernelbase/ntdll frames.

Interestingly enough, I didn't touch the browser-side dumping code, but see a signature there nonetheless

http://crash-stats.mozilla.com/report/index/500bb5ed-8c21-48e1-8df5-c351a2100329
OK, this is a weird interface to expose through crashreporter.  Granted.  The reason it's done is that on linux, the gettid() syscall isn't exposed through libc, and breakpad goes to great lengths to do so correctly on several platforms.  This patch is my attempt to not duplicate that work.

reviews -> :bs b/c of ted's upcoming paternity leave.
Attachment #435831 - Flags: review?(benjamin)
Separated into parts (a) and (b) for easier reviewing.  They don't compile separately, so I'll qfold them before pushing.
Attachment #435834 - Flags: review?(benjamin)
(In reply to comment #5)
> (No stack, huh.  This is a debug build, but I guess socorro has been switched
> to eh_frame-based stackwalking exclusively?)

That's an x86-64 build. We can't walk the stack at all without symbols on the server.
Attachment #435831 - Flags: review?(benjamin) → review+
Attachment #435833 - Flags: review?(benjamin) → review+
Attachment #435834 - Flags: review?(benjamin) → review+
Comment on attachment 435834 [details] [diff] [review]
Part 2b: Allow a particular subprocess thread to be "blamed" in an OOP dump

The indentation in exception_handler.cc is a little weird, but other than that this looks great.
Comment on attachment 435835 [details] [diff] [review]
Part 3: Grab the plugin thread's ID on startup and blame it in hang dumps

Is `0` a good default for mPluginThread?
Attachment #435835 - Flags: review?(benjamin) → review+
Here's a windows browser-side dump with Part 4

http://crash-stats.mozilla.com/report/index/9dada40d-86b1-43a0-ade9-459b82100330

and linux

http://crash-stats.mozilla.com/report/index/bp-38fa0e43-137c-49fb-9303-461fd2100330

I took the same approach for these as for the plugin-side dumps, identifying the crash reason as EXCEPTION_BREAKPOINT/SIGSTOP.  Patch here in a tick.
Relatively simple patch, so I stuffed the windows and linux changes together.
Attachment #436000 - Flags: review?(benjamin)
(In reply to comment #15)
> (From update of attachment 435834 [details] [diff] [review])
> The indentation in exception_handler.cc is a little weird, but other than that
> this looks great.

Thanks, my windows emacs isn't configured correctly.

(In reply to comment #16)
> (From update of attachment 435835 [details] [diff] [review])
> Is `0` a good default for mPluginThread?

It is on linux, and per jimm's test on IRC windows treats 0 as an invalid thread id.
Attachment #436000 - Flags: review?(benjamin) → review+
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: