Closed Bug 662646 Opened 13 years ago Closed 13 years ago

Snapshot the stack at every GC

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla8

People

(Reporter: billm, Assigned: billm)

References

Details

(Whiteboard: [inbound])

Attachments

(5 files, 4 obsolete files)

We know that there are a lot of GC-related topcrashes. Many of these may be rooting problems. The typical sequence of events here is that we do a GC at a "bad" time, and it collects an object that we don't want to collect. Then on the next GC we traverse that collected object and crash.

The idea in this bug is to be able to look at a stack trace from when the last GC took place. So at the end of each GC, we would copy a portion of the stack to some global area that breakpad should include in the minidump. That way, if we crash during GC, the stack data in this buffer will be from the last GC.

Ted, I can think of a hacky way to snapshot the stack and get it into the minidump. I'm mainly concerned with how difficult it would be to interpret this data. One idea I have is to take the minidump file, pull out the GC stack trace, and make a new minidump that's the same as the old one but with the GC stack trace overlaying the normal stack trace. Then we could open this minidump in a debugger and get line number info and stuff. Does this seem reasonable? Is there a better way that wouldn't be too hard to implement? I'm interested in doing this soonish.

A few more concrete questions:
- Is there an easy way to get a stack trace? I suspect that backtrace() is not platform-independent. I could do a fixed-size memcpy starting at the address of some dummy stack variable, but then I'd be worried about crashing if the stack happens to be really shallow.
- Is there an API for manipulating minidumps? That way I could replace the normal stack trace with the GC stack trace pretty easily.
When Breakpad creates a minidump, all it actually collects for each thread is the register state + the stack memory, and it walks the stack server side (using the debug symbols previously gathered at build time). If you can snapshot that and include it in the dump, it'd be pretty easy to get a stack trace out of it.

Linux minidump writing code:
http://code.google.com/p/google-breakpad/source/browse/trunk/src/client/linux/minidump_writer/minidump_writer.cc#718
http://code.google.com/p/google-breakpad/source/browse/trunk/src/client/linux/minidump_writer/linux_dumper.cc#475

It looks like the Linux dumper code is kind of dumb and just grabs up to 32k of stack memory (it does parse /proc/<pid>/maps to figure out where mappings end, so it won't go beyond that). I don't know if there's an easy way to find the "top of stack" on Linux in general, but maybe doing something like that would be easy enough for you. We'd also need to add a way to get this memory into the dump, obviously.

As for your second question, the Breakpad processor code is structured as a library for reading minidumps, but it's not really set up for modifying them. You can easily write dumps using the Linux minidump_generator code, and read them using the processor code, but you might be on your own for modifying them.

If you'd like to use the processor lib, grab the Breakpad source from SVN and build it (it's just configure && make):
http://code.google.com/p/google-breakpad/source/checkout

then link src/processor/libbreakpad.a into your app, and you can use the apis from google_breakpad/processor/minidump.h:
http://code.google.com/p/google-breakpad/source/browse/trunk/src/google_breakpad/processor/minidump.h

I have a couple of little apps that use the Breakpad APIs that I've written, like:
http://hg.mozilla.org/users/tmielczarek_mozilla.com/get-minidump-instructions/
http://hg.mozilla.org/users/tmielczarek_mozilla.com/check-minidump-instruction-pointer/
OK, thanks. I'm actually much more interested in Windows, though, since that's where almost all our crashes seem to happen. I guess I can look at the Windows minidump code.
Ah, you won't find much of use there, since we simply call the MinidumpWriteDump API to do all the hard work. The data stored is the same, though. We do already have things setup to include an additional memory region in the minidump, you can see that code here:

http://code.google.com/p/google-breakpad/source/browse/trunk/src/client/windows/handler/exception_handler.cc#769
http://code.google.com/p/google-breakpad/source/browse/trunk/src/client/windows/handler/exception_handler.cc#840
Actually, that looks great. Thanks. I can use the VirtualQuery interface to get the extents of the stack and then I can include it in the minidump using the callback business.

Is the same minidump file format used on all platforms? It would be easier if I could do everything in Linux except for the actual debugging.
Yes, the same minidump format is used for all platforms. The Breakpad processor libraries are written for POSIX, so you can easily do all the minidump manipulation on Linux.
I wonder if these stacks should go in the log that bug 662814 will implement...
a) Seems like that would be very spammy
b) We can't reliably walk the stack in the app, since we ship without a frame pointer
Attached patch crashreporter patch (obsolete) — Splinter Review
This is a WIP, but I was hoping you could give me some feedback on it, Ted. It's a patch to allow user code to add arbitrary memory regions to the minidump. So far it's only for x64 Linux, but I'll add x86 and Windows support if it looks okay to you.

I realized that we already have an "app data" section in the .extra file. However, I'd like something that I can view in the debugger and stuff.

One problem I'm having is that the md2core utility doesn't seem to be working for me. Is this expected to work on x64? I can open the core file in gdb, but the stack I get is just garbage. Is there a trick to it?

Also, for future reference, is there any way I could get access to the Breakpad symbol data used by Socorro? I'm assuming it's stored on a server somewhere. Eventually I'd like to be able to download minidumps, process them to pull out the GC data I'm planning to instrument them with, and then run the stackwalker on them.
Assignee: general → wmccloskey
Status: NEW → ASSIGNED
Attachment #539960 - Flags: feedback?(ted.mielczarek)
I went ahead and did this. It only works on Linux and Windows. On Mac and Solaris, the memory is not attached to the crash dump.
Attachment #539960 - Attachment is obsolete: true
Attachment #539960 - Flags: feedback?(ted.mielczarek)
Attachment #542288 - Flags: review?(ted.mielczarek)
These are some tests for the patch. Some of the file I/O stuff is kinda gross, but I couldn't figure out any better way to transmit the information.
Attachment #542289 - Flags: review?(ted.mielczarek)
This patch adds some code to the JS engine to snapshot the stack at arbitrary times. I'm still working on a patch that takes snapshots on GCs, OOMs, compartment mismatches, and the like.
Attachment #542290 - Flags: review?(dmandelin)
Comment on attachment 542290 [details] [diff] [review]
patch to snapshot the stack in JS

Review of attachment 542290 [details] [diff] [review]:
-----------------------------------------------------------------

The stack size check and snapshotting looks good. I do have some requests for additional documentation and some questions about the API.

::: js/src/jsapi.h
@@ +1180,5 @@
> +typedef JSBool
> +(* JSRegisterCrashMemoryFun)(void *ptr, size_t length);
> +
> +extern JS_PUBLIC_API(void)
> +JS_SetRegisterCrashMemoryFun(JSRuntime *rt, JSRegisterCrashMemoryFun fun);

This needs a better name, and a comment to document it.

It seems the implementation doesn't actually use the runtime, so how about deleting that parameter?

::: js/src/jscrashformat.h
@@ +42,5 @@
> +#define jscrashformat_h___
> +
> +#include <string.h>
> +
> +const static int crash_cookie_len = 16;

This file probably wants namespace js and possibly crash too.

@@ +69,5 @@
> +    uint64 snaptime;
> +    CrashRegisters regs;
> +    uint64 stack_base;
> +    uint64 stack_len;
> +    char stack[crash_buffer_size];

Please briefly document these values, especially snaptime.

Also, why uint64 for stack_base and stack_len? Wouldn't they normally be uintptr_t and size_t/ptrdiff_t?

::: js/src/jscrashreport.h
@@ +42,5 @@
> +#define jscrashreport_h___
> +
> +#include "jstypes.h"
> +
> +JS_BEGIN_EXTERN_C

Why the C API here?
OK, I made those changes.

The reason I used uint64 for everything is that the structures in jscrashformat.h define the binary format of the data that appears in the crash dump. This way I don't have to worry whether the dump came from x86 or x64. Also, making everything the same size means that there aren't any concerns about struct alignment/padding.
Attachment #542290 - Attachment is obsolete: true
Attachment #542290 - Flags: review?(dmandelin)
Attachment #542673 - Flags: review?(dmandelin)
Comment on attachment 542673 [details] [diff] [review]
patch to snapshot the stack in JS, v2

Review of attachment 542673 [details] [diff] [review]:
-----------------------------------------------------------------

Almost there, but not r+ing just yet so we can bikeshed a bit more on the API name.

::: js/src/jsapi.h
@@ +1187,5 @@
> + * included with the crash report. Each memory range is registered via
> + * the given function.
> + */
> +extern JS_PUBLIC_API(void)
> +JS_RegisterCrashMemory(JSRegisterCrashMemoryFun fun);

Let's call this JS_EnumerateDiagnosticMemoryRegions unless someone has a better idea. (That name is a bit long but I think a longer name is almost good for such a specialized function.) And "Callback" is the suffix idiom in JSAPI, so JSEnumerateDiagnosticMemoryCallback for the argument type.

The comment should be something like

/*
 * Enumerate memory regions that contain diagnostic information
 * intended to be included in crash report minidumps.
 */

::: js/src/jscrashformat.h
@@ +80,5 @@
> +    uint64 snaptime;      /* Unix time when the stack was snapshotted. */
> +    CrashRegisters regs;  /* Register contents for the snapshot. */
> +    uint64 stack_base;    /* Base address of stack at the time of snapshot. */
> +    uint64 stack_len;     /* Extent of the stack. */
> +    char stack[crash_buffer_size]; /* Contents of the stack. */

OK on uint64 here. Please do add a comment as to why it was done that way in case it looks funny to future readers.
Added a few more comments, too.
Attachment #542673 - Attachment is obsolete: true
Attachment #542673 - Flags: review?(dmandelin)
Attachment #542898 - Flags: review?(dmandelin)
Comment on attachment 542898 [details] [diff] [review]
patch to snapshot the stack in JS, v3

Thanks for the extra comments.
Attachment #542898 - Flags: review?(dmandelin) → review+
This is the final patch in the series. It adds several things:
- Snapshots the stack at the end of every GC.
- At the beginning of every GC, it stores some information about that GC. Currently this is just whether it was a compartment GC and whether it regenerated shapes.
- Poisons swept GC things with 0xDA, even in opt builds.
- Saves a stack snapshot whenever we OOM.
- Intentionally crashes opt builds if the GC ever traverses from one compartment to another when it doesn't expect to.

All this passes on try server. This is the patch that we should back out after a week or so. It seems like the start of the next release cycle would be a good time to get it in.
Attachment #543181 - Flags: review?(dmandelin)
Comment on attachment 542288 [details] [diff] [review]
patch to register app memory with crashreporter

Review of attachment 542288 [details] [diff] [review]:
-----------------------------------------------------------------

Lots of comments, but they're mostly fiddly stuff. This patch looks good overall. r=me with these addressed.

::: toolkit/crashreporter/google-breakpad/src/client/linux/handler/exception_handler.h
@@ +218,5 @@
>                        uintptr_t start_address,
>                        size_t mapping_size,
>                        size_t file_offset);
>  
> +  void RegisterAppMemory(void *ptr, size_t length);

This needs a comment. Also, there's no corresponding UnregisterAppMemory? Seems like something consumers would want.

::: toolkit/crashreporter/google-breakpad/src/client/linux/minidump_writer/minidump_writer.cc
@@ +518,5 @@
>      if (!WriteMappings(&dirent))
>        return false;
>      dir.CopyIndex(dir_index++, &dirent);
>  
> +    if (!SaveAppMemory())

Should probably be "WriteAppMemory" for consistency with the other methods.

::: toolkit/crashreporter/google-breakpad/src/client/linux/minidump_writer/minidump_writer.h
@@ +43,5 @@
>  // A list of <MappingInfo, GUID>
>  typedef std::pair<struct MappingInfo, u_int8_t[sizeof(MDGUID)]> MappingEntry;
>  typedef std::list<MappingEntry> MappingList;
>  
> +struct AppData {

I think I might call this "AppMemory", that feels a little more specific. Also this could use a comment.

@@ +47,5 @@
> +struct AppData {
> +  AppData(void *ptr, size_t length) : ptr_(ptr), length_(length) {}
> +
> +  void *ptr_;
> +  size_t length_;

Google style says struct members don't need the trailing underscore (only class members):
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Variable_Names (hidden behind the twisty)

@@ +73,5 @@
>  bool WriteMinidump(const char* filename, pid_t process,
>                     pid_t process_blamed_thread);
>  
>  // This overload also allows passing a list of known mappings.
>  bool WriteMinidump(const char* filename, pid_t crashing_process,

Please update the comment here as well.

::: toolkit/crashreporter/google-breakpad/src/client/mac/handler/exception_handler.h
@@ +145,5 @@
>      return crash_generation_client_.get() != NULL;
>    }
>  
> +  void RegisterAppMemory(void *ptr, size_t length) {
> +    // Unimplemented

If you're not going to implement this, let's just #ifdef out our usage in Mozilla code on OS X, and leave this for a future patch. I'd rather not put stubs in upstream.

::: toolkit/crashreporter/google-breakpad/src/client/solaris/handler/exception_handler.h
@@ +133,5 @@
>                              MinidumpCallback callback,
>                              void *callback_context);
>  
> +  void RegisterAppMemory(void *ptr, size_t length) {
> +    // Unimplemented

Same here. (and generally we don't worry about the Solaris code, if it's important to someone else they can fix it.)

::: toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc
@@ +1093,5 @@
>    next_minidump_path_c_ = next_minidump_path_.c_str();
>  }
>  
> +void ExceptionHandler::RegisterAppMemory(void *ptr, size_t length)
> +{

opening brace goes on the line with the function name.

@@ +1094,5 @@
>  }
>  
> +void ExceptionHandler::RegisterAppMemory(void *ptr, size_t length)
> +{
> +  app_data_info_.push_back(AppData(ULONG64(ptr), ULONG(length)));

You should probably be using reinterpret_cast here, although that's sadmaking in terms of line length.

::: toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.h
@@ +78,5 @@
>  
>  using std::vector;
>  using std::wstring;
>  
> +struct AppData {

As earlier, I think this should probably be AppMemory. Also it needs a comment.

@@ +82,5 @@
> +struct AppData {
> +  AppData(ULONG64 ptr, ULONG length) : ptr_(ptr), length_(length) {}
> +
> +  ULONG64 ptr_;
> +  ULONG length_;

Same comment about trailing underscores.

@@ +230,5 @@
>  
>    // Returns whether out-of-process dump generation is used or not.
>    bool IsOutOfProcess() const { return crash_generation_client_.get() != NULL; }
>  
> +  void RegisterAppMemory(void *ptr, size_t length);

Needs a comment here, and also the complement UnregisterAppMemory.

@@ +432,5 @@
>    // EXCEPTION_SINGLE_STEP exceptions.  Leave this false (the default)
>    // to not interfere with debuggers.
>    bool handle_debug_exceptions_;
>  
> +  AppDataList app_data_info_;

comment.
Attachment #542288 - Flags: review?(ted.mielczarek)
Comment on attachment 543181 [details] [diff] [review]
instrumentation and checking for GC crashes

Review of attachment 543181 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgcmark.cpp
@@ +807,5 @@
>              markDelayedChildren();
>          }
>      }
> +
> +    rt->gcCheckCompartment = NULL;

RAII style could be nice here.
Attachment #543181 - Flags: review?(dmandelin) → review+
I made those changes.
Attachment #542288 - Attachment is obsolete: true
Attachment #543465 - Flags: review?(ted.mielczarek)
Comment on attachment 542898 [details] [diff] [review]
patch to snapshot the stack in JS, v3

Review of attachment 542898 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jscrashformat.h
@@ +18,5 @@
> + * March 31, 1998.
> + *
> + * The Initial Developer of the Original Code is
> + * Netscape Communications Corporation.
> + * Portions created by the Initial Developer are Copyright (C) 1998

Drive-by comment on this: initial developer should be "The Mozilla Foundation" and the copyright date should be the current year. You could also change "The Original Code" up above to be something more descriptive.

::: js/src/jscrashreport.cpp
@@ +18,5 @@
> + * March 31, 1998.
> + *
> + * The Initial Developer of the Original Code is
> + * Netscape Communications Corporation.
> + * Portions created by the Initial Developer are Copyright (C) 1998

Same thing here.

::: js/src/jscrashreport.h
@@ +18,5 @@
> + * March 31, 1998.
> + *
> + * The Initial Developer of the Original Code is
> + * Netscape Communications Corporation.
> + * Portions created by the Initial Developer are Copyright (C) 1998

And here.
Comment on attachment 543465 [details] [diff] [review]
crashreporter patch, v2

Review of attachment 543465 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks! I'll get this landed upstream for you as well.
Attachment #543465 - Flags: review?(ted.mielczarek) → review+
Comment on attachment 542289 [details] [diff] [review]
new crashreporter tests

Review of attachment 542289 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with a few comments.

::: toolkit/crashreporter/test/dumputils.cpp
@@ +88,5 @@
> +    memory_list->GetMemoryRegionForAddress(u_int64_t(addr));
> +  if(!region)
> +    return false;
> +
> +  const u_int8_t* chars = region->GetMemory();

Since you're verifying the length and contents of the memory here, you might want to mention that in the comment above. (This is not a general "the dump contains this address" routine.)

::: toolkit/crashreporter/test/nsTestCrasher.cpp
@@ +78,5 @@
> +{
> +  for (size_t i=0; i<sizeof(testData); i++)
> +    testData[i] = i;
> +
> +  FILE *fp = fopen("crash-addr", "w");

Is there a reason you write this to a file given that you're already returning the value? Just awkwardness of working with 64-bit values in JS? js-ctypes is already wrapping the 64-bit value for you, so you shouldn't need this, just pass the memory address back to DumpCheckMemory.

::: toolkit/crashreporter/test/unit/test_crashreporter_appmem.js
@@ +1,4 @@
> +function run_test()
> +{
> +  if (!("@mozilla.org/toolkit/crash-reporter;1" in Components.classes)) {
> +    dump("INFO | test_crashreporter.js | Can't test crashreporter in a non-libxul build.\n");

You don't need this block anymore, we don't support non-libxul builds.

@@ +9,5 @@
> +             .getService(Components.interfaces.nsIHttpProtocolHandler);
> +  if (!ph.userAgent.match(/Windows NT (\d+).(\d+)/) &&
> +      !ph.userAgent.match(/Linux i686/) &&
> +      !ph.userAgent.match(/Linux x86_64/)) {
> +      // App memory is not supported on all platforms

Since I just landed bug 664197 the other day, you can now check this in the manifest instead.

@@ +15,5 @@
> +  }
> +
> +  do_crash(function() {
> +	     appAddr = CrashTestUtils.saveAppMemory();
> +	     crashReporter.registerAppMemory(appAddr, 32);

So this round-trip tests the APIs you added, which is good. Can you test the JS bits as well? That is, can you force a GC and then crash, and verify that your "*J*S*CRASHDATA*" memory region exists in the dump?

::: toolkit/crashreporter/test/unit/xpcshell.ini
@@ +6,5 @@
>  [test_crash_runtimeabort.js]
>  [test_crashreporter.js]
>  [test_crashreporter_crash.js]
>  [test_crashreporter_crash_profile_lock.js]
> +[test_crashreporter_appmem.js]

You can say:
[test_crashreporter_appmem.js]
run-if = os == 'win' || os == 'linux'
Attachment #542289 - Flags: review?(ted.mielczarek) → review+
(In reply to comment #23)
> Is there a reason you write this to a file given that you're already
> returning the value? Just awkwardness of working with 64-bit values in JS?
> js-ctypes is already wrapping the 64-bit value for you, so you shouldn't
> need this, just pass the memory address back to DumpCheckMemory.

There's some fundamental awkwardness here. The address is assigned in one process (the crashing process) and used in another (the test process). And even though they're both running an xpcshell, I think it's possible that the test buffer will end up in different places (due to ASLR). So it's very hard for the test process to know where the crashing process has the buffer stored.
Oh, right. I should know that, having written that crash test harness. Can you at least pass the filename in an environment var or something so we can clean it up?
(In reply to comment #25)
> Oh, right. I should know that, having written that crash test harness. Can
> you at least pass the filename in an environment var or something so we can
> clean it up?

How about I just delete the file at the end of DumpCheckMemory?
Seems fine. Can you document the out-of-band passing of the address in the test functions?
Yes, I'll add some comments about it.

Ted, I think I forgot to ask you about this earlier. In order to analyze the data we'll be getting, I'll need to get access to the crash symbols zip files that tinderbox builds. I assume these are uploaded to a server somewhere?
They get uploaded for tinderbox-builds, and I thought they did for nightly builds, but I can't find them on ftp.mo. They're available via http://symbols.mozilla.org/firefox/, but it's a bit of a pain to pull them down individually.
Hmm, I'm not sure how to use symbols.mozilla.org. When I try to go there, it just says "Forbidden." However, I was able to find *some* of the nightlies on ftp.mozilla.org, in this location:
  ftp://ftp.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-win32/
I just searched for a build that had the same build id as the crash. Unfortunately, some of them aren't there. I guess maybe they're not built through tinderbox? The weird thing is that some of them have 'nightly' for in the channel field of the crash report. Does that imply that we built those builds, or is it just the default?
Right, sorry, I wasn't very clear there (I wrote that comment as I was being rushed out the door).
1) symbols.mozilla.org is effectively a Microsoft-style symbol server, you have to know what you're looking for. If we look at some arbitrary crash report like https://crash-stats.mozilla.com/report/index/cd1e1c5f-140b-4832-b39b-e1d312110707 , you can construct URLs to the symbol files by looking at the "Modules" tab, like:
http://symbols.mozilla.org/firefox/xul.pdb/9350BB22B914422A8280B43B7E2C18FF2/xul.sym
If you've got a minidump, you can run minidump_stackwalk -m on it and it will produce output like you see in the "Raw dump" tab on crash-stats, which includes a module listing containing that information.

2) tinderbox-builds are all the builds from tinderbox, including the per-changeset builds as well as nightly builds, so yeah, there are some nightlies mixed in there. We only keep a few days worth of builds in that directory though, nightlies get copied off to dated directories like:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011-07-07-03-08-39-mozilla-central/
we don't copy the symbol files when we do that, unfortunately.
Thanks very much, Ted. It's much clearer now.
Whiteboard: [inbound]
I thought of a potential problem with the crash reporter patch. Namely, it dumps out the instruction memory by doing app_memory_info_.push_back(...instruction memory...). Since this is an STL list, it will allocate memory. My understanding is that the Windows crash reporter runs in the same address space as the crashed process. If we happen to have crashed due to heap corruption, allocating is probably bad.

This patch just reserves space for the instruction memory ahead of time. I'd like to get this in before this stuff goes out in tonight's nightly.
Attachment #544864 - Flags: review?(ted.mielczarek)
Comment on attachment 544864 [details] [diff] [review]
spotfix for windows issue

Review of attachment 544864 [details] [diff] [review]:
-----------------------------------------------------------------

Ah, good call. I completely missed that.
Attachment #544864 - Flags: review?(ted.mielczarek) → review+
Comment on attachment 544864 [details] [diff] [review]
spotfix for windows issue

>--- a/toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc
>+++ b/toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc
>+      // Skip the reserved element if there was no instruction memory
>+      if (context.iter->ptr == 0)
>+	context.iter++;

Did you mean to add this tab?
Depends on: 670686
I finally got around to upstreaming the Breakpad patch here:
http://code.google.com/p/google-breakpad/source/detail?r=989
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: