Consider capturing information from __crash_info sections
Categories
(Toolkit :: Crash Reporting, enhancement)
Tracking
()
Tracking | Status | |
---|---|---|
firefox90 | --- | fixed |
People
(Reporter: mstange, Assigned: smichaud)
References
Details
Attachments
(11 files, 7 obsolete files)
14.64 KB,
text/plain
|
Details | |
2.96 KB,
text/plain
|
Details | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
1.24 MB,
application/octet-stream
|
Details | |
227.96 KB,
text/plain
|
Details | |
81.35 KB,
text/plain
|
Details | |
603.78 KB,
application/octet-stream
|
Details | |
107.07 KB,
text/plain
|
Details | |
61.17 KB,
text/plain
|
Details | |
2.45 KB,
text/plain
|
Details | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Some macOS system libraries, libGPUSupportMercury.dylib being one of them, have a __crash_info
section in the binary that they write error codes and strings to. Presumably this information gets captured by the Apple crash reporter.
Since our crash reporter intercepts crashes, the Apple crash reporter never sees that information and the information is lost. This means that Apple never sees error codes from drivers or from the kernel that Firefox encounters. For example, if the OpenGL driver encounters a problem when communicating with its kernel piece, Apple doesn't hear about those problems.
It would be nice to capture this information in crash reports so that we could at least forward it to Apple when needed.
See bug 1576767 comment 19 for an example of this.
More information:
- https://alastairs-place.net/blog/2013/01/10/interesting-os-x-crash-report-tidbits/
- https://github.com/apple/swift/blob/master/include/swift/Runtime/Debug.h
- There are
getsectdata
andgetsectdatafromFramework
functions in mach headers.
Assignee | ||
Comment 1•4 years ago
|
||
There are many different kinds of crashes in gpusGenerateCrashLog
or gpusGenerateCrashLog.cold.1
. There also continue to be large numbers of them. Without their Graphics kernel error
information (which is in the __crash_info
section), these crashes are almost impossible even to categorize, never mind to resolve.
I'm going to try to get to this sometime in the next few weeks.
Assignee | ||
Comment 2•4 years ago
•
|
||
I've started working on this. But I quickly discovered that, though many system modules have a "__crash_info" section, they generally aren't initialized as the module is loaded, and so are usually filled with junk when a crash happens.
I'll need to dig much deeper, and reverse-engineer how Apple handles crashes. I suspect I'll discover that the "__crash_info" section only gets initialized when a crash happens in that module. But we'll still need to find a way to tell when a "__crash_info" section is valid and when it isn't.
By the way, the junk all disappears when I use a hook library to explicitly bzero() every module's "__crash_info" section as it's loaded. That's how I know the junk doesn't come from the crashes themselves.
Assignee | ||
Comment 3•4 years ago
|
||
(Following up comment #2)
Oops, I forgot the module "slide". So all my "__crash_info" addresses were off by that amount :-(
Once I fixed the problem I didn't see any more "junk". So this bug isn't nearly as dire as I first thought.
Assignee | ||
Comment 4•4 years ago
•
|
||
I've got a patch for this bug. It's probably not complete. But I'm not entirely sure how to finish it, so I'm asking for help.
With the patch, Breakpad now collects information from the __crash_info
section of every module (in the list of loaded modules) that contains one, and adds it to the minidump. This information is stored per-module. minidump_stackwalk
now gathers all this information and prints out a summary of it near the top of its output (in different formats depending on whether or not it was called with -m
). So though this information is stored per-module, it's reported per-process -- like Apple does in its own crash reports.
I had to change the MDRawModule
structure, which governs how all minidumps are formatted. My changes are backward compatible. The old tools still work with new format minidumps (those with __crash_info
information), and the new tools still work with old format minidumps. The MDRawModule
structure has no versioning. But fortunately it has two "reserved" fields at the end. So the field I added just encroaches on them.
I should probably also change minidump-analyzer
. But I haven't yet, because I have only the vaguest idea how it works, and so don't really know how I'd test it. Please point me to any documentation that exists. I could figure it out on my own by trolling through Mozilla source code. But that'd be a lot of work, which I'd like to avoid if possible :-)
Edit: I've found the following, which is reasonably informative. But I'd still like your opinion on whether or not I should change minidump-analyzer
. As best I can tell, only the "full" output would be effected, and that's not what's written to *.extra
files by default.
https://firefox-source-docs.mozilla.org/toolkit/crashreporter/crashreporter/index.html
I should probably add part of my patch to toolkit/crashreporter/breakpad-patches
. But I'm not entirely sure which changes need to be included. Is it every change under toolkit/crashreporter/google-breakpad
?
I've done quite a bit of manual testing on builds made with my patch. But any practical test requires injecting code, using something like my HookCase. So it won't be possible to add tests for my patch to Mozilla's automated testing. Mozilla's existing general-purpose Breakpad tests will have to suffice. (In a later comment I'll post the hook library I used for testing, and explain the tests I ran.)
I've started a tryserver build with the latest version of my patch:
https://treeherder.mozilla.org/jobs?repo=try&revision=035bec068933ef3061e0676c3e580d4a2724289e
Assignee | ||
Comment 5•4 years ago
|
||
Even after my patch is complete and landed in the tree, Mozilla's support for __crash_info
data won't yet be complete. Socorro and https://crash-stats.mozilla.org/ won't yet have support. So crash reports displayed there won't include this information, and it won't be possible to search on the different kinds of __crash_info
data.
Here too I'll need help. Ideally someone else will be able to write the patch(es).
Assignee | ||
Comment 6•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 7•4 years ago
|
||
Assignee | ||
Comment 8•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 9•4 years ago
|
||
Assignee | ||
Comment 10•4 years ago
|
||
Assignee | ||
Comment 11•4 years ago
|
||
Assignee | ||
Comment 12•4 years ago
|
||
Here's the source for the hook library I tested with, as a patch on https://github.com/steven-michaud/HookCase/blob/master/HookLibraryTemplate/hook.mm.
Most of it (which is currently commented out) I used to trace Firefox's crash handling. The crucial part (not commented out) injects code into _pthread_cond_wait()
to trigger a crash that writes a message
to two different modules' __crash_info
sections.
Digging through system modules to find how they use __crash_info
, I discovered the following function in the Security framework. It writes directly to that module's __crash_info.message
and then calls abort()
:
void CKRaise(const char *message);
_pthread_cond_wait()
is often used in event handlers, to wait for the next event. So it's an ideal place to inject code to trigger a crash. It "works" in almost any application, in its main process or one of its child processes.
Once my hook library is loaded, you trigger the crash by doing kill -28 [pid]
. The 28
is for SIGWINCH. This signal isn't much used, so I often steal it for my own purposes.
Assignee | ||
Comment 13•4 years ago
|
||
https://treeherder.mozilla.org/jobs?repo=try&revision=035bec068933ef3061e0676c3e580d4a2724289e
This tryserver build has completed. It contains both a Firefox Nightly distro and a package containing the new minidump_stackwalk
.
After you download these packages, be sure to run xattr -c [file]
on each of them, before opening/extracting. This gets rid of any "quarantine" attributes. Otherwise what you extract will be unusable.
Assignee | ||
Comment 14•4 years ago
|
||
(Following up comment 12)
I forgot to mention that I also tested one of my tryserver builds on an Apple Silicon Mac Mini running macOS 11.2.3. It worked fine, but the testing was minimal. I couldn't trigger a crash with __crash_info
data -- I haven't yet ported HookCase to Apple Silicon.
Comment 15•4 years ago
|
||
Sorry for taking so long Steven but I had to double-check stuff both in Breakpad and our infrastructure to be sure I'm giving you the right answers :-)
Let's start with the simple stuff:
- All the changes we do to
toolkit/crashreporter/google-breakpad
have a corresponding patch intoolkit/crashreporter/breakpad-patches
. That's because the code under that directory is vanilla unmodified code. We did fork breakpad's client code which is intoolkit/crashreporter/breakpad-client
. This isn't an ideal arrangement but it makes our life relatively simple until we finish replacing Breakpad's client code with Rust. - We build two stackwalkers in-tree. The first one is based on minidump_stackwalk and used in automation so the changes you made to it will be immediately reflected in try runs (if a test crashes for example). The other is the minidump-analyzer and is used to generate stack traces for crash ping and gather information that isn't available in the minidump. Your change shouldn't affect it because we can't send this data in the crash ping.
- Socorro's stack walker lives in a separate repo and outputs JSON. To apply your change to it you'd need to put your Breakpad patch under the breakpad-patches subdirectory like you did on m-c and then add the additional output to stackwalker.cc.
- Once that's done we'd need to surface the additional stackwalker output into Socorro's web UI but that's the easiest part as by default the new output will already be visible in the "raw crash" tab so that would need change only if we want to include it in another tab.
Now onto the complicated part: in principle we shouldn't change MDRawModule
because we don't control its layout, Microsoft does. I don't know if the reserved fields are used for something else but to maintain forward compatibility with tooling we shouldn't touch them. There's another way to include custom data into a minidump that is both forward- and backward-compatible: custom minidump streams. Breakpad already supports a few so you can use them as examples. It works this way:
- You define your new stream like say
MOZ_MACOS_CRASH_INFO_STREAM = 0x4d7a0001
(for the ID I'm following Breakpad's convention of using two character in hex format for the first two bytes in the extension. In this case it'd "Mz") - You create a writer for your stream similar to the other ones
- Finally you add the new writer here so that when the directory is written out your new stream is included.
Assignee | ||
Comment 16•4 years ago
|
||
Thanks, Gabriele, for your detailed response. It'll be tricky to avoid changing MDRawModule
, but think I can manage it per your suggestion. It'll involve quite a lot more work, though. I'm not sure when I'll be able to finish it.
Assignee | ||
Comment 17•4 years ago
•
|
||
Hi again, Gabriele. I've now revised my patch per your suggestions (under "complicated part" in comment 15). It seems to work fine. But before I post my revision, I want to try integrating my changes into Socorro's stack walker. And for that I need to figure out Docker, and how Mozilla uses it. Not for the first time, the "simple stuff" is more complicated than the "complicated part" :-)
I've got the basic idea (that Docker builds apps in some kind of "container"). But the details are still a mystery. Do I need some kind of Mozilla account, or can I create my own Docker account and make do with their "free" download? I searched on "Docker site:mozilla.org" using both DuckDuckGo and Google. I didn't find anything very helpful.
I need to be able to build Socorro's stack walker with my changes, and then test them.
Comment 18•4 years ago
|
||
(In reply to Steven Michaud [:smichaud] (Retired) from comment #17)
Hi again, Gabriele. I've now revised my patch per your suggestions (under "complicated part" in comment 15). It seems to work fine. But before I post my revision, I want to try integrating my changes into Socorro's stack walker. And for that I need to figure out Docker, and how Mozilla uses it. Not for the first time, the "simple stuff" is more complicated than the "complicated part" :-)
I've got the basic idea (that Docker builds apps in some kind of "container"). But the details are still a mystery. Do I need some kind of Mozilla account, or can I create my own Docker account and make do with their "free" download? I searched on "Docker site:mozilla.org" using both DuckDuckGo and Google. I didn't find anything very helpful.
I actually have very little experience with Docker. While we do have instructions to install the entire Socorro stack under Docker I don't recommend it. It's large and somewhat complicated if your changes only apply to the stack walker.
I need to be able to build Socorro's stack walker with my changes, and then test them.
I can help with that! Apply this patch to a clone of minidump-stackwalk and then run ./bin/build_stackwalker.sh
from the source root directory. When the process ends you'll find the stackwalker
executable under the stackwalk/stackwalker
.
There's a couple of ifs and buts though: it's obviously not very pretty, every time you build it will do a full profile run (because the script generates the PGO-optimized version of the stackwalker) and gclient might download a ton of stuff but it's faster than using Docker.
Once you've built it you can run it with stackwalker example_dump_file.dmp
and it will do everything on its own (including downloading symbols if they are available on our symbol server).
Updated•4 years ago
|
Assignee | ||
Comment 19•4 years ago
|
||
Thanks, Gabriele. But the build died partway through with the following error:
grep: /proc/cpuinfo: No such file or directory
Sounds like it thought it was building on Linux.
I'll dig further tomorrow, and may be able to fix this problem myself. But let me know if you have an easy solution.
Comment 20•4 years ago
|
||
Right, there's Linuxisms in those build scripts. This one should do.
Comment 21•4 years ago
|
||
It seems that my patch is not enough. It's been a while since I built breakpad on macOS and there's something else that's in the way.
Assignee | ||
Comment 22•4 years ago
•
|
||
I finally managed to build Socorro's minidump_stackwalk
, but only inside a Docker image, and only as a Linux binary. Here's how I did it:
-
Download Docker for Mac, install it, and run it.
-
At an ordinary Terminal prompt (not one for a Docker container) run
git clone https://github.com/mozilla-services/minidump-stackwalk.git
-
cd minidump-stackwalk
, and while Docker Desktop is running, runmake build
-
This creates a Docker container, and builds all the appropriate binaries inside it (including
minidump_stackwalk
). -
In the Docker Desktop, run the container and open a "CLI" (command line interface) to it. Confusingly, its "home directory" is
/home/app
, and the CLI (a Terminal window) opens at/app
, but the binaries are in/stackwalk
. They're all Linux binaries, but they do run correctly in the container (via the CLI).
So it looks like I need to add my patch to minidump-stackwalk/tree/main/breakpad-patches
, re-run make build
, then test the altered minidump_stackwalk
inside the container (via its CLI). The CLI supports sftp
, so I should be able to import copies of minidumps that my altered Firefox has generated.
Docker containers run Linux. And the reason the build scripts have Linuxisms is that they're meant to run on Linux -- either directly or via a Docker container.
You don't need a Docker account to do any of this. That's only for getting access to remote storage.
Comment 23•4 years ago
|
||
Running make shell
runs /bin/bash
inside the minidump-stackwalk Docker image that you built with make build
. When the Docker service starts up the container, it mounts the local directory as /app
inside the container. Any files in that directory on your host machine will be available in the container in /app
. That's how I access files on my host from inside the container.
Assignee | ||
Comment 24•4 years ago
|
||
I'm about to submit my revised patch for review.
I've also now got a patch for Socorro's stackwalker
. I built it with make build
and tested it using make shell
(after having copied a bunch of my own minidumps to a minidumps
directory inside the distro). Everything went fine. I'll submit it (or some version of it) for review once my mozilla-central patch has landed on trunk. I assume I should do this using a GitHub pull request.
Surprisingly, my mozilla-central patch was quite easy to write. It's smaller and much neater than the previous version. The only disadvantage (as best I can tell) is that my new patch must iterate a second time through all the crashing process's loaded modules (in addition to the iteration required for MinidumpGenerator::WriteModuleListStream()
). But it doesn't touch the file system -- it reads all these modules' __DATA,__crash_info
sections from memory. So I don't think this will result in any performance problems.
Assignee | ||
Comment 25•4 years ago
|
||
Assignee | ||
Comment 26•4 years ago
•
|
||
All these attachments made with my old patch are now obsolete. I'm updating them.
This was made with a new tryserver build, available here, and my HookCase hook library (attached to this bug).
The minidump_stackwalk
build is here.
https://treeherder.mozilla.org/jobs?repo=try&revision=652e9c56787f702ceaaf9153561fd81a5edb5ce6
Assignee | ||
Comment 27•4 years ago
|
||
Assignee | ||
Comment 28•4 years ago
|
||
Assignee | ||
Comment 29•4 years ago
|
||
Assignee | ||
Comment 30•4 years ago
|
||
Assignee | ||
Comment 31•4 years ago
|
||
Comment hidden (obsolete) |
Comment 33•4 years ago
|
||
Assignee | ||
Comment 34•4 years ago
•
|
||
While reading through my patch for the umpteenth time, I found a bug. It's not something that will effect us now, or in the near future -- only after Apple starts adding new fields to its __crash_info
information, and we follow up by adding those fields to Breakpad. Once that happens, minidump_stackwalk
would likely crash reading a minidump with fewer "strings" than it expected.
The fix is very simple, and is contained in this new patch. It also removes a superfluous bit of code.
Why don't we just land this on mozilla-central, after my current patch lands there? Let me know what you think, Gabriele.
Sorry I didn't catch this earlier :-(
Comment 35•4 years ago
|
||
bugherder |
Assignee | ||
Comment 36•4 years ago
|
||
I just ran ./mach bootstrap
on a fully up-to-date mozilla tree. As always, doing this refreshed the toolset in ~/.mozbuild
, which contains a copy of minidump_stackwalk
. This new minidump_stackwalk
was built with my patch (I tested it against one of my minidumps containing __crash_info
data). So the process of getting new code into the ~/.mozbuild
toolset seems to be automatic. I'm glad to see this. I was wondering if it had to be done by hand.
As I mentioned in comment 24, I've got a patch for Socorro's stackwalker
. I've updated it with my latest 20-mac-crash-info.patch
(updated as per comment 34), and rebuild it and retested it. Everything went fine. I'll wait until Monday. Then if nothing has blown up by then, I'll use this patch to generate a pull request for https://github.com/mozilla-services/minidump-stackwalk.
Assignee | ||
Comment 37•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Comment 38•4 years ago
|
||
(In reply to Steven Michaud [:smichaud] (Retired) from comment #34)
Sorry I didn't catch this earlier :-(
No problem! File a follow up and we'll fix it.
Assignee | ||
Comment 39•4 years ago
|
||
(Following up comment #36)
I've created my pull request here: https://github.com/mozilla-services/minidump-stackwalk/pull/29.
Comment 40•4 years ago
|
||
Comment 41•4 years ago
|
||
bugherder |
Assignee | ||
Comment 42•4 years ago
|
||
My pull request at https://github.com/mozilla-services/minidump-stackwalk/pull/29 is about to be landed.
I've opened bug 1709658 to add __crash_info
data to the "details" page of crash reports and to make it searchable.
Assignee | ||
Comment 43•4 years ago
•
|
||
I just discovered that the copy of /usr/lib/dyld
that's loaded into every process has its own __crash_info
section, which isn't examined by current code (by my patch for this bug). This wouldn't be hard to fix. I'll open a new bug to do it, once my current patch has baked for a while.
Edit: This is now fixed by bug 1814852.
Assignee | ||
Comment 44•4 years ago
|
||
As of earlier today, bug 1709658 is now fixed, and it's now possible to search on mac crash info
at https://crash-stats.mozilla.org/!
Comment 45•4 years ago
|
||
Steven, Will, do you want to announce this on the MLs? Or shall I do it? Also probably worth mentioning in the #macdev channel on Matrix.
Assignee | ||
Comment 46•4 years ago
|
||
I'm probably not the best one to do it. I don't even know what "MLs" and "the #macdev channel on Matrix" are :-)
Assignee | ||
Comment 47•4 years ago
•
|
||
We might want to postpone any kind of public announcement until more data has accumulated, and we've had a chance to test what kinds of searches are possible, and which are the most fruitful. This would let us find (and iron out) any kinks that still exist, and also write up some kind of document on how to search mac_crash_info
.
I'd be happy to write that document, or at least a first draft of it. But I'd prefer to wait a couple of weeks or so, to ensure I have a representative amount of data to work with.
By "public" I mean outside of BMO. I've already been spreading the word inside BMO, and maybe others could do more of that.
Assignee | ||
Comment 48•4 years ago
|
||
I just opened three bugs using the mac_crash_info
that has already come over the transom. I referenced this bug in all of them. But maybe I should create a mac_crash_info
meta bug, and make them reference that bug instead.
Comment 49•4 years ago
|
||
Thanks, let's see how these evolve. We might consider putting part of that info in the crash signature if we find some interesting correlations.
Assignee | ||
Comment 50•4 years ago
|
||
I've opened the meta bug I talked about in comment #48.
Assignee | ||
Comment 51•4 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #49)
Thanks, let's see how these evolve. We might consider putting part of that info in the crash signature if we find some interesting correlations.
Excellent idea! I've created bug 1711956 to follow up on this.
Assignee | ||
Comment 52•4 years ago
•
|
||
For the record, crash reports with interesting mac_crash_info
have started to appear that have nothing to do with "gpusGenerateCrashLog":
bp-ca7f9643-fdba-4731-98fd-4c8550210519
bp-e4e279f1-ea9b-47f6-80a8-795300210519
bp-0435588b-3896-4b53-8b76-9bd880210518
Description
•