Closed Bug 1577886 Opened 5 years ago Closed 3 years ago

Consider capturing information from __crash_info sections

Categories

(Toolkit :: Crash Reporting, enhancement)

All
macOS
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox90 --- fixed

People

(Reporter: mstange, Assigned: smichaud)

References

Details

Attachments

(11 files, 7 obsolete files)

14.64 KB, text/plain
Details
2.96 KB, text/plain
Details
48 bytes, text/x-phabricator-request
Details | Review
1.24 MB, application/octet-stream
Details
227.96 KB, text/plain
Details
81.35 KB, text/plain
Details
603.78 KB, application/octet-stream
Details
107.07 KB, text/plain
Details
61.17 KB, text/plain
Details
2.45 KB, text/plain
Details
48 bytes, text/x-phabricator-request
Details | Review

Some macOS system libraries, libGPUSupportMercury.dylib being one of them, have a __crash_info section in the binary that they write error codes and strings to. Presumably this information gets captured by the Apple crash reporter.

Since our crash reporter intercepts crashes, the Apple crash reporter never sees that information and the information is lost. This means that Apple never sees error codes from drivers or from the kernel that Firefox encounters. For example, if the OpenGL driver encounters a problem when communicating with its kernel piece, Apple doesn't hear about those problems.

It would be nice to capture this information in crash reports so that we could at least forward it to Apple when needed.

See bug 1576767 comment 19 for an example of this.

More information:

There are many different kinds of crashes in gpusGenerateCrashLog or gpusGenerateCrashLog.cold.1. There also continue to be large numbers of them. Without their Graphics kernel error information (which is in the __crash_info section), these crashes are almost impossible even to categorize, never mind to resolve.

I'm going to try to get to this sometime in the next few weeks.

Assignee: nobody → smichaud
Blocks: 1696358
See Also: → 1576767, 1696358, 1692399

I've started working on this. But I quickly discovered that, though many system modules have a "__crash_info" section, they generally aren't initialized as the module is loaded, and so are usually filled with junk when a crash happens.

I'll need to dig much deeper, and reverse-engineer how Apple handles crashes. I suspect I'll discover that the "__crash_info" section only gets initialized when a crash happens in that module. But we'll still need to find a way to tell when a "__crash_info" section is valid and when it isn't.

By the way, the junk all disappears when I use a hook library to explicitly bzero() every module's "__crash_info" section as it's loaded. That's how I know the junk doesn't come from the crashes themselves.

(Following up comment #2)

Oops, I forgot the module "slide". So all my "__crash_info" addresses were off by that amount :-(

Once I fixed the problem I didn't see any more "junk". So this bug isn't nearly as dire as I first thought.

I've got a patch for this bug. It's probably not complete. But I'm not entirely sure how to finish it, so I'm asking for help.

With the patch, Breakpad now collects information from the __crash_info section of every module (in the list of loaded modules) that contains one, and adds it to the minidump. This information is stored per-module. minidump_stackwalk now gathers all this information and prints out a summary of it near the top of its output (in different formats depending on whether or not it was called with -m). So though this information is stored per-module, it's reported per-process -- like Apple does in its own crash reports.

I had to change the MDRawModule structure, which governs how all minidumps are formatted. My changes are backward compatible. The old tools still work with new format minidumps (those with __crash_info information), and the new tools still work with old format minidumps. The MDRawModule structure has no versioning. But fortunately it has two "reserved" fields at the end. So the field I added just encroaches on them.

I should probably also change minidump-analyzer. But I haven't yet, because I have only the vaguest idea how it works, and so don't really know how I'd test it. Please point me to any documentation that exists. I could figure it out on my own by trolling through Mozilla source code. But that'd be a lot of work, which I'd like to avoid if possible :-)

Edit: I've found the following, which is reasonably informative. But I'd still like your opinion on whether or not I should change minidump-analyzer. As best I can tell, only the "full" output would be effected, and that's not what's written to *.extra files by default.

https://firefox-source-docs.mozilla.org/toolkit/crashreporter/crashreporter/index.html

I should probably add part of my patch to toolkit/crashreporter/breakpad-patches. But I'm not entirely sure which changes need to be included. Is it every change under toolkit/crashreporter/google-breakpad?

I've done quite a bit of manual testing on builds made with my patch. But any practical test requires injecting code, using something like my HookCase. So it won't be possible to add tests for my patch to Mozilla's automated testing. Mozilla's existing general-purpose Breakpad tests will have to suffice. (In a later comment I'll post the hook library I used for testing, and explain the tests I ran.)

I've started a tryserver build with the latest version of my patch:
https://treeherder.mozilla.org/jobs?repo=try&revision=035bec068933ef3061e0676c3e580d4a2724289e

Flags: needinfo?(gsvelto)

Even after my patch is complete and landed in the tree, Mozilla's support for __crash_info data won't yet be complete. Socorro and https://crash-stats.mozilla.org/ won't yet have support. So crash reports displayed there won't include this information, and it won't be possible to search on the different kinds of __crash_info data.

Here too I'll need help. Ideally someone else will be able to write the patch(es).

Attachment #9213859 - Attachment mime type: text/plain → application/octet-stream
Attachment #9213859 - Attachment description: Minidump containing __crash_info data on main process crash → 9BB74358-2E48-4F57-95EE-746CA9E224E7: minidump containing __crash_info data on main process crash
Attachment #9213859 - Attachment description: 9BB74358-2E48-4F57-95EE-746CA9E224E7: minidump containing __crash_info data on main process crash → 9BB74358-2E48-4F57-95EE-746CA9E224E7: minidump containing __crash_info data on child process crash

Here's the source for the hook library I tested with, as a patch on https://github.com/steven-michaud/HookCase/blob/master/HookLibraryTemplate/hook.mm.

Most of it (which is currently commented out) I used to trace Firefox's crash handling. The crucial part (not commented out) injects code into _pthread_cond_wait() to trigger a crash that writes a message to two different modules' __crash_info sections.

Digging through system modules to find how they use __crash_info, I discovered the following function in the Security framework. It writes directly to that module's __crash_info.message and then calls abort():

    void CKRaise(const char *message);

_pthread_cond_wait() is often used in event handlers, to wait for the next event. So it's an ideal place to inject code to trigger a crash. It "works" in almost any application, in its main process or one of its child processes.

Once my hook library is loaded, you trigger the crash by doing kill -28 [pid]. The 28 is for SIGWINCH. This signal isn't much used, so I often steal it for my own purposes.

https://treeherder.mozilla.org/jobs?repo=try&revision=035bec068933ef3061e0676c3e580d4a2724289e

This tryserver build has completed. It contains both a Firefox Nightly distro and a package containing the new minidump_stackwalk.

After you download these packages, be sure to run xattr -c [file] on each of them, before opening/extracting. This gets rid of any "quarantine" attributes. Otherwise what you extract will be unusable.

(Following up comment 12)

I forgot to mention that I also tested one of my tryserver builds on an Apple Silicon Mac Mini running macOS 11.2.3. It worked fine, but the testing was minimal. I couldn't trigger a crash with __crash_info data -- I haven't yet ported HookCase to Apple Silicon.

Sorry for taking so long Steven but I had to double-check stuff both in Breakpad and our infrastructure to be sure I'm giving you the right answers :-)

Let's start with the simple stuff:

  • All the changes we do to toolkit/crashreporter/google-breakpad have a corresponding patch in toolkit/crashreporter/breakpad-patches. That's because the code under that directory is vanilla unmodified code. We did fork breakpad's client code which is in toolkit/crashreporter/breakpad-client. This isn't an ideal arrangement but it makes our life relatively simple until we finish replacing Breakpad's client code with Rust.
  • We build two stackwalkers in-tree. The first one is based on minidump_stackwalk and used in automation so the changes you made to it will be immediately reflected in try runs (if a test crashes for example). The other is the minidump-analyzer and is used to generate stack traces for crash ping and gather information that isn't available in the minidump. Your change shouldn't affect it because we can't send this data in the crash ping.
  • Socorro's stack walker lives in a separate repo and outputs JSON. To apply your change to it you'd need to put your Breakpad patch under the breakpad-patches subdirectory like you did on m-c and then add the additional output to stackwalker.cc.
  • Once that's done we'd need to surface the additional stackwalker output into Socorro's web UI but that's the easiest part as by default the new output will already be visible in the "raw crash" tab so that would need change only if we want to include it in another tab.

Now onto the complicated part: in principle we shouldn't change MDRawModule because we don't control its layout, Microsoft does. I don't know if the reserved fields are used for something else but to maintain forward compatibility with tooling we shouldn't touch them. There's another way to include custom data into a minidump that is both forward- and backward-compatible: custom minidump streams. Breakpad already supports a few so you can use them as examples. It works this way:

  • You define your new stream like say MOZ_MACOS_CRASH_INFO_STREAM = 0x4d7a0001 (for the ID I'm following Breakpad's convention of using two character in hex format for the first two bytes in the extension. In this case it'd "Mz")
  • You create a writer for your stream similar to the other ones
  • Finally you add the new writer here so that when the directory is written out your new stream is included.
Flags: needinfo?(gsvelto)

Thanks, Gabriele, for your detailed response. It'll be tricky to avoid changing MDRawModule, but think I can manage it per your suggestion. It'll involve quite a lot more work, though. I'm not sure when I'll be able to finish it.

Hi again, Gabriele. I've now revised my patch per your suggestions (under "complicated part" in comment 15). It seems to work fine. But before I post my revision, I want to try integrating my changes into Socorro's stack walker. And for that I need to figure out Docker, and how Mozilla uses it. Not for the first time, the "simple stuff" is more complicated than the "complicated part" :-)

I've got the basic idea (that Docker builds apps in some kind of "container"). But the details are still a mystery. Do I need some kind of Mozilla account, or can I create my own Docker account and make do with their "free" download? I searched on "Docker site:mozilla.org" using both DuckDuckGo and Google. I didn't find anything very helpful.

I need to be able to build Socorro's stack walker with my changes, and then test them.

Flags: needinfo?(gsvelto)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #17)

Hi again, Gabriele. I've now revised my patch per your suggestions (under "complicated part" in comment 15). It seems to work fine. But before I post my revision, I want to try integrating my changes into Socorro's stack walker. And for that I need to figure out Docker, and how Mozilla uses it. Not for the first time, the "simple stuff" is more complicated than the "complicated part" :-)

I've got the basic idea (that Docker builds apps in some kind of "container"). But the details are still a mystery. Do I need some kind of Mozilla account, or can I create my own Docker account and make do with their "free" download? I searched on "Docker site:mozilla.org" using both DuckDuckGo and Google. I didn't find anything very helpful.

I actually have very little experience with Docker. While we do have instructions to install the entire Socorro stack under Docker I don't recommend it. It's large and somewhat complicated if your changes only apply to the stack walker.

I need to be able to build Socorro's stack walker with my changes, and then test them.

I can help with that! Apply this patch to a clone of minidump-stackwalk and then run ./bin/build_stackwalker.sh from the source root directory. When the process ends you'll find the stackwalker executable under the stackwalk/stackwalker.

There's a couple of ifs and buts though: it's obviously not very pretty, every time you build it will do a full profile run (because the script generates the PGO-optimized version of the stackwalker) and gclient might download a ton of stuff but it's faster than using Docker.

Once you've built it you can run it with stackwalker example_dump_file.dmp and it will do everything on its own (including downloading symbols if they are available on our symbol server).

Flags: needinfo?(gsvelto)
Status: NEW → ASSIGNED

Thanks, Gabriele. But the build died partway through with the following error:

    grep: /proc/cpuinfo: No such file or directory

Sounds like it thought it was building on Linux.

I'll dig further tomorrow, and may be able to fix this problem myself. But let me know if you have an easy solution.

Flags: needinfo?(gsvelto)

Right, there's Linuxisms in those build scripts. This one should do.

Attachment #9216892 - Attachment is obsolete: true
Flags: needinfo?(gsvelto)

It seems that my patch is not enough. It's been a while since I built breakpad on macOS and there's something else that's in the way.

I finally managed to build Socorro's minidump_stackwalk, but only inside a Docker image, and only as a Linux binary. Here's how I did it:

  1. Download Docker for Mac, install it, and run it.

  2. At an ordinary Terminal prompt (not one for a Docker container) run

     git clone https://github.com/mozilla-services/minidump-stackwalk.git
    
  3. cd minidump-stackwalk, and while Docker Desktop is running, run

     make build
    
  4. This creates a Docker container, and builds all the appropriate binaries inside it (including minidump_stackwalk).

  5. In the Docker Desktop, run the container and open a "CLI" (command line interface) to it. Confusingly, its "home directory" is /home/app, and the CLI (a Terminal window) opens at /app, but the binaries are in /stackwalk. They're all Linux binaries, but they do run correctly in the container (via the CLI).

So it looks like I need to add my patch to minidump-stackwalk/tree/main/breakpad-patches, re-run make build, then test the altered minidump_stackwalk inside the container (via its CLI). The CLI supports sftp, so I should be able to import copies of minidumps that my altered Firefox has generated.

Docker containers run Linux. And the reason the build scripts have Linuxisms is that they're meant to run on Linux -- either directly or via a Docker container.

You don't need a Docker account to do any of this. That's only for getting access to remote storage.

Running make shell runs /bin/bash inside the minidump-stackwalk Docker image that you built with make build. When the Docker service starts up the container, it mounts the local directory as /app inside the container. Any files in that directory on your host machine will be available in the container in /app. That's how I access files on my host from inside the container.

I'm about to submit my revised patch for review.

I've also now got a patch for Socorro's stackwalker. I built it with make build and tested it using make shell (after having copied a bunch of my own minidumps to a minidumps directory inside the distro). Everything went fine. I'll submit it (or some version of it) for review once my mozilla-central patch has landed on trunk. I assume I should do this using a GitHub pull request.

Surprisingly, my mozilla-central patch was quite easy to write. It's smaller and much neater than the previous version. The only disadvantage (as best I can tell) is that my new patch must iterate a second time through all the crashing process's loaded modules (in addition to the iteration required for MinidumpGenerator::WriteModuleListStream()). But it doesn't touch the file system -- it reads all these modules' __DATA,__crash_info sections from memory. So I don't think this will result in any performance problems.

All these attachments made with my old patch are now obsolete. I'm updating them.

This was made with a new tryserver build, available here, and my HookCase hook library (attached to this bug).

The minidump_stackwalk build is here.

https://treeherder.mozilla.org/jobs?repo=try&revision=652e9c56787f702ceaaf9153561fd81a5edb5ce6

Attachment #9213870 - Attachment is obsolete: true
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7cb0a587e669
Add support to for macOS __crash_info data to Breakpad. r=gsvelto

While reading through my patch for the umpteenth time, I found a bug. It's not something that will effect us now, or in the near future -- only after Apple starts adding new fields to its __crash_info information, and we follow up by adding those fields to Breakpad. Once that happens, minidump_stackwalk would likely crash reading a minidump with fewer "strings" than it expected.

The fix is very simple, and is contained in this new patch. It also removes a superfluous bit of code.

Why don't we just land this on mozilla-central, after my current patch lands there? Let me know what you think, Gabriele.

Sorry I didn't catch this earlier :-(

Flags: needinfo?(gsvelto)
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch

I just ran ./mach bootstrap on a fully up-to-date mozilla tree. As always, doing this refreshed the toolset in ~/.mozbuild, which contains a copy of minidump_stackwalk. This new minidump_stackwalk was built with my patch (I tested it against one of my minidumps containing __crash_info data). So the process of getting new code into the ~/.mozbuild toolset seems to be automatic. I'm glad to see this. I was wondering if it had to be done by hand.

As I mentioned in comment 24, I've got a patch for Socorro's stackwalker. I've updated it with my latest 20-mac-crash-info.patch (updated as per comment 34), and rebuild it and retested it. Everything went fine. I'll wait until Monday. Then if nothing has blown up by then, I'll use this patch to generate a pull request for https://github.com/mozilla-services/minidump-stackwalk.

Flags: needinfo?(gsvelto)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #34)

Sorry I didn't catch this earlier :-(

No problem! File a follow up and we'll fix it.

(Following up comment #36)

I've created my pull request here: https://github.com/mozilla-services/minidump-stackwalk/pull/29.

Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a6afee6925f0
Add support to for macOS __crash_info data to Breakpad (followup). r=gsvelto

My pull request at https://github.com/mozilla-services/minidump-stackwalk/pull/29 is about to be landed.

I've opened bug 1709658 to add __crash_info data to the "details" page of crash reports and to make it searchable.

See Also: → 1709658

I just discovered that the copy of /usr/lib/dyld that's loaded into every process has its own __crash_info section, which isn't examined by current code (by my patch for this bug). This wouldn't be hard to fix. I'll open a new bug to do it, once my current patch has baked for a while.

Edit: This is now fixed by bug 1814852.

As of earlier today, bug 1709658 is now fixed, and it's now possible to search on mac crash info at https://crash-stats.mozilla.org/!

Steven, Will, do you want to announce this on the MLs? Or shall I do it? Also probably worth mentioning in the #macdev channel on Matrix.

I'm probably not the best one to do it. I don't even know what "MLs" and "the #macdev channel on Matrix" are :-)

We might want to postpone any kind of public announcement until more data has accumulated, and we've had a chance to test what kinds of searches are possible, and which are the most fruitful. This would let us find (and iron out) any kinks that still exist, and also write up some kind of document on how to search mac_crash_info.

I'd be happy to write that document, or at least a first draft of it. But I'd prefer to wait a couple of weeks or so, to ensure I have a representative amount of data to work with.

By "public" I mean outside of BMO. I've already been spreading the word inside BMO, and maybe others could do more of that.

See Also: → 1711804
See Also: → 1711802
See Also: → 1711801

I just opened three bugs using the mac_crash_info that has already come over the transom. I referenced this bug in all of them. But maybe I should create a mac_crash_info meta bug, and make them reference that bug instead.

Thanks, let's see how these evolve. We might consider putting part of that info in the crash signature if we find some interesting correlations.

I've opened the meta bug I talked about in comment #48.

See Also: 1711804, 1711802, 17118011711944

(In reply to Gabriele Svelto [:gsvelto] from comment #49)

Thanks, let's see how these evolve. We might consider putting part of that info in the crash signature if we find some interesting correlations.

Excellent idea! I've created bug 1711956 to follow up on this.

For the record, crash reports with interesting mac_crash_info have started to appear that have nothing to do with "gpusGenerateCrashLog":

bp-ca7f9643-fdba-4731-98fd-4c8550210519
bp-e4e279f1-ea9b-47f6-80a8-795300210519
bp-0435588b-3896-4b53-8b76-9bd880210518

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: