Closed Bug 1371390 Opened 7 years ago Closed 5 years ago

Wrong UUIDs for x86_64h libraries in crash reports

Categories

(Toolkit :: Crash Reporting, defect)

All
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla72
Tracking Status
firefox72 + fixed

People

(Reporter: mstange, Assigned: smichaud)

References

Details

Attachments

(3 files)

Here's an example crash report: https://crash-stats.mozilla.com/report/index/dbb6f65f-c0ae-449d-b08a-8148c0170603

This crash happened on a Haswell CPU, for which OS X has a custom variant of libobjc.A.dylib.

This particular libobjc.A.dylib contains three architectures:
MODULE mac x86 A32B2573CC083053A0054AFC9FC0E4CE0 libobjc.A.dylib
MODULE mac x86_64 54CD8D1A5C983559B13A932B3D3DD3380 libobjc.A.dylib
MODULE mac x86_64h DC77AA6EA4E4326D8D7F82D63AA88F990 libobjc.A.dylib

The crash report displays the UUID 54CD8D1A5C983559B13A932B3D3DD3380 for it. However, that's wrong for this crash report. In this report, the crashing process had the x86_64h variant loaded, and the UUID should have been reported as DC77AA6EA4E4326D8D7F82D63AA88F990.

This affects symbolication, and it affects the "function_offset" field in the raw dump which is computed incorrectly.
The code that writes out the module list for mac minidumps is here:
https://dxr.mozilla.org/mozilla-central/rev/a49112c7a5765802096b3fc298069b9495436107/toolkit/crashreporter/breakpad-client/mac/handler/minidump_generator.cc#1294

There are separate codepaths for OOP and in-process crash dump generation.

Markus, do you know if this ever got fixed?

Flags: needinfo?(mstange)

I don't know.

Flags: needinfo?(mstange)

This isn't fixed.

I used a HookCase (https://github.com/steven-michaud/HookCase) hook library to trigger a crash in __CFRunLoopDoSource0() on a MacBook Pro (Retina, 15-inch, Mid 2015) with a Haswell+ CPU, running macOS 10.15.1 (the current Catalina release). The CoreFoundation framework has both x86_64 and x86_64h architectures, as shown by the output of dump_syms -i -a [arch] on my machine's copy of it:

MODULE mac x86_64 09EB9DD025BC37309716FE231CAF2C700 CoreFoundation
MODULE mac x86_64h 15D61616B29B3BDB86244B84A49564850 CoreFoundation

But the stack trace for the crash (bp-dbaba15a-fbbb-48ff-b117-d324e0191103) has incorrect symbols for the CoreFoundation elements of the stack trace:

    1  CoreFoundation  __CFRunLoopServiceMachPort
    2  CoreFoundation  __CFRunLoopRun
    3  CoreFoundation  __CFRunLoopModeIsEmpty

These should be:

    1  CoreFoundation  __CFRunLoopDoSources0
    2  CoreFoundation  __CFRunLoopRun
    3  CoreFoundation  __CFRunLoopRunSpecific

In the raw dump (https://crash-stats.mozilla.org/report/index/dbaba15a-fbbb-48ff-b117-d324e0191103#tab-rawdump) I notice that the CoreFoundation module's "debug_id" is 09EB9DD025BC37309716FE231CAF2C700 -- which corresponds to the incorrect architecture.

And in the result of minidump_stackwalk -m on the crash minidump, the following shows up:

Module|CoreFoundation||CoreFoundation|09EB9DD025BC37309716FE231CAF2C700|0x7fff2bd1b000|0x7fff2c19afff|0

So Mozilla's breakpad client has already picked the wrong architecture before the crash minidump gets uploaded to Socorro, and processed there.

The symbol server now has symbols for both architectures (x86_64 and x86_64h) of CoreFoundation. The symbols I scraped at bug 1588843 comment 63 are on the symbol server (I just checked by running gathersymbols.py again). So that can't be the cause of this problem.

I'll look for the current version of the code that Ted mentioned in comment 1, and see what possible errors I can find there.

Here's the same crash on a pre-Haswell Mac running macOS 10.15.1 (a VMware virtual machine running on a mid 2012 Mac Pro):

bp-0b2c47f9-9734-4833-a98d-a4b740191103

Everything about the CoreFoundation module is correct in this one.

Here's the hook library I used to trigger the crashes I mentioned above, posted as a diff with https://github.com/steven-michaud/HookCase/tree/master/HookLibraryTemplate.

This is a very serious problem -- one that effects the integrity of many of Socorro's Mac crash stacks. All new Macs since 2014 have Haswell+ CPUs. So over time the proportion of crash stacks effected by this bug can only increase (https://en.wikipedia.org/wiki/List_of_Macintosh_models_grouped_by_CPU_type).

Severity: normal → critical
Type: enhancement → defect

I'll tentatively assign this bug to myself.

Assignee: nobody → smichaud

Here are the modules that have an x86_64h architecture (on macOS 10.15.1) which are most likely to be included in Socorro crash stacks:

/usr/lib/system/libsystem_m.dylib
/usr/lib/libobjc.A.dylib

/System/Library/Frameworks/CoreGraphics.framework/Versions/A/CoreGraphics
/System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO
/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
/System/Library/Frameworks/CoreData.framework/Versions/A/CoreData

[Tracking Requested - why for this release]: Might be good to follow this crash / symbols issue.

I may have a fix for this bug, for which I've started a tryserver build:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b05a2c3f3108b29dd09bd57dc7db623f6b47ab7b

I've arranged for its symbols to be uploaded to the symbol server. If all goes well, it should be ready for use by sometime tomorrow.

Here's the tryserver build:

https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/YAlcid7SQa2PyYBWmXgTDw/runs/0/artifacts/public/build/target.dmg

It's symbols are on the symbol server. But, oddly, they don't get demangled in crash logs. It must be how I generated the symbol files that were uploaded.

Here's a crash report from the tryserver build, triggered by my HookCase hook library from comment #6 on my Haswell+ MacBook Pro:

bp-0045b89b-8903-4b4b-9ec2-74bd00191115

Notice that the CoreFoundation symbols are correctly symbolicated.

Pushed by smichaud@pobox.com:
https://hg.mozilla.org/integration/autoland/rev/da61ebbdb3a5
Pay attention to macho images' cpusubtype when creating minidumps. r=gsvelto
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72

We've got a problem. It seems that, as of build id 20191115214405 (the first mozilla-central nightly that contains this patch), no content process crash stacks are getting symbolicated properly:

https://crash-stats.mozilla.com/search/?build_id=%3E%3D20191115214405&platform=Mac%20OS%20X&process_type=content&date=%3E%3D2019-11-15T16%3A14%3A00.000Z&date=%3C2019-11-16T16%3A14%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Things are fine (or at least much better) for earlier mozilla-central builds:

https://crash-stats.mozilla.com/search/?build_id=%3C20191115214405&version=72.0a1&platform=Mac%20OS%20X&process_type=content&date=%3E%3D2019-11-15T16%3A16%3A00.000Z&date=%3C2019-11-16T16%3A16%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

I don't know if it's my patch that triggered this (and I don't yet know how it could have), but it seems likely. I'll be doing local builds to see if I can replicate the problem, then trying to find the culprit (and if it's this patch) the solution. I did test my patch on content process crashes, and it worked reasonably well (apart from the probable effects of bug 1594065 and bug 1594078).

I'll be working on this. If I don't come up with a solution today (and if it is my patch that's triggered the problem), I'll back out my patch.

The crash stacks with hook.dylib in them are my tests. The patch for bug 1516367 landed in the same mozilla-central nightly or the one just before. It may also (somehow) be involved here.

A question for everyone here: Can I just back out my patch on my own authority? Would I use Lando to do that? If so how?

The patch for bug 1516376 landed in the 20191115095319 mozilla-central nightly (the one just before the nightly where my patch landed).

In limited testing with a local build, backing out my patch doesn't help with the symbolication of content process crash stacks. So now I'm wondering about changes that may have happened on Socorro or Tecken.

The patch for bug 1516367 landed in the same mozilla-central nightly or the one just before. It may also (somehow) be involved here.

For what it's worth, disabling minidump-analyzer in the current mozilla-central nightly (build id 20191116051316) doesn't help. I disabled it by renaming it (which prevents other FF code from finding it and running it).

It's interesting that a number of crash stacks with failed symbolication have headers (i.e. signatures) that are symbolicated (whether correctly or not I don't know):

bp-682384d3-70fc-4960-9005-3e12b0191116
bp-8052c734-4bf5-473b-9d6c-2a08f0191116
bp-148599d4-d7ef-420d-bf64-270910191116

If I don't come up with a solution today (and if it is my patch that's triggered the problem), I'll back out my patch.

I haven't found a solution. But as best I can tell, backing out my patch won't fix the failed symbolication of content process crash stacks. So I'm going to sit tight and await further developments.

I quickly investigated on my machine and when I dump out a minidump with your patch applied the various modules have their debug ID set to 000000000000000000000000000000000 which causes the symbol machinery to be unable to find symbols. ATM it's unclear if it's the minidump that contains the wrong debug ID or if the tools are interpreting it erroneously. Backing out will fix the problem until we solve this issue.

I found a hint of what might be going wrong. This invocation of MachoIdentifier() is returning false with your patch applied but returns true w/o.

I checked today's nightly and confirmed what you say. I also checked my local build, with my patch backed out, and found that the debug ids are no longer zeroed. So yes, let's back out my patch. I'll dig deeper and then submit a revised patch.

Thanks for looking into this. I should have thought to use minidump_dump (as I assume you did).

I don't know how to get the patch backed out. Please tell me how to do it, or please contact the appropriate people yourself.

By the way, in my tests at least, it's only minidumps for the content processes that have their debug ids zeroed. Firefox process minidumps are fine.

I don't know how to get the patch backed out. Please tell me how to do it, or please contact the appropriate people yourself.

I just asked the current sherrif to do it on #developers on IRC. That procedure doesn't seem to have changed since the days when I was a Mozilla employee :-)

Status: RESOLVED → REOPENED
Flags: needinfo?(smichaud)
Resolution: FIXED → ---
Target Milestone: mozilla72 → ---

Thanks!

Flags: needinfo?(smichaud)

I'm about to post a revision of my patch. I basically made three changes:

  1. I found a fix for the problem that gsvelto reported in comment 24 and comment 25. My mistake was to dereference a pointer after the original object had been resized, and therefore moved in memory. I invoked header->cpusubtype after the second call to ReadTaskMemory(). But by then it referenced random memory, and the values I got were always incorrect.

  2. I noticed that Apple code which references mach_header.cpusubtype always ands it with ~CPU_SUBTYPE_MASK. Apparently the top 8 bits are reserved for "feature flags", including CPU_SUBTYPE_LIB64 == 0x80000000. I actually saw one instance of this as I was debugging. There are lots of references to mach_header.cpusubtype in Breakpad code. I didn't fix all of them -- only those relevant to the code used from XUL, which creates minidumps. Later I'll post another patch for the code used by utilities like minidump_dump and minidump_stackwalk. But since those read minidumps, the changes aren't as urgent.

  3. Last and most importantly, I fixed how Breakpad code computes the "slide" for modules in the "dyld shared cache".

To implement ASLR, all modules are "slid" by a random amount whenever they're loaded into memory. To know where a particular module (like XUL or the CoreFoundation framework) is loaded into memory, one needs to know both its "original" base address and the amount by which it has been slid. When minidump_stackwalk is trying to symbolicate a given address in memory, it needs to be able to correctly identify the module it points to.

The dyld shared cache is a single module into which commonly used system dylibs and frameworks are incorporated. dyld maps it into every process at load time. The component modules all have the same slide. For some time (and maybe since the very beginning), Breakpad code has computed the slide for these modules incorrectly, by assuming that it can use the same procedure it (correctly) uses for modules not in the dyld shared cache. Breakpad only uses this code to create minidumps for other processes (child processes like the content process). But as a result, almost all system calls in content process crash stacks are either symbolicated incorrectly or not at all.

My patch fixes this problem. The "shared cache slide" (sharedCacheSlide) is stored in the dyld_all_image_infos structure. This field is only available on OS X 10.7 and later. But Firefox only supports OS X 10.9 and later, so we don't have to worry about this.

Pushed by smichaud@pobox.com:
https://hg.mozilla.org/integration/autoland/rev/0c63dcd7a1c6
Pay attention to macho images' cpusubtype when creating minidumps (revised). r=gsvelto
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Regressions: 1652865
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: