Closed Bug 1371390 Opened 8 years ago Closed 5 years ago

Wrong UUIDs for x86_64h libraries in crash reports

Categories

(Toolkit :: Crash Reporting, defect)

Product:

Component:

Platform:

All

macOS

Type:

defect

Priority:

Not set

Severity:

critical

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla72

Tracking Flags:

Tracking

Status

firefox72

+

fixed

People

(Reporter: mstange, Assigned: smichaud)

References

Details

Attachments

(3 files)

Hook library used to trigger crashes 5 years ago Steven Michaud [:smichaud] (Retired) 2.19 KB, text/plain		Details
Bug 1371390 - Pay attention to macho images' cpusubtype when creating minidumps. r=gsvelto 5 years ago Steven Michaud [:smichaud] (Retired) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1371390 - Pay attention to macho images' cpusubtype when creating minidumps (revised). r=gsvelto 5 years ago Steven Michaud [:smichaud] (Retired) 47 bytes, text/x-phabricator-request		Details \| Review

Markus Stange [:mstange]

Reporter

Description

•

8 years ago

•

Here's an example crash report: https://crash-stats.mozilla.com/report/index/dbb6f65f-c0ae-449d-b08a-8148c0170603 This crash happened on a Haswell CPU, for which OS X has a custom variant of libobjc.A.dylib. This particular libobjc.A.dylib contains three architectures: MODULE mac x86 A32B2573CC083053A0054AFC9FC0E4CE0 libobjc.A.dylib MODULE mac x86_64 54CD8D1A5C983559B13A932B3D3DD3380 libobjc.A.dylib MODULE mac x86_64h DC77AA6EA4E4326D8D7F82D63AA88F990 libobjc.A.dylib The crash report displays the UUID 54CD8D1A5C983559B13A932B3D3DD3380 for it. However, that's wrong for this crash report. In this report, the crashing process had the x86_64h variant loaded, and the UUID should have been reported as DC77AA6EA4E4326D8D7F82D63AA88F990. This affects symbolication, and it affects the "function_offset" field in the raw dump which is computed incorrectly.

(not currently active) Ted Mielczarek

Comment 1

•

8 years ago

The code that writes out the module list for mac minidumps is here: https://dxr.mozilla.org/mozilla-central/rev/a49112c7a5765802096b3fc298069b9495436107/toolkit/crashreporter/breakpad-client/mac/handler/minidump_generator.cc#1294 There are separate codepaths for OOP and in-process crash dump generation.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 2

•

5 years ago

Markus, do you know if this ever got fixed?

Flags: needinfo?(mstange)

Markus Stange [:mstange]

Reporter

Comment 3

•

5 years ago

I don't know.

Flags: needinfo?(mstange)

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 4

•

5 years ago

•

This isn't fixed.

I used a HookCase (https://github.com/steven-michaud/HookCase) hook library to trigger a crash in __CFRunLoopDoSource0() on a MacBook Pro (Retina, 15-inch, Mid 2015) with a Haswell+ CPU, running macOS 10.15.1 (the current Catalina release). The CoreFoundation framework has both x86_64 and x86_64h architectures, as shown by the output of dump_syms -i -a [arch] on my machine's copy of it:

MODULE mac x86_64 09EB9DD025BC37309716FE231CAF2C700 CoreFoundation
MODULE mac x86_64h 15D61616B29B3BDB86244B84A49564850 CoreFoundation

But the stack trace for the crash (bp-dbaba15a-fbbb-48ff-b117-d324e0191103) has incorrect symbols for the CoreFoundation elements of the stack trace:

    1  CoreFoundation  __CFRunLoopServiceMachPort
    2  CoreFoundation  __CFRunLoopRun
    3  CoreFoundation  __CFRunLoopModeIsEmpty

These should be:

    1  CoreFoundation  __CFRunLoopDoSources0
    2  CoreFoundation  __CFRunLoopRun
    3  CoreFoundation  __CFRunLoopRunSpecific

In the raw dump (https://crash-stats.mozilla.org/report/index/dbaba15a-fbbb-48ff-b117-d324e0191103#tab-rawdump) I notice that the CoreFoundation module's "debug_id" is 09EB9DD025BC37309716FE231CAF2C700 -- which corresponds to the incorrect architecture.

And in the result of minidump_stackwalk -m on the crash minidump, the following shows up:

Module|CoreFoundation||CoreFoundation|09EB9DD025BC37309716FE231CAF2C700|0x7fff2bd1b000|0x7fff2c19afff|0

So Mozilla's breakpad client has already picked the wrong architecture before the crash minidump gets uploaded to Socorro, and processed there.

The symbol server now has symbols for both architectures (x86_64 and x86_64h) of CoreFoundation. The symbols I scraped at bug 1588843 comment 63 are on the symbol server (I just checked by running gathersymbols.py again). So that can't be the cause of this problem.

I'll look for the current version of the code that Ted mentioned in comment 1, and see what possible errors I can find there.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 5

•

5 years ago

Here's the same crash on a pre-Haswell Mac running macOS 10.15.1 (a VMware virtual machine running on a mid 2012 Mac Pro):

bp-0b2c47f9-9734-4833-a98d-a4b740191103

Everything about the CoreFoundation module is correct in this one.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 6

•

5 years ago

Attached file Hook library used to trigger crashes — Details

Here's the hook library I used to trigger the crashes I mentioned above, posted as a diff with https://github.com/steven-michaud/HookCase/tree/master/HookLibraryTemplate.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 7

•

5 years ago

This is a very serious problem -- one that effects the integrity of many of Socorro's Mac crash stacks. All new Macs since 2014 have Haswell+ CPUs. So over time the proportion of crash stacks effected by this bug can only increase (https://en.wikipedia.org/wiki/List_of_Macintosh_models_grouped_by_CPU_type).

Severity: normal → critical

Type: enhancement → defect

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 8

•

5 years ago

I'll tentatively assign this bug to myself.

Assignee: nobody → smichaud

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 9

•

5 years ago

Here are the modules that have an x86_64h architecture (on macOS 10.15.1) which are most likely to be included in Socorro crash stacks:

/usr/lib/system/libsystem_m.dylib
/usr/lib/libobjc.A.dylib

/System/Library/Frameworks/CoreGraphics.framework/Versions/A/CoreGraphics
/System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO
/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
/System/Library/Frameworks/CoreData.framework/Versions/A/CoreData

Liz Henry (:lizzard) (relman/hg->git project)

Comment 10

•

5 years ago

[Tracking Requested - why for this release]: Might be good to follow this crash / symbols issue.

status-firefox72: --- → affected

tracking-firefox72: --- → ?

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 11

•

5 years ago

I may have a fix for this bug, for which I've started a tryserver build:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b05a2c3f3108b29dd09bd57dc7db623f6b47ab7b

I've arranged for its symbols to be uploaded to the symbol server. If all goes well, it should be ready for use by sometime tomorrow.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 12

•

5 years ago

Here's the tryserver build:

https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/YAlcid7SQa2PyYBWmXgTDw/runs/0/artifacts/public/build/target.dmg

It's symbols are on the symbol server. But, oddly, they don't get demangled in crash logs. It must be how I generated the symbol files that were uploaded.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 13

•

5 years ago

Here's a crash report from the tryserver build, triggered by my HookCase hook library from comment #6 on my Haswell+ MacBook Pro:

bp-0045b89b-8903-4b4b-9ec2-74bd00191115

Notice that the CoreFoundation symbols are correctly symbolicated.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 14

•

5 years ago

Attached file Bug 1371390 - Pay attention to macho images' cpusubtype when creating minidumps. r=gsvelto — Details

Comment 15

•

5 years ago

Pushed by smichaud@pobox.com: https://hg.mozilla.org/integration/autoland/rev/da61ebbdb3a5 Pay attention to macho images' cpusubtype when creating minidumps. r=gsvelto

Stefan Hindli [:stefan_hindli]

Comment 16

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/da61ebbdb3a5

Status: NEW → RESOLVED

Closed: 5 years ago

status-firefox72: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla72

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 17

•

5 years ago

We've got a problem. It seems that, as of build id 20191115214405 (the first mozilla-central nightly that contains this patch), no content process crash stacks are getting symbolicated properly:

https://crash-stats.mozilla.com/search/?build_id=%3E%3D20191115214405&platform=Mac%20OS%20X&process_type=content&date=%3E%3D2019-11-15T16%3A14%3A00.000Z&date=%3C2019-11-16T16%3A14%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Things are fine (or at least much better) for earlier mozilla-central builds:

https://crash-stats.mozilla.com/search/?build_id=%3C20191115214405&version=72.0a1&platform=Mac%20OS%20X&process_type=content&date=%3E%3D2019-11-15T16%3A16%3A00.000Z&date=%3C2019-11-16T16%3A16%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

I don't know if it's my patch that triggered this (and I don't yet know how it could have), but it seems likely. I'll be doing local builds to see if I can replicate the problem, then trying to find the culprit (and if it's this patch) the solution. I did test my patch on content process crashes, and it worked reasonably well (apart from the probable effects of bug 1594065 and bug 1594078).

I'll be working on this. If I don't come up with a solution today (and if it is my patch that's triggered the problem), I'll back out my patch.

The crash stacks with hook.dylib in them are my tests. The patch for bug 1516367 landed in the same mozilla-central nightly or the one just before. It may also (somehow) be involved here.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 18

•

5 years ago

A question for everyone here: Can I just back out my patch on my own authority? Would I use Lando to do that? If so how?

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 19

•

5 years ago

The patch for bug 1516376 landed in the 20191115095319 mozilla-central nightly (the one just before the nightly where my patch landed).

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 20

•

5 years ago

In limited testing with a local build, backing out my patch doesn't help with the symbolication of content process crash stacks. So now I'm wondering about changes that may have happened on Socorro or Tecken.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 21

•

5 years ago

The patch for bug 1516367 landed in the same mozilla-central nightly or the one just before. It may also (somehow) be involved here.

For what it's worth, disabling minidump-analyzer in the current mozilla-central nightly (build id 20191116051316) doesn't help. I disabled it by renaming it (which prevents other FF code from finding it and running it).

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 22

•

5 years ago

•

It's interesting that a number of crash stacks with failed symbolication have headers (i.e. signatures) that are symbolicated (whether correctly or not I don't know):

bp-682384d3-70fc-4960-9005-3e12b0191116
bp-8052c734-4bf5-473b-9d6c-2a08f0191116
bp-148599d4-d7ef-420d-bf64-270910191116

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 23

•

5 years ago

If I don't come up with a solution today (and if it is my patch that's triggered the problem), I'll back out my patch.

I haven't found a solution. But as best I can tell, backing out my patch won't fix the failed symbolication of content process crash stacks. So I'm going to sit tight and await further developments.

Gabriele Svelto [:gsvelto]

Comment 24

•

5 years ago

I quickly investigated on my machine and when I dump out a minidump with your patch applied the various modules have their debug ID set to 000000000000000000000000000000000 which causes the symbol machinery to be unable to find symbols. ATM it's unclear if it's the minidump that contains the wrong debug ID or if the tools are interpreting it erroneously. Backing out will fix the problem until we solve this issue.

Gabriele Svelto [:gsvelto]

Comment 25

•

5 years ago

I found a hint of what might be going wrong. This invocation of MachoIdentifier() is returning false with your patch applied but returns true w/o.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 26

•

5 years ago

I checked today's nightly and confirmed what you say. I also checked my local build, with my patch backed out, and found that the debug ids are no longer zeroed. So yes, let's back out my patch. I'll dig deeper and then submit a revised patch.

Thanks for looking into this. I should have thought to use minidump_dump (as I assume you did).

I don't know how to get the patch backed out. Please tell me how to do it, or please contact the appropriate people yourself.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 27

•

5 years ago

•

By the way, in my tests at least, it's only minidumps for the content processes that have their debug ids zeroed. Firefox process minidumps are fine.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 28

•

5 years ago

I don't know how to get the patch backed out. Please tell me how to do it, or please contact the appropriate people yourself.

I just asked the current sherrif to do it on #developers on IRC. That procedure doesn't seem to have changed since the days when I was a Mozilla employee :-)

Andreea Pavel [:apavel]

Comment 29

•

5 years ago

Backed out on request: https://hg.mozilla.org/mozilla-central/rev/7c2b637d452d37a6ce4320eef98d3d41b0d601c5

Status: RESOLVED → REOPENED

status-firefox72: fixed → affected

Flags: needinfo?(smichaud)

Resolution: FIXED → ---

Target Milestone: mozilla72 → ---

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 30

•

5 years ago

Thanks!

Flags: needinfo?(smichaud)

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 31

•

5 years ago

•

I'm about to post a revision of my patch. I basically made three changes:

I found a fix for the problem that gsvelto reported in comment 24 and comment 25. My mistake was to dereference a pointer after the original object had been resized, and therefore moved in memory. I invoked header->cpusubtype after the second call to ReadTaskMemory(). But by then it referenced random memory, and the values I got were always incorrect.
I noticed that Apple code which references mach_header.cpusubtype always ands it with ~CPU_SUBTYPE_MASK. Apparently the top 8 bits are reserved for "feature flags", including CPU_SUBTYPE_LIB64 == 0x80000000. I actually saw one instance of this as I was debugging. There are lots of references to mach_header.cpusubtype in Breakpad code. I didn't fix all of them -- only those relevant to the code used from XUL, which creates minidumps. Later I'll post another patch for the code used by utilities like minidump_dump and minidump_stackwalk. But since those read minidumps, the changes aren't as urgent.
Last and most importantly, I fixed how Breakpad code computes the "slide" for modules in the "dyld shared cache".

To implement ASLR, all modules are "slid" by a random amount whenever they're loaded into memory. To know where a particular module (like XUL or the CoreFoundation framework) is loaded into memory, one needs to know both its "original" base address and the amount by which it has been slid. When minidump_stackwalk is trying to symbolicate a given address in memory, it needs to be able to correctly identify the module it points to.

The dyld shared cache is a single module into which commonly used system dylibs and frameworks are incorporated. dyld maps it into every process at load time. The component modules all have the same slide. For some time (and maybe since the very beginning), Breakpad code has computed the slide for these modules incorrectly, by assuming that it can use the same procedure it (correctly) uses for modules not in the dyld shared cache. Breakpad only uses this code to create minidumps for other processes (child processes like the content process). But as a result, almost all system calls in content process crash stacks are either symbolicated incorrectly or not at all.

My patch fixes this problem. The "shared cache slide" (sharedCacheSlide) is stored in the dyld_all_image_infos structure. This field is only available on OS X 10.7 and later. But Firefox only supports OS X 10.9 and later, so we don't have to worry about this.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 32

•

5 years ago

Attached file Bug 1371390 - Pay attention to macho images' cpusubtype when creating minidumps (revised). r=gsvelto — Details

Comment 33

•

5 years ago

Pushed by smichaud@pobox.com: https://hg.mozilla.org/integration/autoland/rev/0c63dcd7a1c6 Pay attention to macho images' cpusubtype when creating minidumps (revised). r=gsvelto

Alexandru Michis [:malexandru]

Comment 34

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0c63dcd7a1c6

Status: REOPENED → RESOLVED

Closed: 5 years ago → 5 years ago

status-firefox72: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla72

Julien Cristau [:jcristau]

Updated

•

5 years ago

tracking-firefox72: ? → +

Gabriele Svelto [:gsvelto]

Updated

•

5 years ago

Blocks: 1599449

Nathan Froyd [:froydnj]

Updated

•

5 years ago

Regressions: 1652865

You need to log in before you can comment on or make changes to this bug.