Wrong UUIDs for x86_64h libraries in crash reports
Categories
(Toolkit :: Crash Reporting, defect)
Tracking
()
People
(Reporter: mstange, Assigned: smichaud)
References
Details
Attachments
(3 files)
Comment 1•8 years ago
|
||
Assignee | ||
Comment 2•5 years ago
|
||
Markus, do you know if this ever got fixed?
Assignee | ||
Comment 4•5 years ago
•
|
||
This isn't fixed.
I used a HookCase (https://github.com/steven-michaud/HookCase) hook library to trigger a crash in __CFRunLoopDoSource0() on a MacBook Pro (Retina, 15-inch, Mid 2015) with a Haswell+ CPU, running macOS 10.15.1 (the current Catalina release). The CoreFoundation framework has both x86_64 and x86_64h architectures, as shown by the output of dump_syms -i -a [arch]
on my machine's copy of it:
MODULE mac x86_64 09EB9DD025BC37309716FE231CAF2C700 CoreFoundation
MODULE mac x86_64h 15D61616B29B3BDB86244B84A49564850 CoreFoundation
But the stack trace for the crash (bp-dbaba15a-fbbb-48ff-b117-d324e0191103) has incorrect symbols for the CoreFoundation elements of the stack trace:
1 CoreFoundation __CFRunLoopServiceMachPort
2 CoreFoundation __CFRunLoopRun
3 CoreFoundation __CFRunLoopModeIsEmpty
These should be:
1 CoreFoundation __CFRunLoopDoSources0
2 CoreFoundation __CFRunLoopRun
3 CoreFoundation __CFRunLoopRunSpecific
In the raw dump (https://crash-stats.mozilla.org/report/index/dbaba15a-fbbb-48ff-b117-d324e0191103#tab-rawdump) I notice that the CoreFoundation module's "debug_id" is 09EB9DD025BC37309716FE231CAF2C700 -- which corresponds to the incorrect architecture.
And in the result of minidump_stackwalk -m
on the crash minidump, the following shows up:
Module|CoreFoundation||CoreFoundation|09EB9DD025BC37309716FE231CAF2C700|0x7fff2bd1b000|0x7fff2c19afff|0
So Mozilla's breakpad client has already picked the wrong architecture before the crash minidump gets uploaded to Socorro, and processed there.
The symbol server now has symbols for both architectures (x86_64 and x86_64h) of CoreFoundation. The symbols I scraped at bug 1588843 comment 63 are on the symbol server (I just checked by running gathersymbols.py again). So that can't be the cause of this problem.
I'll look for the current version of the code that Ted mentioned in comment 1, and see what possible errors I can find there.
Assignee | ||
Comment 5•5 years ago
|
||
Here's the same crash on a pre-Haswell Mac running macOS 10.15.1 (a VMware virtual machine running on a mid 2012 Mac Pro):
bp-0b2c47f9-9734-4833-a98d-a4b740191103
Everything about the CoreFoundation module is correct in this one.
Assignee | ||
Comment 6•5 years ago
|
||
Here's the hook library I used to trigger the crashes I mentioned above, posted as a diff with https://github.com/steven-michaud/HookCase/tree/master/HookLibraryTemplate.
Assignee | ||
Comment 7•5 years ago
|
||
This is a very serious problem -- one that effects the integrity of many of Socorro's Mac crash stacks. All new Macs since 2014 have Haswell+ CPUs. So over time the proportion of crash stacks effected by this bug can only increase (https://en.wikipedia.org/wiki/List_of_Macintosh_models_grouped_by_CPU_type).
Assignee | ||
Comment 8•5 years ago
|
||
I'll tentatively assign this bug to myself.
Assignee | ||
Comment 9•5 years ago
|
||
Here are the modules that have an x86_64h architecture (on macOS 10.15.1) which are most likely to be included in Socorro crash stacks:
/usr/lib/system/libsystem_m.dylib
/usr/lib/libobjc.A.dylib
/System/Library/Frameworks/CoreGraphics.framework/Versions/A/CoreGraphics
/System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO
/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
/System/Library/Frameworks/CoreData.framework/Versions/A/CoreData
Comment 10•5 years ago
|
||
[Tracking Requested - why for this release]: Might be good to follow this crash / symbols issue.
Assignee | ||
Comment 11•5 years ago
|
||
I may have a fix for this bug, for which I've started a tryserver build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b05a2c3f3108b29dd09bd57dc7db623f6b47ab7b
I've arranged for its symbols to be uploaded to the symbol server. If all goes well, it should be ready for use by sometime tomorrow.
Assignee | ||
Comment 12•5 years ago
|
||
Here's the tryserver build:
It's symbols are on the symbol server. But, oddly, they don't get demangled in crash logs. It must be how I generated the symbol files that were uploaded.
Assignee | ||
Comment 13•5 years ago
|
||
Here's a crash report from the tryserver build, triggered by my HookCase hook library from comment #6 on my Haswell+ MacBook Pro:
bp-0045b89b-8903-4b4b-9ec2-74bd00191115
Notice that the CoreFoundation symbols are correctly symbolicated.
Assignee | ||
Comment 14•5 years ago
|
||
Comment 15•5 years ago
|
||
Comment 16•5 years ago
|
||
bugherder |
Assignee | ||
Comment 17•5 years ago
|
||
We've got a problem. It seems that, as of build id 20191115214405 (the first mozilla-central nightly that contains this patch), no content process crash stacks are getting symbolicated properly:
Things are fine (or at least much better) for earlier mozilla-central builds:
I don't know if it's my patch that triggered this (and I don't yet know how it could have), but it seems likely. I'll be doing local builds to see if I can replicate the problem, then trying to find the culprit (and if it's this patch) the solution. I did test my patch on content process crashes, and it worked reasonably well (apart from the probable effects of bug 1594065 and bug 1594078).
I'll be working on this. If I don't come up with a solution today (and if it is my patch that's triggered the problem), I'll back out my patch.
The crash stacks with hook.dylib in them are my tests. The patch for bug 1516367 landed in the same mozilla-central nightly or the one just before. It may also (somehow) be involved here.
Assignee | ||
Comment 18•5 years ago
|
||
A question for everyone here: Can I just back out my patch on my own authority? Would I use Lando to do that? If so how?
Assignee | ||
Comment 19•5 years ago
|
||
The patch for bug 1516376 landed in the 20191115095319 mozilla-central nightly (the one just before the nightly where my patch landed).
Assignee | ||
Comment 20•5 years ago
|
||
In limited testing with a local build, backing out my patch doesn't help with the symbolication of content process crash stacks. So now I'm wondering about changes that may have happened on Socorro or Tecken.
Assignee | ||
Comment 21•5 years ago
|
||
The patch for bug 1516367 landed in the same mozilla-central nightly or the one just before. It may also (somehow) be involved here.
For what it's worth, disabling minidump-analyzer in the current mozilla-central nightly (build id 20191116051316) doesn't help. I disabled it by renaming it (which prevents other FF code from finding it and running it).
Assignee | ||
Comment 22•5 years ago
•
|
||
It's interesting that a number of crash stacks with failed symbolication have headers (i.e. signatures) that are symbolicated (whether correctly or not I don't know):
bp-682384d3-70fc-4960-9005-3e12b0191116
bp-8052c734-4bf5-473b-9d6c-2a08f0191116
bp-148599d4-d7ef-420d-bf64-270910191116
Assignee | ||
Comment 23•5 years ago
|
||
If I don't come up with a solution today (and if it is my patch that's triggered the problem), I'll back out my patch.
I haven't found a solution. But as best I can tell, backing out my patch won't fix the failed symbolication of content process crash stacks. So I'm going to sit tight and await further developments.
Comment 24•5 years ago
|
||
I quickly investigated on my machine and when I dump out a minidump with your patch applied the various modules have their debug ID set to 000000000000000000000000000000000 which causes the symbol machinery to be unable to find symbols. ATM it's unclear if it's the minidump that contains the wrong debug ID or if the tools are interpreting it erroneously. Backing out will fix the problem until we solve this issue.
Comment 25•5 years ago
|
||
I found a hint of what might be going wrong. This invocation of MachoIdentifier()
is returning false with your patch applied but returns true w/o.
Assignee | ||
Comment 26•5 years ago
|
||
I checked today's nightly and confirmed what you say. I also checked my local build, with my patch backed out, and found that the debug ids are no longer zeroed. So yes, let's back out my patch. I'll dig deeper and then submit a revised patch.
Thanks for looking into this. I should have thought to use minidump_dump (as I assume you did).
I don't know how to get the patch backed out. Please tell me how to do it, or please contact the appropriate people yourself.
Assignee | ||
Comment 27•5 years ago
•
|
||
By the way, in my tests at least, it's only minidumps for the content processes that have their debug ids zeroed. Firefox process minidumps are fine.
Assignee | ||
Comment 28•5 years ago
|
||
I don't know how to get the patch backed out. Please tell me how to do it, or please contact the appropriate people yourself.
I just asked the current sherrif to do it on #developers on IRC. That procedure doesn't seem to have changed since the days when I was a Mozilla employee :-)
Comment 29•5 years ago
|
||
Backed out on request: https://hg.mozilla.org/mozilla-central/rev/7c2b637d452d37a6ce4320eef98d3d41b0d601c5
Assignee | ||
Comment 31•5 years ago
•
|
||
I'm about to post a revision of my patch. I basically made three changes:
-
I found a fix for the problem that gsvelto reported in comment 24 and comment 25. My mistake was to dereference a pointer after the original object had been resized, and therefore moved in memory. I invoked
header->cpusubtype
after the second call toReadTaskMemory()
. But by then it referenced random memory, and the values I got were always incorrect. -
I noticed that Apple code which references
mach_header.cpusubtype
alwaysand
s it with~CPU_SUBTYPE_MASK
. Apparently the top 8 bits are reserved for "feature flags", includingCPU_SUBTYPE_LIB64
== 0x80000000. I actually saw one instance of this as I was debugging. There are lots of references tomach_header.cpusubtype
in Breakpad code. I didn't fix all of them -- only those relevant to the code used from XUL, which creates minidumps. Later I'll post another patch for the code used by utilities likeminidump_dump
andminidump_stackwalk
. But since those read minidumps, the changes aren't as urgent. -
Last and most importantly, I fixed how Breakpad code computes the "slide" for modules in the "dyld shared cache".
To implement ASLR, all modules are "slid" by a random amount whenever they're loaded into memory. To know where a particular module (like XUL or the CoreFoundation framework) is loaded into memory, one needs to know both its "original" base address and the amount by which it has been slid. When minidump_stackwalk is trying to symbolicate a given address in memory, it needs to be able to correctly identify the module it points to.
The dyld shared cache is a single module into which commonly used system dylibs and frameworks are incorporated. dyld
maps it into every process at load time. The component modules all have the same slide. For some time (and maybe since the very beginning), Breakpad code has computed the slide for these modules incorrectly, by assuming that it can use the same procedure it (correctly) uses for modules not in the dyld shared cache. Breakpad only uses this code to create minidumps for other processes (child processes like the content process). But as a result, almost all system calls in content process crash stacks are either symbolicated incorrectly or not at all.
My patch fixes this problem. The "shared cache slide" (sharedCacheSlide
) is stored in the dyld_all_image_infos structure. This field is only available on OS X 10.7 and later. But Firefox only supports OS X 10.9 and later, so we don't have to worry about this.
Assignee | ||
Comment 32•5 years ago
|
||
Comment 33•5 years ago
|
||
Comment 34•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Description
•