Crash in [@ MacFileUtilities::MachoWalker::FindHeader]
Categories
(Toolkit :: Crash Reporting, defect)
Tracking
()
People
(Reporter: gsvelto, Assigned: smichaud)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(2 files)
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
jcristau
:
approval-mozilla-esr78+
|
Details | Review |
13.67 KB,
text/plain
|
Details |
Crash report: https://crash-stats.mozilla.org/report/index/d3ddb982-4c08-4f0e-afc9-f98080201109
Reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
Top 10 frames of crashing thread:
0 XUL MacFileUtilities::MachoWalker::FindHeader toolkit/crashreporter/google-breakpad/src/common/mac/macho_walker.cc:134
1 libsystem_c.dylib vsnprintf_l
2 XUL MacFileUtilities::MachoWalker::WalkHeader toolkit/crashreporter/google-breakpad/src/common/mac/macho_walker.cc:91
3 XUL MacFileUtilities::MachoID::WalkHeader toolkit/crashreporter/google-breakpad/src/common/mac/macho_id.cc:242
4 XUL XUL@0x506eadf
5 XUL MacFileUtilities::MachoID::UUIDCommand toolkit/crashreporter/google-breakpad/src/common/mac/macho_id.cc:159
6 XUL google_breakpad::MinidumpGenerator::WriteCVRecord toolkit/crashreporter/breakpad-client/mac/handler/minidump_generator.cc:1481
7 libsystem_kernel.dylib mach_msg
8 libsystem_kernel.dylib _kernelrpc_mach_vm_read
9 libmozglue.dylib malloc memory/build/malloc_decls.h:51
This appears to be a problem with breakpad when reading from a process we're dumping out. The crash appears to be at the beginning of a page that's not accessible so we might be reading from the wrong area while looking for the MachO header.
Reporter | ||
Comment 1•4 years ago
|
||
This seems to have started with buildid 20200906215057.
Reporter | ||
Comment 2•4 years ago
|
||
This looks like it might be a regression introduced by bug 1662862. This crash has a much better stack than the one I referenced on the bug:
https://crash-stats.mozilla.org/report/index/2cbc7b29-d176-48b1-ac20-377bf0200917
It seems that the crash might be caused by module->base_of_image
being wrong for certain modules (or maybe the size?).
Steven would you have time to look into this? If not I'll have a look in the coming weeks myself.
Updated•4 years ago
|
Assignee | ||
Comment 3•4 years ago
|
||
Yes, it does appear that these crashes wouldn't be happening without my patch for bug 1662862. But, off the top of my head, I suspect they aren't actually caused by it. I suspect they result from from some kind of corruption, which might manifest itself in other ways without my patch. Sometime in the next week I'll dig in more deeply.
It seems that the crash might be caused by module->base_of_image being wrong for certain modules (or maybe the size?).
I'll try to figure out how to trigger these crashes, and this is definitely one of the things I'll be looking at.
https://crash-stats.mozilla.org/report/index/2cbc7b29-d176-48b1-ac20-377bf0200917
It's interesting that this happened on macOS 10.15.4. 'modulePath' should be in the file system, but still we failed over to in_memory == true
.
As best I can tell, these crashes are happening on all versions of macOS, but proportionally more often on 10.11 and older:
I don't see any crashes on macOS 11, though. It's true that Firefox still only has very few users on that version of macOS, but it's still odd.
Assignee | ||
Comment 4•4 years ago
•
|
||
https://crash-stats.mozilla.org/report/index/2cbc7b29-d176-48b1-ac20-377bf0200917
Note that the crash address, 0x5e4bf000
, is wrong -- it's been truncated to 32 bits. As you can see from the raw dump, it should be 0x15e4bf000
(the value in the rsi
register). Note also that rip
(the instruction pointer) is 0x10cd48235
. So the "bad access" seems to have been to something in the current code (__TEXT
) segment (i.e. something in XUL
itself). Maybe a failed write, or maybe an attempt to access an address beyond the end of XUL
.
Edit: I'm no longer so sure that the attempted access was to something in XUL
. Maybe it was to something on the heap. It definitely wasn't to something on the stack. I'll know more when I get to work on this.
Reporter | ||
Comment 5•4 years ago
|
||
(In reply to Steven Michaud [:smichaud] (Retired) from comment #4)
Note that the crash address,
0x5e4bf000
, is wrong -- it's been truncated to 32 bits. As you can see from the raw dump, it should be0x15e4bf000
(the value in thersi
register).
Yeah, I need to fix bug 1035892 first to get better visibility into the crash. The patch there has been ready for ages but it causes gtest failure I haven't been able to solve yet so I couldn't land it.
Edit: I'm no longer so sure that the attempted access was to something in
XUL
. Maybe it was to something on the heap. It definitely wasn't to something on the stack. I'll know more when I get to work on this.
Thanks!
Assignee | ||
Comment 6•4 years ago
•
|
||
I started work on this today, and am making progress. I can reproduce these crashes, sort of. But the crashes I see are so weird that it will take a while to wrap my head around them. Interestingly, I can only reproduce them (in the main process) by triggering crashes in the content process.
Assignee | ||
Comment 7•4 years ago
•
|
||
I have a hunch what the problem is: When the crashreporter thread in the main process iterates through images in memory, in order to process a crash in a different process (one of the content processes), it's actually reading memory (or trying to read memory) in the crashing process, to which the main process generally doesn't have access. This does work for images in the dyld shared cache (which is most of them), but doesn't work for those that aren't in the dyld shared cache.
As of my patch for bug 1662862, MinidumpGenerator::WriteCVRecord()
defaults to reading image UUIDs from the file system. And even on macOS 11, images that aren't in the dyld shared cache are by definition somewhere in the file system. But if WriteCVRecord()
can't access an image's file, it fails over to trying to access that image in memory -- which will fail if the image isn't in the dyld shared cache. The reason we don't see more of these crashes is that problems with file access are apparently quite rare.
I'm very short of time. But over the next few days I'll try to firm up this analysis (or replace it with another one). Then I'll start writing a patch to deal with the problem.
Assignee | ||
Comment 8•4 years ago
•
|
||
(Following up comment 3)
I don't see any crashes on macOS 11, though.
Not any more. There seem to be more in the last week than on any other version of macOS. Note that you have to look for both "10.16" and "11.0" in the "platform version".
Note that none of this bug's crashes are "new" crashes. Each happens in crash processing code, and overlays (and disguises) some other crash. This bug needs to be fixed. But fixing it will just unveil the "real" crashes.
Assignee | ||
Comment 9•4 years ago
|
||
For the record, all this bug's crashes happen here, during a call to memcpy()
:
Updated•4 years ago
|
Assignee | ||
Comment 10•4 years ago
|
||
Updated•4 years ago
|
Assignee | ||
Comment 11•4 years ago
•
|
||
As best I can tell, my hunch from comment 7 still holds. I used a HookCase hook library to test the patch I've just submitted, on both macOS 10.15.7 and macOS 11. It seems to fix the crashes.
Here's the hook library I used (as a patch on https://github.com/steven-michaud/HookCase/tree/v5.0.0/HookLibraryTemplate). I tested with a try build containing my patch, with symbols unstripped, and a few symbols added. (The hook library needs those symbols.)
https://treeherder.mozilla.org/jobs?repo=try&revision=bcf5f57a935834bb6845ea6859db023295890a68
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/PAhx7zSSSyqiPIYpuqtB_Q/runs/0/artifacts/public/build/target.dmg
Note that for each instance of this bug's crash that gets fixed, the "real" (underlying) crash's crash report will be missing symbols for at least one of the child process's modules.
Comment 13•4 years ago
|
||
Pushed by smichaud@pobox.com: https://hg.mozilla.org/integration/autoland/rev/c8f168ace30c Prevent access to child process modules not in dyld shared cache. r=gsvelto
Comment 14•4 years ago
|
||
bugherder |
Assignee | ||
Comment 15•4 years ago
•
|
||
Here's the first mozilla-central nightly containing this bug's patch (build id 20201121092754):
I'll be keeping an eye on what my patch does to this bug's crash numbers. But the crash volume is low enough that it may take up to a week to be sure, one way or the other.
No crashes for now.
I'll also be looking for content process crash stacks containing modules with no symbols, to get more information about the underlying problem(s) that triggered these crashes. At this point all we know is that they include file access problems.
Comment 16•4 years ago
|
||
The bulk of these crashes was on beta, so we'll want an uplift.
Assignee | ||
Comment 17•4 years ago
|
||
Let's wait at least a few days to see how the patch does on trunk. Then if it's doing OK (if there aren't any crashes) we can do the uplift to beta.
Assignee | ||
Comment 18•4 years ago
•
|
||
Comment on attachment 9189001 [details]
Bug 1676102 - Prevent access to child process modules not in dyld shared cache. r=gsvelto
Beta/Release Uplift Approval Request
- User impact if declined: There are very few of this bug's crashes on trunk but a huge number on beta. So it will be much easier to judge the success or failure of this patch on beta. It also makes sense to try to clear up the crashes on beta as soon as possible.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): This patch is based on a hunch that has only been lightly tested.
- String changes made/needed: None
Assignee | ||
Comment 19•4 years ago
|
||
I decided to go ahead with the beta upload request now. I hadn't realized what a huge number of this bug's crashes are happening on the current (84) beta branch.
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 21•4 years ago
•
|
||
I just realized that almost all the beta crashes are on Apple Silicon:
That makes fixing this bug less urgent. But it'd still be interesting to see the effects of my patch on beta.
And no, there still haven't been any of this bug's crashes on trunk builds containing my patch.
I don't have an Apple Silicon machine, and am unlikely to get one until they support more than 16GB of RAM.
Comment 22•4 years ago
|
||
Comment on attachment 9189001 [details]
Bug 1676102 - Prevent access to child process modules not in dyld shared cache. r=gsvelto
approved for 84.0b5
Updated•4 years ago
|
Comment 23•4 years ago
|
||
bugherder uplift |
Assignee | ||
Comment 24•4 years ago
|
||
This bug's crashes have completely disappeared on trunk and beta since my patch was landed there. So it's clearly working as intended. But the number of different kinds of content process signatures hasn't increased (at least on Apple Silicon), as I expected it would. So I suspect some crashes are being misreported (under the wrong signature), or not being reported at all.
Beta 84 macOS content process ARM64 crashes before my patch:
1 _pthread_join Add term 334 84.99 % 1679513 1678226
2 libsystem_pthread.dylib@0x8ad8 Add term 59 15.01 % 1679373
Beta 84 macOS content process ARM64 crashes with my patch:
1 _pthread_join Add term 446 80.94 % 1679513 1678226
2 libsystem_pthread.dylib@0x8ad8 Add term 105 19.06 % 1679373
I also suspect the problem is more general than this. So I've opened bug 1679896.
Comment 25•4 years ago
|
||
Should we consider uplifting this to ESR78 also?
Assignee | ||
Comment 26•4 years ago
|
||
Yes, I think so. Even on Intel Macs, this bug's crashes happen disproportionately on macOS 10.11 and older (see comment 3).
Assignee | ||
Comment 27•4 years ago
|
||
Comment on attachment 9189001 [details]
Bug 1676102 - Prevent access to child process modules not in dyld shared cache. r=gsvelto
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: This bug's crashes happen disproportionately on OS X 10.11 and older (on Intel hardware).
- User impact if declined: Lots of these crashes will continue to happen on versions of OS X currently only supported on the ESR branch.
- Fix Landed on Version: 85
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky):
- String or UUID changes made by this patch: None
Reporter | ||
Comment 28•4 years ago
|
||
(In reply to Steven Michaud [:smichaud] (Retired) from comment #26)
Yes, I think so. Even on Intel Macs, this bug's crashes happen disproportionately on macOS 10.11 and older (see comment 3).
Agreed.
Comment 29•4 years ago
|
||
Comment on attachment 9189001 [details]
Bug 1676102 - Prevent access to child process modules not in dyld shared cache. r=gsvelto
approved for 78.6esr
Updated•4 years ago
|
Comment 30•4 years ago
|
||
bugherder uplift |
Description
•