Closed Bug 1676102 Opened 4 years ago Closed 4 years ago

Crash in [@ MacFileUtilities::MachoWalker::FindHeader]

Categories

(Toolkit :: Crash Reporting, defect)

ARM64
macOS
defect

Tracking

()

RESOLVED FIXED
85 Branch
Tracking Status
firefox-esr78 84+ fixed
firefox83 --- wontfix
firefox84 --- fixed
firefox85 --- fixed

People

(Reporter: gsvelto, Assigned: smichaud)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files)

Crash report: https://crash-stats.mozilla.org/report/index/d3ddb982-4c08-4f0e-afc9-f98080201109

Reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS

Top 10 frames of crashing thread:

0 XUL MacFileUtilities::MachoWalker::FindHeader toolkit/crashreporter/google-breakpad/src/common/mac/macho_walker.cc:134
1 libsystem_c.dylib vsnprintf_l 
2 XUL MacFileUtilities::MachoWalker::WalkHeader toolkit/crashreporter/google-breakpad/src/common/mac/macho_walker.cc:91
3 XUL MacFileUtilities::MachoID::WalkHeader toolkit/crashreporter/google-breakpad/src/common/mac/macho_id.cc:242
4 XUL XUL@0x506eadf 
5 XUL MacFileUtilities::MachoID::UUIDCommand toolkit/crashreporter/google-breakpad/src/common/mac/macho_id.cc:159
6 XUL google_breakpad::MinidumpGenerator::WriteCVRecord toolkit/crashreporter/breakpad-client/mac/handler/minidump_generator.cc:1481
7 libsystem_kernel.dylib mach_msg 
8 libsystem_kernel.dylib _kernelrpc_mach_vm_read 
9 libmozglue.dylib malloc memory/build/malloc_decls.h:51

This appears to be a problem with breakpad when reading from a process we're dumping out. The crash appears to be at the beginning of a page that's not accessible so we might be reading from the wrong area while looking for the MachO header.

This seems to have started with buildid 20200906215057.

This looks like it might be a regression introduced by bug 1662862. This crash has a much better stack than the one I referenced on the bug:

https://crash-stats.mozilla.org/report/index/2cbc7b29-d176-48b1-ac20-377bf0200917

It seems that the crash might be caused by module->base_of_image being wrong for certain modules (or maybe the size?).

https://searchfox.org/mozilla-central/rev/353f334d13359f7cb879b3cb38a94294231c2aa8/toolkit/crashreporter/breakpad-client/mac/handler/minidump_generator.cc#1479-1480

Steven would you have time to look into this? If not I'll have a look in the coming weeks myself.

Flags: needinfo?(smichaud)
Keywords: regression
Regressed by: 1662862
Has Regression Range: --- → yes
Blocks: 1664921

Yes, it does appear that these crashes wouldn't be happening without my patch for bug 1662862. But, off the top of my head, I suspect they aren't actually caused by it. I suspect they result from from some kind of corruption, which might manifest itself in other ways without my patch. Sometime in the next week I'll dig in more deeply.

It seems that the crash might be caused by module->base_of_image being wrong for certain modules (or maybe the size?).

I'll try to figure out how to trigger these crashes, and this is definitely one of the things I'll be looking at.

https://crash-stats.mozilla.org/report/index/2cbc7b29-d176-48b1-ac20-377bf0200917

It's interesting that this happened on macOS 10.15.4. 'modulePath' should be in the file system, but still we failed over to in_memory == true.

As best I can tell, these crashes are happening on all versions of macOS, but proportionally more often on 10.11 and older:

https://crash-stats.mozilla.org/search/?signature=~MacFileUtilities%3A%3AMachoWalker%3A%3AFindHeader&date=%3E%3D2020-08-09T15%3A28%3A00.000Z&date=%3C2020-11-09T15%3A28%3A00.000Z&_facets=signature&_facets=platform_version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-platform_version

I don't see any crashes on macOS 11, though. It's true that Firefox still only has very few users on that version of macOS, but it's still odd.

Flags: needinfo?(smichaud)

https://crash-stats.mozilla.org/report/index/2cbc7b29-d176-48b1-ac20-377bf0200917

Note that the crash address, 0x5e4bf000, is wrong -- it's been truncated to 32 bits. As you can see from the raw dump, it should be 0x15e4bf000 (the value in the rsi register). Note also that rip (the instruction pointer) is 0x10cd48235. So the "bad access" seems to have been to something in the current code (__TEXT) segment (i.e. something in XUL itself). Maybe a failed write, or maybe an attempt to access an address beyond the end of XUL.

Edit: I'm no longer so sure that the attempted access was to something in XUL. Maybe it was to something on the heap. It definitely wasn't to something on the stack. I'll know more when I get to work on this.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #4)

Note that the crash address, 0x5e4bf000, is wrong -- it's been truncated to 32 bits. As you can see from the raw dump, it should be 0x15e4bf000 (the value in the rsi register).

Yeah, I need to fix bug 1035892 first to get better visibility into the crash. The patch there has been ready for ages but it causes gtest failure I haven't been able to solve yet so I couldn't land it.

Edit: I'm no longer so sure that the attempted access was to something in XUL. Maybe it was to something on the heap. It definitely wasn't to something on the stack. I'll know more when I get to work on this.

Thanks!

I started work on this today, and am making progress. I can reproduce these crashes, sort of. But the crashes I see are so weird that it will take a while to wrap my head around them. Interestingly, I can only reproduce them (in the main process) by triggering crashes in the content process.

I have a hunch what the problem is: When the crashreporter thread in the main process iterates through images in memory, in order to process a crash in a different process (one of the content processes), it's actually reading memory (or trying to read memory) in the crashing process, to which the main process generally doesn't have access. This does work for images in the dyld shared cache (which is most of them), but doesn't work for those that aren't in the dyld shared cache.

As of my patch for bug 1662862, MinidumpGenerator::WriteCVRecord() defaults to reading image UUIDs from the file system. And even on macOS 11, images that aren't in the dyld shared cache are by definition somewhere in the file system. But if WriteCVRecord() can't access an image's file, it fails over to trying to access that image in memory -- which will fail if the image isn't in the dyld shared cache. The reason we don't see more of these crashes is that problems with file access are apparently quite rare.

I'm very short of time. But over the next few days I'll try to firm up this analysis (or replace it with another one). Then I'll start writing a patch to deal with the problem.

(Following up comment 3)

I don't see any crashes on macOS 11, though.

Not any more. There seem to be more in the last week than on any other version of macOS. Note that you have to look for both "10.16" and "11.0" in the "platform version".

https://crash-stats.mozilla.org/search/?signature=~MacFileUtilities%3A%3AMachoWalker%3A%3AFindHeader&date=%3E%3D2020-11-11T20%3A07%3A00.000Z&date=%3C2020-11-18T20%3A07%3A00.000Z&_facets=signature&_facets=platform_version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-platform_version

Note that none of this bug's crashes are "new" crashes. Each happens in crash processing code, and overlays (and disguises) some other crash. This bug needs to be fixed. But fixing it will just unveil the "real" crashes.

Assignee: nobody → smichaud
Status: NEW → ASSIGNED

As best I can tell, my hunch from comment 7 still holds. I used a HookCase hook library to test the patch I've just submitted, on both macOS 10.15.7 and macOS 11. It seems to fix the crashes.

Here's the hook library I used (as a patch on https://github.com/steven-michaud/HookCase/tree/v5.0.0/HookLibraryTemplate). I tested with a try build containing my patch, with symbols unstripped, and a few symbols added. (The hook library needs those symbols.)

https://treeherder.mozilla.org/jobs?repo=try&revision=bcf5f57a935834bb6845ea6859db023295890a68
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/PAhx7zSSSyqiPIYpuqtB_Q/runs/0/artifacts/public/build/target.dmg

Note that for each instance of this bug's crash that gets fixed, the "real" (underlying) crash's crash report will be missing symbols for at least one of the child process's modules.

Pushed by smichaud@pobox.com:
https://hg.mozilla.org/integration/autoland/rev/c8f168ace30c
Prevent access to child process modules not in dyld shared cache. r=gsvelto
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 85 Branch

Here's the first mozilla-central nightly containing this bug's patch (build id 20201121092754):

http://ftp.mozilla.org/pub/firefox/nightly/2020/11/2020-11-21-09-27-54-mozilla-central/firefox-85.0a1.en-US.mac.dmg

I'll be keeping an eye on what my patch does to this bug's crash numbers. But the crash volume is low enough that it may take up to a week to be sure, one way or the other.

No crashes for now.

I'll also be looking for content process crash stacks containing modules with no symbols, to get more information about the underlying problem(s) that triggered these crashes. At this point all we know is that they include file access problems.

The bulk of these crashes was on beta, so we'll want an uplift.

Let's wait at least a few days to see how the patch does on trunk. Then if it's doing OK (if there aren't any crashes) we can do the uplift to beta.

See Also: → 1644249

Comment on attachment 9189001 [details]
Bug 1676102 - Prevent access to child process modules not in dyld shared cache. r=gsvelto

Beta/Release Uplift Approval Request

  • User impact if declined: There are very few of this bug's crashes on trunk but a huge number on beta. So it will be much easier to judge the success or failure of this patch on beta. It also makes sense to try to clear up the crashes on beta as soon as possible.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): This patch is based on a hunch that has only been lightly tested.
  • String changes made/needed: None
Attachment #9189001 - Flags: approval-mozilla-beta?

I decided to go ahead with the beta upload request now. I hadn't realized what a huge number of this bug's crashes are happening on the current (84) beta branch.

I just realized that almost all the beta crashes are on Apple Silicon:

https://crash-stats.mozilla.org/search/?signature=~MacFileUtilities%3A%3AMachoWalker%3A%3AFindHeader&version=84.0b&date=%3E%3D2020-11-17T15%3A06%3A00.000Z&date=%3C2020-11-24T15%3A06%3A00.000Z&_facets=signature&_facets=cpu_arch&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_arch

That makes fixing this bug less urgent. But it'd still be interesting to see the effects of my patch on beta.

And no, there still haven't been any of this bug's crashes on trunk builds containing my patch.

I don't have an Apple Silicon machine, and am unlikely to get one until they support more than 16GB of RAM.

Comment on attachment 9189001 [details]
Bug 1676102 - Prevent access to child process modules not in dyld shared cache. r=gsvelto

approved for 84.0b5

Attachment #9189001 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Hardware: Unspecified → ARM64

This bug's crashes have completely disappeared on trunk and beta since my patch was landed there. So it's clearly working as intended. But the number of different kinds of content process signatures hasn't increased (at least on Apple Silicon), as I expected it would. So I suspect some crashes are being misreported (under the wrong signature), or not being reported at all.

Beta 84 macOS content process ARM64 crashes before my patch:

    1  _pthread_join   Add term  334  84.99 %  1679513 1678226
    2  libsystem_pthread.dylib@0x8ad8   Add term  59  15.01 %  1679373

https://crash-stats.mozilla.org/search/?build_id=%3C20201126155845&cpu_arch=arm64&version=84.0b&platform=Mac%20OS%20X&process_type=content&date=%3E%3D2020-11-23T23%3A04%3A00.000Z&date=%3C2020-11-30T23%3A04%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Beta 84 macOS content process ARM64 crashes with my patch:

    1  _pthread_join   Add term  446  80.94 %  1679513 1678226
    2  libsystem_pthread.dylib@0x8ad8   Add term  105  19.06 %  1679373

https://crash-stats.mozilla.org/search/?build_id=%3E%3D20201126155845&cpu_arch=arm64&version=84.0b&platform=Mac%20OS%20X&process_type=content&date=%3E%3D2020-11-23T23%3A04%3A00.000Z&date=%3C2020-11-30T23%3A04%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

I also suspect the problem is more general than this. So I've opened bug 1679896.

Should we consider uplifting this to ESR78 also?

Flags: needinfo?(smichaud)

Yes, I think so. Even on Intel Macs, this bug's crashes happen disproportionately on macOS 10.11 and older (see comment 3).

Flags: needinfo?(smichaud)

Comment on attachment 9189001 [details]
Bug 1676102 - Prevent access to child process modules not in dyld shared cache. r=gsvelto

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: This bug's crashes happen disproportionately on OS X 10.11 and older (on Intel hardware).
  • User impact if declined: Lots of these crashes will continue to happen on versions of OS X currently only supported on the ESR branch.
  • Fix Landed on Version: 85
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky):
  • String or UUID changes made by this patch: None
Attachment #9189001 - Flags: approval-mozilla-esr78?

(In reply to Steven Michaud [:smichaud] (Retired) from comment #26)

Yes, I think so. Even on Intel Macs, this bug's crashes happen disproportionately on macOS 10.11 and older (see comment 3).

Agreed.

Comment on attachment 9189001 [details]
Bug 1676102 - Prevent access to child process modules not in dyld shared cache. r=gsvelto

approved for 78.6esr

Attachment #9189001 - Flags: approval-mozilla-esr78? → approval-mozilla-esr78+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: