Closed Bug 1880582 Opened 2 years ago Closed 10 months ago

Poison crash in [@ objc_msgSend | -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:]]

Categories

(Core :: Widget: Cocoa, defect, P2)

defect

Tracking

()

VERIFIED FIXED
133 Branch
Tracking Status
thunderbird_esr128 --- fixed
firefox-esr115 --- unaffected
firefox-esr128 133+ fixed
firefox127 --- wontfix
firefox128 + wontfix
firefox129 + wontfix
firefox130 + wontfix
firefox131 + wontfix
firefox132 - wontfix
firefox133 + fixed

People

(Reporter: mccr8, Assigned: bradwerth)

References

Details

(4 keywords, Whiteboard: [tbird crash][adv-main133+r][adv-esr128.5+r])

Crash Data

Attachments

(7 files, 4 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
175.14 KB, text/plain
Details
38.53 KB, text/plain
Details
2.09 KB, text/plain
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

Crash report: https://crash-stats.mozilla.org/report/index/1d1458ba-0b39-4471-b7be-1076d0240215

Reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS

Top 10 frames of crashing thread:

0  libobjc.A.dylib  objc_msgSend  
1  AppKit  -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:]  
2  AppKit  __58-[_NSTrackingAreaAKManager _activeTrackingAreasNeedUpdate]_block_invoke  
3  AppKit  ___NSMainRunLoopPerformBlockInModes_block_invoke  
4  CoreFoundation  __CFRUNLOOP_IS_CALLING_OUT_TO_A_BLOCK__  
5  CoreFoundation  __CFRunLoopDoBlocks  
6  CoreFoundation  __CFRunLoopRun  
7  CoreFoundation  CFRunLoopRunSpecific  
8  HIToolbox  RunCurrentEventLoopInMode  
9  HIToolbox  ReceiveNextEventCommon  

Solid volume of these. I couldn't find the string "activetracking" in our codebase anywhere, so I'm not sure what this is from. They are all on various releases of MacOS 14, so not super old, but only on 14.2 and older so not the latest versions.

worrying that it seems to be getting more frequent even though it involves old versions of the OS -- seems to imply we've changed something on our end. Or, I suppose, that some popular web site did something to tickle this bug.

This might be an unactionable bug in macOS

Keywords: sec-high
Severity: -- → S3
Priority: -- → P3

NSTrackingArea is related to mouse and cursor tracking events. We do have a few references to that in our code base.

The severity field for this bug is set to S3. However, the bug is flagged with the sec-high keyword.
:spohl, could you consider increasing the severity of this security bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(spohl.mozilla.bugs)
Flags: needinfo?(spohl.mozilla.bugs)
Duplicate of this bug: 1884582

The comments seem to talk about specific user actions, and it seems tricky to trigger even then, so I'll reduce this to sec-moderate. It would still be nice to fix, as the volume is quite high for a poisoning crash.

(In reply to Stephen A Pohl [:spohl] from comment #2)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #1)

Someone please look for URLs and useful comments from this bug's crash data, then post them here. I'm not a Mozilla employee, so I'm not able to access this information myself.

"Closing last private tab", "closing a tab", "closing a tab within a private window" seem to be the most interesting comments. There are also two comments about the crash occurring when watching a youtube video, one of the two mentions youtube in fullscreen. Could this maybe be triggered by clicking the "x" button on a tab but holding the mouse down, initiating a tab drag and then releasing the mouse?

Keywords: sec-highsec-moderate

Could this maybe be triggered by clicking the "x" button on a tab but holding the mouse down, initiating a tab drag and then releasing the mouse?

I tried this in a window (private and standard) and a tab -- sometimes in the last window (or tab). No crashes, and no ASAN errors.

These crashes still happen on macOS 14.4 (which came out last Thursday), so it's not that Apple has fixed the underlying bug. It's still too soon to know, though, whether or not 14.4 will reduce the volume of these crashes. Edit: It hasn't.

Running my ASAN build I got a bunch of malloc: nano zone abandoned due to inability to reserve vm space errors. I was able to get rid of these by using MallocNanoZone=0 ./mach run. Their absence made no difference.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 5 desktop browser crashes on Mac on release

:spohl, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(spohl.mozilla.bugs)
Keywords: topcrash
Flags: needinfo?(spohl.mozilla.bugs)
Crash Signature: [@ objc_msgSend | -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:]] → [@ objc_msgSend | -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:]] [@ objc_retain | -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:] ]

I just stumbled upon this while doing nightly crash triage. This looks like a regression to me as the crashes seem to have a specific starting date when restricting the query for nightly builds: buildid 20240201095346. There's a few macOS-specific changes on our side in that timeframe with bug 1865372 being the most suspicious, given it deals with destroying windows. Here's several comments from these crashes that point to closing a window being the culprit here:

I closed a couple of tabs of a window using Cmd+W, and Firefox crashed just after I closed the last tab in the window.

Closing last private tab

I had a YouTube video popped out, and I closed the popped out window (pressed X on the popped out window) leading to a crash.

After logging out from ‘Google Docs, I changed a Google Docs spreadsheet in another tab and then clicked close on the tab. It asked whether I was sure because it would not store the change. I was sure and closed Firefox. Then I was informed about the crash which had happened.

From the looks of it this is affecting several hundred users per day, we might want to prioritize fixing it.

n-i Brad for a look.

Flags: needinfo?(bwerth)
See Also: → 1865372

I'll see if I can replicate.

Assignee: nobody → bwerth
Flags: needinfo?(bwerth)

Bug 1891608 is possibly relevant.

I think crash rates are going to zero with the landing of Bug 1891608. I'll wait another two weeks and then resolve this Bug.

(In reply to Brad Werth [:bradwerth] from comment #12)

I think crash rates are going to zero with the landing of Bug 1891608. I'll wait another two weeks and then resolve this Bug.

This is fixed.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED

(In reply to Steven Michaud [:smichaud] (Retired) from comment #14)

I'm pretty sure this bug isn't fixed.

As best I can tell, all beta releases on the 127 channel have the fix for bug 1891608. But there are still quite a few of this bug's crashes in them:

https://crash-stats.mozilla.org/search/?signature=~_NSTrackingAreaAKManager&version=127.0b&platform=Mac%20OS%20X&date=%3E%3D2024-05-21T18%3A45%3A00.000Z&date=%3C2024-05-28T18%3A45%3A00.000Z&_facets=signature&_facets=platform_version&_facets=proto_signature&_facets=address&_facets=version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version

Until FF 127 is released, we won't be able to tell whether or not the frequency of this bug's crashes has been reduced.

Sorry, I misunderstood the crash data. I'll keep thinking about what might be happening here, and I'll look at new crash stacks in builds with patches for Bug 1891608 applied to see what has changed, if anything.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Okay, new theory. nsCocoaWindow has a property accessor to get a tracking area view, and calls it when adding the tracking area, and calls it again during deallocation. That works fine as long as the tracking view accessor provides the same value at both call times. But it might not! So the fix will be to also store the view that was giving the tracking area, and then remove the tracking area from that same view later. Patch coming shortly.

It is important to remove the NSTrackingArea before the nsCocoaWindow is
deallocated. This patch ensures that we remember which NSView is holding
the tracking area and remove the tracking area from that view during
dealloc (as well as whenever the tracking area is updated).

Since the trackingAreaView property accessor returns a value based on
contentView or its superview, manipulations of either view will change
the return value of the accessor. We can't rely on it being consistent
throughout the lifetime of the window. This change ensures that we will
always remove the tracking area from the view that added it.

Attachment #9406148 - Attachment description: Bug 1880582: Make nsCocoaWindow and and remove tracking areas from the same view. → Bug 1880582: Make nsCocoaWindow add and remove tracking areas from the same view.
Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1f084e8e0d62 Make nsCocoaWindow add and remove tracking areas from the same view. r=mac-reviewers,spohl
Group: layout-core-security → core-security-release
Status: REOPENED → RESOLVED
Closed: 1 year ago1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 129 Branch

The patch landed in nightly and beta is affected.
:bradwerth, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox128 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(bwerth)

I don't want to uplift, because I first want to confirm it fixes the issue, and that will take some time. I think this can ride the trains.

Flags: needinfo?(bwerth)
Flags: needinfo?(haftandilian)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #22)

The 129 branch has been patched for a week (starting with the 20240619095752 build). In that week there've been five of this bug's crashes, three of them in builds with the patch. It's pretty clear the patch hasn't fixed these crashes.

Sorry for the premature closing of the Bug. Let's re-open and keep it open while we try things.

Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---

Gabriele, on the off chance that OCLP is involved here, could you check the crash reports' boot args?

Flags: needinfo?(gsvelto)

(Following up comment #25)

Oops, I got my previous analysis badly wrong. I'll try again:

For what it's worth, these crashes happen when -[NSView(NSTrackingArea) trackingAreas] is called on a deleted NSView object. I found this out by looking at the AppKit framework's -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:] in a disassembler.

In objc_msgSend() the self argument (arg0) isn't invalid. But since it now points to a block of poisoned memory, self->isa (r10) is invalid (== 0x000065e5e5e5e5e0).

(In reply to Steven Michaud [:smichaud] (Retired) from comment #27)

(Following up comment #25)>
For what it's worth, these crashes happen when -[NSView(NSTrackingArea) trackingAreas] is called on a deleted NSView object.

This sounds very plausible. I think our whole approach to tracking areas is fragile. We set a tracking area on a BaseWindow content view's superview, and clear the tracking area when the window itself is destroyed. It would be conceptually cleaner if whatever view that was would instead set its own tracking area and remove it upon destruction. I'll see if I can untangle our calling patterns here and change to a more direct pattern.

I discovered something interesting with my HookCase hook library:

All the stack traces for calls to -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:] look like this. I'm not able to reproduce any traces that look like the ones recorded for this bug's crashes.

    (hook.dylib) -[NSObject(_NSTrackingAreaAKManagerSwizzling) _NSTrackingAreaAKManager__updateActiveTrackingAreasForWindowLocation:modifierFlags:] + 0x87
    (AppKit) -[_NSTrackingAreaAKManager _mouseMoved:] + 0x1d3
    (AppKit) _routeMouseMovedEvent + 0x126
    (AppKit) -[NSWindow(NSEventRouting) _reallySendEvent:isDelayedEvent:] + 0x17d
    (AppKit) -[NSWindow(NSEventRouting) sendEvent:] + 0x159
    (XUL) -[ToolbarWindow sendEvent:] + 0x5c
    (AppKit) routeMouseMovedEvent + 0xb4
    (AppKit) -[NSApplication(NSEventRouting) sendEvent:] + 0x68e
    (XUL) -[GeckoNSApplication sendEvent:] + 0x8a
    (AppKit) -[NSApplication _handleEvent:] + 0x41
    (AppKit) -[NSApplication run] + 0x280
    (XUL) -[GeckoNSApplication run] + 0x42
    (XUL) nsAppShell::Run() + 0x13e
    (XUL) nsAppStartup::Run() + 0x3c
    (XUL) XREMain::XRE_mainRun() + 0x658
    (XUL) XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0x34d
    (XUL) XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0x76
    (firefox) main + 0x22f
    (dyld) start + 0x796

I don't know what this means. I'll keep digging.

Edit: No wait, I am able to reproduce stacks like this bug's crash stacks by double-clicking on a link in a page. I'll redo my ASAN build and see if doing that there triggers something interesting.

Edit: No luck with the ASAN build :-( I'll keep digging.

Possibly a coincidence, but I just looked at about ten crash reports and all have RawCamera in their modules list.

Only four crashes have the mac_boot_args field populated, one is -v watchdog=0 agdpmod=pikera alcid=11 -wegnoigpu -radcodec npci=0x2000 revpatch=sbvmm,cpuname and the others -arm64e_preview_abi.

Flags: needinfo?(gsvelto)

Thanks! So OCLP isn't involved here.

There are several kinds of deleted NSView objects that might be involved here, but only two of them are created and managed by Mozilla code: PixelHostingView and ChildView. I'll be looking through the code for any changes, relevant to these objects, on the 124 branch. With luck I'll find what triggered these crashes.

Actually, very few changes landed on the 124 branch in nsChildView.mm or nsCocoaWindow.mm.

Bug 1833814 landed a "change in nsIWidget's cursor logic", and seems unlikely to be relevant here.

Bug 1865372 and bug 1877749 seem much more likely to be relevant -- particularly bug 1865372. Brad, you wrote both of these patches. Could you go over them again and look for problems? Could we back out one or both of them for a while on mozilla-central, to see if that makes this bug's crashes go away?

(Following up comment #8) Gabriele, I now notice that you also suspect bug 1865372 might have regressed this bug.

Flags: needinfo?(spohl.mozilla.bugs)
Flags: needinfo?(bwerth)
Flags: needinfo?(spohl.mozilla.bugs)

(Following up comment #35)

Specifically, Brad, I wonder if this bug's crashes will disappear if you comment out the call to DestroyNativeWindow() in nsCocoaWindow::Destroy().

(In reply to Steven Michaud [:smichaud] (Retired) from comment #36)

(Following up comment #35)

Specifically, Brad, I wonder if this bug's crashes will disappear if you comment out the call to DestroyNativeWindow() in nsCocoaWindow::Destroy().

I find it credible that the changes in Bug 1865372 triggered these crashes, but those changes solved some problems also, so I don't want to back them out. Perhaps we can get all the value if the call to DestroyNativeWindow is deferred to the next event loop. I'll think about it some more and build a patch that does this.

Flags: needinfo?(bwerth)

It was never my intention to have you back out your patch for bug 1865372 permanently. I just wanted to see if backing it out for a short while (say a week) got rid of this bug's crashes. They're so plentiful that a week would be enough to tell. The results of this test, either way, would give you more information on how to fix this bug's crashes.

But if you'd prefer, go ahead and try to postpone the call to DestroyNativeWindow() from nsCocoaWindow::Destroy(). That may be enough to fix things.

I've been trying to use my hook library to trigger these crashes, but without success. Not even (indirectly) calling nsCocoaWindow::Destroy and nsCocoaWindow::DestroyNativeWindow() right before a call to [NSView traceEvents] (with exactly the same call stack as this bug's crashes) does the trick. It's possible I'm doing something wrong -- that I'm not calling these entirely as they normally would be called. But I think that's unlikely.

So it's possible this bug isn't caused by calls to nsCocoaWindow::DestroyNativeWindow(), premature or otherwise. And it may be worthwhile to back the entire patch for bug 1865372 out for a week, and see what happens. If you do this, I suggest waiting until mozilla-central moves to the 130 branch. That's due to happen soon, and we wouldn't want the backout to get into a beta.

See comment #41.

Enough has changed in nsCocoaWindow.cpp since the landing of Bug 1865372 that a pure backout isn't even possible. A backout would become a selective merge/removal. We can first try to remove the call to DestroyNativeWindow (reverting the issue that the landing of Bug 1865372 resolved) and see how that affects these crashes.

(Following up comment #39)

I've kept digging, and have found some new information. But I still haven't figured out this bug's crashes.

They happen at several different offsets in -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:]. Two of them involve calls to -[NSView traceEvents] or -[NSView superview] on several different kinds of NSView objects. But the most common involves a call to -[NSWindow windowNumber] on a ToolbarWindow object.

All objects that can be involved in these crashes are deallocated on macOS system calls to "observers" or "source0 sources" (ultimately from CFRunLoopRunSpecific()), on later iterations of the run loop than the one where their retain count reached zero. In principle it's possible for objects to be deallocated prematurely if certain "observers" or "sources" are run prematurely (not in their correct order). I can use my hook library to trigger this kind of behavior, but doing this hasn't yet triggered this bug's crashes.

Thread contention might be involved. The calls to -[NSView traceEvents] and -[NSView superview] that can trigger crashes usually happen on the same object. If this object survives the first call and only crashes on the second, that means it's being deallocated while -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:] is running. And that can only be happening on a secondary thread. But though CFRunLoopRunSpecific() can (and does) call "observers" and "sources" on secondary threads, I've never seen it do deallocations (on objects involved in this bug's crashes) anywhere but the main thread.

So there's strong evidence that these crashes are truly weird, and an Apple bug. But Mozilla might still be able to do something about them.

Brad: Once mozilla-central is on the 130 branch, please do land a trial patch that removes the call to nsCocoaWindow::DestroyNativeWindow() from nsCocoaWindow::Destroy(). A week on the trunk should be enough to tell us whether or not this gets rid of this bug's crashes. If it does, you can try something like CFRunLoopTimerCreate() and CFRunLoopAddTimer() to postpone nsCocoaWindow::DestroyNativeWindow() to a later iteration of the run loop.

Note, though, that after you remove nsCocoaWindow::DestroyNativeWindow() from nsCocoaWindow::Destroy(), nsCocoaWindow::DestroyNativeWindow() will still be called, from nsCocoaWindow::~nsCocoaWindow(), on a later iteration of the run loop. It may be tricky to get your "timer" to run before these calls. And it may not even be necessary -- first check that the trial patch does indeed regress bug 1865372.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #41)

first check that the trial patch does indeed regress bug 1865372.

Yes, it does :-( I found out by using my hook library to emulate the trial patch in today's mozilla-central nightly.

I'll play around with CFRunLoopTimerCreate() and CFRunLoopAddTimer() to see if I can get around this.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #41)

Brad: Once mozilla-central is on the 130 branch, please do land a trial patch that removes the call to nsCocoaWindow::DestroyNativeWindow() from nsCocoaWindow::Destroy(). A week on the trunk should be enough to tell us whether or not this gets rid of this bug's crashes. If it does, you can try something like CFRunLoopTimerCreate() and CFRunLoopAddTimer() to postpone nsCocoaWindow::DestroyNativeWindow() to a later iteration of the run loop.

Good to know that delaying the call to DestroyNativeWindow() will fix this (edit: we haven't discerned this yet); too bad about regressing Bug 1865372. I've built a local patch that pushes the call to a later iteration of the event loop using NS_DispatchToCurrentThread, but that had a visual side effect that is unacceptable: the window disappears, reappears, then disappears again. I'm trying to figure out what is causing that, and I hope to get a patch up here soon.

I've found a way to delay the call to nsCocoaWindow::DestroyNativeWindow() without regressing bug 1865372. The code looks something like this:

static void nsCocoaWindow_DestroyNativeWindow_callback(CFRunLoopTimerRef timer, void *info)
{
  nsCocoaWindow *thisptr = (nsCocoaWindow *) info;
  thisptr->DestroyNativeWindow();
  NS_IF_RELEASE(thisptr);
}

void nsCocoaWindow::Destroy()
{
  ...

  if (mWindow && mWindowMadeHere) {
    CancelAllTransitions();
    //DestroyNativeWindow();

    CFRunLoopTimerContext timer_context = CFRunLoopTimerContext();
    NS_ADDREF_THIS();
    timer_context.info = this;
    CFRunLoopTimerRef timer =
      CFRunLoopTimerCreate(NULL, CFAbsoluteTimeGetCurrent() + 0.1, 0, 0, 0,
                           nsCocoaWindow_DestroyNativeWindow_callback, &timer_context);
    CFRunLoopAddTimer(CFRunLoopGetCurrent(), timer, kCFRunLoopCommonModes);
    CFRelease(timer);

  }
}

Of course I don't know if this will also fix this bug's crashes. We really need to land a trial patch, for a week, that only removes the call to nsCocoaWindow::DestroyNativeWindow() from nsCocoaWindow::Destroy(). This will, of course, regress bug 1865372. But it's the only way we'll be able to tell if we're on the right path. If this very simple trial patch does eliminate this bug's crashes, we can start trying possible solutions that don't regress bug 1865372. Each, in turn, should be left in place for another week, to see what effect it has on the crashes.

Edit: I think this code doesn't leak timer. And since timer is a one-shot, I don't think you need to call CFRunLoopRemoveTimer() afterwards. But I'm not completely sure. I based my code on the machine code for -[NSObject(NSDelayedPerforming) performSelector:withObject:afterDelay:inModes:] in the Foundation framework.

Edit: On second thought it probably does leak timer. As I remember it, Apple code uses automatic reference counting, but Mozilla's doesn't. I've corrected the above code accordingly (after I tried out my correction in my hook library and didn't see any crashes). The call to CFRunLoopAddTimer() presumably adds a reference to timer, which presumably gets released when it's invalidated after being used once.

Edit: Yes, I did it right. Here's confirmation. A one-shot timer is automatically invalidated after it runs once.

Edit: Oops, forgot to AddRef() and Release() thisptr. Now corrected.

As of today's mozilla-central nightly, the trunk is now on the 130 branch.

This is a temporary measure. We would like to destroy the native window
here, but we want to see the effect of removal on crash volume.

For what it's worth, I tried deleting objects related to this bug's crashes on com.apple.NSEventThread (using a CFRunLoopTimer in my hook library). I do crash, but the crashes in no way resemble this bug's crashes. This doesn't rule out that this bug's crashes are caused by thread contention. But it does eliminate one particular kind of thread contention from consideration.

I'll keep digging.

Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/95b8a86436fe Don't destory the native window in nsCocoaWindow::Destroy. r=mstange

Backed out for causing OSX bc failures on browser_ext_windows_create_params.js

Backout link

Push with failures

Failure log

Flags: needinfo?(bwerth)

(In reply to Norisz Fay [:noriszfay] from comment #49)

Backed out for causing OSX bc failures on browser_ext_windows_create_params.js

Backout link

Push with failures

Failure log

Given that the patch changes how a window is closed, it's odd that the test failure has to do with a window that's just opened. Could the failure be spurious?

Attachment #9411823 - Attachment description: Bug 1880582: Don't destory the native window in nsCocoaWindow::Destroy. → Bug 1880582: Don't destroy the native window in nsCocoaWindow::Destroy.

I haven't figured out the test failure yet, but I haven't given up on this Bug yet, either. This Bug remains a priority due to its sec-moderate rating and topcrash status.

Flags: needinfo?(haftandilian)
Flags: needinfo?(bwerth)
Priority: P3 → P2
Whiteboard: [adv-main129+r]
Whiteboard: [adv-main129+r]

The crash happens on macOS 15.x as well; not fixed in the new version
bp-50ce7b58-ce50-44d4-971f-aac180240729
bp-9c62d11e-065a-4ec8-b68f-a98f10240725

Ignoring the 1% null crashes, there are two distinct addresses that correspond to the CPU type

amd64 crashes at 0x000065e5e5e5e5f8
arm64 crashes at 0x0065e5e5e5e5e5f0

The arm64 crashes I looked at have the 0xe5e5e5e5e5e5e5e5 poison address in x14, and then 0x0065e5e5e5e5e5e0 in registers x15 and x16. The crashing address is that + 0x10.

amd64 doesn't have as many registers. I always see 0x000065e5e5e5e5e0 in r10 and the instruction is and r11d, dword [r10 + 0x18] (explaining the "f8" at the end). In 2 of the 5 crashes I looked at 0xe5e5e5e5e5e5e5e5 was in rax. Why would that be inconsistent?

I don't know what kind of operation would stomp a consistent 00006 or 006 prefix on top of a value that gets used as a pointer to an object. I don't know arm64 chips and if particular registers are used for consistent things, but the intel registers that are typically used as base pointers (rbp, rsp) aren't anywhere near that value.

The bug is marked as tracked for firefox130 (nightly). However, the bug still has low severity.

:gcp, could you please increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(gpascutto)

(In reply to Daniel Veditz [:dveditz] from comment #52)

I don't know what kind of operation would stomp a consistent 00006 or 006 prefix on top of a value that gets used as a pointer to an object. I don't know arm64 chips and if particular registers are used for consistent things, but the intel registers that are typically used as base pointers (rbp, rsp) aren't anywhere near that value.

It happens in objc_msgSend(id self, SEL op, ...) and friends.

All Objective-C objects have Class isa as their first member:

OBJC_ROOT_CLASS
@interface Object {
    Class isa;
}
@end

All these calls derefence self->isa. But beforehand they mask it with ISA_MASK (0x007ffffffffffff8ULL on Apple Silicon and 0x00007ffffffffff8ULL on Intel). When self has been deallocated, and its memory poisoned, the value of self->isa before masking is 0xe5e5e5e5e5e5e5e5. But after masking it's 0x65e5e5e5e5e5e0 on Apple Silicon and 0x65e5e5e5e5e0 on Intel.

Flags: needinfo?(gpascutto)

I'm going to try something new. This crash has something to do with macOS invoking tracking area methods on dead objects. Our code presently creates and re-creates lots of tracking areas for a window, with one destruction and one creation every time the window is resized. With a different structure, we could instead use only one tracking area that persists for the lifetime of the window. I'm going to provide a patch that does this, and I hope that it will reduce or eliminate this crash. I have no evidence for this, because I can't reproduce the crash, but the patch I think is a strict improvement over current code.

macOS offers an enum, NSTrackingInVisibleRect, which keeps the tracking
area up-to-date with the visible area of the window. This patch adds
this enum to our tracking area at window creation time, and removes the
code that was recreating the tracking area with every window resize.

And if needed, I could put this patch on a new, non-security Bug if that helps mask what we are trying to fix here. This might be particularly helpful if the patch causes problems and needs to be backed out.

(In reply to Brad Werth [:bradwerth] from comment #58)

And if needed, I could put this patch on a new, non-security Bug if that helps mask what we are trying to fix here. This might be particularly helpful if the patch causes problems and needs to be backed out.

I have done this, and opened Bug 1912949 to handle this fixup.

See Also: → 1912949

Comment on attachment 9418935 [details]
Bug 1880582: Let macOS window manager calculate visible rect of the mouse tracking area.

Revision D219087 was moved to bug 1912949. Setting attachment 9418935 [details] to obsolete.

Attachment #9418935 - Attachment is obsolete: true

And that patch in Bug 1912949 is also preventing the tracking area from being manipulated during dealloc, which may be the real issue.

Duplicate of this bug: 1913774

We don't know what causes these crashes. Until we do, it's best to stick to the one fact we have -- that the patch for bug 1865372 triggered them. Notice I say "triggered" and not "caused" -- this bug is almost certainly an Apple bug, deep in Apple code. Until we do understand the cause, guesses based on refactoring Mozilla code are just shots in the dark.

After letting it lie fallow for about a month, I'm back to using HookCase hook libraries to find the cause(s). Still no luck, but I have some new ideas. I'll keep trying.

Until I've had better luck, I think it's best to pursue the patch Brad landed in comment #48. That means figuring out the test failure.

Still working on figuring out the test failures with attachment 9411823 [details] -- they don't occur for me locally.

Meanwhile, new theory! I saw in the gtk code a note that BaseWidget::OnDestroy() may delete the widget itself. In such a case, a nsCocoaWindow might see this problem because the deallocator is called and then the rest of nsCocoaWindow::Destroy() would execute -- including the likely-triggering call to DestroyNativeWindow(). Experimenting with this while I try to work out the issue with the other patch.

Okay, interestingly, moving the call to OnDestroy() allowed the patch to run without causing errors. Moderate evidence! In order to land this, the patch will contain both changes. Presuming that it works, I think the next step will be to just resume the call to DestroyNativeWindow() as it is -- no need to delay it.

Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4b2edb2c26d4 Don't destroy the native window in nsCocoaWindow::Destroy. r=mstange
Status: REOPENED → ASSIGNED
Target Milestone: 129 Branch → ---

Backed out for causing mochitest failures on test_fullscreen-api-race.html

Backout link

Push with failures

Failure log

This failure could also be caused by the push. Log here

Flags: needinfo?(bwerth)

I'm struggling with this. My attempts to fixup the patch as-is, removing the call to DestroyNativeWindow() result in failures for various fullscreen tests, which are notoriously finicky. Whereas if I keep the call in, and make some of the other changes that conceptually seem relevant (don't destroy the widget until Destroy() has finished), tests pass fine. I'm not delighted to make another speculative patch, but that seems like the best move to me. I am going to update the patch and land it after the soft freeze period is over, and we'll see what that does to the crash volume.

To be clear, our fullscreen tests are "finicky" because the DOM does not define an observable event fired when the fullscreen transition is complete. Thus the tests that rapidly progress steps after a fullscreen transition can end up in unexpected states. When we moved to macOS native fullscreen, many of the tests had to be modified to do spin-checks on browser state in order to pass at all. Two of the failing tests, browser_fullscreen-sizemode.js and test_fullscreen-api-race.html have not been modified and I don't want to try to fix them up as part of this Bug.

Attachment #9411823 - Attachment description: Bug 1880582: Don't destroy the native window in nsCocoaWindow::Destroy. → Bug 1880582: Use more correct teardown semantics in nsCocoaWindow::Destroy.

I'm finally making progress finding the true cause of these crashes (the Apple bug). I've got a hook library that crashes mozilla-central nightlies, with this bug's signature, on macOS 14 but not on macOS 13. I still don't understand why the patch for bug 1865372 triggered them, but that may come with time.

Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/90950f99d69f Use more correct teardown semantics in nsCocoaWindow::Destroy. r=mstange
Flags: needinfo?(bwerth)

I've got a very simple patch that fixes the crashes triggered by my hook library:

https://treeherder.mozilla.org/jobs?repo=try&revision=439559d8301cad4168bf8fc26ff8388b273203a6
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/EYgoFPu3Q5mynPrHrtPN9w/runs/0/artifacts/public/build/target.dmg

Let's see if Brad's patch works. If not, my patch is clearly the next step to take.

Edit: I did a new tryserver build with a slightly better patch and better comments.

Brad's autoland build is just finishing up. Once that happens I'll try my hook library on it.

My hook library doesn't crash Safari. But interestingly it does crash Chrome (the release version). As best I can tell, though, Google makes it impossible for me to see the crash report, so I can't get a crash stack.

Several minidumps are present in ~/Library/Application Support/Google/Chrome/Crashpad/completed, but I haven't figured out how to symbolicate them.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #73)

Brad's autoland build is just finishing up. Once that happens I'll try my hook library on it.

The autoland build made from Brad's latest patch crashes with my hook library. So it appears not to fix this bug's crashes.

My tryserver build from comment #72 doesn't regress bug 1865372.

For what it's worth, here's an Apple crash report for a Chromium crash triggered by my hook library. I was only able to get this after building "ungoogled" Chromium as per ungoogled-chromium-macos. An excellent project, by the way. Before I build it, I set symbol-level to 2 in flags.macos.gn. This stops symbol-stripping, and allows crash reports to be properly symbolicated.

The crash report is distinctly odd, though clearly recognizable as caused by the same bug that causes this bug's crashes.

OK, I'm going to make a stab at describing the Apple "bug" that causes this bug's crashes. It's actually a design flaw, and is built deeply into macOS (and probably most/all other modern OSes). The problem is that run loops can be nested. And if things happen just right (or wrong), a nested run loop can destroy resources being used in the run loop it's nested on top of, leading to UAF crashes in the "lower" run loop.

As best I can tell there's no general fix. The problem needs to be addressed on a case-by-case basis, whenever it shows up. You need to find the resource that's triggering the crashes, and retain it before nesting a run loop that might destroy it. (It can be released after the nested run loop finishes.)

You might say that you should avoid nesting run loops. But that's not feasible. It's benefits far outweigh its disadvantages. For example, a modal run loop (using in dragging, let's say) might freeze up your app if you can't periodically nest other run loops on top of it.

In this bug's case, the way to address the problem is my patch from comment #72. The resource that's being destroyed in a "rogue" nested run loop is the (native) main window. The workaround is to hold a reference to this native window until its Mozilla counterpart (and owner) is destroyed -- until the nsCocoaWindow destructor is called.

It's just luck that Safari doesn't crash. It's not that Apple was aware of these crashes and secretly worked around them. It just happens to hold a reference to the main window long enough to prevent them from happening.

Chromium (and Google Chrome) crashes on accessing not the main window itself, but one of its NSView components (WebContentsViewCocoa). It, too, already holds a reference to the main window. But not, apparently, to all of its components.

Mozilla's apps (Firefox and Thunderbird) do crash on macOS 13 (and presumably also on lower versions), but with much lower frequency than on macOS 14 and 15. I'm not sure why. Chrome, as best I can tell, crashes with equal frequency on macOS 14 and 13. Google, of course, doesn't publish its crash stats. But my hook library "works" on Chrome just as easily on macOS 13 as on macOS 14.

My hook library deliberately nests a "rogue" run loop -- one that closes and destroys the main window. I'm still working on it. But I'll post it when I'm finished.

Duplicate of this bug: 1917227
Whiteboard: [tbird crash]

(In reply to Steven Michaud [:smichaud] (Retired) from comment #80)

OK, I'm going to make a stab at describing the Apple "bug" that causes this bug's crashes. It's actually a design flaw, and is built deeply into macOS (and probably most/all other modern OSes). The problem is that run loops can be nested. And if things happen just right (or wrong), a nested run loop can destroy resources being used in the run loop it's nested on top of, leading to UAF crashes in the "lower" run loop.

Wonderful analysis, thank you. Seems very plausible that some nested run loop might call DestroyNativeWindow and that such a thing would cause a problem. I'm not certain that your patch is sufficient and correct because it will retain mWindow forever if HideWindowChrome is called. And syntactically it doesn't feel great to have an alloc-then-retain pattern. I will try to build and alternate patch based on your insights -- either something that varies behavior based on which run loop we are in or something that ensures that DestroyNativeWindow is only called from the main run loop.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #80)

And if things happen just right (or wrong), a nested run loop can destroy resources being used in the run loop it's nested on top of, leading to UAF crashes in the "lower" run loop.

That's a bit fuzzy. Can you give a more specific example? Who drops the last reference? And who is holding on to a pointer that is now invalid?

In this bug's case, the way to address the problem is my patch from comment #72. The resource that's being destroyed in a "rogue" nested run loop is the (native) main window. The workaround is to hold a reference to this native window until its Mozilla counterpart (and owner) is destroyed -- until the nsCocoaWindow destructor is called.

If the window is destroyed in the native event loop, that must mean that nobody is holding a strong reference to it anymore. Are you saying that, somewhere higher up on the stack, we still have a pointer to it anyways? Do you know who might have such a pointer? And how does this cause crashes in the tracking area code?

Flags: needinfo?(smichaud)

(In reply to Brad Werth [:bradwerth] from comment #82)

I'm not certain that your patch is sufficient and correct because it will retain mWindow forever if HideWindowChrome is called.

Oops, I forgot about that. I'll revise my patch to deal with it.

And syntactically it doesn't feel great to have an alloc-then-retain pattern.

I don't really see anything wrong with it. Feel free to come up with a new patch. But the basic problem is very simple, and so should the solution be. Don't encrust it with other stuff that may actually cause it to miss its goal.

I'll answer your questions in another comment. (Actually they were Markus' questions.)

Flags: needinfo?(smichaud)

(In reply to Markus Stange [:mstange] from comment #84)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #80)

And if things happen just right (or wrong), a nested run loop can destroy resources being used in the run loop it's nested on top of, leading to UAF crashes in the "lower" run loop.

That's a bit fuzzy. Can you give a more specific example? Who drops the last reference? And who is holding on to a pointer that is now invalid?

I don't really know. That's what happens with non-reproducible crashes. At best you'll come up with a plausible way to reproduce them, which isn't necessarily exactly how they happen.

In this bug's case, the way to address the problem is my patch from comment #72. The resource that's being destroyed in a "rogue" nested run loop is the (native) main window. The workaround is to hold a reference to this native window until its Mozilla counterpart (and owner) is destroyed -- until the nsCocoaWindow destructor is called.

If the window is destroyed in the native event loop, that must mean that nobody is holding a strong reference to it anymore. Are you saying that, somewhere higher up on the stack, we still have a pointer to it anyways?

No.

And how does this cause crashes in the tracking area code?

There are two places in -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:] where we crash with any frequency. One is on a call to -[NSWindow windowNumber] on the main window. The other is a call to -[NSView trackingArea] on an NSView object associated with the main window. In trying to figure out this bug, I looked for ways to reproduce exactly these crashes. Mostly I failed. But I finally found one. More details in my next comment.

Don't forget: These crashes aren't a Mozilla bug. We weren't doing anything wrong. Or at least if we were, it didn't cause these crashes.

So the solution to them isn't to "fix" Mozilla code. It's to find the least objectionable workaround.

(Following up comment #86)

The external call nearest to these two crash locations (of any complexity) is one to-[NSArray enumerateObjectsWithOptions:usingBlock:] whose block->invoke method is ___collectTrackingAreasForTargetAndWinLoc_block_invoke(block_literal *block, id obj, NSUInteger idx, BOOL *stop). This NSArray method calls out to a block. So why not, after the block has returned, call out to another block that closes the main window? I did this, and it "worked". Including the results from both Firefox and Chromium, it reproduces this bug's crashes exactly. In Firefox the crashes happen on the call to -[NSWindow windowNumber]. In Chromium they happen on the call to -[NSView trackingArea].

I won't go into detail about how I pulled this off. You can look at the source code for my hook library after I post it (which should be pretty soon now).

This told me what I already suspected -- that the solution is to postpone the main window's deallocation. If you look at my hook library's logging, you see that nsCocoaWindow's destructor is called quite "late". And in any case that's the last chance we have to intervene. So why not deallocate it there? And when I do that, my hook library's evil, NSWindow eating block no longer crashes Firefox.

I have to deal with the HideWindowChrome complication. But presuming I can, this still seems the best solution.

We haven't reported this to Apple yet because we didn't have a clear root cause / explanation. Do we have that now? Is there a particular nested run loop occurrence in Apple code we could point to? Thanks for the continued debugging, Steven.

(In reply to Haik Aftandilian [:haik] from comment #89)

We haven't reported this to Apple yet because we didn't have a clear root cause / explanation. Do we have that now? Is there a particular nested run loop occurrence in Apple code we could point to? Thanks for the continued debugging, Steven.

I asked you to report this to Apple when I thought it was an Apple bug. Now I think this kind of crash is something inherent to run loop nesting, and that there's nothing Apple can (or should) do about it. I can't prove it, but I suspect it's impossible to come up with a general/universal solution. App developers need to seek solutions on a case-by-case basis, which will be different according to the different designs of their apps' event handling.

So I'd vote against opening a bug on this with Apple.

I've started to play around with the "HideWindowChrome() complication", but I can't find any way to trigger a call to HideWindowChrome().

Help! Can anyone tell me?

Edit: Even the "hidechrome" tests don't trigger calls to HideWindowChrome(). I searched in https://searchfox.org on "hidechrome".

I've revised my patch do deal with HideWindowChrome():

https://treeherder.mozilla.org/jobs?repo=try&revision=931b7eca3a561be240a95ffeeb0de2a2596bb61a
https://hg.mozilla.org/try/rev/4faee9c7f7bdbe8200776921a46f3c285e3aa3a3

I still haven't figured out an easy way to trigger calls to HideWindowChrome(). But I have noticed that it's called when you run reftests -- in fact while the reftest harness is initialized. (For example by doing ./mach reftest layout/tools/reftest/reftest.xhtml.) This happens when a BorderlessWindow is created -- an NSWindow subclass used for all WindowType::Invisible and WindowType::Child windows, and some WindowType::Popup ones. I'm not sure if these can be created using ordinary, unprivileged JavaScript, but I'll keep trying.

In the meantime I'll continue testing with ./mach reftest layout/tools/reftest/reftest.xhtml.

I'm still working on my hook library. But I should be able to post it soon --possibly later today.

Attachment #9423291 - Attachment description: WIP: Bug 1880582: Ensure nsCocoaWindow native window is only closed/released on the main thread. → Bug 1880582: Ensure nsCocoaWindow native window is only closed/released on the main thread.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #90)

(In reply to Haik Aftandilian [:haik] from comment #89)

We haven't reported this to Apple yet because we didn't have a clear root cause / explanation. Do we have that now? Is there a particular nested run loop occurrence in Apple code we could point to? Thanks for the continued debugging, Steven.

I asked you to report this to Apple when I thought it was an Apple bug. Now I think this kind of crash is something inherent to run loop nesting, and that there's nothing Apple can (or should) do about it. I can't prove it, but I suspect it's impossible to come up with a general/universal solution. App developers need to seek solutions on a case-by-case basis, which will be different according to the different designs of their apps' event handling.

So I'd vote against opening a bug on this with Apple.

I'd like to back off from this, a bit. I still doubt it's possible to build a faultless run loop nesting system from first principles. But there may be an Apple bug here, after all. Or at least something for Apple to do.

These crashes aren't generic -- as you'd expect them to be if the problem was just difficulties with run loop nesting. They happen in a very specific place -- after the call to -[NSArray enumerateObjectsWithOptions:usingBlock:] from -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:], whose block's invoke method is ___collectTrackingAreasForTargetAndWinLoc_block_invoke(block_literal *block, id obj, NSUInteger idx, BOOL *stop). Also, the crashes only happen when _updateActiveTrackingAreasForWindowLocation is called from __58-[_NSTrackingAreaAKManager _activeTrackingAreasNeedUpdate]_block_invoke(block_literal *block). Normally it isn't called from a block invoke method.

All I'm certain about is this and the fact that these crashes involve a window being closed in a nested run loop. I can't get much further without being able to reproduce the crashes (which will never happen). Apple probably won't be able to reproduce them, either. But they have access to the source code for where the crashes take place, which may include interesting developer comments. They also have access to their internal bug base and crash report base. They may be able to find relevant bugs in their code, or at least a way to work around these crashes specifically.

So yes, do report these crashes to Apple if you think that's appropriate. You'll need to quote liberally from my comments. If you want, I can actually write the report myself and post it here, for you to send to Apple.

By the way, I already showed (in comment #47 above) that these crashes aren't caused by deleting a window on a secondary thread.

(Following up comment #44)

I thought to try this patch again, in a tryserver build. It does fix this bug's crashes (it doesn't crash with my hook library's test crash). But it regresses bug 1865372.

Attachment #9423291 - Attachment is obsolete: true

If either of these asserts are triggered, it may be more stable to defer
the call to [mWindow close] to a later run of the event loop.

I predict neither of these asserts will ever be triggered.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #97)

I predict neither of these asserts will ever be triggered.

I thought that part of your working theory is that the [window close] call was happening inside a nested run loop. If that's incorrect, what is the current theory?

Flags: needinfo?(smichaud)

I assumed both runloops (base and nested) would be the main runloop. I'll test with my hook library and report back.

Flags: needinfo?(smichaud)

(Following up comment #99)

In limited testing it turns out that it's always the main run loop that's nested, and never some other run loop. This is true for Firefox (today's mozilla-central nightly), Safari and Chrome. I tested on macOS 14.7.

Here's the hook library I tested with, as a patch on https://github.com/steven-michaud/HookCase/blob/v9.0.0/HookLibraryTemplate/hook.mm.

Here are a few examples of nesting that I found in Firefox (today's mozilla-central nightly):

(Fri Oct 11 09:55:49 2024) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[5213] MainThread[0x119380240] Hook.mm: CFRunLoopRunSpecific(2): mode __kCFPasteboardPrivateMode, seconds 60.000000, returnAfterSourceHandled 1, returning 0x4
    (hook.dylib) PrintStackTrace() + 0x68
    (hook.dylib) Hooked_CFRunLoopRunSpecific(__CFRunLoop*, __CFString const*, double, unsigned char) + 0xe6
    (CoreFoundation) ___onqueue_CFPasteboardRequestDataFromDaemon_block_invoke.409 + 0x5c
    (CoreFoundation) CFPasteboardCopyData + 0x21d
    (CarbonCore) CacheFMMapData + 0x1a7
    (CarbonCore) IntlFCOpenComponentData + 0x9c
    (HIToolbox) InitializeInputSourceCache + 0x128
    (HIToolbox) GetInputSourceCacheHeader + 0x6b
    (HIToolbox) islcGetInputMethodIndexFromBundleID + 0x60
    (HIToolbox) ValidateEnabledInputSourceIDs + 0x1eb
    (HIToolbox) CopyInputSourceEnabledPrefs + 0xb7
    (HIToolbox) GetInputSourceEnabledPrefsIncludingThirdParty + 0x72
    (HIToolbox) isPrefsCreateCacheFromEnabledAndDefaultInputSources + 0x49
    (HIToolbox) islGetInputSourceListWithAdditions + 0x10f
    (HIToolbox) _CreateKeyboardInputSourcesArray_Mutable + 0x17f
    (HIToolbox) _CreateKeyboardInputSourcesArray + 0x11
    (HIToolbox) UpdateSourceIndicatorMode + 0x46
    (HIToolbox) InitTSMFirstEventTime + 0x2be
    (HIToolbox) TISCopyCurrentKeyboardInputSource + 0x12
    (XUL) mozilla::widget::TISInputSourceWrapper::InitByCurrentInputSource() + 0x26
    (XUL) mozilla::widget::IMEInputHandler::OnCurrentTextInputSourceChange(__CFNotificationCenter*, void*, __CFString const*, void const*, __CFDictionary const*) + 0x49
    (XUL) mozilla::widget::TextInputHandler::TextInputHandler(nsChildView*, NSView<mozView>*) + 0x1c1
    (XUL) nsChildView::Create(nsIWidget*, mozilla::gfx::IntRectTyped<mozilla::LayoutDevicePixel> const&, mozilla::widget::InitData*) + 0x25f
    (XUL) nsBaseWidget::CreateChild(mozilla::gfx::IntRectTyped<mozilla::LayoutDevicePixel> const&, mozilla::widget::InitData&) + 0x52
    (XUL) nsView::CreateWidget(nsIWidget*, bool, bool) + 0x132
    (XUL) nsDocumentViewer::MakeWindow(nsSize const&, nsView*) + 0x1fb
    (XUL) nsDocumentViewer::InitInternal(nsIWidget*, nsISupports*, mozilla::dom::WindowGlobalChild*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&, bool, bool, bool) + 0x16d
    (XUL) nsDocumentViewer::Init(nsIWidget*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&, mozilla::dom::WindowGlobalChild*) + 0x1b
    (XUL) nsDocShell::Embed(nsIDocumentViewer*, mozilla::dom::WindowGlobalChild*, bool, bool, nsIRequest*, nsIURI*) + 0x317
    (XUL) nsDocShell::CreateAboutBlankDocumentViewer(nsIPrincipal*, nsIPrincipal*, nsIContentSecurityPolicy*, nsIURI*, bool, mozilla::Maybe<nsILoadInfo::CrossOriginEmbedderPolicy> const&, bool, bool, mozilla::dom::WindowGlobalChild*) + 0x364
    (XUL) nsAppShellService::JustCreateTopWindow(nsIAppWindow*, nsIURI*, unsigned int, int, int, bool, mozilla::AppWindow**) + 0x362
    (XUL) nsAppShellService::CreateHiddenWindow() + 0x17d
    (XUL) nsAppStartup::CreateHiddenWindow() + 0x4d
    (XUL) XREMain::XRE_mainRun() + 0x4c3
    (XUL) XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0x34d
    (XUL) XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0xb1
    (firefox) main + 0x22f
    (dyld) start + 0x775
(Fri Oct 11 09:56:11 2024) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[5213] MainThread[0x119380240] Hook.mm: CFRunLoopRunSpecific(2): mode kCFRunLoopDefaultMode, seconds 0.000000, returnAfterSourceHandled 1, returning 0x4
    (hook.dylib) PrintStackTrace() + 0x68
    (hook.dylib) Hooked_CFRunLoopRunSpecific(__CFRunLoop*, __CFString const*, double, unsigned char) + 0xe6
    (HIToolbox) RunCurrentEventLoopInMode + 0x124
    (HIToolbox) ReceiveNextEventCommon + 0xc9
    (HIToolbox) _BlockUntilNextEventMatchingListInModeWithFilter + 0x42
    (AppKit) _DPSNextEvent + 0x370
    (AppKit) -[NSApplication(NSEventRouting) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 0x4f9
    (XUL) -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 0x7c
    (XUL) nsAppShell::ProcessNextNativeEvent(bool) + 0x29c
    (XUL) nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool) + 0x190
    (XUL) NS_ProcessNextEvent(nsIThread*, bool) + 0xa45
    (XUL) nsThreadManager::SpinEventLoopUntilInternal(nsTSubstring<char> const&, nsINestedEventLoopCondition*, mozilla::ShutdownPhase) + 0x3d8
    (XUL) NS_InvokeByIndex + 0x8e
    (XUL) XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) + 0x816
    (XUL) XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) + 0x1bf
    (XUL) js::Interpret(JSContext*, js::RunState&) + 0xb957
    (XUL) js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) + 0xa52
    (XUL) JS_CallFunctionValue(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) + 0x326
    (XUL) nsXPCWrappedJS::CallMethod(unsigned short, nsXPTMethodInfo const*, nsXPTCMiniVariant*) + 0x8fb
    (XUL) PrepareAndDispatch + 0x320
    (XUL) SharedStub + 0x5b
    (XUL) NS_InvokeByIndex + 0x8e
    (XUL) XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) + 0x816
    (XUL) XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) + 0x1bf
    (XUL) js::Interpret(JSContext*, js::RunState&) + 0xb957
    (XUL) js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) + 0xa52
    (XUL) JS_CallFunctionValue(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) + 0x326
    (XUL) nsXPCWrappedJS::CallMethod(unsigned short, nsXPTMethodInfo const*, nsXPTCMiniVariant*) + 0x8fb
    (XUL) PrepareAndDispatch + 0x320
    (XUL) SharedStub + 0x5b
    (XUL) nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) + 0x1d1
    (XUL) NS_InvokeByIndex + 0x8e
    (XUL) XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) + 0x816
    (XUL) XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) + 0x1bf
    (XUL) js::Interpret(JSContext*, js::RunState&) + 0xb957
    (XUL) js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) + 0xa52
    (XUL) JS::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) + 0x24c
    (XUL) mozilla::dom::EventListener::HandleEvent(mozilla::dom::BindingCallContext&, JS::Handle<JS::Value>, mozilla::dom::Event&, mozilla::ErrorResult&) + 0x431
    (XUL) mozilla::EventTargetChainItem::HandleEventTargetChain(nsTArray<mozilla::EventTargetChainItem>&, mozilla::EventChainPostVisitor&, mozilla::EventDispatchingCallback*, mozilla::ELMCreationDetector&) + 0x5b13
    (XUL) mozilla::EventDispatcher::Dispatch(mozilla::dom::EventTarget*, nsPresContext*, mozilla::WidgetEvent*, mozilla::dom::Event*, nsEventStatus*, mozilla::EventDispatchingCallback*, nsTArray<mozilla::dom::EventTarget*>*) + 0x28d5
    (XUL) nsINode::DispatchEvent(mozilla::dom::Event&, mozilla::dom::CallerType, mozilla::ErrorResult&) + 0x10c
    (XUL) mozilla::dom::EventTarget::DispatchEvent(mozilla::dom::Event&, mozilla::ErrorResult&) + 0x2f
    (XUL) nsContentUtils::DispatchXULCommand(nsIContent*, bool, mozilla::dom::Event*, mozilla::PresShell*, bool, bool, bool, bool, unsigned short, short) + 0x185
    (XUL) nsXULElement::DispatchXULCommand(mozilla::EventChainVisitor const&, nsTAutoStringN<char16_t, 64ul>&) + 0x174
    (XUL) nsXULElement::PreHandleEvent(mozilla::EventChainVisitor&) + 0xa8
    (XUL) mozilla::EventDispatcher::Dispatch(mozilla::dom::EventTarget*, nsPresContext*, mozilla::WidgetEvent*, mozilla::dom::Event*, nsEventStatus*, mozilla::EventDispatchingCallback*, nsTArray<mozilla::dom::EventTarget*>*) + 0x3422
    (XUL) nsINode::DispatchEvent(mozilla::dom::Event&, mozilla::dom::CallerType, mozilla::ErrorResult&) + 0x10c
    (XUL) mozilla::dom::EventTarget::DispatchEvent(mozilla::dom::Event&) + 0x2f
    (XUL) nsMenuUtilsX::DispatchCommandTo(nsIContent*, unsigned long, short) + 0x139
    (AppKit) -[NSApplication(NSResponder) sendAction:to:from:] + 0x151
    (AppKit) -[NSMenuItem _corePerformAction] + 0x1c7
    (AppKit) _NSMenuPerformActionWithHighlighting + 0xb9
    (AppKit) -[NSMenu _performActionForItem:atIndex:fromEvent:] + 0xc5
    (AppKit) -[NSMenu performKeyEquivalent:] + 0x1b4
    (XUL) -[GeckoNSMenu performSuperKeyEquivalent:] + 0x37
    (XUL) nsMenuBarX::PerformKeyEquivalent(NSEvent*) + 0x18
    (XUL) nsChildView::SendEventToNativeMenuSystem(NSEvent*) + 0x2b
    (XUL) nsChildView::PostHandleKeyEvent(mozilla::WidgetKeyboardEvent*) + 0xfb
    (XUL) mozilla::EventStateManager::PostHandleKeyboardEvent(mozilla::WidgetKeyboardEvent*, nsIFrame*, nsEventStatus&) + 0x111
    (XUL) mozilla::EventStateManager::PostHandleEvent(nsPresContext*, mozilla::WidgetEvent*, nsIFrame*, nsEventStatus*, nsIContent*) + 0x54e
    (XUL) mozilla::PresShell::EventHandler::DispatchEvent(mozilla::EventStateManager*, mozilla::WidgetEvent*, bool, nsEventStatus*, nsIContent*) + 0x12c
    (XUL) mozilla::PresShell::EventHandler::HandleEventWithCurrentEventInfo(mozilla::WidgetEvent*, nsEventStatus*, bool, nsIContent*) + 0x21c
    (XUL) mozilla::PresShell::EventHandler::HandleEventAtFocusedContent(mozilla::WidgetGUIEvent*, nsEventStatus*) + 0xc2
    (XUL) mozilla::PresShell::HandleEvent(nsIFrame*, mozilla::WidgetGUIEvent*, bool, nsEventStatus*) + 0x151
    (XUL) nsViewManager::DispatchEvent(mozilla::WidgetGUIEvent*, nsView*, nsEventStatus*) + 0x11c
    (XUL) nsView::HandleEvent(mozilla::WidgetGUIEvent*, bool) + 0x91
    (XUL) nsChildView::DispatchEvent(mozilla::WidgetGUIEvent*, nsEventStatus&) + 0x66
    (XUL) nsBaseWidget::ProcessUntransformedAPZEvent(mozilla::WidgetInputEvent*, mozilla::layers::APZEventResult const&) + 0x87
    (XUL) nsBaseWidget::DispatchInputEvent(mozilla::WidgetInputEvent*) + 0xbd
    (XUL) mozilla::widget::TextEventDispatcher::DispatchInputEvent(nsIWidget*, mozilla::WidgetInputEvent&, nsEventStatus&) + 0x44
    (XUL) mozilla::widget::TextEventDispatcher::DispatchKeyboardEventInternal(mozilla::EventMessage, mozilla::WidgetKeyboardEvent const&, nsEventStatus&, void*, unsigned int, bool) + 0x5c7
    (XUL) mozilla::widget::TextEventDispatcher::MaybeDispatchKeypressEvents(mozilla::WidgetKeyboardEvent const&, nsEventStatus&, void*, bool) + 0x94
    (XUL) mozilla::widget::TextInputHandler::HandleKeyDownEvent(NSEvent*, unsigned int) + 0x68e
    (XUL) -[ChildView keyDown:] + 0x16a
    (AppKit) -[NSWindow(NSEventRouting) _reallySendEvent:isDelayedEvent:] + 0x1ad
    (AppKit) -[NSWindow(NSEventRouting) sendEvent:] + 0x159
    (XUL) -[ToolbarWindow sendEvent:] + 0x5c
    (AppKit) -[NSApplication(NSEventRouting) sendEvent:] + 0x5b0
    (XUL) -[GeckoNSApplication sendEvent:] + 0x5e
    (AppKit) -[NSApplication _handleEvent:] + 0x41
    (AppKit) -[NSApplication run] + 0x280
    (XUL) -[GeckoNSApplication run] + 0x42
    (XUL) nsAppShell::Run() + 0x13e
    (XUL) nsAppStartup::Run() + 0x3c
    (XUL) XREMain::XRE_mainRun() + 0x72e
    (XUL) XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0x34d
    (XUL) XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0xb1
    (firefox) main + 0x22f
    (dyld) start + 0x775
(Fri Oct 11 09:55:57 2024) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[5213] MainThread[0x119380240] Hook.mm: CFRunLoopRunSpecific(2): mode com.apple.run-loop-mode.view-bridge.blocks, seconds 5.999978, returnAfterSourceHandled 1, returning 0x4
    (hook.dylib) PrintStackTrace() + 0x68
    (hook.dylib) Hooked_CFRunLoopRunSpecific(__CFRunLoop*, __CFString const*, double, unsigned char) + 0xe6
    (HIToolbox) -[IMKInputSessionXPCInvocation runPrivateRunloopWithTimeLimit_0_innerRunLoop0:] + 0x66
    (HIToolbox) -[IMKInputSessionXPCInvocation invocationAwaitXPCReply] + 0xeb
    (HIToolbox) -[IMKInputSession activate] + 0x787
    (HIToolbox) -[IMKInputSession activateAfterServerConnection] + 0x7d
    (HIToolbox) -[IMKClient fulfillServerDependentWork] + 0x2d8
    (HIToolbox) __87+[IMKInputSession IMKXPCPerformBlockOnMainThreadInMode:performHow:callerCmd:workBlock:]_block_invoke + 0x67
    (CoreFoundation) __CFRUNLOOP_IS_CALLING_OUT_TO_A_BLOCK__ + 0xc
    (CoreFoundation) __CFRunLoopDoBlocks + 0x18e
    (CoreFoundation) __CFRunLoopRun + 0x3ac
    (CoreFoundation) CFRunLoopRunSpecific + 0x22d
    (hook.dylib) Hooked_CFRunLoopRunSpecific(__CFRunLoop*, __CFString const*, double, unsigned char) + 0x87
    (HIToolbox) RunCurrentEventLoopInMode + 0x124
    (HIToolbox) ReceiveNextEventCommon + 0xc9
    (HIToolbox) _BlockUntilNextEventMatchingListInModeWithFilter + 0x42
    (AppKit) _DPSNextEvent + 0x370
    (AppKit) -[NSApplication(NSEventRouting) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 0x4f9
    (XUL) -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 0x7c
    (AppKit) -[NSApplication run] + 0x25b
    (XUL) -[GeckoNSApplication run] + 0x42
    (XUL) nsAppShell::Run() + 0x13e
    (XUL) nsAppStartup::Run() + 0x3c
    (XUL) XREMain::XRE_mainRun() + 0x72e
    (XUL) XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0x34d
    (XUL) XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0xb1
    (firefox) main + 0x22f
    (dyld) start + 0x775
Attachment #9430221 - Attachment is obsolete: true

...and release it in DestroyNativeWindow. This separates
responsibilities in the way we intend, where Destroy is responsible for
clsoing the window and DestroyNativeWindow is responsible for more
general memory teardown.

Brad, can you do a tryserver build with your patch? I'd like to test it with my hook library from comment #93.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #103)

Brad, can you do a tryserver build with your patch? I'd like to test it with my hook library from comment #93.

Try build in progress at https://treeherder.mozilla.org/jobs?repo=try&revision=e6abb831026c02bfc47c1217c3b37ced7309d737. I'm hoping this will not affect the browser chrome tests, avoiding the problem we see when trying to remove the call to DestroyNativeWindow() from Destroy().

According to my tests, Brad's latest patch (from comment #102) is likely to fix (i.e. work around) this bug's crashes. I was unable to use my hook library from comment #93 to crash the tryserver build from comment #104. It doesn't leak the window being destroyed (either its XUL nsCocoaWindow object or its native ToolbarWindow object).

I tested on macOS 14.7.

My hook library does still crash today's mozilla-central nightly (bp-fdf67909-c2f8-4fcb-b186-4b9950241013).

Edit: Brad's tryserver build also doesn't regress bug 1865372.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #105)

According to my tests, Brad's latest patch (from comment #102) is likely to fix (i.e. work around) this bug's crashes.

Given that, a minimal fix is likely just adding [[mWindow retain] autorelease] before [mWindow close] in DestroyNativeWindow(). That's the working part of this patch. I'm going to try to revise D225423 to work better because I think this is the right time to clarify the relative roles of Destroy() and DestroyNativeWindow().

Attachment #9430518 - Attachment is obsolete: true
Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/884289b29239 Hold an autorelease reference to mWindow while releasing it. r=mstange
Status: ASSIGNED → RESOLVED
Closed: 1 year ago10 months ago
Resolution: --- → FIXED
Target Milestone: --- → 133 Branch

Bad news :-( Today's mozilla-central nightly still crashes using my hook library from comment #93 (bp-190ac722-3730-4be2-b7a7-6144f0241017). I'm going to try to figure out why.

Keep track of crashes with the patch from comment #107 by using the following search (suitably updated). So far there aren't any not triggered by my hook library. But that doesn't mean too much -- mozilla-central has very few users.

https://crash-stats.mozilla.org/search/?signature=~_NSTrackingAreaAKManager&build_id=%3E%3D20241017095824&modules_in_stack=%21~hook.dylib&version=133.0a1&platform=Mac%20OS%20X&date=%3E%3D2024-08-21T13%3A11%3A00.000Z&date=%3C2024-09-04T13%3A11%3A00.000Z&_facets=signature&_facets=platform_version&_facets=proto_signature&_facets=address&_facets=version&page=1&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

This bug will probably need to be reopened.

Here's Brad's patch from comment #107. The problem is that [[mWindow retain] autorelease] doesn't keep mWindow alive "long enough" after mWindow is closed. (A nested run loop can release autoreleased objects "early".) One way around this is my patch from comment #92. Yes, I know you guys don't like it. But to fix this bug (to work around its crashes) you're going to need to do something equivalent.

# HG changeset patch
# User Brad Werth <bwerth@mozilla.com>
# Date 1729118503 0
# Node ID 884289b29239b94fb06f257548502fcb195b5471
# Parent  b8a61a1c09b8dcfa001d3e0d2f1279bd78a2ba71
Bug 1880582: Hold an autorelease reference to mWindow while releasing it. r=mstange

Differential Revision: https://phabricator.services.mozilla.com/D225554

diff --git a/widget/cocoa/nsCocoaWindow.mm b/widget/cocoa/nsCocoaWindow.mm
--- a/widget/cocoa/nsCocoaWindow.mm
+++ b/widget/cocoa/nsCocoaWindow.mm
@@ -164,17 +164,26 @@ void nsCocoaWindow::DestroyNativeWindow(
   Show(false);
 
   [mWindow removeTrackingArea];
 
   [mWindow releaseJSObjects];
   // We want to unhook the delegate here because we don't want events
   // sent to it after this object has been destroyed.
   mWindow.delegate = nil;
+
+  // We might be in an embedded run loop which is still using mWindow.
+  // Hold an extra reference as an autorelease to cover these cases.
+  [[mWindow retain] autorelease];
+
+  // Closing the window will also release it, though our extra retain
+  // above will keep it alive through the event loop.
+  MOZ_ASSERT(mWindow.releasedWhenClosed);
   [mWindow close];
+
   mWindow = nil;
   [mDelegate autorelease];
 
   NS_OBJC_END_TRY_IGNORE_BLOCK;
 }
 
 nsCocoaWindow::~nsCocoaWindow() {
   NS_OBJC_BEGIN_TRY_IGNORE_BLOCK;

(Following up comment #111)

A nested run loop can release autoreleased objects "early".

Here's an example of this from my hook library's log of one of these crashes, with a different object (not an NSWindow object):

(Thu Oct 17 11:42:54 2024) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[77846] MainThread[0x1078cc240] Hook.mm: [NSObject dealloc]: self NSView(0x13889e400)
    (hook.dylib) PrintStackTrace() + 0x68
    (hook.dylib) -[NSObject(MethodSwizzling) NSObject_dealloc] + 0x8c
    (AppKit) -[NSResponder dealloc] + 0x1b1
    (AppKit) -[NSView dealloc] + 0x95
    (AppKit) -[NSViewController dealloc] + 0xc5
    (AppKit) -[NSViewController release] + 0x79
    (libobjc.A.dylib) AutoreleasePoolPage::releaseUntil(objc_object**) + 0xb9
    (libobjc.A.dylib) objc_autoreleasePoolPop + 0xeb
    (CoreFoundation) _CFAutoreleasePoolPop + 0x16
    (CoreFoundation) __CFRunLoopPerCalloutARPEnd + 0x27
    (CoreFoundation) __CFRunLoopDoBlocks + 0x1ac
    (CoreFoundation) __CFRunLoopRun + 0x385
    (CoreFoundation) CFRunLoopRunSpecific + 0x22d
    (CoreFoundation) CFRunLoopRun + 0x28
    (hook.dylib) after_enumerateWithOptions(objc_object*, unsigned long, void (objc_object*, unsigned long, signed char*) block_pointer) + 0x42b
    (hook.dylib) -[NSObject(NSArraySwizzling) __NSArrayM_enumerateObjectsWithOptions:usingBlock:] + 0x5d
    (AppKit) -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:] + 0x6f3
    (hook.dylib) -[NSObject(_NSTrackingAreaAKManagerSwizzling) _NSTrackingAreaAKManager__updateActiveTrackingAreasForWindowLocation:modifierFlags:] + 0xba
    (AppKit) __58-[_NSTrackingAreaAKManager _activeTrackingAreasNeedUpdate]_block_invoke + 0x13e
    (hook.dylib) Hooked_activeTrackingAreasNeedUpdate_block_invoke(_activeTrackingAreasNeedUpdate_literal*) + 0xba
    (AppKit) ___NSMainRunLoopPerformBlockInModes_block_invoke + 0x19
    (hook.dylib) Hooked____NSMainRunLoopPerformBlockInModes_block_invoke(_perform_block_literal*) + 0x74
    (CoreFoundation) __CFRUNLOOP_IS_CALLING_OUT_TO_A_BLOCK__ + 0xc
    (CoreFoundation) __CFRunLoopDoBlocks + 0x18e
    (CoreFoundation) __CFRunLoopRun + 0x385
    (CoreFoundation) CFRunLoopRunSpecific + 0x22d
    (HIToolbox) RunCurrentEventLoopInMode + 0x124
    (HIToolbox) ReceiveNextEventCommon + 0xc9
    (HIToolbox) _BlockUntilNextEventMatchingListInModeWithFilter + 0x42
    (AppKit) _DPSNextEvent + 0x370
    (AppKit) -[NSApplication(NSEventRouting) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 0x4f9
    (XUL) -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 0x7c
    (AppKit) -[NSApplication run] + 0x25b
    (XUL) -[GeckoNSApplication run] + 0x42
    (XUL) nsAppShell::Run() + 0x13e
    (XUL) nsAppStartup::Run() + 0x3c
    (XUL) XREMain::XRE_mainRun() + 0x72e
    (XUL) XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0x34d
    (XUL) XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0xb1
    (firefox) main + 0x22f
    (dyld) start + 0x775

(Following up comment #111)

And here's the NSWindow object itself being deleted "early", after the block in which it was closed has finished running:

(Thu Oct 17 11:42:54 2024) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[77846] MainThread[0x1078cc240] Hook.mm: [NSObject dealloc]: self ToolbarWindow(0x138c7c100)
    (hook.dylib) PrintStackTrace() + 0x68
    (hook.dylib) -[NSObject(MethodSwizzling) NSObject_dealloc] + 0x8c
    (AppKit) -[NSResponder dealloc] + 0x1b1
    (AppKit) -[NSWindow dealloc] + 0xb4c
    (XUL) -[BaseWindow dealloc] + 0x52
    (XUL) -[ToolbarWindow dealloc] + 0x7a
    (Foundation) _NSKVOPerformWithDeallocatingObservable + 0x97
    (Foundation) NSKVODeallocate + 0x96
    (hook.dylib) __destroy_helper_block_e8_32o40r48r + 0x46
    (libsystem_blocks.dylib) _call_dispose_helpers_excp + 0x25
    (libsystem_blocks.dylib) _Block_release + 0xe8
    (CoreFoundation) __CFRunLoopDoBlocks + 0x1b4
    (CoreFoundation) __CFRunLoopRun + 0x385
    (CoreFoundation) CFRunLoopRunSpecific + 0x22d
    (CoreFoundation) CFRunLoopRun + 0x28
    (hook.dylib) after_enumerateWithOptions(objc_object*, unsigned long, void (objc_object*, unsigned long, signed char*) block_pointer) + 0x42b
    (hook.dylib) -[NSObject(NSArraySwizzling) __NSArrayM_enumerateObjectsWithOptions:usingBlock:] + 0x5d
    (AppKit) -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:] + 0x6f3
    (hook.dylib) -[NSObject(_NSTrackingAreaAKManagerSwizzling) _NSTrackingAreaAKManager__updateActiveTrackingAreasForWindowLocation:modifierFlags:] + 0xba
    (AppKit) __58-[_NSTrackingAreaAKManager _activeTrackingAreasNeedUpdate]_block_invoke + 0x13e
    (hook.dylib) Hooked_activeTrackingAreasNeedUpdate_block_invoke(_activeTrackingAreasNeedUpdate_literal*) + 0xba
    (AppKit) ___NSMainRunLoopPerformBlockInModes_block_invoke + 0x19
    (hook.dylib) Hooked____NSMainRunLoopPerformBlockInModes_block_invoke(_perform_block_literal*) + 0x74
    (CoreFoundation) __CFRUNLOOP_IS_CALLING_OUT_TO_A_BLOCK__ + 0xc
    (CoreFoundation) __CFRunLoopDoBlocks + 0x18e
    (CoreFoundation) __CFRunLoopRun + 0x385
    (CoreFoundation) CFRunLoopRunSpecific + 0x22d
    (HIToolbox) RunCurrentEventLoopInMode + 0x124
    (HIToolbox) ReceiveNextEventCommon + 0xc9
    (HIToolbox) _BlockUntilNextEventMatchingListInModeWithFilter + 0x42
    (AppKit) _DPSNextEvent + 0x370
    (AppKit) -[NSApplication(NSEventRouting) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 0x4f9
    (XUL) -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 0x7c
    (AppKit) -[NSApplication run] + 0x25b
    (XUL) -[GeckoNSApplication run] + 0x42
    (XUL) nsAppShell::Run() + 0x13e
    (XUL) nsAppStartup::Run() + 0x3c
    (XUL) XREMain::XRE_mainRun() + 0x72e
    (XUL) XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0x34d
    (XUL) XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0xb1
    (firefox) main + 0x22f
    (dyld) start + 0x775

(Following up comment #111)

Here's the NSWindow object being closed (the one that, after it's deleted, triggers the crash on a call to -[NSWindow windowNumber]). This happens before the events in comment #112 and comment #113.

(Thu Oct 17 11:42:54 2024) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[77846] MainThread[0x1078cc240] Hook.mm: [NSWindow close]: self <ToolbarWindow: 0x138c7c100>
    (hook.dylib) PrintStackTrace() + 0x68
    (hook.dylib) -[NSWindow(MethodSwizzling) NSWindow_close] + 0x8f
    (XUL) nsCocoaWindow::DestroyNativeWindow() + 0x95
    (XUL) nsCocoaWindow::Destroy() + 0x89
    (XUL) mozilla::AppWindow::Destroy() + 0x1e5
    (XUL) mozilla::AppWindow::RequestWindowClose(nsIWidget*) + 0x1ec
    (XUL) mozilla::AppWindow::WidgetListenerDelegate::RequestWindowClose(nsIWidget*) + 0x1c
    (XUL) -[WindowDelegate windowShouldClose:] + 0x29
    (AppKit) __19-[NSWindow __close]_block_invoke + 0xc6
    (AppKit) -[NSWindow __close] + 0x1ad
    (AppKit) -[NSApplication(NSResponder) sendAction:to:from:] + 0x151
    (AppKit) -[NSControl sendAction:to:] + 0x56
    (AppKit) __26-[NSCell _sendActionFrom:]_block_invoke + 0x83
    (AppKit) -[NSCell _sendActionFrom:] + 0xab
    (AppKit) -[NSButtonCell _sendActionFrom:] + 0x60
    (AppKit) __29-[NSButtonCell performClick:]_block_invoke + 0x40
    (AppKit) -[NSButtonCell performClick:] + 0x24d
    (hook.dylib) invocation function for block in after_enumerateWithOptions(objc_object*, unsigned long, void (objc_object*, unsigned long, signed char*) block_pointer) + 0x5f
    (CoreFoundation) __CFRUNLOOP_IS_CALLING_OUT_TO_A_BLOCK__ + 0xc
    (CoreFoundation) __CFRunLoopDoBlocks + 0x18e
    (CoreFoundation) __CFRunLoopRun + 0x385
    (CoreFoundation) CFRunLoopRunSpecific + 0x22d
    (CoreFoundation) CFRunLoopRun + 0x28
    (hook.dylib) after_enumerateWithOptions(objc_object*, unsigned long, void (objc_object*, unsigned long, signed char*) block_pointer) + 0x42b
    (hook.dylib) -[NSObject(NSArraySwizzling) __NSArrayM_enumerateObjectsWithOptions:usingBlock:] + 0x5d
    (AppKit) -[_NSTrackingAreaAKManager _updateActiveTrackingAreasForWindowLocation:modifierFlags:] + 0x6f3
    (hook.dylib) -[NSObject(_NSTrackingAreaAKManagerSwizzling) _NSTrackingAreaAKManager__updateActiveTrackingAreasForWindowLocation:modifierFlags:] + 0xba
    (AppKit) __58-[_NSTrackingAreaAKManager _activeTrackingAreasNeedUpdate]_block_invoke + 0x13e
    (hook.dylib) Hooked_activeTrackingAreasNeedUpdate_block_invoke(_activeTrackingAreasNeedUpdate_literal*) + 0xba
    (AppKit) ___NSMainRunLoopPerformBlockInModes_block_invoke + 0x19
    (hook.dylib) Hooked____NSMainRunLoopPerformBlockInModes_block_invoke(_perform_block_literal*) + 0x74
    (CoreFoundation) __CFRUNLOOP_IS_CALLING_OUT_TO_A_BLOCK__ + 0xc
    (CoreFoundation) __CFRunLoopDoBlocks + 0x18e
    (CoreFoundation) __CFRunLoopRun + 0x385
    (CoreFoundation) CFRunLoopRunSpecific + 0x22d
    (HIToolbox) RunCurrentEventLoopInMode + 0x124
    (HIToolbox) ReceiveNextEventCommon + 0xc9
    (HIToolbox) _BlockUntilNextEventMatchingListInModeWithFilter + 0x42
    (AppKit) _DPSNextEvent + 0x370
    (AppKit) -[NSApplication(NSEventRouting) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 0x4f9
    (XUL) -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 0x7c
    (AppKit) -[NSApplication run] + 0x25b
    (XUL) -[GeckoNSApplication run] + 0x42
    (XUL) nsAppShell::Run() + 0x13e
    (XUL) nsAppStartup::Run() + 0x3c
    (XUL) XREMain::XRE_mainRun() + 0x72e
    (XUL) XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0x34d
    (XUL) XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0xb1
    (firefox) main + 0x22f
    (dyld) start + 0x775

Tricky, tricky, keep trying for a fix that doesn't leak memory.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Target Milestone: 133 Branch → ---

(In reply to Brad Werth [:bradwerth] from comment #115)

Tricky, tricky, keep trying for a fix that doesn't leak memory.

For what it's worth, my patch from comment #92 doesn't leak memory.

This is a simplified version of a fix authored by Steven Michaud. This
creates a death grip in DestroyNativeWindow() and holds it until the
nsCocoaWindow is destroyed. This seems to satisfy the various run loops
in macOS which might invoke DestroyNativeWindow() without holding a
reference to the window itself.

Brad, have you started a tryserver build for your new patch? I'd like to test it with my hook library from comment #93.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #118)

Brad, have you started a tryserver build for your new patch? I'd like to test it with my hook library from comment #93.

Try run in progress at https://treeherder.mozilla.org/jobs?repo=try&revision=a8380edf34b8f6117919500f39bd5926b9389e16.

No crashes with my hook library from comment #93. And neither of the relevant ToolbarWindow and nsCocoaWindow objects are leaked. Hooray!

I haven't yet tested HideWindowChrome() (using ./mach reftest layout/tools/reftest/reftest.xhtml). I'll do that later, once Brad's patch from comment #117 has landed on mozilla-central.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #120)

I haven't yet tested HideWindowChrome() (using ./mach reftest layout/tools/reftest/reftest.xhtml). I'll do that later, once Brad's patch from comment #117 has landed on mozilla-central.

I just did, after having added Brad's patch from comment #117 to current trunk code.

I don't see any crashes. The NSWindow object created by the last call to HideWindowChrome() does get leaked (running ./mach reftest layout/tools/reftest/reftest.xhtml). This is because the nsCocoaWindow object on which HideWindowChrome() is called is itself leaked, so its destructor never runs. The nsCocoaWindow object leakage also happens without the patch from comment #117, and with or without the patch from comment #107. So this isn't a big deal.

Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3a818b2e1d6d Make nsCocoaWindow hold a death grip on its native window until its destructor. r=mstange
Status: ASSIGNED → RESOLVED
Closed: 10 months ago10 months ago
Resolution: --- → FIXED
Target Milestone: --- → 133 Branch

I just tested today's mozilla-central nightly with my hook library -- no crashes! (It's the first nightly with this bug's latest patch.)

I suggest uplifting the patch to beta. Otherwise it's going to take a while to find out if the patch really does get rid of this bug's crashes.

Here's how to search for crashes in mozilla-central nightlies that have this bug's patch. Later, if the patch is uplifted, I'll update the search to include betas that have this bug's patch. (I updated it just now, per comment #125 below.)

https://crash-stats.mozilla.org/search/?signature=~_NSTrackingAreaAKManager&build_id=%3E%3D20241024094434&modules_in_stack=%21~hook.dylib&release_channel=nightly&release_channel=beta&release_channel=aurora&version=133.0a1&version=133.0b&platform=Mac%20OS%20X&date=%3E%3D2024-10-22T18%3A57%3A00.000Z&date=%3C2024-10-29T18%3A57%3A00.000Z&_facets=signature&_facets=platform_version&_facets=proto_signature&_facets=address&_facets=version&page=1&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

133 rides to Beta next week.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #125)

133 rides to Beta next week.

OK, that should be good enough.

Blocks: 1884624
See Also: 1884624
QA Whiteboard: [post-critsmash-triage]
Flags: qe-verify-
See Also: → 1765391

Brad's patch from comment #117 has been on the 133 branch for about ten days, including six for which it was on beta. There have been no crashes at all in builds containing this patch. Given the high crash volume on earlier branches, I think we can say the fix has been verified.

The patch has also fixed (or probably fixed) several other bugs -- bug 1927157, bug 1884624, bug 1765391, bug 1904850 and bug 1899823. This shows that all of these (along with this bug) were triggered by run loop nesting forcing native (Cocoa) objects to be released "too early".

For a while I hoped the patch might fix all Mozilla UAF bugs on macOS. Unfortunately it doesn't. "Accessibility" UAFs still happen:

https://crash-stats.mozilla.org/search/?build_id=%3E%3D20241024094434&address=~e5e5&release_channel=nightly&release_channel=aurora&release_channel=beta&version=133.0a1&version=133.0b&platform=Mac%20OS%20X&date=%3E%3D2024-10-21T15%3A44%3A00.000Z&date=%3C2024-11-04T15%3A44%3A00.000Z&_facets=signature&_facets=platform_version&_facets=proto_signature&_facets=address&_facets=version&_facets=mac_crash_info&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

https://crash-stats.mozilla.org/search/?build_id=%3E%3D20241024094434&app_notes=~unrecognized%20selector&release_channel=nightly&release_channel=aurora&release_channel=beta&version=133.0a1&version=133.0b&platform=Mac%20OS%20X&date=%3E%3D2024-10-21T15%3A44%3A00.000Z&date=%3C2024-11-04T15%3A44%3A00.000Z&_facets=signature&_facets=platform_version&_facets=proto_signature&_facets=address&_facets=version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_version&_columns=release_channel#facet-signature

But it certainly fixes a lot of them, all of which were (probably) intractable before my insight about run loop nesting.

Status: RESOLVED → VERIFIED
Blocks: 1765391
See Also: 1765391

:bradwerth could you look at uplift requests for esr128 when you have a moment?

Flags: needinfo?(bwerth)

Comment on attachment 9431899 [details]
Bug 1880582: Make nsCocoaWindow hold a death grip on its native window until its destructor.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Seems to fix a variety of macOS crashes during tab closing.
  • User impact if declined: Users will continue to experience occasional parent process crashes when closing a tab.
  • Fix Landed on Version: 133
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Holds a memory reference a bit longer than normal, and releases the memory when the owning object is destructed. Seems quite safe.
Flags: needinfo?(bwerth)
Attachment #9431899 - Flags: approval-mozilla-esr128?
No longer blocks: 1899823
No longer blocks: 1765391

And, very unfortunately, bug 1765391 isn't fixed either. See bug 1765391 comment #34.

Comment on attachment 9431899 [details]
Bug 1880582: Make nsCocoaWindow hold a death grip on its native window until its destructor.

Approved for 128.5esr.

Attachment #9431899 - Flags: approval-mozilla-esr128? → approval-mozilla-esr128+

Bug 1927157 isn't fixed by the latest patch for this bug. See bug 1927157 comment #9.

Whiteboard: [tbird crash] → [tbird crash][adv-main133+r]
Whiteboard: [tbird crash][adv-main133+r] → [tbird crash][adv-main133+r][adv-esr128.5+r]
Group: core-security-release

Glad we finally got this sorted out. Thanks, all, for being patient.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: