Closed Bug 1765391 Opened 3 years ago Closed 8 months ago

Crash in [@ nsMenuBarX::Paint]

Categories

(Core :: Widget: Cocoa, defect, P3)

defect

Tracking

()

RESOLVED FIXED
135 Branch
Tracking Status
firefox-esr115 --- wontfix
firefox-esr128 --- wontfix
firefox132 --- wontfix
firefox133 --- wontfix
firefox134 --- wontfix
firefox135 --- fixed

People

(Reporter: mccr8, Assigned: spohl)

References

Details

(Keywords: crash, regression, Whiteboard: [tbird crash] [STR in comment #61])

Crash Data

Attachments

(6 files, 4 obsolete files)

Crash report: https://crash-stats.mozilla.org/report/index/606ea0f7-5d04-41ee-bb7c-a43880220419

MOZ_CRASH Reason: MOZ_CRASH(Encountered unexpected Objective C exception)

Top 10 frames of crashing thread:

0 XUL nsMenuBarX::Paint widget/cocoa/nsMenuBarX.mm:431
1 XUL -[WindowDelegate windowDidResignMain:] widget/cocoa/nsCocoaWindow.mm:2886
2 CoreFoundation __CFNOTIFICATIONCENTER_IS_CALLING_OUT_TO_AN_OBSERVER__ 
3 CoreFoundation ___CFXRegistrationPost_block_invoke 
4 CoreFoundation _CFXRegistrationPost 
5 CoreFoundation _CFXNotificationPost 
6 Foundation -[NSNotificationCenter postNotificationName:object:userInfo:] 
7 AppKit -[NSWindow _changeKeyAndMainLimitedOK:] 
8 AppKit -[NSWindow _makeKeyRegardlessOfVisibility] 
9 AppKit -[NSWindow makeKeyAndOrderFront:] 

I hit this crash, in a build that I think has the patch from bug 1699936. My browser was lagging a lot while trying to switch "tabs" in crash stats, so I decided to restart the browser. While it was in the middle of shutting down, it hit this crash.

signature ranks #35 for Thunderbird 100 beta
bp-d7530b60-9a3f-4395-84a7-185290220421

Whiteboard: [tbird crash]

This should have started to drop off now that bug 1699936 was backed out. Wayne, is this so?

Flags: needinfo?(vseerror)
Regressed by: 1699936

Will have results in about two weeks, with stats from beta 101

Has Regression Range: --- → yes

FWIW a firefox 101.0b2 beta crash- May 3 build bp-8716a618-6155-4634-9c5a-806e90220506

(In reply to Stephen A Pohl [:spohl] from comment #2)

This should have started to drop off now that bug 1699936 was backed out. Wayne, is this so?

The backout was April 25. That change would pick up in beta 101?

Overall, for beta, average crash rate

Flags: needinfo?(vseerror) → needinfo?(spohl.mozilla.bugs)

Then this couldn't have been caused by bug 1699936...

Flags: needinfo?(spohl.mozilla.bugs)
No longer regressed by: 1699936
See Also: 1699936

The volume is pretty low, could we decrease the severity?

Flags: needinfo?(spohl.mozilla.bugs)
Severity: S2 → S3
Flags: needinfo?(spohl.mozilla.bugs)
Priority: -- → P3

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 5 desktop browser crashes on Mac on beta (startup)

:spohl, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(spohl.mozilla.bugs)

(In reply to Marco Castelluccio [:marco] from comment #7)

The volume is pretty low, could we decrease the severity?

(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #8)

:spohl, could you consider increasing the severity of this top-crash bug?

🤷‍♂️

Flags: needinfo?(spohl.mozilla.bugs)

A couple of crashes per build doesn't really seem like cause for alarm (comment 8 says this is top 5 for beta Mac startup crashes). Maybe there should be a lower threshold for crash volume for one of these alerts?

Flags: needinfo?(smujahid)

The release reports are throttled so the actual count is like 10x higher? And this is on a minor platform, which means that the chance per user of hitting this crash is proportionally greater? I understand we like S2 bugs to be gone from everyone's radar but I do think this is the kind of thing we actually want to look into...

(In reply to Andrew McCreight [:mccr8] from comment #10)

A couple of crashes per build doesn't really seem like cause for alarm (comment 8 says this is top 5 for beta Mac startup crashes). Maybe there should be a lower threshold for crash volume for one of these alerts?

Autonag implements the top crash criteria which in this specific case have the following thresholds: "If there's less than 5 crashes per week on a signature, that bug probably still doesn't qualify - same for crashes happening to only 2 or 3 installations."

Flags: needinfo?(smujahid)

https://support.mozilla.org/en-US/questions/1396794#answer-1549032 cites Thunderbird crash bp-32bd076c-494d-4eed-bd73-e2cba0221114

0 XUL nsMenuBarX::Paint()
1 XUL +[WindowDelegate paintMenubarForWindow:]
2 XUL -[WindowDelegate windowDidBecomeMain:]
3 CoreFoundation CFNOTIFICATIONCENTER_IS_CALLING_OUT_TO_AN_OBSERVER
4 CoreFoundation ___CFXRegistrationPost_block_invoke
5 CoreFoundation _CFXRegistrationPost
6 CoreFoundation _CFXNotificationPost
7 Foundation -[NSNotificationCenter postNotificationName:object:userInfo:]
8 AppKit -[NSWindow _changeKeyAndMainLimitedOK:]
9 AppKit -[NSMenuWindowManagerWindow _restorePreviousKeyWindowFromSavedProperties]
10 AppKit -[NSMenuWindowManagerWindow _setVisible:]
11 AppKit -[NSWindow _doWindowWillBecomeHidden]
12 AppKit -[NSWindow _reallyDoOrderWindowOutRelativeTo:findKey:forCounter:force:isModal:]

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

See Also: → 1799247

Firefox crash rate of the past month is tripled compared to April.
And pct of crashes <1min uptime has increased to 67% for past month, up from 56% for April

Flags: needinfo?(spohl.mozilla.bugs)

I am reworking our menu bar in bug 1808223 and this crash will hopefully be fixed at the same time. Marking as blocked by bug 1808223 for now.

Depends on: 1808223
Flags: needinfo?(spohl.mozilla.bugs)

New report: Thunderbird Support Forum: https://support.mozilla.org/en-US/questions/1428216

MAC OS X version 10.15.7 Catalina
Crash ID: bp-290856b9-84d7-4075-aada-dfe180231020

Info says using version 102.15.1
I'm going to suggest:

  • switch off auto compact
  • try manual compact on one folder at time.
  • an upgrade to 115.3.3, but that's going to be the user choice. Just to see if problem continues.

The fix that ultimately landed in bug 1808223 will probably not make a difference in the crash rate here.

No longer depends on: 1808223
Assignee: nobody → spohl.mozilla.bugs
Status: NEW → ASSIGNED

I will be landing the diagnostic patch next week. Adding leave-open keyword.

Keywords: leave-open
Pushed by spohl@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d5c404ba6047 Add more exception handling to help isolate when the OS is throwing an NSGenericException when painting menu bars on macOS. r=mstange
See Also: → 1870924

The leave-open keyword is there and there is no activity for 6 months.
:spohl, maybe it's time to close this bug?
For more information, please visit BugBot documentation.

Flags: needinfo?(spohl.mozilla.bugs)

The improved exception handling shows that this crash occurs when we're attempting to set NSApp.mainMenu. The exception, "NSGenericException: *** Collection <__NSArrayM: 0x140c82f10> was mutated while being enumerated." indicates that macOS might be in the process of enumerating the menu items when we're attempting to set a new mainMenu. Let's see if we can fix this with a simple mutex lock.

Flags: needinfo?(spohl.mozilla.bugs)
Keywords: leave-open

(In reply to Stephen A Pohl [:spohl] from comment #25)

The improved exception handling shows that this crash occurs when we're attempting to set NSApp.mainMenu. The exception, "NSGenericException: *** Collection <__NSArrayM: 0x140c82f10> was mutated while being enumerated." indicates that macOS might be in the process of enumerating the menu items when we're attempting to set a new mainMenu. Let's see if we can fix this with a simple mutex lock.

Which crash report did you see this in?

Is another thread enumerating, or is the same thread enumerating? The stacks in the report should show us.

Flags: needinfo?(spohl.mozilla.bugs)

(In reply to Markus Stange [:mstange] from comment #27)

(In reply to Stephen A Pohl [:spohl] from comment #25)

The improved exception handling shows that this crash occurs when we're attempting to set NSApp.mainMenu. The exception, "NSGenericException: *** Collection <__NSArrayM: 0x140c82f10> was mutated while being enumerated." indicates that macOS might be in the process of enumerating the menu items when we're attempting to set a new mainMenu. Let's see if we can fix this with a simple mutex lock.

Which crash report did you see this in?

Is another thread enumerating, or is the same thread enumerating? The stacks in the report should show us.

For example in: https://crash-stats.mozilla.org/report/index/beaf09ca-957f-476c-a21a-47f840240618

The top-most frame points to the try/abort block that only has one statement, which is setting NSApp.mainMenu.

Flags: needinfo?(spohl.mozilla.bugs)
Duplicate of this bug: 1924906

This is a stretch, but it's possible the (final) fix for bug 1880582 will have an effect here, or even fix this bug's crashes.

Bug 1880582 involved "mutations" that it seemed could only happen as a result of thread contention. It turns out they were caused by run loop nesting. The same could be true here.

How to search for this bug's crashes on builds that contain the (final) fix for bug 1880582:

https://crash-stats.mozilla.org/search/?app_notes=~mutated%20while%20being%20enumerated&build_id=%3E%3D20241024094434&version=133.0a1&version=133.0b&platform=Mac%20OS%20X&date=%3E%3D2024-10-23T14%3A27%3A00.000Z&date=%3C2024-10-30T14%3A27%3A00.000Z&_facets=signature&page=1&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

See Also: → 1880582

Thanks, Steven! I'll keep an eye on your search to see if the fix here is still necessary.

See Also: 1880582

This bug's crashes have disappeared on 133.0b builds, all of which have the (latest) patch for bug 1880582. Given this bug's high crash volume, I think we can say this bug's crashes have been fixed (worked around). What do you think, Stephen?

Agreed! Thank you, Steven!

Closing as fixed by bug 1880582.

Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Assignee: spohl.mozilla.bugs → bwerth
Target Milestone: --- → 133 Branch

Unfortunately, the latest patch for bug 1880582 doesn't seem to have fixed this bug:

bp-be5d3964-cc2e-45a8-8fed-65e8a0241104

That the "mutations'" only easy explanation was run loop nesting made it seem likely. And run loop nesting may yet be involved here. But "early" deallocation of an NSWindow object doesn't seem to be the trigger.

On the positive side, I noticed that none of this bug's crashes happen on macOS 14 or later. So even without a fix, they should diminish over time.

Status: RESOLVED → REOPENED
No longer depends on: 1880582
Resolution: FIXED → ---
Status: REOPENED → NEW
Target Milestone: 133 Branch → ---

I think Stephen's patch from comment #26 is worth a try.

Pushed by spohl@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/38382a91376e Fix crash when attempting to set the app's mainMenu on macOS. r=mac-reviewers,mstange,bradwerth

Didn't realize this one was assigned to me. Stephen is working this.

Assignee: bwerth → spohl.mozilla.bugs
Flags: needinfo?(bwerth) → needinfo?(spohl.mozilla.bugs)
Pushed by spohl@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7b4b4cbe7323 Fix crash when attempting to set the app's mainMenu on macOS. r=mac-reviewers,mstange,bradwerth
Flags: needinfo?(spohl.mozilla.bugs)
Status: NEW → RESOLVED
Closed: 10 months ago10 months ago
Resolution: --- → FIXED
Target Milestone: --- → 134 Branch

The patch landed in nightly and beta is affected.
:spohl, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox133 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(spohl.mozilla.bugs)

Let's have this ride the trains to give us an opportunity to look out for any further crash reports and/or regressions.

Flags: needinfo?(spohl.mozilla.bugs)

Stephen's patch (from comment #26) hasn't fixed this bug's crashes :-(

bp-6e9e5bdb-0ac2-47c8-8166-947610241112

I've painstakingly symbolicated the top eight lines of the stack in the App Notes for bp-6e9e5bdb-0ac2-47c8-8166-947610241112. For the record, the dyld shared cache slide for this particular crash is 0x13d7f000. I figured that out by comparing the known value for (dyld) start + 0x768 (on Intel macOS 13.7.1) to the one in the App Notes crash stack. (By the way, the offsets in Mozilla crash stacks are systematically off by 1.) The line just below these eight is the call to nsMenuBarX::Paint() at the top of the crash report's stack. I used atos, plus the output of dyld_shared_cache_util -list -vmaddr to find which modules were involved from the bare addresses (minus the slide) in the original stack.

__exceptionPreprocess (in CoreFoundation) + 178
objc_exception_throw (in libobjc.A.dylib) + 48
__NSFastEnumerationMutationHandler (in CoreFoundation) + 135
-[NSCarbonMenuImpl _menuLostMainMenuStatus] (in AppKit) + 398
-[NSToolbar _notifyFamily_DidRemoveItemAtIndex:] (in AppKit) + 47
_NSLogOverriddenIdentifierForObject (in AppKit) + 196
-[NSMenuKEUniquer _coreAddItem:] (in AppKit) + 793
-[NSMenu setDelegate:] (in AppKit) + 48

Edit: I redid the symbolication using modules that I'd extracted (as files) from the shared library cache. The result is slightly different, but basically the same.

Edit: See comment #59 below.

I wonder if we could sidestep this issue by instead using a pattern of copy-modify-replace rather than trying to change the current mainMenu at all. In other words, nsMenuBarX::Paint modifies both mNativeMenu and outgoingMenu (both of which, I think, are retained pointers to NSApp.mainMenu) and instead maybe we just make a copy of the menu, modify the copy, and then set the mainMenu to the modified copy -- possibly on a later event loop.

That is more complicated, and perhaps not well-reasoned. I'm also not sure if it would solve the problem if we are in a nested run loop as we think was happening in similar Bug 1880582. Just brainstorming, here...

(In reply to Steven Michaud [:smichaud] (Retired) from comment #45)

I've painstakingly symbolicated the top eight lines of the stack in the App Notes

FWIW, there's a tampermonkey script in bug 1692394 comment 5 but it's not working anymore because the symbolication request is blocked by CSP.

A really simple fix occurs to me.

Stephen, you patched nsCocoaWindow::SetMenuBar() to defer a call to mMenuBar->Paint(). But there's still a call to hiddenWindowMenuBar->Paint() in -[WindowDelegate windowDidResignMain:(NSNotification*)aNotification]. Maybe you should defer that call, too.

(Following up comment #45)

I can't make any sense of the symbolicated stack from comment #45. In particular, I can't reproduce the call to -[NSMenu setDelegate:]. I'll keep playing around with my hook library, but it's going to be a lot of work. Hopefully my suggestion from comment #49 will obviate the need for this work.

I am able to reproduce the following, which makes a lot more sense as the top of this bug's crash stacks:

(Fri Nov 15 15:38:34 2024) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[6379] MainThread[0x1118d4280] Hook.mm: -[NSCarbonMenuImpl _menuLostMainMenuStatus]
    (hook.dylib) PrintStackTrace() + 0x68
    (hook.dylib) -[NSObject(NSCarbonMenuImplSwizzling) NSCarbonMenuImpl__menuLostMainMenuStatus] + 0x68
    (AppKit) -[NSMenu _setMenuName:] + 0x27e
    (AppKit) -[NSApplication setMainMenu:] + 0x15e
    (AppKit) -[NSMenu _setMenuName:] + 0x405
    (XUL) nsMenuBarX::Paint() + 0xad
...

Maybe the App Notes' stacks are corrupt. Or maybe (and I think this is more likely) there are problems with atos.

-[NSCarbonMenuImpl _menuLostMainMenuStatus] calls -[NSFastEnumeration countByEnumeratingWithState:objects:count:] on the NSCarbonMenuImpl object's array of menu items. I bet this is where the "mutation" exception is thrown.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

(In reply to Steven Michaud [:smichaud] (Retired) from comment #51)

I am able to reproduce the following, which makes a lot more sense as the top of this bug's crash stacks:

(Fri Nov 15 15:38:34 2024) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[6379] MainThread[0x1118d4280] Hook.mm: -[NSCarbonMenuImpl _menuLostMainMenuStatus]
    (hook.dylib) PrintStackTrace() + 0x68
    (hook.dylib) -[NSObject(NSCarbonMenuImplSwizzling) NSCarbonMenuImpl__menuLostMainMenuStatus] + 0x68
    (AppKit) -[NSMenu _setMenuName:] + 0x27e
    (AppKit) -[NSApplication setMainMenu:] + 0x15e
    (AppKit) -[NSMenu _setMenuName:] + 0x405
    (XUL) nsMenuBarX::Paint() + 0xad
...

Maybe the App Notes' stacks are corrupt. Or maybe (and I think this is more likely) there are problems with atos.

There should be a call to (AppKit) -[NSApplication setMainMenu:] above (XUL) nsMenuBarX::Paint() + 0xad and below (AppKit) -[NSMenu _setMenuName:] + 0x405. The reason there isn't is because (AppKit) -[NSApplication setMainMenu:], at its end, jmps to (AppKit) -[NSMenu _setMenuName:]. This puts the two calls at the same offset in the call stack, so that the latter overwrites the former.

Status: REOPENED → NEW

(In reply to Steven Michaud [:smichaud] (Retired) from comment #49)

A really simple fix occurs to me.

Stephen, you patched nsCocoaWindow::SetMenuBar() to defer a call to mMenuBar->Paint(). But there's still a call to hiddenWindowMenuBar->Paint() in -[WindowDelegate windowDidResignMain:(NSNotification*)aNotification]. Maybe you should defer that call, too.

Steven, your comment made me try to aggregate the crash reports on their proto signatures, and it turns out that there were a total of five different ways that this crash occurred. My patch addresses the remaining four.

Someone please look at the comments for this bug's crashes and report back here. I'm trying to figure out how run loop nesting might be involved.

  • Several comments indicate crashing upon restart (which seemed to have been a manual close and start, so possibly session restore related?)
  • some indicating opening "attachments" (presumably from webmail)
  • at least one indicated rapid closing of many tabs/windows that were no longer needed
  • one mentioned closing windows during session restore at startup
  • some indicate crashes after restart to update
  • some say that they are no longer able to open Firefox and see an immediate crash

I can dig for more.

Thanks. I was hoping for something about dragging. Otherwise I don't really know what I'm looking for ... at least not yet.

HookCase gives me the power to see all kinds of things that would otherwise be invisible. But sometimes that leads me down rabbit holes. I'll keep playing with my hook library for this bug. With luck something will turn up that isn't just another rabbit hole.

To this point I've discovered that this bug's "mutation" exceptions are thrown when changes are made to an array of NSMenuItem pointers (what's returned by -[NSMenu itemArray]) during a call on one of these arrays to -[NSFastEnumeration countByEnumeratingWithState:objects:count:]. See comment #52. I've also (I think) discovered a way to log all changes to such arrays. So far, though, I haven't seen any of these changes being made on a nested run loop. I'll keep looking.

(Following up comment #45)

I redid my manual symbolication without using atos, and got a reasonable result. (Instead of using atos I looked up addresses (minus the slide) in modules loaded into Hopper Disassembler.)

(CoreFoundation) __exceptionPreprocess + 0xe2
(libobjc.A.dylib) objc_exception_throw + 0x30
(CoreFoundation) __NSFastEnumerationMutationHandler + 0x97
(AppKit) -[NSCarbonMenuImpl _destroyMenuRefIfNotCheckedOut] + 0x104
(AppKit) -[NSCarbonMenuImpl _menuLostMainMenuStatus] + 0x1ed
(AppKit) -[NSMenu _setMenuName:] + 0x27e
(AppKit) -[NSApplication setMainMenu:] + 0x15e
(AppKit) -[NSMenu _setMenuName:] + 0x405

I'm still pretty sure run loop nesting comes into this bug, somehow. But, so far at least, my best efforts don't show me how.

In the meantime I've been looking into other stuff, and as best I can tell this bug's crashes happen in either of two circumstances:

  1. Two or more windows are open and Firefox has the app focus: Do something to focus a (Firefox) window that isn't focused. For example, click on it. Or close the focused window.
  2. Two or more windows are open and Firefox doesn't have the app focus: Do something to focus a (Firefox) window that wasn't focused when Firefox had the app focus. For example click on it -- which gives the app focus back to Firefox. Or click the red close button on the (formerly) focused window -- which doesn't give back the app focus.

The first has nsCocoaWindow::Show() in the stack trace. The second has nsFocusManager::RaiseWindow() in the stack trace, except when you close the window without returning the app focus to Firefox (in this case the stack trace has nsCocoaWindow::Show()).

I have STR for this bug's crashes! The following works at least 50% of the time in today's mozilla-central nightly on macOS 13.7.1.

  1. Run Firefox and open 10 or more windows, then quit Firefox.
  2. Restart Firefox, and as soon as the Firefox menu starts being displayed but before any windows appear, browse quickly and continuously through the menu's components with the mouse held down -- that is, as if you're dragging the mouse through the menus. The windows will appear one by one as you're doing this. Keep going until all of them appear, or you crash.

bp-a8a58c65-2e0d-478c-bba5-750ca0241126
bp-79b95889-f9c2-4794-ac5e-bcab00241126
bp-5035e53e-8923-4a07-925c-4a8a00241126

Has STR: --- → yes
Whiteboard: [tbird crash] → [tbird crash] [STR in comment #61]

Run loop nesting is not involved here. Rather, if -[NSCarbonMenuImpl _destroyMenuRefIfNotCheckedOut] destroys a MenuRef, it can change an array of NSMenuItem *s while it's iterating through them. For example:

(Mon Nov 25 20:01:50 2024) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[5253] MainThread[0x1100ef280] Hook.mm: -[__NSArrayM insertObject:atIndex:]: self (
    "<NSWindowRepresentingMenuItem: 0x13de60d60 Firefox Nightly>"
)
    (hook.dylib) PrintStackTrace() + 0x68
    (hook.dylib) -[NSObject(NSMutableArraySwizzling) __NSArrayM_insertObject:atIndex:] + 0x10d
    (AppKit) _findMenuItemsForWindow + 0xb3
    (AppKit) -[NSApplication(NSWindowsMenu) updateWindowsItem:] + 0xf6
    (AppKit) -[NSWindow becomeKeyWindow] + 0x2df
    (AppKit) -[NSWindow _changeKeyAndMainLimitedOK:] + 0x3e3
    (AppKit) -[NSWindow makeKeyWindow] + 0xc1
    (AppKit) -[NSMenuWindowManagerWindow _restorePreviousKeyWindowFromSavedProperties] + 0x46
    (AppKit) -[NSMenuWindowManagerWindow _setVisible:] + 0x1a1
    (AppKit) -[NSWindow _doWindowWillBecomeHidden] + 0x6b
    (AppKit) -[NSWindow _reallyDoOrderWindowOutRelativeTo:] + 0xc0
    (AppKit) -[NSWindow _reallyDoOrderWindow:] + 0x63
    (AppKit) -[NSWindow _doOrderWindow:] + 0x127
    (HIToolbox) OrderNSMenuWindowManager(void () block_pointer) + 0x9
    (HIToolbox) ShowHidePlatformWindows + 0x176
    (HIToolbox) _ShowHideWindows + 0x33d
    (HIToolbox) ShowHide + 0x23
    (HIToolbox) HideMenuWindow + 0x30
    (HIToolbox) HideMenus(MenuSelectData*, __CFArray*) + 0x47
    (HIToolbox) MenuChanged(MenuSelectData*, unsigned char, unsigned char) + 0x1e9
    (HIToolbox) _CloseMenu + 0xe5
    (AppKit) -[NSCarbonMenuImpl _destroyMenuRef] + 0x27
    (AppKit) -[NSCarbonMenuImpl _destroyMenuRefIfNotCheckedOut] + 0x59
    (hook.dylib) -[NSObject(NSCarbonMenuImplSwizzling) NSCarbonMenuImpl__destroyMenuRefIfNotCheckedOut] + 0xaa
    (AppKit) -[NSCarbonMenuImpl _destroyMenuRefIfNotCheckedOut] + 0x12d
    (hook.dylib) -[NSObject(NSCarbonMenuImplSwizzling) NSCarbonMenuImpl__destroyMenuRefIfNotCheckedOut] + 0xaa
    (AppKit) -[NSCarbonMenuImpl _menuLostMainMenuStatus] + 0x1ed
    (hook.dylib) -[NSObject(NSCarbonMenuImplSwizzling) NSCarbonMenuImpl__menuLostMainMenuStatus] + 0x73
    (AppKit) -[NSMenu _setMenuName:] + 0x27e
    (AppKit) -[NSApplication setMainMenu:] + 0x15e
    (AppKit) -[NSMenu _setMenuName:] + 0x405
    (XUL) nsMenuBarX::Paint() + 0xad
    (XUL) -[WindowDelegate windowDidResignMain:] + 0x56
    (CoreFoundation) __CFNOTIFICATIONCENTER_IS_CALLING_OUT_TO_AN_OBSERVER__ + 0x89
    (CoreFoundation) ___CFXRegistrationPost_block_invoke + 0x58
    (CoreFoundation) _CFXRegistrationPost + 0x218
    (CoreFoundation) _CFXNotificationPost + 0x2d9
    (Foundation) -[NSNotificationCenter postNotificationName:object:userInfo:] + 0x52
    (AppKit) -[NSWindow _changeKeyAndMainLimitedOK:] + 0x413
    (AppKit) -[NSWindow makeKeyWindow] + 0xc1
    (AppKit) -[NSWindow _makeKeyRegardlessOfVisibility] + 0x4c
    (AppKit) -[NSWindow makeKeyAndOrderFront:] + 0x1b
    (XUL) nsCocoaWindow::Show(bool) + 0x1f0
    (XUL) mozilla::AppWindow::SetVisibility(bool) + 0x81
    (XUL) mozilla::AppWindow::OnChromeLoaded() + 0x4a
    (XUL) mozilla::AppWindow::OnStateChange(nsIWebProgress*, nsIRequest*, unsigned int, nsresult) + 0xc6
    (XUL) non-virtual thunk to mozilla::AppWindow::OnStateChange(nsIWebProgress*, nsIRequest*, unsigned int, nsresult) + 0xd
    (XUL) nsDocLoader::DoFireOnStateChange(nsIWebProgress*, nsIRequest*, int&, nsresult) + 0x10a
    (XUL) nsDocLoader::doStopDocumentLoad(nsIRequest*, nsresult) + 0x105
    (XUL) nsDocLoader::DocLoaderIsEmpty(bool, mozilla::Maybe<nsresult> const&) + 0x24f
    (XUL) nsDocLoader::NotifyDoneWithOnload(nsDocLoader*) + 0x4b
    (XUL) nsDocLoader::DocLoaderIsEmpty(bool, mozilla::Maybe<nsresult> const&) + 0x25a
    (XUL) nsDocLoader::OnStopRequest(nsIRequest*, nsresult) + 0x1c1
    (XUL) nsDocShell::OnStopRequest(nsIRequest*, nsresult) + 0x5a
    (XUL) mozilla::net::nsLoadGroup::NotifyRemovalObservers(nsIRequest*, nsresult) + 0xd1
    (XUL) mozilla::net::nsLoadGroup::RemoveRequest(nsIRequest*, nsISupports*, nsresult) + 0x41
    (XUL) mozilla::dom::Document::UnblockOnload(bool) + 0xbc
    (XUL) mozilla::dom::Document::DispatchContentLoadedEvents() + 0x4b7
    (XUL) mozilla::detail::RunnableMethodImpl<mozilla::dom::Document*, void (mozilla::dom::Document::*)(), true, (mozilla::RunnableKind)0>::Run() + 0x1f
    (XUL) mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) + 0x9ba
    (XUL) mozilla::detail::RunnableFunction<mozilla::TaskController::TaskController()::$_0>::Run() + 0x3e
    (XUL) nsThread::ProcessNextEvent(bool, bool*) + 0x43a
    (XUL) NS_ProcessPendingEvents(nsIThread*, unsigned int) + 0x7e
    (XUL) nsBaseAppShell::NativeEventCallback() + 0x73
    (XUL) nsAppShell::ProcessGeckoEvents(void*) + 0xa3
    (CoreFoundation) __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 0x11
    (CoreFoundation) __CFRunLoopDoSource0 + 0x9d
    (CoreFoundation) __CFRunLoopDoSources0 + 0xd9
    (CoreFoundation) __CFRunLoopRun + 0x394
    (CoreFoundation) CFRunLoopRunSpecific + 0x230
    (hook.dylib) Hooked_CFRunLoopRunSpecific(__CFRunLoop*, __CFString const*, double, unsigned char) + 0xd1
    (HIToolbox) RunCurrentEventLoopInMode + 0x124
    (hook.dylib) Hooked_RunCurrentEventLoopInMode(__CFString const*, signed char) + 0x2d
    (HIToolbox) ReceiveNextEventCommon + 0xc7
    (HIToolbox) _BlockUntilNextEventMatchingListInModeWithFilter + 0x40
    (AppKit) _DPSNextEvent + 0x35a
    (AppKit) -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 0x4be
    (XUL) -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 0x7c
    (AppKit) _NSHLTBMenuEventProc + 0xd5
    (HIToolbox) IsUserStillTracking(MenuSelectData*, unsigned char*) + 0xc2
    (HIToolbox) TrackMenuCommon(MenuSelectData&, unsigned char*, SelectionData*, MenuResult*, MenuResult*) + 0x550
    (HIToolbox) MenuSelectCore(MenuData*, Point, double, unsigned int, OpaqueMenuRef**, unsigned short*) + 0x18d
    (HIToolbox) _HandleMenuSelection2 + 0x1cb
    (AppKit) _NSHandleCarbonMenuEvent + 0xd7
    (AppKit) _DPSEventHandledByCarbon + 0x36
    (AppKit) -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 0xb67
    (XUL) -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 0x7c
    (AppKit) -[NSApplication run] + 0x24a
    (XUL) -[GeckoNSApplication run] + 0x42
    (XUL) nsAppShell::Run() + 0x13e
    (XUL) nsAppStartup::Run() + 0x3c
    (XUL) XREMain::XRE_mainRun() + 0x751
    (XUL) XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0x34d
    (XUL) XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0xb1
    (firefox) main + 0x22f
    (dyld) start + 0x768

This is pretty clearly an Apple bug. I'll be looking for a workaround, but I won't have much time during the Thanksgiving holiday.

Here's the stack trace of a "mutation" that actually triggers one of this bug's crashes:

(Mon Nov 25 20:31:54 2024) /Applications/Firefox Nightly.app/Contents/MacOS/firefox[5395] MainThread[0x10ed95280] Hook.mm: -[__NSArrayM removeObjectsInRange:]: self (
    "<GeckoNSMenuItem: 0x14e642510 Get Help, ke='Command-?'>",
    "<GeckoNSMenuItem: 0x1480ef580 Get Help, ke='Command-?'>"
)
    (hook.dylib) PrintStackTrace() + 0x68
    (hook.dylib) -[NSObject(NSMutableArraySwizzling) __NSArrayM_removeObjectsInRange:] + 0xf4
    (AppKit) -[NSMenuKEUniquer _coreRemoveItem:] + 0xc1
    (AppKit) -[NSMenuKEUniquer removeItems:] + 0xdd
    (AppKit) +[NSMenu(NSKeyEquivalents) _recursivelyUnregisterMenuForKeyEquivalentUniquing:] + 0x4d
    (AppKit) -[NSMenu _setMenuName:] + 0x29d
    (AppKit) -[NSApplication setMainMenu:] + 0x15e
    (XUL) nsMenuBarX::Paint() + 0xad
    (XUL) +[WindowDelegate paintMenubarForWindow:] + 0x72
    (XUL) -[WindowDelegate windowDidBecomeMain:] + 0x6d
    (CoreFoundation) __CFNOTIFICATIONCENTER_IS_CALLING_OUT_TO_AN_OBSERVER__ + 0x89
    (CoreFoundation) ___CFXRegistrationPost_block_invoke + 0x58
    (CoreFoundation) _CFXRegistrationPost + 0x218
    (CoreFoundation) _CFXNotificationPost + 0x2d9
    (Foundation) -[NSNotificationCenter postNotificationName:object:userInfo:] + 0x52
    (AppKit) -[NSWindow _changeKeyAndMainLimitedOK:] + 0x422
    (AppKit) -[NSWindow makeKeyWindow] + 0xc1
    (AppKit) -[NSMenuWindowManagerWindow _restorePreviousKeyWindowFromSavedProperties] + 0x46
    (AppKit) -[NSMenuWindowManagerWindow _setVisible:] + 0x1a1
    (AppKit) -[NSWindow _doWindowWillBecomeHidden] + 0x6b
    (AppKit) -[NSWindow _reallyDoOrderWindowOutRelativeTo:] + 0xc0
    (AppKit) -[NSWindow _reallyDoOrderWindow:] + 0x63
    (AppKit) -[NSWindow _doOrderWindow:] + 0x127
    (HIToolbox) OrderNSMenuWindowManager(void () block_pointer) + 0x9
    (HIToolbox) ShowHidePlatformWindows + 0x176
    (HIToolbox) _ShowHideWindows + 0x33d
    (HIToolbox) ShowHide + 0x23
    (HIToolbox) HideMenuWindow + 0x30
    (HIToolbox) HideMenus(MenuSelectData*, __CFArray*) + 0x47
    (HIToolbox) MenuChanged(MenuSelectData*, unsigned char, unsigned char) + 0x1e9
    (HIToolbox) _CloseMenu + 0xe5
    (AppKit) -[NSCarbonMenuImpl _destroyMenuRef] + 0x27
    (AppKit) -[NSCarbonMenuImpl _destroyMenuRefIfNotCheckedOut] + 0x59
    (hook.dylib) -[NSObject(NSCarbonMenuImplSwizzling) NSCarbonMenuImpl__destroyMenuRefIfNotCheckedOut] + 0xaa
    (AppKit) -[NSCarbonMenuImpl _destroyMenuRefIfNotCheckedOut] + 0x12d
    (hook.dylib) -[NSObject(NSCarbonMenuImplSwizzling) NSCarbonMenuImpl__destroyMenuRefIfNotCheckedOut] + 0xaa
    (AppKit) -[NSCarbonMenuImpl _menuLostMainMenuStatus] + 0x1ed
    (hook.dylib) -[NSObject(NSCarbonMenuImplSwizzling) NSCarbonMenuImpl__menuLostMainMenuStatus] + 0x73
    (AppKit) -[NSMenu _setMenuName:] + 0x27e
    (AppKit) -[NSApplication setMainMenu:] + 0x15e
    (AppKit) -[NSMenu _setMenuName:] + 0x405
    (XUL) nsMenuBarX::Paint() + 0xad
    (XUL) -[WindowDelegate windowDidResignMain:] + 0x56
    (CoreFoundation) __CFNOTIFICATIONCENTER_IS_CALLING_OUT_TO_AN_OBSERVER__ + 0x89
    (CoreFoundation) ___CFXRegistrationPost_block_invoke + 0x58
    (CoreFoundation) _CFXRegistrationPost + 0x218
    (CoreFoundation) _CFXNotificationPost + 0x2d9
    (Foundation) -[NSNotificationCenter postNotificationName:object:userInfo:] + 0x52
    (AppKit) -[NSWindow _changeKeyAndMainLimitedOK:] + 0x413
    (AppKit) -[NSWindow makeKeyWindow] + 0xc1
    (AppKit) -[NSWindow _makeKeyRegardlessOfVisibility] + 0x4c
    (AppKit) -[NSWindow makeKeyAndOrderFront:] + 0x1b
    (XUL) nsCocoaWindow::Show(bool) + 0x1f0
    (XUL) mozilla::AppWindow::SetVisibility(bool) + 0x81
    (XUL) mozilla::AppWindow::OnChromeLoaded() + 0x4a
    (XUL) mozilla::AppWindow::OnStateChange(nsIWebProgress*, nsIRequest*, unsigned int, nsresult) + 0xc6
    (XUL) non-virtual thunk to mozilla::AppWindow::OnStateChange(nsIWebProgress*, nsIRequest*, unsigned int, nsresult) + 0xd
    (XUL) nsDocLoader::DoFireOnStateChange(nsIWebProgress*, nsIRequest*, int&, nsresult) + 0x10a
    (XUL) nsDocLoader::doStopDocumentLoad(nsIRequest*, nsresult) + 0x105
    (XUL) nsDocLoader::DocLoaderIsEmpty(bool, mozilla::Maybe<nsresult> const&) + 0x24f
    (XUL) nsDocLoader::NotifyDoneWithOnload(nsDocLoader*) + 0x4b
    (XUL) nsDocLoader::DocLoaderIsEmpty(bool, mozilla::Maybe<nsresult> const&) + 0x25a
    (XUL) nsDocLoader::OnStopRequest(nsIRequest*, nsresult) + 0x1c1
    (XUL) nsDocShell::OnStopRequest(nsIRequest*, nsresult) + 0x5a
    (XUL) mozilla::net::nsLoadGroup::NotifyRemovalObservers(nsIRequest*, nsresult) + 0xd1
    (XUL) mozilla::net::nsLoadGroup::RemoveRequest(nsIRequest*, nsISupports*, nsresult) + 0x41
    (XUL) mozilla::dom::Document::UnblockOnload(bool) + 0xbc
    (XUL) mozilla::dom::Document::DispatchContentLoadedEvents() + 0x4b7
    (XUL) mozilla::detail::RunnableMethodImpl<mozilla::dom::Document*, void (mozilla::dom::Document::*)(), true, (mozilla::RunnableKind)0>::Run() + 0x1f
    (XUL) mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) + 0x9ba
    (XUL) mozilla::detail::RunnableFunction<mozilla::TaskController::TaskController()::$_0>::Run() + 0x3e
    (XUL) nsThread::ProcessNextEvent(bool, bool*) + 0x43a
    (XUL) NS_ProcessPendingEvents(nsIThread*, unsigned int) + 0x7e
    (XUL) nsBaseAppShell::NativeEventCallback() + 0x73
    (XUL) nsAppShell::ProcessGeckoEvents(void*) + 0xa3
    (CoreFoundation) __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 0x11
    (CoreFoundation) __CFRunLoopDoSource0 + 0x9d
    (CoreFoundation) __CFRunLoopDoSources0 + 0xd9
    (CoreFoundation) __CFRunLoopRun + 0x394
    (CoreFoundation) CFRunLoopRunSpecific + 0x230
    (hook.dylib) Hooked_CFRunLoopRunSpecific(__CFRunLoop*, __CFString const*, double, unsigned char) + 0xd1
    (HIToolbox) RunCurrentEventLoopInMode + 0x124
    (hook.dylib) Hooked_RunCurrentEventLoopInMode(__CFString const*, signed char) + 0x2d
    (HIToolbox) ReceiveNextEventCommon + 0xc7
    (HIToolbox) _BlockUntilNextEventMatchingListInModeWithFilter + 0x40
    (AppKit) _DPSNextEvent + 0x35a
    (AppKit) -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 0x4be
    (XUL) -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 0x7c
    (AppKit) _NSHLTBMenuEventProc + 0xd5
    (HIToolbox) IsUserStillTracking(MenuSelectData*, unsigned char*) + 0xc2
    (HIToolbox) TrackMenuCommon(MenuSelectData&, unsigned char*, SelectionData*, MenuResult*, MenuResult*) + 0x550
    (HIToolbox) MenuSelectCore(MenuData*, Point, double, unsigned int, OpaqueMenuRef**, unsigned short*) + 0x18d
    (HIToolbox) _HandleMenuSelection2 + 0x1cb
    (AppKit) _NSHandleCarbonMenuEvent + 0xd7
    (AppKit) _DPSEventHandledByCarbon + 0x36
    (AppKit) -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 0xb67
    (XUL) -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 0x7c
    (AppKit) -[NSApplication run] + 0x24a
    (XUL) -[GeckoNSApplication run] + 0x42
    (XUL) nsAppShell::Run() + 0x13e
    (XUL) nsAppStartup::Run() + 0x3c
    (XUL) XREMain::XRE_mainRun() + 0x751
    (XUL) XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0x34d
    (XUL) XRE_main(int, char**, mozilla::BootstrapConfig const&) + 0xb1
    (firefox) main + 0x22f
    (dyld) start + 0x768

My STR from comment #61 don't work in Safari or Chrome -- which may mean that this bug doesn't effect them.

__NSArrayM is the concrete implementation (in the CoreFoundation framework) of the NSMutableArray class.

My STR from comment #61 don't work on macOS Sonoma (14.7.1) and Sequoia (15.1.1), because the Firefox menu doesn't appear until all the Firefox windows are displayed. This may be why this bug's crashes only happen on macOS Ventura and below. There may be a clue here about how to work around them.

I discovered (by hacking in my hook library) that delivering NSWindowDidResignMainNotification and NSWindowDidBecomeMainNotification notifications asynchronously (to WindowDelegate) makes the crashes go away. I'll try to find out if there's a high level way to do this, suitable for use in Mozilla code.

For what it's worth, the following variant of Stephen's last two patches fixes this bug in my tests:

https://treeherder.mozilla.org/jobs?repo=try&revision=fd761af8f1f0450da26d4e912adeb1b23377c4c3
https://hg.mozilla.org/try/rev/d087c3bc96947e9088f0c46bc653858dfdcac7f8

I backed out his patch from comment #26 and used only part of his patch from comment #54. I tested with a local build, and I'll test again with this tryserver build when/if it finishes. Edit: The tryserver build also fixes this bug's crashes. I tested on macOS 13.7.1.

See comment #71 below for another variant.

(Following up comment #69)

Here's another variant of Stephen's last two patches:

https://treeherder.mozilla.org/jobs?repo=try&revision=468e76cc2d076d987cc8065b5c4022f65619bed0
https://hg.mozilla.org/try/rev/dfef7703281fe2f9d370ac619070d02b7f27df43
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/TvCEPXixShyJsCPE4oPFqQ/runs/0/artifacts/public/build/target.dmg

This time I've guaranteed that geckoWindow and hiddenWindowMenuBar will still be alive in the "lambdas" to which they are passed. I tried doing this with NS_DispatchToCurrentThread(NS_NewRunnableFunction()), but was stymied by bug 1421435. So I ended up using an Objective-C block instead.

As best I can tell std::function() does nothing with the parameters it "captures". So we need to use AddRef() and Release() ourselves. But bug 1421435 prevents them from working in a std::function() lambda. Apple's documentation on block usage is unclear, and partly just wrong. Local variables outside the block that are used in the block, and which don't have the __block storage type, are copied by value. Looking at XUL in a disassembler shows this happens with both geckoWindow and hiddenWindowMenuBar. So once again we must guarantee their survival ourselves. But Release() works just fine in a block's lambda (or closure).

Steven, can you attach the patch you were trying which ran into bug 1421435? It should be possible to write the lambda in such a way that it captures a RefPtr by value, maybe with an extra RefPtr<nsCocoaWindow> self = this;.

Markus, you can see my failed patches here:

https://treeherder.mozilla.org/jobs?repo=try&revision=4163567c1e5f1100109adbf2c65d742b56b23d7d
https://treeherder.mozilla.org/jobs?repo=try&revision=b166f4aeb35e071f09697f38cf91b310c6ac2e7b
https://treeherder.mozilla.org/jobs?repo=try&revision=17922e632521f24ebe4f3a66368b87d18d417943

None of them failed in local builds, which all worked fine. I guess the tryserver does extra "linting" (and probably also the servers that do "official" builds).

I would expect the following to work:

  RefPtr<nsCocoaWindow> geckoWindow = [windowDelegate geckoWidget];
  if (geckoWindow) {
    // We dispatch this in order to prevent crashes when macOS is actively
    // enumerating the menu items in `NSApp.mainMenu`. Use `geckoWindow`
    // because with it we can handle (in the lambda) the case where
    // `nsCocoaWindow::mWindow` has become nil.
    NS_DispatchToCurrentThread(NS_NewRunnableFunction(
        "PaintMenuBar", [geckoWindow]() mutable -> void {
          // `-[[WindowDelegate paintMenubarForWindow:]` does nothing if
          // nsCocoaWindow::GetCocoaWindow() returns nil.
          [WindowDelegate paintMenubarForWindow:gw->GetCocoaWindow()];
        }));
  }

The lambda stores a field of type RefPtr<nsCocoaWindow>, so when the lambda is done running and gets destroyed, the RefPtr's destructor calls Release.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #73)

None of them failed in local builds, which all worked fine. I guess the tryserver does extra "linting" (and probably also the servers that do "official" builds).

I think this linting can be run locally with ./mach static-analysis but I haven't tried it myself. https://firefox-source-docs.mozilla.org/code-quality/static-analysis/existing.html#mach-static-analysis

(In reply to Markus Stange [:mstange] from comment #74)

I would expect the following to work:

  RefPtr<nsCocoaWindow> geckoWindow = [windowDelegate geckoWidget];
  if (geckoWindow) {
    // We dispatch this in order to prevent crashes when macOS is actively
    // enumerating the menu items in `NSApp.mainMenu`. Use `geckoWindow`
    // because with it we can handle (in the lambda) the case where
    // `nsCocoaWindow::mWindow` has become nil.
    NS_DispatchToCurrentThread(NS_NewRunnableFunction(
        "PaintMenuBar", [geckoWindow]() mutable -> void {
          // `-[[WindowDelegate paintMenubarForWindow:]` does nothing if
          // nsCocoaWindow::GetCocoaWindow() returns nil.
          [WindowDelegate paintMenubarForWindow:gw->GetCocoaWindow()];
        }));
  }

The lambda stores a field of type RefPtr<nsCocoaWindow>, so when the lambda is done running and gets destroyed, the RefPtr's destructor calls Release.

I'll try this tomorrow. I'll also check XUL in the disassembler to make sure the lambda really does store (locally) a field of type RefPtr<nsCocoaWindow>.

It's taken me several days to verify it, and to understand its internal plumbing, but you're right, Markus -- your solution "works" (or one very much like it). In the following I'll work through what really happens when the following code gets compiled and runs. I discovered this by looking at the (Intel) machine code and playing with my HookCase hook library. I also made the machine code more readable by doing a customized debug build (actually an opt build with __attribute__((noinline)) added to lots of functions). What follows is very strange. I assume the compiler must have generated the code that actually runs. There's no way it could have passed muster with even the most lenient reviewer.

Here's a block of code from -[WindowDelegate windowDidBecomeMain:]:

  WindowDelegate* windowDelegate = nil;
  if ([[window delegate] isKindOfClass:[WindowDelegate class]]) {
    windowDelegate = (WindowDelegate*)[window delegate];
  }
  printf("-[WindowDelegate windowDidBecomeMain:](1)\n");
  RefPtr<nsCocoaWindow> geckoWindow = [windowDelegate geckoWidget];
  printf("-[WindowDelegate windowDidBecomeMain:](2)\n");
  if (geckoWindow) {
    // We dispatch this in order to prevent crashes when macOS is actively
    // enumerating the menu items in `NSApp.mainMenu`. Use `geckoWindow`
    // because with it we can handle (in the lambda) the case where
    // `nsCocoaWindow::mWindow` has become nil.
    NS_DispatchToCurrentThread(
        NS_NewRunnableFunction("PaintMenuBar", [gw = geckoWindow]() -> void {
          // `-[WindowDelegate paintMenubarForWindow:]` does nothing if
          // nsCocoaWindow::GetCocoaWindow() returns nil.
          printf("-[WindowDelegate windowDidBecomeMain:] lambda(1)\n");
          [WindowDelegate paintMenubarForWindow:gw->GetCocoaWindow()];
          printf("-[WindowDelegate windowDidBecomeMain:] lambda(2)\n");
        }));
  }

NS_NewRunnableFunction() is templated, and is defined as follows:

template <typename Function>
already_AddRefed<mozilla::Runnable> NS_NewRunnableFunction(
    const char* aName, Function&& aFunction) {
  // We store a non-reference in RunnableFunction, but still forward aFunction
  // to move if possible.
  return do_AddRef(new mozilla::detail::RunnableFunctionImpl<Function>(
      aName, std::forward<Function>(aFunction)));
}

The (also templated) RunnableFunction class is defined as follows:

template <typename StoredFunction>
class RunnableFunction : public Runnable {
 public:
  template <typename F>
  explicit RunnableFunction(const char* aName, F&& aFunction)
      : Runnable(aName), mFunction(std::forward<F>(aFunction)) {}

  NS_IMETHOD Run() override {
    static_assert(std::is_void_v<decltype(mFunction())>,
                  "The lambda must return void!");
    mFunction();
    return NS_OK;
  }

 private:
  StoredFunction mFunction;
};

That they're templated means the compiler replaces them with functions suitable to a specific "type". The replacements only barely conform to the templates.

One of the questions I had going into this was how the capture block ([gw = geckoWindow]) is interpreted. Is it read as code? The answer is "no". If it were, gw would be a nsCocoaWindow*. Instead it's still a RefPtr<nsCocoaWindow>. The first thing the compiler does is make a copy of gw (aka geckoWindow) using a copy constructor (RefPtr<nsCocoaWindow>::RefPtr(RefPtr<nsCocoaWindow> const&)), which also AddRef()s the original nsCocoaWindow object. This second RefPtr<nsCocoaWindow> object gets passed as the only parameter to the specialized replacement for NS_NewRunnableFunction(). Yes, there's only one parameter, which doesn't match the types of either of the two parameters in the template.

The specialized replacement for NS_NewRunnableFunction() first allocates a RunnableFunction object and initializes it with a constructor whose template form would be RunnableFunction::RunnableFunction(char const*, F&& aFunction). Once again, though, it passes only one parameter (RefPtr<nsCocoaWindow>), which doesn't match either of the template's two parameters. This constructor already knows aName ("PaintMenuBar"), which it stores in mName. It feeds the RefPtr<nsCocoaWindow> parameter to yet another constructor (RefPtr<nsCocoaWindow>::RefPtr(RefPtr<nsCocoaWindow> const&&)), which makes another copy and forgets the original, and doesn't do any AddRef()s. The copy gets stored in the RunnableFunction object's mFunction field -- which of course is supposed to have entirely different contents.

Next the specialized replacement for NS_NewRunnableFunction() creates a RefPtr<RunnableFunction> object by calling RefPtr<RunnableFunction>::RefPtr(RunnableFunction*). Then it calls RefPtr<RunnableFunction>::forget() and RefPtr<RunnableFunction>::~RefPtr(), and returns the the RunnableFunction object returned by forget().

NS_DispatchToCurrentThread() dispatches the RunnableFunction returned by NS_NewRunnableFunction() to run later on the current thread (the main thread). RunnableFunction::Run() does what you'd expect -- no weirdness here. Later RunnableFunction::~RunnableFunction() gets called, which calls the destructor for the RunnableFunction's copy of RefPtr<nsCocoaWindow>, which calls Release() on its mRawPtr. Before this happens the original RefPtr<nsCocoaWindow> (geckoWindow) goes out of scope and gets destroyed, at which time it calls Release() on its own mRawPtr.

The calls to AddRef() and Release() on the original nsCocoaWindow object are balanced -- two of each. So there are no UAFs of this object, and it doesn't get leaked ... at least not with the code I've discussed here.

At times like this I really miss C. Yes, C++ is more powerful than C, and usually more efficient. But C is like fresh, clean water compared to the sludge we see here. What's most irritating, though, is how hard it was it was to dig up the sludge. C is simple and transparent by comparison.

Followup in comment #80.

(Following up comment #77)

I've noticed that if the "function" passed to NS_NewRunnableFunction() (in the original source code) has an empty capture block ([]), nothing at all is passed to the specialized NS_NewRunnableFunction() in the code generated by the compiler. The same is true if the capture block only contains non-object values. And if the capture block has more than one object value, still only one parameter is passed to the specialized NS_NewRunnableFunction() -- but this time it's an array of object values. Only if an object value is a RefPtr<> is a copy constructor called on it, with the copy stored in the returned RunnableFunction object. Otherwise no copy is made, and it's stored as is.

So in the machine code that runs when NS_NewRunnableFunction() is called, the template's parameter list (const char* aName, Function&& aFunction) is never accurate. Though, to be fair, its template does describe how it should be called in source code.

I'm going to try to find the source for the code that performs these code generations. The first place to look, I suppose, is in the source code for LLVM's clang. But I have a nagging suspicion that it's Mozilla code doing this, after all. It knows about RefPtr<>, and uses moz_xmalloc().

Any ideas, Markus?

Edit: There's evidence clang knows about RefPtr<>, at least informally.

I don't share your distrust of the compiler - people store objects with destructors on lambdas in C++ all the time all over the world.

Also, if you have a RefPtr<nsCocoaWindow> geckoWindow, then writing auto gw = geckoView; will make gw also be of type RefPtr<nsCocoaWindow>.

This is really quite off topic for this bug, but I'll go into your other question too:
For answering questions about specific compiler code generation, I recommend making a reduced testcase and playing around on godbolt.org, in particular with godbolt's Opt Pipeline Viewer. I made a testcase which shows the elimination of the runnable name string argument: https://godbolt.org/z/1r5YnP9fK
You can see in the optimization panel that the %aName argument of the templated RunnableFunction constructor goes away during the DeadArgumentEliminationPass.
I think the compiler can often optimize templated functions more aggressively than non-templated functions - including making their call ABI not match the original signature. Probably because it's easier to prove that the function can't be called from outside the compilation unit.

Thanks for your information, Markus.

You're quite wrong that my inquiry isn't relevant. You seemed to be saying RefPtr<T> objects "just work" with lambdas. This turns out to be true, but there wasn't any good reason for thinking so. Now I've found one ... or at least have begun to do so.

I now agree that, in principle, code that generates code from templates needn't stick precisely to the template if that doesn't "cause trouble". I had never seen this before, and I'm afraid I over-reacted. What I say about C versus C++ still stands, though. C++ hides all kinds of things in the name of making your code "easier". That approach only stays "easy" until you need to look under the hood.

Attached file Lambda testcase (obsolete) —

Here's a fuller version of Markus's testcase from comment #81. It pays close attention to what happens as a lambda "captures" its arguments.

And here's a Makefile to build it:

CC=/usr/bin/clang

lambda-testcase : lambda-testcase.cpp
	$(CC) -o lambda-testcase -std=c++11 -O2 -fno-exceptions -lstdc++ lambda-testcase.cpp

clean :
	rm lambda-testcase

This shows that compiler-generated code doesn't perform any special handling for raw pointers, even if they point to objects that have copy constructors. It just passes them by value, like it does with int values. It does do special handling for RefPtr<> and std::shared_ptr<> objects, to keep their referents alive until the lambda has finished running.

This special handling is a bit fragile, though. If you define BROKEN in my testcase it stops working for both kinds of smart pointers.

The reason clang knows about RefPtr<> is that WebKit also uses it.

I still haven't been able to find the code (in clang or gcc) that generates "lambda glue". Though I'm pretty sure it happens in the "Semantic Analysis" stage (after "Parsing" and before "Code Generation"), I haven't been able to find a way to log "Semantic Analysis".

Thanks, Markus, for showing me the https://godbolt.org/ site. It didn't help me crack this particular puzzle. But it seems very useful for all kinds of other things.

I haven't looked in detail but my guess is that you found the reason for why we have a std::remove_reference_t here: https://searchfox.org/mozilla-central/rev/a5e04104f9c6cfd69f9c855f0b0f665afd444c74/xpcom/threads/nsThreadUtils.h#556-561
I didn't port that part over faithfully to the reduced testcase.

You're right Markus ... I think.

I added the following block to my testcase (also replacing its implementation of NS_NewRunnableFunction()):

template <class T>
using remove_reference_t = typename std::remove_reference<T>::type;

template <class Function>
using RunnableFunctionImpl =
    class RunnableFunction<remove_reference_t<Function>>;

template <typename Function>
already_AddRefed<RunnableFunction<Function>>
NS_NewRunnableFunction(const char* aName, Function&& aFunction) {
    return do_AddRef(new RunnableFunctionImpl<Function>(
        aName, std::forward<Function>(aFunction)));
}

With this change my testcase compiles (and runs) fine with BROKEN undefined. But with BROKEN defined I get the following compilation errors:

lambda-testcase.cpp:268:26: error: no matching constructor for initialization of 'RunnableFunctionImpl<(lambda at lambda-testcase.cpp:295:13) &>' (aka 'RunnableFunction<(lambda at lambda-testcase.cpp:295:13)>')
    return do_AddRef(new RunnableFunctionImpl<Function>(
                         ^
lambda-testcase.cpp:302:20: note: in instantiation of function template specialization 'NS_NewRunnableFunction<(lambda at lambda-testcase.cpp:295:13) &>' requested here
        runnable = NS_NewRunnableFunction("MyString", lambda);
                   ^
lambda-testcase.cpp:229:31: note: candidate constructor not viable: expects an rvalue for 2nd argument
    __attribute__((noinline)) RunnableFunction(const char* aName, F&& aFunction)
                              ^
lambda-testcase.cpp:228:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 2 were provided
struct RunnableFunction : public Runnable {
       ^
1 error generated.
make: *** [lambda-testcase] Error 1

The compilation errors are grotesquely misleading. But with this change, at least one way to break the compiler's "lambda glue" refuses to compile.

As I mentioned above in comment #80, the "lambda glue" replacement for the call to NS_NewRunnableFunction() actually takes just one parameter -- an array of captured object pointers (smart or dumb). So the RunnableFunctionImpl() rigamarole seems intended to allow only altered calls to NS_NewRunnableFunction() to be compilable. I suspect that NS_NewRunnableFunction("MyString", lambda) wasn't altered (when it was still compilable), and this is what broke the compiler's "lambda glue". Edit: I've confirmed this by looking at my "broken" testcase in Hopper Disassembler. The assembly code is much more readable there than at https://godbolt.org/.

I suspect that NS_NewRunnableFunction("MyString", lambda) wasn't altered

It was altered, but so that it took no parameters. So it still didn't match the signature of a "correctly" altered NS_NewRunnableFunction().

Attached file Revised lambda testcase (obsolete) —
Attachment #9443311 - Attachment is obsolete: true
Attached file Revised lambda testcase (obsolete) —
Attachment #9443334 - Attachment is obsolete: true

(In reply to Steven Michaud [:smichaud] (Retired) from comment #85)

As I mentioned above in comment #80, the "lambda glue" replacement for the call to NS_NewRunnableFunction() actually takes just one parameter -- an array of captured object pointers (smart or dumb). So the RunnableFunctionImpl() rigamarole seems intended to allow only altered calls to NS_NewRunnableFunction() to be compilable. I suspect that NS_NewRunnableFunction("MyString", lambda) wasn't altered (when it was still compilable), and this is what broke the compiler's "lambda glue". Edit: I've confirmed this by looking at my "broken" testcase in Hopper Disassembler. The assembly code is much more readable there than at https://godbolt.org/.

I suspect that NS_NewRunnableFunction("MyString", lambda) wasn't altered

It was altered, but so that it took no parameters. So it still didn't match the signature of a "correctly" altered NS_NewRunnableFunction().

This is wrong :-( The compiler errors happen at the "Parsing" stage -- before the "Semantic Analysis" stage. At this point the "lambda glue" hasn't yet been created, and NS_NewRunnableFunction() still has two parameters. The "RunnableFunctionImpl() rigamarole" is just a way to make the compiler fussier about the second parameter, aFunction. Without it the compiler accepts both [T = (lambda at lambda-testcase.cpp:nnn:nn) &] (the "broken" case) and [T = (lambda at lambda-testcase.cpp:nnn:nn) &&] (the non-"broken" case). With it the compiler only accepts [T = (lambda at lambda-testcase.cpp:nnn:nn) &&].

And so the compiler errors aren't misleading.

Attachment #9443341 - Attachment is obsolete: true

Default output of lambda-testcase:

Created RefCountedObject 1
Ran AddRef() on RefCountedObject 1, mRefCnt 1
RefPtr<[T = RefCountedObject *]>::RefPtr(T* aRef) [construct from raw pointer]
Created RefCountedObject 2
Created RefCountedObject 3
*** Creating lambda ***
Ran AddRef() on RefCountedObject 1, mRefCnt 2
RefPtr<[T = RefCountedObject *]>::RefPtr(const RefPtr<T>& aRefPtr) [copy constructor]
Running NS_NewRunnableFunction(): aFunction [T = (lambda at lambda-testcase.cpp:326:13) &&], RunnableFunctionImpl<Function> [T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)>]
Ran AddRef() on RefCountedObject 1, mRefCnt 3
RefPtr<[T = RefCountedObject *]>::RefPtr(const RefPtr<T>& aRefPtr) [copy constructor]
*** Created function ***
Ran AddRef() on [T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)> *], mRefCnt 1
RefPtr<[T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)> *]>::RefPtr(T* aRef) [construct from raw pointer]
RefPtr<[T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)> *]>::forget()
Deleting RefPtr<[T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)>]>
RefPtr<[T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)> *]>::RefPtr<T>& operator=(already_AddRefed<I>&& aRef) [move assignment]
Deleting RefPtr<[T = RefCountedObject *]>
Running Release() on RefCountedObject 1, mRefCnt 3
Deleting RefPtr<[T = RefCountedObject *]>
Running Release() on RefCountedObject 1, mRefCnt 2
*** Now outside the scope of the original RefPtr<RefCountedObject> ***
Ran AddRef() on [T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)> *], mRefCnt 2
RefPtr<[T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)> *]>::RefPtr(const RefPtr<T>& aRefPtr) [copy constructor]
*** Running lambda *** smartPtr [T = const RefPtr<RefCountedObject> &]([T = RefCountedObject *]:1), sharedPtr [T = const std::shared_ptr<RefCountedObject> &]([T = RefCountedObject *]:2), dumbPtr [T = RefCountedObject *const &]:3, answer 42
Deleting RefPtr<[T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)> *]>
Running Release() on [T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)> *], mRefCnt 2
Deleting RefPtr<[T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)> *]>
Running Release() on [T = RunnableFunction<(lambda at lambda-testcase.cpp:326:13)> *], mRefCnt 1
*** Deleting function ***
Deleting RefCountedObject 2
Deleting RefPtr<[T = RefCountedObject *]>
Running Release() on RefCountedObject 1, mRefCnt 1
Deleting RefCountedObject 1

(In reply to Steven Michaud [:smichaud] (Retired) from comment #83)

I still haven't been able to find the code (in clang or gcc) that generates "lambda glue". Though I'm pretty sure it happens in the "Semantic Analysis" stage (after "Parsing" and before "Code Generation"), I haven't been able to find a way to log "Semantic Analysis".

Turns out this is wrong, or at least not the full story. The AST ("Abstract Syntax Tree", generated using clang -Xclang -ast-dump -fsyntax-only -std=c++11 lambda-testcase.cpp) has a LambdaExpr object. But its NS_NewRunnableFunction() still has the two parameters you see in the source code. The same is true of the IR ("Intermediate Representation", aka "LLVM assembly language") and the assembly language outputs (and of the machine code), but only if you leave optimization off.

So where does the final implementation of "lamda glue" happen? Probably somewhere in the optimization pipeline, during the "Code Generation" stage.

clang -S -emit-llvm -std=c++11 lambda-testcase.cpp outputs IR without optimization.

I used clang -S -emit-llvm -std=c++11 -O2 -fno-exceptions lambda-testcase.cpp to output IR with optimization.

clang -S -std=c++11 lambda-testcase.cpp outputs assembly language without optimization.

I used clang -S -std=c++11 -O2 -fno-exceptions lambda-testcase.cpp to output assembly language with optimization.

You can see all optimization stages on godbolt.org. Above the assembly view, click Add new and select Opt pipeline. (It's available if you've picked a recent clang compiler in the compiler dropdown.)
Then you can click through the list of passes in the optimization panel, and find which pass made the change you're interested in. The passes that applied have green text in the list.

Ah, with your testcase I can't get that panel to work - it always hits a gateway timeout. And running compiler-explorer locally appears to require Linux.

Pushed by spohl@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/875ab4713f0e Fix four more possible crashes when attempting to set the app's mainMenu on macOS. r=mac-reviewers,mstange
Backout by agoloman@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9b459ce6809c Backed out changeset 875ab4713f0e requested by developer. CLOSED TREE
Pushed by spohl@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1543feed0652 Fix four more possible crashes when attempting to set the app's mainMenu on macOS. r=mac-reviewers,mstange
Attached file Lambda Glue generated by clang (obsolete) —

I finally figured out the best way to find my "lambda glue" is to look at binaries compiled without any optimization. The weirdest things I noted above (in comment #77 and comment #80) are actually due to ordinary optimizations, which have nothing to do (per se) with the lambda glue.

This file shows the alterations made for my testcase. They're actually quite reasonable. I still wish they hadn't been so hard to find. But it took me a while for me to realize that the optimizations were standing in my way, so it was partly my fault.

Status: NEW → RESOLVED
Closed: 10 months ago8 months ago
Resolution: --- → FIXED

Objective-C has a well defined ABI for its implementation of blocks. But there doesn't seem to be an equivalent for C++11 lambdas.

However the clang and gcc ABIs (for example) seem to differ only in small details. I got the same output at https://godbolt.org/ compiling with both clang and gcc.

My STR from comment #61 doesn't "work" on today's mozilla-central nightly -- the first with Stephen's latest patch. It doesn't crash.

Target Milestone: 134 Branch → 135 Branch

Perfherder has detected a talos performance change from push 1543feed065207594a549e9a0d2738e0c39e9d2c.

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
12% twinopen ext+twinopen:twinopen.html macosx1470-64-shippable e10s fission stylo webrender-sw 144.96 -> 127.23
11% twinopen ext+twinopen:twinopen.html macosx1470-64-shippable e10s fission stylo webrender 196.92 -> 176.07
8% twinopen ext+twinopen:twinopen.html macosx1470-64-shippable e10s fission stylo webrender 196.39 -> 180.68

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run these tests on try with ./mach try perf --alert 43143

For more information on performance sheriffing please see our FAQ.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: