Closed Bug 1707598 Opened 3 years ago Closed 3 years ago

Intermittent shutdown timeouts in CI with native menus

Categories

(Core :: Widget: Cocoa, defect, P1)

All
macOS
defect

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox89 --- fixed
firefox90 --- fixed

People

(Reporter: mstange, Assigned: mstange)

References

Details

(Whiteboard: [proton-context-menus][mac:mr1] [proton-uplift])

Attachments

(7 files)

Now that native context menus are enabled on macOS in Nightly (bug 1700679), we are seeing an increase in jobs that are terminated by [@ mozilla::(anonymous namespace)::RunWatchdog(void*)] after a shutdown timeout, such as this one.
These occurrences have NativeMenuMac::OpenMenu on the stack. This indicates that the nested event loop for a native menu is never exited.
(Bug 1358898 is the general bug about shutdown timeouts in CI.)

I debugged this a little by pushing logging instrumentation to try, and it seems that this happens if we open another native menu before the previous native menu was able to exit its nested event loop.

Could this be related to bug 1705365 ?

(In reply to Jens Stutte [:jstutte] from comment #1)

Could this be related to bug 1705365 ?

I don't think so - that bug was filed before native context menus were enabled, and it doesn't have NativeMenuMac::OpenMenu on the stack.

On CI, where we open and close menu items in quick succession, we sometimes got
into a state where the new menu was opened while we were still in the old menu's
nested event loop. So we had the following sequence of events:

 - old menu +[NSMenu popUpContextMenu:withEvent:forView:]
   - nested event loop for old menu
     - old menu -[NSMenu cancelTrackingWithoutAnimation]
     - new menu +[NSMenu popUpContextMenu:withEvent:forView:]
       - nested event loop for new menu
         - new menu -[NSMenu cancelTrackingWithoutAnimation]
     - new menu's event loop is exited, but old menu's event loop remains on the
       stack
     - shutdown hang here

MOZMenuOpeningCoordinator makes sure that +[NSMenu popUpContextMenu:withEvent:forView:]
is always called in sequence, never in a nested fashion.

Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/9577e2feecd1
When opening a new menu just after closing an old menu, make sure we exit the nested event loop for the old NSMenu before we ask the new NSMenu to open. r=harry
Priority: -- → P1

I cannot reproduce the browser_library_bookmark_pages.js timeout locally. I even tried running the entire directory.
Try push with debugging instrumentation: https://treeherder.mozilla.org/jobs?repo=try&revision=975e0a40dfd4576d129d5abe53539c052e169663

Something very strange is happening in that test. Profile: https://share.firefox.dev/3sZhuxZ

The first menu opens. While it is open, i.e. while we're in the tracking event loop for the first menu, we call activateItem. activateItem posts a runnable for the menuitem command, and then calls -[NSMenu cancelTrackingWithoutAnimation]. Then the runnable runs before we we had a chance to exit the nested event loop for the menu. (This is the part that happens on try but not on my machine - on my machine, we exit the nested event loop before we run the runnable.) During the command event, we open a sheet window and spawn another nested event loop, under AppWindow::ShowModal. So now we have two nested event loops on the stack: the event loop for the first menu (which already has had cancelTrackingWithoutAnimation called on it) and inside of that the modal event loop for the sheet window. Then, at some point, we close the sheet window and exit the modal event loop. At this point, you'd expect that we'd leave the nested event loop for the first menu, due to the cancelTrackingWithoutAnimation call earlier! But we don't. The menu's event loop stays on the stack. I'm not sure why. We now try to open the second menu. However, MOZMenuOpeningCoordinator delays this opening because the first menu's event loop is still on the stack. And then we go idle, stuck in the nested event loop of the first menu, and the second menu never opens.
It seems that our request to close the first menu somehow got lost. Maybe cancelTrackingWithoutAnimation doesn't expect us to launch another nested event loop before returning back into the menu's event loop?

I'll try to make it so that the command event from activateItem only runs after we exit the menu's event loop. That should avoid this bug.

Flags: needinfo?(mstange.moz)

That seems to work.

Otherwise the singleton would be leaked. It's not a lot of bytes, but a future
patch in this series is going to add an nsDeque member to this class, and
nsDeque has leak detection which would cause test failures in debug builds if
we didn't clean up MOZMenuOpeningCoordinator on shutdown.

Depends on D113373

It doesn't need to be cancelable - the Run() method already nulls out mMenuItem,
so once Run() has been called once, it won't do anything on subsequent invocations.
This patch also improves safety a bit, by moving the pending command runnables
out of the mPendingCommandRunnables field before running them, so that the Run()
method has no way of invalidating the array that's being iterated over.

Depends on D113731

This is a robust way to queue runnables which won't run until after the nested
event loop of the open NSMenu has been exited.
NS_DispatchToCurrentThread is not a reliable way to achieve that, because it
puts the runnables into the Gecko event loop, and there's nothing that prevents
us from potentially visiting the Gecko event loop inside the NSMenu's event loop,
even after cancelTracking(WithoutAnimation) has been called.

Depends on D113732

This avoids bugs where a menu close request (cancelTrackingWithoutAnimation) would be dropped
if the command event did certain unexpected things. I'm not 100% sure what the exact
circumstances for the bad behavior were (see bug 1707598 comment 7 for details), but
this seems like the right thing to do anyway.

Depends on D113733

It looks like these patches now cause another test failure, in 4 out of 5 runs, in bc7 browser/components/extensions/test/browser/browser_ext_browserAction_contextMenu.js. I'll debug that one now. It doesn't reproduce locally for me.

This also makes sure we call ActivateItemAfterMenuClosing on the submenu that
immediately contains the activated menu item, and not on the root menu.
This usually doesn't make a difference, except in the case where something
calls FlushMenuClosedRunnable afterwards; in that case, before this patch, we
might accidentally fired the popuphidden event for the submenu before the
command event for the clicked item.

Depends on D113733

In the next patch, it will make different decisions for how to dispatch the
MenuClosedAsync event based on this flag.

Depends on D113987

(In reply to Markus Stange [:mstange] from comment #14)

It looks like these patches now cause another test failure, in 4 out of 5 runs, in bc7 browser/components/extensions/test/browser/browser_ext_browserAction_contextMenu.js. I'll debug that one now. It doesn't reproduce locally for me.

This was happening because of changed microtask ordering; the test was expecting telemetry events after a popuphidden event, and one of those telemetry events was inserted from a microtask after the command event. I've addressed this by making -[MOZMenuOpeningCoordinator _runMenu] dispatch the "after menu closed runnables" to the regular event loop instead of calling ->Run() on them directly; this way, the Gecko event loop will make sure there are appropriate microtask checkpoints between them.

There was another test failure, in devtools/client/framework/test/browser_menu_api.js - this one was caused by a missing / delayed popuphidden event for a submenu which was closed with openMenu(false); the popuphidden event was delayed until after the entire menu closed and exited its event loop, but obviously that's too late if we're only closing a submenu and leaving the rest of the menu open. I've addressed this bug with the newly-added patches which let MenuClosed() know whether the entire menu is closing, so that it can pick the appropriate way to dispatch the MenuClosedAsync runnable.

Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/8f867acbaa94
When opening a new menu just after closing an old menu, make sure we exit the nested event loop for the old NSMenu before we ask the new NSMenu to open. r=harry
https://hg.mozilla.org/integration/autoland/rev/50adb1fd03f3
Release MOZMenuOpeningCoordinator on shutdown. r=harry
https://hg.mozilla.org/integration/autoland/rev/be5032506244
Make DoCommandRunnable a regular runnable, not a cancelable runnable. r=harry
https://hg.mozilla.org/integration/autoland/rev/d0c2e0e4b6a5
Add -[MOZMenuOpeningCoordinator runAfterMenuClosed:]. r=harry
https://hg.mozilla.org/integration/autoland/rev/5df0f67ca1c6
Rename ActivateItemAndClose and call cancelTrackingWithoutAnimation in NativeMenuMac::ActivateItem. r=harry
https://hg.mozilla.org/integration/autoland/rev/058fd48682c5
Let MenuClosed know when it's called after ActivateItem. r=harry
https://hg.mozilla.org/integration/autoland/rev/2cd944025913
Make sure that command events from activateItem only run once the nested event loop for the NSMenu has been exited. r=harry
Whiteboard: [proton-context-menus][mac:mr1] → [proton-context-menus][mac:mr1] [proton-uplift]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: