Open Bug 1705365 Opened 3 years ago Updated 27 days ago

A Firefox crash occurs if 3 or more “Close tabs” modals triggered from the macOS “Dock” are confirmed one after another

Categories

(Toolkit :: Content Prompts, defect)

Desktop
macOS
defect

Tracking

()

REOPENED
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- affected
firefox87 --- unaffected
firefox88 --- unaffected
firefox89 --- wontfix
firefox90 --- wontfix
firefox91 --- wontfix
firefox92 --- affected
firefox93 --- affected

People

(Reporter: romartin, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: crash, Whiteboard: [proton-modals])

Attachments

(1 file)

[Affected Versions]:

  • Firefox Nightly 89.0a1 (Build ID: 20210414160838)

[Affected Platforms]:

  • macOS 11.2.3
  • macOS 10.15.7

[Prerequisites]:

  • Have a Firefox profile created.

[Steps to reproduce]:

  1. Start the Firefox Nightly browser.
  2. Open a “New Tab” page.
  3. Control+click on the Firefox Nightly browser from the “Dock”.
  4. Select the “Quit” option.
  5. Repeat steps 3 and 4 for two more times.
  6. Click the “Close tabs” on all 3 prompts and observe the “Dock”.
  7. Wait a few minutes and observe the behavior.

[Expected results]:

  • Step 6: The Firefox Nightly process is fully closed.
  • Step 7: Nothing happens.

[Actual results]:

  • Step 6: The Firefox Nightly process remains on the “Dock” as an active process.
  • Step 7: A crash report is displayed.

[Notes]:

  • You can see a video containing the issue here.

  • The issue is not reproducible on the latest Firefox Beta or Release builds.

  • The issue is not reproducible if Firefox is closed using CMD+Q.

  • The following message is displayed in the Terminal:
    2021-04-15 13:34:27.344 crashreporter[70854:365595] Warning: Expected min height of view: (<NSPopoverTouchBarItemButton: 0x10d7bd000>) to be less than or equal to 30 but got a height of 32.000000. This error will be logged once per view in violation.

  • Report ID: bp-b0eeb7b1-03e8-43f0-8a46-eeb4d0210415 15.04.2021, 13:34

  • @Gijs can you please weigh in on this?

  • Also, I’m not 100% sure that this is the right component for the bug.

Flags: needinfo?(gijskruitbosch+bugs)

This crash is a shutdown hang ( https://crash-stats.mozilla.org/report/index/b0eeb7b1-03e8-43f0-8a46-eeb4d0210415 ). Jens, I thought bug 1696397 was supposed to have fixed these?

With prompts.windowPromptSubDialog turned off, the problem doesn't reproduce - the 2nd + later quit request from the dock just focuses the window that has a sheet open. It's not clear to me what code is responsible for doing this, and how we'd teach it about the new style of prompts. :haik, do you know?

Flags: needinfo?(jstutte)
Flags: needinfo?(haftandilian)
Flags: needinfo?(gijskruitbosch+bugs)

FWIW, I don't think this is S1 - it's a shutdown crash, and it requires the user doing something strange like requesting to quit multiple times and ignoring the dialogs that come up...

Whiteboard: [proton-modals]

(In reply to :Gijs (he/him) from comment #1)

This crash is a shutdown hang ( https://crash-stats.mozilla.org/report/index/b0eeb7b1-03e8-43f0-8a46-eeb4d0210415 ). Jens, I thought bug 1696397 was supposed to have fixed these?

That was the hope, yes. And from what I could see in 5min on crash-stats, for Windows it helped, apparently (but I'd appreciate some help on confirming this). There are some Mac specific code paths during Quit that might change the game?

But what I see in the crash data looks as if AdvanceShutdownPhase had been called normally, otherwise we would not see Shutdown hanging at step quit-application., confirmed also by the obvious fact, that the Watchdog thread has been started at all. So it is still puzzling me why the SpinEventLoopUntilOrQuit would not break.

:pbz, IIRC the STR look very similar to yours, can you confirm that this is now happening only on specific platforms? The Mac quit code is mostly the same as for Linux (apart from some #ifdef ) so I could expect it to happen also there, the Windows QUIT code path is totally different.

Flags: needinfo?(jstutte) → needinfo?(pbz)

Can't reproduce on Nightly 89.0a1 (2021-04-15), Manjaro Linux (Gnome 40) and Windows 10. Nightly shuts down correctly after confirming all the close prompts.
The only terminal output I get on shutdown is:

[2021-04-16T08:59:25Z ERROR xulstore::persist] removeDocument error: unavailable

###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
Flags: needinfo?(pbz)

Thanks. The whole stack is weird (to me, no experience with MacOS). I see no frame on the main thread where nsAppStartup::Quit has been actually ever called. All I see is a reaction on a "quit-application-requested" notification that ends up spinning the event loop probably for Prompter.jsm. That would explain why we hang - we simply never advanced the shutdown phases correctly as we would inside nsAppStartup::Quit.

But in that case I miss any point that justifies that the watchdog has even started and knows about our phase being "quit-application", though. Nika, do you have any hint?

Flags: needinfo?(nika)
Priority: -- → P3

After looking at this a bit, I have a vague theory of how this might be happening. I think this is an interaction between how we implemented nested event loops and the MacOS event loop system. When we do a nested event loop, we repeatedly call NS_ProcessNextEvent to process events, and check for the exit condition between these calls. I think we're not returning from NS_ProcessNextEvent after the shutdown event is processed, and are instead waiting for another native event, so don't have an opportunity to leave the event loop.

When we call into the event loop, if we're allowed to block and have no pending XPCOM messages, we'll call into the native OS message loop, and end up waiting for a native message to arrive, and stay there until something happens which causes us to process an XPCOM event. This means we don't always return from NS_ProcessNextEvent between native notifications, and therefore don't have an opportunity to abort the nested event loop. I think the macOS code for handling this is here: https://searchfox.org/mozilla-central/rev/a8b75e4ba3f8ddf0e76b42681d0a7b7e78e67730/widget/cocoa/nsAppShell.mm#616-623

With a normal situation, I think this wouldn't be an issue, as we'd post the nsAppExitEvent from https://searchfox.org/mozilla-central/rev/a8b75e4ba3f8ddf0e76b42681d0a7b7e78e67730/toolkit/components/startup/nsAppStartup.cpp#458, which both ensures that we leave the native loop to process the XPCOM event, and calls nsAppShell::Exit which does further work to end the native queue. In this particular situation, though, it seems like these events get processed within a multiply-nested event loop, so the toplevel nested event loop doesn't end up seeing a request to process XPCOM events, and never leaves the waiting section.

I am unfortunately not familiar enough with the code to propose a great solution here, but perhaps we need to check if we've advanced shutdown state since nsAppShell::ProcessGeckoEvents was called, and always exit back to the XPCOM loop in that case, even if it happened within a nested event loop or something like that? Perhaps :KrisWright will be more familiar with the code here.

Flags: needinfo?(nika) → needinfo?(kwright)

So to be fair, I think we can implement a solution at a different level, ie the quit/close warnings should be able to keep global state about whether a prompt is pending, and "just" re-surface that window and its prompt if an attempt to re-enter quit/close logic is made. But the earlier comments (thank you everyone) do make me think that there may still be a problem at the cocoa/prompt level, even if we fix the particular quit/close usecase? If I'm misunderstanding that part, please let me know, as it does sound like fixing this particular case at the frontend level might be easier than trying to work around multiple event loops getting into a gordian knot...

(In reply to Jens Stutte [:jstutte] from comment #5)

Thanks. The whole stack is weird (to me, no experience with MacOS). I see no frame on the main thread where nsAppStartup::Quit has been actually ever called. All I see is a reaction on a "quit-application-requested" notification that ends up spinning the event loop probably for Prompter.jsm. That would explain why we hang - we simply never advanced the shutdown phases correctly as we would inside nsAppStartup::Quit.

But in that case I miss any point that justifies that the watchdog has even started and knows about our phase being "quit-application", though. Nika, do you have any hint?

FWIW and just to clarify my own doubt here: nsAppStartup::Quit in the end just posts a nsAppExitEvent. Before doing so, we already advance our shutdown phase to AppShutdownConfirmed. So it is normal to not see any nsAppStartup::Quit on the stack (note to me: read further before posting) and find ourselves already in phase "quit-application".

Thus I am quite sure we can and do end up in the situation that :nika described so well above as a (non)reaction on the nsAppExitEvent (thanks!). And I think we should definitely find a way to avoid such a situation in general, not just work around the issue with the re-entrant prompter in the frontend. I fear I cannot really help with a solution, :KrisWright, any idea?

So I sat down and reproduced this and it is a bit tricky from the xpcom side - nsThreadManager makes an assumption that the event loop will keep moving so we can bail when we reach the shutdown stage without processing any more events but we just never get there. Based on what I'm seeing I think Nika's theory is correct, and I have a few thoughts on the problem:

Since we're looking at multiple dialogue boxes and closing them kicks off the shutdown hang, the theory I'm going with for the most part is we're getting stuck in some kind of nested mismatch here where the native event loop is waiting on the dialog box - but we've already closed the dialog box, and there's nothing to wait on, so we're just getting stuck. What's weird is that SpinEventLoopOrQuit can't just exit anyway, which tells me that we are stuck somewhere in the event loop to the point that we're ignoring shutdown notifications. The event loop's aVeryGoodReasonToDoThis in the stack points to prompts/Prompter.jsm:openPromptSync, so it thinks we are in an open prompt box so I'm pretty sure something's going wrong with the closing of nested prompt box dialogs.

Here's where it goes way out of my depth - we're stuck trying to get the next event in the bottom-most event loop. I'm not versed enough in mac native builds to really know why we're stuck here or what exactly it's waiting on. My theory is that it thinks that it's waiting on the prompt box. My reasoning for this is that if I open a single exit prompt, pause firefox with the prompt open, and examine the main thread backtrace, it looks the same to me as a crashing backtrace. So what I think is that the loop thinks its dialog box is still open. In this specific case I can resume, close firefox, and exit normally, because I have only opened a single dialog box and it notices that it's been closed.

tl;dr - SpinEventLoopUntil is processing a single event on the main thread and it hasn't noticed that it needs to break. But that single event is actually a nested native event loop that has gotten stuck for reasons I don't quite understand but appear to be that it's looking for a prompt that has already happened. We hang, because we are sitting on a native event that is hanging which for some reason we can't get out of, and then we crash.

I can think of a few solutions that vary in complexity, though my scope of mac native loops or our prompt code is minimal so I can't guarantee which ones are better:

  • change our SpinEventLoopUntil code to receive notifications (possibly off a timer at this phase in shutdown) that force exit the entire loop, regardless of whether it is in the middle of processing an event. This is not ideal to me because we might overlook the underlying issue of a shutdown hang if we cancel everything and exit immediately. We also might end up dropping shutdown events in say, a thread shutdown loop, and cause other threads to hang (which would cause issues further into shutdown).
  • Improve the way we process events so that every event loop ensures that it receives a shutdown notification and quits. I just don't think this will fix the hang because I think the problem starts while the dialog box is open. We haven't advanced that event at all so it won't be able to check for shutdown.
  • change the way we handle close tab dialog boxes to only prompt once to close tabs after attempting to quit instead of opening multiple dialog boxes. I feel like this doesn't fix the underlying issue but it would probably prevent the hang from tripping in the first place. I'm a bit out of my depth in this code so I haven't attempted to do this myself. I feel like this is a bug anyway, and it may need to be fixed regardless of if we do anything else.
  • Address this from the native event loops and force them to shut down. I don't know for certain what this might entail since I don't know enough about our mac native builds. Based on what I'm seeing we may still be stuck on the same event we were waiting on to hit 'close tabs'. I think this is the real fix to the problem.

Here's a backtrace of the main thread I based my assumptions off of if anyone has any ideas:

* thread #1, name = 'MainThread', queue = 'com.apple.main-thread'
  * frame #0: 0x00007fff67bc722a libsystem_kernel.dylib`mach_msg_trap + 10
    frame #1: 0x00007fff67bc776c libsystem_kernel.dylib`mach_msg + 60
    frame #2: 0x00007fff3bb1c1ee CoreFoundation`__CFRunLoopServiceMachPort + 328
    frame #3: 0x00007fff3bb1b75c CoreFoundation`__CFRunLoopRun + 1612
    frame #4: 0x00007fff3bb1aebe CoreFoundation`CFRunLoopRunSpecific + 455
    frame #5: 0x00007fff3ad7a1ab HIToolbox`RunCurrentEventLoopInMode + 292
    frame #6: 0x00007fff3ad79ee5 HIToolbox`ReceiveNextEventCommon + 603
    frame #7: 0x00007fff3ad79c76 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64
    frame #8: 0x00007fff3911279d AppKit`_DPSNextEvent + 1135
    frame #9: 0x00007fff3911148b AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 1361
    frame #10: 0x000000010996427d XUL`::-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:](self=0x00000001012bcd30, _cmd="nextEventMatchingMask:untilDate:inMode:dequeue:", mask=18446744073709551615, expiration=4001-01-01 00:00:00 UTC, mode="kCFRunLoopDefaultMode", flag=YES) at nsAppShell.mm:170:24
    frame #11: 0x000000010996607e XUL`nsAppShell::ProcessNextNativeEvent(this=0x000000013a5ecca0, aMayWait=true) at nsAppShell.mm:633:28
    frame #12: 0x000000010986b25e XUL`nsBaseAppShell::DoProcessNextNativeEvent(this=0x000000013a5ecca0, mayWait=true) at nsBaseAppShell.cpp:120:17
    frame #13: 0x000000010986b7c1 XUL`nsBaseAppShell::OnProcessNextEvent(this=0x000000013a5ecca0, thr=0x0000000101216480, mayWait=false) at nsBaseAppShell.cpp:259:10
    frame #14: 0x0000000109966b80 XUL`nsAppShell::OnProcessNextEvent(this=0x000000013a5ecca0, aThread=0x0000000101216480, aMayWait=true) at nsAppShell.mm:811:26
    frame #15: 0x000000010238e69b XUL`nsThread::ProcessNextEvent(this=0x0000000101216480, aMayWait=true, aResult=0x00007ffeefbf4d97) at nsThread.cpp:1075:10
    frame #16: 0x0000000102397568 XUL`NS_ProcessNextEvent(aThread=0x0000000101216480, aMayWait=true) at nsThreadUtils.cpp:548:10
    frame #17: 0x0000000102397304 XUL`bool mozilla::SpinEventLoopUntil<(mozilla::ProcessFailureBehavior)1, nsThreadManager::SpinEventLoopUntilInternal(nsTSubstring<char> const&, nsINestedEventLoopCondition*, mozilla::ShutdownPhase)::$_10>(aPredicate=0x00007ffeefbf4e50, aThread=0x0000000000000000)::$_10&&, nsIThread*) at SpinEventLoopUntil.h:93:25
    frame #18: 0x0000000102396f3d XUL`nsThreadManager::SpinEventLoopUntilInternal(this=0x0000000115b153e8, aVeryGoodReasonToDoThis=0x00007ffeefbf5160, aCondition=0x00000001725889c0, aCheckingShutdownPhase=AppShutdownConfirmed) at nsThreadManager.cpp:714:8
    frame #19: 0x000000010239701a XUL`nsThreadManager::SpinEventLoopUntilOrQuit(this=0x0000000115b153e8, aVeryGoodReasonToDoThis=0x00007ffeefbf5160, aCondition=0x00000001725889c0) at nsThreadManager.cpp:648:10
    frame #20: 0x000000010240262e XUL`NS_InvokeByIndex at xptcinvoke_asm_x86_64_unix.S:101

NI?ing :nika in case you have any thoughts :)

Flags: needinfo?(kwright) → needinfo?(nika)
Depends on: 1706963
Priority: P3 → P1
Status: NEW → RESOLVED
Closed: 3 years ago
Priority: P3 → P1
Resolution: --- → WONTFIX

Should this be WONTFIX now that bug 1706963 made it so multiple quit/close prompts don't end up open at the same time avoiding the nested event loop spinning. Or are there other places that nest spinning?

(In reply to Ed Lee :Mardak from comment #11)

Should this be WONTFIX now that bug 1706963 made it so multiple quit/close prompts don't end up open at the same time avoiding the nested event loop spinning. Or are there other places that nest spinning?

The problem is still reproducible. I was able to reproduce the hang and crash on the latest Nightly (2021-08-05). Re-opening the bug.

Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---

I think fixing this in the front end would be a good practical approach for now (as Gij's commented in comment 7). Specifically, instead of displaying a new dialog, raising the window with the existing quit dialog and avoiding spinning of a new event loop. Assuming we can handle the existing window with a prompt being minimized by un-minimizing it.

I can't claim to understand the shutdown issue, but it does seem like the call to win.gDialogBox.replaceDialogIfOpen() in BG__onQuitRequest() does not result in the existing native event loop finishing as one would expect because we end up with multiple nsThreadManager::SpinEventLoopUntilInternal() frames on the stack. Is there something else we should be doing in the front end in addition to calling replaceDialogIfOpen() to remove the existing prompt cleanly?

@gijs, do you know the answer to that and/or who would be a good person to work on the front end fix you described in comment 7.

Flags: needinfo?(haftandilian) → needinfo?(gijskruitbosch+bugs)

(In reply to Haik Aftandilian [:haik] from comment #13)

I think fixing this in the front end would be a good practical approach for now (as Gij's commented in comment 7). Specifically, instead of displaying a new dialog, raising the window with the existing quit dialog and avoiding spinning of a new event loop. Assuming we can handle the existing window with a prompt being minimized by un-minimizing it.

I can't claim to understand the shutdown issue, but it does seem like the call to win.gDialogBox.replaceDialogIfOpen() in BG__onQuitRequest() does not result in the existing native event loop finishing as one would expect because we end up with multiple nsThreadManager::SpinEventLoopUntilInternal() frames on the stack. Is there something else we should be doing in the front end in addition to calling replaceDialogIfOpen() to remove the existing prompt cleanly?

@gijs, do you know the answer to that

I don't know per se, but I can make an educated guess... when the call to replaceDialogIfOpen is made, we will close the dialog. That dialog was opened from within a nested event loop. But we won't exit from the initial nested event loop synchronously - we're in the middle of some other sync JS code (the call to replaceDialogIfOpen) and that bit of stack would have to unwind before we'd reconsider the exit condition for the existing event loop and exit it. But it never does, and instead, we enter a new nested loop for the next quit prompt, meaning we never evaluate the exit condition for that loop until we quit the innermost loop (which would be handling events from that point onwards).

To allow the "parent" nested event loop to quit would require letting all the JS come off the stack, but we need an answer to the quit observer notification synchronously, and so we can't stop running sync JS.

and/or who would be a good person to work on the front end fix you described in comment 7.

Well, it'd need prioritization from product, because we have too much stuff to do and too little time. Do the crash volumes indicate that such prioritization is warranted? From a quick look we're at about 900-odd mac crashes a week - but AFAICT based on bug 1610213 (which predates this regression by 2 years) they aren't all going to be caused by this issue, and at this point it's unclear what proportion is caused by this issue. Either way I have no idea if 900 is a lot or not.

Another thing to perhaps point out here is that we're about to add more complexity in this area at product's request, by disabling this warning by default (yay) but creating a separate one that's enabled by default for use of accel-q on linux/mac (which will therefore have the same issue - boooo!). A fix along the lines of comment 7 would need to deal with both types of prompts. Bug 1724959 is the relevant meta, fwiw.

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(haftandilian)
See Also: → 1610213

Thanks for the explanation. I agree, I don't think the crash reports uniquely identify this crash. And I don't think this warrants high prioritization given the steps to reproduce are contrived in my opinion.

I tested a patch using the global variable approach and it seems to work well, but does change the behavior:

  1. After Cmd-Q or Dock->Quit, when the prompt is displayed, if the user switches to a new window and performs another quit, instead of showing the quit prompt in the second window, we switch focus back to the original window already showing the prompt.
  2. Once the prompt is displayed, if a user switches to an existing window and opens new windows and/or tabs, the window and tab count mentioned in the prompt is not updated.

It's up to UX, but I think ideally we would make all windows modal when the quit prompt is displayed, and not allow minimizing windows until the prompt is dismissed. I'll post the patch.

One data point is that I can't reproduce the crash using command-Q.

Flags: needinfo?(haftandilian)

On Mac, don't display new quit dialogs for nested quit requests. Instead, focus the window and cancel the additional quit.

See Also: → 1729981
See Also: → 1728344

(In reply to Haik Aftandilian [:haik] from comment #15)

It's up to UX, but I think ideally we would make all windows modal when the quit prompt is displayed, and not allow minimizing windows until the prompt is dismissed. I'll post the patch.

Is that something we can actually technically do without significant effort? Because I don't think that we have frontend primitives, at least, to delegate modal-ness to a different window (ie attempts to interact with the other windows should bring the modal'd window to front and focus the dialog there). Perhaps I'm missing something obvious?

(Asking because although I could go to UX with this question, if they say "yes implement what Haik is saying", I'd still be stuck because I don't think we really can...)

Flags: needinfo?(haftandilian)
Severity: S2 → S3
Keywords: crash
Component: Notifications and Alerts → Content Prompts
Flags: needinfo?(nika)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: