Closed Bug 1258009 Opened 4 years ago Closed 3 years ago

crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | nsThread::ProcessNextEvent | NS_ProcessNextEvent | nsThread::Shutdown

Categories

(Core :: XPCOM, defect, critical)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox45 --- wontfix
firefox46 + wontfix
firefox47 blocking fixed
firefox48 + fixed
firefox49 --- fixed
firefox-esr45 --- affected

People

(Reporter: khuey, Assigned: aklotz)

References

Details

(Keywords: crash, Whiteboard: [pi])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-e1000c2c-615e-4697-bc48-a56942160318.
=============================================================

Thread 22 here is stuck in a dialog event loop created by the shell on a ShellExecuteExW.  Is there anything we can do about that?
Flags: needinfo?(jmathies)
Two things - first, I see mozilla::ipc::MessagePumpForNonMainThreads on the stack. We shouldn't be calling ShellExecute on a background thread that doesn't run a message pump. Audit callers of launch() to figure out where this is coming from and try bouncing this over to the main thread.

Second, A possible fix - populating the hwnd param with the browser window might help, that way if the parent closes the dialog would go with it. 

https://msdn.microsoft.com/en-us/library/windows/desktop/bb759784(v=vs.85).aspx

The real bug here may be that the dialog is hung due to the thread we're creating it on.
Flags: needinfo?(jmathies) → needinfo?(khuey)
Bug 632556 made all .launch calls go to the background thread like this, in 2012 ...
Flags: needinfo?(khuey)
Maybe we should be passing SEE_MASK_ASYNCOK?
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #2)
> Bug 632556 made all .launch calls go to the background thread like this, in
> 2012 ...

Doesn't look like it - 

https://bugzilla.mozilla.org/attachment.cgi?id=600092&action=diff

The comments up top indicate we knew this had to happen on the main thread.
The XPIDL methods are main thread only but if you look at nsLocalFile::Launch it unconditionally goes to a background thread to do the ShellExecuteEx.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #5)
> The XPIDL methods are main thread only but if you look at
> nsLocalFile::Launch it unconditionally goes to a background thread to do the
> ShellExecuteEx.

Yep, I see that now. You probably read the remarks sections too..

"The SEE_MASK_NOASYNC flag must be specified if the thread calling ShellExecuteEx does not have a message loop or if the thread or process will terminate soon after ShellExecuteEx returns."

oops!

Using SEE_MASK_NOASYNC can change the return results. Since this is an async operation maybe that doesn't matter anymore?

Another potential fix, use a thread that runs in a background ui thread loop instead. We have an unused creationFlags for NewThread we can use, which could select the use of a better message pump  [1][2][3][4].

[1] http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#391
[2] http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_loop.h#221
[3] http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_loop.cc#125
[4] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/MessagePump.h#98
Component: Widget: Win32 → XPCOM
[Tracking Requested - why for this release]:
This bug represents a potential deadlock risk for the browser on Windows. It's been around for a few years, we should track it and get it fixed.
Aaron, curious if you might have a few cycles for this, I think a fix is pretty straight forward.
Flags: needinfo?(aklotz)
Whiteboard: [pi]
Sure.
Flags: needinfo?(aklotz)
Assignee: nobody → aklotz
Status: NEW → ASSIGNED
Passing a HWND would be bad if we continued to call ShellExecuteEx from a background thread as that would induce input queue synchronization.

Any UI thrown up by ShellExecuteEx will pump messages since modal dialogs contain an internal message loop.

OTOH (and this has me wondering), since a parent HWND isn't specified, any UI presented by ShellExecuteEx might have fallen behind other windows, so the user never saw it. If it had been in the foreground, perhaps it would have been more obvious that the UI had to be dismissed to continue with shutdown.

I'm going to be very careful about this, obviously, but I'm leaning toward calling ShellExecuteEx on the main thread (so that we can specify a parent window) but with SEE_MASK_ASYNCOK. This way we can get the best of both world: we will have the correct HWND parenting but ShellExecuteEx won't bung up the main thread.

I'll experiment with this a bit to ensure that we can properly respond to a failure in this case.
Ah, I see about the SEE_MASK_NOASYNC and needing to pump for DDE (!).

I still want to proceed with my plan in comment 10, as I think this is a UX issue as much as a deadlock issue.
This is a top-crash (#5 rank) for 47, tracked as blocking.
Tracking for 46 too, to keep an eye on it as more users update to 46.0.1 over the next few days.
Comment on attachment 8749456 [details]
MozReview Request: Bug 1258009: Move nsLocalFile::Launch back to the main thread on Win32, but pass SEE_MASK_ASYNCOK and a parent HWND; r?jimm

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/50961/diff/1-2/
Comment on attachment 8749456 [details]
MozReview Request: Bug 1258009: Move nsLocalFile::Launch back to the main thread on Win32, but pass SEE_MASK_ASYNCOK and a parent HWND; r?jimm

https://reviewboard.mozilla.org/r/50961/#review47829

::: xpcom/io/nsLocalFileWin.cpp:108
(Diff revision 2)
> +  nsCOMPtr<nsIWidget> widget = widget::WidgetUtils::DOMWindowToWidget(win);
> +  if (!widget) {
> +    return result;
> +  }
> +
> +  result = reinterpret_cast<HWND>(widget->GetNativeData(NS_NATIVE_WINDOW));

nit - do we really need result? everything above here can return nullptr, here you just return the result of GetNativeData.

::: xpcom/io/nsLocalFileWin.cpp:170
(Diff revision 2)
> -      }
> -      break;
> +        break;
> -      case LaunchOp: {
> -        Launch();
>        }
> +      default: {

So we don't really need FileOp or any of the mOperation stuff anymore. Why don't we just get rid of all that. We can rename this class to something like AsyncLocalFileReveal.

::: xpcom/io/nsLocalFileWin.cpp:3305
(Diff revision 2)
>    if (NS_FAILED(rv)) {
>      return rv;
>    }
>  
> -  // To create a new thread, get the thread manager
> -  nsCOMPtr<nsIThreadManager> tm = do_GetService(NS_THREADMANAGER_CONTRACTID);
> +  ULONG semask = SEE_MASK_ASYNCOK;
> +  HWND sehwnd = GetMostRecentNavigatorHWND();

nit - you could simplify this a bit by assigning these directly to the members of seinfo (like nShow).
Attachment #8749456 - Flags: review?(jmathies)
Noting this is the #3 topcrash on 46.0.1, #12 on 47 beta 4.
Comment on attachment 8749456 [details]
MozReview Request: Bug 1258009: Move nsLocalFile::Launch back to the main thread on Win32, but pass SEE_MASK_ASYNCOK and a parent HWND; r?jimm

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/50961/diff/2-3/
Attachment #8749456 - Flags: review?(jmathies)
Attachment #8749456 - Flags: review?(jmathies) → review+
Comment on attachment 8749456 [details]
MozReview Request: Bug 1258009: Move nsLocalFile::Launch back to the main thread on Win32, but pass SEE_MASK_ASYNCOK and a parent HWND; r?jimm

https://reviewboard.mozilla.org/r/50961/#review49982
https://hg.mozilla.org/mozilla-central/rev/52a46e3ac003
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Hello Aaron, Jim, should we consider uplifting this to Beta47, Aurora48? Thanks!
Flags: needinfo?(jmathies)
Flags: needinfo?(aklotz)
(In reply to Ritu Kothari (:ritu) from comment #22)
> Hello Aaron, Jim, should we consider uplifting this to Beta47, Aurora48?
> Thanks!

I think an aurora uplift would be relatively safe to do. We're a little to close to merge for beta imo, this should get have some bake time.
Flags: needinfo?(jmathies)
Agreed.
Flags: needinfo?(aklotz)
Comment on attachment 8749456 [details]
MozReview Request: Bug 1258009: Move nsLocalFile::Launch back to the main thread on Win32, but pass SEE_MASK_ASYNCOK and a parent HWND; r?jimm

Approval Request Comment
[Feature/regressing bug #]: Bug 632556
[User impact if declined]: Shutdown hangs, system dialogs displayed behind Firefox windows when launching from the downloads pane
[Describe test coverage new/current, TreeHerder]: On m-c, tested locally
[Risks and why]: Low, this patch actually simplifies the launching code
[String/UUID change made/needed]: None
Attachment #8749456 - Flags: approval-mozilla-aurora?
Comment on attachment 8749456 [details]
MozReview Request: Bug 1258009: Move nsLocalFile::Launch back to the main thread on Win32, but pass SEE_MASK_ASYNCOK and a parent HWND; r?jimm

Let's uplift to Aurora48 to see how much it helps. This is a high volume crasher on all channels.
Attachment #8749456 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Jim Mathies [:jimm] from comment #23)
> (In reply to Ritu Kothari (:ritu) from comment #22)
> > Hello Aaron, Jim, should we consider uplifting this to Beta47, Aurora48?
> > Thanks!
> 
> I think an aurora uplift would be relatively safe to do. We're a little to
> close to merge for beta imo, this should get have some bake time.

Hi Jim, Aaron, this is a high volume crash on Beta47. We've seen ~2200 instances of this on Fx47 in a week. This is also huge on 46.0.1 (~12k crashes in a week). I would like to watch this for a few days on Aurora48 and if it looks good, uplift to Beta47 for 47.0b9. If we do not see an improvement (or any worsening) on Beta47, we can backout in RC1. Thoughts?
Flags: needinfo?(jmathies)
Flags: needinfo?(aklotz)
(In reply to Ritu Kothari (:ritu) from comment #27)
> (In reply to Jim Mathies [:jimm] from comment #23)
> > (In reply to Ritu Kothari (:ritu) from comment #22)
> > > Hello Aaron, Jim, should we consider uplifting this to Beta47, Aurora48?
> > > Thanks!
> > 
> > I think an aurora uplift would be relatively safe to do. We're a little to
> > close to merge for beta imo, this should get have some bake time.
> 
> Hi Jim, Aaron, this is a high volume crash on Beta47. We've seen ~2200
> instances of this on Fx47 in a week. This is also huge on 46.0.1 (~12k
> crashes in a week). I would like to watch this for a few days on Aurora48
> and if it looks good, uplift to Beta47 for 47.0b9. If we do not see an
> improvement (or any worsening) on Beta47, we can backout in RC1. Thoughts?

Sounds like a good plan. This is pretty well understood code here as is the change.
Flags: needinfo?(jmathies)
Hello Wes, this is another one that is kind of critical for uplift to Aurora to stabilize so we can uplift to Beta before we gtb 47.0b9. Many thanks!
Flags: needinfo?(wkocher)
Flags: needinfo?(wkocher)
Comment on attachment 8749456 [details]
MozReview Request: Bug 1258009: Move nsLocalFile::Launch back to the main thread on Win32, but pass SEE_MASK_ASYNCOK and a parent HWND; r?jimm

Approval Request Comment
[Feature/regressing bug #]: Bug 632556
[User impact if declined]: Shutdown hangs, system dialogs displayed behind Firefox windows when launching from the downloads pane
[Describe test coverage new/current, TreeHerder]: On m-c, tested locally
[Risks and why]: Low, this patch actually simplifies the launching code
[String/UUID change made/needed]: None
Flags: needinfo?(aklotz)
Attachment #8749456 - Flags: approval-mozilla-beta?
Comment on attachment 8749456 [details]
MozReview Request: Bug 1258009: Move nsLocalFile::Launch back to the main thread on Win32, but pass SEE_MASK_ASYNCOK and a parent HWND; r?jimm

This is a high volume crasher on Beta47, I am taking a chance by uplifting this in 47.0b9. If this hurts rather than help the situation, we will consider backing out before we gtb RC1.
Attachment #8749456 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Hi Aaron, RyanVM mentioned that there were win7 bc3 permaleaks on beta from this push https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=cbabb4769b50dc1b9c56f96141e741db6b89b0b8. We suspect it might be due to this patch. Could you please help investigate asap? Beta47 will enter RC week on Monday so the sooner we can root cause the better. Thanks!
Flags: needinfo?(aklotz)
It is not clear to me why this patch would be causing those bc3 permaleaks:

* The main change in this patch eliminates the use nsRunnable from nsLocalFileWin::Launch().
* Some cleanup was done around nsLocalFileWin::Reveal(), which still uses nsRunnable, but the changes were minimal.
* The fact that it isn't present on central and is only present on Windows 7 makes this odd.

The only scenario relating to this patch that I can think of that *might* cause this the cause is due to timing:

1) A Reveal() is done near the end of a test suite;
2) Reveal() executes in a nsRunnable on a background thread;
3) The test suite finishes and main thread shutdown begins;
4) The AsyncRevealOperation runnable enqueues an AsyncLocalFileWinDone runnable onto the main thread to clean up the background thread;

*Maybe* this could happen if step (4) occurred after the main thread has left its event loop, but I'm not 100% sure since my knowledge of main thread event loop shutdown is incomplete.
Flags: needinfo?(aklotz)
The main thread accepts events past the point where all other XPCOM threads are joined, so if you're running on an nsThread, your dispatch should never fail.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #36)
> The main thread accepts events past the point where all other XPCOM threads
> are joined, so if you're running on an nsThread, your dispatch should never
> fail.

That's what I was hoping it was doing, thanks Kyle.

For the purposes of this discussion, that makes it even less clear why this patch could be leaking like that.
It looks like bc3 contains the loop tests, so I wouldn't be so sure it is this ...
Crash volume for signature 'shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | nsThread::ProcessNextEvent | NS_ProcessNextEvent | nsThread::Shutdown':
  - esr (45): 13462

Affected platform: Windows
You need to log in before you can comment on or make changes to this bug.