hang after startup. windows title bar shows "Not Responding"

RESOLVED FIXED in Firefox 52

Status

()

defect
P1
critical
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: wsmwk, Assigned: froydnj)

Tracking

({hang})

Trunk
mozilla53
Points:
---

Firefox Tracking Flags

(firefox50+ wontfix, firefox51+ wontfix, firefox52+ fixed, firefox53 fixed)

Details

Attachments

(4 attachments)

restarted firefox
about:sessionrestore page appeared - all seems OK.
did NOT click restore, and had no interraction with Firefox UI.
tabbed away to a different application.
came back within a few minutes to a hung firefox and title bar shows "Not Responding".

unfortunately I did not check to see how much CPU is being used.
Firefox nightly build 20160827

I have never seen a hang after startup like that before.
So far, startups after that incident have not hung.
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:51.0) Gecko/20100101 Firefox/51.0

I have tested this issue on Windows 10 x64 with the latest Firefox release (48.0.2) and the latest Nightly (51.0a1-20160831030224) and could not reproduce it.
After opening a few tabs, I've restarted Firefox using the "Restart with Add-ons Disabeld" option from the "Help" menu and clicking "Refresh Nightly" from the "Nightly Safe Mode" pop-up.
After the "about:sessionrestore" page appeared I've switched to another application for a couple of minutes and when I've switched back, Firefox was still functional.

Can you please retest this using a new clean Firefox profile? Also could you provide mode details, anything that could help us reproduce the issue?
Flags: needinfo?(vseerror)
I've provided a detailed stacktrace, and it's not reproducible, so I don't plan to do further testing.
Flags: needinfo?(vseerror)
I've investigated a bit the stacktrace and found multiple hang errors related to "ipc". Considering this, I'm assigning the "Core: IPC" component on this. Perhaps there's someone with more knowledge on this area that might be able to help here.
Component: Untriaged → IPC
Product: Firefox → Core
It looks like thread 0 is inside google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread(), triggered by mozilla::dom::ContentParent::ForceKillTimerCallback(), maybe waiting on a lock? So I guess we're hanging while creating a minidump. I don't know if that is normal or not.
Ted, is that expected? I have no idea what it is supposed to look like when we're generating a minidump.
Flags: needinfo?(ted)
Oof, this kinda sucks.

Thread 1 noticed the content process hanging near shutdown, so it's trying to kill it. Before we kill it we collect a minidump from both the content process and the current process.

Collecting a minidump in the current process happens on another thread, and the minidump code suspends all threads except the current one. While taking the minidump, it uses a separate heap and must not use the general heap, because another suspended thread may be holding the allocator lock. But we have a hook within the minidump code which is then allocating memory:

00000000`2f0adec0 000007fe`c9a9843d xul!nsAString_internal::Assign(class nsAString_internal * aStr = 0x00000000`2f0ae190, struct mozilla::fallible_t * aFallible = <Value unavailable error>)+0xca [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\xpcom\string\nstsubstring.cpp @ 440]
(Inline Function) --------`-------- xul!mozilla::NtPathToDosPath+0x10d [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\xpcom\string\nstsubstring.cpp @ 396]
(Inline Function) --------`-------- xul!mozilla::NtPathToDosPath+0x10d [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\obj-firefox\dist\include\nstsubstring.h @ 455]
00000000`2f0adef0 000007fe`c9a97cf6 xul!mozilla::NtPathToDosPath(class nsAString_internal * aNtPath = 0x00000000`2f0ae190, class nsAString_internal * aDosPath = 0x00000000`2f0ae060)+0x10d [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\xpcom\io\fileutilswin.h @ 65]
00000000`2f0ae040 000007fe`c9a97a7b xul!`anonymous namespace'::WinIOAutoObservation::WinIOAutoObservation(mozilla::IOInterposeObserver::Operation aOp = <Value unavailable error>, class nsAString_internal * aFilename = 0x00000000`2f0ae190)+0x86 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\xpcom\build\poisoniointerposerwin.cpp @ 167]
00000000`2f0ae130 000007fe`fd886296 xul!`anonymous namespace'::InterposedNtCreateFile(void ** aFileHandle = 0x00000000`2f0ae298, unsigned long aDesiredAccess = 0x80100080, struct _OBJECT_ATTRIBUTES * aObjectAttributes = 0x00000000`2f0ae2e8, struct _IO_STATUS_BLOCK * aIoStatusBlock = 0x00000000`2f0ae2a8, union _LARGE_INTEGER * aAllocationSize = 0x00000000`00000000, unsigned long aFileAttributes = 0, unsigned long aShareAccess = 3, unsigned long aCreateDisposition = 1, unsigned long aCreateOptions = 0x60, void * aEaBuffer = 0x00000000`00000000, unsigned long aEaLength = 0)+0x67 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\xpcom\build\poisoniointerposerwin.cpp @ 255]
00000000`2f0ae220 00000000`77990d7d KERNELBASE!CreateFileW+0x2cd
00000000`2f0ae380 000007fe`f4946c7d KERNEL32!CreateFileWImplementation+0x7d
00000000`2f0ae3e0 000007fe`f4943471 dbghelp!Win32LiveSystemProvider::OpenMapping+0x5e
00000000`2f0ae5e0 000007fe`f494565b dbghelp!GenAllocateModuleObject+0xd5
00000000`2f0ae7b0 000007fe`f4941c25 dbghelp!GenGetProcessInfo+0x2db
00000000`2f0af1a0 000007fe`f4942139 dbghelp!MiniDumpProvideDump+0x265
00000000`2f0af8f0 000007fe`cb7f2b37 dbghelp!MiniDumpWriteDump+0x249
00000000`2f0af9b0 000007fe`cb7f28f9 xul!google_breakpad::ExceptionHandler::WriteMinidumpWithExceptionForProcess(unsigned long requesting_thread_id = 0x26ec, struct _EXCEPTION_POINTERS * exinfo = <Value unavailable error>, struct MDRawAssertionInfo * assertion = 0x000007fe`cb7f255a, void * process = 0xffffffff`ffffffff, bool write_requester_stream = true)+0x1ff [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\toolkit\crashreporter\google-breakpad\src\client\windows\handler\exception_handler.cc @ 1018]

So the bug here is that this hook is bad.

glandium, I guess we should somehow forcefully disable the IO interpose hook before taking minidumps?
Flags: needinfo?(mh+mozilla)
Component: IPC → Breakpad Integration
Product: Core → Toolkit
Flags: needinfo?(ted)
Yikes! Is it generally safe for the IO interpose hook to allocate memory? It definitely shouldn't be doing that when we're writing a minidump.
You can tell the IO Interposer to ignore a specific fd. I don't know if there's a way to universally do it.
Making the IO Interposer not allocate might be difficult, but as mccr8 mentions, you can register the minidump fd as not being interposed. See https://dxr.mozilla.org/mozilla-central/rev/c55bcb7c777ea09431b4d16903ed079ae5632648/xpcom/build/PoisonIOInterposer.h#20,23,26
Flags: needinfo?(mh+mozilla)
[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

I don't think we can register this fd for several reasons:

* the FD is not opened by us, it's opened by a MS library
* the registration functions both take a lock (again with a small risk of deadlock) and potentially allocate memory via the global FdList.

I think this is pretty bad and we should fix this ASAP (even for uplift to 50). Who can take it?
Priority: -- → P1
Duplicate of this bug: 1308492
sheppy hit this same issue on OS X in bug 1308492--he had a content process hang, the chrome process called `CreateMinidumpsAndPair`, but unfortunately another thread was holding the malloc lock when it froze all other threads to write a minidump of itself. The IOInterposer was called while writing to the minidump file, and then when it tried to malloc we were deadlocked.
Doesn't look like this can be fixed in time for 50, wontfix.
Hi :ted.mielczarek,
can you help find someone who can work on this given this is high priority and critical?
Flags: needinfo?(ted)
I don't know much about how the IO interposer works. bsmedberg: can you find someone who does? I'm happy to provide info from the crash reporting side of things.
Flags: needinfo?(ted) → needinfo?(benjamin)
That would be Nathan, I think. At least he's the XPCOM owner now! 
Flags: needinfo?(benjamin) → needinfo?(nfroyd)
It looks like we have IOInterposer::Disable:

http://dxr.mozilla.org/mozilla-central/source/xpcom/build/IOInterposer.h#169

which is intended for the crashreporter.  Indeed, various signal-handling bits of the crashreporter already call said function, e.g.:

http://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.cpp#1419

I guess the problem is that we're not disabling the IOInterposer for "manual" minidump writing?

It looks like there are a number of different ways to manually write minidumps, according to nsExceptionHandler.h:

http://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.h#142
http://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.h#180

and so forth, but those all seem to be thin wrappers around google_breakpad::ExceptionHandler::WriteMinidump*, so we'd have to annotate a number of different functions, rather than some common entry point, or refactor things so we could always wrap calls to WriteMinidump*...which doesn't seem like a bad idea, given the number of #ifdefs around such calls.  Ted, is that correct?
Flags: needinfo?(nfroyd) → needinfo?(ted)
So the only time we should need to call this is from `CreateMinidumpsAndPair`. We only hit this deadlock if we try to write a minidump of *the current process*, which only happens if we actually crash (which is covered by the code you linked above) or if we have a hang in a content or plugin process, in which case we write a minidump of the browser process to go along with it.

Although, looking at the code again, it looks like `TakeMinidump` also writes a minidump of the current process:
https://dxr.mozilla.org/mozilla-central/rev/79feeed4293336089590320a9f30a813fade8e3c/toolkit/crashreporter/nsExceptionHandler.cpp#3946

You could probably refactor those two call sites to share code, they do virtually the same thing for writing a dump of the browser process.

In this case, you'd want to reenable the IOInterposer after writing the dump, since the process isn't crashing.
Flags: needinfo?(ted)
Nathan are you able to make progress on this with the info Ted provided?
Flags: needinfo?(nfroyd)
In some parts of the crash reporter, we need to disable I/O
interposition, then turn it back on.  We didn't have a function for
re-enabling the I/O interposer...until now.
Attachment #8821667 - Flags: review?(aklotz)
This will ensure the I/O Interposer is re-enabled after temporarily
disabling it.
Attachment #8821668 - Flags: review?(aklotz)
I think I got the places for the RAII class correct, based on comment 19.
Attachment #8821669 - Flags: review?(ted)
(In reply to Julien Cristau [:jcristau] from comment #20)
> Nathan are you able to make progress on this with the info Ted provided?

Yes, sorry for dropping this on the floor.
Flags: needinfo?(nfroyd)
Comment on attachment 8821668 [details] [diff] [review]
part 2 - add an AutoIOInterposerDisable RAII class

Review of attachment 8821668 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/build/IOInterposer.h
@@ +272,5 @@
>  #endif
>    }
>  };
>  
> +class AutoIOInterposerDisable final MOZ_STACK_CLASS

How 'bout MOZ_RAII instead?
Attachment #8821668 - Flags: review?(aklotz) → review+
Attachment #8821667 - Flags: review?(aklotz) → review+
Comment on attachment 8821669 [details] [diff] [review]
part 3 - disable I/O interposition when manually taking crashdumps

Review of attachment 8821669 [details] [diff] [review]:
-----------------------------------------------------------------

Oh well, that's a simple patch!
Attachment #8821669 - Flags: review?(ted) → review+
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/53b091d03fcd
part 1 - add IOInterposer::Enable(); r=aklotz
https://hg.mozilla.org/integration/mozilla-inbound/rev/8b5024ea181a
part 2 - add an AutoIOInterposerDisable RAII class; r=aklotz
https://hg.mozilla.org/integration/mozilla-inbound/rev/8fa726d6a295
part 3 - disable I/O interposition when manually taking crashdumps; r=ted.mielczarek
We've built 51 RC. Mark 51 won't fix. 
Hi :froydnj, 
Is this worth uplifting to Aurora52?
Flags: needinfo?(nfroyd)
(In reply to Gerry Chang [:gchang] from comment #29)
> We've built 51 RC. Mark 51 won't fix. 
> Hi :froydnj, 
> Is this worth uplifting to Aurora52?

Yes!
Flags: needinfo?(nfroyd)
Comment on attachment 8821667 [details] [diff] [review]
part 1 - add IOInterposer::Enable()

This request is for all the patches contained in this bug.

Approval Request Comment
[Feature/Bug causing the regression]: Probably the bug(s) that added the I/O interposer and various features surrounding it.  Been around for a while, I suspect.
[User impact if declined]: Startup hangs, most unfriendly.
[Is this code covered by automated tests?]: No.
[Has the fix been verified in Nightly?]: No.
[Needs manual test from QE? If yes, steps to reproduce]: Manual testing would be fantastic, but I don't think we have steps to reproduce.
[List of other uplifts needed for the feature/fix]: Just the patches in this bug.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: Disabling the I/O interposer is not an unusual operation; we do it during "normal" crashdumps, for instance.  So we know that disabling it works; this is just an additional code path that does that.
[String changes made/needed]: None.
Attachment #8821667 - Flags: approval-mozilla-aurora?
Comment on attachment 8821667 [details] [diff] [review]
part 1 - add IOInterposer::Enable()

disable io interposer when creating a minidump to avoid deadlock, aurora52+ (all three patches)
Attachment #8821667 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Assignee: nobody → nfroyd
Me so happy. Way to go, team!
You need to log in before you can comment on or make changes to this bug.