Open Bug 1461353 Opened 6 years ago Updated 2 years ago

Startup hang in nsIWidget::Create on Windows 10

Categories

(Core :: Widget: Win32, defect, P3)

defect

Tracking

()

People

(Reporter: gbrown, Unassigned)

References

(Blocks 1 open bug)

Details

Bug 1414495 records intermittent failures in tests caused by "no output timeouts": when the test harness times out because tests have not reported any results for several minutes. For many weeks, frequent intermittent failures in that bug have been apparent startup hangs in Windows 10 mochitest tasks. Those hangs appear to be on browser startup, well before any tests are started. Screenshots taken at the time of the timeout show no Firefox window. On timeout, the harness kills the browser and almost always successfully collects a minidump; however, the crash reports have rarely been useful, with no Mozilla code in the crashing thread stack. 

Loading some minidumps in Visual Studio pointed to bug 1373279, which was recently fixed. 

Now some - perhaps 30% - of the Windows startup hang reports have a stack that includes nsIWidget::Create.

Here are some examples:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=178273429&lineNumber=7613
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=178251739&lineNumber=5336
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=178251664&lineNumber=5808
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=178251659&lineNumber=10585


https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=178333902&lineNumber=2879-2910

13:26:54     INFO -  16  xul.dll!nsIWidget::Create(nsIWidget *,void *,mozilla::gfx::IntRectTyped<mozilla::DesktopPixel> const &,nsWidgetInitData *) [nsIWidget.h:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 458 + 0x22]
13:26:54     INFO -      rbp = 0x00007ff88ba50100   rsp = 0x0000006555bfc3f0
13:26:54     INFO -      rip = 0x00007ff869653a2a
13:26:54     INFO -      Found by: stack scanning
13:26:54     INFO -  17  xul.dll!nsWebShellWindow::Initialize(nsIXULWindow *,nsIXULWindow *,nsIURI *,int,int,bool,nsITabParent *,mozIDOMWindowProxy *,nsWidgetInitData &) [nsWebShellWindow.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 177 + 0x1a]
13:26:54     INFO -      rbp = 0x00007ff88ba50100   rsp = 0x0000006555bfc450
13:26:54     INFO -      rip = 0x00007ff86965328e
13:26:54     INFO -      Found by: call frame info
13:26:54     INFO -  18  xul.dll!nsAppShellService::JustCreateTopWindow(nsIXULWindow *,nsIURI *,unsigned int,int,int,bool,nsITabParent *,mozIDOMWindowProxy *,nsWebShellWindow * *) [nsAppShellService.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 736 + 0x57]
13:26:54     INFO -      rbp = 0x00007ff88ba50100   rsp = 0x0000006555bfc630
13:26:54     INFO -      rip = 0x00007ff869652cff
13:26:54     INFO -      Found by: call frame info
13:26:54     INFO -  19  xul.dll!nsAppShellService::CreateTopLevelWindow(nsIXULWindow *,nsIURI *,unsigned int,int,int,nsITabParent *,mozIDOMWindowProxy *,nsIXULWindow * *) [nsAppShellService.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 200 + 0x56]
13:26:54     INFO -      rbp = 0x00007ff88ba50100   rsp = 0x0000006555bfc7a0
13:26:54     INFO -      rip = 0x00007ff86953cb1c
13:26:54     INFO -      Found by: call frame info
13:26:54     INFO -  20  xul.dll!nsAppStartup::CreateChromeWindow2(nsIWebBrowserChrome *,unsigned int,nsITabParent *,mozIDOMWindowProxy *,unsigned __int64,bool *,nsIWebBrowserChrome * *) [nsAppStartup.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 680 + 0x3f]
13:26:54     INFO -      rbp = 0x00007ff88ba50100   rsp = 0x0000006555bfc820
13:26:54     INFO -      rip = 0x00007ff86953ca03
13:26:54     INFO -      Found by: call frame info
13:26:54     INFO -  21  xul.dll!nsWindowWatcher::CreateChromeWindow(nsTSubstring<char> const &,nsIWebBrowserChrome *,unsigned int,nsITabParent *,mozIDOMWindowProxy *,unsigned __int64,nsIWebBrowserChrome * *) [nsWindowWatcher.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 469 + 0x3d]
13:26:54     INFO -      rbp = 0x00007ff88ba50100   rsp = 0x0000006555bfc8a0
13:26:54     INFO -      rip = 0x00007ff86953c8c4
13:26:54     INFO -      Found by: call frame info
13:26:54     INFO -  22  xul.dll!nsWindowWatcher::OpenWindowInternal(mozIDOMWindowProxy *,char const *,char const *,char const *,bool,bool,bool,nsIArray *,bool,bool,nsIDocShellLoadInfo *,mozIDOMWindowProxy * *) [nsWindowWatcher.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 939 + 0x44]
13:26:54     INFO -      rbp = 0x00007ff88ba50100   rsp = 0x0000006555bfc910
13:26:54     INFO -      rip = 0x00007ff86917bc63
13:26:54     INFO -      Found by: call frame info
13:26:54     INFO -  23  xul.dll!nsWindowWatcher::OpenWindow(mozIDOMWindowProxy *,char const *,char const *,char const *,nsISupports *,mozIDOMWindowProxy * *) [nsWindowWatcher.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 328 + 0x4e]
13:26:54     INFO -      rbp = 0x00007ff88ba50100   rsp = 0x0000006555bfcc50
13:26:54     INFO -      rip = 0x00007ff86968afef
13:26:54     INFO -      Found by: call frame info
:jimm - It would be great if you could find someone to look into this. These startup hangs affect a wide range of Windows 10 test jobs and have been on-going for many weeks; this stack looks like the most promising way of addressing it.
Blocks: 1414495
Flags: needinfo?(jmathies)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2aa248746ceed45ac2cf1ecc4e9caa92706c2ac7 reproduces the hang with additional logging:

https://hg.mozilla.org/try/rev/4ff65cda51295d3e7503eebd98a8be650e0385ac#l1.12

https://treeherder.mozilla.org/logviewer.html#?job_id=178817719&repo=try&lineNumber=6392-6395

18:33:17     INFO -  GECKO(6308) | ZZZ calling WTSRegisterSessionNotification 00000000001D00EA
18:36:17     INFO -  runtests.py | Waiting for browser...
18:39:27     INFO -  Buffered messages finished
18:39:27  WARNING -  TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output

suggesting a hang in WTSRegisterSessionNotification().


Same in:

https://treeherder.mozilla.org/logviewer.html#?job_id=178817592&repo=try&lineNumber=14110-14113
Interesting here is also the comment above the line of code which seems to hang:

>   // Starting with Windows XP, a process always runs within a terminal services
>   // session. In order to play nicely with RDP, fast user switching, and the
>   // lock screen, we should be handling WM_WTSSESSION_CHANGE. We must register
>   // our HWND in order to receive this message.
>   DebugOnly<BOOL> wtsRegistered = ::WTSRegisterSessionNotification(mWnd,
>                                                        NOTIFY_FOR_THIS_SESSION);

Given that this is only a Windows 10 issue, I wonder what's different compared to our Windows 7 AMIs. Rob, any idea?
Flags: needinfo?(rthijssen)
(In reply to Henrik Skupin (:whimboo) from comment #3)
> Given that this is only a Windows 10 issue, I wonder what's different
> compared to our Windows 7 AMIs. Rob, any idea?

well, for starters the operating system is quite different.
https://stackoverflow.com/questions/40854000/wtsregistersessionnotification-causing-process-to-hang
https://stackoverflow.com/questions/39904632/why-is-exitprocess-necessary-under-win32-when-you-can-use-a-ret
Flags: needinfo?(rthijssen)
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> central&job_id=178273429&lineNumber=7613
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=178251739&lineNumber=5336
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=178251664&lineNumber=5808
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=178251659&lineNumber=10585

All of these appear to be related to a delayed DLL load that seems to be taking a long time. Jank here is common since it involves disk access usually but we shouldn't hang indefinitely. Can you post a minidump for one of these hangs here we can inspect?
Flags: needinfo?(jmathies) → needinfo?(gbrown)
(My first thought is the graphic test window btw, but I have no way to confirm that yet.
All of the treeherder links have minidumps: Just look for "Artifact uploaded: xxx.dmp" in the top-left pane of the logviewer. For instance, https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=178273429&lineNumber=7613 has https://queue.taskcluster.net/v1/task/fAlRn6ksQa2RwJawk3EfuQ/runs/0/artifacts/public/test_info/bd47f9c4-a99c-4183-994f-61bc8fc7145b.dmp
Flags: needinfo?(gbrown)
(In reply to Jim Mathies [:jimm] from comment #5)
> All of these appear to be related to a delayed DLL load that seems to be
> taking a long time. Jank here is common since it involves disk access
> usually but we shouldn't hang indefinitely. Can you post a minidump for one
> of these hangs here we can inspect?

We are definitely waiting over 5 minutes without seeing any progress in the test logs...not just jank.

Did you notice that the frame just after nsIWidget::Create is often (always?) "0x1":

https://treeherder.mozilla.org/logviewer.html#?job_id=180534849&repo=mozilla-central&lineNumber=7630-7645

12:02:22     INFO -  14  wtsapi32.dll!PsspCaptureVaSpaceInformation + 0x35d
12:02:22     INFO -      rbx = 0x0000000000000001   rbp = 0x00007ff975810100
12:02:22     INFO -      rsp = 0x00000017af3fca90   r12 = 0x0000000000000003
12:02:22     INFO -      r13 = 0x0000000000000040   r14 = 0x0000000000000001
12:02:22     INFO -      r15 = 0x0000000000000000   rip = 0x00007ff9726a21d1
12:02:22     INFO -      Found by: call frame info
12:02:22     INFO -  15  0x1
12:02:22     INFO -      rbx = 0x0000000000000001   rbp = 0x00007ff975810100
12:02:22     INFO -      rsp = 0x00000017af3fcbf0   r12 = 0x0000000000000003
12:02:22     INFO -      r13 = 0x0000000000000040   r14 = 0x0000000000000001
12:02:22     INFO -      r15 = 0x0000000000000000   rip = 0x0000000000000001
12:02:22     INFO -      Found by: call frame info
12:02:22     INFO -  16  xul.dll!nsIWidget::Create(nsIWidget *,void *,mozilla::gfx::IntRectTyped<mozilla::DesktopPixel> const &,nsWidgetInitData *) [nsIWidget.h:a466172aed4bc2afc21169b749b8068a4b98c93f : 459 + 0x22]
12:02:22     INFO -      rbp = 0x00007ff975810100   rsp = 0x00000017af3fcc20
12:02:22     INFO -      rip = 0x00007ff953986fca
12:02:22     INFO -      Found by: stack scanning

Is that just a curiosity of the crash report, or are we actually calling into a bad address?
:jimm - See comments 7 through 10. Also, more generally, can you find some time for this bug? It is one of the most common Windows test failures.
Flags: needinfo?(jmathies)
(In reply to Geoff Brown [:gbrown] from comment #1)
> :jimm - It would be great if you could find someone to look into this. These
> startup hangs affect a wide range of Windows 10 test jobs and have been
> on-going for many weeks; this stack looks like the most promising way of
> addressing it.

Looking for someone with cycles, will assign when I do.
(In reply to Geoff Brown [:gbrown] from comment #8)
> And here's a new example, from today:
> 
> https://treeherder.mozilla.org/logviewer.html#?job_id=180534849&repo=mozilla-
> central&lineNumber=7642-7673
> 
> (very similar crash report)
> 
> Minidump:
> https://queue.taskcluster.net/v1/task/DB6POVKrRau_X_X8OWmm9w/runs/0/
> artifacts/public/test_info/c7728861-ab34-40ff-9cc6-8865d9a3b06c.dmp

Hey Adam, would you mind taking ten minutes and poking at this see if you can find anything obvious going on here.
Flags: needinfo?(jmathies) → needinfo?(agashlin)
tldr: We don't prevent stack walking during DLL load/unload unless the profiler is enabled, here we are walking due to Background Hang Monitor and getting deadlocked.

---

The looks like the same symptoms as bug 1349444.

From https://treeherder.mozilla.org/logviewer.html#?job_id=183012047&repo=mozilla-inbound&lineNumber=12237 in Comment 9:

Thread 3020 (main):
ntdll.dll!NtProtectVirtualMemory()
ntdll.dll!LdrpChangeMrdataProtection()
ntdll.dll!LdrProtectMrdata()
ntdll.dll!RtlInsertInvertedFunctionTable()
ntdll.dll!LdrpProcessMappedModule()
ntdll.dll!LdrpMapDllWithSectionHandle()
ntdll.dll!LdrpMapDllNtFileName()
ntdll.dll!LdrpMapDllSearchPath()
ntdll.dll!LdrpProcessWork()
ntdll.dll!LdrpLoadDllInternal()
ntdll.dll!LdrpLoadForwardedDll()
ntdll.dll!LdrpGetDelayloadExportDll()
ntdll.dll!LdrpHandleProtectedDelayload()
ntdll.dll!LdrResolveDelayLoadedAPI()
wtsapi32.dll!__delayLoadHelper2()
wtsapi32.dll!__tailMerge_WINSTA_dll()
wtsapi32.dll!WTSRegisterSessionNotification()
xul.dll!nsWindow::Create(nsIWidget * aParent, void * aNativeParent, const mozilla::gfx::IntRectTyped<mozilla::LayoutDevicePixel> & aRect, nsWidgetInitData * aInitData) Line 928

RtlInsertInvertedFunctionTable does RtlAcquireSRWLockExclusive on LdrpInvertedFunctionTableSRWLock before the call to LdrProtectMrdata and doesn't release it until later.

Thread 6612:
ntdll.dll!NtWaitForAlertByThreadId()
ntdll.dll!RtlpxLookupFunctionTable()
ntdll.dll!RtlLookupFunctionEntry()
mozglue.dll!WalkStackMain64(WalkStackData * aData) Line 317
mozglue.dll!MozStackWalkThread(void(*)(unsigned int, void *, void *, void *) aCallback, unsigned int aSkipFrames, unsigned int aMaxFrames, void * aClosure, void * aThread, _CONTEXT * aContext) Line 443
xul.dll!DoMozStackWalkBacktrace(const mozilla::BaseAutoLock<PSMutex> & aLock, const RegisteredThread & aRegisteredThread, const Registers & aRegs, NativeStack & aNativeStack) Line 1124
[Inline Frame] xul.dll!profiler_suspend_and_sample_thread::__l9::<lambda_9998a5e560acb364e76945bccdc5e530>::operator()(const Registers &) Line 3674
[Inline Frame] xul.dll!Sampler::SuspendAndSampleAndResumeThread(const mozilla::BaseAutoLock<PSMutex> &) Line 177
xul.dll!profiler_suspend_and_sample_thread(int aThreadId, unsigned int aFeatures, ProfilerStackCollector & aCollector, bool aSampleNative) Line 3661
xul.dll!mozilla::ThreadStackHelper::GetStack(mozilla::HangStack & aStack, nsTSubstring<char> & aRunnableName, bool aStackWalk) Line 148
xul.dll!mozilla::BackgroundHangManager::RunMonitorThread() Line 395
xul.dll!mozilla::BackgroundHangManager::MonitorThread(void * aData) Line 80


profiler_suspend_and_sample_thread has aThreadId = 3020, so it has suspended the main thread. The NtWaitForAlertByThreadId is done on LdrpInvertedFunctionTableSRWLock.

So it seems like we're detecting a hang (maybe due to the DLL load taking a while?) and trying to walk the stack, but that's turning into a permanent hang as we were in the middle of DLL loading.

In bug 1349444 we added a hook to prevent stack walking during LdrUnloadDll, in this case we're instead in LdrResolveDelayLoadedAPI, but that should already have a hook in place since bug 1355559.

But it looks like since bug 1378442 we only install the hooks if the profiler is enabled, and in this run ActivePS::sInstance = NULL. So Background Hang Monitor is running without a safety net.

It sounds like we just need to do InitializeWin64ProfilerHooks in BHR init, but I'm reluctant to do that since we explicitly disabled it in bug 1378442.
Flags: needinfo?(agashlin)
dmajor, what do you think about enabling the profiler hooks for background hang reporter?
Flags: needinfo?(dmajor)
I've paged out all the profiler code I worked on. Could you give me some context/backstory around the question?
Flags: needinfo?(dmajor)
Ok I think I managed to page some of this back in... having the hooks installed for typical users in the wild had a lot of bad interactions with AVs and other software that competes to hook those same functions. I moved the hooks to the profiler so that they'd basically only run on the Quantum Flow team's dev machines, which tend to be pretty free of such external software. If you are wanting to re-enable these hooks in the wild, the results are probably not going to be pretty.
Wrote this up before seeing comment 17:

It's dangerous to walk the stack of a thread while it is loading or unloading DLLs, as it could be suspended while holding a lock exclusively on some internal loader structures; when the walking thread does RtlLookupFunctionEntry it tries to get the same lock and we deadlock. (It looks like a related issue with the JIT was first diagnosed in bug 1263595 comment 12.)

The workaround from bug 1349444 (LdrUnloadDll), bug 1354611 (LdrLoadDll), and bug 1355559 (LdrResolveDelayLoadedAPI) is to not allow stack walking while running the Ldr* APIs, this is accomplished by wrapping the calls in a AutoSuppressStackWalking guard [1].

The wrapping is done by installing hooks on the APIs (or adding to the one already in place for LdrLoadDll). Originally this was added to EnsureWalkThreadReady for LdrUnloadDll and LdrResolveDelayLoadedAPI, but it was moved to DllBlocklist_Initialize in bug 1372375 (to avoid taking locks during the profiler's critical section), and then to locked_profiler_start in bug 1378442 (to mitigate shutdown regressions in non-profiling sessions caused by the previous move).

Moving to locked_profiler_start means that the LdrUnloadDll and LdrResolveDelayLoadedAPI hooks are now only installed when profiling is enabled, but Background Hang Monitor (not "Reporter", oops) also does stack walking, which seems to be the cause of the current bug. It seems reasonable to run InitializeWin64ProfilerHooks when setting up Background Hang Monitor. But I think BHM is used on all Nightly builds, and I'm wary of reviving the yet-unsolved shutdown regressions mentioned in bug 1378442.

In bug 1359507 the mechanism for disabling stack walking was simplified, I was hoping that might have solved the shutdown issues, but it looks like you landed that (03 May 2017) [2] before moving the hook installation to mitigate the shutdown issues (05 July) [3], so I assume that whatever was happening on shutdown is still happening.

[1] https://searchfox.org/mozilla-central/rev/d0a41d2e7770fc00df7844d5f840067cc35ba26f/tools/profiler/core/platform-win32.cpp#303-354
[2] https://hg.mozilla.org/mozilla-central/rev/ab0c15933c74
[3] https://hg.mozilla.org/mozilla-central/rev/a199c96110ca
(In reply to David Major [:dmajor] from comment #17)
> Ok I think I managed to page some of this back in... having the hooks
> installed for typical users in the wild had a lot of bad interactions with
> AVs and other software that competes to hook those same functions. I moved
> the hooks to the profiler so that they'd basically only run on the Quantum
> Flow team's dev machines, which tend to be pretty free of such external
> software. If you are wanting to re-enable these hooks in the wild, the
> results are probably not going to be pretty.

Ah, ok, interfering with others hooking the same functions sounds like a recipe for disaster, it's surprising we get away with it for LdrLoadDll. I wonder if there's any way around this that isn't going to neuter BHM.
> I wonder if there's any way around this that isn't going to neuter BHM.

This is probably a horrible idea, but if we can be sure that LdrLoadDll/LdrUnloadDll/LdrResolveDelayLoadedAPI always hold the loader lock before they take the function-table lock, then we could test whether the loader lock is held by calling LdrLockLoaderLock(2, ...) and avoid stack walking if so. This is an undocumented API though so it would be inviting trouble on the next Windows release.
As discussed in our PlatInt meeting:

What if we wrote our own delayLoadHelper(), whose replacement is supported by MSVC?

I guess another question is, what are the implications for clang-cl?
Flags: needinfo?(dmajor)
AIUI it's just normal symbol resolution, so as long as your lib containing __delayLoadHelper2 appears before delayimp.lib, it should be fine.

Do we know that this issue is limited to delay-loading, i.e. it doesn't happen during "vanilla" LdrLoadDll/LdrUnloadDll?
Flags: needinfo?(dmajor)
We already suppress profiler stackwalking in those cases AIUI.
LdrLoadDll, yes. LdrUnloadDll is only hooked when profiling, like LdrResolveDelayLoadedAPI: https://searchfox.org/mozilla-central/rev/d0a41d2e7770fc00df7844d5f840067cc35ba26f/tools/profiler/core/platform-win32.cpp#346

For the record, "cross our fingers and hope this doesn't happen during Unload" is a valid approach IMO. If we've made it this long without problems...
Do we have any update for this bug?
Maybe Adam can comment?

(Intermittent test failures continue to be reported in bug 1414495. Huge thanks to everyone who has commented so far!)
Flags: needinfo?(agashlin)
No immediate plans from me (or platform integration in general) to work on this. As I understand we are only doing Background Hang Monitor on Nightly [1], though that includes automated testing.

Fix will probably be aklotz's suggestion in comment 21, but it isn't known how much work that is going to involve yet.

[1] https://searchfox.org/mozilla-central/rev/6ef785903fee6c0b16a1eab79d722373d940fd78/toolkit/components/backgroundhangmonitor/moz.build#10
Flags: needinfo?(agashlin)
Priority: -- → P2
Please note that bug 1414495 shows a very small number of intermittent failures related to Windows all the last weeks. So has something been changed on mozilla-central, which might have caused this?
Actually this failure is still happening more or less frequently.
Priority: P2 → P3

Hey Geoff,
Can you still reproduce this issue or should we close it?

Flags: needinfo?(whole.grains)

Windows mochitest startup hangs continue (after all this time!), as seen in bug 1414495. I see stacks very similar to those discussed in comment 14+. This bug is still relevant.

Flags: needinfo?(whole.grains)
QA Whiteboard: [qa-not-actionable]

In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.

Severity: critical → --
Severity: -- → S3
You need to log in before you can comment on or make changes to this bug.