Open Bug 1204682 Opened 9 years ago Updated 2 years ago

Add Wait Chain Traversal to hang reports

Categories

(Toolkit :: Crash Reporting, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: bugzilla, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(3 files, 2 obsolete files)

I think we should add wait chain traversal to the Windows crash reporter for hang reports (for Vista and newer). It should be able to give us additional information about hung threads, in particular who they are blocked on and why. Since WCT groks COM, GUI messaging, ALPC, and other things, I think it could give us some insight that we wouldn't otherwise have access to.

See also:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms681622%28v=vs.85%29.aspx
Assignee: nobody → aklotz
Status: NEW → ASSIGNED
Depends on: 1234406
Attached patch First Cut (obsolete) — Splinter Review
Attached patch Part 1 (obsolete) — Splinter Review
Attachment #8700878 - Attachment is obsolete: true
Attachment #8700880 - Flags: review?(ehsan)
Attached patch Part 1Splinter Review
Bah, saw something I didn't like in moz.build.
Attachment #8700880 - Attachment is obsolete: true
Attachment #8700880 - Flags: review?(ehsan)
Attachment #8700882 - Flags: review?(ehsan)
Attached patch Part 2Splinter Review
Attachment #8701130 - Flags: review?(jmathies)
Attachment #8701131 - Flags: review?(vladan.bugzilla)
Comment on attachment 8701131 [details] [diff] [review]
Part 3 - Hang Monitor

Review of attachment 8701131 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/HangMonitor.cpp
@@ +129,5 @@
> +  }
> +  if (GeckoProcessType_Default == XRE_GetProcessType()) {
> +    // Let's get the main thread Ids of all content processes
> +    nsTArray<mozilla::dom::ContentParent*> contentParents;
> +    mozilla::dom::ContentParent::GetAll(contentParents);

did you test this patch manually with multiple content processes?

@@ +132,5 @@
> +    nsTArray<mozilla::dom::ContentParent*> contentParents;
> +    mozilla::dom::ContentParent::GetAll(contentParents);
> +    for (uint32_t i = 0, len = contentParents.Length(); i < len; ++i) {
> +      int32_t pid = contentParents[i]->Pid();
> +      nsAutoHandle snapshot(::CreateToolhelp32Snapshot(TH32CS_SNAPTHREAD, pid));

couldn't the CreateToolhelp32Snapshot call fail if the process exits? e.g. if it's a thumbnail capture process with a short lifetime
Attachment #8701131 - Flags: review?(vladan.bugzilla) → review+
Comment on attachment 8701130 [details] [diff] [review]
Part 2

Review of attachment 8701130 [details] [diff] [review]:
-----------------------------------------------------------------

sorry for the wait.

::: dom/plugins/ipc/PluginModuleParent.cpp
@@ +1248,5 @@
> +    if (NS_SUCCEEDED(NS_GetMainThread(&mainThread)) && mainThread &&
> +        NS_SUCCEEDED(mainThread->GetPRThread(&mainPrThread)) && mainPrThread) {
> +        waitChainTids.AppendElement(PR_GetThreadID(mainPrThread));
> +    }
> +    waitChainTids.AppendElement(crashReporter->ChildMainThreadId());

nit - add a comment explaining what we're doing here please.

@@ +1314,5 @@
>              // include those minidumps as well.
>              if (CreatePluginMinidump(mFlashProcess1, 0, pluginDumpFile,
>                                       NS_LITERAL_CSTRING("flash1"))) {
>                  additionalDumps.AppendLiteral(",flash1");
> +                waitChainTids.AppendElement(GetFirstThreadId(mFlashProcess1));

- ditto

@@ +3133,5 @@
>  
> +#if defined(XP_WIN)
> +
> +DWORD
> +PluginModuleChromeParent::GetFirstThreadId(DWORD aProcessId)

nit - comment me
Attachment #8701130 - Flags: review?(jmathies) → review+
Comment on attachment 8700882 [details] [diff] [review]
Part 1

Review of attachment 8700882 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/tests/windows/TestWaitChainTraversal.cpp
@@ +46,5 @@
> +  {
> +    return !mMutex;
> +  }
> +
> +  CRITICAL_SECTION  mCS;

Nit: Please use AutoCriticalSection.

@@ +59,5 @@
> +{
> +  // This function figures out what the value of CRITICAL_SECTION::LockCount
> +  // should be when the CS is held but not contended. We do it this way because
> +  // the actual semantics of that field can change between Windows versions.
> +  CRITICAL_SECTION tempCS;

Ditto.

::: xpcom/threads/WaitChainTraversal.h
@@ +10,5 @@
> +#if !defined(XP_WIN)
> +#error This feature is only supported on Windows builds.
> +#endif
> +
> +#include <windows.h>

Including <windows.h> in a header is a Bad Idea.  See <https://dxr.mozilla.org/mozilla-central/search?q=%22%23undef+min%22&redirect=false&case=true> for example.  :-)

Please move the implementation to a .cpp file to avoid this header inclusion.

@@ +40,5 @@
> +  {
> +    if (!InitWCTFunctionPtrs()) {
> +      return;
> +    }
> +    mHandle = OpenFnPtr()(0, nullptr);

Hmm...  What prevents this from crashing if InitWCTFunctionPtrs() fails?

@@ +87,5 @@
> +  static HMODULE& GetModule()
> +  {
> +    static HMODULE sModule = NULL;
> +    return sModule;
> +  }

Out of curiosity, why not just use a normal static member.  This is a strange idiom, and I'm not sure if I understand why it's needed.

@@ +118,5 @@
> +    if (advApi32 && openFn && closeFn && getFn) {
> +      return true;
> +    }
> +    if (!advApi32) {
> +      advApi32 = ::LoadLibraryW(L"advapi32.dll");

Please use GetModuleHandleW, so that you don't have to FreeLibrary().

@@ +279,5 @@
> +}
> +
> +#if defined(MOZ_CRASHREPORTER) && defined(MOZILLA_INTERNAL_API)
> +inline void
> +TraverseWaitChain(const nsTArray<DWORD>& aTids)

I think this is a misleading name.  How about calling this AnnotateCrashReportWithWaitChain() or something like that?
Attachment #8700882 - Flags: review?(ehsan) → review+
Depends on: 1253727

This might interest you, but feel free to resolve as INVALID/WONTFIX if not...

Assignee: bugzilla → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(gsvelto)

Oooh, this is very interesting! Definitely not closing it as WONTFIX. Hang reporting needs some love and I think this would be quite useful, especially to diagnose shutdown hangs. For those we rely on reading code and hunches to figure out if we're looking at a complex deadlock scenario.

Type: defect → enhancement
Flags: needinfo?(gsvelto)
Priority: -- → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: