Closed Bug 1360308 Opened 3 years ago Closed 3 years ago

ExceptionHandler::WriteMinidumpForChild() can hang for a long time when generating a minidump

Categories

(Toolkit :: Crash Reporting, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox56 --- fixed

People

(Reporter: ehsan, Assigned: cyu)

References

Details

(Whiteboard: [qf:p1][bhr])

Attachments

(4 files, 5 obsolete files)

4.51 KB, patch
gsvelto
: review+
Details | Diff | Splinter Review
10.18 KB, patch
cyu
: review+
Details | Diff | Splinter Review
40.35 KB, patch
cyu
: review+
Details | Diff | Splinter Review
11.98 KB, patch
cyu
: review+
Details | Diff | Splinter Review
See these hang reports, for example:

https://people-mozilla.org/~mlayzell/bhr/20170405/1172.html
https://people-mozilla.org/~mlayzell/bhr/20170405/1287.html
https://people-mozilla.org/~mlayzell/bhr/20170405/1595.html

The main thread has been paused for more than 8 *seconds* here.

(This same code seems to be used for plugins as well: <https://searchfox.org/mozilla-central/rev/ce5ccb6a8ca803271c946ccb8b43b7e7d8b64e7a/dom/plugins/ipc/PluginModuleParent.cpp#1218>)

Can this be reworked so that the minidump generation happens on a background thread?
And here are some example hangs for the plugin case:

https://people-mozilla.org/~mlayzell/bhr/20170405/15.html#3
https://people-mozilla.org/~mlayzell/bhr/20170405/15.html#87
https://people-mozilla.org/~mlayzell/bhr/20170405/15.html#88
https://people-mozilla.org/~mlayzell/bhr/20170405/15.html#95
https://people-mozilla.org/~mlayzell/bhr/20170405/15.html#139
Summary: ContentParent::KillHard() can hang for a long time when generating a minidump → ExceptionHandler::WriteMinidumpForChild() can hang for a long time when generating a minidump
Whiteboard: [qf:p1] → [qf:p1][bhr]
Flags: needinfo?(cyu)
Hang resulting from crash generation for > 8 sec. sounds pretty bad. I'll take a look.
Assignee: nobody → cyu
Flags: needinfo?(cyu)
Cervantes, you might want to have a look at bug 1262852 for the changes we made last year to this code. Some decisions on where and when to grab the minidump were done in order to gather the best possible stacks; we did it differently before but it often resulted in meaningless stacks.
(In reply to Gabriele Svelto [:gsvelto] from comment #3)
> Cervantes, you might want to have a look at bug 1262852 for the changes we
> made last year to this code. Some decisions on where and when to grab the
> minidump were done in order to gather the best possible stacks; we did it
> differently before but it often resulted in meaningless stacks.

Gabriele, thanks for the background information of this bug.
It looks to me that writing minidump doesn't need to be on the main thread: for in-process crash generation, we write the minidump on a specific handler thread. Offloading the IO operation in generating the dump doesn't look like a problem, either. HangMonitorParent::SendHangNotification() is itself async. We can create an async version of mozilla::plugins::TakeFullMinidump() for plugins and mozilla::ipc::CrashReporterHost::GenerateMinidumpAndPair() for content process, probably returning a MozPromise to its consumer and put the remainder in the Then() part. These 2 cases doesn't seem to have a consumer that waits synchronously for the death of the process.

The problem is that we want to freeze the stack ASAP after the freeze is detected. If we just trivially put the call off-main-thread, we might regress in the precision in capturing the hang stack. I am not sure about how much this regression could be. It sounds not too bad if we create a dedicated thread for this, but when the system is under a very heavy load (which might be a source for the hang), the added delay could be much worse. Generating minidumps for capturing the hang stacks itself is not sync anyway, but we don't want go from bad to worse.

A feasible solution is that we suspend the thread in the sync part of taking the minidump, and just offload IO from the main thread. Windows supports nested SuspendThread() call and we don't need to hack the breakpad ExceptionHandler if we want to suspend the thread ASAP. On Linux and Mac we might be able to suspend the process or thread similarly (e.g. sending SIGSTOP).
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #4)
> Gabriele, thanks for the background information of this bug.
> It looks to me that writing minidump doesn't need to be on the main thread:
> for in-process crash generation, we write the minidump on a specific handler
> thread. Offloading the IO operation in generating the dump doesn't look like
> a problem, either. HangMonitorParent::SendHangNotification() is itself
> async. We can create an async version of
> mozilla::plugins::TakeFullMinidump() for plugins and
> mozilla::ipc::CrashReporterHost::GenerateMinidumpAndPair() for content
> process, probably returning a MozPromise to its consumer and put the
> remainder in the Then() part. These 2 cases doesn't seem to have a consumer
> that waits synchronously for the death of the process.

Yes, that path should be asynchronous so there's no need to wait synchronously.

> The problem is that we want to freeze the stack ASAP after the freeze is
> detected.

Yes, that's the biggest issue.

> If we just trivially put the call off-main-thread, we might
> regress in the precision in capturing the hang stack. I am not sure about
> how much this regression could be. It sounds not too bad if we create a
> dedicated thread for this, but when the system is under a very heavy load
> (which might be a source for the hang), the added delay could be much worse.

Or the hung thread could resume leaving with a non-sensical stack which is what we saw before sometimes.

> A feasible solution is that we suspend the thread in the sync part of taking
> the minidump, and just offload IO from the main thread. Windows supports
> nested SuspendThread() call and we don't need to hack the breakpad
> ExceptionHandler if we want to suspend the thread ASAP. On Linux and Mac we
> might be able to suspend the process or thread similarly (e.g. sending
> SIGSTOP).

That sounds like an excellent idea if you can pull it off!
When I was looking at thread-safety of MinidumpWriteDump(), I found that we could have races when multiple threads are generating minidumps concurrently:

1. In-process crash calls MinidumpWriteDump() on the exception handler thread.
2. OOP crash generation calls MinidumpWriteDump() using Windows-native thread pool.
3. Content or plugin hang calls MinidumpWriteDump() on the main thread.

From MSDN https://msdn.microsoft.com/en-us/library/windows/desktop/ms680360(v=vs.85).aspx :
"All DbgHelp functions, such as this one, are single threaded. Therefore, calls from more than one thread to this function will likely result in unexpected behavior or memory corruption."

This sounds like an issue we need to handle.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #6)
> When I was looking at thread-safety of MinidumpWriteDump(), I found that we
> could have races when multiple threads are generating minidumps concurrently:
> 
> 1. In-process crash calls MinidumpWriteDump() on the exception handler
> thread.
> 2. OOP crash generation calls MinidumpWriteDump() using Windows-native
> thread pool.
> 3. Content or plugin hang calls MinidumpWriteDump() on the main thread.
> 
> From MSDN
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms680360(v=vs.85).
> aspx :
> "All DbgHelp functions, such as this one, are single threaded. Therefore,
> calls from more than one thread to this function will likely result in
> unexpected behavior or memory corruption."
> 
> This sounds like an issue we need to handle.

It does, can you file a bug for that explicitly? The overall thread-safety of our crash handling paths hasn't really been tested much.
See Also: → 1368019
Attached patch WIP patch (obsolete) — Splinter Review
This is a WIP patch for making CrashReporterHost::GenerateMinidumpAndPair() async using MozPromise.
What is done in this WIP:
* Make CrashReporterHost::GenerateMinidumpAndPair() async
* ContentParent::KillHard() is updated to consume the promise.
* PluginModuleChromeParent::TakeFullMinidump() is updated to consume and promise. Itself also returns a promise to its user.
* PluginModuleChromeParent::TerminateChildProcess() is updated to consume the promise.

What is not done:
* Really move the MinidumpWriteDump() call off the main thread.
* Keeping the objects in the call stack alive (ContentParent and PluginModuleChromeParent).
* Blocking shutdown.
* May need to update ContentParent::KillHard() to make it return a promise to its callers. Otherwise, we may run some code out of order in its caller and up the calling chain.

Given the size of the current patch, I start to wonder if this is the right direction. Maybe using nested event loop will simplify this work.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #8)
> Given the size of the current patch, I start to wonder if this is the right
> direction. Maybe using nested event loop will simplify this work.

This code is rather complicated so this isn't unexpected. If you can find of a way to simplify it, besides not making it happen on the main thread then it would be a significant improvement too :)
Attached patch wip_v2.patch (obsolete) — Splinter Review
I still try to use MozPromise except in PluginModuleChromeParent::ShouldContinueFromReplyTimeout(), which is called in sync IPC code. It makes the test case dom/plugins/test/mochitest/test_hang_submit.xul behave like without the patch, but I still feel it might reorder some events especially in the JS world. I need to take a deeper look at it. Worst case is that we provide a synchronous version for this special case to be safe.

Try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=48c8d9987a7bf04a4e0bcacc87da600f8b3bfe73
Attachment #8871692 - Attachment is obsolete: true
Support sync and async minidump generation in CrashReporterHost::GenerateMinidumpAndPair(). I gave up MozPromise and and use callbacks to support both sync and async operations.

Try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1fc8b84dbc03c992087270531ade7513a3016e9
Attachment #8875718 - Attachment is obsolete: true
Attachment #8878442 - Flags: review?(gsvelto)
Comment on attachment 8878442 [details] [diff] [review]
Part 1 - De-templatize CrashReporterHost::GenerateMinidumpAndPair().

Review of attachment 8878442 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM, just one minor nit.

::: ipc/glue/CrashReporterHost.cpp
@@ +130,5 @@
> +    return false;
> +  }
> +
> +    return CrashReporter::GetIDFromMinidump(targetDump, mDumpID);
> +  }

nit: Indentation of the two lines above
Attachment #8878442 - Flags: review?(gsvelto) → review+
Comment on attachment 8878443 [details] [diff] [review]
Part 2 - Use callback in CrashReporterHost::GenerateMinidumpAndPair() and up the caller chain.

Review of attachment 8878443 [details] [diff] [review]:
-----------------------------------------------------------------

Also looking good but this is making my head spin... This code was already hard to follow as-is, now that it's asynchronous it's even harder, yuck!

::: dom/plugins/ipc/PluginHangUIParent.cpp
@@ +356,5 @@
>    if (aResponse & HANGUI_USER_RESPONSE_STOP) {
>      // User clicked Stop
> +    // Use a dummy callback to terminate the child process since we are doing
> +    // it synchronously.
> +    std::function<void(bool)> callback = [](bool aResult) { };

Since we do this in two places it would be worth declaring just one dummy callback (e.g. kDummyCallback) in one of the headers and use that.

::: dom/plugins/ipc/PluginModuleParent.cpp
@@ +392,5 @@
> +                                   aBrowserDumpId,
> +                                   Move(aCallback),
> +                                   aAsync);
> +  } else {
> +      aCallback(NS_LITERAL_STRING(""));

nit: Use EmptyString() instead of NS_LITERAL_STRING(""), also, this should be indented two spaces to the left. This file uses mixed indentation unfortunately but I'd like to see new code stick with the 2 spaces we use everywhere else. So if a function already uses 4 spaces then unless you're rewriting it entirely then stick to it; if not use 2 spaces please.

@@ +401,5 @@
>  mozilla::plugins::TerminatePlugin(uint32_t aPluginId,
>                                    base::ProcessId aContentProcessId,
>                                    const nsCString& aMonitorDescription,
> +                                  const nsAString& aDumpId,
> +                                  std::function<void(bool)>&& aCallback,

It seems to me that we're always invoking this callback with the parameter set to true; if I'm not mistaken then let's remove it entirely.

@@ +415,5 @@
> +                                        aDumpId,
> +                                        Move(aCallback),
> +                                        aAsync);
> +  } else {
> +      aCallback(true);

nit: Indentation

::: dom/plugins/ipc/PluginModuleParent.h
@@ +481,5 @@
>       *   TerminateChildProcess will use this dump file instead of generating a
>       *   multi-process crash report. If not provided a multi-process dump will
>       *   be taken at the time of this call.
> +     * @param aCallback the callback when the operation completes. The callback
> +     *   argument denotes whether the operation succeeds.

nit: a callback invoked when the operation completes. The argument denotes whether the operation succeeded.

@@ +494,5 @@
> +                          bool aAsync);
> +
> +  private:
> +#ifdef MOZ_CRASHREPORTER
> +    // The following methods are callbacks after calling TakeFullMinidump().

nit: The following methods are callbacks invoked after calling TakeFullMinidump()

@@ +505,5 @@
> +                                    base::ProcessId aContentPid,
> +                                    const nsAString& aBrowserDumpId);
> +
> +#endif
> +    // The following method is the callback after calling

nit: Likewise

::: ipc/glue/CrashReporterHost.h
@@ +45,5 @@
> +    void Init(std::function<void(T)>&& aCallback, bool aAsync)
> +    {
> +      mCallback = Move(aCallback);
> +      mAsync = aAsync;
> +      if (mAsync) {

nit: Use IsAsync() for clarity here

@@ +86,5 @@
> +      mCallback = nullptr;
> +      mTargetThread = nullptr;
> +      mAsync = false;
> +    }
> +  

nit: Trim trailing space

@@ +116,5 @@
>    // crash service.
>    bool FinalizeCrashReport();
>  
>    // Generate a paired minidump. This does not take the crash report, as
>    // GenerateCrashReport does. After this, FinalizeCrashReport may be called.

Please modify the comment to describe the the new parameters (aCallback & aAsync) and when/how the callback is called.
Attachment #8878443 - Flags: review?(gsvelto) → review+
Comment on attachment 8878445 [details] [diff] [review]
Part 3 - Support off-main-thread I/O in CrashReporter::CreateMinidumpsAndPair().

Review of attachment 8878445 [details] [diff] [review]:
-----------------------------------------------------------------

Functionally speaking this is OK, but the RUN_CALLBACK macro is really ugly. Can't you turn it into a function using std::function() for handling the callback?

::: ipc/glue/CrashReporterHost.cpp
@@ +132,5 @@
>    }
>  
> +  std::function<void(bool)> callback =
> +    [this](bool aResult) {
> +       if (aResult &&

nit: Indentation
Attachment #8878445 - Flags: review?(gsvelto) → review+
Comment on attachment 8878447 [details] [diff] [review]
Part 4 - Block shutdown in creating minidumps.

Review of attachment 8878447 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8878447 - Flags: review?(gsvelto) → review+
Attachment #8880322 - Attachment description: 1360308_p1.patch → Part 1 - De-templatize CrashReporterHost::GenerateMinidumpAndPair().
https://hg.mozilla.org/integration/mozilla-inbound/rev/fef8b4d5bd1d3abbbd5ed278fb51a49a59575793
Bug 1360308 - Part 1: De-templatize CrashReporterHost::GenerateMinidumpAndPair(). r=gsvelto

https://hg.mozilla.org/integration/mozilla-inbound/rev/2b28e4ee269cb5559f0f28a1421d284cb0c98658
Bug 1360308 - Part 2: Use callback in CrashReporterHost::GenerateMinidumpAndPair() and up the caller chain. r=gsvelto

https://hg.mozilla.org/integration/mozilla-inbound/rev/4f17147356744f39fcd1756c019540a90eda9192
Bug 1360308 - Part 3: Support off-main-thread I/O in CrashReporter::CreateMinidumpsAndPair(). r=gsvelto

https://hg.mozilla.org/integration/mozilla-inbound/rev/5f978d7caf2f81d26372462ecee97fa33cd52a4d
Bug 1360308 - Part 4: Block shutdown when generating minidumps in CrashReporterHost::GenerateMinidumpAndPair(). r=gsvelto
Depends on: 1387369
Depends on: 1390143
Depends on: 1394924
You need to log in before you can comment on or make changes to this bug.