Last Comment Bug 771251 - OOP crash reporting access the directory service off the main thread
: OOP crash reporting access the directory service off the main thread
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: IPC (show other bugs)
: unspecified
: x86_64 Windows 7
: -- normal (vote)
: ---
Assigned To: Benjamin Smedberg [:bsmedberg]
:
: [PTO to Dec5] Bill McCloskey (:billm)
Mentors:
Depends on: 773665 773830 779849
Blocks: 769048
  Show dependency treegraph
 
Reported: 2012-07-05 11:42 PDT by Benjamin Smedberg [:bsmedberg]
Modified: 2012-09-17 06:28 PDT (History)
7 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed


Attachments
Move the crash callback guts to the main thread synchronously, rev. 1 (7.45 KB, patch)
2012-07-05 11:46 PDT, Benjamin Smedberg [:bsmedberg]
ehsan: review+
Details | Diff | Splinter Review
Give child crashes a sequence number, save the pending directory for threadsafey, rev. 2 (34.07 KB, patch)
2012-07-10 11:58 PDT, Benjamin Smedberg [:bsmedberg]
ted: review+
Details | Diff | Splinter Review
Give child crashes a sequence number, save the pending directory for threadsafey, rev. 2.1 (35.61 KB, patch)
2012-07-11 08:50 PDT, Benjamin Smedberg [:bsmedberg]
ted: review+
lukasblakk+bugs: approval‑mozilla‑beta+
Details | Diff | Splinter Review
v2 -> v2.1 interdiff (3.64 KB, text/plain)
2012-07-11 08:57 PDT, Benjamin Smedberg [:bsmedberg]
no flags Details
Followup: suppress extra dump deletion using an envvar, rev. 1 (747 bytes, patch)
2012-07-13 11:05 PDT, Benjamin Smedberg [:bsmedberg]
no flags Details | Diff | Splinter Review

Description Benjamin Smedberg [:bsmedberg] 2012-07-05 11:42:14 PDT
The OOP crash reporting implementation, and especially the method OnChildProcessDumpRequested happens on a helper thread but ends up using the directory service.

There is also a race condition involved with the new Flash crash injector: because the crashed process can disappear before the crash callback, it is possible that the IPC mechanism will notice that the process has died before the callback occurs; this means that the crash mechanism will not have a notification of the crash minidump. This patch fixes that issue by synchronously informing the main thread of the minidump before allowing the crashing process to continue/crash.
Comment 1 Benjamin Smedberg [:bsmedberg] 2012-07-05 11:46:40 PDT
Created attachment 639418 [details] [diff] [review]
Move the crash callback guts to the main thread synchronously, rev. 1
Comment 2 :Ehsan Akhgari 2012-07-05 13:06:47 PDT
Comment on attachment 639418 [details] [diff] [review]
Move the crash callback guts to the main thread synchronously, rev. 1

Review of attachment 639418 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +1876,5 @@
> +};
> +
> +NS_IMETHODIMP
> +HandleRemoteCrash::Run()
> +{

Please assert that this always runs on the main thread.
Comment 3 Benjamin Smedberg [:bsmedberg] 2012-07-06 14:43:20 PDT
Comment on attachment 639418 [details] [diff] [review]
Move the crash callback guts to the main thread synchronously, rev. 1

This patch causes deadlocks when a crash occurs in plugin-container during an RPC call. I think I have a smaller solution to this, but I also think it can wait until ted gets back, so I'll ping him about it then.
Comment 4 Benjamin Smedberg [:bsmedberg] 2012-07-10 11:58:43 PDT
Created attachment 640706 [details] [diff] [review]
Give child crashes a sequence number, save the pending directory for threadsafey, rev. 2
Comment 5 Ted Mielczarek [:ted.mielczarek] 2012-07-10 18:48:53 PDT
Comment on attachment 640706 [details] [diff] [review]
Give child crashes a sequence number, save the pending directory for threadsafey, rev. 2

Review of attachment 640706 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/plugins/ipc/PluginModuleParent.cpp
@@ +304,5 @@
> +        }
> +        else {
> +            RemoveMinidump(childDumpFile);
> +        }
> +    }

This feels like it ought to be a loop or a function call or something, but it's not really critical.

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +232,5 @@
> +#endif
> +  { }
> +
> +  nsCOMPtr<nsIFile> minidump;
> +  PRUint32 sequence; // sequence of crashing processes

Could you make this comment a little more descriptive?
Comment 6 Benjamin Smedberg [:bsmedberg] 2012-07-10 19:21:34 PDT
https://hg.mozilla.org/mozilla-central/rev/95c9bc0e63f7

Callek graciously agreed to watch the tree and back me out if things go awry.
Comment 7 Justin Wood (:Callek) 2012-07-10 19:55:01 PDT
And things went awry... https://hg.mozilla.org/mozilla-central/rev/d75c06f195d2

Error from linux64 (though B2g, as well as android and osx10.7 builds are all red too):

../../../toolkit/crashreporter/nsExceptionHandler.cpp: In function 'void CrashReporter::OOPInit()':
../../../toolkit/crashreporter/nsExceptionHandler.cpp:2077:39: error: cannot convert 'PRUnichar*' to 'CrashReporter::XP_CHAR*' in assignment
../../../toolkit/crashreporter/nsExceptionHandler.cpp: At global scope:
../../../toolkit/crashreporter/nsExceptionHandler.cpp:1986:13: warning: 'bool CrashReporter::ChildFilter(void*)' defined but not used
make[6]: *** [nsExceptionHandler.o] Error 1
make[6]: Leaving directory `/builds/slave/m-cen-lnx64/build/obj-firefox/toolkit/crashreporter'
make[5]: *** [crashreporter_libs] Error 2
Comment 8 Phil Ringnalda (:philor) 2012-07-10 22:37:13 PDT
Not just the build bustage, though: on Windows, where you did build, you had both a mochitest-chrome failure and 32 xpcshell failures.
Comment 9 Benjamin Smedberg [:bsmedberg] 2012-07-11 08:50:23 PDT
Created attachment 641069 [details] [diff] [review]
Give child crashes a sequence number, save the pending directory for threadsafey, rev. 2.1

Stupid errors and the xpcshell testing framework relies on crash reporting working even when there is no user directory, so I've changed some asserts into warnings. I'll have an interdiff up shortly for ease of review.
Comment 10 Benjamin Smedberg [:bsmedberg] 2012-07-11 08:57:26 PDT
Created attachment 641071 [details]
v2 -> v2.1 interdiff
Comment 11 Benjamin Smedberg [:bsmedberg] 2012-07-13 05:18:13 PDT
https://hg.mozilla.org/mozilla-central/rev/6a640ca09064 for this morning's nightly.
Comment 12 Benjamin Smedberg [:bsmedberg] 2012-07-13 06:56:17 PDT
Comment on attachment 641069 [details] [diff] [review]
Give child crashes a sequence number, save the pending directory for threadsafey, rev. 2.1

Bug caused by (feature/regressing bug #): insufficient implementation of Flash process crash reporting
User impact if declined: Flash process crash reporting will list the wrong crash process in some cases (40-50% of the time)
Testing completed (on m-c, etc.): m-c landed, verification on self-builds of the correct behavior
Risk to taking this patch (and alternatives if risky): This patch is larger than I wanted but I believe it is fairly safe. In the common case of non-Flash crash reporting the risk is only that there is a code error which prevents reports from being submitted in general.
String or UUID changes made by this patch: There is one API change to XRE_TakeMinidumpForChild, but that would only be used by embedders and even that is very unlikely.
Comment 13 Benjamin Smedberg [:bsmedberg] 2012-07-13 07:50:31 PDT
bc, this introduces a behavior change which may affect your automation: previously when a Flash process crashed, you would usually get either two or three .dmp files in the pending directory, one for the Flash process and one for the plugin-container. Now we will only keep the minidump for the process which crashed first, and delete the others. Let me know if you'd like an environment variable/pref setting which will keep all three minidumps, and I can arrange that.
Comment 14 Bob Clary [:bc:] 2012-07-13 08:03:44 PDT
bsmedberg: I think I can live with the change and in fact I believe I will like the new behavior more than the old. Thanks for the heads up though.
Comment 15 Benjamin Smedberg [:bsmedberg] 2012-07-13 09:53:24 PDT
Comment on attachment 641069 [details] [diff] [review]
Give child crashes a sequence number, save the pending directory for threadsafey, rev. 2.1

Clearing aurora nom until the crash is figured out. Thanks scoobidiver.
Comment 16 Benjamin Smedberg [:bsmedberg] 2012-07-13 11:05:27 PDT
Created attachment 641963 [details] [diff] [review]
Followup: suppress extra dump deletion using an envvar, rev. 1

bc, here is a patch to use an envvar to disable dump deletion per our IRC conversation, please check to see if it works.
Comment 17 Benjamin Smedberg [:bsmedberg] 2012-07-13 20:27:30 PDT
Comment on attachment 641069 [details] [diff] [review]
Give child crashes a sequence number, save the pending directory for threadsafey, rev. 2.1

re-requesting aurora approval along with followup: bug 773665 and bug 773830
Comment 18 Bob Clary [:bc:] 2012-07-16 08:04:29 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #16)

> bc, here is a patch to use an envvar to disable dump deletion per our IRC
> conversation, please check to see if it works.

I spent some time manually trying to get crashes with multiple dumps but none of the urls I tested would create multiple dumps. I ran a set of daily urls with and without the patch hoping to get an overall comparison of the difference but the variation in the crash reproducibility hid whatever differences the crash workers experienced. If you want to wait for me to find a reproducible example where the patch has an affect that is fine with me. Otherwise it seems straightforward to land.
Comment 19 Benjamin Smedberg [:bsmedberg] 2012-07-16 13:34:38 PDT
Comment on attachment 641069 [details] [diff] [review]
Give child crashes a sequence number, save the pending directory for threadsafey, rev. 2.1

Merge day happened, transferring nom.
Comment 20 Lukas Blakk [:lsblakk] use ?needinfo 2012-07-16 15:12:54 PDT
Comment on attachment 641069 [details] [diff] [review]
Give child crashes a sequence number, save the pending directory for threadsafey, rev. 2.1

approving, please land before tomorrow morning so we can get this in the first beta which will go to build then.
Comment 21 Lukas Blakk [:lsblakk] use ?needinfo 2012-07-16 15:24:46 PDT
(In reply to Lukas Blakk [:lsblakk] from comment #20)
> Comment on attachment 641069 [details] [diff] [review]
> Give child crashes a sequence number, save the pending directory for
> threadsafey, rev. 2.1
> 
> approving, please land before tomorrow morning so we can get this in the
> first beta which will go to build then.

I'm sure you already know this, but we just want to make sure that if this lands on beta that bug 773665 lands too.
Comment 22 Benjamin Smedberg [:bsmedberg] 2012-07-17 08:19:53 PDT
Per IRC, I had to do a little backporting of this patch because of nsIFile/nsILocalFile changes and I'm not comfortable pushing it without another tryserver run. So we're going to do 15.0b1 without this change and pick it up for 15.0b2.
Comment 23 Benjamin Smedberg [:bsmedberg] 2012-07-18 06:32:38 PDT
https://hg.mozilla.org/releases/mozilla-beta/rev/0cdc0b40d6d9

Note You need to log in before you can comment on or make changes to this bug.