Communicate crash metadata through shmem

RESOLVED FIXED in Firefox 52

Status

()

defect
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: billm, Assigned: dvander)

Tracking

(Blocks 1 bug)

unspecified
mozilla52
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox52 fixed)

Details

Attachments

(3 attachments, 2 obsolete attachments)

It seems like we now have a way for the child process to write out crash information to disk before the crash actually happens. That seems a lot safer to me than using IPC, like we do in PCrashReporter. Aaron, do you think it would be feasible to get rid of PCrashReporter and implement AnnotateCrashReport by storing the data in a hashtable that gets written out in PrepareChildExceptionTimeAnnotations?
Flags: needinfo?(aklotz)
Yeah, I can't think of any reason why we couldn't do that...
Flags: needinfo?(aklotz)
Probably the only thing to be wary of here is how this will interact with sandboxing.
I wrote that machinery to be sandboxing-aware.

In the long term I think it would be nice to push some changes upstream to allow us to stream annotations over the breakpad pipe (I have a WIP patch somewhere), but for the short term we made sure that we were writing this data to a safe path that would be permitted by the sandbox.
(and of course we validate that file in the parent as we read it, since that file should not be trusted)
The upstream client code is effectively EOL, Google is switching everything to crashpad: https://crashpad.chromium.org/index.html
Assignee: nobody → wmccloskey
Duplicate of this bug: 1282737
Posted patch patch (obsolete) — Splinter Review
This is kind of big but pretty straightforward I think:

- CrashReporterChild goes away. Crash annotations/notes end up in the same hashtable as we use for the parent process.
- This hashtable gets written out to the temp dir extra file when the child crashes.
- CrashReporterParent is replaced by CrashReporterHost since we still need some place to store per-process crash data.
- We still need to do a little IPC to send up the child process's main thread ID at startup (for hang reporting).
Attachment #8770306 - Flags: review?(aklotz)
Comment on attachment 8770306 [details] [diff] [review]
patch

Review of attachment 8770306 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM, but please remote the printfs ;-)

Thanks for deduplicating the remote crashreporter init/teardown code!

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +1295,5 @@
>    if (gMozCrashReason) {
>      WriteAnnotation(apiData, "MozCrashReason", gMozCrashReason);
>    }
> +
> +  printf("CHILD DUMP CALLBACK\n");

I don't think you want to leave in the printfs :-)

@@ +3542,5 @@
>    MOZ_ASSERT(!gExceptionHandler, "crash client already init'd");
>  
> +#if defined(XP_WIN) || defined(XP_MACOSX)
> +  // crash reporting is disabled
> +  if (crashPipe.Equals(kNullNotifyPipe))

Nit: Curlies

::: toolkit/xre/nsEmbedFunctions.cpp
@@ +263,5 @@
>  {
>  #if defined(XP_WIN) || defined(XP_MACOSX)
>    return CrashReporter::SetRemoteExceptionHandler(nsDependentCString(aPipe));
>  #elif defined(OS_LINUX)
> +  return CrashReporter::SetRemoteExceptionHandler(nsCString(""));

NS_LITERAL_CSTRING("") instead?
Attachment #8770306 - Flags: review?(aklotz) → review+
(In reply to Aaron Klotz [:aklotz] from comment #8)
> Comment on attachment 8770306 [details] [diff] [review]
> patch
> 
> Review of attachment 8770306 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> LGTM, but please remote the printfs ;-)

s/remote/remove

I don't think we need to remote printf for e10s ;-)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2)
> Probably the only thing to be wary of here is how this will interact with
> sandboxing.

Bug 1290633 is part of how it interacts with sandboxing at the moment, unfortunately — the “sandboxing-aware” in comment #3 isn't enough awareness for media plugin processes with no FS access at all (on Linux it's roughly the Chromium Web renderer sandbox minus bug 1151624).  So we'll need to do… something to media plugins to get the crash metadata back.
I didn't land this because of Jed's comment, and also because I realize that this change will cause us to lose metadata for hang reports.

I think the right way to do this is to write another patch on top that instead stores the child process annotations in shared memory, similar to what Kan-Ru did in bug 1282737. We would have to create the shared memory region in a process-type-specific way, sort of like how the SetMainThreadId thing happens now. Then we would serialize the crash annotation to the shmem here:
http://searchfox.org/mozilla-central/rev/8910ca900f826a9b714607fd23bfa1b37a191eca/toolkit/crashreporter/nsExceptionHandler.cpp#2189

Sizing the shmem could be tricky. I think it would be reasonable to reserve 16K for it, but maybe other people think that's too big? Most crash report metadata seems to be either about 1.5K or 13K. The difference is caused by whether TelemetryEnvironment is included. I'm not sure how that's determined. In any case, I see no reason why the child process should be responsible for generating that data since the parent should have it. So 1.5K is probably the typical case, and 16K should be more than big enough.
David sort of volunteered to write the shmem part since he needs it soon. David, please let me know if the plan is clear and sounds reasonable.
Flags: needinfo?(dvander)
That sounds reasonable - thanks for the explanation.
Assignee: wmccloskey → dvander
Status: NEW → ASSIGNED
Flags: needinfo?(dvander)
Summary: Replace PCrashReporter with machinery from bug 1236108 → Communicate crash metadata through shmem
This is all the new code that could replace PCrashReporter. I'll do actual integration in separate patches so it's easier to see what changed.

This patch has three new classes:

CrashReporterMetadataShmem holds an ipc::Shmem and provides methods for adding crash metadata. This is the hacky part. I couldn't find any serialization stuff in Gecko outside of Pickle, which doesn't easily support writing to a pre-allocated buffer. Instead I just hand-rolled out some stuff and hopefully it's not too bad - if it is, or there's an alternative, I can change it.

The encoding is very simple. It starts with a control byte (none, annotation, or appnote), then is followed by one or two strings based on the control. Strings are encoded with a length prefix. A "none" byte terminates the metadata.

CrashReporterClient is a singleton that lives in a subprocess. It holds the CrashReporterMetadataShmem and provides threadsafe methods for writing to it. This effectively replaces CrashReporterChild.

CrashReporterHost is what a top-level actor for a subprocess creates, effectively replacing CrashReporterParent. It simply holds the Shmem alive on the parent process side, and when the actor abnormally dies, parses the shmem back out to an AnnotationTable.
Attachment #8770306 - Attachment is obsolete: true
Attachment #8797389 - Flags: review?(wmccloskey)
This changes nsExceptionHandler such that subprocess annotations will go through CrashReporterClient if one exists.

Note that CrashReporterClient is threadsafe so we no longer have to post messages to the main thread.
Attachment #8797391 - Flags: review?(wmccloskey)
This adds GPU process support. GPU process has no UI integration for crashes yet, so this is pretty much just for testing and to see how part 1 integrates.

When I trigger a GPU process crash I do indeed see the metadata materialize on the parent side. If there's tests I can or should add around this let me know.
Attachment #8797392 - Flags: review?(wmccloskey)
Comment on attachment 8797389 [details] [diff] [review]
part 1, CrashReporterClient/Host

Review of attachment 8797389 [details] [diff] [review]:
-----------------------------------------------------------------

I think it would be nicer to put the crash reporter stuff in ipc/glue instead of dom/ipc.

I think the main problem here is with how the metadata is written out. I think it's not uncommon for us to write the same key many times. With your current scheme, we could quickly fill up the 16K that way.

The existing crash reporting code code writes out these annotations by storing them in a hashtable and then converting the entire hashtable to a string each time an annotation changes. It's not very efficient, but it does ensure that we don't store the same key twice. I would recommend doing something like that.

::: dom/ipc/CrashReporterHost.cpp
@@ +53,5 @@
> +  notes.Put(NS_LITERAL_CSTRING("StartupTime"), nsDependentCString(startTime));
> +
> +  CrashReporterMetadataShmem::ReadAppNotes(mShmem, &notes);
> +
> +  CrashReporter::AppendExtraData(dumpID, notes);

I think it would make sense to have something like what CrashReporterParent::FinalizeChildData does here. I'd like to make sure we get telemetry from these crashes.
Attachment #8797389 - Flags: review?(wmccloskey)
Comment on attachment 8797391 [details] [diff] [review]
part 2, nsExceptionHandler changes

Review of attachment 8797391 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with the double escaping fixed.

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +2203,5 @@
> +    }
> +
> +    // Otherwise, we have to handle this on the main thread since we will go
> +    // through IPDL.
> +    if (NS_IsMainThread()) {

if (!NS_IsMainThread()) {

Also, by moving this down, you're causing us to escape twice, which is incorrect.

@@ +2290,5 @@
> +    }
> +
> +    if (!NS_IsMainThread()) {
> +      // Child process needs to handle this in the main thread:
> +      nsCOMPtr<nsIRunnable> r = new CrashReporterHelperRunnable(data);

Same issue here about double escaping.
Attachment #8797391 - Flags: review?(wmccloskey) → review+
Attachment #8797392 - Flags: review?(wmccloskey) → review+
(In reply to Bill McCloskey (:billm) from comment #18)
> Comment on attachment 8797391 [details] [diff] [review]
> part 2, nsExceptionHandler changes
> 
> Review of attachment 8797391 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r+ with the double escaping fixed.
> 
> ::: toolkit/crashreporter/nsExceptionHandler.cpp
> @@ +2203,5 @@
> > +    }
> > +
> > +    // Otherwise, we have to handle this on the main thread since we will go
> > +    // through IPDL.
> > +    if (NS_IsMainThread()) {
> 
> if (!NS_IsMainThread()) {
> 
> Also, by moving this down, you're causing us to escape twice, which is
> incorrect.
> 
> @@ +2290,5 @@
> > +    }
> > +
> > +    if (!NS_IsMainThread()) {
> > +      // Child process needs to handle this in the main thread:
> > +      nsCOMPtr<nsIRunnable> r = new CrashReporterHelperRunnable(data);
> 
> Same issue here about double escaping.

re: double-escaping. It uses the original "data" here, not "escapedData". Or am I missing something? Though it is weird that we escape it before sending it over IPDL rather than after.
> re: double-escaping. It uses the original "data" here, not "escapedData". Or am I missing
> something? Though it is weird that we escape it before sending it over IPDL rather than after.

Sorry, I missed that. You're right.
This version should address the problems in the first patch. I took your advice and just serialized the AnnotationTable on every change. The serialization code mostly stayed the same, but it could probably be simplified.

I also moved the guts of FinalizeChildData/NotifyCrashService into a helper function on CrashReporterHost, so both classes can call it easily.
Attachment #8797389 - Attachment is obsolete: true
Attachment #8797926 - Flags: review?(wmccloskey)
Comment on attachment 8797926 [details] [diff] [review]
part 1, CrashReporterClient/Host

Review of attachment 8797926 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!

::: ipc/glue/CrashReporterMetadataShmem.cpp
@@ +54,5 @@
> +
> +  MOZ_MUST_USE bool WriteAnnotation(const nsCString& aKey, const nsCString& aValue) {
> +    // Save the current position so we can write the entry type if the entire
> +    // entry fits.
> +    uint8_t* start = mCursor++;

What if mCursor == mEnd at this point before the increment? Seems like Write could go off the end of the buffer. But I guess Commit maintains the invariant that mCursor != mEnd. Maybe assert that at the top here?

@@ +166,5 @@
> +    if (!src) {
> +      return false;
> +    }
> +
> +    aOut.AssignASCII((const char *)src, length);

I'd rather use Assign here since the chars were written out as UTF-8 (although I'm not sure there's really a difference). Is there a type issue in using Assign?
Attachment #8797926 - Flags: review?(wmccloskey) → review+

Comment 23

3 years ago
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d6435d2c9eff
Add a PCrashReporter alternative that uses shmem instead of IPDL. (bug 1278717 part 1, r=billm)
https://hg.mozilla.org/integration/mozilla-inbound/rev/cdd5e8c99e2f
Add nsExceptionHandler support for CrashReporterClient. (bug 1278717 part 2, r=billm)
https://hg.mozilla.org/integration/mozilla-inbound/rev/a65c408e1376
Use CrashReporterHost/Client in the GPU process. (bug 1278717 part 3, r=billm)

Comment 24

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/d6435d2c9eff
https://hg.mozilla.org/mozilla-central/rev/cdd5e8c99e2f
https://hg.mozilla.org/mozilla-central/rev/a65c408e1376
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
It might have been nice to split the GPU service bits out to a separate bug, I didn't realize that had landed until reviewing bug 1297843.

Is there any way we can generate a crash in the GPU process for testing purposes?

Comment 27

3 years ago
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c8183800343e
Add GPU process support to the crash reporting service. (bug 1278717, r=ted)
comment #28 was meant for bug 1297843. I'll file another bug for testing.

Comment 30

3 years ago
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0432b5da3243
Add GPU process support to the crash reporting service. (bug 1278717 part 1, r?=gfritzsche)
You need to log in before you can comment on or make changes to this bug.