Closed Bug 1407693 Opened 3 years ago Closed 2 years ago

Don't attempt to create a file on crash in content process

Categories

(Toolkit :: Crash Reporting, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- fixed

People

(Reporter: Alex_Gaynor, Assigned: Alex_Gaynor)

References

Details

Attachments

(2 files, 1 obsolete file)

Right now on a crash, our callback to the breakpad ExceptionHandler creates a new file in the content temp directory: https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.cpp#1351-1390

We'd like to drop write permissions from this directory, and I believe this is the last use of that!

A fairly straight forward resolution of this would be to open the file in |SetRemoteExceptionHandler|, and pass it as the callback context.

I'll work up a patch which does this shortly.
Comment on attachment 8932235 [details]
Bug 1407693 - Part 1 - Expose method for sharing a HANDLE to a child process in the sandboxing API;

https://reviewboard.mozilla.org/r/203266/#review208836

::: security/sandbox/win/src/sandboxbroker/sandboxBroker.h:60
(Diff revision 1)
> -  // Exposes AddTargetPeer from broker services, so that none sandboxed
> +  // Exposes AddTargetPeer from broker services, so that non-sandboxed
>    // processes can be added as handle duplication targets.
>    bool AddTargetPeer(HANDLE aPeerProcess);
>  
> +  void AddHandleToShare(HANDLE aHandle);

Oh dear, thanks for fixing.
While we're changing it, can we use JavaDoc-style comments, which I didn't realise were in the style guide at the time.

Also, can you add one to AddHandleToShare to explain that this handle will be inherited by the child.
Attachment #8932235 - Flags: review?(bobowencode) → review+
Comment on attachment 8932236 [details]
Bug 1407693 - Part 2 - when a child process crashes, write extra annotation data to a pre-opened file descriptor instead of creating a new file;

https://reviewboard.mozilla.org/r/203268/#review210132

This looks alright with a nit addressed: you need to add stubs for the functions you've added to nsExceptionHandler.h in nsDummyExceptionHandler.cpp so that things will build even with crash reporting disabled. This is required because I've landed bug 1402519 in the meantime.

::: toolkit/crashreporter/nsExceptionHandler.h:178
(Diff revision 2)
>  // indicating which remote process crashed first.
>  bool TakeMinidumpForChild(uint32_t childPid,
>                            nsIFile** dump,
>                            uint32_t* aSequence = nullptr);
>  
> +// TODO: can we use base::ProcessId?

This is rather ugly but I assume there must have been a reason why we've avoided including chromium headers here...
Attachment #8932236 - Flags: review?(gsvelto) → review+
I'm looking into the tests now. I've been bothered about the way they get .extra/.dmp files for a while so this might be the right time to fix that.
While trying to address the issues with the broken tests I've noticed that there's another problem with your patch that I hadn't spotted during review. If you try to run the toolkit/crashreporter/test/unit_ipc/test_content_oom_annotation_windows.js xpcshell test, you'll notice that the annotations that are added in the exception handler callback do not end up in the .extra file. I'm not sure if the issue is because the test isn't getting the correct .extra file or if the annotations are not written correctly. Thing is, the test does find an .extra file, but it's only partially populated.
Mmm, when I run this test I'm seeing a ton of unrelated error tracebacks, including "Failed to load module resource:///modules/AttributionCode.jsm" and a Python KeyError in mozlog.

Do these match what you're seeing? -- I suspect there's some underlying error in this output, but it's a bit hard to find :-)
Ok, disregard that, I managed to muddle through by ignoring the errors and debugging manually from your description :-) I'm debugging now.
Yeah, those errors are sadly all expected... The AttributionCode.jsm one is particularly annoying and was added recently IIRC.
Gabriele, should we expect this data to be coming from the WriteGlobaMemoryStatus call in PrepareChildExceptionTimeAnnotations or the call in MinidumpCallback?
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #17)
> Gabriele, should we expect this data to be coming from the
> WriteGlobaMemoryStatus call in PrepareChildExceptionTimeAnnotations or the
> call in MinidumpCallback?

Yes, and from my testing it seems to be called correctly. I haven't had time to check the handle that gets passed to the callback though. Either way it seems that some annotation end up in an .extra file, and some others (like those from PrepareChildExceptionTimeAnnotations) end up in another. Or at least that what it looks like.
So it doesn't appear that MinidumpCallback is ever invoked, at least in this test. (Verified by sticking a MOZ_CRASH() at the top of it and seeing if it crashed, which hopefully wouldn't get silently eaten). Is that expected?

I'm not sure I understand the full flow of when these callbacks are used, and when different ones are supposed to be called. Is it a bug that MinidumpCallback isn't called?

If not, the only other called of WriteGlobalMemoryStatus is PrepareChildExceptionTimeAnnotations, which writes it to the main annotation FD (the one this whole patchset is about :-)). If this is the call it is supposed to originate, I need to figure out how it's supposed to migrate to the .extra.

Are there any good docs to read to get more comfortable with this code?
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #20)
> So it doesn't appear that MinidumpCallback is ever invoked, at least in this
> test. (Verified by sticking a MOZ_CRASH() at the top of it and seeing if it
> crashed, which hopefully wouldn't get silently eaten). Is that expected?

Yes, MinidumpCallback is invoked only for main process crashes. This test is causing a content process crash so it shouldn't be called.

> If not, the only other called of WriteGlobalMemoryStatus is
> PrepareChildExceptionTimeAnnotations, which writes it to the main annotation
> FD (the one this whole patchset is about :-)). If this is the call it is
> supposed to originate, I need to figure out how it's supposed to migrate to
> the .extra.

Yes, that's the call that should do the annotations. When we hit a content process crash the main process is notified and it asks for a dump of the process that crashed. This should trigger OnChildProcessDumpRequested() callback:

https://searchfox.org/mozilla-central/rev/1f9b4f0f0653b129b4db1b9fc5e9068054afaac0/toolkit/crashreporter/nsExceptionHandler.cpp#3234

This in turn will attempt to write an .extra file for the associated minidump by calling WriteExtraForMinidump():

https://searchfox.org/mozilla-central/rev/1f9b4f0f0653b129b4db1b9fc5e9068054afaac0/toolkit/crashreporter/nsExceptionHandler.cpp#3269

This will retrieve the .extra file generated by the child process via GetExtraFileForChildPid():

https://searchfox.org/mozilla-central/rev/1f9b4f0f0653b129b4db1b9fc5e9068054afaac0/toolkit/crashreporter/nsExceptionHandler.cpp#3060

In your patch you've replaced this part by retrieving the file descriptor from the processToCrashFd which looks sane since it's the file that you've passed to the content process in the first place.

There's just a thing that might be relevant and which I hadn't noticed during review: you deregister the file descriptor and delete the file in GeckoProcessHost::~GeckoProcessHost(), but this destructor isn't called on the main thread, it's called on the I/O thread:

https://searchfox.org/mozilla-central/rev/1f9b4f0f0653b129b4db1b9fc5e9068054afaac0/ipc/glue/GeckoChildProcessHost.cpp#107

I wonder if we're racing here and deleting the content process' .extra file before we get the chance to retrieve the annotations. Note that if the file isn't present WriteExtraForMinidump() will return false, but OnChildProcessDumpRequested() won't do a thing, so the error might not be apparent.

> Are there any good docs to read to get more comfortable with this code?

Unfortunately not, we have a high-level overview of the whole process but it doesn't got into all the details and quirks of the implementation:

https://firefox-source-docs.mozilla.org/toolkit/crashreporter/crashreporter/index.html
Disabling the de-registration code doesn't cause this test to pass, so I don't think it's that (or at least, not _only_ that).

I'm hitting something that is completely baffling. If I run the test with --debugger="C:\path\to\windbg.exe" the test passes, even when I do nothing but run "g" (continue) when it starts. Running without a debugger reliably fails.

My goal was to put a DebugBreak() in WriteExtraForMinidump, however when running with the debugger that DebugBreak() doesn't fire, even though if I put an fprintf in there, it fires.

Are there issues running the crashreporter tests under a debugger?
With a bunch of fprintf debugging, what I've found is that it doesn't bail out of WriteExtraForMinidump anywhere, it makes it all the way through; the FD is present in the map, it writes it, etc. Next step is to add more prints until I know exactly what's getting written to the .extra
Ok, spent basically the entire day with printfs and a debugger:

- We definitely have the same HANDLE value in both the parent and the child.
- We write stuff to the child HANDLE, and then we close it.
- The parent's position on the HANDLE is at 0 as soon as we receive it -- this was surprising to me, but I think sharing the position between two processes might be a unix-ism.
- When we go to read from the HANDLE in the parent, we read 0 bytes from it.

Aaaaand, as I was writing this I came up with one additional idea and confirmed it: WriteFile in the child is returning FALSE, indicating it did not succeed.

I think this indicates that somehow I am not passing the HANDLE to the child correctly. I will resume investigating this tomorrow.
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #24)
> Ok, spent basically the entire day with printfs and a debugger:
> 
> - We definitely have the same HANDLE value in both the parent and the child.

I wonder if this is the problem.

I know we can't use DuplicateHandle in the parent, because we don't know the target process (because it doesn't exist yet).
However, maybe we have to use DuplicateHandle in the child with the child process as the source and target process, to get a handle we can use.
Just grasping at straws really.
If we could get hang of someone more familiar with Windows internals it would help... However searching around the internet seems to point to the fact that one cannot share a file handle on Windows between processes without calling DuplicateHandle() on it first. Our IPC code seems to follow this guideline, see this:

https://searchfox.org/mozilla-central/rev/b7e3ec2468d42fa59d86c03ec7afeb209813f1d4/ipc/glue/FileDescriptor.cpp#119
Status: NEW → ASSIGNED
*lights computer on fire*

HOKAY, SO! To share a HANDLE with a new process, you need to include it in PROC_THREAD_ATTRIBUTE_HANDLE_LIST, which is accomplished by the AddHandleToShare API.

AddHandleToShare is on the SandboxBroker.

We have two process spawning APIs, depending on whether the sandbox is enabled.

xpcshells disable the sandbox.

PROC_THREAD_ATTRIBUTE_HANDLE_LIST is therefore only set when the sandbox is running, meaning the HANDLE isn't available in the child process, causing these tests to fail.

It seems there's two ways to go here:

a) Use PROC_THREAD_ATTRIBUTE_HANDLE_LIST in the non-sandboxed spawn path
b) Consolidate the process spawning code paths.

Bob, I'm not sure what our plans are here, is (b) already on the roadmap?
Flags: needinfo?(bobowencode)
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #27)
...
> xpcshells disable the sandbox.

Isn't this kind of by accident?
 
> a) Use PROC_THREAD_ATTRIBUTE_HANDLE_LIST in the non-sandboxed spawn path
> b) Consolidate the process spawning code paths.
> 
> Bob, I'm not sure what our plans are here, is (b) already on the roadmap?

Well roadmap might be a bit strong, it should be fairly straight forward once we can unconditionally compile the sandbox code.
I think the only thing blocking that is the MinGW build for tor on which tjr is actively working.

I guess for the purposes of our tests I could consolidate when MOZ_SANDBOX is defined.
Then we could cut out the other stuff once we always have the sandbox.

I have a WIP patch somewhere, let me see how far I got.
Flags: needinfo?(bobowencode)
Accident is also kind of a strong word :-) It's intentional in the sense that we intentionally wrote code to do it; there's a bunch of xpcshell tests that do things like spawn TCP servers, and we didn't want refactoring all of those to block rolling the sandbox out.
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #27)
> It seems there's two ways to go here:
> 
> a) Use PROC_THREAD_ATTRIBUTE_HANDLE_LIST in the non-sandboxed spawn path
> b) Consolidate the process spawning code paths.

Upstream Chromium seems to do (a), for what it's worth: https://codesearch.chromium.org/chromium/src/base/process/launch_win.cc?l=231&rcl=329d49213d27cd7eaf30f52a7fbab6118a7c5c37

But also, so do we:
https://searchfox.org/mozilla-central/rev/e3cba77cee3f/ipc/chromium/src/base/process_util_win.cc#251

…kind of; it's currently only stdout and stderr:
https://searchfox.org/mozilla-central/rev/e3cba77cee3f/ipc/chromium/src/base/process_util_win.cc#363-366

However, it looks like it wouldn't be too hard to add something to LaunchOptions and wire it up in LaunchApp.
(In reply to Jed Davis [:jld] (⏰UTC-7) from comment #30)
...
> However, it looks like it wouldn't be too hard to add something to
> LaunchOptions and wire it up in LaunchApp.

This sounds like the simplest option then.
Went with that option, and indeed |test_content_oom_annotation_windows.js| passes cleanly on my machine now. Kicked off a try run and will review where we're at after lunch :-)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=59db81ed4739ce4167762a88f004b3badb3d39c1&group_state=expanded

Looks like the failing tests are:

- dom/plugins/test/mochitest/test_busy_hang.xul
- dom/plugins/test/mochitest/test_CrashService_hang.html
- browser/components/extensions/test/xpcshell/test_ext_browsingData.js  (and several others from this directory)

I believe these correspond to the TODO comments I put in part 3. IIRC, you were looking at changes to these tests Gabriele, did you make any progress on those, or should I dive in?
I hadn't looked at those yet. Hangs are nasty because they use a slightly different codepath to generate the minidump. I'll have a look later today to see if this is something trivial or not.
I'm reasonably certain the comment in hang_test.js I added is wrong -- that code is running in the parent, so it's acting on the final location of the minidump, not where the content process wrote to. I'm digging in to figure out where the actual bug is.
Ok, fixed the hang tests! https://treeherder.mozilla.org/#/jobs?repo=try&revision=afeab45e919a2aa33f0682e574b8483ead6be8fa&group_state=expanded

Down to just the failures which I believe are caused by the deletion issue, will dive into those tomorrow.
Pretty sure the solution here is to just use an anonymous pipe, instead of a disk file. Will implement today.
Gabriele, one of the limitations of a pipe is that there's a max write size before it starts to block -- obviously that'd be really bad in the crashreporter :-). Do you have any sense of how large the payloads we're typically writing for crash time annotations is?
Flags: needinfo?(gsvelto)
It depends on the annotations, most are small, usually a few tens of bytes. The TelemetryEnvironment annotation on the other hand is usually much bigger as it's a JSON object holding a significant amount of data (think kilobytes). Are you considering removing the content process .extra file altogether?
Flags: needinfo?(gsvelto)
The idea was to inherit a pipe HANDLE, instead of inheriting a disk file HANDLE. Is the TelemetryEnvironment annotation one of the ones that is sent as part of ExceptionTimeAnnotations?
I did some digging to be sure of what is annotated when. The only annotation we write to the content process .extra files are the ones in PrepareChildExceptionTimeAnnotations(). So OOMAllocationSize, MozCrashReason, ThreadIdNameMapping and additionally only on Windows those in WriteGlobalMemoryStatus(). Those are all small (tens of bytes) with the only exception of ThreadIdNameMapping which can be a few hundred KiBs. If you can push those over a pipe then we can get rid of the content process .extra file. This doesn't need to be perfect, we just have to check that if something gets truncated on the parent side it doesn't generate invalid annotations.

[1] https://searchfox.org/mozilla-central/rev/eeb7190f9ad6f1a846cd6df09986325b3f2c3117/toolkit/crashreporter/nsExceptionHandler.cpp#1264
Ooof. I misread your message as "few hundred bytes" and started porting to pipes, but you wrote hundreds of _KiB_. Pipe buffer sizes are max at like 65KiB, so that's not going to work. Going to dive into these tests and why they want to delete the temp dir.
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #48)
> Ooof. I misread your message as "few hundred bytes" and started porting to
> pipes, but you wrote hundreds of _KiB_.

No, that's a typo! I meant a few hundred bytes, the pipe idea is the best approach and it should work.
Attachment #8932237 - Attachment is obsolete: true
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7f01da56df3e7bfe4e08bd6d207c9829532d4100&group_state=expanded

tests look green on Windows with a pipe! (oranges are unrelated) -- going to kick off a build with Linux and macOS now.
Took a bit to get the tests green on Linux, but we're good! https://treeherder.mozilla.org/#/jobs?repo=try&revision=38877abc835b6dab5d47c999f903c3e2c1d45942&group_state=expanded

Going to re-r? since the patch has diverged significantly since the original reviews!
Attachment #8932235 - Flags: review+ → review?(bobowencode)
Attachment #8932236 - Flags: review+ → review?(gsvelto)
Comment on attachment 8932235 [details]
Bug 1407693 - Part 1 - Expose method for sharing a HANDLE to a child process in the sandboxing API;

https://reviewboard.mozilla.org/r/203266/#review224134

I'm not an IPC peer, but given jld's comment 30, I think that's fine.
Attachment #8932235 - Flags: review?(bobowencode) → review+
Comment on attachment 8932236 [details]
Bug 1407693 - Part 2 - when a child process crashes, write extra annotation data to a pre-opened file descriptor instead of creating a new file;

https://reviewboard.mozilla.org/r/203268/#review224238

Looking good, thanks a lot for the time you spent fixing this, sorry it's so messy :(

::: toolkit/crashreporter/nsExceptionHandler.h:178
(Diff revision 7)
>  // indicating which remote process crashed first.
>  bool TakeMinidumpForChild(uint32_t childPid,
>                            nsIFile** dump,
>                            uint32_t* aSequence = nullptr);
>  
> +// TODO: can we use base::ProcessId?

There was a reason why we didn't use base:: here though I can't remember which.
Attachment #8932236 - Flags: review?(gsvelto) → review+
Thanks!
Keywords: checkin-needed
Pushed by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f18e1e557cf6
Part 1 - Expose method for sharing a HANDLE to a child process in the sandboxing API; r=bobowen
https://hg.mozilla.org/integration/autoland/rev/9c3346021c21
Part 2 - when a child process crashes, write extra annotation data to a pre-opened file descriptor instead of creating a new file; r=gsvelto
Keywords: checkin-needed
Backed out for windows mingw32 bustages at /builds/worker/workspace/build/src/ipc/glue/GeckoChildProcessHost.cpp:1032

Push that caused the failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=9c3346021c21eec994fc6b5dc6b68c00d05ea7f4

Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=160900661&repo=autoland&lineNumber=13122

Backout: https://hg.mozilla.org/integration/autoland/rev/20ec5beec3715aff1ff658ea95a9806e03a2046e
Flags: needinfo?(agaynor)
We actually backed it out for breaking Android, not for breaking mingw. Nonetheless, working on it :-)
Flags: needinfo?(agaynor)
Ok, updated and green on Android! Gabriele, are you the right person to review the Android changes?
Flags: needinfo?(gsvelto)
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #67)
> Ok, updated and green on Android! Gabriele, are you the right person to
> review the Android changes?

Not really, I did some crash-related hacking on Android but I'm not a peer there. You could ask :jchen or :snorp instead.
Flags: needinfo?(gsvelto)
Attachment #8932236 - Flags: review?(snorp)
Attachment #8932236 - Flags: review?(snorp) → review?(rbarker)
Seems ok to me, but rbarker wrote this code so I'll defer to him.
Comment on attachment 8932236 [details]
Bug 1407693 - Part 2 - when a child process crashes, write extra annotation data to a pre-opened file descriptor instead of creating a new file;

https://reviewboard.mozilla.org/r/203268/#review224960
Attachment #8932236 - Flags: review?(rbarker) → review+
Thanks for all the reviews!
Keywords: checkin-needed
Pushed by dluca@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/202bc739dda8
Part 1 - Expose method for sharing a HANDLE to a child process in the sandboxing API; r=bobowen
https://hg.mozilla.org/integration/autoland/rev/bcc0a91dd43f
Part 2 - when a child process crashes, write extra annotation data to a pre-opened file descriptor instead of creating a new file; r=gsvelto,rbarker
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/202bc739dda8
https://hg.mozilla.org/mozilla-central/rev/bcc0a91dd43f
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Depends on: 1438106
Depends on: 1438406
No longer depends on: 1438406
Depends on: 1438209
Depends on: 1438219
You need to log in before you can comment on or make changes to this bug.