Closed Bug 1390488 Opened 3 years ago Closed 2 years ago

GPU process crashes don't contain crash message (including rust panic messages)

Categories

(Toolkit :: Crash Reporting, defect)

Unspecified
Windows
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla57
Tracking Status
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- fixed

People

(Reporter: kats, Assigned: kats)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

See e.g. https://crash-stats.mozilla.com/report/index/87d7e75e-2a2d-43e0-b914-b66820170813#tab-details

This is a GPU process rust panic in webrender. However the panic error message doesn't appear anywhere in the crash report, as far as I can tell. We should fix that.
First thought is that this is probably similar to bug 1379857, where we weren't capturing panic messages from child processes in crash reports.  Maybe there's a missing install_rust_panic_hook call in the GPU process startup sequence?  I would think the crashreporting gets set up the same way for the child process and the GPU process, but maybe there's some subtle difference there?
I looked for some more crash reports with unwrap_failed in the signature and found these interesting ones:

1. https://crash-stats.mozilla.com/report/index/ad8d7ba6-38d5-467e-9e5a-a8e0e0170813 - this is a content(extension) process and has no MOZ_CRASH_REASON
2. https://crash-stats.mozilla.com/report/index/9bb84aeb-eb41-4f09-b5db-2ebeb0170812 - this is also in a content(extension) process but *does* have a MOZ_CRASH_REASON.

(1) is from a 20170813 build and (2) is from a 20170811 build, so either this regressed recently or it's intermittent
On nightly 20170815 the hook setup appears to be working fine in the gpu process (the one with "gpu" at the end of its command line). The `HOOK` variable goes from empty at the beginning of the function to something plausible after I step out:

Breakpoint 1 hit
xul!gkrust_shared::install_rust_panic_hook:
00007ffc`fa257e60 55              push    rbp
1:036> dq xul!HOOK L2
00007ffc`fdd70718  00000000`00000000 00000000`00000000
1:036> gu; dq xul!HOOK L2
00007ffc`fdd70718  00000000`00000001 00000198`2f908390

So it doesn't look like an across-the-board bustage. Either some subset of machines is failing to set up the hook, or something happens after the setup to make it not be respected, or maybe the hook is running fine but we're not capturing the data somehow.
> or maybe the hook is running fine but we're not capturing the data somehow.

I'm leaning towards that one.

I don't see _any_ MOZ_CRASH text in gpu process crash reports, rust or otherwise, newer than build 20170717100212.
> I don't see _any_ MOZ_CRASH text in gpu process crash reports, rust or
> otherwise, newer than build 20170717100212.

I'm not sure where I got that build ID from. Searching again, I actually don't see _any_ MOZ_CRASH text in gpu process crash reports, ever. Perhaps they're going through a different crash-reporting codepath?
The code to annotate MOZ_CRASH for a child process crash is part of `PrepareChildExceptionTimeAnnotations`[1].  This handler method is installed via `XRE_SetRemoteExceptionHandler`[2].

Is the exception handler being installed for the GPU process?

[1]: http://searchfox.org/mozilla-central/rev/e5b13e6224dbe3182050cf442608c4cb6a8c5c55/toolkit/crashreporter/nsExceptionHandler.cpp#1339
[2]: http://searchfox.org/mozilla-central/rev/e5b13e6224dbe3182050cf442608c4cb6a8c5c55/toolkit/xre/nsEmbedFunctions.cpp#501-517
> Is the exception handler being installed for the GPU process?

Yes, just confirmed with a breakpoint: we do go through XRE_SetRemoteExceptionHandler in the GPU process.
Widening scope of this bug to GPU process crashes based on comment 5. We can punt the maybe-regression in comment 2 to another bug. While digging around the bug tree (rooted at 1264543) I found bug 1278717 which seems relevant. Haven't dug into the code yet though.
Blocks: e10s-gpu
See Also: → 1278717
Summary: Some rust-panic crash reports seem to not contain error messages → GPU process crashes don't contain crash message (including rust panic messages)
When a child crash happens, the extra data (like MOZ_CRASH) should first be written[1] to files named like `GeckoChildCrash<pid>.extra` inside of the `TmpD` directory, and then later combined when the crash report is submitted.

What is `TmpD` for the GPU process?  Is it blocked by sandboxing?

Perhaps the extra files are in fact written but never combined with the crash report...?

[1]: http://searchfox.org/mozilla-central/rev/e5b13e6224dbe3182050cf442608c4cb6a8c5c55/toolkit/crashreporter/nsExceptionHandler.cpp#1344-1373
(In reply to J. Ryan Stinnett [:jryans] (use ni?) from comment #9)
> When a child crash happens, the extra data (like MOZ_CRASH) should first be
> written[1] to files named like `GeckoChildCrash<pid>.extra` inside of the
> `TmpD` directory, and then later combined when the crash report is submitted.
> 
> What is `TmpD` for the GPU process?  Is it blocked by sandboxing?

The GPU process uses an abbreviated XPCOM startup sequence:

http://searchfox.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#754

It looks like it doesn't initialize the directory service, whereas the normal startup sequence does:

http://searchfox.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#560-592

So we may be asking for TmpD and not getting anything back.  That'd be something to look into too.
> So we may be asking for TmpD and not getting anything back.  That'd be
> something to look into too.

Yep.

0:001> x xul!CrashReporter::childProcessTmpDir
00007ffc`fde5d658 xul!CrashReporter::childProcessTmpDir = 0x00000000`00000000
I'll see if I can get the temp dir hooked up.
Assignee: nobody → bugmail
So I got TmpD working and I see the GeckoChildCrash<pid>.extra file getting written out but it's still not showing up in the crash report. Does anybody have a quick pointer to which code is supposed to combine the .extra file into the final crash report?
Also for future reference: right now just enabling and initializing the directory service is sufficient to write files into TmpD. However, sandboxing is also currently disabled on the GPU process (see bug 1347710 and bug 1361719 for details). Presumably once sandboxing is turned back on we'll need to make sure TmpD points to a folder that can be written to even with sandboxing. The content processes do this at [1] and we'll probably need something similar in the GPU process.

[1] http://searchfox.org/mozilla-central/rev/b258e6864ee3e809d40982bc5d0d5aff66a20780/dom/ipc/ContentProcess.cpp#96-99
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #13)
> So I got TmpD working and I see the GeckoChildCrash<pid>.extra file getting
> written out but it's still not showing up in the crash report. Does anybody
> have a quick pointer to which code is supposed to combine the .extra file
> into the final crash report?

Here's the code[1] that reads in the .extra file.

It seems like it can either be called by `CrashReporterHost`[2] or by `OnChildProcessDumpRequested`[3].  I am not sure which applies in this scenario...

As for `CrashReporterHost`, I did notice that most process types appear to set this up in the IPC parent, like `ContentParent`, etc.  But for the GPU, it's created in the child[4], which seems suspicious (but I am just making guesses).

[1]: http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/toolkit/crashreporter/nsExceptionHandler.cpp#3311
[2]: http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/ipc/glue/CrashReporterHost.cpp#214
[3]: http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/toolkit/crashreporter/nsExceptionHandler.cpp#3401
[4]: http://searchfox.org/mozilla-central/search?q=CrashReporterHost&case=false&regexp=false&path=
Thanks! For the GPU process the parent/child relationship is reversed (the GPUChild lives in the UI process, GPU parent lives in the GPU process) so that's why it's backwards. I'll trace through this code tomorrow and see what's going on (I don't have access to the windows machine today).
Ah, the problem is that the parent process looks in what *it* thinks is the childProcessTmpDir [1] (which looks something like "C:\Users\kats\AppData\LocalLow\Mozilla\Temp-{guid}" on my machine). However the GPU process wrote the extra file to what *it* had as the childProcessTmpDir which is c:\Users\kats\AppData\Local\Temp. The difference is because the GPU process doesn't run the code at [2] to update it's temp dir when MOZ_SANDBOX is defined.

I guess I might as well make that happen, and then this code will be sandbox-safe as well in the GPU process for when sandboxing is turned back on.

[1] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/toolkit/crashreporter/nsExceptionHandler.cpp#3203
[2] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/dom/ipc/ContentProcess.cpp#99
This is turning out to be quite the tricky problem. The directory service needs to have a nsXREDirProvider properly initialized and registered in order to produce any value at all for the NS_APP_CONTENT_PROCESS_TEMP_DIR key, which is what we need. And nsXREDirProvider initialization is somewhat glued to other XRE plumbing so I had to do a bit of refactoring to get that part working.

Next problem is that in a couple of places the sandboxing code checks prefs to figure out if sandboxing is enabled [1] and to obtain the UUID to tack on to the sandbox-friendly temp folder [2]. Problem is, the GPU process doesn't have access to the Preferences service, so this fails. [1] is easy to work around but [2] is not. I see a few possible approaches here:
a) Initialize the Preferences service in the GPU process. I think skipping this was done intentionally so I'd like to avoid doing this. I'm not sure how much other stuff this will pull in the GPU process
b) Find some other mechanism (i.e. not Preferences) to propagate that folder UUID from the parent process to the GPU process. Maybe we could just pass the entire temp dir as a command-line param to the GPU process or something. Other prefs are currently sent to the GPU process via the PGPU::Init ipdl message but that I believe happens too late for what we want here.
c) Drop the requirement that the GPU process put it's .extra file in the same folder that the other content processes are putting theirs in. That way the GPU process doesn't need to know the UUID at all. However I'm not sure this actually solves the problem because the GPU process and the parent process still need to agree on a common location to drop that .extra file. Any mechanism we use to make them agree on this, we can just use to propagate that UUID.

Right now I'm leaning towards option (b).

[1] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/security/sandbox/common/SandboxSettings.cpp#15
[2] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/toolkit/xre/nsXREDirProvider.cpp#749
I got the command-line param thing working. So in the end I didn't even need to initialize the directory service or anything else, I just take the childProcessTmpDir directly from the CrashReporter in the parent process and pass it to the GPU process command line, which then reinserts it into its own CrashReporter. Local testing with that seems to work.

Try push with everything, since I don't know what all exercises this code: https://treeherder.mozilla.org/#/jobs?repo=try&revision=99172e4f43064d9c132af60e10506af023fc88a5
Attachment #8900725 - Flags: review?(ted) → review?(nfroyd)
Attachment #8900726 - Flags: review?(ted) → review?(nfroyd)
Comment on attachment 8900725 [details]
Bug 1390488 - Clean up ifdef indenting and balancing comments, no functional changes.

https://reviewboard.mozilla.org/r/172166/#review184392

The ifdeffery here is frightening.

::: ipc/glue/GeckoChildProcessHost.cpp:1134
(Diff revision 1)
>        }
>      }
> -#endif
> +# endif // MOZ_SANDBOX
>    }
>  
> -#else
> +#else // goes with defined(OS_POSIX)

"goes with defined(OS_POSIX)"...*400* lines earlier!

::: ipc/glue/GeckoChildProcessHost.cpp:1136
(Diff revision 1)
> -#endif
> +# endif // MOZ_SANDBOX
>    }
>  
> -#else
> +#else // goes with defined(OS_POSIX)
>  #  error Sorry
>  #endif

Nit: mark this `// defined(OS_POSIX)` for consistency?
Comment on attachment 8900725 [details]
Bug 1390488 - Clean up ifdef indenting and balancing comments, no functional changes.

https://reviewboard.mozilla.org/r/172166/#review184394
Attachment #8900725 - Flags: review?(nfroyd) → review+
Comment on attachment 8900726 [details]
Bug 1390488 - Pass the childProcessTmpDir from the parent process to the GPU process.

https://reviewboard.mozilla.org/r/172168/#review184396

Seems reasonable, thank you for looking into this!
Attachment #8900726 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #23)
> 
> The ifdeffery here is frightening.

Yeah. For the longest time I was fiddling with the command line args inside the POSIX block and wondering why the heck it wasn't working (on Windows). Once I realized my mistake I figured this cleanup might help a bit :)

> > -#else
> > +#else // goes with defined(OS_POSIX)
> >  #  error Sorry
> >  #endif
> 
> Nit: mark this `// defined(OS_POSIX)` for consistency?

Fixed
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fdffa51b68d1
Clean up ifdef indenting and balancing comments, no functional changes. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/b80e267bdf30
Pass the childProcessTmpDir from the parent process to the GPU process. r=froydnj
Backed out for build bustage at ipc/glue/GeckoChildProcessHost.cpp:868: undefined reference to `CrashReporter::GetChildProcessTmpDir(nsIFile**)':

https://hg.mozilla.org/integration/autoland/rev/3f40a2289e410b8aa650d244048407c58a2308f8
https://hg.mozilla.org/integration/autoland/rev/0fd97303a92d2adc9831da022c10c1fd9ffa2dc7

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=b80e267bdf307f08a49810729a31a567e39fa0c5&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=130671310&repo=autoland

[task 2017-09-13T14:32:37.672401Z] 14:32:37     INFO -      INPUT("../../modules/zlib/src/zutil.o")
[task 2017-09-13T14:32:37.672610Z] 14:32:37     INFO -      INPUT("StaticXULComponentsEnd/StaticXULComponentsEnd.o")
[task 2017-09-13T14:32:37.672865Z] 14:32:37     INFO -  ../../ipc/glue/Unified_cpp_ipc_glue0.o: In function `mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunchInternal(std::vector<std::string, std::allocator<std::string> >&, base::ProcessArchitecture)':
[task 2017-09-13T14:32:37.673060Z] 14:32:37     INFO -  /builds/worker/workspace/build/src/ipc/glue/GeckoChildProcessHost.cpp:868: undefined reference to `CrashReporter::GetChildProcessTmpDir(nsIFile**)'
[task 2017-09-13T14:32:37.673260Z] 14:32:37     INFO -  ../../../gcc/bin/ld: libxul.so: hidden symbol `_ZN13CrashReporter21GetChildProcessTmpDirEPP7nsIFile' isn't defined
[task 2017-09-13T14:32:37.673438Z] 14:32:37     INFO -  ../../../gcc/bin/ld: final link failed: Bad value
[task 2017-09-13T14:32:37.673639Z] 14:32:37     INFO -  clang-4.0: error: linker command failed with exit code 1 (use -v to see invocation)
[task 2017-09-13T14:32:37.673834Z] 14:32:37     INFO -  /builds/worker/workspace/build/src/config/rules.mk:721: recipe for target 'libxul.so' failed
[task 2017-09-13T14:32:37.674012Z] 14:32:37     INFO -  gmake[5]: *** [libxul.so] Error 1
Flags: needinfo?(bugmail)
Urgh. Looks like linux64-asan is built with --disable-crashreporter [1] which means we skip the crashreporter folder [2], so the new function in toolkit/crashreporter/nsExceptionHandler.cpp doesn't get compiled. I'll need to wrap the call site in ifdef MOZ_CRASHREPORTER, I think. Other uses of CrashReporter::* in GeckoChildProcessHost.cpp already do that.

[1] http://searchfox.org/mozilla-central/rev/e76c0fee79a34a3612740b33508276f4b37c3ee8/build/unix/mozconfig.asan#24
[2] http://searchfox.org/mozilla-central/rev/e76c0fee79a34a3612740b33508276f4b37c3ee8/toolkit/moz.build#49
Flags: needinfo?(bugmail)
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e05d7bb3a80f
Clean up ifdef indenting and balancing comments, no functional changes. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/1682167a8946
Pass the childProcessTmpDir from the parent process to the GPU process. r=froydnj
Not sure that this is worth uplifting to 56. We've lived without the crash reason for this long so I don't think another release will make a big difference.
OS: Unspecified → Windows
You need to log in before you can comment on or make changes to this bug.