Closed
Bug 1390488
Opened 7 years ago
Closed 7 years ago
GPU process crashes don't contain crash message (including rust panic messages)
Categories
(Toolkit :: Crash Reporting, defect)
Tracking
()
VERIFIED
FIXED
mozilla57
People
(Reporter: kats, Assigned: kats)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
See e.g. https://crash-stats.mozilla.com/report/index/87d7e75e-2a2d-43e0-b914-b66820170813#tab-details This is a GPU process rust panic in webrender. However the panic error message doesn't appear anywhere in the crash report, as far as I can tell. We should fix that.
Comment 1•7 years ago
|
||
First thought is that this is probably similar to bug 1379857, where we weren't capturing panic messages from child processes in crash reports. Maybe there's a missing install_rust_panic_hook call in the GPU process startup sequence? I would think the crashreporting gets set up the same way for the child process and the GPU process, but maybe there's some subtle difference there?
Assignee | ||
Comment 2•7 years ago
|
||
I looked for some more crash reports with unwrap_failed in the signature and found these interesting ones: 1. https://crash-stats.mozilla.com/report/index/ad8d7ba6-38d5-467e-9e5a-a8e0e0170813 - this is a content(extension) process and has no MOZ_CRASH_REASON 2. https://crash-stats.mozilla.com/report/index/9bb84aeb-eb41-4f09-b5db-2ebeb0170812 - this is also in a content(extension) process but *does* have a MOZ_CRASH_REASON. (1) is from a 20170813 build and (2) is from a 20170811 build, so either this regressed recently or it's intermittent
On nightly 20170815 the hook setup appears to be working fine in the gpu process (the one with "gpu" at the end of its command line). The `HOOK` variable goes from empty at the beginning of the function to something plausible after I step out: Breakpoint 1 hit xul!gkrust_shared::install_rust_panic_hook: 00007ffc`fa257e60 55 push rbp 1:036> dq xul!HOOK L2 00007ffc`fdd70718 00000000`00000000 00000000`00000000 1:036> gu; dq xul!HOOK L2 00007ffc`fdd70718 00000000`00000001 00000198`2f908390 So it doesn't look like an across-the-board bustage. Either some subset of machines is failing to set up the hook, or something happens after the setup to make it not be respected, or maybe the hook is running fine but we're not capturing the data somehow.
> or maybe the hook is running fine but we're not capturing the data somehow.
I'm leaning towards that one.
I don't see _any_ MOZ_CRASH text in gpu process crash reports, rust or otherwise, newer than build 20170717100212.
> I don't see _any_ MOZ_CRASH text in gpu process crash reports, rust or
> otherwise, newer than build 20170717100212.
I'm not sure where I got that build ID from. Searching again, I actually don't see _any_ MOZ_CRASH text in gpu process crash reports, ever. Perhaps they're going through a different crash-reporting codepath?
The code to annotate MOZ_CRASH for a child process crash is part of `PrepareChildExceptionTimeAnnotations`[1]. This handler method is installed via `XRE_SetRemoteExceptionHandler`[2]. Is the exception handler being installed for the GPU process? [1]: http://searchfox.org/mozilla-central/rev/e5b13e6224dbe3182050cf442608c4cb6a8c5c55/toolkit/crashreporter/nsExceptionHandler.cpp#1339 [2]: http://searchfox.org/mozilla-central/rev/e5b13e6224dbe3182050cf442608c4cb6a8c5c55/toolkit/xre/nsEmbedFunctions.cpp#501-517
> Is the exception handler being installed for the GPU process?
Yes, just confirmed with a breakpoint: we do go through XRE_SetRemoteExceptionHandler in the GPU process.
Assignee | ||
Comment 8•7 years ago
|
||
Widening scope of this bug to GPU process crashes based on comment 5. We can punt the maybe-regression in comment 2 to another bug. While digging around the bug tree (rooted at 1264543) I found bug 1278717 which seems relevant. Haven't dug into the code yet though.
When a child crash happens, the extra data (like MOZ_CRASH) should first be written[1] to files named like `GeckoChildCrash<pid>.extra` inside of the `TmpD` directory, and then later combined when the crash report is submitted. What is `TmpD` for the GPU process? Is it blocked by sandboxing? Perhaps the extra files are in fact written but never combined with the crash report...? [1]: http://searchfox.org/mozilla-central/rev/e5b13e6224dbe3182050cf442608c4cb6a8c5c55/toolkit/crashreporter/nsExceptionHandler.cpp#1344-1373
Comment 10•7 years ago
|
||
(In reply to J. Ryan Stinnett [:jryans] (use ni?) from comment #9) > When a child crash happens, the extra data (like MOZ_CRASH) should first be > written[1] to files named like `GeckoChildCrash<pid>.extra` inside of the > `TmpD` directory, and then later combined when the crash report is submitted. > > What is `TmpD` for the GPU process? Is it blocked by sandboxing? The GPU process uses an abbreviated XPCOM startup sequence: http://searchfox.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#754 It looks like it doesn't initialize the directory service, whereas the normal startup sequence does: http://searchfox.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#560-592 So we may be asking for TmpD and not getting anything back. That'd be something to look into too.
Comment 11•7 years ago
|
||
diagnosis |
> So we may be asking for TmpD and not getting anything back. That'd be
> something to look into too.
Yep.
0:001> x xul!CrashReporter::childProcessTmpDir
00007ffc`fde5d658 xul!CrashReporter::childProcessTmpDir = 0x00000000`00000000
Assignee | ||
Comment 12•7 years ago
|
||
I'll see if I can get the temp dir hooked up.
Assignee: nobody → bugmail
Assignee | ||
Comment 13•7 years ago
|
||
So I got TmpD working and I see the GeckoChildCrash<pid>.extra file getting written out but it's still not showing up in the crash report. Does anybody have a quick pointer to which code is supposed to combine the .extra file into the final crash report?
Assignee | ||
Comment 14•7 years ago
|
||
Also for future reference: right now just enabling and initializing the directory service is sufficient to write files into TmpD. However, sandboxing is also currently disabled on the GPU process (see bug 1347710 and bug 1361719 for details). Presumably once sandboxing is turned back on we'll need to make sure TmpD points to a folder that can be written to even with sandboxing. The content processes do this at [1] and we'll probably need something similar in the GPU process. [1] http://searchfox.org/mozilla-central/rev/b258e6864ee3e809d40982bc5d0d5aff66a20780/dom/ipc/ContentProcess.cpp#96-99
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #13) > So I got TmpD working and I see the GeckoChildCrash<pid>.extra file getting > written out but it's still not showing up in the crash report. Does anybody > have a quick pointer to which code is supposed to combine the .extra file > into the final crash report? Here's the code[1] that reads in the .extra file. It seems like it can either be called by `CrashReporterHost`[2] or by `OnChildProcessDumpRequested`[3]. I am not sure which applies in this scenario... As for `CrashReporterHost`, I did notice that most process types appear to set this up in the IPC parent, like `ContentParent`, etc. But for the GPU, it's created in the child[4], which seems suspicious (but I am just making guesses). [1]: http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/toolkit/crashreporter/nsExceptionHandler.cpp#3311 [2]: http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/ipc/glue/CrashReporterHost.cpp#214 [3]: http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/toolkit/crashreporter/nsExceptionHandler.cpp#3401 [4]: http://searchfox.org/mozilla-central/search?q=CrashReporterHost&case=false®exp=false&path=
Assignee | ||
Comment 16•7 years ago
|
||
Thanks! For the GPU process the parent/child relationship is reversed (the GPUChild lives in the UI process, GPU parent lives in the GPU process) so that's why it's backwards. I'll trace through this code tomorrow and see what's going on (I don't have access to the windows machine today).
Assignee | ||
Comment 17•7 years ago
|
||
diagnosis |
Ah, the problem is that the parent process looks in what *it* thinks is the childProcessTmpDir [1] (which looks something like "C:\Users\kats\AppData\LocalLow\Mozilla\Temp-{guid}" on my machine). However the GPU process wrote the extra file to what *it* had as the childProcessTmpDir which is c:\Users\kats\AppData\Local\Temp. The difference is because the GPU process doesn't run the code at [2] to update it's temp dir when MOZ_SANDBOX is defined. I guess I might as well make that happen, and then this code will be sandbox-safe as well in the GPU process for when sandboxing is turned back on. [1] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/toolkit/crashreporter/nsExceptionHandler.cpp#3203 [2] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/dom/ipc/ContentProcess.cpp#99
Assignee | ||
Comment 18•7 years ago
|
||
This is turning out to be quite the tricky problem. The directory service needs to have a nsXREDirProvider properly initialized and registered in order to produce any value at all for the NS_APP_CONTENT_PROCESS_TEMP_DIR key, which is what we need. And nsXREDirProvider initialization is somewhat glued to other XRE plumbing so I had to do a bit of refactoring to get that part working. Next problem is that in a couple of places the sandboxing code checks prefs to figure out if sandboxing is enabled [1] and to obtain the UUID to tack on to the sandbox-friendly temp folder [2]. Problem is, the GPU process doesn't have access to the Preferences service, so this fails. [1] is easy to work around but [2] is not. I see a few possible approaches here: a) Initialize the Preferences service in the GPU process. I think skipping this was done intentionally so I'd like to avoid doing this. I'm not sure how much other stuff this will pull in the GPU process b) Find some other mechanism (i.e. not Preferences) to propagate that folder UUID from the parent process to the GPU process. Maybe we could just pass the entire temp dir as a command-line param to the GPU process or something. Other prefs are currently sent to the GPU process via the PGPU::Init ipdl message but that I believe happens too late for what we want here. c) Drop the requirement that the GPU process put it's .extra file in the same folder that the other content processes are putting theirs in. That way the GPU process doesn't need to know the UUID at all. However I'm not sure this actually solves the problem because the GPU process and the parent process still need to agree on a common location to drop that .extra file. Any mechanism we use to make them agree on this, we can just use to propagate that UUID. Right now I'm leaning towards option (b). [1] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/security/sandbox/common/SandboxSettings.cpp#15 [2] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/toolkit/xre/nsXREDirProvider.cpp#749
Assignee | ||
Comment 19•7 years ago
|
||
I got the command-line param thing working. So in the end I didn't even need to initialize the directory service or anything else, I just take the childProcessTmpDir directly from the CrashReporter in the parent process and pass it to the GPU process command line, which then reinserts it into its own CrashReporter. Local testing with that seems to work. Try push with everything, since I don't know what all exercises this code: https://treeherder.mozilla.org/#/jobs?repo=try&revision=99172e4f43064d9c132af60e10506af023fc88a5
Assignee | ||
Comment 20•7 years ago
|
||
Here's one with the Linux fixed: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bd1a3a58819d5468e5b909c451f069804e69507f
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Updated•7 years ago
|
Attachment #8900725 -
Flags: review?(ted) → review?(nfroyd)
Attachment #8900726 -
Flags: review?(ted) → review?(nfroyd)
Comment 23•7 years ago
|
||
mozreview-review |
Comment on attachment 8900725 [details] Bug 1390488 - Clean up ifdef indenting and balancing comments, no functional changes. https://reviewboard.mozilla.org/r/172166/#review184392 The ifdeffery here is frightening. ::: ipc/glue/GeckoChildProcessHost.cpp:1134 (Diff revision 1) > } > } > -#endif > +# endif // MOZ_SANDBOX > } > > -#else > +#else // goes with defined(OS_POSIX) "goes with defined(OS_POSIX)"...*400* lines earlier! ::: ipc/glue/GeckoChildProcessHost.cpp:1136 (Diff revision 1) > -#endif > +# endif // MOZ_SANDBOX > } > > -#else > +#else // goes with defined(OS_POSIX) > # error Sorry > #endif Nit: mark this `// defined(OS_POSIX)` for consistency?
Comment 24•7 years ago
|
||
mozreview-review |
Comment on attachment 8900725 [details] Bug 1390488 - Clean up ifdef indenting and balancing comments, no functional changes. https://reviewboard.mozilla.org/r/172166/#review184394
Attachment #8900725 -
Flags: review?(nfroyd) → review+
Comment 25•7 years ago
|
||
mozreview-review |
Comment on attachment 8900726 [details] Bug 1390488 - Pass the childProcessTmpDir from the parent process to the GPU process. https://reviewboard.mozilla.org/r/172168/#review184396 Seems reasonable, thank you for looking into this!
Attachment #8900726 -
Flags: review?(nfroyd) → review+
Assignee | ||
Comment 26•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #23) > > The ifdeffery here is frightening. Yeah. For the longest time I was fiddling with the command line args inside the POSIX block and wondering why the heck it wasn't working (on Windows). Once I realized my mistake I figured this cleanup might help a bit :) > > -#else > > +#else // goes with defined(OS_POSIX) > > # error Sorry > > #endif > > Nit: mark this `// defined(OS_POSIX)` for consistency? Fixed
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 29•7 years ago
|
||
Pushed by kgupta@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fdffa51b68d1 Clean up ifdef indenting and balancing comments, no functional changes. r=froydnj https://hg.mozilla.org/integration/autoland/rev/b80e267bdf30 Pass the childProcessTmpDir from the parent process to the GPU process. r=froydnj
Comment 30•7 years ago
|
||
Backed out for build bustage at ipc/glue/GeckoChildProcessHost.cpp:868: undefined reference to `CrashReporter::GetChildProcessTmpDir(nsIFile**)': https://hg.mozilla.org/integration/autoland/rev/3f40a2289e410b8aa650d244048407c58a2308f8 https://hg.mozilla.org/integration/autoland/rev/0fd97303a92d2adc9831da022c10c1fd9ffa2dc7 Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=b80e267bdf307f08a49810729a31a567e39fa0c5&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=130671310&repo=autoland [task 2017-09-13T14:32:37.672401Z] 14:32:37 INFO - INPUT("../../modules/zlib/src/zutil.o") [task 2017-09-13T14:32:37.672610Z] 14:32:37 INFO - INPUT("StaticXULComponentsEnd/StaticXULComponentsEnd.o") [task 2017-09-13T14:32:37.672865Z] 14:32:37 INFO - ../../ipc/glue/Unified_cpp_ipc_glue0.o: In function `mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunchInternal(std::vector<std::string, std::allocator<std::string> >&, base::ProcessArchitecture)': [task 2017-09-13T14:32:37.673060Z] 14:32:37 INFO - /builds/worker/workspace/build/src/ipc/glue/GeckoChildProcessHost.cpp:868: undefined reference to `CrashReporter::GetChildProcessTmpDir(nsIFile**)' [task 2017-09-13T14:32:37.673260Z] 14:32:37 INFO - ../../../gcc/bin/ld: libxul.so: hidden symbol `_ZN13CrashReporter21GetChildProcessTmpDirEPP7nsIFile' isn't defined [task 2017-09-13T14:32:37.673438Z] 14:32:37 INFO - ../../../gcc/bin/ld: final link failed: Bad value [task 2017-09-13T14:32:37.673639Z] 14:32:37 INFO - clang-4.0: error: linker command failed with exit code 1 (use -v to see invocation) [task 2017-09-13T14:32:37.673834Z] 14:32:37 INFO - /builds/worker/workspace/build/src/config/rules.mk:721: recipe for target 'libxul.so' failed [task 2017-09-13T14:32:37.674012Z] 14:32:37 INFO - gmake[5]: *** [libxul.so] Error 1
Flags: needinfo?(bugmail)
Assignee | ||
Comment 31•7 years ago
|
||
Urgh. Looks like linux64-asan is built with --disable-crashreporter [1] which means we skip the crashreporter folder [2], so the new function in toolkit/crashreporter/nsExceptionHandler.cpp doesn't get compiled. I'll need to wrap the call site in ifdef MOZ_CRASHREPORTER, I think. Other uses of CrashReporter::* in GeckoChildProcessHost.cpp already do that. [1] http://searchfox.org/mozilla-central/rev/e76c0fee79a34a3612740b33508276f4b37c3ee8/build/unix/mozconfig.asan#24 [2] http://searchfox.org/mozilla-central/rev/e76c0fee79a34a3612740b33508276f4b37c3ee8/toolkit/moz.build#49
Flags: needinfo?(bugmail)
Assignee | ||
Comment 32•7 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9c590eeb076985d609a6b9e2d078eefc1f7584c9
Comment hidden (mozreview-request) |
Comment 34•7 years ago
|
||
Pushed by kgupta@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e05d7bb3a80f Clean up ifdef indenting and balancing comments, no functional changes. r=froydnj https://hg.mozilla.org/integration/autoland/rev/1682167a8946 Pass the childProcessTmpDir from the parent process to the GPU process. r=froydnj
Comment 35•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e05d7bb3a80f https://hg.mozilla.org/mozilla-central/rev/1682167a8946
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Assignee | ||
Comment 36•7 years ago
|
||
Not sure that this is worth uplifting to 56. We've lived without the crash reason for this long so I don't think another release will make a big difference.
Assignee | ||
Comment 37•7 years ago
|
||
As of buildid 20170913220121 we're getting the moz crash reason in GPU process crashes. https://crash-stats.mozilla.com/search/?moz_crash_reason=%40.%2B&product=Firefox&version=57.0a1&platform=Windows&process_type=gpu&date=%3E%3D2017-09-07T12%3A26%3A24.000Z&date=%3C2017-09-14T12%3A26%3A24.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•