Crashes at gpusGenerateCrashLog.cold.1 or gpusGenerateCrashLog on Apple Intel graphics hardware
Categories
(Core :: Graphics: Layers, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox83 | --- | fixed |
People
(Reporter: smichaud, Assigned: smichaud)
References
(Blocks 1 open bug, )
Details
Attachments
(3 files, 1 obsolete file)
This bug is spun off from bug 1576767 comment 315 and following.
These crashes happen on Apple Intel graphics hardware, and are caused by at least one bug in the Apple graphics drivers specific to this hardware. Alexandru Trif found 100% effective STR for one of these bugs (bug 1576767 comment 315). Using that, I found a workaround for it.
Here's a fairly detailed description of Apple's bug, and of my workaround:
The "sideband buffer" is used by user-mode graphics drivers to schedule tasks in kernel-mode graphics drivers. This buffer fills up with "tokens", each of which corresponds to a single task. Periodically the tasks are all performed in a batch, and the sideband buffer is (temporarily) cleared. This is what happens on each call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers()
from user mode.
Each task uses certain resources, which need to be present for it to succeed. In the case of this bug's crashes on Intel hardware, a Blit2D
token is added to the sideband buffer by a user-mode graphics driver. But before the kernel-mode driver can process the token (in a call to AppleIntelHD5000Graphics-IGAccelGLContext::process_token_Blit2D()
), one of the resources required to perform the Blit2D
task is deleted, as the result of a call to mContext->gl->fDeleteTextures()
in XUL-mozilla::WebGLTexture::~WebGLTexture()
. This also triggers a call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers()
, which tries to process the Blit2D
token. But it only happens after the resource has been deleted. So the call to AppleIntelHD5000Graphics-IGAccelGLContext::process_token_Blit2D()
fails, causing the kernel-mode driver to set the "context error" to fffffffe/-2
. This in turn causes the user-mode call to crash with a signature containing gpusGenerateCrashLog
. These crashes happen on all versions of macOS back to 10.12 (including macOS 11).
This kind of error should really be cleaned up by graphics driver code, either in user mode or in kernel mode. That doesn't happen in this case. But the problem is basically very simple. So we can "help" the driver around it by explicitly triggering a call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers()
(via a call to mContext->gl->fFlush()
) before the call to mContext->gl->fDeleteTextures()
.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 1•4 years ago
|
||
Assignee | ||
Comment 2•4 years ago
|
||
Because of a design flaw in dtrace
(at least Apple's dtrace
), you need some special handwaving to get this script to to symbolicate stack traces made in child processes: Use dtrace
with the -p
option, repeated with the pid for every plugin-container
process that might crash:
sudo dtrace -p [pid1] -p [pid2] -p [pid3] -p [pid4] ... -s bugzilla1576767.d
Assignee | ||
Comment 3•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 4•4 years ago
|
||
I have deliberately not yet added signatures to this bug. That's because I'm not yet sure of the scope of my fix -- of how many different crash signatures it will eliminate, or at least reduce in frequency.
Assignee | ||
Comment 5•4 years ago
|
||
Assignee | ||
Comment 6•4 years ago
|
||
Updated•4 years ago
|
Comment 8•4 years ago
|
||
bugherder |
Assignee | ||
Comment 9•4 years ago
|
||
Here's the first nightly build containing this bug's patch (build id 20200922094538):
I'll be keeping track of its effect on crash signatures, particularly those containing the string "gpusGenerateCrashLog". It may be a week or two before we really get a good idea of it.
Assignee | ||
Comment 10•4 years ago
|
||
It looks like calls from Rust code to gl.delete_textures
bypass my patch. But I don't know that code at all, so its hard for me to be sure.
My evidence for this is one crash, in a build with my patch, on Intel graphics hardware:
bp-4d936dd1-7c2e-4458-a887-2c71d0200923
The alternate explanation is that my patch isn't working, which would surprise me.
What do you think, Jeff? Please pass this question along to someone else if you think that's appropriate.
Comment 11•4 years ago
|
||
Yes, WebRender's GL usage does not go through GLContext. I think you could apply an equivalent fix in the rust code here: https://searchfox.org/mozilla-central/rev/35d927df97900a57ecb562ad13909e392440b0fb/gfx/wr/webrender/src/device/gl.rs#2674
Something like this could work:
#[cfg(target_os = "macos")]
{
// On the Mac the call to delete_textures() triggers a flush. But it
// happens at the wrong time, which can lead to crashes. To work around
// this we call flush() explicitly ourselves, before the call to
// delete_textures(). This fixes bug 1666293.
self.gl.flush();
}
self.gl.delete_textures(&[texture.id]);
Assignee | ||
Comment 12•4 years ago
|
||
Thanks, Markus.
But after digging around a bit, I wonder if it might be better to intervene at https://searchfox.org/mozilla-central/source/third_party/rust/gleam/src/gl_fns.rs#348:
fn delete_textures(&self, textures: &[GLuint]) {
unsafe {
#[cfg(target_os = "macos")]
{
// On the Mac the call to delete_textures() triggers a flush.
// But it happens at the wrong time, which can lead to crashes.
// To work around this we call flush() explicitly ourselves,
// before the call to delete_textures(). This fixes bug 1666293.
self.ffi_gl_.Flush();
}
self.ffi_gl_
.DeleteTextures(textures.len() as GLsizei, textures.as_ptr());
}
}
Assignee | ||
Comment 13•4 years ago
|
||
(Following up comment 12)
Your suggestion wouldn't catch all uses of delete_textures
, but mine would:
https://searchfox.org/mozilla-central/search?q=delete_textures&path=&case=true®exp=false
Assignee | ||
Comment 14•4 years ago
|
||
Though I do realize that this is third party code, which might complicate things.
Assignee | ||
Comment 15•4 years ago
|
||
I've started a tryserver build with my new patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5e34083e0a09acb709ed121d61948bb624adcce
Comment 16•4 years ago
|
||
(In reply to Steven Michaud [:smichaud] (Retired) from comment #14)
Though I do realize that this is third party code, which might complicate things.
Yes, this was the reason why I didn't suggest it. We control upstream gleam so it's not a problem to upstream, but it requires a github PR and some coordination.
Assignee | ||
Comment 17•4 years ago
•
|
||
I'm willing to do a PR on the github project. Is it https://github.com/servo/gleam?
I did notice one patch to Gleam code in the Mozilla tree that seems to have happened there directly (bug 1615694). But I agree it's best to do a PR. This bug has existed for years. It won't hurt to wait a little longer for a full fix.
Assignee | ||
Comment 18•4 years ago
|
||
The previous tryserver build died. Apparently there's a trick to making changes to "crates" in third_party/rust
. Here's a new tryserver build that's doing better:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=01ec6ad5751f958e1766fb20addb6731d9142f13
Assignee | ||
Comment 19•4 years ago
|
||
I've created a pull request for the gleam project:
Comment 20•4 years ago
|
||
(In reply to Steven Michaud [:smichaud] (Retired) from comment #17)
I'm willing to do a PR on the github project. Is it https://github.com/servo/gleam?
I did notice one patch to Gleam code in the Mozilla tree that seems to have happened there directly (bug 1615694). But I agree it's best to do a PR. This bug has existed for years. It won't hurt to wait a little longer for a full fix.
The patch in that bug was updating the vendored copy of the code to a more recent upstream version that included the changes in a PR. Similar work will need to happen in this bug once #217 is merged.
Assignee | ||
Comment 21•4 years ago
|
||
My pull request was rejected. The people who manage https://github.com/servo/gleam want to keep it a simple API wrapper. I'll need to do my work in WebRender.
But I'm currently very busy with other things, principally updating https://github.com/steven-michaud/HookCase for macOS 11. So I'm going to put this bug aside for a while.
Comment 22•4 years ago
|
||
Steven, do I understand correctly that this bug should be re-opened?
Assignee | ||
Comment 23•4 years ago
|
||
I don't really know.
My patch is working, as best I can tell. So this bug is half-fixed. Or maybe more than half-fixed, if calls to fDeleteTextures()
coming from Rust code are less common than those not from Rust code. But until the calls from Rust code are also patched, it isn't fixed completely.
My preference is to leave it closed, at least for the time being. I'll reopen it if there are lots of crashes in Rust code that seem to belong here (that seem to be caused by Apple's graphics driver bug).
As per comment 4, I'm still waiting to attach signatures to this bug. I'm still waiting to see the full effect of my patch (the one that's landed) on crash statistics.
Comment 24•4 years ago
|
||
Yeah, I think we should file a new bug for the remaining work.
(In reply to Steven Michaud [:smichaud] (Retired) from comment #13)
(Following up comment 12)
Your suggestion wouldn't catch all uses of
delete_textures
, but mine would:https://searchfox.org/mozilla-central/search?q=delete_textures&path=&case=true®exp=false
There are only two other calls to delete_textures in WebRender:
self.gl.delete_textures(&[external.id]);
in device/gl.rs: This is used in code that is not compiled into Firefox. It is used for the "replay" feature of the stand-alone tool called "wrench", which is only used by WebRender developers.gl.delete_textures(&[tex]);
in wrench/src/main.rs: A very specific call, again only used in wrench, that is unlikely to hit the crash.- The other results are in "SWGL" which is a software implementation of OpenGL. It does not call into macOS OpenGL code.
So I think my suggested fix from comment 11 would cover all remaining cases for Firefox users.
Assignee | ||
Comment 25•4 years ago
|
||
(In reply to comment #24)
Opening a new bug is fine with me. You can do it, Markus. Or we can wait until I once again have time to deal with it.
As for your suggested fix from comment 11, go ahead, if you want. Or we can wait until I can deal with it.
Description
•