Closed Bug 1666293 Opened 4 years ago Closed 4 years ago

Crashes at gpusGenerateCrashLog.cold.1 or gpusGenerateCrashLog on Apple Intel graphics hardware

Categories

(Core :: Graphics: Layers, defect)

Unspecified
macOS
defect

Tracking

()

RESOLVED FIXED
83 Branch
Tracking Status
firefox83 --- fixed

People

(Reporter: smichaud, Assigned: smichaud)

References

(Blocks 1 open bug, )

Details

Attachments

(3 files, 1 obsolete file)

This bug is spun off from bug 1576767 comment 315 and following.

These crashes happen on Apple Intel graphics hardware, and are caused by at least one bug in the Apple graphics drivers specific to this hardware. Alexandru Trif found 100% effective STR for one of these bugs (bug 1576767 comment 315). Using that, I found a workaround for it.

Here's a fairly detailed description of Apple's bug, and of my workaround:

The "sideband buffer" is used by user-mode graphics drivers to schedule tasks in kernel-mode graphics drivers. This buffer fills up with "tokens", each of which corresponds to a single task. Periodically the tasks are all performed in a batch, and the sideband buffer is (temporarily) cleared. This is what happens on each call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers() from user mode.

Each task uses certain resources, which need to be present for it to succeed. In the case of this bug's crashes on Intel hardware, a Blit2D token is added to the sideband buffer by a user-mode graphics driver. But before the kernel-mode driver can process the token (in a call to AppleIntelHD5000Graphics-IGAccelGLContext::process_token_Blit2D()), one of the resources required to perform the Blit2D task is deleted, as the result of a call to mContext->gl->fDeleteTextures() in XUL-mozilla::WebGLTexture::~WebGLTexture(). This also triggers a call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers(), which tries to process the Blit2D token. But it only happens after the resource has been deleted. So the call to AppleIntelHD5000Graphics-IGAccelGLContext::process_token_Blit2D() fails, causing the kernel-mode driver to set the "context error" to fffffffe/-2. This in turn causes the user-mode call to crash with a signature containing gpusGenerateCrashLog. These crashes happen on all versions of macOS back to 10.12 (including macOS 11).

This kind of error should really be cleaned up by graphics driver code, either in user mode or in kernel mode. That doesn't happen in this case. But the problem is basically very simple. So we can "help" the driver around it by explicitly triggering a call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers() (via a call to mContext->gl->fFlush()) before the call to mContext->gl->fDeleteTextures().

Assignee: nobody → smichaud
Status: NEW → ASSIGNED

Because of a design flaw in dtrace (at least Apple's dtrace), you need some special handwaving to get this script to to symbolicate stack traces made in child processes: Use dtrace with the -p option, repeated with the pid for every plugin-container process that might crash:

    sudo dtrace -p [pid1] -p [pid2] -p [pid3] -p [pid4] ... -s bugzilla1576767.d
Blocks: 1576767

I have deliberately not yet added signatures to this bug. That's because I'm not yet sure of the scope of my fix -- of how many different crash signatures it will eliminate, or at least reduce in frequency.

Attachment #9176967 - Attachment is obsolete: true
Pushed by smichaud@pobox.com:
https://hg.mozilla.org/integration/autoland/rev/1516a6b811f2
Flush explicitly before deleting textures. r=jgilbert
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 83 Branch

Here's the first nightly build containing this bug's patch (build id 20200922094538):

http://ftp.mozilla.org/pub/firefox/nightly/2020/09/2020-09-22-09-45-38-mozilla-central/firefox-83.0a1.en-US.mac.dmg

I'll be keeping track of its effect on crash signatures, particularly those containing the string "gpusGenerateCrashLog". It may be a week or two before we really get a good idea of it.

It looks like calls from Rust code to gl.delete_textures bypass my patch. But I don't know that code at all, so its hard for me to be sure.

My evidence for this is one crash, in a build with my patch, on Intel graphics hardware:

bp-4d936dd1-7c2e-4458-a887-2c71d0200923

The alternate explanation is that my patch isn't working, which would surprise me.

What do you think, Jeff? Please pass this question along to someone else if you think that's appropriate.

Flags: needinfo?(jgilbert)

Yes, WebRender's GL usage does not go through GLContext. I think you could apply an equivalent fix in the rust code here: https://searchfox.org/mozilla-central/rev/35d927df97900a57ecb562ad13909e392440b0fb/gfx/wr/webrender/src/device/gl.rs#2674

Something like this could work:

        #[cfg(target_os = "macos")]
        {
          // On the Mac the call to delete_textures() triggers a flush. But it
          // happens at the wrong time, which can lead to crashes. To work around
          // this we call flush() explicitly ourselves, before the call to
          // delete_textures(). This fixes bug 1666293.
          self.gl.flush();
        }

        self.gl.delete_textures(&[texture.id]);
Flags: needinfo?(jgilbert)

Thanks, Markus.

But after digging around a bit, I wonder if it might be better to intervene at https://searchfox.org/mozilla-central/source/third_party/rust/gleam/src/gl_fns.rs#348:

    fn delete_textures(&self, textures: &[GLuint]) {
        unsafe {
            #[cfg(target_os = "macos")]
            {
                // On the Mac the call to delete_textures() triggers a flush.
                // But it happens at the wrong time, which can lead to crashes.
                // To work around this we call flush() explicitly ourselves,
                // before the call to delete_textures(). This fixes bug 1666293.
                self.ffi_gl_.Flush();
            }
            self.ffi_gl_
                .DeleteTextures(textures.len() as GLsizei, textures.as_ptr());
        }
    }

(Following up comment 12)

Your suggestion wouldn't catch all uses of delete_textures, but mine would:

https://searchfox.org/mozilla-central/search?q=delete_textures&path=&case=true&regexp=false

Though I do realize that this is third party code, which might complicate things.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #14)

Though I do realize that this is third party code, which might complicate things.

Yes, this was the reason why I didn't suggest it. We control upstream gleam so it's not a problem to upstream, but it requires a github PR and some coordination.

I'm willing to do a PR on the github project. Is it https://github.com/servo/gleam?

I did notice one patch to Gleam code in the Mozilla tree that seems to have happened there directly (bug 1615694). But I agree it's best to do a PR. This bug has existed for years. It won't hurt to wait a little longer for a full fix.

The previous tryserver build died. Apparently there's a trick to making changes to "crates" in third_party/rust. Here's a new tryserver build that's doing better:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=01ec6ad5751f958e1766fb20addb6731d9142f13

I've created a pull request for the gleam project:

https://github.com/servo/gleam/pull/217

(In reply to Steven Michaud [:smichaud] (Retired) from comment #17)

I'm willing to do a PR on the github project. Is it https://github.com/servo/gleam?

I did notice one patch to Gleam code in the Mozilla tree that seems to have happened there directly (bug 1615694). But I agree it's best to do a PR. This bug has existed for years. It won't hurt to wait a little longer for a full fix.

The patch in that bug was updating the vendored copy of the code to a more recent upstream version that included the changes in a PR. Similar work will need to happen in this bug once #217 is merged.

My pull request was rejected. The people who manage https://github.com/servo/gleam want to keep it a simple API wrapper. I'll need to do my work in WebRender.

But I'm currently very busy with other things, principally updating https://github.com/steven-michaud/HookCase for macOS 11. So I'm going to put this bug aside for a while.

Steven, do I understand correctly that this bug should be re-opened?

Flags: needinfo?(smichaud)

I don't really know.

My patch is working, as best I can tell. So this bug is half-fixed. Or maybe more than half-fixed, if calls to fDeleteTextures() coming from Rust code are less common than those not from Rust code. But until the calls from Rust code are also patched, it isn't fixed completely.

My preference is to leave it closed, at least for the time being. I'll reopen it if there are lots of crashes in Rust code that seem to belong here (that seem to be caused by Apple's graphics driver bug).

As per comment 4, I'm still waiting to attach signatures to this bug. I'm still waiting to see the full effect of my patch (the one that's landed) on crash statistics.

Flags: needinfo?(smichaud)

Yeah, I think we should file a new bug for the remaining work.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #13)

(Following up comment 12)

Your suggestion wouldn't catch all uses of delete_textures, but mine would:

https://searchfox.org/mozilla-central/search?q=delete_textures&path=&case=true&regexp=false

There are only two other calls to delete_textures in WebRender:

  • self.gl.delete_textures(&[external.id]); in device/gl.rs: This is used in code that is not compiled into Firefox. It is used for the "replay" feature of the stand-alone tool called "wrench", which is only used by WebRender developers.
  • gl.delete_textures(&[tex]); in wrench/src/main.rs: A very specific call, again only used in wrench, that is unlikely to hit the crash.
  • The other results are in "SWGL" which is a software implementation of OpenGL. It does not call into macOS OpenGL code.

So I think my suggested fix from comment 11 would cover all remaining cases for Firefox users.

(In reply to comment #24)

Opening a new bug is fine with me. You can do it, Markus. Or we can wait until I once again have time to deal with it.

As for your suggested fix from comment 11, go ahead, if you want. Or we can wait until I can deal with it.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: