Closed Bug 1666293 Opened 4 years ago Closed 4 years ago

Crashes at gpusGenerateCrashLog.cold.1 or gpusGenerateCrashLog on Apple Intel graphics hardware

Tracking

()

Status:

RESOLVED FIXED

Milestone:

83 Branch

Tracking Flags:

Tracking

Status

firefox83

---

fixed

People

(Reporter: smichaud, Assigned: smichaud)

References

(Blocks 1 open bug,
URL
)

Details

Attachments

(3 files, 1 obsolete file)

Dtrace script for debugging bug 1666293 crashes 4 years ago Steven Michaud [:smichaud] (Retired) 757 bytes, text/plain		Details
Output of dtrace script 4 years ago Steven Michaud [:smichaud] (Retired) 10.77 KB, text/plain		Details
Bug 1666293 - Flush explicitly before deleting textures. r=jgilbert 4 years ago Steven Michaud [:smichaud] (Retired) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1666293 - Flush explicitly before deleting textures. r=jgilbert 4 years ago Steven Michaud [:smichaud] (Retired) 47 bytes, text/x-phabricator-request		Details \| Review

Steven Michaud [:smichaud] (Retired)

Assignee

Description

•

4 years ago

•

Edited

This bug is spun off from bug 1576767 comment 315 and following.

These crashes happen on Apple Intel graphics hardware, and are caused by at least one bug in the Apple graphics drivers specific to this hardware. Alexandru Trif found 100% effective STR for one of these bugs (bug 1576767 comment 315). Using that, I found a workaround for it.

Here's a fairly detailed description of Apple's bug, and of my workaround:

The "sideband buffer" is used by user-mode graphics drivers to schedule tasks in kernel-mode graphics drivers. This buffer fills up with "tokens", each of which corresponds to a single task. Periodically the tasks are all performed in a batch, and the sideband buffer is (temporarily) cleared. This is what happens on each call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers() from user mode.

Each task uses certain resources, which need to be present for it to succeed. In the case of this bug's crashes on Intel hardware, a Blit2D token is added to the sideband buffer by a user-mode graphics driver. But before the kernel-mode driver can process the token (in a call to AppleIntelHD5000Graphics-IGAccelGLContext::process_token_Blit2D()), one of the resources required to perform the Blit2D task is deleted, as the result of a call to mContext->gl->fDeleteTextures() in XUL-mozilla::WebGLTexture::~WebGLTexture(). This also triggers a call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers(), which tries to process the Blit2D token. But it only happens after the resource has been deleted. So the call to AppleIntelHD5000Graphics-IGAccelGLContext::process_token_Blit2D() fails, causing the kernel-mode driver to set the "context error" to fffffffe/-2. This in turn causes the user-mode call to crash with a signature containing gpusGenerateCrashLog. These crashes happen on all versions of macOS back to 10.12 (including macOS 11).

This kind of error should really be cleaned up by graphics driver code, either in user mode or in kernel mode. That doesn't happen in this case. But the problem is basically very simple. So we can "help" the driver around it by explicitly triggering a call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers() (via a call to mContext->gl->fFlush()) before the call to mContext->gl->fDeleteTextures().

Steven Michaud [:smichaud] (Retired)

Assignee

Updated

•

4 years ago

Assignee: nobody → smichaud

Status: NEW → ASSIGNED

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 1

•

4 years ago

Attached file Dtrace script for debugging bug 1666293 crashes — Details

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 2

•

4 years ago

Because of a design flaw in dtrace (at least Apple's dtrace), you need some special handwaving to get this script to to symbolicate stack traces made in child processes: Use dtrace with the -p option, repeated with the pid for every plugin-container process that might crash:

    sudo dtrace -p [pid1] -p [pid2] -p [pid3] -p [pid4] ... -s bugzilla1576767.d

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 3

•

4 years ago

Attached file Output of dtrace script — Details

Steven Michaud [:smichaud] (Retired)

Assignee

Updated

•

4 years ago

URL: https://www.khronos.org/registry/webg...

Steven Michaud [:smichaud] (Retired)

Assignee

Updated

•

4 years ago

Blocks: 1576767

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 4

•

4 years ago

I have deliberately not yet added signatures to this bug. That's because I'm not yet sure of the scope of my fix -- of how many different crash signatures it will eliminate, or at least reduce in frequency.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 5

•

4 years ago

Attached file Bug 1666293 - Flush explicitly before deleting textures. r=jgilbert — Details

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 6

•

4 years ago

Attached file Bug 1666293 - Flush explicitly before deleting textures. r=jgilbert (obsolete) — Details

Phabricator Automation

Updated

•

4 years ago

Attachment #9176967 - Attachment is obsolete: true

Pulsebot

Comment 7

•

4 years ago

Pushed by smichaud@pobox.com: https://hg.mozilla.org/integration/autoland/rev/1516a6b811f2 Flush explicitly before deleting textures. r=jgilbert

Alexandru Michis [:malexandru]

Comment 8

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1516a6b811f2

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox83: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 83 Branch

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 9

•

4 years ago

Here's the first nightly build containing this bug's patch (build id 20200922094538):

http://ftp.mozilla.org/pub/firefox/nightly/2020/09/2020-09-22-09-45-38-mozilla-central/firefox-83.0a1.en-US.mac.dmg

I'll be keeping track of its effect on crash signatures, particularly those containing the string "gpusGenerateCrashLog". It may be a week or two before we really get a good idea of it.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 10

•

4 years ago

It looks like calls from Rust code to gl.delete_textures bypass my patch. But I don't know that code at all, so its hard for me to be sure.

My evidence for this is one crash, in a build with my patch, on Intel graphics hardware:

bp-4d936dd1-7c2e-4458-a887-2c71d0200923

The alternate explanation is that my patch isn't working, which would surprise me.

What do you think, Jeff? Please pass this question along to someone else if you think that's appropriate.

Flags: needinfo?(jgilbert)

Markus Stange [:mstange]

Comment 11

•

4 years ago

Yes, WebRender's GL usage does not go through GLContext. I think you could apply an equivalent fix in the rust code here: https://searchfox.org/mozilla-central/rev/35d927df97900a57ecb562ad13909e392440b0fb/gfx/wr/webrender/src/device/gl.rs#2674

Something like this could work:

        #[cfg(target_os = "macos")]
        {
          // On the Mac the call to delete_textures() triggers a flush. But it
          // happens at the wrong time, which can lead to crashes. To work around
          // this we call flush() explicitly ourselves, before the call to
          // delete_textures(). This fixes bug 1666293.
          self.gl.flush();
        }

        self.gl.delete_textures(&[texture.id]);

Flags: needinfo?(jgilbert)

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 12

•

4 years ago

Thanks, Markus.

But after digging around a bit, I wonder if it might be better to intervene at https://searchfox.org/mozilla-central/source/third_party/rust/gleam/src/gl_fns.rs#348:

    fn delete_textures(&self, textures: &[GLuint]) {
        unsafe {
            #[cfg(target_os = "macos")]
            {
                // On the Mac the call to delete_textures() triggers a flush.
                // But it happens at the wrong time, which can lead to crashes.
                // To work around this we call flush() explicitly ourselves,
                // before the call to delete_textures(). This fixes bug 1666293.
                self.ffi_gl_.Flush();
            }
            self.ffi_gl_
                .DeleteTextures(textures.len() as GLsizei, textures.as_ptr());
        }
    }

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 13

•

4 years ago

(Following up comment 12)

Your suggestion wouldn't catch all uses of delete_textures, but mine would:

https://searchfox.org/mozilla-central/search?q=delete_textures&path=&case=true&regexp=false

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 14

•

4 years ago

Though I do realize that this is third party code, which might complicate things.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 15

•

4 years ago

I've started a tryserver build with my new patch:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5e34083e0a09acb709ed121d61948bb624adcce

Markus Stange [:mstange]

Comment 16

•

4 years ago

(In reply to Steven Michaud [:smichaud] (Retired) from comment #14)

Though I do realize that this is third party code, which might complicate things.

Yes, this was the reason why I didn't suggest it. We control upstream gleam so it's not a problem to upstream, but it requires a github PR and some coordination.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 17

•

4 years ago

•

Edited

I'm willing to do a PR on the github project. Is it https://github.com/servo/gleam?

I did notice one patch to Gleam code in the Mozilla tree that seems to have happened there directly (bug 1615694). But I agree it's best to do a PR. This bug has existed for years. It won't hurt to wait a little longer for a full fix.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 18

•

4 years ago

The previous tryserver build died. Apparently there's a trick to making changes to "crates" in third_party/rust. Here's a new tryserver build that's doing better:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=01ec6ad5751f958e1766fb20addb6731d9142f13

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 19

•

4 years ago

I've created a pull request for the gleam project:

https://github.com/servo/gleam/pull/217

Josh Matthews [:jdm]

Comment 20

•

4 years ago

(In reply to Steven Michaud [:smichaud] (Retired) from comment #17)

I'm willing to do a PR on the github project. Is it https://github.com/servo/gleam?

I did notice one patch to Gleam code in the Mozilla tree that seems to have happened there directly (bug 1615694). But I agree it's best to do a PR. This bug has existed for years. It won't hurt to wait a little longer for a full fix.

The patch in that bug was updating the vendored copy of the code to a more recent upstream version that included the changes in a PR. Similar work will need to happen in this bug once #217 is merged.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 21

•

4 years ago

My pull request was rejected. The people who manage https://github.com/servo/gleam want to keep it a simple API wrapper. I'll need to do my work in WebRender.

But I'm currently very busy with other things, principally updating https://github.com/steven-michaud/HookCase for macOS 11. So I'm going to put this bug aside for a while.

Gian-Carlo Pascutto [:gcp]

Comment 22

•

4 years ago

Steven, do I understand correctly that this bug should be re-opened?

Flags: needinfo?(smichaud)

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 23

•

4 years ago

I don't really know.

My patch is working, as best I can tell. So this bug is half-fixed. Or maybe more than half-fixed, if calls to fDeleteTextures() coming from Rust code are less common than those not from Rust code. But until the calls from Rust code are also patched, it isn't fixed completely.

My preference is to leave it closed, at least for the time being. I'll reopen it if there are lots of crashes in Rust code that seem to belong here (that seem to be caused by Apple's graphics driver bug).

As per comment 4, I'm still waiting to attach signatures to this bug. I'm still waiting to see the full effect of my patch (the one that's landed) on crash statistics.

Flags: needinfo?(smichaud)

Markus Stange [:mstange]

Comment 24

•

4 years ago

Yeah, I think we should file a new bug for the remaining work.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #13)

(Following up comment 12)

Your suggestion wouldn't catch all uses of delete_textures, but mine would:

https://searchfox.org/mozilla-central/search?q=delete_textures&path=&case=true&regexp=false

There are only two other calls to delete_textures in WebRender:

self.gl.delete_textures(&[external.id]); in device/gl.rs: This is used in code that is not compiled into Firefox. It is used for the "replay" feature of the stand-alone tool called "wrench", which is only used by WebRender developers.
gl.delete_textures(&[tex]); in wrench/src/main.rs: A very specific call, again only used in wrench, that is unlikely to hit the crash.
The other results are in "SWGL" which is a software implementation of OpenGL. It does not call into macOS OpenGL code.

So I think my suggested fix from comment 11 would cover all remaining cases for Firefox users.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 25

•

4 years ago

(In reply to comment #24)

Opening a new bug is fine with me. You can do it, Markus. Or we can wait until I once again have time to deal with it.

As for your suggested fix from comment 11, go ahead, if you want. Or we can wait until I can deal with it.

You need to log in before you can comment on or make changes to this bug.