Open Bug 1713230 Opened 3 years ago Updated 20 days ago

Crashes at gpusGenerateCrashLog with customized graphics kernel errors '0x1be385f9' and '0x067900fc'

Categories

(Core :: Graphics, defect, P3)

x86_64
macOS
defect

Tracking

()

REOPENED

People

(Reporter: smichaud, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: topcrash)

Crash Data

Attachments

(2 files, 3 obsolete files)

IOAccelContext2::setContextError(unsigned int error) is a method defined in the IOAcceleratorFamily2 kernel-mode graphics driver. It's called, on an error condition, to set the "context error" in an IOAccelContext2 context -- usually from one of the hardware-specific kernel-mode graphics drivers like AppleIntelHD5000Graphics or AMDRadeonX4000. When this happens, this error number becomes the "graphics kernel error" in the mac_crash_info data written by gpusGenerateCrashLog.cold.1().

Normally these error numbers are simple negative integers -- for example 0xfffffff9/-7 or 0xfffffffc/-4. But as of macOS 11.4, context errors set (by calls to setContextError()) from AMDRadeon hardware-specific kernel-mode graphics drivers can have a different, much more elaborate format -- for example 0x1be385f9 or 0x067900fc.

I believe this is part of a special effort on Apple's part to get to the bottom of these errors on AMDRadeon hardware. macOS 11.1 and 11.3 are supposed to have included fixes for these problems. But if anything they've grown worse. (See bug 1576767 comment #347 and bug 1576767 comment #348.) And it appears that Apple has doubled down in their efforts to resolve them. Though they haven't (so far as I know) done this publicly. (Which probably shows that many more apps than just Firefox and Thunderbird are effected.)

Each of these error numbers has three "fields": "nnnn:nn:nn". I don't (yet) understand the first and the third. But I'm pretty sure the second indicates the kind of "token" being processed when the failure occurred. More on this in a later comment. But even without understanding the error numbers' format, you can see (in a good disassembler) that each one is only ever used once. So you can tell from the error number exactly where the error happened -- exactly which call to setContextError() "set" it.

Here are some examples from the last few days:

bp-3539c604-8378-4d6d-adcd-f89aa0210526
bp-c5702982-e478-4dad-97fe-4150e0210527
bp-cfb6c9e9-c86b-4449-8bdb-aa9490210527

    {
      "num_records": 2,
      "records": [
        {
          "message": "abort() called",
          "module": "/usr/lib/system/libsystem_c.dylib"
        },
        {
          "module": "/System/Library/PrivateFrameworks/GPUSupport.framework/Versions/A/Libraries/libGPUSupportMercury.dylib",
          "signature_string": "Graphics kernel error: 0x1be385f9\n"
        }
      ]
    }

bp-a6438d18-4304-49d4-990a-4fb2f0210526

    {
      "num_records": 2,
      "records": [
        {
          "message": "abort() called",
          "module": "/usr/lib/system/libsystem_c.dylib"
        },
        {
          "module": "/System/Library/PrivateFrameworks/GPUSupport.framework/Versions/A/Libraries/libGPUSupportMercury.dylib",
          "signature_string": "Graphics kernel error: 0x067900fc\n"
        }
      ]
    }
Blocks: 1711944

The error number 0x67900fc from comment #0 is set from the following method in the AMDRadeonX4000 kernel-mode graphics driver:

    AMDRadeonX4000_AMDSIGLContext::processSidebandToken(IOAccelCommandStreamInfo& info);

This happens when a qword value at offset 0xfe0 in the AMDRadeonX4000_AMDSIGLContext object is unexpectedly 0 or NULL. From an error message elsewhere in the code, this value is fCurrentDataBuffer[0]. The "token" value (according to my theory from comment #0) is '0' -- presumably because the error didn't happen processing a particular "token".

The error number 0x1be385f9 is set from the following method, also in the AMDRadeonX4000 kernel-mode graphics driver:

    AMDRadeonX4000_AMDSIGLContext::process_ResourceList(IOAccelCommandStreamInfo& info);

It happens just after an error return from the following method:

    AMDRadeonX4000_AMDAccelResource::BatchPrepare(AMDRadeonX4000_AMDGraphicsAccelerator*, AMDRadeonX4000_AMDAccelResource* const*, unsigned int);

Here the "token" is 0x85, which I think just means "ResourceList".

Edit: fCurrentDataBuffer[2] can contain up to two elements, each of which is an AMDRadeonX4000_AMDAccelResource* (aka IOAccelResource2*) object. These elements are written by calls to IOAccelContext2::process_token_BindDataBuffer(IOAccelCommandStreamInfo& info). "BindDataBuffer"'s "token" id is '0'.

Apple might find this mac_crash_info data useful, especially as more of it accumulates. Does Mozilla have an Apple contact we can CC on this bug?

Crash Signature: [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ]

Here's a followup on what I said in comment #0, with more detail and a few corrections.

The "sideband buffer" is a memory buffer used by graphics drivers, which is double-mapped into kernel-space and user-space. At least on AMD Radeon hardware, there's one buffer per IOAccelContext, and (at least in Firefox and Chrome) there's one IOAccelContext per browser session. User-mode graphics drivers write "tokens" to this buffer, then send it to kernel-mode graphics drivers to be processed (via calls to gpusSubmitDataBuffers() and IOAccelContextSubmitDataBuffersExt2()).

On AMD Radeon hardware and in Firefox and Chrome, the first three tokens in every batch to be processed are:

Token id	Name
0x1		Start
0x0		BindDataBuffer
0x85		ResourceList

Here's the format of the first part of every "token":

    typedef struct _token_header {
      // Offset in IOAccelResource2* fCurrentDataBuffer[2]
      uint8_t texbuf_id;           // Offset 0x0
      uint8_t token_id;            // Offset 0x1
      // The size, in "words" (actually dwords), of the token
      uint16_t token_size;       // Offset 0x2
    } token_header;

The first context error from comment #0 (0x67900fc) is set in AMDRadeonX4000_AMDSIGLContext::processSidebandToken(IOAccelCommandStreamInfo& info), just after a call to IOAccelGLContext2::processSidebandToken(IOAccelCommandStreamInfo& info) in its superclass. This error is set if the token's texbuf_id and token_id are both '0' and fCurrentDataBuffer[0] is NULL. So the token in question is "BindDataBuffer", and the error is that IOAccelContext2::process_token_BindDataBuffer(IOAccelCommandStreamInfo& info) failed to set fCurrentDataBuffer[0] to the AMDRadeonX4000_AMDAccelResource* corresponding to the "resource id" in the token's uint32_t field at offset 0x8.

The second context error from comment #0 (0x1be385f9) is set in AMDRadeonX4000_AMDSIGLContext::process_ResourceList(IOAccelCommandStreamInfo& info), just after a failed call to AMDRadeonX4000_AMDAccelResource::BatchPrepare(). A "ResourceList" token is just what it says -- an array of uint32_t "resource ids", starting at offset 0x8 in the token, and continuing to its end (some of these may be '0', but they are ignored by process_ResourceList()). BatchPrepare() is called after process_ResourceList() successfully creates an array of AMDRadeonX4000_AMDAccelResource* by looking them up by their "resource ids". BatchPrepare() is complex, and I haven't yet analyzed why it can fail.

All the "custom" error numbers we've seen so far indicate problems close to the beginning of the token array in the sideband buffer. I'm going to wait a while for more of these error numbers to accumulate. If they all match the patterns we've seen so far, I'll try to find ways to trigger them (using a HookCase hook library).

Each of these error numbers has three "fields": "nnnn:nn:nn".

I've managed to figure out the third of these fields -- it's the least significant byte of the "conventional" error number that would normally have been used in its place. I confirmed this by comparing calls to IOAccelContext2::setContextError(unsigned int error) in the AMDRadeonX4000 graphics kernel driver on macOS 10.15.7 (build 19H1030) to those in the same kernel driver on macOS 11.4 (build 20F71).

So error number 0x67900fc on macOS 11.4 and up is comparable to error number 0xfffffffc ("internal error") on prior versions of macOS. Likewise, error number 0x1be385f9 is comparable to 0xfffffff9 ("out of memory"). These two "conventional" errors have been by far the most common ones on AMD Radeon hardware. So it looks like the new, "custom" error numbers are tracking the same bug or bugs.

I still haven't figured out the first field. I can see patterns in them: For example, calls to setContextError() that are close together have very similar first fields, and the one at the larger address always has the larger first field. So they are offsets of some kind, but I haven't yet figured out what kind.

Edit: It's possible the first field is source code line numbers. But we won't ever have the source code, so this won't help us.

As I noted above, the second field is the "token id" of the token currently being processed.

Fifteen crashes with custom error numbers have now accumulated, and so far the only examples have been the two already mentioned. I'll add them to the summary to make this bug easier to find.

Summary: Crashes at gpusGenerateCrashLog with customized graphics kernel errors → Crashes at gpusGenerateCrashLog with customized graphics kernel errors 0x1be385f9 and 0x067900fc

Possibly relevant, since its crash stack contains "AMDRsrcList_NewCmdBuf":

bp-3c45560b-353c-4f3d-acbd-1b3a40210602

Also possibly relevant, since its crash stack contains "AMDRsrcList_AddRsrc":

bp-d00b5aa7-4c0d-4681-937b-a468e0210423

Summary: Crashes at gpusGenerateCrashLog with customized graphics kernel errors 0x1be385f9 and 0x067900fc → Crashes at gpusGenerateCrashLog with customized graphics kernel errors '0x1be385f9' and '0x067900fc'
Blocks: gfx-triage
Severity: -- → S4
Flags: needinfo?(mstange.moz)
Priority: -- → P3

For a while I thought Apple's new error numbers might give me leverage on the bug or bugs that (from the quantity of crash reports) badly plague apps that use OpenGL with Apple's AMDRadeonX4000 graphics drivers. Now I no longer think so. Crashes with both error numbers still peter out in the trackless jungle of IOAccelResource2::prepare() and AMDRadeonX4000_AMDAccelResource::prepare() -- the same place I ended up at bug 1576767. None of this code sets any "context errors" (since they aren't methods of IOAccelContext2). So the new error numbers don't help here.

I've tried some of the same tricks I used at bug 1576767, and some new ones, to trigger these crashes. None worked. I messed with the "resource" objects "attached" to the "BindDataBuffer" and "ResourceList" tokens. I also messed with those tokens themselves. Virtually everything I tried triggered a crash -- just not the right ones.

There's probably nothing more I can do until I gain new insights or information.

One more thing: I'm virtually certain these bugs have nothing to do with "resource" objects being deleted/destroyed prematurely.

Attachment #9224182 - Attachment is obsolete: true
Crash Signature: [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] → [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ libsystem_kernel.dylib@0x792e]

(Following up comment #9)

I may have spoken too soon. Just now I managed to trigger a "BindDataBuffer" (0x067900fc) crash:

bp-27c18015-7f42-4f41-b037-1f03d0210605

I did it by changing the "resource id" in the "BindDataBuffer" (0x00) tag from the one originally present to one that had already been present at least once in a "ResourceList" (0x85) tag's list of "attachments". The resource was valid (not deleted or malformed), but just the wrong kind.

"Resources" (created by IOAccelResourceCreate() or IOAccelResourceCreateDataBuffer() in the IOAccelResource private framework) can have various types. Those created by IOAccelResourceCreateDataBuffer() seem to always have type 0xa. And these are (normally) the only ones whose "resource ids" are present in the "BindDataBuffer" tag. Those created by IOAccelResourceCreate() can have many different types, but never 0xa. These other types are what's normally present in the "ResourceList" tag's list of attachments.

I haven't yet managed to trigger a "ResourceList" (0x1be385f9) crash. When I tried changing one of this tag's resource ids (in its list of attachments) to one of type 0xa (which had already been present at least once in a "BindDataBuffer" tag), I got another 0x067900fc crash. Doing this didn't mess up the "ResourceList" tag. But apparently it did mess up that 0xa resource (or objects linked to it) badly enough that it triggered a crash the next time it (or one of its linked objects) was used in a "BindDataBuffer" tag.

I'm not sure where that leaves us. I may have found out that they're caused by the wrong kind of resource's id being present in a tag. But I can't really be sure of that until I manage to use the same strategy to trigger 0x1be385f9 crashes.

If I'm right, these crashes are caused by a bug or bugs in Apple's user-mode AMDRadeonX4000 graphics drivers. That's good, because it's a lot easier to figure out user-mode bugs (using tools like HookCase) than it is to figure out kernel-mode bugs. But it still won't be easy. This bug will only have moved from "almost impossible" to "very difficult".

Attachment #9225334 - Attachment is obsolete: true

One of my "BindDataBuffer" tests triggered this kernel panic.

It's not terribly surprising that you can trigger kernel panics by messing with the sideband buffer. And if I'm right about this being a user-mode bug, Firefox users might also be seeing them.

They've occasionally been reported in the past. Mozilla's (and my own) response has been to say it must have been an Apple bug, and to throw up our hands. They surely are Apple bugs. But now we might be able to pin the blame more precisely. Whoever sees one of these reports, please check the user's graphics hardware, and note whether or not they're using AMDRadeonX4000 drivers.

My panic report is symbolized, which is very unusual. It's because I've set the keepsyms=1 kernel boot arg (using nvram boot-args="keepsyms=1").

See Also: → 1535120

Just now I managed to trigger a 0x1be385f9 crash:

bp-05d12616-888e-42d9-997f-287e30210609

But I didn't use the strategy I described in comment #13. Instead I used a slightly corrupt IOAccelResource object of type 0xc0 ("VidMemShared"). I corrupted the data passed to IOAccelResourceCreate() to create it. By "slightly corrupt" I mean not enough to make this call fail, but enough to make the kernel mode driver fail (and set the 0x1be385f9 context error) while processing a "ResourceList" tag that includes this object (its resource id).

I'm still not sure where this leaves us. But I did find out that Safari doesn't use them (though Chrome does). And there's an underhanded trick I can play to make Firefox not use them, without any obvious loss of quality (though with perhaps some loss of performance). If I can find a less underhanded way to do this, I'll write a patch for it. It's just possible that it will get rid of these crashes (at least the "out of memory" ones).

Of course we'd want to hide this change behind a pref. But if this patch gets landed, I'd like the pref to be on for a week or so, to see what effect it has on Mozilla's crash stats.

And there's an underhanded trick I can play to make Firefox not use them, without any obvious loss of quality (though with perhaps some loss of performance).

My trick makes Firefox use objects of type 0x80 ("SysMemShared") instead.

No longer blocks: gfx-triage

(Following up comment #16)

The same "slight corruption" also triggers 0x1be385f9 crashes when used with IOAccelResource objects of type 0x40 ("Standard"). These objects are much more common than "VidMemShared" objects, and Firefox can't do without them. Some of them are also much larger. So (presumably) avoiding the use of "VidMemShared" objects won't help here. And my (guarded) optimism in comment #16 was misplaced.

By the way, my "slight corruption" was to drastically increase a resource object's "resident size" -- the amount of space it takes up (in kernel memory) when it's "wired" into VRAM (GPU RAM) or system memory (ordinary RAM). Doing this doesn't stop an IOAccelResource object from being created (by IOAccelResourceCreate() in user-mode code). But it does cause IOAccelResource2::prepare() to fail in kernel-mode code while processing a ResourceList tag (type 0x85) containing this object's resource id.

This is a Apple bug or design flaw. IOAccelResourceCreate() communicates directly with kernel-mode code, which could call IOAccelResource2::prepare() and return an error if it fails. My tests have shown that IOAccelResourceCreate() failures rarely (if ever) cause crashes. But IOAccelResource2::prepare() failures always trigger crashes when called via user-mode calls to gpusSubmitDataBuffers().

Recent developments, post comment #9, made me temporarily more optimistic. But now I think what I said there is largely correct. Apple's customized error codes did allow me to learn more about what's going on with this bug's crashes. But it's still not actionable, and I'm not going to pursue this research any further. Only some big new insight or piece of information will change that.

Here's where things stand, as I now see it:

This bug's error codes (0x1be385f9 and 0x067900fc) correspond to two distinct kinds of crash. They're probably not related.

0x1be385f9 (aka 0xfffffff9/-7) crashes probably happen because of the Apple bug or design flaw that I outlined in comment #18. I suspect they're much more likely to happen on systems that drive lots of pixels -- especially those with several external monitors. They're real "out of memory" crashes, at least in a sense -- they happen when a system is close to the limits of available VRAM and wireable RAM. But they can happen on systems with lots of "available physical memory". They're kernel-mode "out of memory" crashes, as distinct from user-mode "out of memory" crashes.

0x067900fc (aka 0xfffffffc/-4) crashes are harder to understand. They might be caused by user-mode graphics drivers attaching the wrong kind of IOAccelResource object to a "BindDataBuffer" tag (of type 0x0). That, so far, is the only way I've found to trigger them. But I could have missed something.

Crash Signature: [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ libsystem_kernel.dylib@0x792e] → [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill ] [@ libsystem_kernel.dylib@0x792e]
Flags: needinfo?(mstange.moz)
Crash Signature: [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill ] [@ libsystem_kernel.dylib@0x792e] → [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill ] [@ libsystem_kernel.dylib@0x792e]

As of a few days ago, I seem to encounter this crash once per day. It tends to happen when switching between virtual desktops, but so far I haven't been able to isolate a reliable trigger.

macOS 11.6.3
Firefox Nightly 99 - 100
Builds 20220307093830 - 20220309094444

2018 MacBook Pro
AMD Radeon Pro 560X 4 GB
Intel UHD Graphics 630 1536 MB

The AMD GPU is always in use because an external display is attached.

My crash reports so far:

https://crash-stats.mozilla.org/report/index/f5b447cc-b093-44a6-9fa6-9b8650220307
https://crash-stats.mozilla.org/report/index/6118528d-6bb3-4a12-a2ca-cac840220309
https://crash-stats.mozilla.org/report/index/14efc082-bc5d-42ac-8a72-95ef70220310

I'm happy to collect any other data, run investigations, or whatever else might help to investigate this.

So you're getting the 0x1be385f9 (aka 0xfffffff9/-7) context error, which means "out of memory". As I said in comment #19, my hunch is that these are more likely to happen the more pixels your graphics card is driving. So please list all your displays (internal and external), and their sizes in pixels. Also try using gfxCardStatus to force your displays to use your built-in Intel graphics hardware, to see what difference this makes. And (if this is feasible) try disconnecting one or more of your external displays for a day or two, to see if this makes a difference.

Finally, do you go a long time between rebooting your machine? If so, try rebooting it at least once a day, to see if this makes a difference.

As I said above, these crashes are Apple bugs or design flaws. We're not going to be able to fix them. But it'd still be nice to know more about them, and to find possible workarounds.

Thanks for the reply! 😄

Displays

  • Built-in: 2880 x 1800
  • External: 3840 x 2160

Intel GPU

The existence of the external display is listed as depending on the AMD GPU, so gfxCardStatus refuses to switch back to Intel. I can try removing the display for a bit (causing a switch back to Intel) and see how that goes.

Uptime

Looks like it has been around 24 days since the last reboot, so I can explore that path as well.

macOS Updates

I'm also a major version behind current macOS (I usually wait a long time for Apple to sort out bugs before doing a major version upgrade), so it's possible the macOS 12 series could change behaviour here.

Thanks for the information :-)

I can try removing the [external] display for a bit (causing a switch back to Intel) and see how that goes.

When you do this, could you use gfxCardStatus to force apps to use your AMD graphics hardware? I'm most interested in seeing what effect this change has on Apple's AMD graphics drivers.

One thing I forgot to ask about: Do you have other graphics-intensive apps running at the same time as Firefox? If so, try quitting them, to see if this makes a difference.

(In reply to comment #24)

I'm also a major version behind current macOS (I usually wait a long time for Apple to sort out bugs before doing a major version upgrade), so it's possible the macOS 12 series could change behaviour here.

Crash stats indicate that crashes with the 0x1be385f9 context error continue to happen with (more or less) the same frequency in even the latest release of macOS 12 (12.2.1 build 21D62). So Apple hasn't fixed this bug, and upgrading to macOS 12 is unlikely to help. Still, though, it might make a difference. I'd like you to save this change for last. It's irreversible, after all.

https://crash-stats.mozilla.org/search/?mac_crash_info=~Graphics%20kernel%20error%3A%200x1be385f9&date=%3E%3D2022-02-10T22%3A03%3A00.000Z&date=%3C2022-03-10T22%3A03%3A00.000Z&_facets=signature&_facets=mac_crash_info&_facets=platform_version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-platform_version

I rebooted towards the end of last week and so far, I haven't hit this crash again yet, so perhaps that has cleared some state or stemmed an internal leak of some kind.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #25)

One thing I forgot to ask about: Do you have other graphics-intensive apps running at the same time as Firefox? If so, try quitting them, to see if this makes a difference.

Hmm, not that am I consciously aware of at least... Looking at Activity Monitor's GPU Time column, iTerm2 is the only other app with a decent chunk of time, though Firefox uses much more GPU time by far. I do occasionally open GPU intensive sites like Figma in Firefox (which uses a canvas and forces discrete GPU mode when loaded).

One related note to mention: I do have many, many open Firefox windows, each with many tabs, so that does make my workflow unusual and different from the average user.

(In reply to J. Ryan Stinnett [:jryans] (Use needinfo, replies may be slow) from comment #28)

I rebooted towards the end of last week and so far, I haven't hit this crash again yet, so perhaps that has cleared some state or stemmed an internal leak of some kind.

...

One related note to mention: I do have many, many open Firefox windows, each with many tabs, so that does make my workflow unusual and different from the average user.

It occurs to me that restarting Firefox might have had the same benefits as rebooting your computer -- especially since you run Firefox with so many open windows and tabs. So the next time these crashes start up again (if they do), try just restarting Firefox.

Edit: Err, oops. Cancel that request. After the first crash has happened, it's already too late to test restarting Firefox. Quitting Firefox normally might cause graphics driver resources to be freed that aren't freed on a crash. But I can't think of any good way for you to test that hypothesis.

It happened again today after 4 days of machine uptime. As a next experiment, I'll try removing the external monitor while still forcing the AMD GPU.

Today the crash occurred again without the external display, so we now know it's possible to trigger with and without the external display. For now, I'll reconnect the display.

While looking logs after today's crash, I did notice the following was logged just before the time of the crash (00:51:56):

Mar 17 00:51:54 onett firefox[25643]: getattrlist failed for /System/Library/Extensions/AppleIntelKBLGraphicsGLDriver.bundle/Contents/MacOS/AppleIntelKBLGraphicsGLDriver: #2: No such file or directory
Mar 17 00:51:54 onett firefox[25643]: getattrlist failed for /Library/GPUBundles/AMDRadeonX4000GLDriver.bundle/Contents/MacOS/ATIRadeonX4000SCLib.dylib: #2: No such file or directory
Mar 17 00:51:54 onett firefox[25643]: getattrlist failed for /System/Library/Extensions/AMDRadeonX4000GLDriver.bundle/Contents/MacOS/ATIRadeonX4000SCLib.dylib: #2: No such file or directory
Mar 17 00:51:54 onett firefox[25643]: getattrlist failed for /System/Library/Frameworks/OpenGL.framework/Resources//GLRendererFloat.bundle/GLRendererFloat: #2: No such file or directory

...but after some searching, it seems like these logs may be "normal".

I assume you rebooted your computer before testing without the external display.

While looking logs after today's crash, I did notice the following was logged just before the time of the crash (00:51:56):

Where were these messages logged? In the Console app?

What's "onett"?

I notice that those files really are missing on my macOS 11.6.5 VM. I assume they're also missing on yours. On my macOS 10.15.7 build 19H1824 VM only the second file is missing. Did you upgrade to macOS 11 from macOS 10.15.X?

For what it's worth, I see similar messages in the system log (system.log) on my (Intel) MacBook Pro running macOS 11.6.4. I've never seen any of this bug's crashes there. So yes, they do seem to be unrelated. And yes, I did upgrade that machine from macOS 10.15.7 to macOS 11.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #32)

I assume you rebooted your computer before testing without the external display.

Yes.

Where were these messages logged? In the Console app?

Yes, in system.log, viewable in the Console app.

What's "onett"?

Ah, that's the machine name.

I notice that those files really are missing on my macOS 11.6.5 VM. I assume they're also missing on yours. On my macOS 10.15.7 build 19H1824 VM only the second file is missing. Did you upgrade to macOS 11 from macOS 10.15.X?

Yes, they are really missing on mine as well. Yes, I upgraded from macOS 10.15.x.

(In reply to J. Ryan Stinnett [:jryans] (Use needinfo, replies may be slow) from comment #31)

Today the crash occurred again without the external display, so we now know it's possible to trigger with and without the external display. For now, I'll reconnect the display.

(Following up comment #23)

So you're getting the 0x1be385f9 (aka 0xfffffff9/-7) context error, which means "out of memory". As I said in comment #19, my hunch is that these are more likely to happen the more pixels your graphics card is driving.

I now consider this hunch disproven -- at least with regard to this custom error. I don't know where to go from here. I'll let you know if I think of something.

12 days have now passed since my last crash (2022-03-17), and I believe I have been doing all the same things in terms of my daily workflow... It's great that it seems to be better, but also a bit sad to have no idea why... 😅

Crash stats can now be searched on "Mac memory pressure" -- which can be Normal, Warning or Critical. jryans, I just checked your crash reports from comment 22, comment 30 and comment 31. I found that all of them have "MacMemoryPressure" (in the Crash Annotations tab) set to Normal. I also found that the vast majority of those crashes with the 0xf9/-7 ("out of memory") context error also have their "Mac memory pressure" set to Normal. So "standard" memory pressure seems to play no role in the graphics driver "out of memory" context errors.

https://crash-stats.mozilla.org/search/?mac_crash_info=%40.%2AGraphics%20kernel%20error%3A%20.%2Af9.%2A&date=%3E%3D2021-09-29T19%3A55%3A00.000Z&date=%3C2022-03-29T19%3A55%3A00.000Z&_facets=signature&_facets=mac_crash_info&_facets=mac_memory_pressure&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-mac_memory_pressure

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
Status: RESOLVED → REOPENED
Crash Signature: [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill ] [@ libsystem_kernel.dylib@0x792e] → [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill ] [@ libsystem_kernel.dylib@0x792e] [@ pthread_kill | abort | gpusKillClientExt ]
Resolution: WORKSFORME → ---

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 5 desktop browser crashes on Mac on release (startup)

:bhood, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(bhood)
Flags: needinfo?(bhood)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 5 desktop browser crashes on Mac on release (startup)

For more information, please visit BugBot documentation.

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: