Open Bug 1576767 Opened 5 years ago Updated 4 months ago

[10.15+11] Crash in [@ libsystem_kernel.dylib@0x744e] in mozilla::gl::GLContextCGL::SwapBuffers()

Categories

(Core :: Graphics, defect, P3)

Desktop
macOS
defect

Tracking

()

REOPENED
Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 --- wontfix
firefox-esr78 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox70 - wontfix
firefox71 --- wontfix
firefox72 --- wontfix
firefox73 --- wontfix
firefox74 --- wontfix
firefox75 --- wontfix
firefox76 --- wontfix
firefox77 --- wontfix
firefox79 --- wontfix
firefox80 --- wontfix
firefox81 --- wontfix
firefox82 --- wontfix
firefox83 --- wontfix
firefox84 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(22 files, 18 obsolete files)

3.67 KB, text/plain
Details
36.11 KB, text/plain
Details
186.62 KB, text/plain
Details
14.69 KB, text/plain
Details
188.56 KB, text/plain
Details
160.41 KB, text/plain
Details
190.54 KB, text/plain
Details
70.80 KB, image/jpeg
Details
6.26 KB, text/plain
Details
20.71 KB, text/plain
Details
79.14 KB, text/plain
Details
312.19 KB, image/png
Details
397.63 KB, image/png
Details
60.16 KB, text/plain
Details
3.52 KB, text/plain
Details
75.47 KB, text/plain
Details
4.47 KB, text/plain
Details
268.13 KB, image/png
Details
100.60 KB, text/plain
Details
10.77 KB, text/plain
Details
757 bytes, text/plain
Details
326.42 KB, image/png
Details

This bug is for crash report bp-ad07c25b-7d38-4168-86bb-f39420190826.

Seen while looking at macOS specific crashes: https://bit.ly/2MD8vll. Crashes started in 20190823214900. 6 crashes/5 installations.

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=dcfcd7909aff0ef81a3b884ead0745645c6d6670&tochange=9f96b6821f1dfe383b34a190df239dfe2463339e

Code was touched in Bug 1571253

Top 10 frames of crashing thread:

0 libsystem_kernel.dylib libsystem_kernel.dylib@0x744e 
1 libsystem_c.dylib libsystem_c.dylib@0x7fa37 
2 libGPUSupportMercury.dylib libGPUSupportMercury.dylib@0xb0a5 
3 libGPUSupportMercury.dylib libGPUSupportMercury.dylib@0x21e6 
4 AMDRadeonX4000GLDriver AMDRadeonX4000GLDriver@0x30640 
5 libGPUSupportMercury.dylib libGPUSupportMercury.dylib@0x35bb 
6 AMDRadeonX4000GLDriver AMDRadeonX4000GLDriver@0xabf9 
7 OpenGL OpenGL@0xd48a 
8 AppKit AppKit@0x383766 
9 XUL mozilla::gl::GLContextCGL::SwapBuffers gfx/gl/GLContextProviderCGL.mm:126

Flags: needinfo?(mstange)

I would suspect a Catalina beta bug except that I haven't updated to a new beta recently, while Firefox has been updated semi-regularly and just became crashy yesterday.

So, for what it's worth, the upgrade I performed that started crashing was from build 20190822215453. So about three days or so worth of changes.

All of the crashes are occurring on 10.15.0 19A536g.

I would suspect bug 1574538.

We call [glContext flushBuffer] on an NSOpenGLContext that is not attached to an NSView and that only renders into (non-0) framebuffers. On pre-10.15, this seems to be handled as just a glFlush(), but maybe it confuses 10.15. We can just call glFlush() instead.

Flags: needinfo?(mstange)

I don't know if that would help, though. But it's worth a try.

Depends on: 1576968

I've attached a patch to bug 1576968. I don't have very high hopes for it making much of a difference though.

Assignee: nobody → mstange
Priority: -- → P2

I have a machine with that versionof 10.15 now, but I haven't seen a crash yet. Is there anything particular you did to trigger the crash?

Flags: needinfo?(eshepherd)

:mstange -- So, I get this crash pretty much nonstop. It happens when I click links, it happens when I scroll using the scroll wheel or clicking the scroll bar, it even happens when I'm not even touching my computer. Happens when Firefox is in the foreground, happens when it's in the background. Happens when it's actually completely hidden. Just happens all the time. I can't go more than a few minutes before it bombs.

I will say that I have something like 38 windows open with a total of around 550-600 tabs among them all. Some of these are minimized into the Dock. Also, I use an addon that prevents tab content from loading until you click into the tab for the first time, so most of those tabs are not actually loaded.

Flags: needinfo?(eshepherd)

Is there a way for me to try this patch out without building it myself? I can do it if need be but it would be a big chunk of time out of my day that I can't easily spare. But I will do that if it's the only option.

:mstange - You said in comment 7 that you attached a patch but it's not actually here.

Flags: needinfo?(mstange)

Sorry, pasted the wrong bug number. The patch is in bug 1576968. I'll make a try build.

Flags: needinfo?(mstange)

Here's the build: target.dmg

I'll try it out and post back on the bug once I've got a feel for it.

:mstange - I very tentatively want to say things look promising. I started running your build a little over two hours ago and have been doing my normal work, without any crashes thus far. I'll update here again tomorrow.

My Firefox still has not crashed. This is looking really good.

Since installing the build you shared in comment 14 yesterday afternoon, I haven't crashed once, whereas I'd crashed 6 times on the 26th, 3 times on the 27th, and twice on the 28th -- and that's despite the fact that I'd been doing nearly all my work in Safari instead of Firefox to avoid the crashes. I've switched back to Firefox full-time since installing the build yesterday and it hasn't crashed once yet.

Seems like others are still crashing, even with the fix :(
https://crash-stats.mozilla.org/report/index/80b5e83c-bb2b-4f85-b44a-400ad0190830 is a report with build ID 20190829094151, which is from m-c push 23824765c6aa026ccc3e3aea1c851c07ab8937ee, and the patch for bug 1576968 was included in the m-c push 28ed211ab542dfb8c750688701f1353db47a912e, which happened 5.5 hours earlier.

I've symbolicated the stack from bp-80b5e83c-bb2b-4f85-b44a-400ad0190830:

libsystem_kernel.dylib@0x744e
libsystem_c.dylib@0x7fa37
libGPUSupportMercury.dylib@0xb0a5
libGPUSupportMercury.dylib@0x21e6
AMDRadeonX4000GLDriver@0x30640
libGPUSupportMercury.dylib@0x35bb
AMDRadeonX4000GLDriver@0xabf9
mozilla::gl::GLContextCGL::SwapBuffers()

is:

__pthread_kill (in libsystem_kernel.dylib) + 10
abort (in libsystem_c.dylib) + 119
gpusGenerateCrashLog.cold.1 (in libGPUSupportMercury.dylib) + 93
gpusGenerateCrashLog (in libGPUSupportMercury.dylib) + 88
gpusKillClientExt (in AMDRadeonX4000GLDriver) + 8
gpusSubmitDataBuffers (in libGPUSupportMercury.dylib) + 163
glrATI_Hwl_SubmitPacketsWithToken (in AMDRadeonX4000GLDriver) + 109
mozilla::gl::GLContextCGL::SwapBuffers()

This happens if the call to IOAccelGLContextSubmitDataBuffersExt2 in gpusSubmitDataBuffers returns an error, as far as I can tell. gpusGenerateCrashLog produces an error message with an error code, but our crash reporter doesn't capture that error message, so it is lost and neither we nor Apple see it. If we used the Apple crash reporter instead, Apple would hear about these crashes automatically and get the error message and error code.

I'm not sure if this is our bug or Apple's bug.

(In reply to Markus Stange [:mstange] from comment #19)

but our crash reporter doesn't capture that error message

I've filed bug 1577886 about this.

Weirdly, tonight it began to crash for me again -- after several days of working fine without any crashes at all. It's crashed twice in the last few minutes, with the same signature. I was playing around with the DPI related settings in about:config at the time, trying to adjust things so my two displays could present things differently than the default.

I do wish we participated in Apple's crash reporting, if for no other reason than to handle those cases where it's on their end...

This morning, my Firefox has resumed crashing, and is now stuck crashing immediately on startup while starting to render pages, every time I launch it. :(

Adding 71 as affected. Still a fairly low volume crash.

It's interesting that this seems to come and go. It will happen over and over and over again, then taper off and sometimes let me have several days before the next crash. But it always comes back.

If this is an issue of an interaction between Firefox and the updated Catalina graphics driver, then it may be necessary to ensure that the machine you're using is using the same driver (and possibly also the same graphics chipset).

This may explain why the issue is uncommon -- if it's limited to people not only using the Catalina beta, but also using Macs with a specific graphics chip or set of chips that share a common driver.

This is another bug that might benefit from an analysis using a HookCase hook library (https://github.com/steven-michaud/HookCase). HookCase doesn't currently support macOS 10.15 (Catalina), but I hope to add support a few weeks after Catalina is released (gets out of beta).

Add signatures for other versions of macOS 10.15 beta.

Crash Signature: [@ libsystem_kernel.dylib@0x744e] → [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa]

This bug may not be limited to Catalina.

Here's a similar crash (also in mozilla::gl::GLContextCGL::SwapBuffers()) that happens on macOS 10.14 (Mojave):

https://crash-stats.mozilla.com/report/index/c64ec9ae-9b14-4f26-8e8a-ec7a20190908

There are a number of these crashes in Firefox releases. So they don't seem to have been triggered by recent changes in Firefox code.

Note that crashes in __pthread_kill are likely to have many different causes. But all the crashes I looked at in __pthread_kill on Catalina in Firefox releases seem to belong to this bug -- they all happen in mozilla::gl::GLContextCGL::SwapBuffers().

It'd really be nice to be able to search on signatures lower than the top of a crash stack.

Summary: Crash in [@ libsystem_kernel.dylib@0x744e] → Crash in [@ libsystem_kernel.dylib@0x744e] in mozilla::gl::GLContextCGL::SwapBuffers()

Let me know if there's anything at all I can do to help sort this out. I will run test builds using a debugger or with added logging or anything you need. Just say the word.

Actually, the signatures in comment #33 seem more directly relevant to bug 1535120, since they're all in glSwap_Exec. But since that bug's crashes are also (ultimately) in mozilla::gl::GLContextCGL::SwapBuffers(), these two bugs are probably related.

See Also: → 1535120

I reinstalled Nightly and created a new profile, then I set up sync and then it automatically reinstalled all my add-ons for me. Next, I opened windows with many of my previous profile's tabs in them (whittled down from around 350 tabs to about 190).

It is once again crashing on every startup, unless I start in safe mode. In safe mode it's fine. I don't know if that's because add-ons because those were installed before I got the tabs and such all reopened.

Presumably this is due to bypassing the acceleration and thus not hitting the graphics driver where it does.

Is there a way for me to see what's being passed into that SwapBuffers() call that's failing? I haven't done much debugging of Firefox itself but I'd be happy to do some here with guidance.

Hi Eric,

that's amazing - a 100% reproducible startup crash should be much easier to debug. Can you paste your machine description from "About This Mac" here? I'm particularly interested in the amount of GPU RAM you have.

How many windows do you have now, after you reduced the number of tabs? Do you think you can bisect between your windows so that you can find the window that causes the crash, or maybe the number of windows at which we start crashing? (Or maybe a combination? Who knows.)

(In reply to Eric Shepherd [:sheppy] from comment #35)

It is once again crashing on every startup, unless I start in safe mode. In safe mode it's fine. I don't know if that's because add-ons because those were installed before I got the tabs and such all reopened.

Presumably this is due to bypassing the acceleration and thus not hitting the graphics driver where it does.

That's right. (It's still using CoreAnimation is software mode, but no OpenGL.)

Is there a way for me to see what's being passed into that SwapBuffers() call that's failing? I haven't done much debugging of Firefox itself but I'd be happy to do some here with guidance.

Unfortunately SwapBuffers just calls glFlush(), which has no arguments. The work that glFlush does really depends on "every GL call since the last glFlush", so to really debug this we might need to disable compositor functionality until we stop crashing...

Flags: needinfo?(eshepherd)
Crash Signature: [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa] → [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa] [@ libsystem_kernel.dylib@0x1cb66 ]

So, this just happened to me again moments ago. I have exactly one Firefox window open. The crash happened while attempting to open a third tab. So, apparently it will happen with just three total tabs open.

https://crash-stats.mozilla.org/report/index/5416fbdc-af10-49f7-ad8e-5023b0191007

My system description follows:

Model Name: iMac
Model Identifier: iMac18,3
Processor Name: Quad-Core Intel Core i7
Processor Speed: 4.2 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 8 MB
Hyper-Threading Technology: Enabled
Memory: 48 GB
Boot ROM Version: 178.0.0.0.0
SMC Version (system): 2.41f2

Graphics/Displays:
Chipset Model: Radeon Pro 580
Type: GPU
Bus: PCIe
PCIe Lane Width: x16
VRAM (Total): 8 GB
Vendor: AMD (0x1002)
Device ID: 0x67df
Revision ID: 0x00c0
ROM Revision: 113-D000AA-931
VBIOS Version: 113-D0001A1X-025
EFI Driver Version: 01.00.931
Metal: Supported, feature set macOS GPUFamily2 v1
Displays:
iMac:
Display Type: Built-In Retina LCD
Resolution: 5120 x 2880 Retina
Framebuffer Depth: 30-Bit Color (ARGB2101010)
Main Display: Yes
Mirror: Off
Online: Yes
Rotation: Supported
Automatically Adjust Brightness: No
DELL U2713H:
Resolution: 2560 x 1440 (QHD/WQHD - Wide Quad High Definition)
UI Looks like: 2560 x 1440 @ 59 Hz
Framebuffer Depth: 30-Bit Color (ARGB2101010)
Display Serial Number: C6F0K43P11JL
Mirror: Off
Online: Yes
Rotation: Supported
Automatically Adjust Brightness: No
Connection Type: DisplayPort
1601W:
Resolution: 1920 x 1080 (1080p FHD - Full High Definition)
UI Looks like: 1920 x 1080 @ 60 Hz
Framebuffer Depth: 30-Bit Color (ARGB2101010)
Display Serial Number: MMEK1JA001964
Mirror: Off
Online: Yes
Rotation: Supported
Automatically Adjust Brightness: No
Connection Type: DisplayPort

Flags: needinfo?(eshepherd)

Oh -- also, I now have only two addons installed (Tab Wrangler and WebXR Emulator), neither of which were installed on my previous profile that was crashing every startup.

I'll go ahead and track this for 70 so we can make sure to keep an eye on what happens after 70 release.

@:mstange - Is there any possibility this is related to enabling Core Animation in 70? The timing looks just about right in terms of when that got enabled here... certainly close enough to be asking the question.

Flags: needinfo?(mstange)

The other question is: should we raise a radar for this?

(In reply to Eric Shepherd [:sheppy] from comment #40)

@:mstange - Is there any possibility this is related to enabling Core Animation in 70? The timing looks just about right in terms of when that got enabled here... certainly close enough to be asking the question.

That is very much a possibility, yes! And that's the main reason why this worries me so much. I thought I had stated this explicitly somewhere, but I guess I never did.
The thing is: This crash is happening in OpenGL, but the OpenGL work we're doing in CoreAnimation mode is extremely similar to what we were doing before we had CoreAnimation. We're just rendering to an offscreen buffer instead of an onscreen buffer now. (We have seen one other instance where offscreen vs onscreen makes an actual difference to the driver, in bug 1586627.)

(In reply to Eric Shepherd [:sheppy] from comment #41)

The other question is: should we raise a radar for this?

Probably, yes. I just wish we could provide Apple with more information, e.g. the error code that we're dropping (due to bug 1577886), or even steps to reproduce. I'll file a radar once 70 is out.

Flags: needinfo?(mstange)
Crash Signature: [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa] [@ libsystem_kernel.dylib@0x1cb66 ] → [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa] [@ libsystem_kernel.dylib@0x1cb66 ]

Markus, did you file your radar bug ? The crash volume on 70 release is not too high, so I'm going to stop tracking this issue for 70.

Flags: needinfo?(mstange)

As an experiment, I've set gfx.core-animation.enabled to false. We'll see if anything changes...

Was also running into this a lot, to the point where it was basically unusable. Setting gfx.core-animation.enabled to false seems to have helped.

Marking this as 10.15 specific since all the crashes happen only with that version. This is happening consistently in 71beta, but fairly sporadically on 72 nightly.

Summary: Crash in [@ libsystem_kernel.dylib@0x744e] in mozilla::gl::GLContextCGL::SwapBuffers() → [10.15] Crash in [@ libsystem_kernel.dylib@0x744e] in mozilla::gl::GLContextCGL::SwapBuffers()

:Marcia - That sporadic nature in nightly may be because we’re all turning off core animation in order to make the crashing stop.

So... this crash just happened for me again in Nightly (72.0a1 (2019-11-16)), even with Core Animation disabled. To say again, gfx.core-animation.enabled is still set to false, but it crashed with the same signature.

We have shipped our last beta for 71 but I would consider an uplift in RC week or in a dot release as this is a noticeable crash volume for a subset of the macOS population.

I've got an experimental fix for this coming. Today or tomorrow I'll post a tryserver build here for people who see these crashes.

Here's the my tryserver job on treeherder.mozilla.org:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e16d82932bc70e1fca116143a17044e6013b2866

And here's the build itself (the opt build):

https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/BtH_30lUTmWYINPzJv1n5g/runs/0/artifacts/public/build/target.dmg

Using both a disassembler and a HookCase hook library, I found that the gpusSubmitDataBuffers() method in libGPUSupportMercury.dylib can call _gpusKillClientExt() (in either AMDRadeonX4000GLDriver or AMDRadeonX5000GLDriver) when it detects a "reset". But this behavior can be overridden by setting libGPUSupportMercury.dylib's global no_crash_upon_reset variable to a non-zero value. That's what my patch does, via calls to the OpenGL framework's CGLSetParameter() and (indirectly) libGPUSupportMercury.dylib's gldSetInteger().

Eric and Peter, please test this build for a few days. Try it with CoreAnimation enabled and disabled. Let us know your results.

I'm not entirely sure what kind of "reset" this variable's name refers to, but I assume it's some kind of hardware reset. If hardware resets no longer trigger crashes, they may still cause at least temporary display corruption. Please keep an eye out for this.

My patch is a blunt instrument. As best I can tell, neither Safari nor Chrome change no_crash_upon_reset, though neither appears to be affected by these crashes. If so, there must be some other, more subtle way to avoid them. In the meantime, though, I hope my patch will provide a proof of concept. If it avoids this bug's crashes without causing undue display corruption, we're at least moving in the right direction.

Firefox calls gpusSubmitDataBuffers() on a secondary thread (the Compositor thread). Safari (actually its com.apple.WebKit.WebContent process) calls gpusSubmitDataBuffers() on the main thread, and Chrome (its Google Chrome Helper (GPU) process) calls it on a thread it names "CrGpuMain". This may help explain the difference.

I checked, and "CrGpuMain" really is the main thread.

no_crash_upon_reset is present in libGPUSupportMercury.dylib back to OS X 10.9, and seems to have exactly the same functionality. That's remarkably stable -- something must be using it. I was able to test my tryserver build back to macOS 10.12 -- in all cases it managed to set no_crash_upon_reset to 1. So I assume it will work all the way back to OS X 10.9 (the oldest version supported by Firefox).

Since I don't myself see this bug's crashes, I'm not able to test whether or not my tryserver build prevents them. That's up to Eric and Peter, and anyone else who sees these crashes.

(In reply to Eric Shepherd [:sheppy] from comment #44)

As an experiment, I've set gfx.core-animation.enabled to false. We'll see if anything changes...

Just a note, this no longer does anything on Nightly. CoreAnimation is now always enabled. I landed bug 1576390 just before you announced this plan...

I also landed bug 1579664 recently, which should help with issues related to GPU-switching, but looking at comment 37, this is not a dual-GPU setup, so it probably won't make a difference: Your iMac only has a discrete GPU and no Intel GPU. I wonder if the fact that you have two external screens connected to the iMac is a contributing factor...

Flags: needinfo?(mstange)

I'm downloading the test build shared by :smichaud right now and will let y'all know how it plays out.

So, this crash (or one exactly like it) is still happening, even with the test build. Back to you, :smichaud.

Flags: needinfo?(smichaud)

Sigh. Could you post a crash id? It should be symbolicated, since I arranged for the try build's symbols to be uploaded to the symbol server.

Flags: needinfo?(smichaud)

Just spent a few minutes looking through the code, and because I know all programmers love getting unsolicited ideas for things to look at thrown at them...

  • A previous comment said that on Mac this is happening off the main thread. It should be verified that (a) you're allowed to issue the glFlush() from a different thread, and (b) whether or not this could be getting called when the state of the GL context is invalid, such as during teardown or creation or...?

I don't know the code well enough to easily pass through it all, or I'd look deeper here.

Thanks, Eric, for trying out my test build. And thanks for the crash id. But I really wish you didn't have macOS 10.15.0 build 19A582a. That's the latest available Catalina beta, right? (The release was build 19A583.) The symbol server doesn't have symbols for your build. And much worse, I don't have a copy of it to play with. I need to check the exact path your crash took through libGPUSupportMercury.dylib. I assume it can't have been the path that my patch forestalled.

I do still have a full installer for one of the Catalina betas. I assume I can get it up to build 19A582a by upgrading it as far as possible. But it'll be a while before I can pull that off. And I won't have much time to spare over the Thanksgiving holiday.

For the time being I'm going to continue to work on this problem from the bottom up -- from the lowest level Apple code that gets exercised when the crashes happen. Rather than trying to find out what Mozilla OpenGL code shouldn't be doing, I'll be trying to figure out how to get its unusual approach working properly. After all, it mostly works OK, even on the compositor thread, as best I can tell.

Peter, have you tried my test build? What build of Catalina do you have? If it's one of the release builds (19A583 or later), please post one of your crash ids here. Presuming that you do crash with it.

Flags: needinfo?(peterv)

Steven --

Huh. Didn't see that I was on a beta. I can fix that easily enough. The 10.15.1 release build is waiting to install in my software update queue, so I'll apply that. You have access to symbols for that?

You have access to symbols for that?

Yes. The symbol server has good coverage of symbols for the original Catalina release and all subsequent updates.

Fingers crossed that you'll still be able to reproduce the crashes :-)

Crash Signature: [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa] [@ libsystem_kernel.dylib@0x1cb66 ] → [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ abort | gpusGenerateCrashLog.cold.1 ] [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa] [@ libsystem_kernel.dylib@0x1cb66 ]

Got my first crash in the test build in this code, and yep, looks like it could be helpful, I hope:

https://crash-stats.mozilla.org/report/index/96aad268-3625-44c0-bef9-99b770191128

Flags: needinfo?(smichaud)

At the same time as my last crash, this wakeups log file was also generated. I don't know if there's any useful info there, but the timing is interesting...

71 is shipping in a few days but the volume is high enough on 70 and 71 beta that I would evaluate taking a fix in a 71 dot release if we have one, so marking as fix-optional for 71 in case a fix happens in the 2 weeks to come.

Attachment #9112167 - Attachment mime type: application/octet-stream → text/plain

Thanks, Eric, for the new crash log. I now have the same build of macOS 10.15.1 as you do (19B88), and can see from the offset of the call to _gpusKillClientExt in gpusSubmitDataBuffers that your crash has taken a different path than the one I anticipated (and which I saw being taken in other crash reports). This gives me something to chew on, but I don't know how long it will take for me to digest it. I won't be able to spend much time on this bug until after the Thanksgiving holiday.

I don't know what to make of your wakeups log. I may be able to make better sense of it later.

Flags: needinfo?(smichaud)

Yeah, the wake ups log is odd. No idea it’s relevance but the fact that it’s stuff in the graphics area suggests it just might be so might as well provide it. Catch you after the long weekend!

Steven, do you think knowing the error code would help us figure this out? In that case, would you like to look into bug 1577886 a little bit? Having that might also help with other problems of a similar nature.

Thanks, Markus, for pointing that out. Eric's crash goes through the path you mentioned in comment 19 (which I missed) -- the call to _gpusKillClientExt happens after a failed call to IOAccelGLContextSubmitDataBuffersExt2. And yes, gpusGenerateCrashLog.cold.1 writes the address of an error string (a C-string) to a location in the __crash_info section. I'll try to get access to this string in my next try build, and write it somewhere in the crash log.

My next patch will almost certainly still be an experimental one. It'll likely take a few more iterations before I can get a viable fix. I'm very grateful that Eric is able to reproduce these crashes.

I’m happy to try builds any time. Eager to help get this nailed down.

I've done another tryserver build. It's the same as the previous one except that this one also disables the Mozilla crash reporter. So when it crashes the Apple crashreporter should come up. That can take a long time -- up to 30 seconds. It it takes longer than that, try running the Console app and look for "firefox" crashes under "Crash Reports".

In comment 72 I said I'd try to incorporate the __crash_info error string into the Mozilla crash report. But I decided it's a lot easier just to find that information in the Apple crash report.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3fe8b61ef9353ecd77146c1536d0a2ae8c434f12

https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/P0q4Bgx6TZKkY-JXDtAFiw/runs/1/artifacts/public/build/target.dmg

Eric, please try this out. Once you crash, you should see something like the following in the Apple crash report, towards the top:

    Application Specific Signatures:
    Graphics kernel error: 0x10000003

The error string will have either of the following two formats:

    "Graphics kernel error: 0x%08x\\n"
    "Graphics hardware encountered an error and was reset: 0x%08x\\n"

I think I've learned how to emulate these crashes (using a HookCase hook library). The error message "Graphics kernel error: 0x10000003" is the one that I see. The number is a Mach error code, MACH_SEND_INVALID_DEST. I'm very interested to find out which error message you see. If it's the same, then there's a good chance I've begun to figure out why these crashes happen.

Flags: needinfo?(peterv) → needinfo?(eshepherd)

I've installed it and am running it now. Will follow up once I've crashed. Could be anywhere from 30 seconds to 3 days. depending on... I don't know, the phase of the moon or something.

Interesting side effects already noted are things like tabs crashing that didn't previously do so, with no way to find out why since there's no crash log for crashed tabs now. :)

Flags: needinfo?(eshepherd)

If that side effect becomes too painful, you can go back to the previous try build and do the following in a Terminal prompt:

    defaults write org.mozilla.firefox OSCrashReporter 1

This is supposed to bring up the Apple crash reporter in addition to the Mozilla one (according to https://developer.mozilla.org/en-US/docs/Archive/Misc_top_level/Environment_variables_affecting_crash_reporting). I don't know how (or if) it works with tab crashes.

You can of course also do this with an ordinary nightly, beta or release. And (if it works with my first try build) I'd like you to try that later. But first I want to get a __crash_info error string from a crash that I know went through the path that's not effected by no_crash_upon_reset.

I take back what I said in comment 76 about setting OSCrashReporter. I just tried it with today's mozilla-central nightly, and it doesn't seem to work at all -- the Apple crash reporter never appears, and no new entries get added to "Crash Reports" in the Console app. I tested with both a tab crash and my emulation of this bug's crash (both using hook libraries).

It's not a big deal actually. I'll report back when I get my first crash.

And I've had my first instance of this crash. I'll attach the full crash log, but some interesting bits:

Crashed Thread:        23  Compositor

Exception Type:        EXC_CRASH (SIGABRT)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Application Specific Information:
abort() called

Application Specific Signatures:
Graphics kernel error: 0xfffffff9

And then down to thread 23, the crashed one:

Thread 23 Crashed:: Compositor
0   libsystem_kernel.dylib        	0x00007fff6958349a __pthread_kill + 10
1   libsystem_pthread.dylib       	0x00007fff696406cb pthread_kill + 384
2   libsystem_c.dylib             	0x00007fff6950ba1c abort + 120
3   libGPUSupportMercury.dylib    	0x00007fff529b90a6 gpusGenerateCrashLog.cold.1 + 94
4   libGPUSupportMercury.dylib    	0x00007fff529b01e7 gpusGenerateCrashLog + 89
5   com.apple.AMDRadeonX4000GLDriver	0x0000000127c30a31 gpusKillClientExt + 9
6   libGPUSupportMercury.dylib    	0x00007fff529b15bc gpusSubmitDataBuffers + 164
7   com.apple.AMDRadeonX4000GLDriver	0x0000000127c0af8a glrATI_Hwl_SubmitPacketsWithToken + 110
8   XUL                           	0x00000001066718c5 mozilla::gl::GLContextCGL::SwapBuffers() + 229
9   XUL                           	0x00000001066f9b55 mozilla::layers::CompositorOGL::EndFrame() + 1397
10  XUL                           	0x0000000106812c32 mozilla::layers::LayerManagerComposite::UpdateAndRender() + 14610
11  XUL                           	0x000000010680f1f2 mozilla::layers::LayerManagerComposite::EndTransaction(mozilla::TimeStamp const&, mozilla::layers::LayerManager::EndTransactionFlags) + 194
12  XUL                           	0x000000010682ffb1 mozilla::layers::CompositorBridgeParent::CompositeToTarget(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::gfx::DrawTarget*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const*) + 1041
13  XUL                           	0x000000010683c395 mozilla::layers::CompositorVsyncScheduler::Composite(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) + 133
14  XUL                           	0x000000010684f85f mozilla::detail::RunnableMethodImpl<mozilla::layers::CompositorVsyncScheduler*, void (mozilla::layers::CompositorVsyncScheduler::*)(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp), true, (mozilla::RunnableKind)1, mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp>::Run() + 47
15  XUL                           	0x0000000105dbaa8d MessageLoop::DoWork() + 1965
16  XUL                           	0x0000000105dbb58a base::MessagePumpDefault::Run(base::MessagePump::Delegate*) + 506
17  XUL                           	0x0000000105dba026 MessageLoop::Run() + 86
18  XUL                           	0x0000000105dc46d6 base::Thread::ThreadMain() + 1142
19  XUL                           	0x0000000105dc09aa ThreadFunc(void*) (.llvm.13933990605475630895) + 10
20  libsystem_pthread.dylib       	0x00007fff69640d36 _pthread_start + 125
21  libsystem_pthread.dylib       	0x00007fff6963d58f thread_start + 15
Flags: needinfo?(smichaud)

Thanks, Eric! If you get crashes with gpusGenerateCrashLog on the stack and different "Application Specific Signatures", please post them. I will need to dig around in the OS binaries to find out what your error number (0xfffffff9) means.

Flags: needinfo?(smichaud)

I've opened bug 1601366 on the issue with OSCrashReporter.

I've looked everywhere I can think of, and still can't find out what error code 0xfffffff9 means. I'll take a break and see what happens when I come back to the question with a fresher mind.

The string "Graphics kernel error: 0xfffffff9" does have some hits on Google, but none of them explains the error code.

I did also search on "-7", to which (int32_t) 0xfffffff9 translates.

I'm doing some looking around as well. Will let you know if I find anything.

I do see other reports posted by people that see this issue in other programs, such as this one for Adobe Premiere Pro CC 2017: http://premiere456.rssing.com/chan-7308898/all_p4642.html

Possible clue: Searching through the AMDRadeonX4000 kernel extension in a disassembler for the string 0xfffffff9, I found an IOAccelContext2::setContextError(unsigned int) method being called with 0xfffffff9 as its arg0 parameter. So 0xfffffff9/-7 is presumably some kind of IOAccelerator "context error". I'll pursue this further once I've had lunch :-)

You do have an AMDRadeonX4000 kernel extension loaded, don't you? I do. Run kextstat to find out.

Yeah, I do:

$ kextstat |grep Radeon
  188    0 0xffffff7f84163000 0x5000     0x5000     com.apple.kext.AMDRadeonServiceManager (3.0.2) D3B14FA8-3864-3697-8659-DBDD1DEF740B <13 5 3 1>
  189    0 0xffffff7f84168000 0x11000    0x11000    com.apple.kext.AMDRadeonX4000HWServices (3.0.2) 247AC7B6-C94C-3850-B4D8-BA2D6440188C <41 13 12 8 6 5 3 1>
  191    0 0xffffff7f8417c000 0x446000   0x446000   com.apple.kext.AMDRadeonX4000 (3.0.2) 1E092428-A3E6-3063-B692-B943B1565B91 <113 79 41 13 8 6 5 3 1>
  193    0 0xffffff7f846b2000 0x827000   0x827000   com.apple.kext.AMDRadeonX4200HWLibs (1.0) F6D23316-405D-3CA2-AD46-D6E4984E4F3A <13 6 5 3 1>

This is totally making me wish I had a good disassembler. :)

(In reply to Eric Shepherd [:sheppy] from comment #87)

This is totally making me wish I had a good disassembler. :)

Ghidra works very well as a disassembler.

I don't know Ghidra, but I'm very fond of Hopper Disassembler, which I use.

IOAccelContext2::setContextError(unsigned int) lives in the IOAcceleratorFamily2 kernel extension. That's probably the best place to learn what the 0xfffffff9/-7 error code means. On macOS 10.15.1, the "context error" lives at offset 0x644 in an IOAccelContext2 object. Presumably the IOAccelContext2 object is fairly closely related to the user-mode IOAccelerator framework's IOAccelContext object, though I don't yet know exactly how.

In IOAcceleratorFamily2, the error 0xfffffff9 is associated with the following two error messages:

    "%s: failed to alloc memory for block fence!\\n"
    "%s: failed to allocate memory\\n"

So presumably it means "out of memory".

Eric, you might learn something if you have the Console app running while one of these crashes happens. On recent versions of macOS, most syslog-type messages are lost unless the Console is running. But its output is extremely verbose, so you'd have to search through an enormous haystack to find any possible needles. Since the most informative messages are likely to come from a kernel extension, you could winnow down the output by filtering on "kernel". You'd still have a ton of messages to wade through, though.

Interesting. I wonder how it's running out of memory. Does it mean system memory or VRAM?

I like Hopper very much. I don't like the $99 price tag for it, given how I would only use it in pretty specific circumstances, really. :)

I suspect the error code 0xfffffff9 is spurious, or at least misleading, and that these crashes are an Apple bug. "Out of memory" errors are usually a sign of extreme distress, and I doubt very much that's what you have.

But to make progress on this bug I'll need to figure out how that particular error is triggered. That'll require learning more about how kernel-mode and user-mode "drivers" communicate with each other. At this point I have no idea how long this will take. If I'm lucky I'll have figured everything out in a few days. Otherwise it'll take me longer.

I think I bought my copy of Hopper for $50, but that was years ago. I use it every day, though, and would happily pay $99 for it, or even more.

I've learned how to emulate this bug's crashes, including the "Graphics kernel error: 0xfffffff9" error string. From this I've learned that the crashes happen without any of the methods in the call stack doing an error return. All that's needed is that the "context error" be non-zero, in the AMDRadeonX4000 or AppleIntelHD5000Graphics kernel extensions, before either calls IOAcceleratorFamily2's equivalent of libGPUSupportMercury.dylib's gpusSubmitDataBuffers. Whatever I set the user-mode reflection of the "context error" to, that's what appears in the "Graphics kernel error" string.

I've still got a lot of work to do, though. I still need to figure out why the "context error" is being set in this way. And then I need to figure out what Firefox can do about it.

Hm. Is it possible that there's not really an error at all, per se, but that something isn't getting initialized correctly so that it looks like there's an error when it's really just a spurious value?

Yes, I'd say that's possible. In fact I'm now working on that assumption to see if I can find a way to trigger the crashes by deliberately corrupting the structures that Firefox passes to libGPUSupportMercury.dylib's gpusSubmitDataBuffers().

Just finally got another crash, while Console was open. Attached is the crash log as well as the syslog output, filtered down to only things involving the "firefox" process. There are things in there that are maybe interesting.

Thanks, Eric, for your crash logs. I don't see anything helpful in them, at least yet.

Do note, however, the PerfLogging messages, particularly the "Slow WS Update data" ones (I assume WS means window server). It seems like you were dragging a window around when the crash happened. And remember the excess wakeups you got during one of your previous crashes, which may be related.

I'm glad that your "Graphics kernel error: 0xfffffff9" error message stayed the same. It seems to show that your crashes are consistent.

I'll be doing a new tryserver build, this time with Mozilla's crash reporter disabled but without my no_crash_upon_reset patch. I'll be very interested to see if this changes your crash reports in any way.

It seems like you were dragging a window around when the crash happened.

Do you remember which window? If so, give us the URL that was in the active/visible tab and I'll try it out. Also let us know roughly how may tabs you had open in the window, which may be relevant. Their contents are probably less relevant, since (as I remember it) tab contents get squelched when they're not visible.

I'm slowly figuring out which kernel-mode methods can set the "context error" to 0xfffffff9, and which user-mode methods can call them. dtrace has proved surprisingly useful here. HookCase is much more powerful and flexible than dtrace, but only dtrace can "hook" methods in the kernel and Apple's kernel extensions. Many of the methods I've encountered in the AMDRadeonX4000, AppleIntelHD5000Graphics and IOAcceleratorFamily2 kernel extensions have dtrace probes. I can use dtrace to get kernel-mode and user-mode stacks for each call to one of these methods, for example as follows:

    sudo dtrace -n '_ZN15IOAccelContext219submit_data_buffersEP33IOAccelContextSubmitDataBuffersInP34IOAccelContextSubmitDataBuffersOutyPy:entry{stack();ustack();}'

You can use the following command to get a list of all dtrace probes, then grep it for the ones you're interested in:

    sudo dtrace -l

I misspoke in comment 97 above. I know that these crashes happen when the "context error" is non-zero when IOAccelContext2::submit_data_buffers() begins executing. What I need to find out is how this can happen. For each kernel-mode method that can set the "context error" to 0xfffffff9, I'll be manipulating (abusing, if you will) the user-mode code that calls it to see if I can trigger the behavior that causes the "context error" to be set to this value. Looking at Mozilla code, I now think it's unlikely that any of this is caused by uninitialized data in the buffers that it (indirectly) sends to AMDRadeonX4000, AppleIntelHD5000Graphics and friends. It's more likely that some extraneous call (from the same thread or a different thread) is "corrupting" a given IOAccelContext2 object (by setting its "context error"), between calls to IOAccelContext2::submit_data_buffers().

Here's my new tryserver build. As mentioned above, it's only difference from current mozilla-central code is that it disables the Mozilla crash reporter (so that you'll get the Apple crash reporter instead). I've removed my no_crash_upon_reset patch.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a7313efce4250136437590605051fdf446b4388c

https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/bz_hhjtrQWO6OVgUQ9vCeQ/runs/0/artifacts/public/build/target.dmg

Please try it out, Eric, and let us know your results. You'll still crash, but I want to know if your __crash_info changes, and if so to what.

Eric, I notice from comment 37 and the log you posted in comment 99 that you have an iMac18,3. Apple's Identify your iMac model page says that this particular model has a Retina 5K display. It must be more "work" to "drive" this high-resolution display than for lesser models. And noting evidence of window server congestion just before your latest crash (as I did in comment 101), I wonder if these crashes are more likely to happen with 5K displays.

As far as I know, Mozilla's crash minidumps don't collect the crashing machine's model number on macOS. At least it doesn't appear in the publicly available information available at crash-stats.mozilla.com.

Does anyone here know if we collect this information? If not, we should. I'll open a bugzilla bug on this issue and assign it to myself.

Steven, It's possible that these are more likely to happen on higher resolution displays, but I still wonder how likely it is to be my main problem given the fact that the issue can arise while nothing is happening at all. I've downloaded the new try build and will install it right after posting this comment.

That said, I will try to be more observant as to the precise context of each crash in terms of what I'm doing and what else might be going on with the GPU at the time for future incidents of this crash.

but I still wonder how likely it is to be my main problem given the fact that the issue can arise while nothing is happening at all.

Oh well. Then my idea about "window server congestion" may be a red herring.

Do keep running the Console while you wait for a crash. And next time please also look for "kernel" messages within a few seconds of your crash -- they might be interesting. And yes, please try to remember what you were doing just before the crash, with a particular eye on whether or not you were dragging a window.

Thanks in advance!

My most recent crash occurred while I wasn't even using Firefox, and there were few Firefox windows even substantially visible at the time, as I was working in Finder and Safari at the time, doing some research for WebXR documentation. I was simply typing in the in-page search box in Safari at the time.

I've tried to find useful probes for dtrace without luck so far. Do you have system integrity disabled? I ask only because it warns me that some probes are unavailable since I keep it enabled.

I've done sudo dtrace -l |grep x for x of "Radeon", "libGPUSupport", "firefox", "IOAccel" and a few other things, without luck.

There are no wakeups logs for Firefox recorded today. Either it had no problems, or something kept them from being recorded.

Immediately after relaunching the latest try build you sent after the crash I mentioned in comment 108, it crashed again. This was while only one window was open, with no tabs (or only the first tab), before reopening my session from before.

This crash was similar but recorded slightly different information, including:

Crashed Thread:        26  Compositor

Exception Type:        EXC_CRASH (SIGABRT)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Application Specific Information:
abort() called

Application Specific Signatures:
Graphics kernel error: 0xfffffffc

I will attach the complete Apple crash log momentarily, as well as a link to the Firefox crash info as soon as I get it.

And of course there's no Firefox crash log; you disabled the internal crash logs, or at least the about:crashes page isn't present. But here's the Apple crash log from the on-startup crash in the same general code.

Very interesting that your "graphics kernel error" changed -- it was 0xfffffff9/-7 and is now 0xfffffffc/-4. Grepping through the IOAcceleratorFamily2 kernel extensions's assembly code, the new error code seems to mean "stream error". Your crash took the same path through gpusSubmitDataBuffers(), though -- the path not effected by no_crash_upon_reset.

Do you have two Apple crash reports (one for each crash)? If so, please also attach the other one.

In my experience, dtrace is only really useful when you start digging through assembly code, and begin to understand how it works. It helps you solidify your understanding, and cut off false trails (for example because a method you're interested in never gets called). The same is true of HookCase.

You didn't see any wakeup logs for Firefox. I assume you also didn't see "Slow WS Update data" ones. If so my "window server congestion" idea is definitely a red herring.

Now that I have another "graphics kernel error", I'll need to widen my search for kernel extension code to "tickle", in my attempt to find out how these crashes happen.

Another thing:

I only recently discovered that you can search on "proto_signature" at crash-stats.mozilla.com. From this I was able to find out that the vast majority of these crashes happen when the AMDRadeonX4000 kernel extension is active. A few also happen when the AppleIntelHD5000Graphics, AppleIntelHD4000Graphics or GeForce kernel extensions are active. It seems to be only on the AppleIntelHD5000Graphics hardware that some of the crashes take the path effected by no_crash_upon_reset.

For the moment, Eric, I don't have anything new for you to try. I'm still deep in the "code tickling" business, and don't know when I'll resurface. Also, at some point the holidays will require me to drastically cut back the amount of time I'm spending on this bug. It's entirely possible that my work here will carry into next year.

Wait, you're not going to spend your entire holiday trying to figure out an obscure graphics driver crash? What's wrong with... kidding... kidding. :)

Attachment is the Apple crash log from the crash immediately prior to the crash on startup.

I've tried to find useful probes for dtrace without luck so far. Do you have system integrity disabled? I ask only because it warns me that some probes are unavailable since I keep it enabled.

Yes, I have system integrity disabled. I got into this habit while developing HookCase. And though recent versions of HookCase work just fine with everything but "kext protection" enabled, I've stuck with it. It's nice to be able to mess with system files when you want to. And I almost always use my computers at home, behind a NAT firewall.

Over the holidays I'll mostly be stuffing my face, and storing up the energy I'll need to work on this bug in the depths of winter in the upper midwest :-)

Attached image Bugzilla-santa-hat.jpg

Happy holidays Steven!

Steven -- should I disable system integrity and try these dtraces again? I ask because Firefox crashed something like four times in ten minutes this morning. I finally gave up entirely and am working in Chrome now.

Hi Eric. Yes, dtrace could potentially provide very useful information if it's running during one of your crashes. I'm currently very busy with other things, but I'll try to find time today or tomorrow to come up with some dtrace probes for you to run. You'll need to test with a mozilla-central nightly, not a Firefox release -- FF releases have their symbols stripped.

I was only able to come up with one probe:

    sudo dtrace -n '_ZN15IOAccelContext215setContextErrorEj:entry{printf("Process %s, error %x",execname,arg1);stack();ustack();}'

The output might be interesting, or even critically important. This is a probe of the IOAccelContext2::setContextError(unsigned int) method in the IOAcceleratorFamily2 kernel extension.

Eric, any results testing with my dtrace probe from comment #120?

Any chance any one can pickup this bug now? I am ready to risk a csrutil disable to get more info if needed, the only other alternative if to return to Chrome, which is really prefer to avoid.

Sorin, it'd be very helpful if you could do the following:

  1. Turn off SIP (system integrity protection). First boot into your Recovery partition (restart your computer and press Cmd-R immediately after you hear the Mac startup sound). Then start the Terminal app and enter csrutil disable. Then restart your computer again to get back to your "normal" partition. Once there run csrutil status -- the result should be System Integrity Protection status: disabled.

  2. Download a recent mozilla-central nightly. For example, here's a link to today's mozilla-central Mac nightly: http://ftp.mozilla.org/pub/firefox/nightly/2020/01/2020-01-17-09-44-53-mozilla-central/firefox-74.0a1.en-US.mac.dmg.

  3. Run the dtrace command from comment #120 in a Terminal window.

  4. Test with the mozilla-central nightly until you experience this bug's crash.

  5. In the Terminal window where you ran the dtrace command, press Ctrl-C to quit dtrace. Then copy its output into a file. Then use the Attach new file button above to attach that output here.

Flags: needinfo?(sorin.sbarnea)
Flags: needinfo?(eshepherd)

I'm stuck, more or less, and can't make progress until someone who sees this bug's crashes follows the steps in comment #123.

(Following up comment #124)

If at all possible, please test on the current build of Catalina (macOS 10.15.2 build 19C57). That way I'll be able to match the addresses in dtrace's stack traces to exact addresses in the AMDRadeonX4000 and IOAcceleratorFamily2 kernel extension binaries. macOS 10.15.2 build 19C57 is what I'm currently running.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #125)

(Following up comment #124)

If at all possible, please test on the current build of Catalina (macOS 10.15.2 build 19C57). That way I'll be able to match the addresses in dtrace's stack traces to exact addresses in the AMDRadeonX4000 and IOAcceleratorFamily2 kernel extension binaries. macOS 10.15.2 build 19C57 is what I'm currently running.

I had to wait 3 days to reproduce it with Nightly build, but here is the magic

$     sudo dtrace -n '_ZN15IOAccelContext215setContextErrorEj:entry{printf("Process %s, error %x",execname,arg1);stack();ustack();}'                                   
dtrace: description '_ZN15IOAccelContext215setContextErrorEj:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  6 302714 _ZN15IOAccelContext215setContextErrorEj:entry Process firefox, error fffffff9
              AMDRadeonX4000`AMDRadeonX4000_AMDSIGLContext::process_ResourceList(IOAccelCommandStreamInfo&)+0x1cb
              AMDRadeonX4000`AMDRadeonX4000_AMDSIGLContext::processSidebandBuffer(IOAccelCommandDescriptor*, bool)+0x15b
              IOAcceleratorFamily2`IOAccelContext2::processDataBuffers(unsigned int)+0x5d
              IOAcceleratorFamily2`IOAccelGLContext2::processDataBuffers(unsigned int)+0x337
              IOAcceleratorFamily2`IOAccelContext2::submit_data_buffers(IOAccelContextSubmitDataBuffersIn*, IOAccelContextSubmitDataBuffersOut*, unsigned long long, unsigned long long*)+0x9d7
              kernel`shim_io_connect_method_structureI_structureO+0x1b0
              kernel`IOUserClient::externalMethod(unsigned int, IOExternalMethodArguments*, IOExternalMethodDispatch*, OSObject*, void*)+0x331
              kernel`is_io_connect_method+0x223
              kernel`0xffffff8004222a10+0x212
              kernel`ipc_kobject_server+0x238
              kernel`ipc_kmsg_send+0x135
              kernel`mach_msg_overwrite_trap+0x2e5
              kernel`mach_call_munger64+0x205
              kernel`hndl_mach_scall64+0x16

              libsystem_kernel.dylib`mach_msg_trap+0xa
              IOKit`io_connect_method+0x17f
              IOKit`IOConnectCallMethod+0xf4
              IOKit`IOConnectCallStructMethod+0x23
              IOAccelerator`IOAccelContextSubmitDataBuffersExt2+0x102
              libGPUSupportMercury.dylib`gpusSubmitDataBuffers+0x88
              AMDRadeonX4000GLDriver`glrATI_Hwl_SubmitPacketsWithToken+0x6e
              XUL`mozilla::gl::GLContext::fFlush()+0x1e
              XUL`mozilla::gl::GLContextCGL::SwapBuffers()+0x78
              XUL`mozilla::layers::CompositorOGL::EndFrame()+0xf3
              XUL`mozilla::layers::LayerManagerComposite::Render(mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&)+0x7e3
              XUL`mozilla::layers::LayerManagerComposite::UpdateAndRender()+0x215
              XUL`mozilla::layers::LayerManagerComposite::EndTransaction(mozilla::TimeStamp const&, mozilla::layers::LayerManager::EndTransactionFlags)+0xa1
              XUL`mozilla::layers::CompositorBridgeParent::CompositeToTarget(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::gfx::DrawTarget*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const*)+0x25c
              XUL`mozilla::layers::CompositorVsyncScheduler::Composite(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp)+0xac
              XUL`mozilla::detail::RunnableMethodImpl<mozilla::layers::CompositorVsyncScheduler*, void (mozilla::layers::CompositorVsyncScheduler::*)(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp), true, (mozilla::RunnableKind)1, mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp>::Run()+0x27
              XUL`MessageLoop::DoWork()+0x1bc
              XUL`base::MessagePumpDefault::Run(base::MessagePump::Delegate*)+0x14b
              XUL`MessageLoop::Run()+0x50
              XUL`base::Thread::ThreadMain()+0x13d
Flags: needinfo?(sorin.sbarnea)

Thank you, Sorin! I'm glad you persisted.

I'll spend the next few days wringing out as much information as I can from your stack. I already can reproduce something like it (using a HookCase hook library), but it's not identical. My error number is different, for example. I'm glad to see that your error number (fffffff9/-7, "out of memory") is the same as Eric's.

I'll look for a way to reproduce your stack exactly. Then I'll start looking for possible remediations.

I'm convinced this is one or more bugs in Apple's AMDRadeonX4000 and IOAcceleratorFamily2 kernel extensions. The trick will be finding a way to work around it/them. That's likely to take a while -- I don't know how long. But at least I'm no longer stuck.

Guess what, another one:

  2 302714 _ZN15IOAccelContext215setContextErrorEj:entry Process firefox, error fffffff9
              AMDRadeonX4000`AMDRadeonX4000_AMDSIGLContext::process_ResourceList(IOAccelCommandStreamInfo&)+0x1cb
              AMDRadeonX4000`AMDRadeonX4000_AMDSIGLContext::processSidebandBuffer(IOAccelCommandDescriptor*, bool)+0x15b
              IOAcceleratorFamily2`IOAccelContext2::processDataBuffers(unsigned int)+0x5d
              IOAcceleratorFamily2`IOAccelGLContext2::processDataBuffers(unsigned int)+0x337
              IOAcceleratorFamily2`IOAccelContext2::submit_data_buffers(IOAccelContextSubmitDataBuffersIn*, IOAccelContextSubmitDataBuffersOut*, unsigned long long, unsigned long long*)+0x9d7
              kernel`shim_io_connect_method_structureI_structureO+0x1b0
              kernel`IOUserClient::externalMethod(unsigned int, IOExternalMethodArguments*, IOExternalMethodDispatch*, OSObject*, void*)+0x331
              kernel`is_io_connect_method+0x223
              kernel`0xffffff8004222a10+0x212
              kernel`ipc_kobject_server+0x238
              kernel`ipc_kmsg_send+0x135
              kernel`mach_msg_overwrite_trap+0x2e5
              kernel`mach_call_munger64+0x205
              kernel`hndl_mach_scall64+0x16

              libsystem_kernel.dylib`mach_msg_trap+0xa
              IOKit`io_connect_method+0x17f
              IOKit`IOConnectCallMethod+0xf4
              IOKit`IOConnectCallStructMethod+0x23
              IOAccelerator`IOAccelContextSubmitDataBuffersExt2+0x102
              libGPUSupportMercury.dylib`gpusSubmitDataBuffers+0x88
              AMDRadeonX4000GLDriver`glrATI_Hwl_SubmitPacketsWithToken+0x6e
              GLEngine`gleUnbindTextureObject+0x3a
              GLEngine`gleUnbindDeleteHashNamesAndObjects+0x9f
              GLEngine`glDeleteTextures_Exec+0x2ba
              XUL`mozilla::gl::GLContext::raw_fDeleteTextures(int, unsigned int const*)+0x2e
              XUL`mozilla::layers::CompositingRenderTargetOGL::~CompositingRenderTargetOGL()+0xaf
              XUL`mozilla::layers::CompositingRenderTargetOGL::~CompositingRenderTargetOGL()+0xe
              XUL`void mozilla::layers::ContainerRender<mozilla::layers::ContainerLayerComposite>(mozilla::layers::ContainerLayerComposite*, mozilla::layers::LayerManagerComposite*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&, mozilla::Maybe<mozilla::gfx::PolygonTyped<mozilla::gfx::UnknownUnits> > const&)+0x1d7
              XUL`void mozilla::layers::RenderLayers<mozilla::layers::ContainerLayerComposite>(mozilla::layers::ContainerLayerComposite*, mozilla::layers::LayerManagerComposite*, mozilla::gfx::IntRectTyped<mozilla::RenderTargetPixel> const&, mozilla::Maybe<mozilla::gfx::PolygonTyped<mozilla::gfx::UnknownUnits> > const&)+0x1e5
              XUL`void mozilla::layers::ContainerRender<mozilla::layers::ContainerLayerComposite>(mozilla::layers::ContainerLayerComposite*, mozilla::layers::LayerManagerComposite*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&, mozilla::Maybe<mozilla::gfx::PolygonTyped<mozilla::gfx::UnknownUnits> > const&)+0x55
              XUL`mozilla::layers::LayerManagerComposite::Render(mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&)::$_2::operator()(mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&) const+0x76
              XUL`mozilla::layers::LayerManagerComposite::Render(mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&)+0x669
              XUL`mozilla::layers::LayerManagerComposite::UpdateAndRender()+0x200
              XUL`mozilla::layers::LayerManagerComposite::EndTransaction(mozilla::TimeStamp const&, mozilla::layers::LayerManager::EndTransactionFlags)+0xa1

Interesting, and thanks again!

The top part is the same as the stacks from comment #126. But they differ below the following line:

    AMDRadeonX4000GLDriver`glrATI_Hwl_SubmitPacketsWithToken+0x6e

I don't want to ask you to wait another three days. But if you do get more stacks, please post any that are different from the previous two.

Sorin (and Eric), how often do you reboot your Mac? (I don't, of course, mean putting it to sleep -- for example by closing its cover.) If you keep it running for days (or weeks) at a time, do this bug's crashes happen more often after it's been running for a while?

I notice that the Firefox uptimes for these crashes are usually no more than 3 or 4 days, though they also also more likely to happen with those uptimes than with shorter ones. But Mozilla's crash stats don't include information on the uptime of the machine itself.

Flags: needinfo?(sorin.sbarnea)

This happens only on my iMac (5K) which always-on. I not even use suspend as I keep irc app on 24/7. I only lock it when out but not suspend.

Mainly I rebooted it specially to disable SIP, and when a system update is needed.

I noticed that it took ~2 days until Firefox crashed again after the reboot which was far less than normal so it may be possible that this bug triggers when the system uptime is bigger. Please let me know what other tests to do. Based on the occurrences is not only me encountering it.

I switched back to vanilla Firefox but I can go back to Nightly if it would be needed.

Flags: needinfo?(sorin.sbarnea)

Thanks for the information.

Now that your IOAccelContext2::setContextError() stacks have narrowed my search for this bug's cause, I'm starting to suspect it isn't an Apple bug at all, but is caused by the (temporary) exhaustion of some kind of system resource. That would explain why Firefox, at least, needs to be running for a while before you first see the crashes. It'd also explain the "out of memory" error code. I'll dig into this over the next few days. Among other things, I'll need to figure out how Chrome (and presumably Safari) manage not to exhaust this resource.

As best I can tell, the "out of memory" failure that causes this bug happens in one or more prepare() methods in the IOAcceleratorFamily2 kernel extension (IOAccelResource2::prepare() and/or IOAccelMemory::prepare(). But these methods are called a lot, and the output of simple probes would be unbearably noisy. I need to figure out how to make dtrace only show output when these methods fail. If I can manage that I'll give you another dtrace probe to run.

Attached file Dtrace script for debugging crashes (obsolete) —

I need to figure out how to make dtrace only show output when these methods fail.

Turns out this wasn't hard.

Sorin, here's a dtrace script for you to try. First save it as a file (I suggest naming it bugzilla1576767.d). Then run it as follows from a Terminal prompt, and keep it running until you experience at least one crash. Please test with a mozilla-central nightly (since they don't have their symbols stripped).

    sudo dtrace -s bugzilla1576767.d

Let us know what output you get. Also let us know if you get any output without crashing. If you get a lot of that, just comment out the relevant section and try again. See the existing examples of commented out sections.

This script is probably the best I can do for now. I may be able to improve it after seeing your results.

You can quit the script by typing Contol-C.

I just discovered that you can enable SIP partially, in such a way that you can still run dtrace scripts and probes:

  1. Boot into your Recovery partition (restart your computer and press Cmd-R immediately after you hear the Mac startup sound).

  2. Start the Terminal app and enter csrutil enable --without dtrace. You'll get an error message about this being an unsupported configuration, but it still works.

  3. Reboot your computer again to get back to your "normal" partition.

I now have time to get back into this a bit. Steven -- I restart my Mac rarely. It stays on (sleep is disabled even) until it either has a kernel panic or has to be restarted due to a software update. That's it.

My schedule today is a mess, but I have slotted time in tomorrow to do testing with SIP disabled and so I can run the dtraces above.

I haven't used Firefox at all in nearly two months because of this crash; it happens too often and I have too much important work in my browser to risk it. Would be nice to get back to it, so I'm eager to get back into helping sort this.

Flags: needinfo?(eshepherd)

Thanks Eric. Note that you don't have to disable SIP completely (see comment 134). Probably the most important protection is file system protection, and that stays on. Please test with the dtrace script from comment 133. It actually contains the probe from comment 120.

Thanks, Sorin! It'll take me a while to work out the implications of your trace.

Are you still on macOS 10.15.2 build 19C57, or have you upgraded to 10.15.3 build 19D76?

Finally new trace:

(In reply to Steven Michaud [:smichaud] (Retired) from comment #138)
> Thanks, Sorin! It'll take me a while to work out the implications of your trace.
> 
> Are you still on macOS 10.15.2 build 19C57, or have you upgraded to 10.15.3 build 19D76?

I upgraded, so this trace was produced with 19D76.
I cannot say the same about Firefox, which may be a day old nightly as usually it takes time to crash-again, but is never older than one day. I guess that is not so important because you do receive the crash reports.

Let me know if there is something I can put in report description to ease linking them to this bug. Is putting the bug number enough?

Let me know if there is something I can put in report description to ease linking them to this bug. Is putting the bug number enough?

I don't know what you're referring to. Here's how I'd add one of my own traces as an attachment:

  1. Cmd-A in the Terminal window to copy the entire contents of the dtrace log, then Cmd-C to copy it.

  2. Open a file in a text editor (like BBEdit) and paste (Cmd-V) in the contents. Edit it, if necessary, to remove extraneous content, then save the file. Make sure you save it in text format, and that the filename ends in *.txt.

  3. Use the Attach New File button above to attach the file here.

There's an error message associated with first failure in Sorin's log from the attachment in comment 137 -- "failed to wire down mapping". I'll try to figure out what this means, more precisely. But it lends support to my hunch from comment 132 about the temporary exhaustion of some kind of system resource.

I'll also try to come up with another dtrace script, to gather more information.

There's more evidence that Firefox is triggering some kind of resource leak with AMD GPUs in bug 1583922 - see bug 1583922 comment 9 and following. I haven't had a chance to look into that in more detail but maybe it can give you a hint in the right direction.

Here's another version of my dtrace script. I've added two new probes, testing for possible causes of the failure in IOGraphicsAccelerator2::freeToPrepareMapping(IOAccelMemoryMap*). Please test with this instead of with my previous script, Eric. Sorin, please also test with this script.

Attachment #9123564 - Attachment is obsolete: true

(In reply to comment 142)

Thanks, Markus! That does indeed look related. I didn't know about ioclasscount.

Here's yet another new version of my dtrace script, with yet another probe -- one for failures in IOAccelVidMemoryList::ReverseIterator::getPrevMemory().

Based on what Markus said, I assume the failures will happen here, when it's called from IOGraphicsAccelerator2::freeWaitToPrepareVidMap(IOAccelMemoryMap*, bool, bool).

Eric and Sorin, please test with this script. If you've already started testing with a previous version and haven't yet had a crash, quit the previous version (using Ctrl-C) and run this one instead.

Attachment #9125129 - Attachment is obsolete: true

I just noticed something rather slimy from Apple: dtrace user stacks (logged using ustack()) aren't symbolicated unless SIP is turned off completely. This doesn't matter for the dtrace script I'm asking Sorin and Eric to test with -- I already know what the output of ustack() will be in those cases. (It does matter for the tests I've been running, so I'm back to using csrutil disable.)

So, if you're using csrutil enable --without dtrace and testing with my dtrace script from comment 145, you can use either a Firefox release or a Firefox Nightly -- the user stack symbols will be missing regardless.

Steven -- bummer about the symbols. Annoying. LMK if you do want this run with SIP fully disabled.

At any rate, I've got a crash for you, attached. Used the newest script from #145.

Thanks, Eric! Your results are very interesting, and not what I expected. (I expected failures in IOGraphicsAccelerator2::freeWaitToPrepareVidMap(IOAccelMemoryMap*, bool, bool).) Give me an hour or two and I'll add some more probes to my dtrace script.

Don't worry about the missing ustack() symbols -- the user stacks are all the same, and I already have symbols for them.

Here's another revision of my dtrace script. Eric and Sorin, please quit my previous version and run this instead. I've added one more probe, for IOAccelResidentMemorySet::LRUIterator::getNextMemory().

Attachment #9125138 - Attachment is obsolete: true

IOAccelResidentMemorySet is a list of blocks of wired system memory. Wired memory can't be paged out (unless it's unwired first). There are limits on how much system memory you can have wired at any given time -- both per process and system wide. Running out of wired system memory is not the same thing as running out of system memory altogether.

My current hunch is that this bug is triggered by running out of wired system memory. If so, it's probably not related to bug 1583922.

Eric and Sorin, do you see anything like the long lags reported at bug 1583922?

Another question for Eric and Sorin: How much RAM do you have on your computers?

Do we have a way to find out how much wired memory Firefox uses for its own purposes? about:memory doesn't seem to have anything. IPC tends to use wired memory. If Firefox uses more wired memory than Safari or Chrome, that might be one contributing factor for these crashes.

I don't know - this is the first I've ever heard "wired memory", to be honest. Our source code doesn't contain the string either.
I just wanted to mention that I'm watching in awe as you are doing the impossible.

Wired memory is mostly a kernel thing, I think. Writing HookCase (a kernel extension), I had to get intimately acquainted with the source code for Apple's xnu kernel. That's where I learned about wired memory. But it can also be used by ordinary applications. I understand that shared memory (memory shared between different processes) is wired.

There must be ways for user-mode programs to figure out how much wired memory is in use -- both in total and in a particular process. top lists the total amount of physical memory that's been wired. sysctl has vm.global_user_wire_limit, vm.user_wire_limit and vm.global_no_user_wire_amount. I'll just need to dig around a bit.

It's kind of fun to do the impossible ... at least when it actually works :-)

Yet another question for Eric and Sorin: What are your kernel boot args, if any? You can see them by running nvram boot-args from a Terminal prompt.

OK -- answers to questions (I haven't switched to the new script yet but will immediately after posting this info)...

Lags: So... I have a long history of serious problems with long pauses when using Firefox. They seem to come almost at random, but may more often occur when switching tabs, loading or reloading pages, or even simply mousing over an idle tab without clicking on it.

It's also worth adding that the crash sometimes happens after Firefox has been running for an extended period (hours or days), sometimes it happens within a few minutes of starting up the browser, and sometimes it happens before the first page even appears on screen.

RAM installed: My iMac has 48 GB installed.

Boot args:

$ nvram boot-args
nvram: Error getting variable - 'boot-args': (iokit/common) data was not found

Thanks, Eric, for the info.

My current hunch is that this bug is triggered by running out of wired system memory.

I'm already beginning to question this. As best I can tell, neither Firefox, Safari nor Chrome use any wired memory. (I used vm_region_64() to count it by iterating through all of the current process's "regions", in a hook library.)

I have a 15-inch mid-2015 MacBook Pro with 16GB of RAM and the following two graphics "cards":

AMD Radeon R9 M370X (discrete)
Intel Iris Pro (integrated)

I use gfxCardStatus (https://gfx.io/) to force one or the other to be used. When I select "discrete only", my graphics hardware uses the AMDRadeonX4000 kernel extension (as do Eric's and Sorin's iMacs). Using hook libraries, I've largely been able to emulate the behaviors that Eric and Sorin have reported (though I haven't been able to reproduce their crashes exactly). But now I've bumped into something that may represent a substantial difference between their hardware and mine. On my MacBook Pro the following function is apparently never called:

IOGraphicsAccelerator2::freeWaitToPrepareSysMap(IOAccelMemoryMap*, bool)

This is one of Eric's failure points, and I expect it will also be for Sorin.

Eric and Sorin: What, exactly is the graphics/displays hardware reported on your iMacs by About this Mac : System Information?

My iMac's graphics situation follows. Note that the third display (the 1601W) was added more recently; the problem was occurring before I added it. Which makes me wonder: is it possible this problem only occurs if you have more than one display attached?

Radeon Pro 580:

  Chipset Model:	Radeon Pro 580
  Type:	GPU
  Bus:	PCIe
  PCIe Lane Width:	x16
  VRAM (Total):	8 GB
  Vendor:	AMD (0x1002)
  Device ID:	0x67df
  Revision ID:	0x00c0
  ROM Revision:	113-D000AA-931
  VBIOS Version:	113-D0001A1X-025
  EFI Driver Version:	01.00.931
  Metal:	Supported, feature set macOS GPUFamily2 v1
  Displays:
iMac:
  Display Type:	Built-In Retina LCD
  Resolution:	5120 x 2880 Retina
  Framebuffer Depth:	30-Bit Color (ARGB2101010)
  Main Display:	Yes
  Mirror:	Off
  Online:	Yes
  Automatically Adjust Brightness:	No
  Connection Type:	Internal
DELL U2713H:
  Resolution:	2560 x 1440 (QHD/WQHD - Wide Quad High Definition)
  UI Looks like:	2560 x 1440 @ 59 Hz
  Framebuffer Depth:	30-Bit Color (ARGB2101010)
  Display Serial Number:	C6F0K43P11JL
  Mirror:	Off
  Online:	Yes
  Rotation:	Supported
  Automatically Adjust Brightness:	No
  Connection Type:	DisplayPort
1601W:
  Resolution:	1920 x 1080 (1080p FHD - Full High Definition)
  UI Looks like:	1920 x 1080 @ 60 Hz
  Framebuffer Depth:	30-Bit Color (ARGB2101010)
  Display Serial Number:	MMEK1JA001964
  Mirror:	Off
  Online:	Yes
  Rotation:	Supported
  Automatically Adjust Brightness:	No
  Connection Type:	DisplayPort

Thanks, Eric, for the info. The crucial factor may be how many pixels your graphics hardware needs to drive. My current thinking is that the bug only happens when the AMDRadeonX4000 kernel extension decides that it doesn't have enough "vidmem" to do the job, and starts to also use (wired) "sysmem". This is little more than a guess. But I hope to play some tricks that might confirm it.

Sorin, please also provide your display/graphics information. Also let us know if you, like Eric, see significant lags (like those reported at bug 1583922).

Warning for the overly adventurous: I just tried adding the -amd_no_dgpu_accel kernel boot-arg, and it made my system temporarily unbootable. I got out of it by forcing a reset, then doing Cmd-R to boot into the Recovery partition, then using nvram in the Terminal app to get rid of this boot-arg.

I noticed that this boot-arg is recognized by the AMDRadeonX4000 kernel extension. Apparently it disables the AMD GPU framebuffer. I hoped that this meant it wouldn't use "vidmem" at all, but only "sysmem". Perhaps it did, but you can't test on a system you can't boot :-)

Attached a new log from dtrace running the script in comment #149.

Interesting. Some of this (the failures leading up to the 0xfffffff9/-7 context error) I expected. The rest (the failures leading up to the 0xfffffffc/-4 context error) I didn't. But you did report this error previously in comment 109. More grist for my mill.

I'll still be concentrating mostly on the 0xfffffff9/-7 context errors, since they seem to be the most frequent. But I'm beginning to doubt that I'll be able to emulate them on my MacBook Pro. If I can emulate the 0xfffffffc/-4 context errors, that might be a good second best. I need to learn how to emulate at least one of these errors, so I can move on to discovering why this bug doesn't happen (or happens so much less often) in Chrome (and presumably Safari).

I may come up with more additions to my dtrace script, but they are now likely to come less frequently.

Oops, I just realized that none of the 0xfffffffc/-4 context errors happened in Firefox. Instead they happened in processes named Pixelmator, Photos, and Monument Helper. What are those processes? Are they separate apps? These context errors should have caused them to crash -- did they? Did they crash around the same time as Firefox? Were they interacting with Firefox in any way?

Note that most of my probes' logging is limited to processes named "firefox" or "plugin-container". That's why they didn't record logs for any other process.

Those are all separate apps. I didn't realize the script was capturing stuff outside of Firefox. :)

Pixelmator is a graphics app. I had a couple of crashes in that yesterday.
Photos is the Apple Photos application.
Monument Helper is an app that automatically copies new photos added to the Photos library to an external photo library management device (https://getmonument.com).

I'm not aware of Photos or Monument Helper crashing, but I suppose it's possible they did. However, none of them were interacting with Firefox in any way.

Is there a way to limit the dtrace command to specific processes?

It's interesting that other apps on your computer also have setContextError() problems. But I agree that this isn't essential information. It would be more interesting, though, if the error codes were 0xfffffff9/-7 instead of 0xfffffffc/-4.

Here's a revised dtrace script that does what you ask.

Look in the Console app under "Crash Reports" for anything on Photos and Monument Helper.

Attachment #9125355 - Attachment is obsolete: true

Yeah, I see the crashes in the crash reports. Looks like some crashes-in-background I wasn't aware of.

All of them seem to be crashing while submitting data buffers to the GPU. That's intriguing, though I'm not entirely sure what to make of it.

Oh! I just thought of something I've noticed that might be relevant now that I think of it -- now and then (and this happens in all of my web browsers including Chrome, Firefox, and Safari, and has happened in other apps too) the entire contents of a window shows up red until I manually do things to trigger refreshes of the contents. This could be suggestive of an issue with texture buffer transfer errors, perhaps? This happens several times a week, typically. I hav enot seen many if any crashes that are obviously affiliated with this taking place, but it does happen fairly often.

The artifacts you see make sense -- the underlying cause of your crashes is probably system wide. So do the lags -- do you also see those in other browsers? What remains unexplained is why Chrome and Safari don't crash.

Your kernel mode stacks suggest problems with "resources". Seeing that, I wrote a hook for void ioAccelResourceFinalize(void *arg0) (in the IOAccelerator private framework) that (under certain circumstances) prevents it from calling its original function. This leaves these "resources" un-reclaimed, even in the kernel (and its extensions). Firefox's memory usage balloons, and I eventually see errors with my dtrace script that are similar to yours (though in "vidmem" and not "sysmem"), and serious lags (even hangs). I also see occasional video artifacts (though as yet nothing permanent). No crashes. But like I said I expect that's due to hardware differences between your system and mine -- mainly that your graphics drivers have a lot more pixels to drive, which forces them to use "sysmem" in addition to (instead of?) "vidmem".

Hm. I'm starting to toy with Instruments a bit to see if I can find some reporting on graphics memory usage that may be useful here. If there's anything else I should be doing at this point, let me know.

I do see using iStat Menus that my GPU memory is 96% full right now, despite not doing anything all that exciting. Instruments records tons of data but I don't have the knowhow to say what it means. If there's something I could do with that to get data you need, do let me know.

Thanks Eric. What you say confirms my suspicions. Frankly, it sounds like the 5K iMac's graphics hardware is a bit underpowered, at least with regard to the quantity of its video memory.

I'm still working on emulating this bug's crashes, to help find out why Safari and Chrome don't crash. No luck so far.

I'll let you know when I have more questions for you, and tests for you to run.

I am probably overtaxing the VRAM a little driving the built-in monitor at 3200x1800 pixels, a 4K monitor at 2560x1440 pixels, and a third monitor at 1920x1080 pixels (though the latter didn't affect the frequency of the crashes when I added it, and indeed I removed it for a while to test the effect and there was none).

If you feel like there would be value to it, I suppose I could disconnect both of my external displays for a short while to do some added testing. What I do wonder about is how much of the problem is all the displays and how much of it is all my windows I have open. I kind of think it's more about the windows, with their buffers piling up. The three displays together don't actually need that much framebuffer space, comparatively speaking.

Oh, another thing -- smichaud, let me know if it would be useful for me to set things up for you to remotely access my iMac. I can add a user account on here and arrange things for you to access it remotely if it would help with debugging on a machine that has the problem directly rather than needing to emulate the issue.

Thanks, Eric, for offering me remote access to your machine. I may take you up on it later, but for now I can't think of anything useful I could get from it. Go ahead and perform any tests you can think of. I don't (for the time being at least) have anything to suggest.

Eric and Sorin (and anyone else who sees these crashes regularly), I've created a new tryserver build for you to try. It's a long shot, but there's a reasonably good chance it will "fix" (that is work around) this bug.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ebffead1c3e3cb8eb1452bc4ed562f0dad6baf76

https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AtES1jiEQB64A3HCV-SALA/runs/0/artifacts/public/build/target.dmg

I noticed that the autorelease pool is never "drained" on the compositor thread while Firefox is running. Lots of IOAccelResource objects get released to this pool, but ioAccelResourceFinalize() is never called on them until the pool is "drained". These will pile up after Firefox has been running for a few days, and it's easy to imagine them causing a shortage of wired system memory on machines where that resource (along with video memory) is already taxed to the limit.

Apple does a much better job of managing the autorelease pool on the main thread. gpusSubmitDataBuffers() is called on the compositor thread in Firefox, but it's called on the main thread in Chrome and Safari. This could explain why people don't see this bug's crashes in Chrome and Safari, or at least see them much less often.

To be sure that this patch works, you'll need to run the build for at least a week without seeing one of this bug's crashes.

My patch is (shall we say) a bit unorthodox. I'm using it as proof of concept. Even if my patch works, the final patch is likely to be different.

I've given up trying to emulate this bug's crashes on my MacBook Pro. I found a "property" (IOSurfaceSysMemOnly) that I could specify to IOSurfaceCreate() in its dictionary of properties. But when I did this I either crashed my login session (presumably by crashing the WindowServer process) or froze up my machine completely (requiring a hardware reset). Possibly this was because I don't have much system RAM (only 16GB). In any case I consider myself warned away :-)

I noticed that the autorelease pool is never "drained" on the compositor thread while Firefox is running.

This turns out to be false. So my "fourth experimental patch" won't help. You can stop testing with it. Sigh.

What is true is that ioAccelResourceFinalize() is never called from _CFAutoreleasePoolPop() on the compositor thread while Firefox is running, even with my patch. I'll try to figure out what significance this has, if any.

I'll add that your explanation in comment #174 talks about "when Firefox has been running for a few days" but in my case, the crashes are happening sometimes during or immediately after launch.

I don't yet have a good explanation for what you report. It's confirmed by the fact that not all this bug's crashes have long uptimes.

When a process crashes that used lots of "resources", the kernel counterparts of those "resources" might not get freed right away. This could explain one long uptime followed by one or more short uptimes. That seems to be the pattern with your crashes, or at least one of the patterns.

You're right -- that does seem to be the pattern. Long uptime, crash, then a series of trying to start up and failing due to crashes either during startup or immediately after startup, after which I give up and leave it for a while. When I come back, I usually get a good startup (though not always -- sometimes it continues to be obnoxious for days on end).

As a short-term workaround, I suggest rebooting your computer when it "gets obnoxious". I expect that will clear things up for a while.

Yeah, I try to do that when it comes up... though the amount of stuff I always have going on makes it hard to make time for rebooting. But yeah.. :)

I wonder... I have heard of instances of people pulling things like graphics drivers from older macOS versions. Since the crashes became so much more frequent with Catalina, I wonder if pulling the driver for my GPU from Mojave would have an impact on anything

I doubt it. Remember that I was able to emulate much of your dtrace script output using standard (and up to data) Apple drivers.

I suspect the strongest correlation is with 5K iMac owners, particularly those with one or more external high-resolution displays. I don't know how to prove this, though.

That kind of test it likely to create all kinds of other problems. I wouldn't do it :-)

I'm starting to see evidence that Safari (at least) is able to make the kernel recover resources that have been leaked. Firefox can't. It'll take me a while to figure out what's going on, but don't lose hope :-)

I never lose hope when Mozilla is involved. :)

(In reply to comment #182, following up comment #183)

Oops, I misread your comment. I assumed you're trying to find ways to make the crashes easier to reproduce (something I've been preoccupied with). Instead you're (presumably) looking for a way to avoid the crashes. But my answer is still the same -- don't do it.

The crashes do happen disproportionately on macOS Catalina (10.15). But they also happen on earlier versions of macOS (with a slightly different signature, "__pthread_kill | abort | gpusGenerateCrashLog"), and I suspect the reason we see more of them on Catalina is that Firefox has more Catalina users. The only safe way to use graphics drivers from earlier versions of macOS is to roll back to those versions, which would mean erasing your boot partition and installing macOS on it from scratch ... if you could find an installer for the earlier version (Apple doesn't archive them). But I expect you'd still see the crashes.

Yeah, probably true. I have known people who have actually successfully done this, though obviously it's a big risk to take -- to go find the driver from a previous release and replace your current driver with it. It's of course asking for trouble. Was just thinking aloud, so to speak.

The weird ones always seem to happen to me. :D

Eric, I actually was able to emulate this bug's crashes a couple of times in a row. The only recent change to my system is that I freed about 750MB of disk space on my boot partition. (The amount of free space on my boot partition is still quite low -- even now it's only a little over 7GB.) Which leads me to suspect that my AMD graphics drivers are, like yours, able to use system memory when video memory runs short -- but only if there's at least a certain amount of disk space available (since system memory swaps to disk when you overuse it).

All of this leads me to ask: How much free space do you have on your boot partition?

I used a HookCase hook to prevent ioAccelResourceFinalize() from calling its original function. Then I loaded three very busy pages and and scrolled up and down in them continuously. It took 5-10 minutes of this to trigger dtrace script errors and then the crash.

Steven--

My boot drive's writable partition has 1.68 TB free. From the system information app:

Macintosh HD:

Free: 1.67 TB (1,665,216,946,176 bytes)
Capacity: 3.12 TB (3,121,401,434,112 bytes)
Mount Point: /
File System: APFS
Writable: No
Ignore Ownership: No
BSD Name: disk2s5
Volume UUID: FEDF0CFD-9ECA-479E-8705-F19A17C8FF52
Physical Drive:
Device Name: APPLE SSD SM0128L
Media Name: AppleAPFSMedia
Medium Type: Rotational
Protocol: SATA
Internal: Yes
Partition Map Type: Unknown
S.M.A.R.T. Status: Verified

Macintosh HD - Data:

Free: 1.67 TB (1,665,216,958,464 bytes)
Capacity: 3.12 TB (3,121,401,434,112 bytes)
Mount Point: /System/Volumes/Data
File System: APFS
Writable: Yes
Ignore Ownership: No
BSD Name: disk2s1
Volume UUID: 9EE36C05-B029-413D-AD4F-5DEC41B2A2D7
Physical Drive:
Device Name: APPLE SSD SM0128L
Media Name: AppleAPFSMedia
Medium Type: Rotational
Protocol: SATA
Internal: Yes
Partition Map Type: Unknown
S.M.A.R.T. Status: Verified

Eric (and Sorin), I've got a new tryserver build for you to test:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=00193a9f4532ce5bdfd63fc976d1e83430a1d39c

https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/NPQNI5Q3Rk6s0HyJuX0YCw/runs/0/artifacts/public/build/target.dmg

There's a fair chance it will help with these bug's crashes. It doesn't seem to help in my own tests (with ioAccelResourceFinalize() as per the last paragraph in comment #188), but I'm not sure that's relevant.

A single IOAccelResource is created every time an IOSurface is initialized, which gets released when the IOSurface is destroyed. This resource is presumably the same "size" as the IOSurface, so it's presumably fairly large, and uses a fairly large amount of video memory or system memory. In Firefox's CoreAnimation code, IOSurfaces are re-used, if possible (they're obtained from and released to a pool). So a long time can pass before they're destroyed. But their use count is incremented when they're obtained from the pool, and decremented when they're released back to the pool. My patch causes the OS to "purge" an IOSurface when its use count drops to zero. This doesn't free the IOSurface or its IOAccelResource, but (as best I can tell) does free the resource's backing memory (video memory or system memory). My hope is that this will reduce the pressure on your 5K iMac's video "resources" (video memory and system memory) enough to at least reduce the frequency of this bug's crashes.

Please test with this tryserver build and let us know your results. You'll likely have to test with it for at least a week to be sure.

I only just now saw that there's a new build. I have downloaded and installed it. The result is weird. Pop-up elements in the chrome draw as red translucent boxes instead of being usable. Either it's a debug mode I don't recognize, or something is pretty broken there. :)

I can try to test this as-is, but it would be easier if the UX were fully visible...

Attached: what the toolbar and other chrome pop-up elements look like. Can't work anything in toolbars properly as they come out like this.

Thanks for the info, Eric. I do remember seeing a few very small pinkish artifacts when I tested the patch, though nothing like what you report. However, the fact that we both saw them shows they're triggered by my patch. So it isn't a viable fix, even if it prevents your crashes.

Sigh.

I'm still working on this. Most of my time is spent reverse engineering the IOAcceleratorFamily2 kernel extension, with particular attention to how "purging" works. The encouraging behavior from Safari, that I reported in comment 184, is quite real. But it seems to depend on characteristics of the Metal framework that will be very difficult to replicate elsewhere.

I do remember seeing a few very small pinkish artifacts when I tested the patch

I just retested my tryserver build, and every popup is opaque and pinkish -- context menus, tooltips, everything. There might conceivably be some way to turn off the effects of my patch in popup windows. But before I look into that I want to continue my line of research into IOAcceleratorFamily2 a bit longer.

Interestingly, the "opaque" popups are pinkish only if I'm using my "discrete" AMD Radeon R9 M370X graphics hardware. They're light grey when I'm using my "integrated" Intel Iris Pro graphics hardware. And neither kind is quite opaque (though no text or images show up in them).

Steven -- Some of my artifacts are opaque, others are translucent. Not sure what causes which to happen. But having been running the build for a while now, it's not just chrome popups. I'm seeing this in page content as well.

It's worth noting that I've been seeing this happen fairly often in Safari for months now (the colored blotches). The behavior is actually remarkably similar in terms of how it acts to the behavior described in the blog post mentioned in the latest newsletter, even though that bug is supposedly specific to Windows. But the behavior of boxes that seem to randomly appear when rendering that clear up on scrolling or mousing over elements is exactly what happens here (except with your patch, where they don't refresh).

So, that's curious. Anyway.

Forgot to add that I've had the test build running since shortly before posting about the graphics issues, and have been using it as my primary browser (running another copy of Firefox side by side with it to compare UX elements to in order to click the right things when I can't see them). No crashes yet...

Attached: screen shot of a window that has red boxes rendering instead of correct content when running the test build.

I see this exact same behavior in release versions of Safari, and have been seeing that for months now. Is it likely to be a graphics driver bug that is being triggered by something that both Firefox and Safari do?

Hi Eric. Thanks for using my tryserver build despite its problems. What you say about artifacts in non-popup windows is bad news, though. Though it'd be tricky, I should be able to change my patch's behavior according to whether or not Firefox is drawing into a popup window. I currently have no idea how I'd deal with them if they can happen in any window, more or less at random.

Please keep testing. If you go for a week without crashes, then it'll be worthwhile trying to save my patch (even though I currently don't have a clue how to do that). Otherwise it's back to the drawing board.

I suspect Safari's artifacts have something to do with its more aggressive "purging", and with the fact that your system uses its video resources to their limits, and sometimes a little beyond. That's just a guess, though. I don't think they are related to the target of your blog post's challenge: Safari's artifacts seem to fill an entire native object (if they're similar to the ones you've seen with my tryserver build), but the webrender artifacts seem to be random splotches.

The bug just reproduced in your tryserver build. :(

Sorry to hear it. Though in a way I'm glad I don't have to go into the rabbit hole of trying to save that particular patch.

I'll keep plugging away, but so far I don't have any further promising ideas for new patches.

Steven - have we filed a radar ticket for this issue with Apple?

Heh - I wonder what would happen if I tried to use one of my paid developer support incidents with my Apple Developer Program membership for this problem. :D

I haven't filed a radar ticket. Go ahead and do that yourself -- it's worth a try. Be sure to refer them to this bug, and to point out your dtrace log from comment 161. That shows exactly where the crashes are triggered. It's conceivable that Apple could "fix" this bug by making the error condition non-fatal -- by stopping it from resulting in IOAccelContext2::setContextError(unsigned int) being called. But somehow I doubt they'll be willing to do that.

Wait until you see the results from that before trying to use a support incident. In my experience it's Apple company policy never to discuss macOS internals with an outsider -- at least never in any detail. That would severely limit the usefulness of the support incident. They might just end up telling you "use Metal" or "don't do CA on a secondary thread", or something similarly unhelpful.

For what it's worth, here's a dtrace log for a crash that I just experienced, after following the procedure in the last paragraph of comment 188. Firefox's (user-level) stacks are symbolicated in it. It was produced with today's mozilla-central nightly.

Most of the time I don't see these crashes. Instead I see a bunch of other failed calls in my dtrace logs, all of which are non-fatal. I don't really know why I sometimes see the crashes and sometimes don't.

I've submitted a radar ticket: https://feedbackassistant.apple.com/feedback/7622909 (I don't know if you can see that or not)

I don't know if you can see that or not

I can't. I'm not surprised. I've only ever been able to see my own bug reports.

It's really too bad that Apple's bugbase isn't public.

Hm. CA is supposedly thread-safe, with certain caveats around making sure you do locking as needed...

Is there value in my continuing to test things here? I have not started Firefox in a week or so now, and probably won't again until there's another patch to try out here, since it's just too unstable to trust for day to day work. :(

Is there value in my continuing to test things here?

Not for the moment, as best I can tell.

I'm still more or less where I was at comment #150: I know generally why these crashes happen (because your hi-resolution displays use system resources to their limits, and sometimes beyond), but I still don't know why they don't happen (or happen less frequently) in Chrome and Safari. I've got some leads I'm pursuing, but I don't know when (or if) they're going to pan out. I'll keep digging. But I take breaks now and then, when the intensity of the work starts to interfere with my sleep :-)

I just noticed something rather interesting, testing with the same three noisy pages on Firefox and Chrome: The average "resident data size" of an IOAccelResource object (as returned by IOAccelResourceGetResidentDataSize()) is much smaller on Chrome than on Firefox. So is the total size of all IOAccelResource objects created during standardized tests of these three pages, though the number of IOAccelResource objects created by Chrome is twice that created by Firefox. I didn't count objects with zero "resident data size".

As best I can tell, a resource's "resident data size" is the amount of wired video memory it takes up when paged in, and the amount of wired system memory it uses for backing store while it can still be paged in or out.

This doesn't apply to Safari, whose average "resident data size" is about the same as Firefox's, and whose total "resident data size" is about twice as large. But then Safari seems to have its own (Metal based) strategy for "purging" resources, used by neither Chrome nor Firefox.

I'll see what I can make of this. It's my first real lead in quite a long time.

That's curious. I wonder what to make of it. Hm.

I'm slowly digging through behaviors that have something to do with OpenGL, trying to find significant differences between Firefox and Chrome, and seeing if changes to either browser make a difference wrt this phenomenon. Nothing so far.

Steven - do you want me to try disconnecting my external displays to reduce the taxation on my video memory, to hopefully confirm the memory pressure hypothesis? Or are you certain enough of that that you don't feel there's any value in it?

Chrome keeps inserting spurious   entities throughout the documents I write for MDN and it's driving me freaking batty. :)

Go ahead and try that, if you'd like (and it isn't too inconvenient). But please also reboot your computer when you do, and keep your external displays disconnected for the length of your test. If I'm right, reducing your system's demand for pixels should reduce the frequency of crashes, though I doubt it will eliminate them entirely. I suspect the basic cause is one or more design flaws in the 5K iMac.

No new ideas yet. I've taken a break for the last few days. But I'll start working again today, and see where it gets me.

No new ideas yet.

Speaking of which, one just popped into my head. Do you "share screen" remotely on your computer a lot? Do others? I'm not sure what difference this would make in terms of total "pixel demand", but I figure it's worth asking.

And yet another idea, Eric. Here's something to try before taking the drastic step disconnecting your external displays and rebooting your system. In about:config, try setting gfx.webrender.compositor.surface-pool-size to 0 (the default is 25). You'll need to restart Firefox after making this change.

IOAccelResource objects corresponding to SurfacePoolEntry objects are quite large. Most SurfacePool objects have their poolSizeLimit set to 0. The webrender compositor's RenderThread::SharedSurfacePool doesn't, though. This change makes it behave like the other SurfacePool objects.

I've had webrender turned on at times, but it's not been on recently and isn't now. I turned it on at one point to see if the crashes went away using webrender and they didn't.

Would there be value in turning it on just to see what if any changes to the crash's stack trace show up? Perhaps there might be a way to help isolate the problem by comparing the two render paths' crash patterns? Just spitballing ideas.

If I'm right about this bug being fundamentally a problem with system resources, turning on webrender should make the crashes happen much more often. But they're so intermittent that it'd be hard to tell this was happening. The stacks might be different, but I doubt that would tell us much -- the crashes would still almost certainly happen at the same location deep in macOS code.

On balance, I suspect it's not worth your trouble trying to test with webrender on.

OK. Then I will go ahead with the plan to try this with the external displays disconnected as soon as is practicable. Probably not today; possibly won't be until the weekend.

Adding another data point: macOS 10.15.4, Firefox 74.0.1. Have not seen this crash previously but am seeing it very frequently now. Although it happens on a high frequency and can be caused by various operations (opening a new tab, clicking a link), like others, there is no specific way to trigger the crash.
The effects go beyond Firefox and caused my entire system to be stalled and forced a hard reset. Although the mouse kept moving, no operation was possible and graphic glitches happend in other windows too.
27" imac late 2012 (NVIDIA GeForce GTX 660M 512 MB) with large external monitor.
Crash frequency is 10x today and it's only been 1-2 hours trying to get work done.
A few signatures:
https://crash-stats.mozilla.org/report/index/72a46878-8e91-462b-9f80-6d58e0200406
https://crash-stats.mozilla.org/report/index/b753b947-46ea-4329-8398-8f3de0200406
https://crash-stats.mozilla.org/report/index/99e9429c-85a5-471a-89a0-4af9e0200406
https://crash-stats.mozilla.org/report/index/48b410d4-8aea-4a40-be3b-3f54e0200406
https://crash-stats.mozilla.org/report/index/95747d81-eceb-49ba-b165-133a90200406
https://crash-stats.mozilla.org/report/index/7f2b9ceb-1de4-4050-8040-4edd80200406
https://crash-stats.mozilla.org/report/index/2869aa6f-63fb-4f4d-9a43-956390200406
https://crash-stats.mozilla.org/report/index/891df728-57c8-4a3d-a830-2b51b0200406

Hard to tell why this started only now. 10.15.4 is new as is 74.0.1. I did not see the crashes in 74.0 although this bug has been known for a few months, so it is interesting that this is happening only sporadically.

Interesting. Did the crashes keep happening, with the same frequency, after the reset? I'll look at the changes in Firefox 74.0.1 and try to find something relevant.

Eric, have you seen something similar? Did you ever try testing without any external monitors?

(Following up comment #221)

As best I can tell, Firefox 74.0.1 contained only two changes -- fixes for bug 1626728 and bug 1620818. Both are security bugs, so I can't see the details. But I don't think there's much chance of either of them being relevant here. Also, there hasn't been any sudden surge in the number of this bug's crashes, as you'd expect if either macOS 10.15.4 or Firefox 74.0.1 made a big difference here.

So, steve-_-, I suspect your increase in crash frequency is due to some recent change in your system (possibly hardware, possibly software), but not to either macOS 10.15.4 or Firefox 74.0.1.

Reset = restart mac. Crashes persisting after that.
https://crash-stats.mozilla.org/report/index/07c5db9e-d57d-41e5-a5e6-ca6b30200406
Didn't think 74.0.1 was the cause as this crash has been going on much longer, but wanted to name a few recent changes. Disconnecting the external monitor and did not see a crash for about 30 minutes.
Reconneding external monitor and crashes returned in under 10 minutes. Two windows, 9 tabs, but also happened with a single window and fewer tabs earlier. No proof but interesting correlation with external monitor.
https://crash-stats.mozilla.org/report/index/991ebead-40d6-48a8-a2f2-a8cc20200406

Is your external monitor itself "new" -- did you start using it as an external monitor just before the crashes increased in frequency? If not, did you change some kind of display setting (either system-wide or in Firefox)?

External monitor has been in use for over a year. Now that you mention it, I had recently disconnected the external monitor to change the order of cables on my desk. When I reconnected it had low resolution and I had to re-configure the settings in macOS system preferences.

Installed 75.0 and the crashes are gone (I hope). Before updating, seeing major UI issues, Firefox would get stuck and graphics on external monitor would be completely broken. Moving the mouse was still possible and trying to create a screenshot caused artefacts all over the place. Fingers crossed this is indeed resolved in 75.0.

I had to re-configure the settings in macOS system preferences.

What did you change?

As macOS is bad at using sane settings for external monitors,sa opened System Preferences > Displays, set resolution to "Scaled" and used the "More Space" setting on the very right.
Firefox is crashing now on a per minute basis. Happy to screenshare if any dev wants to debug this.

Thanks for the info. I suggest playing around with the different "Scaled" settings until you get one that you can live with (that doesn't crash too often). Let us know at which setting this happens.

I'm still working on this bug, but am no closer to a resolution.

Happy to screenshare if any dev wants to debug this

Thanks, but the kinds of things I'd need to do would be too invasive, and potentially dangerous.

One more thing, steve-_-: Do problems happen in other apps (and especially browsers) with your external display set at "Scaled" "More Space"? Please let us know. I'm especially interested in hearing how things go in Chrome, if you ever use it.

Knowing this won't fix the problem (which I strongly suspect is an Apple hardware design flaw), but it might provide clues as to how to change Firefox to work around it.

Unable to get Safari or Chromium to crash or cause similar overall graphic glitches. I have the suspicion that playing a youtube video may help trigger this crash.
Also, when the crash occurs, I have to restart the computer as nothing is usable due to UI wonkyness.
This is system preferences (available 30 days): https://upload.disroot.org/r/MxunmE0B#z77neUGAwJ79Tj72eqB9TSx5+aGPWs6/BPCaxssDrck=
This is Firefox own download details (available 30 days): https://upload.disroot.org/r/X72GqFh3#IHIxKNKgmZdS3soF6dfF3INsQjBk+IHXg6Ik6aIuHO8=

Changing the monitor scaling to the middle setting, does prevent Firefox from crashing or at least I have been unabel to crash it. Same for the second from the right setting. But as I have no way to intentionally trigger the crash there is some guessing involved here.

The skaling system preference look different (maybe Catalina specific change?) so Apple did touch that part of the OS recently.

Thanks for the info about which scaling settings work, and about being unable to crash Safari or Chromium.

This is system preferences (available 30 days): https://upload.disroot.org/r/MxunmE0B#z77neUGAwJ79Tj72eqB9TSx5+aGPWs6/BPCaxssDrck=

This is Firefox own download details (available 30 days): https://upload.disroot.org/r/X72GqFh3#IHIxKNKgmZdS3soF6dfF3INsQjBk+IHXg6Ik6aIuHO8=

Both of these files are corrupt. Possibly this is because they aren't *.png files, as they're labelled. I checked, and neither is compressed (in a format recognized by tar ztf or unzip -t).

The files are not corrupt. This is what I am seeing after Firefox crashes.

OK, I misunderstood.

So is this how you created those files?

  1. Set your external display to "Scaled" "More Space".
  2. Ran Firefox until it crashed with one of this bug's crashes.
  3. Opened System Preferences, took a screenshot of the window, and saved it to a file.
  4. Restarted Firefox, chose Tools : Downloads (or pressed %J), took a screenshot of the window, and saved it to a file.

This implies that at least some of the macOS UI still worked. Presumably, though, you then had to restart your computer to get things back to normal.

A download was already running and I clicked the progress icon which was still somewhat visible. It would have been impossible to start a new download, as the UI was unusable all over the place.

I've been digging into the memory management of shared buffers (between kernel mode and user mode) by the IOAcceleratorFamily2 and IOSurface kernel extentions, and finding new information. But so far I haven't been able to find any differences between Firefox and Chrome and Safari that explain why this bug's crashes happen only (or much more often) in Firefox.

But a while ago I noticed that CGLChoosePixelFormat() is called with one "attribute" in Firefox that isn't used by Chrome -- NSOpenGLPFAAccelerated == 73. (This comparison isn't relevant to Safari, which uses Metal instead of OpenGL.) Not using this attribute makes Firefox scroll noticeably less smoothly on pages with lots of "resources", and (judging from comments in the Mozilla source) may have other bad side effects. But since I'm at a loss about how to proceed, I figure its now worth trying one more rather far fetched idea.

I've done a tryserver build that disables Firefox's use of this attribute. Please try it out, steve and Eric. Testing it should be relatively easy for steve: Just set your external display to "Scaled" "More Space" and see what happens.

https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VfanRq6GT5i9kjZXnHZGTw/runs/0/artifacts/public/build/target.dmg
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a2ef5aaaec599a616b5fb0609a9cadf00fecd9f

Please try it out and let us know your results.

If this "works", it may be worthwhile to create an about:config setting that allows users to turn off NSOpenGLPFAAccelerated. This shouldn't be the default, though.

Steven: thanks for still digging into this. Ran the tryserver build and that crashed after a few minutes:
https://crash-stats.mozilla.org/report/index/bf1a989f-8694-4108-b83b-1a0550200408
https://crash-stats.mozilla.org/report/index/88f9031b-b07a-4cb2-8300-524690200408

Thanks, steve. Bad news, but at least it wasn't too difficult to find it out.

I'll keep working on this bug, but for now I'm stuck. I'll give it a few weeks' break. Hopefully that will allow me to come back to it with an open mind.

Suppplement Update installed 10.15.4 (19E287)
so far no crash for about 1h which looks very promising. Seems Apple messed up with 10.15.4 big time.

Interesting. Maybe Apple really has fixed this bug.

Eric, please try the macOS 10.15.4 Supplemental Update and see what happens.

More bad news: falling back to Safari to discover it also suffers from graphics glitches that look very familiar to when the Firefox crash is happening. Safari does not crash which may be due to the fact, that as Steven described, it is not using identical methods as Firefox. But it seems the assumption, this problem is limited to Firefox was wrong.

That's actually good news. Try opening an Apple bug on it. They're much more likely to pay attention to a problem with Safari. Be sure to note that you've changed the Display setting for your external monitor to "Scaled" "More Space".

Does the problem go away if you change your Display setting?

Just off of a call with an Apple Care Senior Advisor and lacking words for the arrogance that Advisor displayed. They call the late 2012 iMac vintage and couldn't care less if anything works on an operating system that officially supports that hardware. Tried to get them to look at this bug here, but they refused and said, they need a Safari crash log. As Safari is displaying artefacts but not crashing, it is not something I can provide. They suggested I exchange the cable. I don't see how that would be related to anything. Yes, cables can break, but why would Firefox crash, if a cable was broken.

New findings:

  • Fresh test user on 2012 iMac: same problems in Safari and Firefox instantly when external monitor is connected
  • both Thunderbolt ports expose the problematic behavior (Thunderport - Display Port cable used)
  • 2017 MacBook Pro USB C to USC C: neither Firefox nor Safari show signs of trouble, scaling setting can be any
  • once the crash occurs, graphic glitches are all over the place, even when the Thunderbolt cable is removed. So how could this not be a macOS bug 🤯

To me this indicates, the problem is between the graphics card "NVIDIA GeForce GTX 660M" and macOS 10.15. Apple refuses to acknowledge that, which I find hilarious but not unexpected.

Maybe this is Apples way of informing users with older hardware politely to go and spend some cash. Would not be the first time.

Eric: what mac are you using?

Sigh, looks like we won't be getting any help from Apple.

Just to be sure, does changing your Display setting for your external display (from "Scaled" "More Space") on your 2012 iMac make the problems in Safari go away?

As I've said above, I think this bug (the crashes and the glitches) is related to the number of pixels your graphics hardware and software is driving -- the more pixels, the more likely the crashes and glitches. A 5K iMac has lots of pixels, even without an external display. So this bug is much more likely to happen on a 5K iMac.

I don't yet have much information about non-standard Display settings. But I suspect that choosing "more space" increases the number of "virtual" pixels (at least) visible to driver software (in kernel extensions and user-mode libraries), and that "larger text" decreases the number of "virtual" pixels.

<Does the problem go away if you change your Display setting?>
yes, as soon as the second right option to "more space" is selected, any issues are gone. Yeah larger text is lower resolution while more space is higher resolution. Apple seems to think their users are unable to understand what resolution means. In previous versions of macOS they still had a setting giving the resolution.

Excuse the many posts, but I am posting the findings as they happen. v75 just crashed on the second right setting to "More Space". Now switching to the middle setting. We'll see if that prevents the crashes.

I've revised this dtrace script again. It's possible that we'll learn something new from its output if it's running when Firefox crashes with one of this bug's crashes. Save the script as bugzilla1576767.d. Then run it by doing the following at a Terminal prompt:

    sudo dtrace -s bugzilla1576767.d

You can quit the script by typing Control-C.

Please attach the script's output to this bug.

Attachment #9126441 - Attachment is obsolete: true

I'm reasonably sure the number of pixels driven by your graphics hardware is a factor in this bug. Now I think I may have found another one -- the number of app windows that use graphics hardware acceleration that are visible onscreen on at least one of your monitors. As best I can tell, windows that aren't currently visible don't count (for example if they're minimized or in another dock).

I recently found out about a graphics hardware resource that seems to be shared among all currently visible windows -- the GART (https://en.wikipedia.org/wiki/Graphics_address_remapping_table). Calls to IOAccelSysMemory::wire() in the IOAcceleratorFamily2 kernel extension fail if there isn't any room in the GART for the memory being "wired", and may lead (via a cascade of other errors) to the call to IOAccelContext2::setContextError(0xfffffff9) that triggers this bug's crashes. I haven't been able to reproduce these failures, but I'm hoping that steve and Eric and Sorin can.

So, steve and Eric and Sorin, please try my dtrace script from comment #248. Your results should tell me whether or not the GART plays a role in this bug.

Here's yet another revision of my dtrace script. I've now found the proximate cause of my own crashes, at least. It seems to be an Apple bug (in the AMDRadeonX4000 kernel extension's AMDRadeonX4000_AMDHWVMContext::mapVA() method). Eric and Sorin, I'm very interested to see the results of your tests with this script.

steve, since you have GeForce video hardware, you won't be able to use this script as is (or any of its previous versions). You'll need to comment out all the probes that contain "AMDRadeonX4000". I'd still very much like to see your results, though.

As best I can tell, Apple's bug is very simple. Here's reconstructed C++ code for the first few lines of AMDRadeonX4000_AMDHWVMContext::mapVA():

    bool AMDRadeonX4000_AMDHWVMContext::mapVA(vm_address_t startAddress, IOAccelMemory* arg2,
                                              vm_size_t arg3, vm_size_t length,
                                              AMDRadeonX4000_IAMDHWVMM::VmMapFlags arg5)
    {
      if (startAddress < vmRangeStart) {
        return false;
      }
      if (startAddress + length >= vmRangeEnd) {
        return false;
      }
      ...
    }

The bug is that the fourth line of the function should be:

      if (startAddress + length > vmRangeEnd) {

Eric and Sorin, please test with this script. I now suspect that the GART stuff was a red herring, and that you will have exactly the same results as I did.

The AMDRadeonX4000 kernel extension uses the IOKit's IORangeAllocator class to manage allocations from the GPU's VRAM. The address range of 0x400000000 through 0x2400000000 is baked into its machine code. The IORangeAllocator code (which is available at https://opensource.apple.com/ as part of the xnu kernel source distro) tries hard to manage this range efficiently, minimizing fragmentation. But as the range fills up, inevitably some allocations from it are made towards its end. The bug is triggered when an allocation is made flush up to the very end of the range.

I will report this bug to Apple when I know a bit more about it. In the meantime, it may be possible to work around it by messing with Firefox code to make it invoke odd-sized allocations. That might increase fragmentation and slow performance, and would presumably need to be hidden behind a preference. I'll keep digging and see what I can come up with.

Attachment #9141850 - Attachment is obsolete: true
Flags: needinfo?(sorin.sbarnea)
Flags: needinfo?(eshepherd)

Here's a log made with my latest dtrace script, of Firefox crashing with one of this bug's crashes. As always, I used HookCase to hook the IOAccelerator framework's ioAccelResourceFinalize(), preventing it from calling the original method (thus leaking all the resources). Then I scrolled up and down in three different very noisy pages in three different tabs, until the crash happened (about five minutes).

I don't know why none of the user mode (Firefox) stacks are symbolicated -- I tested with today's mozilla-central nightly, which doesn't have its symbols stripped. But at least all the kernel mode stacks are symbolicated.

Here's a better one, whose Firefox stacks are symbolicated. I edited it down -- there was a lot of repetition.

Attachment #9142545 - Attachment is obsolete: true

Here are a couple of my Firefox crash logs:

bp-10c9755c-63c2-49be-a8fb-921ac0200422
bp-4910ab8a-b6fc-47b6-a941-d9d6d0200422

Note "hook.dylib" in the crash stacks. This marks them as test crashes, and not "real" ones.

(Following up comment #250)

The same bug is also present in the AMDRadeonX5000 and AMDRadeonX6000 kernel extensions, in the following methods:

    AMDRadeonX5000_AMDHWVMContext::mapVA(unsigned long long, IOAccelMemory*, unsigned long long, unsigned long long, AMDRadeonX5000_IAMDHWVMM::VmMapFlags)
    AMDRadeonX6000_AMDHWVMContext::mapVA(unsigned long long, IOAccelMemory*, unsigned long long, unsigned long long, AMDRadeonX6000_IAMDHWVMM::VmMapFlags)

I don't know about Apple's other graphics driver kernel extensions. I've only been able to test with AppleIntelHD5000Graphics, since my MacBook Pro also has Intel Iris Pro graphics hardware. It doesn't seem to have the bug.

steve, you almost certainly aren't experiencing the bug I described in comment #250. You might, though, be hitting the GART problem I described in comment #249.

(Following up comment #254)

I should note that all my tests have so far been on macOS Catalina (10.15.X). But I just looked at the machine code for macOS 10.14.6's AMDRadeonX4000 kernel extension, and it appears to have the same bug in the same method. Over the next few days I'll do more work on macOS Mojave, and maybe also look at earlier macOS versions.

@Steven: Firefox refused to crash. Felt like going to a car repair shop and the problem refusing to show. It's fascination as previously the crashes would happen every few minutes while now it ran for 25 minutes straight on "More Space" system preferences > monitor setting. It produced some output attached below as it was only a few lines:

dtrace: script 'bugzilla1576767.d' matched 11 probes
dtrace: error on enabled probe ID 9 (ID 188733: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #4
dtrace: error on enabled probe ID 8 (ID 190582: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 9 (ID 188733: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #4
dtrace: error on enabled probe ID 8 (ID 190582: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3

After my first test run without a crash for 25 minutes, Firefox did crash as soon as I had closed debugging in Terminal.
Crash log 1: https://crash-stats.mozilla.org/report/index/3fd6e165-415c-4059-9a9c-968f90200426

Second attempt with the script running was more successful, below is the output that appeared while a crash happened:

dtrace: script 'bugzilla1576767.d' matched 11 probes
dtrace: error on enabled probe ID 9 (ID 188733: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #4
dtrace: error on enabled probe ID 8 (ID 190582: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 9 (ID 188733: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #4
dtrace: error on enabled probe ID 8 (ID 190582: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 9 (ID 188733: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #4
dtrace: error on enabled probe ID 8 (ID 190582: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 9 (ID 188733: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #4
dtrace: error on enabled probe ID 8 (ID 190582: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 9 (ID 188733: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #4
dtrace: error on enabled probe ID 8 (ID 190582: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 5 (ID 190578: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator220freeToPrepareMappingEP16IOAccelMemoryMap:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 2 (ID 189419: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN16IOAccelResource27prepareEv:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 1 (ID 187560: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN15IOAccelContext215setContextErrorEj:entry): invalid address (0x4) in action #4

Crash log 2: https://crash-stats.mozilla.org/report/index/4a1de7c4-961d-428a-8593-c33730200426

Interesting. Something seems to be suppressing the script's output. Do you have antivirus software running? If so, please turn it off, reboot your computer, and test again. It might also help to turn SIP off entirely (by rebooting into your Recovery partition and doing "csrutil disable").

The dtrace probes may change the timing and order of events, which might explain why it's harder to reproduce this bug's crashes while the dtrace script is running. It might help if I commented out more of the probes. But I don't want to accidentally comment out something critically important. For the time being its more important to stop the script's output from being suppressed. Later, if need be, I'll post copies of the dtrace script with more probes commented out.

Thanks, by the way, for running my dtrace script. As I learn more, I may give you more to run. They'll all be tailored to your machine, with its GeForce graphics hardware.

I've now proved to my own satisfaction that Apple's bug is the major (and perhaps the only) source of this bug's crashes on AMD Radeon graphics hardware. To do this I added code to HookCase to patch the bug. That required only a single byte change, to morph a JAE instruction to a JA instruction. The I reran my test with calls to ioAccelResourceFinalize()'s original function disabled. No crashes, even though I carried my test beyond the point I normally do. The only output from my dtrace script was as follows:

    dtrace: script 'bugzilla1576767.d' matched 14 probes
    CPU     ID                    FUNCTION:NAME
      2 147679 _ZN29AMDRadeonX4000_AMDHWVMContext5mapVAEyP13IOAccelMemoryyyN24AMDRadeonX4000_IAMDHWVMM10VmMapFlagsE:entry Process firefox, start 0x23ffc00000, length 4194304

This shows my patch was working. This single line of output would normally have been the start of a long cascade of errors, leading to a call to IOAccelContext2::setContextError(0xfffffff9).

When I do my "leak resources" tests, I use the output of ioclasscount as a rough guide to how far the test has gone. The crashes normally happen when AMDRadeonX4000_AMDAccelVidMemory and AMDRadeonX4000_AMDAccelSysMemory reach about 25000. This time I carried on to 31000 without seeing any crashes. All I saw was some lags -- no graphics artifacts, even temporary ones.

If people are interested, I can attach a patch to the current HookCase master branch that fixes the Apple bug in AMDRadeonX4000, AMDRadeonX5000 and AMDRadeonX6000 kernel extensions on the current version of macOS Catalina (10.15.4). Loading HookCase patches the bug in running kernel extension code. Unloading it unpatches it (restores the code to its original condition).

If there's sufficient interest, I could even create a separate kernel extension, using the relevant parts of HookCase code, that does nothing but fix the bug when it's loaded, and unfix it when it's unloaded. For those of you who see this bug on AMD Radeon hardware, it'd be interesting to test with Apple's bug fixed.

You need to have XCode and its command line tools to build either HookCase or this new kernel extension. You also need to add "keepsyms=1" to your kernel boot args.

I've created a new kernel extension that patches Apple's bug:

https://github.com/steven-michaud/PatchBug1576767

Anyone who has AMD Radeon hardware and sees this bug's crashes, please try it out. You'll need the current version of macOS Catalina (10.15.4). I hope and assume that PatchBug1576767 will stop all your crashes. Please let me know your results.

I will refer to PatchBug1576767 as proof of concept when I open my bug with Apple. But before I do that, I'd like to know what happens to people who see these crashes without having to use HookCase to hack ioAccelResourceFinalize().

Steven, please excuse the late reply. Help up with other work and personal stuff.
Disable SIP ran script and this was the output once Firefox crashed (which still happens with high frequency).
As the output still shows errors on your probes I am unsure what could be blocking it. No AV here. The LittleSnitch Firewall I disabled for the test. macOS Firewall not running.

dtrace: script 'bugzilla1576767.d' matched 11 probes
dtrace: error on enabled probe ID 9 (ID 184269: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #4
dtrace: error on enabled probe ID 8 (ID 186118: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 9 (ID 184269: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #4
dtrace: error on enabled probe ID 8 (ID 186118: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 9 (ID 184269: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #4
dtrace: error on enabled probe ID 8 (ID 186118: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 5 (ID 186114: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator220freeToPrepareMappingEP16IOAccelMemoryMap:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 2 (ID 184955: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN16IOAccelResource27prepareEv:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 1 (ID 183096: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN15IOAccelContext215setContextErrorEj:entry): invalid address (0x4) in action #4

I'm stumped, steve. I don't know what to ask you next.

steve: On second thought, maybe you've hit some kind of dtrace bug. Here's a script with all the hardware-specific probes removed, and also all references to the built-in "execname" variable. Please test with it and let us know your results.

Hm, this looks very similar.

@@@
dtrace: script 'bugzilla1576767.d' matched 11 probes
dtrace: error on enabled probe ID 9 (ID 195724: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 8 (ID 197573: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 9 (ID 195724: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 8 (ID 197573: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 9 (ID 195724: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN20IOAccelVidMemoryList15ReverseIterator13getPrevMemoryEv:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 8 (ID 197573: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator223freeWaitToPrepareVidMapEP16IOAccelMemoryMapbb:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 5 (ID 197569: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN22IOGraphicsAccelerator220freeToPrepareMappingEP16IOAccelMemoryMap:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 2 (ID 196410: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN16IOAccelResource27prepareEv:return): invalid address (0x4) in action #3
dtrace: error on enabled probe ID 1 (ID 194551: fbt:com.apple.iokit.IOAcceleratorFamily2:_ZN15IOAccelContext215setContextErrorEj:entry): invalid address (0x4) in action #3
@@@

Try this. I wish I knew what "action #3" means.

Dtrace script with hardware-specific probes removed, new try (take two) - looks like we have some useful output.

https://bin.disroot.org/?ea0c90cd4828f23f#6DxGP9DTHmecn8u5YgKiUuAkoZaZGhjZorVnRpML47KN (1 year)

Excellent! And thanks for keeping up with your tests.

Your "context error" is different -- 0xfffffffd/-3 instead of 0xfffffff9/-7. The latter means "out of memory". Yours seems to mean something like "unable to make resident" -- the memory is allocated, but the system is unable to wire/map it in. The 0xfffffffd/-3 context error is somewhat rare. So far I've only found it in the GeForce kernel extension. Other graphics extensions (like the AMDRadeon and AppleIntel ones) may consider the "unable to make resident" error condition to be just a subset of the "out of memory" condition. So your problem is somewhat like those seen by people with AMDRadeon graphics hardware, but we can't assume that it's exactly the same.

I will pore through your dtrace log, and the GeForce and IOAcceleratorFamily2 kernel extensions, looking for probes to add to what I'll now start calling the GeForce-specific dtrace script. Once I'm done I'll post the script and ask you to test with it. It will probably take several iterations to really learn much.

Steve: Here's my revised dtrace script. Please test with it and let us know your results.

Attachment #9146519 - Attachment is obsolete: true
Attachment #9146938 - Attachment is obsolete: true

The iMac came to a complete halt. Nothing worked, UI blinking, artefacts all over, no navigation possible, nothing. Force quit via hardware. On restart Terminal had quite a lot of debug info restored:
https://bin.disroot.org/?f626113bba45fa4d#5fw82ssAJUijfd8B8FnRh2RemFepdKrwLYUrAjKWEENR (1 year)

A large part of the output is missing, I think. There's nothing from the "firefox" process, and also no log for "setContextError".

What content there is is all from the WindowServer process. If that crashes, your current login session dies. If it hangs, so does the entire UI. You probably experienced a hang, or at least a drastic slowdown, in the WindowServer process.

I will try to revise my script to make its output less noisy, and prevent any WindowServer output from showing up.

Steve, please try this.

Attachment #9147318 - Attachment is obsolete: true

Steve: Your latest results are interesting, but I'm having trouble wrapping my mind around them. Here's another revision of my script, with a couple more probes commented out. Please try it and let us know your results.

Your trace from comment 273 does contain a log for setContextError, and the error is the same (0xfffffffd/-3, unable to make resident). But it's also significantly different from your trace from comment 267 -- it doesn't contain a log for IOAccelMemoryMap::commit_pte(). But I'm not sure how significant this difference is -- whether or not it shows that you're experiencing two different bugs. Probes and hooks that produce output can alter the timing of events, and possibly that's all that's going on. My latest dtrace script comments out a couple more of the noisiest probes.

NvRmVidHeapControl() may turn out to be the key here. Its error return (0x51) means "out of memory". But as I mentioned above, there's an ambiguity here -- has your system truly run out of (video) memory, or has it just failed to map/wire it in? Right now my mind is spinning from trying to figure out this function (it doesn't help that I don't have GeForce hardware to test with). I'm going to take a few days' break and then come back.

Attachment #9147473 - Attachment is obsolete: true

Dtrace script with GeForce-specific probes
https://bin.disroot.org/?de1565c5eac3f761#J3gD8Nn3ymdbJmEqFs51wtpyKNLc6Z1EWCkdSumYrK9P (1 year)
Take your time and thanks for your continued investigations.

Thanks. Still no call to IOAccelMemoryMap::commit_pte(), as there was in your trace from comment #267. Maybe that call was just a fluke. I'll be back in a few days to mull over all of this.

Steve, here's another version of my GeForce-specific dtrace script for you to try. I added several new probes, more or less at random. I've been concentrating my attention on the NVDAResman kernel extension (where NvRmVidHeapControl() lives). But most of its symbols have been stripped, and it's very difficult to work with.

Please let us know your results.

Attachment #9147708 - Attachment is obsolete: true

There is an error with this version of the script:

sudo dtrace -s /Users/username/Desktop/bugzilla1576767.d 
dtrace: failed to compile script /Users/username/Desktop/bugzilla1576767.d: line 152: probe description ::RmMapMemoryDMA:return does not match any probes
Flags: needinfo?(smichaud)

Oops, I misspelled the name of that probe. Sorry.

Please try this revision.

Attachment #9150309 - Attachment is obsolete: true
Flags: needinfo?(smichaud)

Actually, better still to try this revision. I added a couple more probes that I just realized I'd missed.

Attachment #9150459 - Attachment is obsolete: true

As much output as the script generated, some of it is still missing. The only probe failures I see are for osAllocMem(). There should be a lot more different kinds (as there were in your output from comment 273 and comment 275).

That said, what output you have is quite interesting. It'll take me a while to digest it.

In the meantime please try this new revision, with the osAllocMem() probe commented out.

Attachment #9150490 - Attachment is obsolete: true

Also very interesting. I don't see any failures in IOMalloc() (the kernel's standard way of allocating memory, called from osAllocMem()). I think this means that you aren't actually running out of memory -- at least of standard system memory. So, if I can manage it, I need to figure out why osAllocMem() is failing but not IOMalloc().

Here's another revision of my GeForce-specific dtrace script. I added a few more probes, but the osAllocMem() probe is still commented out. Please try it, Steve, and let us know your results.

I looked more closely at the binary code for osAllocMem(). As best I can tell, almost all the error returns recorded for it in your trace from comment 281 are impossible -- the only "legitimate" error return is 0x51 (meaning "out of memory"). So something strange is going on, which leads me to wonder if you have any unusual kernel extensions running. So here's another task I'd like you to perform:

At a Terminal prompt, enter "kextstat" and post its results here.

kextstat shows information on all running kernel extensions.

Attachment #9150582 - Attachment is obsolete: true

Interesting. It's quite different from your output from comment 283, though both end in setContextError(0xfffffffd). There appears to be some natural variation in exactly how the error propagates through your graphics kernel extensions. It may even be true that not all your error cascades start with an error return from osAllocMem() (from the NVDAResman kernel extension). But if possible I'd like to learn more about your osAllocMem() failures. Please start by running kextstat in a Terminal window and posting the output here.

You've got one non-Apple kernel extension: at.obdev.nke.LittleSnitch. Try disabling LittleSnitch and restarting your computer. Let us know if that makes any difference to the crashes you've been seeing.

LittleSnitch at.obdev.nke.LittleSnitch
Keybase com.github.kbfuse.filesystems.kbfuse
Disabled both and restarted the mac.

Crashes persisting:
https://crash-stats.mozilla.org/report/index/c73a6906-bc9d-436f-bf24-efa510200523

Thanks for testing without those kernel extensions, Steve. It's conceivable the osAllocMem() weirdness (comment 285) is a bug or design flaw in dtrace itself. But in any case I think I've gone about as far as I can understanding the errors you (sometimes) see in the NVDAResman kernel extension. Now I'm going to explore more carefully your output from comment 286 (especially its errors in IOAccelVidMemory::wire()).

macOS 10.15.5 (19F96), Firefox 76.0.1 still crashing
https://crash-stats.mozilla.org/report/index/01af10ac-91e7-49b7-8ccd-81d690200529
Considering Apple stating in a call, they do not care about crashes in macOS with old hardware and probably seeing this crash on old hardware a nice incentive and mechanism to nudge users to buy new hardware, not really surprised, the issue was not addressed in 10.15.5.

(Following up comment 261)

I just discovered that Apple's latest macOS update, macOS 10.15.5 Supplemental Update, breaks my PatchBug1576767. If you want to test with it, don't apply this update.

I really do want people to test with it, though I haven't yet heard from anyone who has. I'm currently trying to find if there's a way around what Apple did to cause the breakage.

(Following up comment 293)

Turns out the problem was caused by a bug in PatchBug1576767. I just landed a patch that fixes it. Please, please, please try it out, those of you who regularly see this bug's crashes on AMDRadeon video hardware.

I still haven't heard from anyone testing with PatchBug1576767. But I've grown tired of waiting, so I submitted the following bug report to Apple. For what it's worth, the "feedback number" is FB7875887. I'll post updates on Apple's responses.

Crash bug in macOS Catalina AMDRadeonXN000 kernel extension

macOS Catalina's AMDRadeonX4000, AMDRadeonX5000 and AMDRadeonX6000
kernel extensions have a bug that can cause crashes in an application
that uses graphics acceleration. The bug is in code that manages GPU
accelerator memory, specifically in the following two methods:

    bool
    AMDRadeonX4000_AMDHWVMContext::mapVA(vm_address_t startAddress,
                                         IOAccelMemory* arg2,
                                         vm_size_t arg3, vm_size_t length,
                                         AMDRadeonX4000_IAMDHWVMM::VmMapFlags arg5);

    bool
    AMDRadeonX4000_AMDHWVMContext::unmapVA(vm_address_t startAddress,
                                           vm_size_t length);

The C++ code for the start of both functions can be reconstructed as follows:

    if (startAddress < vmRangeStart) {
      return false;
    }
    if (startAddress + length >= vmRangeEnd) {
      return false;
    }

The bug is in the fourth line, which should be:

    if (startAddress + length > vmRangeEnd) {

The mapVA() method maps a buffer into GPU accelerator memory. The
unmapVA() method unmaps it. These six lines of code do an error return
if the buffer won't fit into the available range of GPU memory. The
bug is triggered when mapVA() is called to map a buffer into the very
end of the range. The buffer would fit just fine. But the bug in the
fourth line triggers an error return. Since this error return is
unexpected, it triggers a cascade of other errors that finishes with a
call to IOAccelContext2::setContextError(0xfffffff9/-7). This in turn
causes the application to crash at gpusGenerateCrashLog.cold.1() (in
libGPUSupportMercury.dylib).

Since the bug is very simple and straightforward, I've created a
kernel extension that patches it in running instances of the
AMDRadeon4000, AMDRadeonX5000 and AMDRadeonX6000 kernel extensions. In
my tests it fixes the crashes.

https://github.com/steven-michaud/PatchBug1576767

For some reason, Firefox is much more prone to these crashes than
other applications. That's how they came to my attention. I've been
working on the crashes at
https://bugzilla.mozilla.org/show_bug.cgi?id=1576767. More information
is available there.

FYI I looks like https://youtrack.jetbrains.com/issue/JBR-2474 is tracking the same issue (0xfffffff9/-7) and a few people report that 10.15.6 Beta releases appeared to finally contain some kind of fix.

(BTW absolutely amazing and inspiring debugging work @Steven!)

The IDEA bug seems related, but I suspect it's not the same as the one reported here.

One crash report does have the "graphics kernel error" 0xfffffff9/-7. And the "victims" do often (apparently) have external monitors.

But all the other crash reports have the "graphics kernel error" 0xfffffffc/-4 (those that have a "graphics kernel error" at all). The IDEA crashes seem to happen when the computer goes to sleep, or when it's sleeping. But the Firefox crashes have no observable pattern. The IDEA crashes seem to have started (or greatly increased) with macOS 10.15.5. But the number of Firefox crashes in gpusGenerateCrashLog.cold.1 on macOS 10.15.5 over the last six months is actually considerably lower than the number of crashes on 10.15.4.

And finally, macOS 10.15.6 doesn't fix the bug I reported to Apple. PatchBug1576767 still applies its patch (which its sanity checks would prevent it from doing if the bug had been fixed).

macOS 10.15.6 was just released (in the last couple of days). You'll be able to observe whether this "fixes" the IDEA crashes. I suspect it will have no effect on the Firefox crashes. We should be able to tell for sure in the next week or so.

(BTW absolutely amazing and inspiring debugging work @Steven!)

Thanks! :-)

I'm currently trying to reverse engineer what Apple does with the "sideband buffer" and "command buffers" (what gets "submitted" by gpusSubmitDataBuffers()) -- on both the user side and the kernel side. But the code is fiercely complex, and my progress has been slow. I've been working on a kernel extension dedicated to this task, with the same functionality as HookCase. (A general purpose HookCase for kernel mode would be far too dangerous.) I've also added watchpoint support to HookCase and this new kernel extension. Once I'm done with this work (6 months from now?), I should have a much better understanding of how crashes in gpusGenerateCrashLog() can happen. Already, though, I'm pretty sure they can have many different, unrelated causes.

By the way, I've had no response at all from Apple to my bug report. This isn't necessarily a bad sign. In my experience, Apple tends to say as little as possible in any even semi-public forum. And I've done at least one other bug report where the only response was (after a long delay) "this bug is fixed in the such-and-such release".

(Following up comment 297)

But all the other crash reports have the "graphics kernel error" 0xfffffffc/-4 (those that have a "graphics kernel error" at all).

0xfffffffc/-4 or 0xfffffffd/-3

Steve, a couple of the IDEA crash reports from comment 296 were on GeForce hardware. So it's conceivable that macOS 10.15.6 will "fix" your crashes. Please let us know one way or the other.

Congrats to comment #300 🙉

10.15.6 and 78.0.2 tested and entire sytem resulted in graphic glitches and hanging streamvideo. Disconnecting external monitor did not help, requiring a restart. Sadly no crash report appeared for Firefox.
As soon as external monitor is connected overall UI interaction is lagging. This is everything bug fixed in 10.15.6.

Sounds like your experience is still pretty miserable, Steve, but at least your Firefox crashes seem to be fixed.

Thanks for letting us know.

(Following up comment #297)

macOS 10.15.6 was just released (in the last couple of days). You'll be able to observe whether this "fixes" the IDEA crashes. I suspect it will have no effect on the Firefox crashes. We should be able to tell for sure in the next week or so.

In the last week there have been 154 of this bug's crashes (in gpusGenerateCrashLog.cold.1) on macOS 10.15.6, and 310 of them on 10.15.5. So 10.15.6 hasn't fixed this bug (these crashes in Firefox), and it probably hasn't made any difference at all.

Would be interesting to learn how macOS Big Sur behaves. Anybody affected by this problem running the developer or public beta and able to give it a go?

I can only find three of this bug's crashes on BigSur (macOS 11/10.16) over the last six months:

https://crash-stats.mozilla.org/search/?platform_version=~10.16&proto_signature=~fFlush&date=%3E%3D2020-02-12T16%3A20%3A00.000Z&date=%3C2020-08-12T16%3A20%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

So Apple might have fixed this bug in BigSur, or at least done something to alleviate it.

I still don't have my own copy of the BigSur beta, so I can't check if Apple fixed the bug I reported in comment #295.

Apple still hasn't commented on or responded to my bug report.

Crash Signature: [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ abort | gpusGenerateCrashLog.cold.1 ] [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa] [@ libsystem_kernel.dylib@0x1cb66 ] → [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ abort | gpusGenerateCrashLog.cold.1 ] [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa] [@ libsystem_kernel.dylib@0x1cb66 ] [@ libsystem…

I didn't intend to change the status flags. But they don't seem to be incorrect, so I'll leave them as is.

Keep in mind that some of the old macs which are affected by this Apple bug will not be able to install Big Sur. The late 2012 imac I am seeing the crashes with is one of those machines. So decreased crash numbers for macOS 10.16 / 11 would be somewhat expected. Not necessarily an indicator Apple has done anything about the problem.

(Following up comment #306)

I just installed the current BigSur beta (build 20A5354i, beta5), and, no, Apple has not fixed the bug I reported in comment #295. Still no response from them on my bug report.

Crash Signature: libsystem_kernel.dylib@0x7812 ] → libsystem_kernel.dylib@0x7812 ] [@ libsystem_kernel.dylib@0x7842 ] [@ libsystem_kernel.dylib@0x7552 ]

The number of this bug's crashes on BigSur is starting to pick up:

https://crash-stats.mozilla.org/search/?platform_version=~10.16&proto_signature=~fFlush&date=%3E%3D2020-02-27T20%3A25%3A00.000Z&date=%3C2020-08-27T20%3A25%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

I've started scraping symbols on BigSur, starting with Beta5. I'm sending them to Marco Castelluccio, who's uploading them to the symbol server. I'll keep doing this as new betas are released. Within a week or so, most of the BigSur crash stacks at https://crash-stats.mozilla.org/ should be fully symbolicated.

(Following up comment #310)

Oops, spoke too soon. Big Sur crash reports still aren't getting fully symbolicated. I've opened bug 1661771.

(Following up comment #309)

I just installed the current BigSur beta (build 20A5354i, beta5), and, no, Apple has not fixed the bug I reported in comment #295.

Still unfixed in Beta 6 (build 20A5364e).

Still no response from them on my bug report.

Ditto.

Haik, I understand from bug 1663467 comment 7 that Mozilla has an Apple contact to help with Apple bugs, or other bad interactions with Firefox. Could you mention this bug to them? At the very least it'd be nice to know if they're working on it. My Apple bug report is FB7875887.

Flags: needinfo?(haftandilian)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #313)

Haik, I understand from bug 1663467 comment 7 that Mozilla has an Apple contact to help with Apple bugs, or other bad interactions with Firefox. Could you mention this bug to them? At the very least it'd be nice to know if they're working on it. My Apple bug report is FB7875887.

Done. I referenced your feedback report and stressed that you've included the root cause and a proposed fix. And to echo comment 296, absolutely inspiring debugging work and perseverance on your part!

Flags: needinfo?(haftandilian)

I can't, which isn't terribly surprising.

So tell us more about your machine. Particularly how many and what kinds of external monitors you have, and whether you go a long time between reboots. (Closing your laptop's shell, if that's what you have, doesn't count -- that just puts the machine to sleep.)

Do the crashes go away for a while after you reboot your computer?

Please try my workaround at https://github.com/steven-michaud/PatchBug1576767.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #316)

I can't, which isn't terribly surprising.

So tell us more about your machine. Particularly how many and what kinds of external monitors you have, and whether you go a long time between reboots. (Closing your laptop's shell, if that's what you have, doesn't count -- that just puts the machine to sleep.)

I have a MacBook Pro (13-inch, 2017) - Intel Iris Plus Graphics 650. I have a second LG FHD monitor connected with a TypeC-> HDMI cable. I reboot the computer when I change between OS's (I have macOS 11 and macOS 10.13 installed as well)

Do the crashes go away for a while after you reboot your computer?

I restarted the computer now and visited that link and got an instant crash.

Please try my workaround at https://github.com/steven-michaud/PatchBug1576767.

I will try to use the workaround ASAP and get back here with the results. Thank you!

Thanks for the info.

Please try my workaround at https://github.com/steven-michaud/PatchBug1576767.

I will try to use the workaround ASAP and get back here with the results. Thank you!

I forgot to mention that my workaround is hardware specific -- it requires AMD Radeon graphics hardware. That's where the vast majority of this bug's crashes happen, though a few also happen on Intel graphics hardware.

You say your MacBook Pro has Intel Iris Plus Graphics 650 hardware. But I thought the whole MacBook Pro line has two kinds of graphics hardware -- one "integrated" and one "discrete". Is that also true of yours?

Please post a few of your crash reports. It should be clear from them what kind of graphics hardware the crashes are happening on.

Your crashes are happening on Intel graphics hardware. And I've since noticed that the 13" line of "4th generation" MacBook Pros only have "integrated" video hardware. So you can't use my workaround at https://github.com/steven-michaud/PatchBug1576767. That's really too bad :-(

But since you can reproduce the crashes so easily, I'll try to work up some tests (and test builds) for you to run, like I did for Eric Shepherd and Steve.

Interestingly, I also crash on your link from comment 315 on Intel graphics hardware (bp-63700ccb-8b6f-47ed-9312-50d7d0200914), but it's not the same crash. I don't crash at all on AMD Radeon graphics hardware.

My crash report from comment 320 was spurious -- almost completely wrong. But I did a custom build (of current mozilla-central code) and managed to get a crash stack from lldb. It's very similar to your reports, Alexandru, though it's in a secondary process. (Your's is in the main process.)

Now that I can reproduce the Intel crashes (after a fashion), I'm going to start digging into them as best I can. But I've got other things I need to be doing now (most importantly an update to HookCase for macOS 11). So I'm not sure how much time I'm going to be able to spend on this Intel stuff.

I'm making some progress on the Intel crashes using dtrace probes. The proximate cause seems to be that kernel-mode graphics driver code is trying to access a "resource" after it's been deleted. The "resource" is being deleted from user mode, so this might be a Firefox bug. It may be tricky to find out for sure, though. Dtrace is being a pain -- I can't get it to symbolicate user-mode stacks, though it's supposed to (and often does).

I've been testing with a local build without symbols stripped. Next I'll try with a tryserver build with the same characteristics.

The lack of user-mode symbolication was due to a design flaw in dtrace (at least Apple's dtrace). You need special handwaving to get it to symbolicate stack traces made in child processes: Use dtrace with the -p option, repeated with the pid for every plugin-container process that might crash:

    sudo dtrace -p [pid1] -p [pid2] -p [pid3] -p [pid4] ... -s bugzilla1576767.d

This symbolicated trace shows, I think, that the Intel crashes are caused by a bug in one of Apple's user-mode graphics drivers -- probably AppleIntelHD5000GraphicsGLDriver, but possibly GLEngine.

Attachment #9175659 - Attachment is obsolete: true

It's interesting that the resource that's destroyed prematurely has resource id 10 in both cases. That id might have a special meaning to Apple's kernel mode Intel graphics drivers, and the resource itself might be a special kind.

Fix a mistake in dtrace script.

Attachment #9175679 - Attachment is obsolete: true

What I used to generate this output.

Here are a few crash ids generated by the try build with non-stripped symbols that I've just been testing with. They're much better than the crash report from comment #320, but still lower quality than the lldb crash stack from comment #321:

bp-cdf3a47c-32ca-40f1-bbe8-b860e0200915
bp-6f3f8d0a-d4dc-4ac7-972f-d645a0200915
bp-7020e67f-a26a-43bc-88e5-940590200915

Alexandru, I have another question for you:

What version of macOS 11 do you have (which beta and build)? And when running it do you crash on the link from comment 315?

Flags: needinfo?(alexandru.trif)
Flags: needinfo?(the.sheppy)
Flags: needinfo?(sorin.sbarnea)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #329)

Alexandru, I have another question for you:

What version of macOS 11 do you have (which beta and build)? And when running it do you crash on the link from comment 315?

Yes, I do get the same tab crash when I visit the link from comment 315. The tab crashes every time the link is visited on macOS 11.0 Beta (20A5364e).
Links to crashes:
https://crash-stats.mozilla.org/report/index/16b09d32-e0a2-4b1e-ae01-eeb3e0200916#tab-details
https://crash-stats.mozilla.org/report/index/458ead59-7379-41f6-b6f7-5e9180200916#tab-details

If anything else is needed please let me know!

Flags: needinfo?(alexandru.trif)

Interesting, and thanks. Your crashes on macOS 11 are content process crashes, like mine on macOS 10.15.6, but unlike the 10.15.6 crashes you reported in comment 319.

My main reason for asking was to find out if Apple had fixed this bug in its Big Sur betas -- apparently not.

Crash Signature: libsystem_kernel.dylib@0x7812 ] [@ libsystem_kernel.dylib@0x7842 ] [@ libsystem_kernel.dylib@0x7552 ] → libsystem_kernel.dylib@0x7812 ] [@ libsystem_kernel.dylib@0x7842 ] [@ libsystem_kernel.dylib@0x7552 ] [@ __pthread_kill | libsystem_c.dylib@0x8073f]

After digging around for the last few days, here's my understanding of what causes this bug's crashes on Intel graphics hardware:

The "sideband buffer" is used by user-mode graphics drivers to schedule tasks in kernel-mode graphics drivers. This buffer fills up with "tokens", each of which corresponds to a single task. Periodically the tasks are all performed in a batch, and the sideband buffer is (temporarily) cleared. This is what happens on each call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers() from user mode.

Each task uses certain resources, which need to be present for it to succeed. In the case of this bug's crashes on Intel hardware, a Blit2D token is added to the sideband buffer by a user-mode graphics driver. But before the kernel-mode driver can process the token (in a call to AppleIntelHD5000Graphics-IGAccelGLContext::process_token_Blit2D()), one of the resources required to perform the Blit2D task is deleted, as the result of a call to mContext->gl->fDeleteTextures() in XUL-mozilla::WebGLTexture::~WebGLTexture(). This also triggers a call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers(), which tries to process the Blit2D token. But it only happens after the resource has been deleted. So the call to AppleIntelHD5000Graphics-IGAccelGLContext::process_token_Blit2D() fails, causing the kernel-mode driver to set the "context error" to fffffffe/-2. This in turn causes the user-mode call to fail in __pthread_kill | abort | gpusGenerateCrashLog.cold.1.

This kind of error should really be cleaned up by graphics driver code, either in user mode or in kernel mode. That doesn't happen in this case. But the problem is basically very simple. So we can "help" the driver around it by explicitly triggering a call to libGPUSupportMercury.dylib-gpusSubmitDataBuffers() (via a call to mContext->gl->fFlush()) before the call to mContext->gl->fDeleteTextures().

I have a patch for this, which I've used to do a tryserver build:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=dfa3798acb87dbc73907c957221cc7d89f5d5680
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/f4R6vV1ARkmgaBnWyjWi2A/runs/0/artifacts/public/build/target.dmg

It fixes the crashes in my tests. Please try it, Alexandru, and let us know your results.

Crash Signature: libsystem_kernel.dylib@0x7812 ] [@ libsystem_kernel.dylib@0x7842 ] [@ libsystem_kernel.dylib@0x7552 ] [@ __pthread_kill | libsystem_c.dylib@0x8073f] → libsystem_kernel.dylib@0x7812 ] [@ libsystem_kernel.dylib@0x7842 ] [@ libsystem_kernel.dylib@0x7552 ] [@ __pthread_kill | libsystem_c.dylib@0x8073f]

Testing with current mozilla-central nightlies, I crash with the link from comment 315 on all versions of macOS (on Intel graphics hardware) going back to 10.12 (always the latest minor version of each major version). My patch from comment 333 fixes my crashes on all of them. (These crashes are all content process crashes -- aka tab crashes.) So Apple's bug goes at least that far back. And so my patch will fix a significant number of crashes.

I didn't test on OS X 10.11 or earlier. I'm also not yet able to test on Intel hardware on macOS 11.

Thank you very much, Alexandru, for your STR from comment 315! I would never have found it myself, except by accident. And without it I would never have been able to crack this bug (on Intel hardware).

By the way, and to make sure people understand:

Apple's Intel graphics driver bug, which my patch from comment 333 works around, is not the same as their AMD Radeon graphics driver bug that I reported to them in comment 295. As best I can tell the two bugs are completely unrelated. They just share the same signature.

The AMD Radeon bug would be much harder to work around. So I've relied on Apple to fix it. Judging by their lack of response to my report, and of any sign of their being willing to fix their bug, that may have been a mistake. If a few more months go by and Apple still hasn't fixed it, I'll start trying to look for a workaround.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #333)

I have a patch for this, which I've used to do a tryserver build:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=dfa3798acb87dbc73907c957221cc7d89f5d5680
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/f4R6vV1ARkmgaBnWyjWi2A/runs/0/artifacts/public/build/target.dmg

It fixes the crashes in my tests. Please try it, Alexandru, and let us know your results.

I'm confirming that the tab is not crashing anymore on macOS 10.15.6 and macOS 11 Beta (20A5364e) when vising the address from comment 315 with the build from this link: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/f4R6vV1ARkmgaBnWyjWi2A/runs/0/artifacts/public/build/target.dmg.
Testing was performed on the same machine that crash still occurs on the affected builds.

Thank you very much, Alexandru, for your STR from comment 315! I would never have found it myself, except by accident. And without it I would never have been able to crack this bug (on Intel hardware).

I'm glad that I could help here. Thank you as well.

I've opened bug 1666293 to deal with the Apple bug that Alexandru found STR for.

Depends on: 1666293
No longer depends on: 1576968
Assignee: mstange.moz → smichaud

Steven: Thanks for continuing your work on this bug.
Tried 83.0a1 (2020-09-22) (64-Bit). Unsure if that build would have your patch for #1666293
10.15.6 Supplement Update
Still crashing: https://crash-stats.mozilla.org/report/index/898780d9-b179-40b0-9560-ab4f90200922
NVIDIA GeForce GTX 660M 512 MB, 2,9 GHz Quad-Core Intel Core i5

Visiting https://www.khronos.org/registry/webgl/sdk/tests/conformance/textures/misc/texture-copying-and-deletion.html?webglVersion=1&quiet=0&quick=1 does not trigger a crash.

There are generally two mozilla-central nighlies released every day. But as I say in bug 1666293 comment 9, the "first" one already has my patch for that bug. (The "second" hasn't yet been released.)

Since you don't have Intel graphics hardware, I don't expect that my patch for bug 1666293 will make any difference to you. To find out if it does, you need to also test with builds that don't contain my patch -- any mozilla-central nightly with an earlier date than 2020-09-22. If you crash on those builds (at https://www.khronos.org/registry/webgl/sdk/tests/conformance/textures/misc/texture-copying-and-deletion.html?webglVersion=1&quiet=0&quick=1), then my patch does work on your system (with its NVIDIA GeForce graphics hardware), though it apparently doesn't fix all your crashes.

Right, the link does not crash Firefox 81 nor 83 nightly. So not useful as test case for Nvidia hardware.

Crash Signature: [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ abort | gpusGenerateCrashLog.cold.1 ] [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa] [@ libsystem_kernel.dylib@0x1cb66 ] [@ libsystem… → [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ abort | gpusGenerateCrashLog.cold.1 ] [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x7462] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa] [@ libsystem_k…
Summary: [10.15] Crash in [@ libsystem_kernel.dylib@0x744e] in mozilla::gl::GLContextCGL::SwapBuffers() → [10.15+11] Crash in [@ libsystem_kernel.dylib@0x744e] in mozilla::gl::GLContextCGL::SwapBuffers()

(Following up comment 295)

At some point in the last month or two, the status of my FB7875887 bug report to Apple has changed from "open" to "Potential fix identified - For a future OS update". I haven't checked the status very often lately, so I don't know exactly when this happened.

(Following up comment 341)

Someone just pointed out to me that Apple seems to have fixed the AMDRadeon bug in macOS 11.1:

https://github.com/steven-michaud/PatchBug1576767/issues/2#issuecomment-753017240

(Following up comment 342)

The status of my FB7875887 bug report to Apple has now changed to "Potential fix identified - In macOS 11.1".

For what it's worth, over the last month there has only been one [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] crash on macOS 11.1, and that's on Intel video hardware: bp-efe464eb-5f6a-4907-8457-46aa40210104.

So, as of macOS 11.1, this bug does really seem to be fixed on AMD Radeon video hardware.

https://crash-stats.mozilla.org/search/?platform_version=~11.1.0&signature=~gpusGenerateCrashLog&date=%3E%3D2020-12-18T20%3A28%3A00.000Z&date=%3C2021-01-18T20%3A28%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

(Following up comment 344)

Oops, I was wrong. At least I think so.

As best I can tell, almost all AMD64 crash logs on macOS 11.1 report "10.16.0 20C69" as the "platform version". "20C69" is the build id for the release version of macOS 11.1. So it's likely that all of these happened on macOS 11.1. And there appear to be lots of [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] crashes on AMD Radeon hardware:

https://crash-stats.mozilla.org/search/?platform_version=~20C69&proto_signature=~glrATI_Hwl_SubmitPacketsWithToken&signature=~gpusGenerateCrashLog.&platform=Mac%20OS%20X&date=%3E%3D2020-12-24T22%3A29%3A00.000Z&date=%3C2021-01-24T22%3A29%3A00.000Z&_facets=signature&_facets=platform_version&_facets=cpu_arch&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

I can't explain this. It makes me wonder if the "10.16.0 20C69" crashes are actually on 11.0.1, after all.

I opened a new bug on the incorrect "platform version" numbers: bug 1690604.

(Following up comment 295)

As I noted in comment #342, Apple seems to have fixed this issue as of macOS 11.1. But just today Apple commented for the first time on my original bug report, saying that "there are changes in the latest update, build 20E5196f (Big Sur 11.3 Beta 3), that may have resolved this issue". So macOS 11.3, when it's released, may have an effect on this bug's crashes.

See Also: → 1577886
Severity: critical → S2

As of today, it's now possible to search on mac crash info at https://crash-stats.mozilla.org/!

For now only crashes on the 90 branch can contain this data. And the "database" of searchable crashes is small -- it starts from today. But over time Mozilla should accumulate information that is likely to be very helpful in this bug, and in others whose underlying causes are low-level Apple bugs.

For more information see bug 1577886 and bug 1709658.

I've created some new bug reports that use mac_crash_info data, and I've made them all depend on a meta bug: bug 1711944.

It looks like Apple, as of macOS 11.4, is paying more attention to crashes at gpusGenerateCrashLog.cold.1 on AMDRadeon graphics hardware: bug 1713230.

Assignee: smichaud → nobody
Crash Signature: [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ abort | gpusGenerateCrashLog.cold.1 ] [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x7462] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_kernel.dylib@0x72aa] [@ libsystem_k… → [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill ] [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x7462] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_…
Crash Signature: [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill ] [@ libsystem_kernel.dylib@0x744e] [@ libsystem_kernel.dylib@0x7462] [@ libsystem_kernel.dylib@0x747a] [@ libsystem_… → [@ __pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill | abort | gpusGenerateCrashLog.cold.1 ] [@ abort | gpusGenerateCrashLog.cold.1 ] [@ __pthread_kill | pthread_kill ] [@ libsystem_kernel.dylib@0x744e] [@ libsys…

Crashes down from >1k to 200 in 98.0.1 so far.
Let's take another look in triage and see if we know anything more now than before.

Blocks: gfx-triage
Severity: S2 → S3
Component: Graphics: Layers → Graphics

Note that the fall in volume here is probably entirely due to Apple having fixed bug 1738289.

Using only signatures to measure this kind of crash is a very blunt instrument. Unfortunately we don't (as yet) have any better alternative. For that we'd need to look at the contents of mac crash info.

No longer blocks: gfx-triage
Priority: P2 → P3

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → WORKSFORME

This bug still has crashes. Socorro changed how unsymbolicated signatures are recorded.

Status: RESOLVED → REOPENED
Crash Signature: __pthread_kill | libsystem_c.dylib@0x8073f] → libsystem_kernel.dylib ] [@ __pthread_kill | libsystem_c.dylib@0x8073f]
Resolution: WORKSFORME → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: