Open Bug 1996653 Opened 1 month ago Updated 19 days ago

Mac GPU process: Crash in [@ IPCError-browser | GPUProcessKill]

Categories

(Core :: Graphics, defect, P2)

Unspecified
macOS
defect

Tracking

()

Tracking Status
firefox145 --- affected
firefox146 --- affected
firefox147 --- affected

People

(Reporter: aleiserson, Assigned: bradwerth)

References

Details

(Keywords: topcrash, topcrash-startup)

Crash Data

It is tricky to capture the scope of this issue because GPUProcessKill is a common crash signature with multiple causes. Crash stats shows 113 GPUProcessKill crashes in the past week on MacOS. (https://crash-stats.mozilla.org/search/?product=Firefox&platform=Mac%20OS%20X&process_type=gpu&date=>%3D2025-10-20T16%3A35%3A00.000Z&date=<2025-10-27T16%3A35%3A00.000Z&_facets=signature). Note that the big uptick in the crash data graph is on Android, and is associated with bug 1900134 and bug 1908798.

Looking at a sampling of the reports (and clicking "Show other threads" to see all the threads), in most of them the Renderer thread appears to be actively working, with no consistency in exactly what it is doing. Although I did find this one where that is not the case: https://crash-stats.mozilla.org/report/index/a868da07-ebd8-4ce9-9674-f693c0251022

If I understand correctly, "GPUProcessKill" is only emitted when a minidump is generated. As best I can tell, there's only two callsites that request a minidump while killing the GPU process:

  1. CompositorManagerChild::ShouldContinueFromReplyTimeout()
  2. UiCompositorControllerChild::SetReplyTimeout()

I'll check the crash reports and see if there's some correlation to these callsites.

Here is another view that may be useful: https://crash-stats.mozilla.org/search/?signature=%3DIPCError-browser%20%7C%20GPUProcessKill&product=Firefox&platform=Mac%20OS%20X&process_type=gpu&date=%3E%3D2025-10-20T16%3A35%3A00.000Z&date=%3C2025-10-27T19%3A07%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=version&_columns=build_id&_columns=graphics_critical_error#crash-reports

The common thread seems to be "Killing GPU process due to IPC reply timeout". (I noted this earlier but then decided that maybe this was inherently associated with GPUProcessKill and didn't mention it in the description.) This is why I was trying to figure out where things got stuck and noted that the renderer thread seemed to still be working.

I don't know what the timeout is or what the considerations are in setting it. Possibly it could be adjusted? I figured that someone more knowledgeable about graphics crashes than I am might have better intuition about what could be going on here and what strategies make sense to narrow down the problem.

I see. I wonder if actor destruction can lead to IPC timeout somehow. If one side of the bridge is alive and CanSend() is true, and then a sync message(?) is sent and the receiving side dies without processing it. Actually, that seems like the sort of thing we would be encountering broadly if it was possible. I'll see if I can come up with a better theory.

See Also Bug 1900134 as the Android version of this crash.

See Also: → 1900134

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 10 desktop browser crashes on nightly (startup)
  • Top 5 GPU process crashes on release (startup)
  • Top 10 AArch64 and ARM crashes on nightly
  • Top 10 AArch64 and ARM crashes on beta
  • Top 10 AArch64 and ARM crashes on release

:bhood, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(bhood)

Re: bugbot topcrash notice, this is a generic crash signature that is almost certainly being produced for multiple reasons. Most of the recent crash volume is on Android, so bug 1900134 or bug 1908798 should be the topcrash bug, if any.

Flags: needinfo?(bhood)

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 10 desktop browser crashes on nightly (startup)
  • Top 5 GPU process crashes on release (startup)
  • Top 10 AArch64 and ARM crashes on nightly
  • Top 10 AArch64 and ARM crashes on beta
  • Top 10 AArch64 and ARM crashes on release

For more information, please visit BugBot documentation.

I will try to make this less noisy, maybe by testing my theory in (In reply to Brad Werth [:bradwerth] from comment #3)

I see. I wonder if actor destruction can lead to IPC timeout somehow. If one side of the bridge is alive and CanSend() is true, and then a sync message(?) is sent and the receiving side dies without processing it. Actually, that seems like the sort of thing we would be encountering broadly if it was possible. I'll see if I can come up with a better theory.

I will take this Bug and try to make this crash signature less noisy, maybe by pursuing my theory above.

Assignee: nobody → bwerth
You need to log in before you can comment on or make changes to this bug.