Bug 2017910 Comment 7 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

Jens Stutte [:jstutte]

on 2026-02-21 01:20:12 PST

Just an additional note on what I assume we see here, that is late content process crashes close to or during parent shutdown whose processing bleeds into or happens during parent shutdown:  

> Recapping my understanding of what's actually happening here:
> 
> For child process crashes, the ping submission is triggered immediately when the crash occurs: `CrashService.addCrash()` → minidump analysis → `CrashManager.addCrash()` → `sendGleanPing()` spawns the child process and registers the shutdown blocker. This is the real-time path, not deferred aggregation of old event files.
> 
> Note that [`maybeRunMinidumpAnalyzer`](https://searchfox.org/firefox-main/rev/eb4a741aa4a06a105d7749a718d2503c6622726b/toolkit/components/crashes/CrashService.sys.mjs#21) is skipped when `gQuitting` is true, but the rest of the flow — reading the `.extra` file, computing the hash, spawning the ping child — still runs during shutdown.
> 
> So the likely scenario is: content processes crash close to or during shutdown (not unusual — the parent tears down IPC channels, child processes may crash as a result), the ping child gets spawned atRecapping my understanding of what's actually happening here:
> 
> For child process crashes, the ping submission is triggered immediately when the crash occurs: `CrashService.addCrash()` → minidump analysis → `CrashManager.addCrash()` → `sendGleanPing()` spawns the child process and registers the shutdown blocker. This is the real-time path, not deferred aggregation of old event files.
> 
> Note that [`maybeRunMinidumpAnalyzer`](https://searchfox.org/firefox-main/rev/eb4a741aa4a06a105d7749a718d2503c6622726b/toolkit/components/crashes/CrashService.sys.mjs#21) is skipped when `gQuitting` is true, but the rest of the flow — reading the `.extra` file, computing the hash, spawning the ping child — still runs during shutdown.
> 
> So the likely scenario is: content processes crash close to or during shutdown (not unusual — the parent tears down IPC channels, child processes may crash as a result), the ping child gets spawned at that moment, and it needs to do flock + Glean init + DB write + network upload + Glean shutdown while the parent is counting down the shutdown timeout. That's a race it often loses on slow systems, especially Windows with AV/disk encryption overhead.
> 
> This is consistent with the data: 87% of these hangs have exactly 1 blocker — a single content process crash near shutdown, not a backlog.
> 
> For the crash helper integration, this means the helper should already be running with Glean initialized, so handing off annotations is just an IPC message + `crash.submit()` — no process spawn, no lock acquisition, no Glean init. That should complete well within the shutdown budget.
>  that moment, and it needs to do flock + Glean init + DB write + network upload + Glean shutdown while the parent is counting down the shutdown timeout. That's a race it often loses on slow systems, especially Windows with AV/disk encryption overhead.
> 
> This is consistent with the data: 87% of these hangs have exactly 1 blocker — a single content process crash near shutdown, not a backlog.
> 
> For the crash helper integration, this means the helper should already be running with Glean initialized, so handing off annotations is just an IPC message + `crash.submit()` — no process spawn, no lock acquisition, no Glean init. That should complete well within the shutdown budget.
>

Revision 1 by

Jens Stutte [:jstutte]

on 2026-02-21 01:21:52 PST

Just an additional note on what I assume we see here, that is late content process crashes close to or during parent shutdown whose processing bleeds into or happens during parent shutdown:  

> Recapping my understanding of what's actually happening here:
> 
> For child process crashes, the ping submission is triggered immediately when the crash occurs: `CrashService.addCrash()` → minidump analysis → `CrashManager.addCrash()` → `sendGleanPing()` spawns the child process and registers the shutdown blocker. This is the real-time path, not deferred aggregation of old event files.
> 
> Note that [`maybeRunMinidumpAnalyzer`](https://searchfox.org/firefox-main/rev/eb4a741aa4a06a105d7749a718d2503c6622726b/toolkit/components/crashes/CrashService.sys.mjs#21) is skipped when `gQuitting` is true, but the rest of the flow — reading the `.extra` file, computing the hash, spawning the ping child — still runs during shutdown.
> 
> So the likely scenario is: content processes crash close to or during shutdown (not unusual — the parent tears down IPC channels, child processes may crash as a result), the ping child gets spawned at that moment, and it needs to do flock + Glean init + DB write + network upload + Glean shutdown while the parent is counting down the shutdown timeout. That's a race it often loses on slow systems, especially Windows with AV/disk encryption overhead.
> 
> This is consistent with the data: 87% of these hangs have exactly 1 blocker — a single content process crash near shutdown, not a backlog.
> 
> For the crash helper integration, this means the helper should already be running with Glean initialized, so handing off annotations is just an IPC message + `crash.submit()` — no process spawn, no lock acquisition, no Glean init. That should complete well within the shutdown budget.
>

Revision 2 by

Jens Stutte [:jstutte]

on 2026-02-21 01:23:20 PST

Just an additional note on what I assume we see here, that is late content process crashes close to or during parent shutdown whose processing bleeds into or happens during parent shutdown:  

> Recapping my understanding of what's actually happening here:
> 
> For child process crashes, the ping submission is triggered immediately when the crash occurs: `CrashService.addCrash()` → minidump analysis → `CrashManager.addCrash()` → `sendGleanPing()` spawns the child process and registers the shutdown blocker. This is the real-time path, not deferred aggregation of old event files.
> 
> Note that [`maybeRunMinidumpAnalyzer`](https://searchfox.org/firefox-main/rev/eb4a741aa4a06a105d7749a718d2503c6622726b/toolkit/components/crashes/CrashService.sys.mjs#21) is skipped when `gQuitting` is true, but the rest of the flow — reading the `.extra` file, computing the hash, spawning the ping child — still runs during shutdown.
> 
> So the likely scenario is: content processes crash close to or during shutdown (not unusual — the parent tears down IPC channels, child processes may crash as a result), the ping child gets spawned at that moment, and it needs to do flock + Glean init + DB write + network upload + Glean shutdown while the parent is counting down the shutdown timeout. That's a race it often loses on slow systems, especially Windows with AV/disk encryption overhead.
> 
> This is consistent with the data: 87% of these hangs have exactly 1 blocker — a single content process crash near shutdown, not a backlog.

Revision 3 by

Jens Stutte [:jstutte]

on 2026-02-21 01:26:47 PST

Just an additional note on what I assume we see here, that is late content process crashes close to or during parent shutdown whose processing bleeds into or happens during parent shutdown:  

> Recapping my understanding of what's actually happening here:
> 
> For child process crashes, the ping submission is triggered immediately when the crash occurs: `CrashService.addCrash()` → minidump analysis → `CrashManager.addCrash()` → `sendGleanPing()` spawns the child process and registers the shutdown blocker. This is the real-time path, not deferred aggregation of old event files.
> 
> Note that [`maybeRunMinidumpAnalyzer`](https://searchfox.org/firefox-main/rev/eb4a741aa4a06a105d7749a718d2503c6622726b/toolkit/components/crashes/CrashService.sys.mjs#21) is skipped when `gQuitting` is true, but the rest of the flow — reading the `.extra` file, computing the hash, spawning the ping child — still runs during shutdown.
> 
> So the likely scenario is: content processes crash close to or during shutdown (not unusual — the parent tears down IPC channels, child processes may crash as a result), the ping child gets spawned at that moment, and it needs to do flock + Glean init + DB write + network upload + Glean shutdown while the parent is counting down the shutdown timeout. That's a race it often loses on slow systems, especially Windows with AV/disk encryption overhead.
> 
> This is consistent with the data: 87% of these hangs have exactly 1 blocker — a single content process crash near shutdown, not a backlog.

So I wonder if there is much we can improve in the flow, as most likely the vast majority of crashes are already processed before the parent enters shutdown. This may explain the relatively low numbers here.

Back to Bug 2017910 Comment 7