<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 1

•

4 months ago

Timeline lines up with bug 1989439

Flags: needinfo?(chutten) → needinfo?(afranchuk)

Keywords: regression

Regressed by: 1989439

Pascal Chevrel (relman team) -> :pascalc

Assignee

Comment 2

•

4 months ago

Yes, that's from the new crash ping submission flow. I'll think about what may be a better approach. I have considered launching the crash reporter in the background at startup to submit any pending pings; if we do that, it wouldn't be a big deal to simply kill the process in these cases, because it would submit them later. We could also potentially disown the process, too (it is very unlikely to hang around stuck; in these cases the external process is probably just blocked on Glean shutdown, waiting for pings to be sent).

Assignee: nobody → afranchuk

Status: NEW → ASSIGNED

Flags: needinfo?(afranchuk)

Updated

•

4 months ago

tracking-firefox149: ? → +

Assignee

Comment 3

•

4 months ago

FWIW, this is very likely an issue due to a combination of factors, each of which adds latency:

An external program is now used for sending crash pings.
The external program only sends one ping (and then exits).
The external program invokes (at least at first) a firefox background task to use the network stack for sending the ping.
The external program interacts with its own glean store, so it locks the store (only one instance of a program may run at a time).

I speculate that this is getting hit when there are a lot of crashes in rapid succession, creating a chain of latency. There are a few things we can do to mitigate this.

There's also a separate discussion to be had about whether we should care very much about this sort of case, too. While ideally we get all crash information, this sort of case is more than likely a single crash cause occurring repeatedly (often in a background process), so we are trying to preserve and send a whole bunch of near-identical crashes.

Comment 4

•

4 months ago

I think the thesis that we see a lot of (equal) pings might actually be wrong. The tl;dr is probably that the child process used to send the pings may block on anything also the parent could block (slow file IO, network, ...) and may just never finish. Of course we have 0 insight in the state of the client process when this happens, but experience from other shutdown hangs show that even simplest OS operations can block endlessly during shutdown.

I actually started from thinking about adding a crash annotation with the initial amount of pings to send. But:

I took a closer look at the code and the crash data. Here's what I found:

The 87% single-blocker case is the main problem

Looking at all 82 crashes with this signature via SuperSearch, the async_shutdown_timeout annotation reveals:

71 crashes (87%) have exactly 1 blocker — a single crash ping submission is timing out
11 crashes (13%) have 2–13 blockers (queue buildup from multiple crashes)

Note: We seem to create the shutdown blockers all at once, so the chain of blockers in the hang annotation shows us the number of pending pings at the moment of the hang. Numbers are low, most of the times just one.

So while the unbounded queue (comment 3) makes things worse in some cases, even a single sendGleanPing call can exceed the shutdown timeout. The queue pileup isn't the primary driver.

Why a single ping can block shutdown

Each sendGleanPing call spawns a child process that does significant sequential work:

flock(LOCK_EX)/LockFileEx on crashping.pid — blocks if another instance holds it
Glean DB init (file I/O)
Submit ping + glean::shutdown() — waits for network upload
Unlock + exit

Any of these can be slow, especially on Windows where file I/O can block for extended periods (AV scanning, disk encryption, indexing, slow HDD). We also have no visibility into what the child is doing — the blocker state is always (none).

Silent data loss already happens today

The event file gets deleted as soon as _handleEventFilePayload returns EVENT_FILE_SUCCESS, but sendGleanPing is not awaited at that point. If the child process fails (network, Glean issue, crash), the ping is silently lost with no retry. So the current approach blocks shutdown and can lose data anyway.

It is important to mention that this loss seems to be limited to one ping, though, IIUC. Disowning the process and going on could actually make us loose all, so I do not think this is an option.

Suggested approach: decouple ping submission from shutdown entirely

Instead of blocking shutdown on the child process, move to a file-based handoff:

When processing an event file, write the annotations to a pending-pings directory (e.g. pending-pings/<crash-id>.json)

I assume we originally wanted to avoid to do this from the parent, as it might block, like all file IO. So maybe loosing a ping once in a while is just fine...

Spawn one child process (not one per crash) pointing at that directory, and don't block on it
The child iterates over pending files, inits Glean once, sends all pings, and deletes each pending file only after successful submission
At next startup, if there are leftover files in pending-pings/, spawn the child again

This way shutdown never blocks on crash ping submission, and no data is lost — a pending file only disappears when the ping was actually sent. Failed submissions survive on disk for the next session to retry.

I am not sure there is real urge to do anything, until numbers stay relatively low. But the current architecture feels a bit incomplete if our goal is to get every possible ping over the wire. Alternatively to the above I could also imagine a world where piping the data to the child makes actually sense to avoid further disk IO, but then the child should really avoid ANY disk IO, and I do not think we can easily achieve that. So the above suggestion seems easier to achieve (and btw this would mimic to some extent a background deleter for origins we have elsewhere).

Assignee

Comment 5

•

4 months ago

I just looked at the annotations for the crashes in the past month, and they all seem like single-blockers to me, unless I am missing some subtlety. I didn't realize that it would show more of them (or rather, I wasn't sure whether it was condensing them to a single entry). If it would normally show a list of them, then I definitely agree: it seems like there is no build-up of crashes, and they are simply taking too long to submit. I may try to add something to get visibility into what it's doing at that point. If, for instance, it is always stuck sending the pings, it is safe to kill it (the pings will already be persisted in the Glean DB at that point).

We will eventually move to crash pings being sent by the crash helper. This will have a number of advantages, like initing a Glean DB once and sending pings as necessary from a single, long-lived process. It will also always be spawned at startup, so it will be able to handle retries if pings didn't previously send. I don't know whether the crash helper is yet ready to take on this responsibility (though :gsvelto could weigh in on that), but the crashping crate was made with that transition in mind, so when the time comes it should be as simple as a boilerplate Glean setup and a single additional function call in the crash helper (the crash ping crate needs only the extras JSON data).

One of the main benefits of using Glean here was supposed to be that it would handle management of pending pings and reliable sending, but to do so we need to get the pings into the store in the first place (and that is clearly made more difficult by needing the store to be managed by the crash reporter client). It would be disappointing to duplicate that logic manually, when we just removed the code doing so for Legacy Telemetry logic, but I will think it over further.

Comment 6

•

4 months ago

•

Edited

(In reply to Alex Franchuk [:afranchuk] from comment #5)

I just looked at the annotations for the crashes in the past month, and they all seem like single-blockers to me, unless I am missing some subtlety. I didn't realize that it would show more of them (or rather, I wasn't sure whether it was condensing them to a single entry). If it would normally show a list of them, then I definitely agree: it seems like there is no build-up of crashes, and they are simply taking too long to submit.

It shows all active blockers and there are a few reports where this is visible for this signature.

I may try to add something to get visibility into what it's doing at that point. If, for instance, it is always stuck sending the pings, it is safe to kill it (the pings will already be persisted in the Glean DB at that point).

Good point. So it would be safe to disown the process once we now it persisted the data locally to the Glean DB (which might also block, so until then loss could happen, IIUC).

We will eventually move to crash pings being sent by the crash helper. This will have a number of advantages, like initing a Glean DB once and sending pings as necessary from a single, long-lived process. It will also always be spawned at startup, so it will be able to handle retries if pings didn't previously send. I don't know whether the crash helper is yet ready to take on this responsibility (though :gsvelto could weigh in on that), but the crashping crate was made with that transition in mind, so when the time comes it should be as simple as a boilerplate Glean setup and a single additional function call in the crash helper (the crash ping crate needs only the extras JSON data).

That sounds like a good plan. We may indeed be able to find better moments than shutdown to hand over the JSON to the crashreporter, like through the UserIdleService.

One of the main benefits of using Glean here was supposed to be that it would handle management of pending pings and reliable sending, but to do so we need to get the pings into the store in the first place (and that is clearly made more difficult by needing the store to be managed by the crash reporter client). It would be disappointing to duplicate that logic manually, when we just removed the code doing so for Legacy Telemetry logic, but I will think it over further.

I think Glean being the right choice holds, we must "just" make sure that things arrive in the right Glean DB in time without shutdown interfering. My gut feeling tells me that disk IO is a likely culprit, like we often see in QM or other shutdown hangs. As said, preparing all things up to that point while we are alive could help. I'd want to avoid to do more work directly at startup, we should use later pauses.

Comment 7

•

4 months ago

•

Edited

Just an additional note on what I assume we see here, that is late content process crashes close to or during parent shutdown whose processing bleeds into or happens during parent shutdown:

Recapping my understanding of what's actually happening here:

For child process crashes, the ping submission is triggered immediately when the crash occurs: CrashService.addCrash() → minidump analysis → CrashManager.addCrash() → sendGleanPing() spawns the child process and registers the shutdown blocker. This is the real-time path, not deferred aggregation of old event files.

Note that maybeRunMinidumpAnalyzer is skipped when gQuitting is true, but the rest of the flow — reading the .extra file, computing the hash, spawning the ping child — still runs during shutdown.

So the likely scenario is: content processes crash close to or during shutdown (not unusual — the parent tears down IPC channels, child processes may crash as a result), the ping child gets spawned at that moment, and it needs to do flock + Glean init + DB write + network upload + Glean shutdown while the parent is counting down the shutdown timeout. That's a race it often loses on slow systems, especially Windows with AV/disk encryption overhead.

This is consistent with the data: 87% of these hangs have exactly 1 blocker — a single content process crash near shutdown, not a backlog.

So I wonder if there is much we can improve in the flow, as most likely the vast majority of crashes are already processed before the parent enters shutdown. This may explain the relatively low numbers here.

Updated

•

4 months ago

Severity: -- → S2

Priority: -- → P1

Updated

•

4 months ago

Severity: S2 → S4

Pascal Chevrel (relman team) -> :pascalc

Assignee

Comment 8

•

4 months ago

We could try to use the existing event file infrastructure to temporarily persist shutdown crash events. When we start up again, they will be read and pings will be sent. Note: I saw one or two crashes that had a stack indicating the crash was being sent from this code path at shutdown, however all the rest were from normal crash handling code paths, so taking this approach would address most crashes (assuming they don't get blocked on IO writing the event files, that is).

On the other hand, it's important to remember that at this point we already have the crash data persisted. Our code is reading the extra file and dump files that are already on disk. So it might be even easier for us to simply not make any attempt at sending crash pings during shutdown, and instead send them later when we find the pending crashes. We have code that looks for pending crashes to prompt users (e.g. "you have unsubmitted crash reports"). We could do something similar for crash pings, though it would occur silently, and we'd need to add some bit of bookkeeping to know that a ping was sent, which would be separate from the bookkeeping done for reports.

Updated

•

4 months ago

status-firefox150: --- → affected

tracking-firefox150: --- → +

Comment 9

•

4 months ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 desktop browser crashes on nightly

:afranchuk, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(afranchuk)

Keywords: topcrash

Alessio Placitelli [:Dexter]

Comment 10

•

4 months ago

The bug is marked as tracked for firefox149 (beta) and tracked for firefox150 (nightly). However, the bug still has low severity.

:Dexter, could you please increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(alessio.placitelli)

Comment 11

•

4 months ago

Hey Chris, would you please check comment 10? Thanks!

Flags: needinfo?(alessio.placitelli) → needinfo?(chutten)

Comment 12

•

4 months ago

I'll let :afranchuk decide

Flags: needinfo?(chutten)

Assignee

Comment 13

•

4 months ago

•

Edited

I'm confused by bugbot: this has very few crashes. I'm not sure what I'm doing wrong here, but I can't get it to come up on crash-stats in the list of top crashers for Nightly/Beta at all (I've expanded filters as much as possible). I do not see how this is a top-10 crash.

Flags: needinfo?(afranchuk)

Assignee

Comment 14

•

4 months ago

By the way, looking at the crashes on crash-stats: it appears that it is dominated by Tahoe installs, and almost all of the other crashes are also various macOS versions. This implies that either

there is some additional slowness introduced by macOS (which might be related to what's happening in bug 2019539), and/or
macOS more frequently experiences crashes in Nightly at shutdown.

Assignee

Comment 15

•

4 months ago

Also note that, so far, there aren't any crash reports from beta. The crash reports we have received are also mostly (60%) from users with automatic submission enabled, which is more common in our Nightly population. :jstutte had theorized out-of-band that the crashes during shutdown (which are subsequently causing the async shutdown timeout in ping submission) are content process IPC-ShutdownKill crashes. If this is the case, these will only be crashes in Nightly, and not in beta/release. If we continue to see no crashes in beta, it may corroborate this idea.

Assignee

Comment 16

•

4 months ago

WRT comment 13: it seems as if the top crashers list may exclude some signatures? In supersearch, I do see this signature, albeit it is ranked 20th (but I imagine bugbot is set to ignore some of the signatures here, so that may increase the rank).

Comment 17

•

4 months ago

I think around merge day statistics are a bit shuffled due to there not being data for the new version numbers, not sure if that can still have some effect today.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 18

•

4 months ago

It appears there are some beta crashes. This implies there are other shutdown crashes besides IPC ShutDownKill that are triggering this subsequent crash.

Comment hidden (duplicate)

Comment 20

•

3 months ago

Running the following crash test locally as part of the WebDriver test suite always causes a timeout, due to a shutdown hang in CrashManager while it tries to submit Glean crash pings.

mach wpt --headless --setpref="remote.log.level=Trace" --webdriver-arg=-vv testing/web-platform/mozilla/tests/webdriver/harness/detect_crash.py

Here an excerpt from the log output:

 0:19.27 INFO STDOUT: WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"CrashManager: submitting Glean crash ping(s)","state":"(none)","filename":"resource://gre/modules/CrashManager.sys.mjs","lineNumber":73,"stack":["resource://gre/modules/CrashManager.sys.mjs:sendGleanPing:73","resource://gre/modules/CrashManager.sys.mjs:_sendCrashPing:829","resource://gre/modules/CrashManager.sys.mjs:addCrash/promise<:536"]}] Barrier: profile-before-change-telemetry
 0:36.00 TEST_END: TIMEOUT, expected OK

Alex, I assume that this is the same problem? Note that for WebDriver we have Telemetry turned off.

Flags: needinfo?(afranchuk)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 21

•

3 months ago

Yes, that's likely the same issue. We should short-circuit the logic for testing: the CrashManager will invoke the crash reporter client to send the ping, but the crash reporter client will never actually send a ping if tests are being run. So we could detect a testing environment and skip launching the client entirely.

Flags: needinfo?(afranchuk)

Comment 22

•

3 months ago

Note that in terms of WebDriver we set the following environment variables to control the behavior of the crash reporter:

            .env("MOZ_CRASHREPORTER", "1")
            .env("MOZ_CRASHREPORTER_NO_REPORT", "1")
            .env("MOZ_CRASHREPORTER_SHUTDOWN", "1");

Maybe that's helpful as well.

Pascal Chevrel (relman team) -> :pascalc

Updated

•

3 months ago

status-firefox149: affected → fix-optional

tracking-firefox149: + → ---

Jonathan Almeida [:jonalmeida]

Comment 23

•

3 months ago

REO for 150: Is this a user perceived crash (that we'd like to get into 150)?

Flags: needinfo?(afranchuk)

Assignee

Comment 24

•

3 months ago

•

Edited

This is user perceived: the crash reporter will come up when Firefox is shutting down. I double-checked crash pings, and the counts there are roughly the same as the report counts, which is nice. So that confirms this is fairly low-volume. I've still been mulling over a solution here, and I'll try to put up a patch soon (with intent to uplift to 150 beta).

Flags: needinfo?(afranchuk)

Comment 25

•

3 months ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Assignee

Comment 26

•

3 months ago

Attached file Bug 2017910 - Prevent crash pings during shutdown r=gsvelto — Details

We will send them upon the next restart.

Phabricator Automation

Updated

•

3 months ago

Attachment #9562382 - Attachment description: WIP: Bug 2017910 - Prevent crash pings during shutdown r=gsvelto → Bug 2017910 - Prevent crash pings during shutdown r=gsvelto

Dianna Smith [:diannaS]

Updated

•

2 months ago

status-firefox149: fix-optional → wontfix

Dianna Smith [:diannaS]

Comment 27

•

2 months ago

:afranchuk just a friendly reminder that RC builds next week

Flags: needinfo?(afranchuk)

https://hg.mozilla.org/mozilla-central/rev/d264b1a8f063

Assignee

Comment 28

•

2 months ago

I have not landed this yet because it will make uplifting bug 2026650 more difficult, but I suppose both need to get into beta, so I can do this one first. Bug 2026650 just backed out for an odd reason, so it might take a bit to figure it out.

Flags: needinfo?(afranchuk)

Pulsebot

Comment 29

•

2 months ago

Pushed by afranchuk@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/25242d4115b3 https://hg.mozilla.org/integration/autoland/rev/d264b1a8f063 Prevent crash pings during shutdown r=gsvelto

Cosmin Sabou [:CosminS]

Comment 30

•

2 months ago

bugherder

Status: ASSIGNED → RESOLVED

Closed: 2 months ago

status-firefox151: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 151 Branch

Comment 31

•

2 months ago

The patch landed in nightly and beta is affected.
:afranchuk, is this bug important enough to require an uplift?

If yes, please nominate the patch for beta approval.
- See https://wiki.mozilla.org/Release_Management/Requesting_an_Uplift for documentation on how to request an uplift.
If no, please set status-firefox150 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(afranchuk)

Phabricator Automation

Comment 32

•

2 months ago

firefox-beta Uplift Approval Request

User impact if declined/Reason for urgency: More crashes at shutdown, and missed crash pings from crashes during shutdown.
Code covered by automated testing?: yes
Fix verified in Nightly?: yes
Needs manual QE testing?: no
Steps to reproduce for manual QE testing:
Risk associated with taking this patch: low
Explanation of risk level: This is logically a somewhat simple change. In the worst case (some weird interaction with the crash store, perhaps), we will miss some pings, but in most cases the logic remains the same as before.
String changes made/needed?: No
Is Android affected?: no

Attachment #9569287 - Flags: approval-mozilla-beta?

Assignee

Comment 33

•

2 months ago

Attached file Bug 2017910 - Prevent crash pings during shutdown — Details

We will send them upon the next restart.

Original Revision: https://phabricator.services.mozilla.com/D291323

Assignee

Updated

•

2 months ago

Flags: needinfo?(afranchuk)

Dianna Smith [:diannaS]

Comment 34

•

2 months ago

looks like there are still some crashes in nightly, can you take a look before we uplift this?

Flags: needinfo?(afranchuk)