(In reply to Alex Franchuk [:afranchuk] from comment #5) > I just looked at the annotations for the crashes in the past month, and they all seem like single-blockers to me, unless I am missing some subtlety. I didn't realize that it would show more of them (or rather, I wasn't sure whether it was condensing them to a single entry). If it _would_ normally show a list of them, then I definitely agree: it seems like there is no build-up of crashes, and they are simply taking too long to submit. It shows all active blockers and there are a few reports where this is visible for this signature. > I may try to add something to get visibility into what it's doing at that point. If, for instance, it is always stuck sending the pings, it is safe to kill it (the pings will already be persisted in the Glean DB at that point). Good point. So it would be safe to disown the process once we now it persisted the data locally to the Glean DB (which might also block, so until then loss could happen, IIUC). > We will eventually move to crash pings being sent by the crash helper. This will have a number of advantages, like initing a Glean DB once and sending pings as necessary from a single, long-lived process. It will also always be spawned at startup, so it will be able to handle retries if pings didn't previously send. I don't know whether the crash helper is yet ready to take on this responsibility (though :gsvelto could weigh in on that), but the `crashping` crate was made with that transition in mind, so when the time comes it should be as simple as a boilerplate Glean setup and a single additional function call in the crash helper (the crash ping crate needs only the extras JSON data). > > One of the main benefits of using Glean here was supposed to be that _it_ would handle management of pending pings and reliable sending, but to do so we need to get the pings into the store in the first place (and that is clearly made more difficult by needing the store to be managed by the crash reporter client). It would be disappointing to duplicate that logic manually, when we just removed the code doing so for Legacy Telemetry logic, but I will think it over further.
Bug 2017910 Comment 6 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
(In reply to Alex Franchuk [:afranchuk] from comment #5) > I just looked at the annotations for the crashes in the past month, and they all seem like single-blockers to me, unless I am missing some subtlety. I didn't realize that it would show more of them (or rather, I wasn't sure whether it was condensing them to a single entry). If it _would_ normally show a list of them, then I definitely agree: it seems like there is no build-up of crashes, and they are simply taking too long to submit. It shows all active blockers and there are a few reports where this is visible for this signature. > I may try to add something to get visibility into what it's doing at that point. If, for instance, it is always stuck sending the pings, it is safe to kill it (the pings will already be persisted in the Glean DB at that point). Good point. So it would be safe to disown the process once we now it persisted the data locally to the Glean DB (which might also block, so until then loss could happen, IIUC). > We will eventually move to crash pings being sent by the crash helper. This will have a number of advantages, like initing a Glean DB once and sending pings as necessary from a single, long-lived process. It will also always be spawned at startup, so it will be able to handle retries if pings didn't previously send. I don't know whether the crash helper is yet ready to take on this responsibility (though :gsvelto could weigh in on that), but the `crashping` crate was made with that transition in mind, so when the time comes it should be as simple as a boilerplate Glean setup and a single additional function call in the crash helper (the crash ping crate needs only the extras JSON data). That sounds like a good plan. We may indeed be able to find better moments than shutdown to hand over the JSON to the crashreporter, like though the UserIdleService. > One of the main benefits of using Glean here was supposed to be that _it_ would handle management of pending pings and reliable sending, but to do so we need to get the pings into the store in the first place (and that is clearly made more difficult by needing the store to be managed by the crash reporter client). It would be disappointing to duplicate that logic manually, when we just removed the code doing so for Legacy Telemetry logic, but I will think it over further. I think Glean being the right choice holds, we must "just" make sure that things arrive in the right Glean DB in time without shutdown interfering. My gut feeling tells me that disk IO is a likely culprit, like we often see in QM or other shutdown hangs. As said, preparing all things up to that point while we are alive could help. I'd want to avoid to do more work directly at startup, we should use later pauses.
(In reply to Alex Franchuk [:afranchuk] from comment #5) > I just looked at the annotations for the crashes in the past month, and they all seem like single-blockers to me, unless I am missing some subtlety. I didn't realize that it would show more of them (or rather, I wasn't sure whether it was condensing them to a single entry). If it _would_ normally show a list of them, then I definitely agree: it seems like there is no build-up of crashes, and they are simply taking too long to submit. It shows all active blockers and there are a few reports where this is visible for this signature. > I may try to add something to get visibility into what it's doing at that point. If, for instance, it is always stuck sending the pings, it is safe to kill it (the pings will already be persisted in the Glean DB at that point). Good point. So it would be safe to disown the process once we now it persisted the data locally to the Glean DB (which might also block, so until then loss could happen, IIUC). > We will eventually move to crash pings being sent by the crash helper. This will have a number of advantages, like initing a Glean DB once and sending pings as necessary from a single, long-lived process. It will also always be spawned at startup, so it will be able to handle retries if pings didn't previously send. I don't know whether the crash helper is yet ready to take on this responsibility (though :gsvelto could weigh in on that), but the `crashping` crate was made with that transition in mind, so when the time comes it should be as simple as a boilerplate Glean setup and a single additional function call in the crash helper (the crash ping crate needs only the extras JSON data). That sounds like a good plan. We may indeed be able to find better moments than shutdown to hand over the JSON to the crashreporter, like through the UserIdleService. > One of the main benefits of using Glean here was supposed to be that _it_ would handle management of pending pings and reliable sending, but to do so we need to get the pings into the store in the first place (and that is clearly made more difficult by needing the store to be managed by the crash reporter client). It would be disappointing to duplicate that logic manually, when we just removed the code doing so for Legacy Telemetry logic, but I will think it over further. I think Glean being the right choice holds, we must "just" make sure that things arrive in the right Glean DB in time without shutdown interfering. My gut feeling tells me that disk IO is a likely culprit, like we often see in QM or other shutdown hangs. As said, preparing all things up to that point while we are alive could help. I'd want to avoid to do more work directly at startup, we should use later pauses.