Open Bug 1315306 Opened 4 years ago Updated 3 months ago

Crash in AsyncShutdownTimeout | profile-before-change-telemetry | TelemetryController: shutting down

Categories

(Toolkit :: Telemetry, defect, P3)

50 Branch
defect

Tracking

()

Tracking Status
firefox49 --- unaffected
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- wontfix
firefox-esr52 --- wontfix
firefox-esr60 --- wontfix
firefox-esr68 --- wontfix
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox70 + wontfix
firefox71 + wontfix
firefox72 --- wontfix
firefox73 --- wontfix
firefox74 --- fix-optional

People

(Reporter: philipp, Unassigned)

References

Details

(4 keywords, Whiteboard: [measurement:client])

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-99b7b6be-6dcd-4f89-b230-dbed72161104.
=============================================================

this cross-platform signature started showing up in firefox 50 indicating that something is blocking shutdown for a minute. this is currently accounting for ~0.3% of browser crashes in 50.0b.

the most common "AsyncShutdownTimeout" annotation of these crash reports is :
{"phase":"profile-before-change-telemetry","conditions":[{"name":"TelemetryController: shutting down","state":{"initialized":true,"initStarted":true,"haveDelayedInitTask":false,"shutdownBarrier":"Complete","connectionsBarrier":"Complete","sendModule":{"sendingEnabled":false,"pendingPingRequestCount":0,"pendingPingActivityCount":0,"unpersistedPingCount":0,"persistedPingCount":4,"schedulerState":{"shutdown":true,"hasSendTask":false,"sendsFailed":true,"sendTaskState":"bail out - shutdown check","backoffDelay":60000}}},"filename":"resource://gre/modules/TelemetryController.jsm","lineNumber":729,"stack":["resource://gre/modules/TelemetryController.jsm:setupTelemetry:729","resource://gre/modules/TelemetryController.jsm:Impl.observe:829","resource://gre/modules/TelemetryController.jsm:this.TelemetryController<.observe:178","resource://gre/components/TelemetryStartup.js:TelemetryStartup.prototype.observe:28"]}]}
Is this us killing the process because sending Telemetry is taking too long (I'm almost 100% guessing here in an attempt to move this forward)? Chris, do you know?
Flags: needinfo?(chutten)
Telemetry does do stuff during shutdown, but I'm afraid that's about as far as my knowledge goes about its interaction with shutdown. Georg will know more.
Flags: needinfo?(chutten) → needinfo?(gfritzsche)
The AsyncShutdownTimeout signatures were only improved recently, so i expect that this is not actually a new issue.

Formatting the sampled info here a bit:
> {
>   "phase": "profile-before-change-telemetry",
>   "conditions": [
>     {
>       "name": "TelemetryController: shutting down",
>       "state": {
>         "initialized": true,
>         "initStarted": true,
>         "haveDelayedInitTask": false,
>         "shutdownBarrier": "Complete",
>         "connectionsBarrier": "Complete",
>         "sendModule": {
>           "sendingEnabled": false,
>           "pendingPingRequestCount": 0,
>           "pendingPingActivityCount": 0,
>           "unpersistedPingCount": 0,
>           "persistedPingCount": 4,
>           "schedulerState": {
>             "shutdown": true,
>             "hasSendTask": false,
>             "sendsFailed": true,
>             "sendTaskState": "bail out - shutdown check",
>             "backoffDelay": 60000
>           }
>         }
>       },
>       // ...

... shutdownBarrier & connectionsBarrier are "Complete", but initialized & initStarted are still true.
So we must be waiting for TelemetryStorage.shutdown() here:
https://dxr.mozilla.org/mozilla-central/rev/86f702229e32c6119d092e86431afee576f033a1/toolkit/components/telemetry/TelemetryController.jsm#787
https://dxr.mozilla.org/mozilla-central/rev/86f702229e32c6119d092e86431afee576f033a1/toolkit/components/telemetry/TelemetryStorage.jsm#608

We'd need to instrument this more to see where this module is stuck.
We'll also need to check if all or most of the sample show the same pattern.

From looking at the signature summary on crash-stats, i also notice that 78% of the uptime ranges are in the 1-5min range.
That makes it likely that some of the early expensive tasks could case this (disk scanning, ping cleanup, ...), although those tasks should short-circuit on shutdown.
Flags: needinfo?(gfritzsche)
Priority: -- → P2
Whiteboard: [measurement:client]
Priority: P2 → P3
Georg, thanks for the analysis. Can you investigate more? Also, should we move this to a better component?
Flags: needinfo?(gfritzsche)
http://bit.ly/2ekYAAd is a similar signature that shows in 49.0.2 - one question the Triage team had is whether this is actually a browser crash or just a freeze?
(In reply to Andrew Overholt [:overholt] from comment #4)
> Georg, thanks for the analysis. Can you investigate more? Also, should we
> move this to a better component?

I prioritized it so it will show up in our work planning. With the current volume on beta and this being a shutdown hang that doesn't seem urgent to investigate.
The component is correct.

(In reply to Marcia Knous [:marcia - use ni] from comment #5)
> http://bit.ly/2ekYAAd is a similar signature that shows in 49.0.2 - one
> question the Triage team had is whether this is actually a browser crash or
> just a freeze?

Bug 1246828 ended up renaming profile-before-change2 to profile-before-change-telemetry, so this should be the same signature.
This is not a crash, it is a AsyncShutdownTimeout.
This is the Firefox process waiting longer than 60 seconds to shutdown, then killing itself to submit a crash report for diagnosis.
Crash Signature: [@ AsyncShutdownTimeout | profile-before-change-telemetry | TelemetryController: shutting down] → [@ AsyncShutdownTimeout | profile-before-change-telemetry | TelemetryController: shutting down] [@ AsyncShutdownTimeout | profile-before-change2 | TelemetryController: shutting down]
Flags: needinfo?(gfritzsche)
Thanks for the clarifications!
As far as volume, in the latest release (51.0.1)we have 523 crashes in the last 7 days and 231 in 52.0b7.
See Also: → 1348698
Duplicate of this bug: 1348698
Attached image Frequency graph
If you look at bug 1348698 with the "Bugzilla Socorro Lens" add-on installed, you get the attached graph. It has large spikes that correlate perfectly with release dates. Not sure what this means, but maybe it's useful info to someone else.
(In reply to Nicholas Nethercote [:njn] from comment #10)
> If you look at bug 1348698 with the "Bugzilla Socorro Lens" add-on
> installed, you get the attached graph. It has large spikes that correlate
> perfectly with release dates. Not sure what this means, but maybe it's
> useful info to someone else.

this is just the effect of the crash submission mechanism through the infobar being turned on by default on early beta (bug 1303067) - then the whole backlog of unsubmitted [@ IPCError-browser | ShutDownKill]-content process reports from beta users is getting uploaded at once.
Attached image Pingsender crash.gif
Hi everyone!

I managed to reproduce this crash using Firefox Nightly 56.0a1 and Firefox RC 55.0 on Windows 7 following this steps:

1. Open Firefox.
2. Close Firefox.
3. Leave the pingsender.exe Windows notification (~1 min) opened.
4. Firefox Crashes.

Crash Id's:
78bcdbec-0c32-46f7-aab9-fbbb90170801
b7857985-5a46-4de6-b26d-e50c90170801
0724d5aa-e913-4e04-b3c2-383170170801
88aa3cc6-6348-4110-92ca-7b2960170801

Please Note:
User Account Control is set to default.
Please observe the attached screencast for further details regarding this issue.
Based on comment 12, I think what happens here is when the dialog is shown, ShellExecuteExW here fails <https://searchfox.org/mozilla-central/rev/6326724982c66aaeaf70bb7c7ee170f7a38ca226/xpcom/threads/nsProcessCommon.cpp#500> and as a result the caller in https://searchfox.org/mozilla-central/rev/6326724982c66aaeaf70bb7c7ee170f7a38ca226/toolkit/components/telemetry/TelemetrySend.jsm#1323 throws an exception which somehow messes up the state of the async shutdown code.

Does this sound plausible?
Also (and sorry if this is a dumb question) why does the sendTelemetry async shutdown blocker need to exist in the first place? <https://searchfox.org/mozilla-central/rev/6326724982c66aaeaf70bb7c7ee170f7a38ca226/toolkit/components/asyncshutdown/nsIAsyncShutdown.idl#198>  There is no code that uses it as far as I can see.  It seems like if we deleted it this crash would go away and we would remove future footguns like this.  If someone needed it again it's very easy to add it back..
ni?Alessio since he might know why we have that lever.
Flags: needinfo?(alessio.placitelli)
(In reply to Out sick from comment #14)
> Also (and sorry if this is a dumb question) why does the sendTelemetry async
> shutdown blocker need to exist in the first place?
> <https://searchfox.org/mozilla-central/rev/
> 6326724982c66aaeaf70bb7c7ee170f7a38ca226/toolkit/components/asyncshutdown/
> nsIAsyncShutdown.idl#198>  There is no code that uses it as far as I can
> see.  It seems like if we deleted it this crash would go away and we would
> remove future footguns like this.  If someone needed it again it's very easy
> to add it back..

Woha, this is an interesting crash! So, :ehsan, the |sendTelemetry| shutdown blocker is used to control the whole Telemetry shutdown process [1]. The problem here seems to be that spawning the pingsender is blocking the shutdown process due to the window requesting user input: we should detach from the pingsender process as soon as it starts.

I have a doubt though: why is that prompt being displayed? All our executables are signed and shouldn't trigger that UAC prompt as far as I know. Unless I'm missing something!

[1] - https://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/toolkit/components/telemetry/TelemetryController.jsm#741
Flags: needinfo?(alessio.placitelli)
(In reply to Alessio Placitelli [:Dexter] from comment #16)
> (In reply to Out sick from comment #14)
> > Also (and sorry if this is a dumb question) why does the sendTelemetry async
> > shutdown blocker need to exist in the first place?
> > <https://searchfox.org/mozilla-central/rev/
> > 6326724982c66aaeaf70bb7c7ee170f7a38ca226/toolkit/components/asyncshutdown/
> > nsIAsyncShutdown.idl#198>  There is no code that uses it as far as I can
> > see.  It seems like if we deleted it this crash would go away and we would
> > remove future footguns like this.  If someone needed it again it's very easy
> > to add it back..
> 
> Woha, this is an interesting crash! So, :ehsan, the |sendTelemetry| shutdown
> blocker is used to control the whole Telemetry shutdown process [1].

Oh wow, I *swear* searchfox wasn't showing me that result on Thursday, so I thought it's unused!  But yeah that makes perfect sense.

> The
> problem here seems to be that spawning the pingsender is blocking the
> shutdown process due to the window requesting user input: we should detach
> from the pingsender process as soon as it starts.

Note that it's not pingsender window, it's Windows spawning a dialog before pingsender itself has executed.

> I have a doubt though: why is that prompt being displayed? All our
> executables are signed and shouldn't trigger that UAC prompt as far as I
> know. Unless I'm missing something!

This isn't UAC.  This is a dialog that Windows shows for files downloaded from the Internet AFAIK (perhaps among other cases.)  For example if you download a ZIP package containing Firefox you'll get this dialog the first time you run Firefox.  The "Publisher" line in the dialog in the screencast shows the signature on the file was successfully verified by Windows.  :-)
Setting ni? as a reminder to debug this at the beginning of the next week.
Flags: needinfo?(alessio.placitelli)
(In reply to Emil Ghitta, QA [:emilghitta] from comment #12)
> Created attachment 8892443 [details]
> Pingsender crash.gif
> 
> Hi everyone!
> 
> I managed to reproduce this crash using Firefox Nightly 56.0a1 and Firefox
> RC 55.0 on Windows 7 following this steps:

Hi Emil! Looks like I can't reproduce this on Windows 10 x64. Is this a Windows 7/Vista specific STR/issue?
Flags: needinfo?(alessio.placitelli) → needinfo?(emil.ghitta)
Hi Alessio!

Indeed, I managed to reproduce this issue only on Windows 7.

Should I file a new bug for this?
Flags: needinfo?(emil.ghitta)
Depends on: 1404857
(In reply to Emil Ghitta, QA [:emilghitta] from comment #20)
> Hi Alessio!
> 
> Indeed, I managed to reproduce this issue only on Windows 7.
> 
> Should I file a new bug for this?

Nope, thanks, I filed bug 1404857 :)
Top Crasher:

#11 on Nightly.
#13 on Beta.
#11 on Release.

Top Crashers for Firefox 62.0a1
Top 50 Crashing Signatures. 7 days ago

11 	1.79% 	-0.38% 	AsyncShutdownTimeout | profile-before-change-telemetry | TelemetryController: shutting down	58 	56 	0 	0 	56 	0 	2016-09-01

Top Crashers for Firefox 61.0b
Top 50 Crashing Signatures. 7 days ago
	
13 	0.69% 	-0.1% 	AsyncShutdownTimeout | profile-before-change-telemetry | TelemetryController: shutting down	228 	228 	0 	0 	208 	0 	2016-09-01 		
	
Top Crashers for Firefox 60.0.1
Top 50 Crashing Signatures. 7 days ago 	
	
11 	0.87% 	new 	AsyncShutdownTimeout | profile-before-change-telemetry | TelemetryController: shutting down	25 	25 	0 	0 	25 	0 	2016-09-01
Keywords: topcrash
Adding 62 as affected. On release this is currently the #11 top browser crash.
Marking fix-optional for 64. We could still take a patch for 65, and if it's verified and doesn't seem risky, could still take fixes for 64 as well.
Still a high volume crash on release and beta. Selena can you help find an owner for this bug? We may find that it's stalled but before I remove this from weekly regression triage it seems worth another investigation.
Can this be reproduced on Win 10? Tom - can someone have a look at reproducing this issue per comment #12 on win10?
Flags: needinfo?(sdeckelmann) → needinfo?(tgrabowski)

Hi Selena,

The steps mentioned in comment 12 are no longer valid. The issue that was generated by following those particular steps was fixed by Bug 1400294. Also, the crash was reproducible only on Windows 7 (per comment 20). We didn't manage to find other steps that could lead to a crash with this signature but we will continue to investigate and if something shows up I will post an update.

Flags: needinfo?(tgrabowski)

Updating the tracking flags for this one. During the 66 releases so far we have over 6300 crashes. The crash is visible in 67 beta and somewhat consistent in 68 nightly as well (although the number of crashes per build is not as high - some days only 1 crash, some days 17).

Bulk change of P3 carryover bugs to wontfix for 68.

These two crash signatures are #12 and #14 on 68 release. That's very high volume (in the top 5 crashes on release, if you combine the signatures)

Given the high volume over many versions, do you think this is worth investigating again?
If not, we can treat this as something that we don't keep triaging over and over.

Flags: needinfo?(ehsan)
Flags: needinfo?(alessio.placitelli)

(In reply to Liz Henry (:lizzard) from comment #30)

These two crash signatures are #12 and #14 on 68 release. That's very high volume (in the top 5 crashes on release, if you combine the signatures)

Given the high volume over many versions, do you think this is worth investigating again?
If not, we can treat this as something that we don't keep triaging over and over.

Redirecting to :chutten

Flags: needinfo?(alessio.placitelli) → needinfo?(chutten)

I'm afraid I don't really have any particular insights to share here, especially if we no longer trigger the dialog discussed in comment 17 from pingsender.exe...

Flags: needinfo?(ehsan)

I concur that these reports are different from the earlier ones. The shutdown barriers haven't been started for these ones:

(( from https://crash-stats.mozilla.org/report/index/7b0f4890-3b88-40ac-9ccc-d66a90190813#tab-metadata ))

{
  "phase": "profile-before-change-telemetry",
  "conditions": [
    {
      "name": "TelemetryController: shutting down",
      "state": {
        "initialized": true,
        "initStarted": true,
        "haveDelayedInitTask": false,
        "shutdownBarrier": "Not started",
        "connectionsBarrier": "Not started",
        "sendModule": {
          "sendingEnabled": false,
          "pendingPingRequestCount": 0,
          "pendingPingActivityCount": 0,
          "unpersistedPingCount": 0,
          "persistedPingCount": 1,
          "schedulerState": {
            "shutdown": true,
            "hasSendTask": false,
            "sendsFailed": false,
            "sendTaskState": "bail out - no pings to send",
            "backoffDelay": 60000
          }
        },
        "haveDelayedNewProfileTask": false
      },
      "filename": "resource:\/\/gre\/modules\/TelemetryController.jsm",
      "lineNumber": 854,
      "stack": [
        "resource:\/\/gre\/modules\/TelemetryController.jsm:setupTelemetry:854",
        "resource:\/\/gre\/modules\/TelemetryController.jsm:observe:989",
        "resource:\/\/gre\/modules\/TelemetryController.jsm:observe:180",
        "resource:\/\/gre\/modules\/TelemetryStartup.jsm:TelemetryStartup.prototype.observe:31"
      ]
    }
  ]
}

This means we're in _cleanupOnShutdown() (( the barrier calls shutdown(), _initStarted is true and _delayedInitTask is falsey so we're calling here )). We've skipped _delayedNewPingTask, and from the state in the reports it looks as though we've jumped through everything else and gotten past TelemetrySend.shutdown() as well (sendModule.schedulerState.shutdown is true). But we haven't yet started the shutdownBarrier.

This is consistent with the pingsender hypotheses mentioned after comment 12 because the only remaining things are sending the "health" ping by pingsender, sending the "main" ping with reason "shutdown" by pingsender, and some native deinit.

We could be triggering some deadlocking in the native deinit, but I would be surprised if that were the problem

+gsvelto for pingsender knowledge: we're trying our best to run pingsender asynchronously, but is it possible that something could still be causing us to block when starting the pingsender process?

Without insight there, I'm guessing we add more state to Telemetry shutdown to narrow down exactly which module's causing the issue (and maybe to split out the signatures into their own bugs? This one seems past its best before date.)

Flags: needinfo?(chutten) → needinfo?(gsvelto)

We know that pingsender.exe can start very slowly sometimes, one of those instances is bug 1386760. I have no good solution for that yet and only a guess of why it happens: AV software hooking kernel calls and doing (slow) stuff when we try to spawn pingsender.exe. However I've inspected some of the minidumps for this signature and they're different from bug 1386760.

If we've launched pingsender.exe the crashes are happening past the CreateProcess() call and IIRC we shouldn't be waiting on the resulting process. I don't know how else we can get stuck on pingsender.exe; in theory we shouldn't.

Flags: needinfo?(gsvelto)

Anything you can try for 70 here? Sounds like maybe this is something that needs more time to overhaul how the ping sending works.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #34)

We know that pingsender.exe can start very slowly sometimes, one of those instances is bug 1386760. I have no good solution for that yet

Is it possible to start the process from another thread, and do it earlier during shutdown, so that it happens in parallel while the rest of shutdown is happening?

Changing the priority to p1 as the bug is tracked by a release manager for the current beta.
See What Do You Triage for more information

Priority: P3 → P1

(In reply to Florian Quèze [:florian] from comment #37)

Is it possible to start the process from another thread, and do it earlier during shutdown, so that it happens in parallel while the rest of shutdown is happening?

Theoretically yes, but it's a bit tricky. The pingsender executable is launched from JavaScript code using an nsIProcess instance. The relevant code is running from the main thread and I don't know how to make JavaScript's equivalent of a background thread and delegate launching to that. I could wrap launching pingsender within some C++ code that would then spawn it from a separate thread but it would be a bit icky.

Chris, do you have a better idea on how to spawn pingsender from outside of the main thread?

Flags: needinfo?(gsvelto) → needinfo?(chutten)

The "JS calls into C++ which dispatches to a background thread to do potentially-heavy work" dance is something we do in a couple of places in Telemetry.cpp (see for example where we get the list of loaded modules)

Though this doesn't make it any less icky, it does at least show precedent :|

While I'm here I'll note that Telemetry's pattern of use of pingsender might be exacerbating this. It has become a common pattern to send pings at shutdown, so we're not just spawning one pingsender at shutdown, we're spawning a handful of them one after another. We could change up how we send pings at shutdown (maybe we register shutdown pings with TelemetrySend so so we can batch them somehow), but I'm not sure if that solves the issue under concern here.

That's all I have, but I'm not exactly a mozjs guru. There might be an API for "schedule this task that is guaranteed to execute on a bg thread even during profile-before-change phase of shutdown" that I just haven't run across before. Who would I ask about that... :farre, maybe?

Flags: needinfo?(chutten) → needinfo?(afarre)

Thanks Chris. I'm now focusing on the recent OOM ramp up on Windows but I intend to find a fix for this as soon as I'm done with that. Note that :kwright has started rewriting the pingsender in Rust in bug 1339035. Adding additional features to the pingsender - like sending multiple pings with a single invocation - is going to be a lot easier after that's done.

Sounds unlikely we will come up with a fix before 70 release, but i'll keep this marked fix-optional in case there is something simple enough to uplift as a ridealong for a dot release, since the crash volume is so high.

Just another thing, assuming pingsender would be able to send more than one ping in one invocation, how hard would it be to adapt the telemetry code to use that feature?

Flags: needinfo?(chutten)

It'll require a little work. The more obvious followers of the "send a ping at shutdown" pattern are all told about shutdown in TelemetryController's _cleanupOnShutdown. I'd probably refactor the various pings' shutdown()s to return any pings they'd like to send and then pass the list on to TelemetrySend's shutdown.

We might not get them all, but we'd get the obvious ones with, let's say, a day's effort.

Flags: needinfo?(chutten)

I'm not aware of such an API, my experiences mostly revolve around the idle scheduled tasks, which all run on main thread. I'm guessing Olli or Nathan would be the right people to ask.

Flags: needinfo?(afarre) → needinfo?(bugs)
See Also: → 1381002
Depends on: 1590114

Has bug 1590114 helped here at all?

I don't know the answer to
"There might be an API for "schedule this task that is guaranteed to execute on a bg thread even during profile-before-change phase of shutdown" that I just haven't run across before."

And I don't know what could have changed around ~68.
But I'll keep this in mind while looking into scheduling changes.

Flags: needinfo?(bugs)

(In reply to Olli Pettay [:smaug] from comment #46)

Has bug 1590114 helped here at all?

You know, I'm not sure. Are these frequent enough that we'd notice a dropoff around Oct 24 on Nightly? :gsvelto?

Flags: needinfo?(gsvelto)

The nightly data is very noisy, eyeballing it I don't see a dropoff

Flags: needinfo?(gsvelto)

Really bad crash that seems to have fallen off our radar since 50! Need some help diagnosing the cause and potential fixes.

Component: General → Telemetry
Product: Core → Toolkit

We have shipped our last beta for 71, fix-optional for 71 as this is a significant crasher on release, if a safe patch materializes in 72 we may want to uplift it in a potential dot release as a ridealong.

Chris, can you take a look at this high volume crash or find someone to investigate?

Flags: needinfo?(chutten)

gsvelto, what do you think the likelihood of a Rust rewrite of pingsender (bug 1339035) and augmenting it with "send multiple pings at once" is for Q1?

If we're at less than "somewhat likely", I think maybe we should pursue the "add a method to nsITelemetry to spawn pingsender on bg threads" alternative. Thoughts?

Flags: needinfo?(chutten) → needinfo?(gsvelto)

The Rust rewrite is a long way off because of the lack of adequate crates to make HTTPS requests. Adding support for sending multiple pings (all specified on the command-line in one go) could be done this week.

Flags: needinfo?(gsvelto)

Okidoki, let's get started on that angle of the solution then: collecting at-shutdown pings into some sort of registry and passing them all to pingsender in one swell foop. I'm thinking we need a bug for the registry and a bug for the pingsender feature, and then we can use this one for the final piece when both are done. Sound good?

Assignee: nobody → chutten
Status: NEW → ASSIGNED
Depends on: 1607586

(In reply to Chris H-C :chutten from comment #54)

Okidoki, let's get started on that angle of the solution then: collecting at-shutdown pings into some sort of registry and passing them all to pingsender in one swell foop. I'm thinking we need a bug for the registry and a bug for the pingsender feature, and then we can use this one for the final piece when both are done. Sound good?

+1

Depends on: 1607908

It's been two weeks since we landed the pingsender fixes and we're still seeing Nightly crashes with similar signatures at about the same rate. I'd like to see what it does on beta after merge, because there's a larger population there we might see a smaller effect.

But it might end up being that we need to look further. Maybe saving all these pending pings to disk at such a contentious moment is part of the problem?

The Nightly crash volumes going to 0 might be related to bug 1613692 (but that landed in the first 2020-02-13 nightly, not the second, so maybe not?)

Still plenty of crashes on beta. Gabriele, any ideas what we might have missed? Doug, can you point me at tools for looking at shutdown slowness? (assuming that shutdown slowness is what we're seeing, and not some intermittent lock)

Flags: needinfo?(gsvelto)
Flags: needinfo?(dothayer)

I've opened a bunch of minidumps from the beta builds and I couldn't find the pingsender.exe string anymore on the stack, so at least we're not stuck launching it like we did in the past... maybe we're stuck waiting on it to send all the pings? We still have bug 1610757 to implement to speed up the process.

Flags: needinfo?(gsvelto)

Isn't in an async launch, though? We should only be waiting for the process to start, not for it to stop: we don't pass an observer or block shutdown on its exit code.

The only tool I can recommend is the Gecko profiler with MOZ_PROFILER_SHUTDOWN set (pointing to an absolute filepath), running on a spinny disk. Am I reading the history correctly that at this point we don't know anything beyond that we're crashing waiting for Telemetry's AsyncShutdown blocker to resolve? I suspect that there's no cheeky performance cliff that we're falling off of here, and it's just that Telemetry is doing a lot during this window and we need to generally improve that (unless I'm misunderstanding something?) If the situation is worse on Beta, that would make sense, since Beta just has a higher representation of spinny disks.

Flags: needinfo?(dothayer)

I don't have access to spinny disk hardware, so my investigation lines are limited. Let's take a look at some reports' async shutdown state (field "Async shutdown timeout" in "metadata")

There are two types: those where shutdownBarrier has a state Not started and those where there's nothing in the async shutdown state suggesting why we're still in it (all state information says it's shut down, so I'm not sure what it's doing. Maybe it's in TelemetryStorage.shutdown()))

The overwhelming majority of the crashes are of the first type (shutdownBarrier with a state of Not started). This is no longer consistent with a pingsender hypothesis as pingsender's not even spawned until TelemetrySend.flushPingSenderBatch which is after we await shutdownBarrier here.

I guess we need to add more fine-grained instrumentation to shutdown and see where we're losing the plot. We're still getting the occasional crash on Nightly so we'll be able to iterate. Is there any reason to not include an async shutdown state parameter that is straight up telling us which step of _cleanupOnShutdown we're in? Plus the number of millis since the top of the function?

Depends on: 1631605

All eight reports on Nightly since the finer-grained logging are complaining about TelemetryStorage.shutdown. Some are getting to it very quickly (16ms from top of _cleanupOnShutdown) some are taking longer (over 2s).

This is my first time looking at TelemetryStorage.shutdown. Looks like it's waiting for the orderly completion of certain tasks:

  1. Awaiting the removal of the "aborted-session"-reason "main" ping. This is important because otherwise we'll send it at next startup, making us think the previous session terminated due to a crash or something. (( though I'm not sure about the analysis value of this sort of ping these days. We have the "crash" ping now. ))
  2. If we're in the middle of cleaning the archive of pings older than 60 days and trimming down to at most 120MB of disk space, we wait for that to complete. This isn't strictly necessary to have complete right now, as we'll kick off another task during next startup during delayedInit if we need it.
  3. Enforcing the pending pings quota (removing pending pings if there's more than 15MB of them). Like #2 we don't need to complete this now, we will kick it off again during TelemetrySend.setup
  4. Awaiting us removing all pending pings. This only happens when you opt out of Telemetry, so I'm not sure it's likely to be the problem here. It is however important to have this run to completion as we will not restart it.
  5. Awaiting the saving of pending pings. This is crucial for data integrity and is likely to have some work to do given all the pings we're sending at shutdown via pingsender. (have to save the pings to disk after all).

If I had to place bets, I'd bet on #5 being the biggest slowdown of the bunch. We could drop #2 and #3 to see if it has any effect, but if I'm right there's just nothing we can do without asking data owners not to send pings at shutdown (because sending at shutdown requires dumping them on disk).

All these new reports are from sessions under 3min long (uptime less than 180), though, so I'd like to wait for a few more reports from longer sessions so that I don't over-rotate on this small sample. Maybe I'll look into checking "crash" pings (which should have the annotation) because we'll have more reports there and the minidump's not of much use for this investigation (at the moment at least).

Some data's had a chance to accumulate, and here it is: https://sql.telemetry.mozilla.org/queries/71846/source

We have a lot of await TelemetrySession.shutdown()s followed by not quite as many await TelemetryStorage.shutdown()s. Since I only looked at the latter back in Comment 63, let's take a look at the former...

...Oh look it's saving pings at shutdown, too.

There appears to be a pattern that saving and sending pings at shutdown on some clients just takes too darn long. Perhaps because saving to disk is taking too long...but perhaps because assembling the pings is taking too long. The Environment (present in many of these pings) isn't considered complete enough to be used in pings until it has quite a bit of information in it ranging from HW information to extended OS information to a full list of the Addons you have installed. This can take a while (again, exacerbated by slow disks).

So perhaps we have a combination of slow ping assembly and slow ping persistence causing these signatures. ...but if that's the case then the only thing I can see us doing is changing the contract of what data is guaranteed to be sent at shutdown. We're already missing some from these thousands of MAU (they do send "crash" pings, though), so maybe this won't be a problem?

I'll ponder this. ni?myself so I don't forget to come back to this.

Flags: needinfo?(chutten)

More data, this time examining the number of milliseconds it takes to get from the top of _cleanupOnShutdown to just before TelemetrySession.shutdown() (plot) or TelemetryStorage.shutdown (plot) (only counting cases where the async shutdown timed out there)

The plurailty (250+ of the 1.7k cases since the end of april) of the hangs that timed out in TelemetrySession.shutdown got there immediately (0ms) and it's an exponential falloff from there suggesting that either we're running out of time already by the time we get there and TelemetrySession.shutdown is just a heavy operation that's at the end of shutdown so it gets the blame, or TelemetySession.shutdown is taking tens of seconds.

Looking at the fewer cases of TelemetryStorage.shutdown around 100 of the 966 them are in the 10+ seconds range. The rest are getting there reasonably quickly suggesting (to this casual observer) that TelemetrySession.shutdown might take seconds, but not likely tens of them.

The proper solution for all of this is to not perform costly ping assembly and persistence operations at shutdown and take a more incremental approach to. This will ultimately be accomplished through Project FOG bringing the Glean SDK to Firefox Desktop. For now, though, I think we can just monitor these to ensure they don't suddenly get worse. This happened a total of 3350 times across 2426 clients in Beta and Nightly, so I'm afraid I'm going to have to deprioritize working on this in favour of working on FOG.

Severity: critical → S4
Flags: needinfo?(chutten)
Priority: P1 → P3
Assignee: chutten → nobody
Status: ASSIGNED → NEW
You need to log in before you can comment on or make changes to this bug.