<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Georg Fritzsche [:gfritzsche]

Comment 2

•

9 years ago

Telemetry does do stuff during shutdown, but I'm afraid that's about as far as my knowledge goes about its interaction with shutdown. Georg will know more.

Flags: needinfo?(chutten) → needinfo?(gfritzsche)

Comment 3

•

9 years ago

The AsyncShutdownTimeout signatures were only improved recently, so i expect that this is not actually a new issue. Formatting the sampled info here a bit: > { > "phase": "profile-before-change-telemetry", > "conditions": [ > { > "name": "TelemetryController: shutting down", > "state": { > "initialized": true, > "initStarted": true, > "haveDelayedInitTask": false, > "shutdownBarrier": "Complete", > "connectionsBarrier": "Complete", > "sendModule": { > "sendingEnabled": false, > "pendingPingRequestCount": 0, > "pendingPingActivityCount": 0, > "unpersistedPingCount": 0, > "persistedPingCount": 4, > "schedulerState": { > "shutdown": true, > "hasSendTask": false, > "sendsFailed": true, > "sendTaskState": "bail out - shutdown check", > "backoffDelay": 60000 > } > } > }, > // ... ... shutdownBarrier & connectionsBarrier are "Complete", but initialized & initStarted are still true. So we must be waiting for TelemetryStorage.shutdown() here: https://dxr.mozilla.org/mozilla-central/rev/86f702229e32c6119d092e86431afee576f033a1/toolkit/components/telemetry/TelemetryController.jsm#787 https://dxr.mozilla.org/mozilla-central/rev/86f702229e32c6119d092e86431afee576f033a1/toolkit/components/telemetry/TelemetryStorage.jsm#608 We'd need to instrument this more to see where this module is stuck. We'll also need to check if all or most of the sample show the same pattern. From looking at the signature summary on crash-stats, i also notice that 78% of the uptime ranges are in the 1-5min range. That makes it likely that some of the early expensive tasks could case this (disk scanning, ping cleanup, ...), although those tasks should short-circuit on shutdown.

Flags: needinfo?(gfritzsche)

Priority: -- → P2

Whiteboard: [measurement:client]

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Priority: P2 → P3

Andrew Overholt [:overholt]

Comment 4

•

9 years ago

Georg, thanks for the analysis. Can you investigate more? Also, should we move this to a better component?

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 5

•

9 years ago

http://bit.ly/2ekYAAd is a similar signature that shows in 49.0.2 - one question the Triage team had is whether this is actually a browser crash or just a freeze?

Comment 6

•

9 years ago

(In reply to Andrew Overholt [:overholt] from comment #4) > Georg, thanks for the analysis. Can you investigate more? Also, should we > move this to a better component? I prioritized it so it will show up in our work planning. With the current volume on beta and this being a shutdown hang that doesn't seem urgent to investigate. The component is correct. (In reply to Marcia Knous [:marcia - use ni] from comment #5) > http://bit.ly/2ekYAAd is a similar signature that shows in 49.0.2 - one > question the Triage team had is whether this is actually a browser crash or > just a freeze? Bug 1246828 ended up renaming profile-before-change2 to profile-before-change-telemetry, so this should be the same signature. This is not a crash, it is a AsyncShutdownTimeout. This is the Firefox process waiting longer than 60 seconds to shutdown, then killing itself to submit a crash report for diagnosis.

Flags: needinfo?(gfritzsche)

Andrew Overholt [:overholt]

Comment 7

•

9 years ago

Thanks for the clarifications!

status-firefox50: affected → fix-optional

Updated

•

9 years ago

status-firefox50: fix-optional → wontfix

status-firefox51: affected → wontfix

status-firefox52: affected → fix-optional

status-firefox53: --- → affected

Nicholas Nethercote [inactive]

Comment 8

•

8 years ago

As far as volume, in the latest release (51.0.1)we have 523 crashes in the last 7 days and 231 in 52.0b7.

Simona Badau, Desktop QA

Updated

•

8 years ago

Updated

•

8 years ago

status-firefox52: fix-optional → wontfix

status-firefox53: affected → wontfix

status-firefox54: --- → fix-optional

status-firefox55: --- → fix-optional

status-firefox-esr52: --- → fix-optional

Comment 10

•

8 years ago

Attached image Frequency graph — Details

If you look at bug 1348698 with the "Bugzilla Socorro Lens" add-on installed, you get the attached graph. It has large spikes that correlate perfectly with release dates. Not sure what this means, but maybe it's useful info to someone else.

Virtual_ManPL [:Virtual] 🇵🇱 - (please needinfo? me - so I will see your comment/reply/question/etc.)

Updated

•

8 years ago

Comment 11

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #10) > If you look at bug 1348698 with the "Bugzilla Socorro Lens" add-on > installed, you get the attached graph. It has large spikes that correlate > perfectly with release dates. Not sure what this means, but maybe it's > useful info to someone else. this is just the effect of the crash submission mechanism through the infobar being turned on by default on early beta (bug 1303067) - then the whole backlog of unsubmitted [@ IPCError-browser | ShutDownKill]-content process reports from beta users is getting uploaded at once.

Emil Ghitta, QA [:emilghitta]

Comment 12

•

8 years ago

Attached image Pingsender crash.gif — Details

Hi everyone! I managed to reproduce this crash using Firefox Nightly 56.0a1 and Firefox RC 55.0 on Windows 7 following this steps: 1. Open Firefox. 2. Close Firefox. 3. Leave the pingsender.exe Windows notification (~1 min) opened. 4. Firefox Crashes. Crash Id's: 78bcdbec-0c32-46f7-aab9-fbbb90170801 b7857985-5a46-4de6-b26d-e50c90170801 0724d5aa-e913-4e04-b3c2-383170170801 88aa3cc6-6348-4110-92ca-7b2960170801 Please Note: User Account Control is set to default. Please observe the attached screencast for further details regarding this issue.

Comment 13

•

8 years ago

Based on comment 12, I think what happens here is when the dialog is shown, ShellExecuteExW here fails <https://searchfox.org/mozilla-central/rev/6326724982c66aaeaf70bb7c7ee170f7a38ca226/xpcom/threads/nsProcessCommon.cpp#500> and as a result the caller in https://searchfox.org/mozilla-central/rev/6326724982c66aaeaf70bb7c7ee170f7a38ca226/toolkit/components/telemetry/TelemetrySend.jsm#1323 throws an exception which somehow messes up the state of the async shutdown code. Does this sound plausible?

Comment 14

•

8 years ago

Also (and sorry if this is a dumb question) why does the sendTelemetry async shutdown blocker need to exist in the first place? <https://searchfox.org/mozilla-central/rev/6326724982c66aaeaf70bb7c7ee170f7a38ca226/toolkit/components/asyncshutdown/nsIAsyncShutdown.idl#198> There is no code that uses it as far as I can see. It seems like if we deleted it this crash would go away and we would remove future footguns like this. If someone needed it again it's very easy to add it back..

Comment 15

•

8 years ago

ni?Alessio since he might know why we have that lever.

Flags: needinfo?(alessio.placitelli)

Comment 16

•

8 years ago

(In reply to Out sick from comment #14) > Also (and sorry if this is a dumb question) why does the sendTelemetry async > shutdown blocker need to exist in the first place? > <https://searchfox.org/mozilla-central/rev/ > 6326724982c66aaeaf70bb7c7ee170f7a38ca226/toolkit/components/asyncshutdown/ > nsIAsyncShutdown.idl#198> There is no code that uses it as far as I can > see. It seems like if we deleted it this crash would go away and we would > remove future footguns like this. If someone needed it again it's very easy > to add it back.. Woha, this is an interesting crash! So, :ehsan, the |sendTelemetry| shutdown blocker is used to control the whole Telemetry shutdown process [1]. The problem here seems to be that spawning the pingsender is blocking the shutdown process due to the window requesting user input: we should detach from the pingsender process as soon as it starts. I have a doubt though: why is that prompt being displayed? All our executables are signed and shouldn't trigger that UAC prompt as far as I know. Unless I'm missing something! [1] - https://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/toolkit/components/telemetry/TelemetryController.jsm#741

Flags: needinfo?(alessio.placitelli)

Comment 17

•

8 years ago

(In reply to Alessio Placitelli [:Dexter] from comment #16) > (In reply to Out sick from comment #14) > > Also (and sorry if this is a dumb question) why does the sendTelemetry async > > shutdown blocker need to exist in the first place? > > <https://searchfox.org/mozilla-central/rev/ > > 6326724982c66aaeaf70bb7c7ee170f7a38ca226/toolkit/components/asyncshutdown/ > > nsIAsyncShutdown.idl#198> There is no code that uses it as far as I can > > see. It seems like if we deleted it this crash would go away and we would > > remove future footguns like this. If someone needed it again it's very easy > > to add it back.. > > Woha, this is an interesting crash! So, :ehsan, the |sendTelemetry| shutdown > blocker is used to control the whole Telemetry shutdown process [1]. Oh wow, I *swear* searchfox wasn't showing me that result on Thursday, so I thought it's unused! But yeah that makes perfect sense. > The > problem here seems to be that spawning the pingsender is blocking the > shutdown process due to the window requesting user input: we should detach > from the pingsender process as soon as it starts. Note that it's not pingsender window, it's Windows spawning a dialog before pingsender itself has executed. > I have a doubt though: why is that prompt being displayed? All our > executables are signed and shouldn't trigger that UAC prompt as far as I > know. Unless I'm missing something! This isn't UAC. This is a dialog that Windows shows for files downloaded from the Internet AFAIK (perhaps among other cases.) For example if you download a ZIP package containing Firefox you'll get this dialog the first time you run Firefox. The "Publisher" line in the dialog in the screencast shows the signature on the file was successfully verified by Windows. :-)

Comment 18

•

8 years ago

Setting ni? as a reminder to debug this at the beginning of the next week.

Flags: needinfo?(alessio.placitelli)

Emil Ghitta, QA [:emilghitta]

Comment 19

•

8 years ago

(In reply to Emil Ghitta, QA [:emilghitta] from comment #12) > Created attachment 8892443 [details] > Pingsender crash.gif > > Hi everyone! > > I managed to reproduce this crash using Firefox Nightly 56.0a1 and Firefox > RC 55.0 on Windows 7 following this steps: Hi Emil! Looks like I can't reproduce this on Windows 10 x64. Is this a Windows 7/Vista specific STR/issue?

Flags: needinfo?(alessio.placitelli) → needinfo?(emil.ghitta)

Comment 20

•

8 years ago

Hi Alessio! Indeed, I managed to reproduce this issue only on Windows 7. Should I file a new bug for this?

Flags: needinfo?(emil.ghitta)

Updated

•

8 years ago

Depends on: 1404857

Sylvestre Ledru [:Sylvestre]

Comment 21

•

8 years ago

(In reply to Emil Ghitta, QA [:emilghitta] from comment #20) > Hi Alessio! > > Indeed, I managed to reproduce this issue only on Windows 7. > > Should I file a new bug for this? Nope, thanks, I filed bug 1404857 :)

Updated

•

7 years ago

status-firefox54: fix-optional → wontfix

status-firefox55: fix-optional → wontfix

status-firefox59: --- → affected

status-firefox60: --- → affected

status-firefox61: --- → affected

Mike Taylor [:miketaylr]

Updated

•

7 years ago

status-firefox59: affected → wontfix

status-firefox60: affected → fix-optional

status-firefox61: affected → fix-optional

Trevor Skywalker

Comment 22

•

7 years ago

Top Crasher: #11 on Nightly. #13 on Beta. #11 on Release. Top Crashers for Firefox 62.0a1 Top 50 Crashing Signatures. 7 days ago 11 1.79% -0.38% AsyncShutdownTimeout | profile-before-change-telemetry | TelemetryController: shutting down 58 56 0 0 56 0 2016-09-01 Top Crashers for Firefox 61.0b Top 50 Crashing Signatures. 7 days ago 13 0.69% -0.1% AsyncShutdownTimeout | profile-before-change-telemetry | TelemetryController: shutting down 228 228 0 0 208 0 2016-09-01 Top Crashers for Firefox 60.0.1 Top 50 Crashing Signatures. 7 days ago 11 0.87% new AsyncShutdownTimeout | profile-before-change-telemetry | TelemetryController: shutting down 25 25 0 0 25 0 2016-09-01

Keywords: topcrash

Julien Cristau [:jcristau]

Comment 23

•

7 years ago

Adding 62 as affected. On release this is currently the #11 top browser crash.

status-firefox62: --- → affected

Updated

•

7 years ago

status-firefox60: fix-optional → wontfix

status-firefox61: fix-optional → wontfix

status-firefox63: --- → affected

status-firefox-esr52: fix-optional → wontfix

status-firefox-esr60: --- → affected

Updated

•

7 years ago

status-firefox62: affected → wontfix

status-firefox63: affected → fix-optional

status-firefox64: --- → affected

status-firefox-esr60: affected → fix-optional

Comment 24

•

7 years ago

Marking fix-optional for 64. We could still take a patch for 65, and if it's verified and doesn't seem risky, could still take fixes for 64 as well.

status-firefox64: affected → fix-optional

Pascal Chevrel:pascalc

Updated

•

7 years ago

status-firefox65: --- → affected

Selena Deckelmann :selenamarie :selena

Comment 25

•

7 years ago

Still a high volume crash on release and beta. Selena can you help find an owner for this bug? We may find that it's stalled but before I remove this from weekly regression triage it seems worth another investigation.

status-firefox63: fix-optional → wontfix

status-firefox64: fix-optional → wontfix

status-firefox65: affected → fix-optional

status-firefox66: --- → affected

Flags: needinfo?(sdeckelmann)

Comment 26

•

7 years ago

Can this be reproduced on Win 10? Tom - can someone have a look at reproducing this issue per comment #12 on win10?

Flags: needinfo?(sdeckelmann) → needinfo?(tgrabowski)

Emil Ghitta, QA [:emilghitta]

Comment 27

•

7 years ago

Hi Selena,

The steps mentioned in comment 12 are no longer valid. The issue that was generated by following those particular steps was fixed by Bug 1400294. Also, the crash was reproducible only on Windows 7 (per comment 20). We didn't manage to find other steps that could lead to a crash with this signature but we will continue to investigate and if something shows up I will post an update.

Flags: needinfo?(tgrabowski)

Tim Spurway [:tspurway]

Updated

•

7 years ago

status-firefox66: affected → fix-optional

Comment 28

•

6 years ago

Updating the tracking flags for this one. During the 66 releases so far we have over 6300 crashes. The crash is visible in 67 beta and somewhat consistent in 68 nightly as well (although the number of crashes per build is not as high - some days only 1 crash, some days 17).

status-firefox67: --- → affected

status-firefox68: --- → affected

Pascal Chevrel:pascalc

Updated

•

6 years ago

status-firefox65: fix-optional → wontfix

status-firefox67: affected → fix-optional

Patricia Lawless

Comment 29

•

6 years ago

Bulk change of P3 carryover bugs to wontfix for 68.

status-firefox68: affected → wontfix

Trevor Skywalker

Updated

•

6 years ago

status-firefox69: --- → affected

Keywords: nightly-community

Updated

•

6 years ago

status-firefox70: --- → affected

Comment 30

•

6 years ago

These two crash signatures are #12 and #14 on 68 release. That's very high volume (in the top 5 crashes on release, if you combine the signatures)

Given the high volume over many versions, do you think this is worth investigating again?
If not, we can treat this as something that we don't keep triaging over and over.

Flags: needinfo?(ehsan)

Flags: needinfo?(alessio.placitelli)

Updated

•

6 years ago

status-firefox66: fix-optional → wontfix

status-firefox67: fix-optional → wontfix

status-firefox69: affected → wontfix

Comment 31

•

6 years ago

(In reply to Liz Henry (:lizzard) from comment #30)

These two crash signatures are #12 and #14 on 68 release. That's very high volume (in the top 5 crashes on release, if you combine the signatures)

Given the high volume over many versions, do you think this is worth investigating again?
If not, we can treat this as something that we don't keep triaging over and over.

Redirecting to :chutten

Flags: needinfo?(alessio.placitelli) → needinfo?(chutten)

Comment 32

•

6 years ago

I'm afraid I don't really have any particular insights to share here, especially if we no longer trigger the dialog discussed in comment 17 from pingsender.exe...

Flags: needinfo?(ehsan)

Comment 33

•

6 years ago

I concur that these reports are different from the earlier ones. The shutdown barriers haven't been started for these ones:

(( from https://crash-stats.mozilla.org/report/index/7b0f4890-3b88-40ac-9ccc-d66a90190813#tab-metadata ))

{
  "phase": "profile-before-change-telemetry",
  "conditions": [
    {
      "name": "TelemetryController: shutting down",
      "state": {
        "initialized": true,
        "initStarted": true,
        "haveDelayedInitTask": false,
        "shutdownBarrier": "Not started",
        "connectionsBarrier": "Not started",
        "sendModule": {
          "sendingEnabled": false,
          "pendingPingRequestCount": 0,
          "pendingPingActivityCount": 0,
          "unpersistedPingCount": 0,
          "persistedPingCount": 1,
          "schedulerState": {
            "shutdown": true,
            "hasSendTask": false,
            "sendsFailed": false,
            "sendTaskState": "bail out - no pings to send",
            "backoffDelay": 60000
          }
        },
        "haveDelayedNewProfileTask": false
      },
      "filename": "resource:\/\/gre\/modules\/TelemetryController.jsm",
      "lineNumber": 854,
      "stack": [
        "resource:\/\/gre\/modules\/TelemetryController.jsm:setupTelemetry:854",
        "resource:\/\/gre\/modules\/TelemetryController.jsm:observe:989",
        "resource:\/\/gre\/modules\/TelemetryController.jsm:observe:180",
        "resource:\/\/gre\/modules\/TelemetryStartup.jsm:TelemetryStartup.prototype.observe:31"
      ]
    }
  ]
}

This means we're in _cleanupOnShutdown() (( the barrier calls shutdown(), _initStarted is true and _delayedInitTask is falsey so we're calling here )). We've skipped _delayedNewPingTask, and from the state in the reports it looks as though we've jumped through everything else and gotten past TelemetrySend.shutdown() as well (sendModule.schedulerState.shutdown is true). But we haven't yet started the shutdownBarrier.

This is consistent with the pingsender hypotheses mentioned after comment 12 because the only remaining things are sending the "health" ping by pingsender, sending the "main" ping with reason "shutdown" by pingsender, and some native deinit.

We could be triggering some deadlocking in the native deinit, but I would be surprised if that were the problem

+gsvelto for pingsender knowledge: we're trying our best to run pingsender asynchronously, but is it possible that something could still be causing us to block when starting the pingsender process?

Without insight there, I'm guessing we add more state to Telemetry shutdown to narrow down exactly which module's causing the issue (and maybe to split out the signatures into their own bugs? This one seems past its best before date.)

Flags: needinfo?(chutten) → needinfo?(gsvelto)

Comment 34

•

6 years ago

We know that pingsender.exe can start very slowly sometimes, one of those instances is bug 1386760. I have no good solution for that yet and only a guess of why it happens: AV software hooking kernel calls and doing (slow) stuff when we try to spawn pingsender.exe. However I've inspected some of the minidumps for this signature and they're different from bug 1386760.

If we've launched pingsender.exe the crashes are happening past the CreateProcess() call and IIRC we shouldn't be waiting on the resulting process. I don't know how else we can get stuck on pingsender.exe; in theory we shouldn't.

Flags: needinfo?(gsvelto)

Comment 35

•

6 years ago

This seems to suggest we're not blocking on the resulting process, no? https://searchfox.org/mozilla-central/rev/15cff10fa2d10fcc763b22e909412dcd6e9c4e88/toolkit/components/telemetry/app/TelemetrySend.jsm#1573

Updated

•

6 years ago

status-firefox71: --- → affected

Comment 36

•

6 years ago

Anything you can try for 70 here? Sounds like maybe this is something that needs more time to overhaul how the ping sending works.

Flags: needinfo?(gsvelto)

Florian Quèze [:florian]

Comment 37

•

6 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #34)

We know that pingsender.exe can start very slowly sometimes, one of those instances is bug 1386760. I have no good solution for that yet

Is it possible to start the process from another thread, and do it earlier during shutdown, so that it happens in parallel while the rest of shutdown is happening?

BugBot [:suhaib / :marco/ :calixte]

Updated

•

6 years ago

tracking-firefox70: --- → +

tracking-firefox71: --- → +

Comment 38

•

6 years ago

Changing the priority to p1 as the bug is tracked by a release manager for the current beta.
See What Do You Triage for more information

Priority: P3 → P1

Comment 39

•

6 years ago

(In reply to Florian Quèze [:florian] from comment #37)

Is it possible to start the process from another thread, and do it earlier during shutdown, so that it happens in parallel while the rest of shutdown is happening?

Theoretically yes, but it's a bit tricky. The pingsender executable is launched from JavaScript code using an nsIProcess instance. The relevant code is running from the main thread and I don't know how to make JavaScript's equivalent of a background thread and delegate launching to that. I could wrap launching pingsender within some C++ code that would then spawn it from a separate thread but it would be a bit icky.

Chris, do you have a better idea on how to spawn pingsender from outside of the main thread?

Flags: needinfo?(gsvelto) → needinfo?(chutten)

Comment 40

•

6 years ago

The "JS calls into C++ which dispatches to a background thread to do potentially-heavy work" dance is something we do in a couple of places in Telemetry.cpp (see for example where we get the list of loaded modules)

Though this doesn't make it any less icky, it does at least show precedent :|

While I'm here I'll note that Telemetry's pattern of use of pingsender might be exacerbating this. It has become a common pattern to send pings at shutdown, so we're not just spawning one pingsender at shutdown, we're spawning a handful of them one after another. We could change up how we send pings at shutdown (maybe we register shutdown pings with TelemetrySend so so we can batch them somehow), but I'm not sure if that solves the issue under concern here.

That's all I have, but I'm not exactly a mozjs guru. There might be an API for "schedule this task that is guaranteed to execute on a bg thread even during profile-before-change phase of shutdown" that I just haven't run across before. Who would I ask about that... :farre, maybe?

Flags: needinfo?(chutten) → needinfo?(afarre)

Comment 41

•

6 years ago

Thanks Chris. I'm now focusing on the recent OOM ramp up on Windows but I intend to find a fix for this as soon as I'm done with that. Note that :kwright has started rewriting the pingsender in Rust in bug 1339035. Adding additional features to the pingsender - like sending multiple pings with a single invocation - is going to be a lot easier after that's done.

Comment 42

•

6 years ago

Sounds unlikely we will come up with a fix before 70 release, but i'll keep this marked fix-optional in case there is something simple enough to uplift as a ridealong for a dot release, since the crash volume is so high.

status-firefox70: affected → fix-optional

Comment 43

•

6 years ago

Just another thing, assuming pingsender would be able to send more than one ping in one invocation, how hard would it be to adapt the telemetry code to use that feature?

Flags: needinfo?(chutten)

Comment 44

•

6 years ago

It'll require a little work. The more obvious followers of the "send a ping at shutdown" pattern are all told about shutdown in TelemetryController's _cleanupOnShutdown. I'd probably refactor the various pings' shutdown()s to return any pings they'd like to send and then pass the list on to TelemetrySend's shutdown.

We might not get them all, but we'd get the obvious ones with, let's say, a day's effort.

Flags: needinfo?(chutten)

Andreas Farre [:farre]

Comment 45

•

6 years ago

I'm not aware of such an API, my experiences mostly revolve around the idle scheduled tasks, which all run on main thread. I'm guessing Olli or Nathan would be the right people to ask.

Flags: needinfo?(afarre) → needinfo?(bugs)

Updated

•

6 years ago

Updated

•

6 years ago

Depends on: 1590114

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

6 years ago

status-firefox72: --- → affected

Comment 46

•

6 years ago

Has bug 1590114 helped here at all?

I don't know the answer to
"There might be an API for "schedule this task that is guaranteed to execute on a bg thread even during profile-before-change phase of shutdown" that I just haven't run across before."

And I don't know what could have changed around ~68.
But I'll keep this in mind while looking into scheduling changes.

Flags: needinfo?(bugs)

Comment 47

•

6 years ago

(In reply to Olli Pettay [:smaug] from comment #46)

Has bug 1590114 helped here at all?

You know, I'm not sure. Are these frequent enough that we'd notice a dropoff around Oct 24 on Nightly? :gsvelto?

Flags: needinfo?(gsvelto)

Julien Cristau [:jcristau]

Comment 48

•

6 years ago

The nightly data is very noisy, eyeballing it I don't see a dropoff

Flags: needinfo?(gsvelto)

Jim Mathies [:jimm]

Comment 49

•

6 years ago

Really bad crash that seems to have fallen off our radar since 50! Need some help diagnosing the cause and potential fixes.

Component: General → Telemetry

Product: Core → Toolkit

Pascal Chevrel:pascalc

Comment 50

•

6 years ago

We have shipped our last beta for 71, fix-optional for 71 as this is a significant crasher on release, if a safe patch materializes in 72 we may want to uplift it in a potential dot release as a ridealong.

status-firefox70: fix-optional → wontfix

status-firefox71: affected → fix-optional

status-firefox-esr68: --- → affected

Updated

•

6 years ago

status-firefox71: fix-optional → wontfix

status-firefox72: affected → fix-optional

status-firefox73: --- → affected

status-firefox-esr60: fix-optional → wontfix

status-firefox-esr68: affected → wontfix

Comment 51

•

6 years ago

Chris, can you take a look at this high volume crash or find someone to investigate?

Flags: needinfo?(chutten)

Comment 52

•

6 years ago

gsvelto, what do you think the likelihood of a Rust rewrite of pingsender (bug 1339035) and augmenting it with "send multiple pings at once" is for Q1?

If we're at less than "somewhat likely", I think maybe we should pursue the "add a method to nsITelemetry to spawn pingsender on bg threads" alternative. Thoughts?

Flags: needinfo?(chutten) → needinfo?(gsvelto)

Comment 53

•

6 years ago

The Rust rewrite is a long way off because of the lack of adequate crates to make HTTPS requests. Adding support for sending multiple pings (all specified on the command-line in one go) could be done this week.

Flags: needinfo?(gsvelto)

Comment 54

•

6 years ago

Okidoki, let's get started on that angle of the solution then: collecting at-shutdown pings into some sort of registry and passing them all to pingsender in one swell foop. I'm thinking we need a bug for the registry and a bug for the pingsender feature, and then we can use this one for the final piece when both are done. Sound good?

Updated

•

6 years ago

Assignee: nobody → chutten

Status: NEW → ASSIGNED

Updated

•

6 years ago

Depends on: 1607586

Comment 55

•

6 years ago

(In reply to Chris H-C :chutten from comment #54)

Okidoki, let's get started on that angle of the solution then: collecting at-shutdown pings into some sort of registry and passing them all to pingsender in one swell foop. I'm thinking we need a bug for the registry and a bug for the pingsender feature, and then we can use this one for the final piece when both are done. Sound good?

+1

Updated

•

6 years ago

Depends on: 1607908

Updated

•

6 years ago

status-firefox72: fix-optional → wontfix

status-firefox73: affected → wontfix

status-firefox74: --- → fix-optional

Comment 56

•

6 years ago

It's been two weeks since we landed the pingsender fixes and we're still seeing Nightly crashes with similar signatures at about the same rate. I'd like to see what it does on beta after merge, because there's a larger population there we might see a smaller effect.

But it might end up being that we need to look further. Maybe saving all these pending pings to disk at such a contentious moment is part of the problem?

Comment 57

•

5 years ago

The Nightly crash volumes going to 0 might be related to bug 1613692 (but that landed in the first 2020-02-13 nightly, not the second, so maybe not?)

Comment 58

•

5 years ago

Still plenty of crashes on beta. Gabriele, any ideas what we might have missed? Doug, can you point me at tools for looking at shutdown slowness? (assuming that shutdown slowness is what we're seeing, and not some intermittent lock)

Flags: needinfo?(gsvelto)

Flags: needinfo?(dothayer)

Comment 59

•

5 years ago

I've opened a bunch of minidumps from the beta builds and I couldn't find the pingsender.exe string anymore on the stack, so at least we're not stuck launching it like we did in the past... maybe we're stuck waiting on it to send all the pings? We still have bug 1610757 to implement to speed up the process.

Flags: needinfo?(gsvelto)

Alex Thayer [:alexical] (she/her)

Comment 60

•

5 years ago

Isn't in an async launch, though? We should only be waiting for the process to start, not for it to stop: we don't pass an observer or block shutdown on its exit code.

Comment 61

•

5 years ago

The only tool I can recommend is the Gecko profiler with MOZ_PROFILER_SHUTDOWN set (pointing to an absolute filepath), running on a spinny disk. Am I reading the history correctly that at this point we don't know anything beyond that we're crashing waiting for Telemetry's AsyncShutdown blocker to resolve? I suspect that there's no cheeky performance cliff that we're falling off of here, and it's just that Telemetry is doing a lot during this window and we need to generally improve that (unless I'm misunderstanding something?) If the situation is worse on Beta, that would make sense, since Beta just has a higher representation of spinny disks.

Flags: needinfo?(dothayer)

Comment 62

•

5 years ago

I don't have access to spinny disk hardware, so my investigation lines are limited. Let's take a look at some reports' async shutdown state (field "Async shutdown timeout" in "metadata")

There are two types: those where shutdownBarrier has a state Not started and those where there's nothing in the async shutdown state suggesting why we're still in it (all state information says it's shut down, so I'm not sure what it's doing. Maybe it's in TelemetryStorage.shutdown()))

The overwhelming majority of the crashes are of the first type (shutdownBarrier with a state of Not started). This is no longer consistent with a pingsender hypothesis as pingsender's not even spawned until TelemetrySend.flushPingSenderBatch which is after we await shutdownBarrier here.

I guess we need to add more fine-grained instrumentation to shutdown and see where we're losing the plot. We're still getting the occasional crash on Nightly so we'll be able to iterate. Is there any reason to not include an async shutdown state parameter that is straight up telling us which step of _cleanupOnShutdown we're in? Plus the number of millis since the top of the function?

Updated

•

5 years ago

Depends on: 1631605

Comment 63

•

5 years ago

All eight reports on Nightly since the finer-grained logging are complaining about TelemetryStorage.shutdown. Some are getting to it very quickly (16ms from top of _cleanupOnShutdown) some are taking longer (over 2s).

This is my first time looking at TelemetryStorage.shutdown. Looks like it's waiting for the orderly completion of certain tasks:

Awaiting the removal of the "aborted-session"-reason "main" ping. This is important because otherwise we'll send it at next startup, making us think the previous session terminated due to a crash or something. (( though I'm not sure about the analysis value of this sort of ping these days. We have the "crash" ping now. ))
If we're in the middle of cleaning the archive of pings older than 60 days and trimming down to at most 120MB of disk space, we wait for that to complete. This isn't strictly necessary to have complete right now, as we'll kick off another task during next startup during delayedInit if we need it.
Enforcing the pending pings quota (removing pending pings if there's more than 15MB of them). Like #2 we don't need to complete this now, we will kick it off again during TelemetrySend.setup
Awaiting us removing all pending pings. This only happens when you opt out of Telemetry, so I'm not sure it's likely to be the problem here. It is however important to have this run to completion as we will not restart it.
Awaiting the saving of pending pings. This is crucial for data integrity and is likely to have some work to do given all the pings we're sending at shutdown via pingsender. (have to save the pings to disk after all).

If I had to place bets, I'd bet on #5 being the biggest slowdown of the bunch. We could drop #2 and #3 to see if it has any effect, but if I'm right there's just nothing we can do without asking data owners not to send pings at shutdown (because sending at shutdown requires dumping them on disk).

All these new reports are from sessions under 3min long (uptime less than 180), though, so I'd like to wait for a few more reports from longer sessions so that I don't over-rotate on this small sample. Maybe I'll look into checking "crash" pings (which should have the annotation) because we'll have more reports there and the minidump's not of much use for this investigation (at the moment at least).

Comment 64

•

5 years ago

Some data's had a chance to accumulate, and here it is: https://sql.telemetry.mozilla.org/queries/71846/source

We have a lot of await TelemetrySession.shutdown()s followed by not quite as many await TelemetryStorage.shutdown()s. Since I only looked at the latter back in Comment 63, let's take a look at the former...

...Oh look it's saving pings at shutdown, too.

There appears to be a pattern that saving and sending pings at shutdown on some clients just takes too darn long. Perhaps because saving to disk is taking too long...but perhaps because assembling the pings is taking too long. The Environment (present in many of these pings) isn't considered complete enough to be used in pings until it has quite a bit of information in it ranging from HW information to extended OS information to a full list of the Addons you have installed. This can take a while (again, exacerbated by slow disks).

So perhaps we have a combination of slow ping assembly and slow ping persistence causing these signatures. ...but if that's the case then the only thing I can see us doing is changing the contract of what data is guaranteed to be sent at shutdown. We're already missing some from these thousands of MAU (they do send "crash" pings, though), so maybe this won't be a problem?

I'll ponder this. ni?myself so I don't forget to come back to this.

Flags: needinfo?(chutten)

Comment 65

•

5 years ago

More data, this time examining the number of milliseconds it takes to get from the top of _cleanupOnShutdown to just before TelemetrySession.shutdown() (plot) or TelemetryStorage.shutdown (plot) (only counting cases where the async shutdown timed out there)

The plurailty (250+ of the 1.7k cases since the end of april) of the hangs that timed out in TelemetrySession.shutdown got there immediately (0ms) and it's an exponential falloff from there suggesting that either we're running out of time already by the time we get there and TelemetrySession.shutdown is just a heavy operation that's at the end of shutdown so it gets the blame, or TelemetySession.shutdown is taking tens of seconds.

Looking at the fewer cases of TelemetryStorage.shutdown around 100 of the 966 them are in the 10+ seconds range. The rest are getting there reasonably quickly suggesting (to this casual observer) that TelemetrySession.shutdown might take seconds, but not likely tens of them.

The proper solution for all of this is to not perform costly ping assembly and persistence operations at shutdown and take a more incremental approach to. This will ultimately be accomplished through Project FOG bringing the Glean SDK to Firefox Desktop. For now, though, I think we can just monitor these to ensure they don't suddenly get worse. This happened a total of 3350 times across 2426 clients in Beta and Nightly, so I'm afraid I'm going to have to deprioritize working on this in favour of working on FOG.

Severity: critical → S4

Flags: needinfo?(chutten)

Priority: P1 → P3

Updated

•

5 years ago

Assignee: chutten → nobody

Status: ASSIGNED → NEW

Comment hidden (Intermittent Failures Robot)

Sean Feng [:sefeng211]

Comment 67

•

5 years ago

Just a friendly ping that the number of crashes has been increasing for the past few months.

Comment 68

•

4 years ago

Looks like quite a bit of these reports are coming from Thunderbird, which is interesting.

Alrighty, let's add the shutdown step Telemetry back in.

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Depends on: 1695699

Comment 69

•

3 years ago

The severity field for this bug is set to S4. However, the bug has the topcrash keyword.
:chutten, could you consider increasing the severity of this top-crash bug? If the crash isn't "top" anymore, could you drop the topcrash keyword?

For more information, please visit auto_nag documentation.

Flags: needinfo?(chutten)

Andrew McCreight [:mccr8]

Updated

•

3 years ago

Flags: needinfo?(chutten)

Keywords: topcrash

Marco Bonardo [:mak]

Comment 70

•

3 years ago

Note you could directly annotate the crash with progress, you could reintroduce something like shutdownStep and report it in _getState (that you already support), then get it out from the crash metadata, if you want to monitor the progress.

Wayne Mery (:wsmwk)

Updated

•

3 years ago

Comment 71

•

3 years ago

(In reply to Chris H-C :chutten from comment #68)

Looks like quite a bit of these reports are coming from Thunderbird, which is interesting.

Alrighty, let's add the shutdown step Telemetry back in.

Yes. And crashes are still disproportionately Thunderbird https://crash-stats.mozilla.org/signature/?signature=AsyncShutdownTimeout%20%7C%20profile-before-change-telemetry%20%7C%20TelemetryController%3A%20shutting%20down&date=%3E%3D2022-12-31T00%3A25%3A00.000Z&date=%3C2023-01-31T00%3A25%3A00.000Z#graphs

It also appears that Thunderbird recently has a great increase in crash rate. Filed bug 1813973.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

2 years ago

status-firefox74: fix-optional → wontfix

Updated

•

2 years ago

Crash Signature: [@ AsyncShutdownTimeout | profile-before-change-telemetry | TelemetryController: shutting down] [@ AsyncShutdownTimeout | profile-before-change2 | TelemetryController: shutting down] → [@ AsyncShutdownTimeout | IOUtils: waiting for sendTelemetry IO to complete | TelemetryController: shutting down] [@ AsyncShutdownTimeout | profile-before-change-telemetry | TelemetryController: shutting down] [@ AsyncShutdownTimeout | profile-before-ch…

BugBot [:suhaib / :marco/ :calixte]

Comment 72

•

2 years ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 desktop browser crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash