Closed Bug 1690096 Opened 5 years ago Closed 2 years ago

Remove Terminator Telemetry

Categories

(Toolkit :: Telemetry, task, P3)

task

Tracking

()

RESOLVED DUPLICATE of bug 1879136

People

(Reporter: chutten, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Terminator Telemetry is another one of those custom pieces of instrumentation that, bereft of expiry, lingers on to this day without users or an owner.

We can be rid of it now.

This bug's about removing TerminatorTelemetry.jsm, its registration in components.conf and moz.build, its subordinate SHUTDOWN_PHASE_* Histograms, and the code in nsTerminator that writes to its file (UpdateTelemetry and friends).

This bug is not about rendering any judgement on the rest of Terminator.

Jens, I hate to do this to you, given that you just spent time on the patch to update these, but were you getting any value out of this terminator telemetry? Given that it's at 1-second intervals, and does not really give us much insight regarding what code is actually making a given phase take longer, it seems like we could tear it out and all be fine. If you're okay with that I don't mind throwing a patch up to rip it out.

Flags: needinfo?(jstutte)

I think the phase duration information is useful but it appears that the current mechanism has never been reliable and has perhaps gotten less reliable and would need to be converted to using a more clever shutdown mechanism like a memory mapped file. Maybe glean provides this?

Jens was noticing in his testing that he was only getting data points for quit-application in about:telemetry. I did some quick hacky instrumentation and it appears that the race prone nature of the writer thread (there's no ping/pong to block shutdown) means that whether we have data and/or how much data we have is going to be inconsistent and tend to always leave off the last meaningful data point (which provides the opportunity to actually write the data to disk and rename it). Bug 1505660 likely improved things by providing more opportunities for UpdateTelemetry to be called (via profile-before-change-qm which can have some duration, versus xpcom-will-shutdown and xpcom-shutdown which were always likely to be fast and then bypassed entirely by fast shutdown, making it impossible to trigger a write of the preceding step).

The client counts I think help demonstrate this:

  • "profile-change-teardown"
    • per glam had like 20k pings a day before falling off a cliff in August, which perhaps was due to fast shutdown being enabled?
    • per measurements dashboard has 8k recently?
  • "profile-before-change"

Here's an example of a very boring shutdown of "mach run" on a powerful machine with basically nothing in the profile where we reliably only ever see quit-application. (The 2nd write starts but doesn't complete the rename as confirmed by a subsequent run.)

LOADING /home/visbrero/rev_control/hg/mozilla-unified3/obj-firefox-opt/tmp/profile-default/ShutdownDuration.json
LOADED: {"quit-application": 0}
Processing key "quit-application"
  Yes such histogram, logging datapoint: 0
Telemetry histogram complete.
TELEMETRY: Updating state to {"quit-application": 0}
TELEMETRY WRITER: HAVE DATA: {"quit-application": 0}
TELEMETRY WRITER: wrote data
TELEMETRY WRITER: renamed!
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0}
TELEMETRY WRITER: HAVE DATA: {"quit-application": 0, "profile-change-net-teardown": 0}
TELEMETRY WRITER: wrote data
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0, "profile-change-teardown": 0}
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0, "profile-change-teardown": 0, "profile-before-change": 0}
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0, "profile-change-teardown": 0, "profile-before-change": 0, "profile-before-change-qm": 0}
See Also: → 1689953

I mean, yeah, the whole infrastructure around writing the ShutdownDuration.json file just feels strange, and I think if we want to keep this information or something similar we should rewrite it with a clear goal in mind. The existing implementation feels like it probably works okay for cases where the nsTerminator crashes us, and maybe that's what it's biased towards? Since the write before the crashing phase is likely to fully resolve before we crash, and we can infer the duration of the crashing phase based on the crash timeout. But I think that makes this probe pretty misleading for other cases. I'd much rather see an implementation that gets fine-grained phase timing information on both graceful exits as well as nsTerminator crashes, but doesn't so much care about checking in the data along the way, as that just protects against non-hang crashes during shutdown, which I don't expect to be much related to phase timings.

As far as I'm concerned, we already write to a file during shutdown to ensure we can record the overall shutdown duration. Why don't we just extend that file to include timings for each phase encountered along the way? If we want to still have that data in cases where nsTerminator crashes us, we can make nsTerminator write it write before crashing (we'd need to sync this obviously to avoid stepping on the main thread's toes).

But if we're going to keep it we should change it to milliseconds, rather than "ticks" which loosely approximate seconds. Additionally, instead of having a set of histograms for this, could we make it a keyed scalar? Seems more appropriate and simpler to track together.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #2)

I think the phase duration information is useful but it appears that the current mechanism has never been reliable and has perhaps gotten less reliable and would need to be converted to using a more clever shutdown mechanism like a memory mapped file. Maybe glean provides this?

Glean's current guarantees on Firefox Desktop are from app start until ShutdownPhase::ShutdownFinal (we use RunOnShutdown to call our shutdown routines). (So long as everything else we rely on is as happy. Without many metrics in Firefox Desktop just yet, we haven't stressed the edges). Anything outside of that will need something special inside Firefox to bridge the gap until Glean's available again.

(In reply to Doug Thayer [:dthayer] (he/him) from comment #3)

But if we're going to keep it we should change it to milliseconds, rather than "ticks" which loosely approximate seconds. Additionally, instead of having a set of histograms for this, could we make it a keyed scalar? Seems more appropriate and simpler to track together.

I'd definitely like to keep (and repair) the telemetry and I proposed ticks of 100ms in the patch of bug 1689953.

Flags: needinfo?(jstutte)
See Also: → 1879136
Status: NEW → RESOLVED
Closed: 2 years ago
Duplicate of bug: 1879136
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: