Bug 1690096 Comment 2 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

I think the phase duration information is useful but it appears that the current mechanism has never been reliable and has perhaps gotten less reliable and would need to be converted to using a more clever shutdown mechanism like a memory mapped file.  Maybe glean provides this?

Jens was noticing in his testing that he was only getting data points for quit-application in about:telemetry.  I did some quick hacky instrumentation and it appears that the race prone nature of the writer thread (there's no ping/pong to block shutdown) means that whether we have data and/or how much data we have is going to be inconsistent and tend to always leave off the last meaningful data point (which provides the opportunity to actually write the data to disk and rename it).  Bug 1505660 likely improved things by providing more opportunities for UpdateTelemetry to be called.

The client counts I think help demonstrate this:
- "profile-change-teardown"
  - [per glam](https://glam.telemetry.mozilla.org/firefox/probe/shutdown_phase_duration_ticks_profile_change_teardown/explore?process=parent&ref=20210201094443&timeHorizon=ALL) had like 20k pings a day before falling off a cliff in August, which perhaps was due to the early writes?
  - [per measurements dashboard](https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2021-02-02&include_spill=0&keys=__none__!__none__!__none__&max_channel_version=nightly%252F87&measure=SHUTDOWN_PHASE_DURATION_TICKS_PROFILE_CHANGE_TEARDOWN&min_channel_version=nightly%252F55&processType=*&product=Firefox&sanitize=1&sort_by_value=0&sort_keys=submissions&start_date=2021-01-25&table=0&trim=1&use_submission_date=0) has 8k recently?
- "profile-before-change"
  - [per glam](https://glam.telemetry.mozilla.org/firefox/probe/shutdown_phase_duration_ticks_profile_before_change/explore?process=parent&ref=20210129160110&timeHorizon=ALL) which was the only other phase that could be longish had like 100 client pings a day?
  - [per measurements dashboard](https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2021-02-01&include_spill=0&keys=__none__!__none__!__none__&max_channel_version=nightly%252F87&measure=SHUTDOWN_PHASE_DURATION_TICKS_PROFILE_BEFORE_CHANGE&min_channel_version=nightly%252F55&processType=*&product=Firefox&sanitize=1&sort_by_value=0&sort_keys=submissions&start_date=2021-01-25&table=0&trim=1&use_submission_date=0) has 1.82k recently?

Here's an example of a very boring shutdown of "mach run" on a powerful machine with basically nothing in the profile where we reliably only ever see quit-application.  (The 2nd write starts but doesn't complete the rename as confirmed by a subsequent run.)
```
LOADING /home/visbrero/rev_control/hg/mozilla-unified3/obj-firefox-opt/tmp/profile-default/ShutdownDuration.json
LOADED: {"quit-application": 0}
Processing key "quit-application"
  Yes such histogram, logging datapoint: 0
Telemetry histogram complete.
TELEMETRY: Updating state to {"quit-application": 0}
TELEMETRY WRITER: HAVE DATA: {"quit-application": 0}
TELEMETRY WRITER: wrote data
TELEMETRY WRITER: renamed!
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0}
TELEMETRY WRITER: HAVE DATA: {"quit-application": 0, "profile-change-net-teardown": 0}
TELEMETRY WRITER: wrote data
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0, "profile-change-teardown": 0}
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0, "profile-change-teardown": 0, "profile-before-change": 0}
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0, "profile-change-teardown": 0, "profile-before-change": 0, "profile-before-change-qm": 0}
```
I think the phase duration information is useful but it appears that the current mechanism has never been reliable and has perhaps gotten less reliable and would need to be converted to using a more clever shutdown mechanism like a memory mapped file.  Maybe glean provides this?

Jens was noticing in his testing that he was only getting data points for quit-application in about:telemetry.  I did some quick hacky instrumentation and it appears that the race prone nature of the writer thread (there's no ping/pong to block shutdown) means that whether we have data and/or how much data we have is going to be inconsistent and tend to always leave off the last meaningful data point (which provides the opportunity to actually write the data to disk and rename it).  Bug 1505660 likely improved things by providing more opportunities for UpdateTelemetry to be called (via profile-before-change-qm which can have some duration, versus xpcom-will-shutdown and xpcom-shutdown which were always likely to be fast and then bypassed entirely by fast shutdown, making it impossible to trigger a write of the preceding step).

The client counts I think help demonstrate this:
- "profile-change-teardown"
  - [per glam](https://glam.telemetry.mozilla.org/firefox/probe/shutdown_phase_duration_ticks_profile_change_teardown/explore?process=parent&ref=20210201094443&timeHorizon=ALL) had like 20k pings a day before falling off a cliff in August, which perhaps was due to fast shutdown being enabled?
  - [per measurements dashboard](https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2021-02-02&include_spill=0&keys=__none__!__none__!__none__&max_channel_version=nightly%252F87&measure=SHUTDOWN_PHASE_DURATION_TICKS_PROFILE_CHANGE_TEARDOWN&min_channel_version=nightly%252F55&processType=*&product=Firefox&sanitize=1&sort_by_value=0&sort_keys=submissions&start_date=2021-01-25&table=0&trim=1&use_submission_date=0) has 8k recently?
- "profile-before-change"
  - [per glam](https://glam.telemetry.mozilla.org/firefox/probe/shutdown_phase_duration_ticks_profile_before_change/explore?process=parent&ref=20210129160110&timeHorizon=ALL) which was the only other phase that could be longish had like 100 client pings a day?
  - [per measurements dashboard](https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2021-02-01&include_spill=0&keys=__none__!__none__!__none__&max_channel_version=nightly%252F87&measure=SHUTDOWN_PHASE_DURATION_TICKS_PROFILE_BEFORE_CHANGE&min_channel_version=nightly%252F55&processType=*&product=Firefox&sanitize=1&sort_by_value=0&sort_keys=submissions&start_date=2021-01-25&table=0&trim=1&use_submission_date=0) has 1.82k recently?

Here's an example of a very boring shutdown of "mach run" on a powerful machine with basically nothing in the profile where we reliably only ever see quit-application.  (The 2nd write starts but doesn't complete the rename as confirmed by a subsequent run.)
```
LOADING /home/visbrero/rev_control/hg/mozilla-unified3/obj-firefox-opt/tmp/profile-default/ShutdownDuration.json
LOADED: {"quit-application": 0}
Processing key "quit-application"
  Yes such histogram, logging datapoint: 0
Telemetry histogram complete.
TELEMETRY: Updating state to {"quit-application": 0}
TELEMETRY WRITER: HAVE DATA: {"quit-application": 0}
TELEMETRY WRITER: wrote data
TELEMETRY WRITER: renamed!
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0}
TELEMETRY WRITER: HAVE DATA: {"quit-application": 0, "profile-change-net-teardown": 0}
TELEMETRY WRITER: wrote data
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0, "profile-change-teardown": 0}
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0, "profile-change-teardown": 0, "profile-before-change": 0}
TELEMETRY: Updating state to {"quit-application": 0, "profile-change-net-teardown": 0, "profile-change-teardown": 0, "profile-before-change": 0, "profile-before-change-qm": 0}
```

Back to Bug 1690096 Comment 2