Closed Bug 1552621 Opened 5 years ago Closed 5 years ago

Track bookmark merge events in telemetry and shutdown hang reports

Categories

(Firefox :: Sync, task)

task
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 69
Tracking Status
firefox69 --- fixed

People

(Reporter: lina, Assigned: lina)

References

Details

Attachments

(4 files, 1 obsolete file)

We're still seeing shutdown hangs after bug 1551062 landed, and I'm curious at what point we're getting stuck. Those crashes don't have any context, but AsyncShutdown lets us annotate a shutdown blocker with extra state, which will get serialized and included in the crash report if shutdown takes too long.

Sqlite.jsm callers get this for free if they're stuck executing a statement, but, since we're running statements ourselves on the async thread in Rust, we need to be a bit more creative.

Relatedly, aborting a merge currently discards all stats that we might have recorded so far, like the size of the tree and the time taken to fetch it. I think this would be useful to have for telemetry, even if the merge fails or gets interrupted.

What if we killed two birds with one stone?

  • Instead of having Dogear's Store::merge return Stats, we can add a Driver::record_telemetry_event callback. The merger would call this callback with timings and counts after each operation (fetching the local and remote trees, merging, and applying).
  • Our bookmark_sync::Driver would dispatch a runnable to the main thread for each call, which would call a callback implemented in JS.
  • On the JS side, we'd record the event in event telemetry, and store progress state internally indicating how far we got. (So, if we see a hang after fetch_local_tree, we know we're blocked on fetch_local_contents. If we hang at fetch_local_contents, we know we're stuck on fetch_remote_tree, since that's the next step).
  • When the merge completes (with a call to either handleResult or handleError), reset the progress state.
  • Pass a fetchState function as the third argument to AsyncShutdown.profileBeforeChange.addBlocker, which returns the current progress, as well as anything else we might want to record (size of the merged tree? Time taken for the previous phases?)
Flags: needinfo?(markh)

This commit introduces a mozISyncedBookmarksMirrorProgressListener
interface for capturing telemetry and updating shutdown blocker state
after each step of the merge, instead of waiting until the end. This
also means we can also record events for interrupted and failed merges,
and pass validation data through to the Sync ping.

Shutdown hang crash reports now have a step field indicating the last
finished step. For example, if we see a hang on fetchLocalTree, we
know the merger is blocked on fetchLocalContents. If we hang on
fetchLocalContents, the merger is stuck at fetchRemoteTree, since
that's the next step after fetchLocalContents.

This commit also implements Driver::record_telemetry_event to
dispatch progress callback runnables to the main thread.

Assignee: nobody → lina
Status: NEW → ASSIGNED
Type: defect → task
Flags: needinfo?(markh)
Summary: Record merge progress in the bookmark mirror's shutdown blocker → Track bookmark merge events in telemetry and shutdown hang reports
Attachment #9066304 - Attachment description: Bug 1552621 - Record bookmark validation telemetry and shutdown blocker progress in the bookmarks mirror. r?markh!,tcsc! → Bug 1552621 - Record shutdown blocker progress in the bookmarks mirror. r?markh!,tcsc!

Before this commit, we used event telemetry to record timing and counts
for the different steps of a bookmark merge (fetching the local tree,
new local contents, remote tree, and new remote contents; merging;
applying; fetching outgoing records; and notifying observers).

This has several limitations. We need to store all numbers as strings,
include a "flow ID" to tag events from the same merge, and collect
failure reasons twice. We also can't correlate these events to the
existing engine telemetry, meaning we can't see other stats for that
engine, or for the entire sync. Finally, we need to run separate
queries on these events for analysis, instead of extending our
existing engine dashboards.

This approach also feels like an abuse of event telemetry, so this
commit adds a "steps" field for each engine in the Sync ping. Each step
has a name, time taken, and additional named counts, like the number of
items in the tree, or merged structure stats.

Currently, only the buffered-bookmarks engine records these steps.

Depends on D33410

Attached file engine-steps-data-review.md (obsolete) —
Attachment #9069190 - Flags: data-review?(chutten)
Attachment #9069190 - Attachment is obsolete: true
Attachment #9069190 - Flags: data-review?(chutten)
Attachment #9069191 - Flags: data-review?(chutten)

This turned into a whole thing, but I'm a lot happier with how we track progress. I'd love a second (and third!) set of eyes to make sure this doesn't do anything silly! 😄

  • Shutdown hang reports will be annotated with something like { "steps": [{ "step": "fetchLocalTree", "at": 1559423421582 }, { "step": "fetchNewLocalContents", "at": 1559423421999 }] }.
  • We'll record timings and counts for fetchLocalTree, fetchNewLocalContents, etc. in a new steps field in the engine telemetry payload, instead of event telemetry.
  • We'll also record bookmark validation telemetry in the engine payload, like we do now.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4a97992f2943da2e0b2c8603a4b953b7e0b21aa5

Comment on attachment 9069191 [details]
engine-steps-data-review.md

Load-balancing redirect to :tdsmith
Attachment #9069191 - Flags: data-review?(chutten) → data-review?(tdsmith)
Comment on attachment 9069191 [details]
engine-steps-data-review.md

Thank you for the complete and thoughtful explanations. My only concerns are about documentation. https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/sync-ping.html is our public-facing documentation about the sync ping. While that page references `services/sync/tests/unit/sync_ping_schema.json`, which I see is updated in D33411, I don't think the schema document alone gives a good sense of what the collection is or what kind of values it can contain. That documentation would be helpful both for community members evaluating Firefox's collections and for me as a data scientist. Would you mind extending the narrative documentation to include this field?

I'm not sure how shutdown hang reports are collected or reported, but altering or expanding that collection would also be subject to data review; if that's still in scope for this work, is there documentation to point to for those?
Flags: needinfo?(lina)

Oops, I thought for sure I updated sync-ping.rst, too, but I think hg ate my changes during a histedit—I definitely meant to do that. I'm not sure what the process is for AsyncShutdown.jsm crash report annotations...looking at other places we've added them, it doesn't look like they go through data review, but I can add a comment explaining the format.

Done! Added docs in https://phabricator.services.mozilla.com/D33411, and a fixed enum with the step names in https://phabricator.services.mozilla.com/D31950 (search for ProgressTracker.STEPS).

Flags: needinfo?(lina)
Comment on attachment 9069191 [details]
engine-steps-data-review.md

Awesome, thanks. Thanks also for explaining the crash annotations to me on IRC; I see now that those use an existing facility that has already been reviewed.

--

DATA COLLECTION REVIEW

1) Is there or will there be **documentation** that describes the schema for the ultimate data set in a public, complete, and accurate way?

Yes, at https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/sync-ping.html.

2) Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, the [Mozilla telemetry opt-out](https://support.mozilla.org/en-US/kb/share-data-mozilla-help-improve-firefox) or disabling sync.

3) If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Lina Cambridge.

4) Using the **[category system of data types](https://wiki.mozilla.org/Firefox/Data_Collection)** on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 2, interaction data.

5) Is the data collection request for default-on or default-off?

Default-on.

6) Does the instrumentation include the addition of **any *new* identifiers** (whether anonymous or otherwise; e.g., username, random IDs, etc.?

No.

7) Is the data collection covered by the existing Firefox privacy notice?

Yes.

8) Does there need to be a check-in in the future to determine whether to renew the data?

No, this is a permanent collection.

9) Does the data collection use a third-party collection tool?

No.

Result: data-review+
Attachment #9069191 - Flags: data-review?(tdsmith) → data-review+
Pushed by kcambridge@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7a67c62d4b8b
Record shutdown blocker progress in the bookmarks mirror. r=markh,tcsc
https://hg.mozilla.org/integration/autoland/rev/22fa4447a9a2
Record bookmark validation telemetry for the buffered engine. r=tcsc
https://hg.mozilla.org/integration/autoland/rev/246b4a16ab96
Include optional steps in Sync engine telemetry. r=tcsc,tdsmith
Regressions: 1558388
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: