Track bookmark merge events in telemetry and shutdown hang reports
Categories
(Firefox :: Sync, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox69 | --- | fixed |
People
(Reporter: lina, Assigned: lina)
References
Details
Attachments
(4 files, 1 obsolete file)
We're still seeing shutdown hangs after bug 1551062 landed, and I'm curious at what point we're getting stuck. Those crashes don't have any context, but AsyncShutdown
lets us annotate a shutdown blocker with extra state, which will get serialized and included in the crash report if shutdown takes too long.
Sqlite.jsm
callers get this for free if they're stuck executing a statement, but, since we're running statements ourselves on the async thread in Rust, we need to be a bit more creative.
Relatedly, aborting a merge currently discards all stats that we might have recorded so far, like the size of the tree and the time taken to fetch it. I think this would be useful to have for telemetry, even if the merge fails or gets interrupted.
What if we killed two birds with one stone?
- Instead of having Dogear's
Store::merge
returnStats
, we can add aDriver::record_telemetry_event
callback. The merger would call this callback with timings and counts after each operation (fetching the local and remote trees, merging, and applying). - Our
bookmark_sync::Driver
would dispatch a runnable to the main thread for each call, which would call a callback implemented in JS. - On the JS side, we'd record the event in event telemetry, and store progress state internally indicating how far we got. (So, if we see a hang after
fetch_local_tree
, we know we're blocked onfetch_local_contents
. If we hang atfetch_local_contents
, we know we're stuck onfetch_remote_tree
, since that's the next step). - When the merge completes (with a call to either
handleResult
orhandleError
), reset the progress state. - Pass a
fetchState
function as the third argument toAsyncShutdown.profileBeforeChange.addBlocker
, which returns the current progress, as well as anything else we might want to record (size of the merged tree? Time taken for the previous phases?)
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
This commit introduces a mozISyncedBookmarksMirrorProgressListener
interface for capturing telemetry and updating shutdown blocker state
after each step of the merge, instead of waiting until the end. This
also means we can also record events for interrupted and failed merges,
and pass validation data through to the Sync ping.
Shutdown hang crash reports now have a step
field indicating the last
finished step. For example, if we see a hang on fetchLocalTree
, we
know the merger is blocked on fetchLocalContents
. If we hang on
fetchLocalContents
, the merger is stuck at fetchRemoteTree
, since
that's the next step after fetchLocalContents
.
This commit also implements Driver::record_telemetry_event
to
dispatch progress callback runnables to the main thread.
Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 2•5 years ago
|
||
Depends on D31950
Assignee | ||
Comment 3•5 years ago
|
||
Before this commit, we used event telemetry to record timing and counts
for the different steps of a bookmark merge (fetching the local tree,
new local contents, remote tree, and new remote contents; merging;
applying; fetching outgoing records; and notifying observers).
This has several limitations. We need to store all numbers as strings,
include a "flow ID" to tag events from the same merge, and collect
failure reasons twice. We also can't correlate these events to the
existing engine telemetry, meaning we can't see other stats for that
engine, or for the entire sync. Finally, we need to run separate
queries on these events for analysis, instead of extending our
existing engine dashboards.
This approach also feels like an abuse of event telemetry, so this
commit adds a "steps" field for each engine in the Sync ping. Each step
has a name, time taken, and additional named counts, like the number of
items in the tree, or merged structure stats.
Currently, only the buffered-bookmarks
engine records these steps.
Depends on D33410
Assignee | ||
Comment 4•5 years ago
|
||
Assignee | ||
Comment 5•5 years ago
|
||
Assignee | ||
Comment 6•5 years ago
|
||
This turned into a whole thing, but I'm a lot happier with how we track progress. I'd love a second (and third!) set of eyes to make sure this doesn't do anything silly! 😄
- Shutdown hang reports will be annotated with something like
{ "steps": [{ "step": "fetchLocalTree", "at": 1559423421582 }, { "step": "fetchNewLocalContents", "at": 1559423421999 }] }
. - We'll record timings and counts for
fetchLocalTree
,fetchNewLocalContents
, etc. in a newsteps
field in the engine telemetry payload, instead of event telemetry. - We'll also record bookmark validation telemetry in the engine payload, like we do now.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4a97992f2943da2e0b2c8603a4b953b7e0b21aa5
Comment 7•5 years ago
|
||
Comment on attachment 9069191 [details]
engine-steps-data-review.md
Load-balancing redirect to :tdsmith
Comment 8•5 years ago
|
||
Comment on attachment 9069191 [details] engine-steps-data-review.md Thank you for the complete and thoughtful explanations. My only concerns are about documentation. https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/sync-ping.html is our public-facing documentation about the sync ping. While that page references `services/sync/tests/unit/sync_ping_schema.json`, which I see is updated in D33411, I don't think the schema document alone gives a good sense of what the collection is or what kind of values it can contain. That documentation would be helpful both for community members evaluating Firefox's collections and for me as a data scientist. Would you mind extending the narrative documentation to include this field? I'm not sure how shutdown hang reports are collected or reported, but altering or expanding that collection would also be subject to data review; if that's still in scope for this work, is there documentation to point to for those?
Assignee | ||
Comment 9•5 years ago
|
||
Oops, I thought for sure I updated sync-ping.rst
, too, but I think hg ate my changes during a histedit
—I definitely meant to do that. I'm not sure what the process is for AsyncShutdown.jsm
crash report annotations...looking at other places we've added them, it doesn't look like they go through data review, but I can add a comment explaining the format.
Assignee | ||
Comment 10•5 years ago
|
||
Done! Added docs in https://phabricator.services.mozilla.com/D33411, and a fixed enum with the step names in https://phabricator.services.mozilla.com/D31950 (search for ProgressTracker.STEPS
).
Comment 11•5 years ago
|
||
Comment on attachment 9069191 [details] engine-steps-data-review.md Awesome, thanks. Thanks also for explaining the crash annotations to me on IRC; I see now that those use an existing facility that has already been reviewed. -- DATA COLLECTION REVIEW 1) Is there or will there be **documentation** that describes the schema for the ultimate data set in a public, complete, and accurate way? Yes, at https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/sync-ping.html. 2) Is there a control mechanism that allows the user to turn the data collection on and off? Yes, the [Mozilla telemetry opt-out](https://support.mozilla.org/en-US/kb/share-data-mozilla-help-improve-firefox) or disabling sync. 3) If the request is for permanent data collection, is there someone who will monitor the data over time? Yes, Lina Cambridge. 4) Using the **[category system of data types](https://wiki.mozilla.org/Firefox/Data_Collection)** on the Mozilla wiki, what collection type of data do the requested measurements fall under? Category 2, interaction data. 5) Is the data collection request for default-on or default-off? Default-on. 6) Does the instrumentation include the addition of **any *new* identifiers** (whether anonymous or otherwise; e.g., username, random IDs, etc.? No. 7) Is the data collection covered by the existing Firefox privacy notice? Yes. 8) Does there need to be a check-in in the future to determine whether to renew the data? No, this is a permanent collection. 9) Does the data collection use a third-party collection tool? No. Result: data-review+
Comment 12•5 years ago
|
||
Pushed by kcambridge@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7a67c62d4b8b Record shutdown blocker progress in the bookmarks mirror. r=markh,tcsc https://hg.mozilla.org/integration/autoland/rev/22fa4447a9a2 Record bookmark validation telemetry for the buffered engine. r=tcsc https://hg.mozilla.org/integration/autoland/rev/246b4a16ab96 Include optional steps in Sync engine telemetry. r=tcsc,tdsmith
Comment 13•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/7a67c62d4b8b
https://hg.mozilla.org/mozilla-central/rev/22fa4447a9a2
https://hg.mozilla.org/mozilla-central/rev/246b4a16ab96
Description
•