Closed Bug 1536133 Opened 6 years ago Closed 6 years ago

Figure out the right solution for sync telemetry on Fenix

Categories

(Data Science :: General, task, P1)

task
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: loines, Assigned: loines)

Details

Requested by Alex Davis

Copied with slight modifications from an email:

:markh is looking into Sync telemetry for Fenix. We want to make sure that what we're shipping is working as expected. Please help him identify the best path forward.

Things that come to mind:

  • Do we still need our own ping?
  • How does this fit with the new telemetry component (Glean)?
  • Is there anything we should consider with the upcoming ecosystem telemetry?
  • What questions will we want to answer and what data will we capture to answer them?

Thanks for your help.

Leif Additionally Notes:

Alessio led an initial meeting about this. Three possible ways forward were identified

Alternative 1, probably best option in the long run but more work now: port the ping entirely to glean. The work needed to do so is non-trivial and the AS team already has their plate full. The ping's json payload is currently generated within the sync rust framework and would need to be handed over to glean somehow (and possibly restructured as well to fit its format).

Alternative 2, least amount of short-term work: make the sync rust component POST the ping directly to the aws pipeline where it would pass thru the current sync scala code to end up in the existing sync datasets. This would bypass the use of glean completely for the time being.

Alternative 3 is a hybrid of 1 and 2: critical pieces of the sync ping, such as basic counts of records uploaded, applied, etc are handed over to glean while the full sync payload is POSTed to the "legacy" sync pipeline.

Next step is to decide which one to go with.

(In reply to Leif Oines [:loines] from comment #0)

Alternative 3 is a hybrid of 1 and 2: critical pieces of the sync ping, such as basic counts of records uploaded, applied, etc are handed over to glean while the full sync payload is POSTed to the "legacy" sync pipeline.

Next step is to decide which one to go with.

Maybe this could also be a parallel path, independent of the other options?

To add some more detail here... for Fenix specifically, we have been tracking the "north star metrics" that are critical to the product.
Those include *"Increased sync activations (Desktop + Fenix), Accounts / Sign-ups".

That is achievable by recording some metrics directly into Glean from Fenix or other libraries and something we've been tracking as a requirement.

(In reply to Georg Fritzsche [:gfritzsche] from comment #1)

(In reply to Leif Oines [:loines] from comment #0)

Alternative 3 is a hybrid of 1 and 2: critical pieces of the sync ping, such as basic counts of records uploaded, applied, etc are handed over to glean while the full sync payload is POSTed to the "legacy" sync pipeline.

Next step is to decide which one to go with.

Maybe this could also be a parallel path, independent of the other options?

Yes, I think that's right. I'm guessing Mark and Thom would prefer just doing #2 atm. I'll have a think about what "minimal" measures might be both easy and usable to pass to glean.

To add some more detail here... for Fenix specifically, we have been tracking the "north star metrics" that are critical to the product.
Those include *"Increased sync activations (Desktop + Fenix), Accounts / Sign-ups".

That is achievable by recording some metrics directly into Glean from Fenix or other libraries and something we've been tracking as a requirement.

FWIW and for total transparency, we should be able to track the number of users activating sync from Fenix via the FxA server-side metrics, exactly as we do now for fennec and iOS. But it would probably also be good to track them client-side through glean as well.

(In reply to Leif Oines [:loines] from comment #2)

Yes, I think that's right. I'm guessing Mark and Thom would prefer just doing #2 atm. I'll have a think about what "minimal" measures might be both easy and usable to pass to glean.

I'd prefer to avoid that - as the telemetry team knows, doing this reliably is very tricky. As a very short term measure, just trying to POST and throwing the ping away if there's any problem might be acceptable, but this sounds a long way from ideal and runs the risk of skewing the results towards, eg, users with great connectivity. Working out any policy decisions (eg, is telemetry enabled? Should we only submit when on wifi? etc) also sounds tricky.

Alternative 1, probably best option in the long run but more work now: port the ping entirely to glean.

From the meeting, that sounds tricky too, as the sync ping is quite heavily nested, and IIUC, glean doesn't support that. So recording a very small subset of the data via glean sounds like it's also a short-term option.

To be frank, the "Alternative 4" we'd prefer is: A couple of years ago, the telemetry team helped us define and formalize a schema for the sync ping which all our products support and which is actively used for analysis. We'd prefer an option so we can hand a payload in this format over and have the telemetry utilities manage getting that into the back-end. From the meeting I understand this isn't trivial, but it does seem the best long term option.

Alessio tells me he should have a document ready on this by the end of the week.

Status: NEW → ASSIGNED

(In reply to Mark Hammond [:markh] from comment #3)

To be frank, the "Alternative 4" we'd prefer is: A couple of years ago, the telemetry team helped us define and formalize a schema for the sync ping which all our products support and which is actively used for analysis.a We'd prefer an option so we can hand a payload in this format over and have the telemetry utilities manage getting that into the back-end. From the meeting I understand this isn't trivial, but it does seem the best long term option.

When the sync ping started, the support of different metric types in Telemetry was limited and a custom ping was required.
Over the last years we have been moving towards more structured & flat data formats, both on the client-side as for the pipeline that processes it.

For Glean the goal is to enable other teams to send their own metrics with minimal effort, just by recording metrics (or probes) into the library and the library takes care of pings and other details.
This is possible because Glean comes with a pipeline that is built on a specific set of metric types and does not support custom JSON pings.

There are long-term and short-term considerations here:
Long-term term we recommend using Glean, not custom pings - we're happy to support you using this.
For the short-term there are some options (alternative 2 and 3?) and we can talk through what is most viable for your plans.

Alessio tells me he should have a document ready on this by the end of the week.

I think this bug already outlines the different options now, so a separate document doesn't seem needed. Is there any detail missing here?

Alternative 1, probably best option in the long run but more work now: port the ping entirely to glean.

From the meeting, that sounds tricky too, as the sync ping is quite heavily nested, and IIUC, glean doesn't support that. So recording a very small subset of the data via glean sounds like it's also a short-term option.

This looks like a good option (high-value but low-effort) to us.

(In reply to Georg Fritzsche [:gfritzsche] from comment #4)

(In reply to Mark Hammond [:markh] from comment #3)
When the sync ping started, the support of different metric types in Telemetry was limited and a custom ping was required.
Over the last years we have been moving towards more structured & flat data formats, both on the client-side as for the pipeline that processes it.

For Glean the goal is to enable other teams to send their own metrics with minimal effort, just by recording metrics (or probes) into the library and the library takes care of pings and other details.
This is possible because Glean comes with a pipeline that is built on a specific set of metric types and does not support custom JSON pings.

It's great that you are looking at how to make telemetry simpler moving forward, but we do need to keep in mind that the sync payload is something that is recorded across all products and platforms which support sync, and that analysis of this data is done across all of these products. We currently have the ability to, for example, reconstruct a series of events for a single account regardless of platform so that we can detect, for example, how one erroneously performing device impacts sync on different products, or how reliable cross-device actions (such as "send tab") are.

There are long-term and short-term considerations here:
Long-term term we recommend using Glean, not custom pings - we're happy to support you using this.

When do you expect glean to be working on desktop and iOS? I'm not that bothered about what technology is used to record or submit our telemetry data, just that we are able to have a consistent view of the entire eco-system. Given that having this data on Fennec (and iOS) has paid large dividends in terms of concrete improvements to reliability, we'd obviously like to avoid an outcome where the analysis opportunities on Fenix are significantly reduced compared to Fennec, iOS or Desktop.

Alessio tells me he should have a document ready on this by the end of the week.

I think this bug already outlines the different options now, so a separate document doesn't seem needed. Is there any detail missing here?

I think there's still a lot of detail missing. I don't understand exactly what we'd record in Glean, and how Leif could turn this data back into a format where analysis can be done between the "sync pings" delivered on iOS, Fennec and Desktop.

Alternative 1, probably best option in the long run but more work now: port the ping entirely to glean.

From the meeting, that sounds tricky too, as the sync ping is quite heavily nested, and IIUC, glean doesn't support that. So recording a very small subset of the data via glean sounds like it's also a short-term option.

This looks like a good option (high-value but low-effort) to us.

I'm not sure "a very small subset" is actually high-value (but agree it's probably low-effort). I was thinking we could record a simple "did it work" flag, which is better than nothing, but not much better in the "it didn't work" case.

To get "high value", it would be great if you could give us the markup required for glean to support the data we capture now, and also how Leif would integrate that data into analysis done of the existing sync ping. As things stand, I don't think we understand how to (a) capture data at a similar granularity all other products capture, or (b) exactly how we would integrate this with our existing analysis, particularly cross-device analysis.

Janet is putting together a project plan about this here

Looks like we've come to our chosen solution, :lina to start work on sending one sync ping for each engine every time it syncs, using GLEAN.

More background in the doc

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.