1267919 - Create new sync telemetry ping type.

Reporter

Description

•

9 years ago

Attached file sync-telemetry.json — Details

We want to submit custom telemetry data for Sync health. This bug is for the client side work of this effort and encompasses defining the first cut of a json schema for the data, then collecting this data and submitting it as a new ping type. I'll open further bugs (so the pipeline ops team knows about the new ping type, and for redshift processing etc) once we have general agreement on the direction. I believe the back-end processing of this ping data will expect to validate the data against the defined schema - so I think we also need to perform this validation as part of client-side tests to have any hope of getting it right - bug 1249925 tracks that in general so I'm having that bug block this. I'm attaching a proposed JSON Schema for this work. At a high-level, each ping will include an array of "syncs", where each element is the info for a single sync - so no data is rolled-up between Syncs (ie, there's no concept like "count of successful syncs" - just an array of syncs, each one of which may have success or failure information). Each "sync" record has its own "engines" array, which lists information about every engine that synced. The expectation is that the reduction of these records into summary information will be done at the back-end. In later bugs we will look to extend this schema into collecting further "health" related stats, but this bug is just to get the ball rolling with a relatively minimal set of data for our initial dashboards. An example of this data is below. It shows a ping containing a single sync that started at the timestamp 12345467, took 560ms, was done on behalf of the Sync user with UID "abcdefg", has no failure (ie, it succeeded), and only Synced one engine, "bookmarks". The bookmarks engine took 500ms to sync, saw 100 incoming records, but only applied 50 of them, and 50 of them failed with 5 of them being new failures in this sync. It also attempted to upload 100 records, but the server reported 3 of them failed: { syncs: [{ when: 12345467, took: 560, uid: "abcdefg", failureReason: null, engines: [{ name: "bookmarks", took: 500, incoming: { considered: 100, applied: 50, failed: 50, failedNew: 5, }, outgoing: [{ sent: 100, failed: 3, }], }] }], } Note that the attached schema is more JS than JSON - it has comments, doesn't include quotes around all property names, has trailing commas, etc. From my experiments some validators allow this while others do not. Allowing comments seems particularly useful - but ultimately this will probably depend on exactly what validator we choose to use in the front-end (which I assume itself depends on what is used on the back-end?). This schema is valid according to http://www.jsonschemavalidator.net/ Georg, does this all look like it is the correct direction to take here? Does the schema look reasonable and looks like something we can start working from? (the schema is a WIP and requires further fleshing out, but it should convey the idea)

Attachment #8745856 - Flags: feedback?(gfritzsche)

Flags: firefox-backlog+

Georg Fritzsche [:gfritzsche]

Comment 1

•

9 years ago

Comment on attachment 8745856 [details] sync-telemetry.json We should use strict JSON for the schemas. The schemas are currently living in [0]. You will probably want to put a ping format version number into the payload, so you can tell different versions apart. I'd recommend to involve bsmedberg early for quick design input, to avoid bigger re-designs later after the data collection review [1]. Some questions to dive into it a bit: * This seems like a lot of info to send - will this be opt-out or opt-in? * At what interval is this sent? Once per day/week/...? * Do you need to know the detailed info for each individual sync? Why can't you aggregate on the client side? I think unless there is a compelling reason to not aggregate, we really want to submit the minimal data possible, both for privacy reasons as well as storage and processing cost. 0: https://github.com/mozilla-services/mozilla-pipeline-schemas/ 1: https://wiki.mozilla.org/Firefox/Data_Collection

Attachment #8745856 - Flags: feedback?(gfritzsche)

Mark Hammond [:markh] [:mhammond]

Reporter

Comment 2

•

9 years ago

Thanks for the feedback Georg. (In reply to Georg Fritzsche [:gfritzsche] from comment #1) > We should use strict JSON for the schemas. My intent was to end up with a strict schema, but I'm not sure I'm using the term in the same way you are :) eg, looking at the schemas in https://github.com/mozilla-services/mozilla-pipeline-schemas/, they seem roughly the same as my WIP - less strict if anything (eg, core.schema.json doesn't specify additionalProperties=false). What specifically makes a schema "strict" in this context? > You will probably want to put a ping format version number into the payload, > so you can tell different versions apart. Yeah, thanks. > I'd recommend to involve bsmedberg early for quick design input, to avoid > bigger re-designs later after the data collection review [1]. Done, thanks. > Some questions to dive into it a bit: > * This seems like a lot of info to send - will this be opt-out or opt-in? We intend for this to be opt-out, using the existing opt-out mechanisms, but obviously only Sync users would send this ping. We are looking to include the sync "userid" (a guid) in the payload so we can perform long-term user-based analysis of client data, and also to correlate it with the information Danny Coates is already collecting on the server and funneling into Presto. > * At what interval is this sent? Once per day/week/...? that's a question I hadn't considered yet, and ultimately may depend on the data format we end up with - my gut tells me that the more aggregated the data is the more often we will want to send it - simply because each payload becomes less aggregated - more below. > * Do you need to know the detailed info for each individual sync? Why can't > you aggregate on the client side? > > I think unless there is a compelling reason to not aggregate, we really want > to submit the minimal data possible, both for privacy reasons as well as > storage and processing cost. I initially assumed we'd want aggregated data, but recently concluded that individual records with aggregation done by the pipeline made more sense. This was mainly due to discussions with :rvitillo and Ilana Segal about the UITelemetry payload - Robert said he was looking to move away from aggregated data about UI clicks to a format like |{"Toolbar button X”: [ts1, ts2, ts3],| and Ilana mentioned she preferred this format as it allows deeper analysis, such as "did the user click button 1 followed by button 2", which is impossible when the payload just records the count of clicks for each button. Similarly for Sync, it would allow analysis of things like "were we more likely to see sync errors if the previous sync was interrupted by app shutdown?" or "are we more likely to see validation errors after applying incoming records?", which doesn't seem possible with client-side aggregation. Obviously storage and processing costs are critical, but I don't know how to make the correct tradeoffs here. Robert/Benjamin, are you able to offer additional guidance here? Thanks!

Flags: needinfo?(rvitillo)

Flags: needinfo?(benjamin)

Mark Hammond [:markh] [:mhammond]

Reporter

Comment 3

•

9 years ago

(In reply to Mark Hammond [:markh] from comment #2) > Thanks for the feedback Georg. > > (In reply to Georg Fritzsche [:gfritzsche] from comment #1) > > We should use strict JSON for the schemas. > > My intent was to end up with a strict schema, Oh - re-reading, I see now you mean "the schema should be strict JSON, not a javascipt-y JSON", so please ignore that question. (pity though - comments in schema's seem valuable ;)

Roberto Agostino Vitillo (:rvitillo)

Comment 4

•

9 years ago

(In reply to Mark Hammond [:markh] from comment #2) > I initially assumed we'd want aggregated data, but recently concluded that > individual records with aggregation done by the pipeline made more sense. > This was mainly due to discussions with :rvitillo and Ilana Segal about the > UITelemetry payload - Robert said he was looking to move away from > aggregated data about UI clicks to a format like |{"Toolbar button X”: [ts1, > ts2, ts3],| and Ilana mentioned she preferred this format as it allows > deeper analysis, such as "did the user click button 1 followed by button 2", > which is impossible when the payload just records the count of clicks for > each button. For UITelemetry it makes sense as it grants the analyst the ability to perform exploratory analyses that wouldn't be possible with the pre-aggregated data. I am not sure that's the main use-case for Sync's data though, but I might be wrong. > Similarly for Sync, it would allow analysis of things like "were we more > likely to see sync errors if the previous sync was interrupted by app > shutdown?" or "are we more likely to see validation errors after applying > incoming records?", which doesn't seem possible with client-side aggregation. What are you planning to do with those pings? What are the main questions that you want to answer with it? Who is going to be the main user of that data? Engineers, PMs, analysts, ...? You would need to [over]estimate the total size of the raw uncompressed data for a given time interval given say: - The number of users of Sync - An estimate of the distribution of the number of pings sent by the users in a given time interval - An estimate of the distribution of the size of a ping If the estimated amount of data generated in a day/week/month isn't too high then we could choose to go with the schema that gives us the greatest amount of flexibility.

Flags: needinfo?(rvitillo)

Edwin Wong [:edwong]

Updated

•

9 years ago

Blocks: 1263835

Benjamin Smedberg

Comment 5

•

9 years ago

From a data perspective, the questions I need to understand are: * what is the privacy risk? ** My main concern here is about error reasons. It's easy for errors to contain identifying data by accident, and I don't quite understand the JSON schema here and what data is actually being sent about errors. ** do the sync engine names represent privacy risk? The basic engines like bookmarks and history probably don't: but if we have other custom engines (especially if they can be provided by addons), that might represent subsets of sync data that give away browsing history patterns that might be sensitive. Can we hardcode the list of engines? * is this the minimal set of data required to answer the questions you're asking? ** I can't know this without understanding your questions. I assume that these payloads will not contain the telemetry client ID. Will they contain the environment block? On a purely technical note, if you're going to have individual records, we should consider sending each as a separate ping. It will reduce client complexity, and especially if you don't include the environment block it won't increase the total size by a significant amount.

Flags: needinfo?(benjamin)

Chris Karlof [:ckarlof]

Updated

•

9 years ago

Assignee: nobody → markh

Blocks: 1250012

Flags: firefox-backlog+

Priority: -- → P1

Mark Hammond [:markh] [:mhammond]

Reporter

Comment 6

•

9 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #4) > (In reply to Mark Hammond [:markh] from comment #2) > > I initially assumed we'd want aggregated data, but recently concluded that > > individual records with aggregation done by the pipeline made more sense. > > This was mainly due to discussions with :rvitillo and Ilana Segal about the > > UITelemetry payload - Robert said he was looking to move away from > > aggregated data about UI clicks to a format like |{"Toolbar button X”: [ts1, > > ts2, ts3],| and Ilana mentioned she preferred this format as it allows > > deeper analysis, such as "did the user click button 1 followed by button 2", > > which is impossible when the payload just records the count of clicks for > > each button. > > For UITelemetry it makes sense as it grants the analyst the ability to > perform exploratory analyses that wouldn't be possible with the > pre-aggregated data. I am not sure that's the main use-case for Sync's data > though, but I might be wrong. I was under the impression that UITelemetry's main use-case isn't to perform exploratory analyses, merely that the ability to do so in a post-hoc manner is considered useful? I'd say Sync sits similarly - it's not the primary use-case, but it would be useful to be able to do so. > What are you planning to do with those pings? What are the main questions > that you want to answer with it? Who is going to be the main user of that > data? Engineers, PMs, analysts, ...? Initially it would be engineers, but it would probably feed into PMs. I've tried to document the initial use for these metrics at https://gist.github.com/mhammond/e51a494cb04dc0acd44acf2c1589e7c5. > You would need to [over]estimate the total size of the raw uncompressed data > for a given time interval given say: > - The number of users of Sync > - An estimate of the distribution of the number of pings sent by the users > in a given time interval > - An estimate of the distribution of the size of a ping > > If the estimated amount of data generated in a day/week/month isn't too high > then we could choose to go with the schema that gives us the greatest amount > of flexibility. Using some info provided by Danny, it seems that for all devices (ie, not limited to desktop) we see ~112M syncs per day. The most-written collection is "tabs" where we see ~55M writes per day. So let's say 1/2 of those 112M syncs are no-ops (ie, we find there's nothing incoming and we have no outgoing items). If I tweak the schema to record a no-op Sync minimally, each Sync would be about 110 bytes of uncompressed json - quite a bit less if we reported the Sync userid once per ping rather than in each sync record. I'm guessing we'd expect that to double for a Sync that actually reads and writes alot of stuff - so say 60M per day with 110 bytes, 60M per day with 220. That doesn't include any additional telemetry overhead, just the size of the sync data. I'm not sure how to translate that into actual storage though - JSON is obviously very verbose and wouldn't have duplicated field names once transformed into a different store. OTOH, JSON will compress well - so I'm not sure how to use those stats to come up with the figure you are asking for. Can you help here? As mentioned, I went for non-aggregated data mainly to offer future flexibility in analysis, so if this looks like being a deal-breaker, I'd happily accept client aggregated records instead (although as Benjamin notes, determining exactly when to send the ping would seem to add a fair bit of client complexity, so I'd still prefer non-aggregated data) (In reply to Benjamin Smedberg [:bsmedberg] from comment #5) > From a data perspective, the questions I need to understand are: > > * what is the privacy risk? > ** My main concern here is about error reasons. It's easy for errors to > contain identifying data by accident, and I don't quite understand the JSON > schema here and what data is actually being sent about errors. It would be minimal - eg, no stack traces etc. In some cases it would be just a string to indicate the error category (eg, "app shutdown") but in some other cases it might include the http status code (if a request failed), or possibly the NS_ERROR_* value (if we failed to even make the request) > ** do the sync engine names represent privacy risk? The basic engines like > bookmarks and history probably don't: but if we have other custom engines > (especially if they can be provided by addons), that might represent subsets > of sync data that give away browsing history patterns that might be > sensitive. Can we hardcode the list of engines? We can either exclude all 3rd-party engines or put them into "other" - I guess we might as well exclude them as putting them all into the same bucket would prevent the results being actionable (ie, without knowing what addons are represented, we will not know who to ping about a specific addon's failure rates) > * is this the minimal set of data required to answer the questions you're > asking? > ** I can't know this without understanding your questions. Those error records aren't strictly necessary for the artifacts I list in https://gist.github.com/mhammond/e51a494cb04dc0acd44acf2c1589e7c5, but I'm predicting they will be valuable once we get some dashboards in place. I'd happily remove them if that is a deal-breaker. Similarly for, eg, how long each sync took - not in the initial artifacts but future analysis on them would be interesting. > I assume that these payloads will not contain the telemetry client ID. Will > they contain the environment block? I expect we will want the "build" record initially, but probably don't need the rest of the default environment block. > On a purely technical note, if you're going to have individual records, we > should consider sending each as a separate ping. It will reduce client > complexity, and especially if you don't include the environment block it > won't increase the total size by a significant amount. SGTM. Adding needinfo back to Roberto and Benjamin to make sure this answers their questions and to help me work out the next steps in moving this forward. Thanks all!

URL: https://gist.github.com/mhammond/e51a...

Flags: needinfo?(rvitillo)

Flags: needinfo?(benjamin)

Benjamin Smedberg

Comment 7

•

9 years ago

> > What are you planning to do with those pings? What are the main questions > > that you want to answer with it? Who is going to be the main user of that > > data? Engineers, PMs, analysts, ...? > > Initially it would be engineers, but it would probably feed into PMs. I've > tried to document the initial use for these metrics at > https://gist.github.com/mhammond/e51a494cb04dc0acd44acf2c1589e7c5 This is excellent. I like how you've included both the question and the visualization. > > ** My main concern here is about error reasons. It's easy for errors to > > contain identifying data by accident, and I don't quite understand the JSON > > schema here and what data is actually being sent about errors. > > It would be minimal - eg, no stack traces etc. In some cases it would be > just a string to indicate the error category (eg, "app shutdown") but in > some other cases it might include the http status code (if a request > failed), or possibly the NS_ERROR_* value (if we failed to even make the > request) Can you describe exactly where the error string will come from? I'm mainly worried about catching JS and including those unfiltered in telemetry reports. If it's possible to enumerate all the possible error types being caught, then I have no problems with this. > guess we might as well exclude them as putting them all into the same bucket > would prevent the results being actionable (ie, without knowing what addons > are represented, we will not know who to ping about a specific addon's > failure rates) Yeah, let's exclude these for now. > > I assume that these payloads will not contain the telemetry client ID. Will > > they contain the environment block? > > I expect we will want the "build" record initially, but probably don't need > the rest of the default environment block. We always submit the "application" block from https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/common-ping.html will that be sufficient? I have no problems with what you've described, so I'm happy to count that as a preliminary data review and I'll do the final signoff when this is done and you've added in-tree docs.

Flags: needinfo?(benjamin)

Roberto Agostino Vitillo (:rvitillo)

Comment 8

•

9 years ago

(In reply to Mark Hammond [:markh] from comment #6) > > What are you planning to do with those pings? What are the main questions > > that you want to answer with it? Who is going to be the main user of that > > data? Engineers, PMs, analysts, ...? > > Initially it would be engineers, but it would probably feed into PMs. I've > tried to document the initial use for these metrics at > https://gist.github.com/mhammond/e51a494cb04dc0acd44acf2c1589e7c5. That's very clear, thanks. > As mentioned, I went for non-aggregated data mainly to offer future > flexibility in analysis, so if this looks like being a deal-breaker, I'd > happily accept client aggregated records instead (although as Benjamin > notes, determining exactly when to send the ping would seem to add a fair > bit of client complexity, so I'd still prefer non-aggregated data) I am OK using non-aggregated data given the estimated volume mentioned in Comment 6.

Flags: needinfo?(rvitillo)

Edwin Wong [:edwong]

Updated

•

9 years ago

Whiteboard: [sync-data-integrity]

sync-telemetry.json 9 years ago Mark Hammond [:markh] [:mhammond] 4.04 KB, application/javascript		Details
Bug 1267919 - Part 1. Import Ajv for validation of sync telemetry ping schema 9 years ago Thom Chiovoloni [:tcsc] (ex-moco) 58 bytes, text/x-review-board-request	Dexter : review+ markh : review+	Details
Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping. 9 years ago Thom Chiovoloni [:tcsc] (ex-moco) 58 bytes, text/x-review-board-request	Dexter : review+ benjamin : review+	Details
Bug 1267919 - Part 3. Implement initial sync telemetry recording code. 9 years ago Thom Chiovoloni [:tcsc] (ex-moco) 58 bytes, text/x-review-board-request	Dexter : review+ markh : review+	Details