Closed Bug 1267919 Opened 3 years ago Closed 3 years ago

Create new sync telemetry ping type.

Categories

(Firefox :: Sync, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Firefox 50
Tracking Status
firefox50 --- fixed

People

(Reporter: markh, Assigned: tcsc)

References

(Depends on 1 open bug, Blocks 1 open bug, )

Details

(Whiteboard: [data-integrity] [sync-metrics])

Attachments

(4 files)

Attached file sync-telemetry.json
We want to submit custom telemetry data for Sync health. This bug is for the client side work of this effort and encompasses defining the first cut of a json schema for the data, then collecting this data and submitting it as a new ping type. I'll open further bugs (so the pipeline ops team knows about the new ping type, and for redshift processing etc) once we have general agreement on the direction.

I believe the back-end processing of this ping data will expect to validate the data against the defined schema - so I think we also need to perform this validation as part of client-side tests to have any hope of getting it right - bug 1249925 tracks that in general so I'm having that bug block this.

I'm attaching a proposed JSON Schema for this work. At a high-level, each ping will include an array of "syncs", where each element is the info for a single sync - so no data is rolled-up between Syncs (ie, there's no concept like "count of successful syncs" - just an array of syncs, each one of which may have success or failure information). Each "sync" record has its own "engines" array, which lists information about every engine that synced. The expectation is that the reduction of these records into summary information will be done at the back-end.

In later bugs we will look to extend this schema into collecting further "health" related stats, but this bug is just to get the ball rolling with a relatively minimal set of data for our initial dashboards.

An example of this data is below. It shows a ping containing a single sync that started at the timestamp 12345467, took 560ms, was done on behalf of the Sync user with UID "abcdefg", has no failure (ie, it succeeded), and only Synced one engine, "bookmarks". The bookmarks engine took 500ms to sync, saw 100 incoming records, but only applied 50 of them, and 50 of them failed with 5 of them being new failures in this sync. It also attempted to upload 100 records, but the server reported 3 of them failed:

{     
  syncs: [{
    when: 12345467,
    took: 560,
    uid: "abcdefg",
    failureReason: null,
    engines: [{
      name: "bookmarks",
      took: 500,
      incoming: {
        considered: 100,
        applied: 50,
        failed: 50,
        failedNew: 5,
      },
      outgoing: [{
        sent: 100,
        failed: 3,
      }],
    }]
  }],
}

Note that the attached schema is more JS than JSON - it has comments, doesn't include quotes around all property names, has trailing commas, etc. From my experiments some validators allow this while others do not. Allowing comments seems particularly useful - but ultimately this will probably depend on exactly what validator we choose to use in the front-end (which I assume itself depends on what is used on the back-end?). This schema is valid according to http://www.jsonschemavalidator.net/

Georg, does this all look like it is the correct direction to take here? Does the schema look reasonable and looks like something we can start working from? (the schema is a WIP and requires further fleshing out, but it should convey the idea)
Attachment #8745856 - Flags: feedback?(gfritzsche)
Flags: firefox-backlog+
Comment on attachment 8745856 [details]
sync-telemetry.json

We should use strict JSON for the schemas.
The schemas are currently living in [0].
You will probably want to put a ping format version number into the payload, so you can tell different versions apart.
I'd recommend to involve bsmedberg early for quick design input, to avoid bigger re-designs later after the data collection review [1].

Some questions to dive into it a bit:
* This seems like a lot of info to send - will this be opt-out or opt-in?
* At what interval is this sent? Once per day/week/...?
* Do you need to know the detailed info for each individual sync? Why can't you aggregate on the client side?

I think unless there is a compelling reason to not aggregate, we really want to submit the minimal data possible, both for privacy reasons as well as storage and processing cost.
  
0: https://github.com/mozilla-services/mozilla-pipeline-schemas/
1: https://wiki.mozilla.org/Firefox/Data_Collection
Attachment #8745856 - Flags: feedback?(gfritzsche)
Thanks for the feedback Georg.

(In reply to Georg Fritzsche [:gfritzsche] from comment #1)
> We should use strict JSON for the schemas.

My intent was to end up with a strict schema, but I'm not sure I'm using the term in the same way you are :) eg, looking at the schemas in https://github.com/mozilla-services/mozilla-pipeline-schemas/, they seem roughly the same as my WIP - less strict if anything (eg, core.schema.json doesn't specify additionalProperties=false).

What specifically makes a schema "strict" in this context?

> You will probably want to put a ping format version number into the payload,
> so you can tell different versions apart.

Yeah, thanks.

> I'd recommend to involve bsmedberg early for quick design input, to avoid
> bigger re-designs later after the data collection review [1].

Done, thanks.

> Some questions to dive into it a bit:
> * This seems like a lot of info to send - will this be opt-out or opt-in?

We intend for this to be opt-out, using the existing opt-out mechanisms, but obviously only Sync users would send this ping. We are looking to include the sync "userid" (a guid) in the payload so we can perform long-term user-based analysis of client data, and also to correlate it with the information Danny Coates is already collecting on the server and funneling into Presto.

> * At what interval is this sent? Once per day/week/...?

that's a question I hadn't considered yet, and ultimately may depend on the data format we end up with - my gut tells me that the more aggregated the data is the more often we will want to send it - simply because each payload becomes less aggregated - more below.

> * Do you need to know the detailed info for each individual sync? Why can't
> you aggregate on the client side?
> 
> I think unless there is a compelling reason to not aggregate, we really want
> to submit the minimal data possible, both for privacy reasons as well as
> storage and processing cost.

I initially assumed we'd want aggregated data, but recently concluded that individual records with aggregation done by the pipeline made more sense. This was mainly due to discussions with :rvitillo and Ilana Segal about the UITelemetry payload - Robert said he was looking to move away from aggregated data about UI clicks to a format like |{"Toolbar button X”: [ts1, ts2, ts3],| and Ilana mentioned she preferred this format as it allows deeper analysis, such as "did the user click button 1 followed by button 2", which is impossible when the payload just records the count of clicks for each button.

Similarly for Sync, it would allow analysis of things like "were we more likely to see sync errors if the previous sync was interrupted by app shutdown?" or "are we more likely to see validation errors after applying incoming records?", which doesn't seem possible with client-side aggregation.

Obviously storage and processing costs are critical, but I don't know how to make the correct tradeoffs here.

Robert/Benjamin, are you able to offer additional guidance here?

Thanks!
Flags: needinfo?(rvitillo)
Flags: needinfo?(benjamin)
(In reply to Mark Hammond [:markh] from comment #2)
> Thanks for the feedback Georg.
> 
> (In reply to Georg Fritzsche [:gfritzsche] from comment #1)
> > We should use strict JSON for the schemas.
> 
> My intent was to end up with a strict schema,

Oh - re-reading, I see now you mean "the schema should be strict JSON, not a javascipt-y JSON", so please ignore that question.

(pity though - comments in schema's seem valuable ;)
(In reply to Mark Hammond [:markh] from comment #2)
> I initially assumed we'd want aggregated data, but recently concluded that
> individual records with aggregation done by the pipeline made more sense.
> This was mainly due to discussions with :rvitillo and Ilana Segal about the
> UITelemetry payload - Robert said he was looking to move away from
> aggregated data about UI clicks to a format like |{"Toolbar button X”: [ts1,
> ts2, ts3],| and Ilana mentioned she preferred this format as it allows
> deeper analysis, such as "did the user click button 1 followed by button 2",
> which is impossible when the payload just records the count of clicks for
> each button.

For UITelemetry it makes sense as it grants the analyst the ability to perform exploratory analyses that wouldn't be possible with the pre-aggregated data. I am not sure that's the main use-case for Sync's data though, but I might be wrong.

> Similarly for Sync, it would allow analysis of things like "were we more
> likely to see sync errors if the previous sync was interrupted by app
> shutdown?" or "are we more likely to see validation errors after applying
> incoming records?", which doesn't seem possible with client-side aggregation.

What are you planning to do with those pings? What are the main questions that you want to answer with it? Who is going to be the main user of that data? Engineers, PMs, analysts, ...?

You would need to [over]estimate the total size of the raw uncompressed data for a given time interval given say:
- The number of users of Sync
- An estimate of the distribution of the number of pings sent by the users in a given time interval
- An estimate of the distribution of the size of a ping

If the estimated amount of data generated in a day/week/month isn't too high then we could choose to go with the schema that gives us the greatest amount of flexibility.
Flags: needinfo?(rvitillo)
Blocks: 1263835
From a data perspective, the questions I need to understand are:

* what is the privacy risk?
** My main concern here is about error reasons. It's easy for errors to contain identifying data by accident, and I don't quite understand the JSON schema here and what data is actually being sent about errors.
** do the sync engine names represent privacy risk? The basic engines like bookmarks and history probably don't: but if we have other custom engines (especially if they can be provided by addons), that might represent subsets of sync data that give away browsing history patterns that might be sensitive. Can we hardcode the list of engines?

* is this the minimal set of data required to answer the questions you're asking?
** I can't know this without understanding your questions.

I assume that these payloads will not contain the telemetry client ID. Will they contain the environment block?

On a purely technical note, if you're going to have individual records, we should consider sending each as a separate ping. It will reduce client complexity, and especially if you don't include the environment block it won't increase the total size by a significant amount.
Flags: needinfo?(benjamin)
Assignee: nobody → markh
Blocks: 1250012
Flags: firefox-backlog+
Priority: -- → P1
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #4)
> (In reply to Mark Hammond [:markh] from comment #2)
> > I initially assumed we'd want aggregated data, but recently concluded that
> > individual records with aggregation done by the pipeline made more sense.
> > This was mainly due to discussions with :rvitillo and Ilana Segal about the
> > UITelemetry payload - Robert said he was looking to move away from
> > aggregated data about UI clicks to a format like |{"Toolbar button X”: [ts1,
> > ts2, ts3],| and Ilana mentioned she preferred this format as it allows
> > deeper analysis, such as "did the user click button 1 followed by button 2",
> > which is impossible when the payload just records the count of clicks for
> > each button.
> 
> For UITelemetry it makes sense as it grants the analyst the ability to
> perform exploratory analyses that wouldn't be possible with the
> pre-aggregated data. I am not sure that's the main use-case for Sync's data
> though, but I might be wrong.

I was under the impression that UITelemetry's main use-case isn't to perform exploratory analyses, merely that the ability to do so in a post-hoc manner is considered useful? I'd say Sync sits similarly - it's not the primary use-case, but it would be useful to be able to do so.

> What are you planning to do with those pings? What are the main questions
> that you want to answer with it? Who is going to be the main user of that
> data? Engineers, PMs, analysts, ...?

Initially it would be engineers, but it would probably feed into PMs. I've tried to document the initial use for these metrics at https://gist.github.com/mhammond/e51a494cb04dc0acd44acf2c1589e7c5.

> You would need to [over]estimate the total size of the raw uncompressed data
> for a given time interval given say:
> - The number of users of Sync
> - An estimate of the distribution of the number of pings sent by the users
> in a given time interval
> - An estimate of the distribution of the size of a ping
> 
> If the estimated amount of data generated in a day/week/month isn't too high
> then we could choose to go with the schema that gives us the greatest amount
> of flexibility.

Using some info provided by Danny, it seems that for all devices (ie, not limited to desktop) we see ~112M syncs per day. The most-written collection is "tabs" where we see ~55M writes per day. So let's say 1/2 of those 112M syncs are no-ops (ie, we find there's nothing incoming and we have no outgoing items).

If I tweak the schema to record a no-op Sync minimally, each Sync would be about 110 bytes of uncompressed json - quite a bit less if we reported the Sync userid once per ping rather than in each sync record. I'm guessing we'd expect that to double for a Sync that actually reads and writes alot of stuff - so say 60M per day with 110 bytes, 60M per day with 220. That doesn't include any additional telemetry overhead, just the size of the sync data.

I'm not sure how to translate that into actual storage though - JSON is obviously very verbose and wouldn't have duplicated field names once transformed into a different store. OTOH, JSON will compress well - so I'm not sure how to use those stats to come up with the figure you are asking for. Can you help here?

As mentioned, I went for non-aggregated data mainly to offer future flexibility in analysis, so if this looks like being a deal-breaker, I'd happily accept client aggregated records instead (although as Benjamin notes, determining exactly when to send the ping would seem to add a fair bit of client complexity, so I'd still prefer non-aggregated data)

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #5)
> From a data perspective, the questions I need to understand are:
> 
> * what is the privacy risk?
> ** My main concern here is about error reasons. It's easy for errors to
> contain identifying data by accident, and I don't quite understand the JSON
> schema here and what data is actually being sent about errors.

It would be minimal - eg, no stack traces etc. In some cases it would be just a string to indicate the error category (eg, "app shutdown") but in some other cases it might include the http status code (if a request failed), or possibly the NS_ERROR_* value (if we failed to even make the request)

> ** do the sync engine names represent privacy risk? The basic engines like
> bookmarks and history probably don't: but if we have other custom engines
> (especially if they can be provided by addons), that might represent subsets
> of sync data that give away browsing history patterns that might be
> sensitive. Can we hardcode the list of engines?

We can either exclude all 3rd-party engines or put them into "other" - I guess we might as well exclude them as putting them all into the same bucket would prevent the results being actionable (ie, without knowing what addons are represented, we will not know who to ping about a specific addon's failure rates)

> * is this the minimal set of data required to answer the questions you're
> asking?
> ** I can't know this without understanding your questions.

Those error records aren't strictly necessary for the artifacts I list in https://gist.github.com/mhammond/e51a494cb04dc0acd44acf2c1589e7c5, but I'm predicting they will be valuable once we get some dashboards in place. I'd happily remove them if that is a deal-breaker. Similarly for, eg, how long each sync took - not in the initial artifacts but future analysis on them would be interesting.

> I assume that these payloads will not contain the telemetry client ID. Will
> they contain the environment block?

I expect we will want the "build" record initially, but probably don't need the rest of the default environment block.

> On a purely technical note, if you're going to have individual records, we
> should consider sending each as a separate ping. It will reduce client
> complexity, and especially if you don't include the environment block it
> won't increase the total size by a significant amount.

SGTM.

Adding needinfo back to Roberto and Benjamin to make sure this answers their questions and to help me work out the next steps in moving this forward.

Thanks all!
Flags: needinfo?(rvitillo)
Flags: needinfo?(benjamin)
> > What are you planning to do with those pings? What are the main questions
> > that you want to answer with it? Who is going to be the main user of that
> > data? Engineers, PMs, analysts, ...?
> 
> Initially it would be engineers, but it would probably feed into PMs. I've
> tried to document the initial use for these metrics at
> https://gist.github.com/mhammond/e51a494cb04dc0acd44acf2c1589e7c5

This is excellent. I like how you've included both the question and the visualization.

> > ** My main concern here is about error reasons. It's easy for errors to
> > contain identifying data by accident, and I don't quite understand the JSON
> > schema here and what data is actually being sent about errors.
> 
> It would be minimal - eg, no stack traces etc. In some cases it would be
> just a string to indicate the error category (eg, "app shutdown") but in
> some other cases it might include the http status code (if a request
> failed), or possibly the NS_ERROR_* value (if we failed to even make the
> request)

Can you describe exactly where the error string will come from? I'm mainly worried about catching JS and including those unfiltered in telemetry reports. If it's possible to enumerate all the possible error types being caught, then I have no problems with this.

> guess we might as well exclude them as putting them all into the same bucket
> would prevent the results being actionable (ie, without knowing what addons
> are represented, we will not know who to ping about a specific addon's
> failure rates)

Yeah, let's exclude these for now.

> > I assume that these payloads will not contain the telemetry client ID. Will
> > they contain the environment block?
> 
> I expect we will want the "build" record initially, but probably don't need
> the rest of the default environment block.

We always submit the "application" block from https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/common-ping.html will that be sufficient?

I have no problems with what you've described, so I'm happy to count that as a preliminary data review and I'll do the final signoff when this is done and you've added in-tree docs.
Flags: needinfo?(benjamin)
(In reply to Mark Hammond [:markh] from comment #6)
> > What are you planning to do with those pings? What are the main questions
> > that you want to answer with it? Who is going to be the main user of that
> > data? Engineers, PMs, analysts, ...?
> 
> Initially it would be engineers, but it would probably feed into PMs. I've
> tried to document the initial use for these metrics at
> https://gist.github.com/mhammond/e51a494cb04dc0acd44acf2c1589e7c5.

That's very clear, thanks.

> As mentioned, I went for non-aggregated data mainly to offer future
> flexibility in analysis, so if this looks like being a deal-breaker, I'd
> happily accept client aggregated records instead (although as Benjamin
> notes, determining exactly when to send the ping would seem to add a fair
> bit of client complexity, so I'd still prefer non-aggregated data)

I am OK using non-aggregated data given the estimated volume mentioned in Comment 6.
Flags: needinfo?(rvitillo)
Whiteboard: [sync-data-integrity]
Assignee: markh → tchiovoloni
Attachment #8763708 - Flags: feedback?(markh)
Attachment #8763708 - Flags: feedback?(benjamin)
Attachment #8763709 - Flags: feedback?(markh)
Attachment #8763709 - Flags: feedback?(benjamin)
Comment on attachment 8763708 [details]
Bug 1267919 - Part 1. Import Ajv for validation of sync telemetry ping schema

I don't as data steward I'm the right person to be providing feedback on this. If you're looking for a technical review of what you're doing, you should talk to Georg and/or Alessio.
Attachment #8763708 - Flags: feedback?(benjamin)
Comment on attachment 8763709 [details]
Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping.

What are you asking me to review in this patch? Typically I review changes to histograms.json and the doc files. Are you asking me to review the schema? I can do that, but I still want this to show up in human-readable and linkable doc files.

Is there a tool that translates these schemas into human-readable English descriptions?

I'd be looking for the answers to the following questions:
* When will this ping be sent for each client?
* What is the ping "type" field?
* What other metadata will the ping contain, such as clientid or the environment block?
* What are the valid values for strings like "name"? If we can't have a discrete list in the schema itself, can we reference other parts of the code?
* What are the units of measurement for numbers like "took"?
* What period of time does "took" cover? Is that wall-clock time or CPU time? If the client's clock changes during the measurement, will this metric still be monotonic or skew with the clock?
Attachment #8763709 - Flags: feedback?(benjamin) → feedback-
Review commit: https://reviewboard.mozilla.org/r/60442/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/60442/
Attachment #8763708 - Attachment description: Bug 1267919 - Part 1. Import Ajv for validation of sync telemetry ping schema f?bsmedberg,markh → Bug 1267919 - Part 1. Import Ajv for validation of sync telemetry ping schema
Attachment #8763709 - Attachment description: Bug 1267919 - Part 2. Record telemetry for sync data, and integrate it into existing tests, ensuring that the data is valid using a json-schema. f?bsmedberg,markh → Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping.
Attachment #8764763 - Flags: review?(markh)
Attachment #8764763 - Flags: review?(alessio.placitelli)
Attachment #8763708 - Flags: review?(markh)
Attachment #8763708 - Flags: review?(alessio.placitelli)
Attachment #8763708 - Flags: feedback?(markh)
Attachment #8763709 - Flags: review?(benjamin)
Attachment #8763709 - Flags: review?(alessio.placitelli)
Attachment #8763709 - Flags: feedback?(markh)
Attachment #8763709 - Flags: feedback-
Comment on attachment 8763708 [details]
Bug 1267919 - Part 1. Import Ajv for validation of sync telemetry ping schema

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59842/diff/1-2/
Comment on attachment 8763709 [details]
Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59844/diff/1-2/
bsmedberg: Understood. I've removed you from the technical review, and added documentation to the folder where I found the rest of the telemetry documentation. I tried to address all of your questions there, but let me know if I missed anything.
Comment on attachment 8763708 [details]
Bug 1267919 - Part 1. Import Ajv for validation of sync telemetry ping schema

https://reviewboard.mozilla.org/r/59842/#review57404

It looks like there was a discussion behind this on bug 1249925. As discussed, since this code will only be used during tests, security and performance should not be a real concerns.

That means I'm ok with this library, but I'm not sure I should be the one making this call.
Attachment #8763708 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8763709 [details]
Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping.

https://reviewboard.mozilla.org/r/59844/#review57400

We're nearly there, my main concern is with the schema not being a JSON file (if that was discussed somewhere else, sorry for missing the comment!).

::: services/sync/modules-testing/sync_ping_schema.js:5
(Diff revision 2)
> +"use strict";
> +
> +this.EXPORTED_SYMBOLS = ["SyncPingSchema"]
> +
> +// We define this as a module instead of a JSON file that is loaded and parsed

Is there any particular reason why the schema is not a JSON file?
Using JSON would make it easier to use this schema server-side as well. See the [pipeline schemas repo][1].

[1]: https://github.com/mozilla-services/mozilla-pipeline-schemas/

::: services/sync/modules-testing/sync_ping_schema.js:17
(Diff revision 2)
> +  "$schema": "http://json-schema.org/draft-04/schema#",
> +  "description": "schema for Sync pings",
> +  "type": "object",
> +  "additionalProperties": false,
> +
> +  // Either some engines or a failureReason is required

Can we have pings with both _engines_ and _failures_? If not, we should probably explicitly cover this case in the schema.

::: services/sync/modules-testing/sync_ping_schema.js:19
(Diff revision 2)
> +  "type": "object",
> +  "additionalProperties": false,
> +
> +  // Either some engines or a failureReason is required
> +  "anyOf": [
> +    {"required": ["when", "version", "took", "uid", "engines"]},

Since "when", "version", "took", "uid" are always required, maybe we can take them off these two lists and move them to a separate "required" entry.

::: services/sync/modules-testing/sync_ping_schema.js:24
(Diff revision 2)
> +    {"required": ["when", "version", "took", "uid", "engines"]},
> +    {"required": ["when", "version", "took", "uid", "failureReason"]}
> +  ],
> +
> +  "properties": {
> +    "version": { "type": "integer" },

Maybe we can use pattern matching here (and elsewhere in the schema) to strictly check for the expected version format?

Something like: "pattern": "^\\d{2}\\." (the pipeline schemas have some examples about that)

::: services/sync/modules-testing/sync_ping_schema.js:28
(Diff revision 2)
> +  "properties": {
> +    "version": { "type": "integer" },
> +    "didLogin": { "type": "boolean" },
> +    "when": { "type": "integer" },
> +    "why": { "enum": ["startup", "schedule", "score", "user", "tabs"] },
> +    "took": { "type": "integer" },

Since we're using monotonic clocks for this, maybe we can validate the field only for integers => 0?

::: services/sync/modules-testing/sync_ping_schema.js:43
(Diff revision 2)
> +  "definitions": {
> +    // Each Sync record will have one or more "engine" records.
> +    "engine": {
> +      // We need at least one of "incoming", "outgoing", or "failureReason" to be valid.
> +      "anyOf": [
> +        {"required": ["name", "when", "took", "incoming"]},

I think "name", "when", "took" can be removed from this list (and the next two) since you're requiring them on line 47.

::: services/sync/modules-testing/sync_ping_schema.js:52
(Diff revision 2)
> +      "required": ["name", "when", "took"],
> +      "additionalProperties": false,
> +      "properties": {
> +        "failureReason": { "$ref": "#/definitions/error" },
> +        "when": { "type": "integer" },
> +        "name": { "type": "string" },

Are we allowing third party engines? If not, we could probably be stricter about what we expect here.

::: services/sync/modules-testing/sync_ping_schema.js:53
(Diff revision 2)
> +      "additionalProperties": false,
> +      "properties": {
> +        "failureReason": { "$ref": "#/definitions/error" },
> +        "when": { "type": "integer" },
> +        "name": { "type": "string" },
> +        "took": { "type": "integer" },

Same as the other "took" definition.

::: toolkit/components/telemetry/docs/index.rst:27
(Diff revision 2)
>     core-ping
>     deletion-ping
>     crash-ping
>     uitour-ping
>     heartbeat-ping
> +   sync-ping

Please also add the sync-ping to [pings.rst][1].

[1]: https://dxr.mozilla.org/mozilla-central/rev/d1102663db10b3d4b9358f3cf4e16b7c56902352/toolkit/components/telemetry/docs/pings.rst#51

::: toolkit/components/telemetry/docs/sync-ping.rst:54
(Diff revision 2)
> +              {
> +                sent: <integer>, // number of outgoing records sent
> +                failed: <integer>, // number that failed to send.
> +              }
> +            ],
> +            // optional, excluded in the case of errors

Is this correct? If so, can we expand on the reasons why this could be missing?

::: toolkit/components/telemetry/docs/sync-ping.rst:91
(Diff revision 2)
> +why
> +~~~
> +
> +One of the following values:
> +
> +- ``"startup"``: This is the first sync triggered after browser startup.

Nit: can you remove the quotes from ``"startup"`` and the other ones (e.g. ``startup``)? I think that would be more consistent with what we have in the other pages of the documentation (see the docs for the _reason_ field of the main ping).

::: toolkit/components/telemetry/docs/sync-ping.rst:116
(Diff revision 2)
> +
> +- ``"autherror"``: Indicates an unrecoverable authentication error.
> +
> +    - ``"from"``: Where the authentication error occurred, one of the following values: ``"tokenserver"``, ``"fxaccounts"``, or ``"hawkclient"``.
> +
> +- ``"othererror"``: Indicates that it is a sync error code that we are unable to give more specific information on. A complete list of them is available in services/sync/modules/constants.js.

Can you make services/sync/modules/constants.js link to the file on DXR?

::: toolkit/components/telemetry/docs/sync-ping.rst:127
(Diff revision 2)
> +   - ``"error"``: The message provided by the error.
> +
> +engine.validation
> +~~~~~~~~~~~~~~~~~
> +
> +For engines that can run validation on themselves, an array of objects describing validation errors that have occurred. Items that would have a count of 0 are excluded. Each engine will have its own set of items that it might put in the ``name`` field, but there are a finite number. See ``BookmarkProblemData.getSummary`` in services/sync/modules/bookmark\_validator.js for an example.

Can you link bookmark_validator.js to the file on DXR?
Attachment #8763709 - Flags: review?(alessio.placitelli) → review-
Comment on attachment 8764763 [details]
Bug 1267919 - Part 3. Implement initial sync telemetry recording code.

https://reviewboard.mozilla.org/r/60442/#review57408

I'll defer the Sync specific parts to :markh, as I don't know much about those. I reviewed telemetry.js and the tests: in general, they look good. There are just a few things to address (mostly nits). The biggest issue being the JSON vs JS schema that I mentioned in the other reviews.

Regarding the tests, I think it would be useful to check that all the fields of the payload we are sending have the expected values. It should not be a big task, as it's just a few fields sent with each xpcshell test.

::: services/sync/modules/browserid_identity.js:119
(Diff revision 1)
>  
> +  userUID() {
> +    if (this._signedInUser) {
> +      return this._signedInUser.uid;
> +    } else {
> +      return '';

Mh, if the UID can happen to be an empty string, we should probably document that in the sync ping rst file.

::: services/sync/modules/telemetry.js:24
(Diff revision 1)
> +Cu.import("resource://gre/modules/XPCOMUtils.jsm");
> +
> +let constants = {};
> +Cu.import("resource://services-sync/constants.js", constants);
> +
> +

nit: extra blank line

::: services/sync/modules/telemetry.js:33
(Diff revision 1)
> +
> +XPCOMUtils.defineLazyServiceGetter(this, "Telemetry",
> +                                   "@mozilla.org/base/telemetry;1",
> +                                   "nsITelemetry");
> +
> +

nit: extra blank line

::: services/sync/modules/telemetry.js:49
(Diff revision 1)
> +  "weave:engine:sync:error",
> +  "weave:engine:sync:applied",
> +  "weave:engine:sync:uploaded",
> +];
> +
> +

nit: extra blank line

::: services/sync/modules/telemetry.js:377
(Diff revision 1)
> +  }
> +
> +  onSyncStarted() {
> +    let setupStatus = Status.checkSetup();
> +    if (setupStatus !== constants.STATUS_OK) {
> +      log.info("Not recording sync telemetry for a user that isn't logged in. Status = "+setupStatus)

That's missing the trailing ";".

nit: this is the only log line in this file not using the "${}" syntax. We can make it cosistent with the others.

::: services/sync/modules/telemetry.js:385
(Diff revision 1)
> +    this.current = new TelemetryRecord(this.allowedEngines);
> +  }
> +
> +  _checkCurrent(topic) {
> +    if (!this.current) {
> +      log.warn(`Observed notification ${topic} but no current sync is being recorded.`)

That's missing the trailing ";"

::: services/sync/modules/telemetry.js:404
(Diff revision 1)
> +    this.submit(current.toJSON());
> +  }
> +
> +  observe(subject, topic, data) {
> +    log.trace(`observed ${topic} ${subject} ${data}`);
> +

nit: extra empty line

::: services/sync/tests/unit/head_helpers.js:8
(Diff revision 1)
>  
>  Cu.import("resource://services-common/async.js");
>  Cu.import("resource://testing-common/services/common/utils.js");
>  Cu.import("resource://testing-common/PlacesTestUtils.jsm");
>  
> +Cu.import("resource://testing-common/services/sync/sync_ping_schema.js");

If we switch to JSON, we would still be able to reference the schema in XPCSHELL tests by adding it to the support-files section of the xpcshell.ini file.

::: services/sync/tests/unit/head_helpers.js:220
(Diff revision 1)
>    for (let i = 0; i < a1.length; ++i) {
>      do_check_eq(a1[i], a2[i]);
>    }
>  }
> +
> +// Helper function to get the sync telemetry and add the typically used test 

nit: trailing whitespace

::: services/sync/tests/unit/test_bookmark_engine.js:102
(Diff revision 1)
>                                            syncID: engine.syncID}}}},
>      bookmarks: {}
>    });
>  }
>  
> -add_test(function test_processIncoming_error_orderChildren() {
> +add_task(function test_processIncoming_error_orderChildren() {

This should be a generator function: function*

::: services/sync/tests/unit/test_bookmark_engine.js:157
(Diff revision 1)
>      } catch(ex) {
>        error = ex;
>      }
>      do_check_true(!!error);
> +    do_check_true(!!ping);
> +    do_check_eq(ping.failureReason.error, "error.engine.reason.record_download_fail");

Other things that would be useful to check here, IMHO:

- uid contains the expected value
- failureReason.name must have the correct value
- engines[0].name must have the correct value
- add test coverage for when/took

::: services/sync/tests/unit/test_bookmark_engine.js:288
(Diff revision 1)
>        _("Got error: " + Log.exceptionStr(ex));
>      }
> +    ok(!!ping);
> +    equal(ping.failureReason, undefined);
> +    equal(ping.engines.length, 1);
> +    ok(ping.engines[0].outgoing.length > 0);

We could probably also check engines[0].outgoing[0].sent/failed here

::: services/sync/tests/unit/test_telemetry.js:18
(Diff revision 1)
> +
> +Cu.import("resource://gre/modules/PlacesUtils.jsm");
> +Cu.import("resource://services-sync/util.js");
> +
> +const ajv = new Ajv({ async: "co*" });
> +const validateSyncPing = ajv.compile(SyncPingSchema);

Can't we use SyncPingValidator from head_helpers.js here?
Attachment #8764763 - Flags: review?(alessio.placitelli) → review-
Comment on attachment 8763709 [details]
Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59844/diff/2-3/
Attachment #8763709 - Flags: review- → review?(alessio.placitelli)
Attachment #8764763 - Flags: review- → review?(alessio.placitelli)
Comment on attachment 8764763 [details]
Bug 1267919 - Part 3. Implement initial sync telemetry recording code.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/60442/diff/1-2/
https://reviewboard.mozilla.org/r/60442/#review57408

I've added checks for expected values where it would actually indicate an error. I avoided adding tests for values where we don't care. e.g. on the first sync of a fresh profile we upload 3 bookmarks, if we uploaded 4, or 2, this would be okay as well, so long as we didn't *fail* to upload any. (But if we actually should be sending 3, and not sending them would be a bug, then I tried to check in those cases).

Keep in mind that the validation runs over all of these pings as well, and ensures that the required fields are present, with no additional fields, especially now that schema also validates the integers are positive, etc.

> Mh, if the UID can happen to be an empty string, we should probably document that in the sync ping rst file.

Good point, I had forgotten about this case.

> Other things that would be useful to check here, IMHO:
> 
> - uid contains the expected value
> - failureReason.name must have the correct value
> - engines[0].name must have the correct value
> - add test coverage for when/took

I've added coverage for when to the validation in head_helpers, the only requirement for took is that it be >= 0, which is (now) covered by the schema.

For the uid, I've added checks for in some places, but a number of the test run with different auth systems (mocked legacy auth, then mocked fxa auth), so we might not know the UID.
https://reviewboard.mozilla.org/r/59844/#review57400

> Is there any particular reason why the schema is not a JSON file?
> Using JSON would make it easier to use this schema server-side as well. See the [pipeline schemas repo][1].
> 
> [1]: https://github.com/mozilla-services/mozilla-pipeline-schemas/

Now that there's dedicated documentation, there's no reason to keep it as a module.

> Can we have pings with both _engines_ and _failures_? If not, we should probably explicitly cover this case in the schema.

No, but I've updated the documentation to be more clear about this.

> Maybe we can use pattern matching here (and elsewhere in the schema) to strictly check for the expected version format?
> 
> Something like: "pattern": "^\\d{2}\\." (the pipeline schemas have some examples about that)

This is an integer, and so a pattern doesn't make sense. I did add a check to ensure it is always positive, and added a pattern to the "uid" property.

> Are we allowing third party engines? If not, we could probably be stricter about what we expect here.

No third party engines, good point. Only point of consideration is tests, but that's easy to work around by patching the JSON schema after loading it in the test code.

> Is this correct? If so, can we expand on the reasons why this could be missing?

Thanks, that was reversed.
https://reviewboard.mozilla.org/r/59842/#review57572

What's the process for updating this library? It doesn't look like it came directly out of github and I don't want to make it difficult to update to later versions later (ie, if it was put together by hand I'd be concerned). I think at a minimum we'd want docs next to the file explaining the process of updating.
https://reviewboard.mozilla.org/r/59842/#review57574

::: services/common/moz.build:41
(Diff revision 2)
>      TESTING_JS_MODULES.services.common += [
>          'modules-testing/storageserver.js',
>      ]
>  
>  TESTING_JS_MODULES.services.common += [
> +    'modules-testing/ajv.js',

oops - also meant to say that we could consider putting this file into testing/modules, then it could be referenced via resource://testing-common/... and available for bug 1249925.
https://reviewboard.mozilla.org/r/59842/#review57574

> oops - also meant to say that we could consider putting this file into testing/modules, then it could be referenced via resource://testing-common/... and available for bug 1249925.

Hm, I don't mind moving it, but I'm able to reference it from testing-common as is, via resource://testing-common/services/common/ajv.js (which is used in tests in the later commits).
https://reviewboard.mozilla.org/r/59842/#review57572

It's not part of their github, but it's distributed with their npm bundle (It's in node_modules/ajv/dist/ajv.bundle.js after installing it).  I installed it to a local folder, took that file, and added a license header and the EXPORTED_SYMBOLS line. But I can definitely document this process.
https://reviewboard.mozilla.org/r/60442/#review57586

Wow, this is a large patch :) But I think it is looking great - nice job!

There are lots of comments here and I ran out of time for today, so ended up skipping over some of the test changes, but in general, I think we can sacrifice some of the "robust" validation checking in some of the existing tests in the interests of keeping the validation code in those tests much smaller - we can do fully robust validation checks in telemetry-specific tests.  IOW, I think it might make sense to add new telemetry specific tests with robust checks for record counts etc.

As I mention a few times I also think the error stuff needs more love, even if it does end up meaning we need other changes in Sync itself to get access to the more detailed errors.

::: services/sync/modules/browserid_identity.js:119
(Diff revision 2)
>  
> +  userUID() {
> +    if (this._signedInUser) {
> +      return this._signedInUser.uid;
> +    } else {
> +      return '';

I assume this is needed for tests, otherwise we could throw, right? ISTM like it might be possible for the tests to mock this function so the real function can throw?

::: services/sync/modules/telemetry.js:66
(Diff revision 2)
> +  } catch (e) {}
> +  return false;
> +}
> +
> +function checkStatusError(engineName) {
> +  if (engineName && Status.engines && Status.engines[engineName]) {

FYI, I need to have another look at this Status handling, but I'm not going to get to that today - but I'll post the rest of the review anyway.

::: services/sync/modules/telemetry.js:99
(Diff revision 2)
> +
> +  if (error.failureCode) {
> +    return { name: "othererror", error: error.failureCode };
> +  }
> +
> +  if (isBrowerIdAuthError(error)) {

An alternative here would be to make AuthenticationError() in browserid_identity public, and instead of duplicating the logic have bid_identity set a "from" attribute on the error, and just so an instanceof check here and use .from

::: services/sync/modules/telemetry.js:122
(Diff revision 2)
> +  if (httpCode) {
> +    return { name: "httperror", code: httpCode };
> +  }
> +
> +  if (error.result) {
> +    switch (error.result) {

It might be better to check if the error code has the "module" set as NS_ERROR_MODULE_NETWORK. Sadly this isn't sanely exposed to .js, but cargo-culting https://dxr.mozilla.org/mozilla-central/source/toolkit/components/jsdownloads/src/DownloadCore.jsm#1516 seems reasonable.

::: services/sync/modules/telemetry.js:159
(Diff revision 2)
> +class EngineRecord {
> +  constructor(name) {
> +    this.when = Date.now();
> +    // startTime is in ms from process start, but is monotonic (unlike Date.now())
> +    // so we need to keep both it and when.
> +    this.startTime = tryGetMonotonicTimestamp();

I'm wondering if .when should just be a monotonic timestamp - I doubt we actually need the clock time for "when" and will be more interested in calculating the time between subsequent syncs.

But I guess that depends on how unlikely failing to get a monotonic timestamp is in practice?

::: services/sync/modules/telemetry.js:185
(Diff revision 2)
> +  recordApplied(counts) {
> +    if (!(counts.applied || counts.succeeded || counts.failed ||
> +          counts.newFailed || counts.reconciled)) {
> +      return;
> +    }
> +    if (!this.incoming) {

ISTM there would be a logic error somewhere if this was called twice for the same record? IOW, should we throw and/or log an error if this.incoming already exists? Similarly for the upload counts below.

::: services/sync/modules/telemetry.js:217
(Diff revision 2)
> +class TelemetryRecord {
> +  constructor(allowedEngines) {
> +    this.allowedEngines = allowedEngines;
> +    // Our failure reason. This property only exists in the generated ping if an
> +    // error actually occurred.
> +    this.failureReason = null;

Can we set this to undefined and unconditionally include it in result in toJSON - IIUC it would still not appear in the JSON in that case.

::: services/sync/modules/telemetry.js:222
(Diff revision 2)
> +    this.failureReason = null;
> +    try {
> +      this.uid = Weave.Service.identity.userUID();
> +    } catch (e) {
> +      // Legacy sync authentication. This is should only ever happen during tests.
> +      this.uid = "<LEGACY>";

Even though I recall suggesting this literal, I'm surprised this doesn't cause schema validation failures? (It would be reasonable to include "0"*32 if it did - I'm just more interested in why it doesn't cause test failures and/or whether the tests should be validating more often?  (edit: I see now that you neuter this check in tests - I guess I'm not too bothered, but it doesn't sounds like it would be too much work to (a) use "0"*32 here and (b) have configureFxaIdentity (or whatever it is called) use a "valid" uid?

::: services/sync/modules/telemetry.js:258
(Diff revision 2)
> +    if (engines.length > 0) {
> +      result.engines = engines;
> +    }
> +    if (result.failureReason || result.engines) {
> +      return result;
> +    } else {

nit: no "else" after return. I think we also want a comment about why we are doing this (ie, to avoid uploading a "synced but did nothing" record.)

::: services/sync/modules/telemetry.js:264
(Diff revision 2)
> +      return null;
> +    }
> +  }
> +
> +  finished(error) {
> +    this.took = Math.round(tryGetMonotonicTimestamp() - this.startTime);

Still not sure what I think the best thing to do is with monotonic timestamps, but I wonder if this.took being zero is actually better in practice to it being a value calculated from .now() as a fallback?

::: services/sync/modules/telemetry.js:275
(Diff revision 2)
> +    if (error) {
> +      this.failureReason = transformError(error);
> +    }
> +
> +    if (!this.uid) {
> +      // Try again, it's possible that by now we've logged in.

Any reason we can't *only* fetch it here? I guess there is also the possibility a user disconnects *during* Sync - in which case I think it would be fine to not record the final sync.

::: services/sync/modules/telemetry.js:365
(Diff revision 2)
> +  }
> +}
> +
> +class SyncTelemetryImpl {
> +  constructor(allowedEngines) {
> +    // This is accessable so we can enable custom engines during tests.

nit: accessible

::: services/sync/modules/telemetry.js:394
(Diff revision 2)
> +  }
> +
> +  onSyncStarted() {
> +    let setupStatus = Status.checkSetup();
> +    if (setupStatus !== constants.STATUS_OK) {
> +      log.info(`Not recording sync telemetry for a user that isn't logged in. Status = ${setupStatus}.`);

do we actually need this? It seems likely to be the first sync, and as per the comment above, we should expect a UID at sync finish (or user disconnected during sync, in which case I think we could just discard the ping)

::: services/sync/modules/telemetry.js:429
(Diff revision 2)
> +        this.shutdown();
> +        break;
> +
> +      /* sync itself state changes */
> +      case "weave:service:sync:start":
> +        this.current = new TelemetryRecord(this.allowedEngines);

seems like it might be worth checking !this.current here and logging otherwise?

::: services/sync/modules/telemetry.js:433
(Diff revision 2)
> +      case "weave:service:sync:start":
> +        this.current = new TelemetryRecord(this.allowedEngines);
> +        break;
> +
> +      case "weave:service:sync:finish":
> +        this.onSyncFinished(null);

should this use _checkCurrent() and avoid the additional check in onSyncFinished?

::: services/sync/tests/unit/head_helpers.js:7
(Diff revision 2)
>     http://creativecommons.org/publicdomain/zero/1.0/ */
>  
>  Cu.import("resource://services-common/async.js");
>  Cu.import("resource://testing-common/services/common/utils.js");
>  Cu.import("resource://testing-common/PlacesTestUtils.jsm");
> +Cu.import("resource://gre/modules/Promise.jsm");

We should be able to avoid this import (it used to be necessary before DOM Promises, hence you will see existing imports of this, but they can actually be removed as convenient)

::: services/sync/tests/unit/head_helpers.js:298
(Diff revision 2)
> +    }
> +  }
> +}
> +
> +// Hooks into telemetry to validate all pings after calling.
> +function validate_all_pings() {

I think a name similar to "validate_all_future_pings" would be better.

::: services/sync/tests/unit/head_helpers.js:322
(Diff revision 2)
> +// engine is actually synced, but we still want to ensure we're generating a
> +// valid ping. Returns a promise that resolves to the ping, or rejects with the
> +// thrown error after calling an optional callback.
> +function with_validated_engine_sync(callback, optErrPingCb) {
> +  let telem = get_sync_test_telemetry();
> +  let deferred = Promise.defer();

oh - using Promise.defer here *does* require Promise.jsm, but I think it should be easy to avoid here (ie, return new Promise(...))

::: services/sync/tests/unit/test_bookmark_engine.js:160
(Diff revision 2)
> -    do_check_true(!!error);
> +    ok(!!error);
> +    ok(!!ping);
> +    equal(ping.uid, "<LEGACY>");
> +    deepEqual(ping.failureReason, {
> +      name: "othererror",
> +      error: "error.engine.reason.record_download_fail"

re errors, this doesn't really look right - "record_download_fail" doesn't give us the information we want IMO - I'd expect an unexpectederror with "Sync this!" or similar.

::: services/sync/tests/unit/test_bookmark_engine.js:243
(Diff revision 2)
> +        engine.sync());
>      } catch(ex) {
>        error = ex;
>        _("Got error: " + Log.exceptionStr(ex));
>      }
> +    ok(!!ping);

I wonder if we can cleanly move some of this boilerplate out (eg, have with_validated_engine_sync() take the engine and some additional validation as params instead of a callback? I think it would also be OK to have less strenuous checks here if is helps clean things up (eg, I'm not sure there's a whole lot of value in checking the number of records in these kinds of tests, especially where the number (eg, 6 in this case) is really more a side-effect than a carefully constructed thing.

While doing the schema validation here is awesome, I'm trying to avoid obscuring the actual point of these tests.

::: services/sync/tests/unit/test_collections_recovery.js:37
(Diff revision 2)
>      "/1.1/johndoe/storage/meta/global": johnU("meta",   new ServerWBO("global").handler())
>    };
>    let collections = ["clients", "bookmarks", "forms", "history",
>                       "passwords", "prefs", "tabs"];
> +  // Disable addon sync because AddonManager won't be initialized here.
> +  Service.engineManager.unregister("addons");

what goes wrong here without this change?

::: services/sync/tests/unit/test_errorhandler.js:208
(Diff revision 2)
>  
>    _("Starting first sync.");
> -  Service.sync();
> +  ping = yield wait_for_ping(() => Service.sync());
> +  ok(!!ping);
> +  ok(!ping.engines);
> +  // Unfortunately this is an unexpectederror right now...

why is that? Is that something we can fix before landing, or something we can fix after landing?

::: services/sync/tests/unit/test_syncengine_sync.js:1795
(Diff revision 2)
>  
> +    ok(!!error);
> +    ok(!!ping);
> +    deepEqual(ping.failureReason, {
> +      name: "othererror",
> +      error: "error.engine.reason.record_upload_fail"

ditto here re errors - ideally we'd record *why* rather than the *what*
https://reviewboard.mozilla.org/r/59844/#review57578

I think this looks good, but we should wait for all other patches to be ready before landing (even if r+ comes for this first :)

::: services/sync/tests/unit/sync_ping_schema.json:21
(Diff revision 3)
> +    "took": { "type": "integer", "minimum": 0 },
> +    "uid": {
> +      "type": "string",
> +      "oneOf": [
> +        { "pattern": "[0-9a-f]{32}" },
> +        { "maxLength": 0 }

http://json-schema.org/latest/json-schema-validation.html doesn't mention special semantics for maxLength = 0 - is that actually necessary (and should the pattern include start and end of string expressions?)

::: services/sync/tests/unit/sync_ping_schema.json:114
(Diff revision 3)
> +      "properties": {
> +        "name": { "enum": ["autherror"] },
> +        "from": { "enum": ["tokenserver", "fxaccounts", "hawkclient"] }
> +      }
> +    },
> +    "otherError": {

As per the other patch, I think we need to think a little more about errors.

::: toolkit/components/telemetry/docs/sync-ping.rst:8
(Diff revision 3)
> +===========
> +
> +This ping is generated after a sync is completed, for both successful and failed syncs. It's payload contains measurements
> +pertaining to sync performance and error information.
> +
> +Structure::

I think we should link to the schema itself as the .rst seems likely to get out of date.
Comment on attachment 8763709 [details]
Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping.

https://reviewboard.mozilla.org/r/59844/#review57660

data-review=me without a clientID, or please ping me and explain why you need the clientid

based on re-reading the bug, I'm assuming you don't need the full environment block (https://bugzilla.mozilla.org/show_bug.cgi?id=1267919#c7) so you should document that you're not collecting it.

::: toolkit/components/telemetry/docs/sync-ping.rst:14
(Diff revision 3)
> +
> +    {
> +      version: 4,
> +      type: "sync",
> +      ... common ping data
> +      clientId: <UUID>,

Do you need the telemetry clientID for this ping? This would allow us to correlate the sync account ID with telemetry clientID, which may not be desirable. If you don't need it, it would be better not to include it.

Also, will this ping contain the environment block? It's not documented one way or the other.

::: toolkit/components/telemetry/docs/sync-ping.rst:81
(Diff revision 3)
> +Additionally, both ``failureReason`` and ``engines`` may be present in the payload, as we may have synced or partially synced some engines when an error occurred. If the error occured due to the engine's operation, there may be additional error information on the failureReason of one or more engines.
> +
> +took
> +~~~~
> +
> +These values should be monotonic, however 0 is reported if we can't get a monotonic timestamp.

Wouldn't `null` or missing be better in this case? Are there really cases where we can't get a monotonic timestamp?
Attachment #8763709 - Flags: review?(benjamin) → review+
https://reviewboard.mozilla.org/r/60442/#review57586

The more robust validation checking was largely added in response to :Dexter's comments, so maybe there's some middle ground? I'll work on more telemetry-specific tests in the hope that this is an acceptable middle-ground (and also on reducing the testing boilerplate in more of the tests).

> I assume this is needed for tests, otherwise we could throw, right? ISTM like it might be possible for the tests to mock this function so the real function can throw?

Right, that sounds possible, and any case where mocking it causes a problem can probably just be addressed there.

> It might be better to check if the error code has the "module" set as NS_ERROR_MODULE_NETWORK. Sadly this isn't sanely exposed to .js, but cargo-culting https://dxr.mozilla.org/mozilla-central/source/toolkit/components/jsdownloads/src/DownloadCore.jsm#1516 seems reasonable.

Sounds good to me. (It also occurs to me that we probably don't want to report an unexpectederror in the case that it does have a nsresult code).

edit: I talk about this more later, IMO maybe we just want to report nsresults as nsresults. e.g. something like {name: "nserror", result: 12345} or something like this...

> I'm wondering if .when should just be a monotonic timestamp - I doubt we actually need the clock time for "when" and will be more interested in calculating the time between subsequent syncs.
> 
> But I guess that depends on how unlikely failing to get a monotonic timestamp is in practice?

I think it depends more on how much we care about the time delta for two syncs that happen when the browser restarts in between, since we wouldn't get that with that monotonic time function. Of course, an alternitive would be to include the process start time -- or even just the when telemetry.js loads if the process start time isn't avaliable anywhere, *and* the monotonic msSinceProcessStart. 

But IMO whether or not that's worth doing depends on how likely Date.now()'s lack of monotonicity is to matter. I have no data on this (and I'd be interested if we *do*), but ISTM that it's unlikely to shift so substantially that we care -- but I could definitely be wrong about this.

> ISTM there would be a logic error somewhere if this was called twice for the same record? IOW, should we throw and/or log an error if this.incoming already exists? Similarly for the upload counts below.

I hadn't dug into it too much, and since there is a totally natural way to handle it being called multiple times (sum them), I just did that. I still think summing them is going to be much better than throwing (except possibly during tests...), but logging an error sounds like a good idea.

I was under the impression that it was completely valid for the upload counts to be reported multiple times, indicating multiple uploads, hence it being an array.

> Can we set this to undefined and unconditionally include it in result in toJSON - IIUC it would still not appear in the JSON in that case.

I'll double check this, I was using a different JSON-schema validator that didn't like this at first. I had looked into it and, well, JSON-schema doesn't really have a stance on undefined properties, since they aren't part of JSON.

OTOH when validating I could just JSON.parse(JSON.stringify(obj)) before validating which would resolve this.

> Even though I recall suggesting this literal, I'm surprised this doesn't cause schema validation failures? (It would be reasonable to include "0"*32 if it did - I'm just more interested in why it doesn't cause test failures and/or whether the tests should be validating more often?  (edit: I see now that you neuter this check in tests - I guess I'm not too bothered, but it doesn't sounds like it would be too much work to (a) use "0"*32 here and (b) have configureFxaIdentity (or whatever it is called) use a "valid" uid?

Yeah, I'd much prefer using an all-zero UUID. This means that should the unthinkable happen and it make it's way to the server, it wouldn't be a hard error.

> Still not sure what I think the best thing to do is with monotonic timestamps, but I wonder if this.took being zero is actually better in practice to it being a value calculated from .now() as a fallback?

IMO it probably is. We could still keep the monotonicity guarantee by using `Math.max(0, delta)`, or similar.

I dug into it a bit and it's not clear to me that this will never throw, since it's not entirely dependent on the platform having a monotonic timer (which, ignoring some very obscure unices, they all do), but also on our computed process start time being consistent with what we recorded for it... Which, well, who knows.

That said, it sounds like it would be easy, at least in theory, to find in telemetry if this can ever happen in practice, as a special value (-1) is recorded when we cant get a monotonic timestamp.

> Any reason we can't *only* fetch it here? I guess there is also the possibility a user disconnects *during* Sync - in which case I think it would be fine to not record the final sync.

Sure, although I think we might still want to record the sync, e.g. in the case of authentication/login errors or the like where we can't actually get a unique ID...

> do we actually need this? It seems likely to be the first sync, and as per the comment above, we should expect a UID at sync finish (or user disconnected during sync, in which case I think we could just discard the ping)

Sounds like a good point, especially since that code wasn't even running. As I mentioned above, I don't think this code was even doing a smart thing -- we still would probably want to report auth errors, etc.

> seems like it might be worth checking !this.current here and logging otherwise?

Oh, this really should be calling `this.onSyncStarted()`. But yes, that check seems valuable as well.

> We should be able to avoid this import (it used to be necessary before DOM Promises, hence you will see existing imports of this, but they can actually be removed as convenient)

This is a good point, there are a couple reasons I wanted `Promise.defer` (mostly because `new Promise` runs it's function asynchronously), but I'll try harder to avoid it.

(Of course, ideally those utilities would still be available somewhere).

> oh - using Promise.defer here *does* require Promise.jsm, but I think it should be easy to avoid here (ie, return new Promise(...))

Right, there's a difference though, in that `new Promise()` runs its function argument asynchronously, so if `callback` reports an error or finishes before the promise microtask gets a chance to run, then we'll miss the submit.

This seemed to be happening in practice, so I moved from `new Promise()` to using defer. I'll see if I can get it to work with `new Promise()` though (it seems like I should be able to if I move enough stuff into the `new Promise`, that would work...).

> re errors, this doesn't really look right - "record_download_fail" doesn't give us the information we want IMO - I'd expect an unexpectederror with "Sync this!" or similar.

This seems to be dropped/swallowed in the sync error handling code -- https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/resource.js#300 and https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/engines.js#1135-1138 seem to be where the problem is.

IMO fixing all the places where sync does bad things with errors is out of the scope of this bug, but, well, maybe the data isn't valuable without better error reporting, in which case it might not be.

> I wonder if we can cleanly move some of this boilerplate out (eg, have with_validated_engine_sync() take the engine and some additional validation as params instead of a callback? I think it would also be OK to have less strenuous checks here if is helps clean things up (eg, I'm not sure there's a whole lot of value in checking the number of records in these kinds of tests, especially where the number (eg, 6 in this case) is really more a side-effect than a carefully constructed thing.
> 
> While doing the schema validation here is awesome, I'm trying to avoid obscuring the actual point of these tests.

Well, a lot of it was added after feedback from the earlier review.

But adding additional params to move the checking into the with_validated_engine_sync function sounds fine to me (although, I thought there is at least one place where we do more than just `engine.sync()` inside the callback, but it's possible I got rid of it already).

> what goes wrong here without this change?

This throws an NS_ERROR_NOT_INITIALIZED: https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/addonsreconciler.js#351.

But thanks for bringing my attention back to this, I had meant to look into it more, since we were reporting this as "error.engine.reason.unknown_fail", which, well, pretty much should never be reported. Even if we know nothing else, we should be reporting it as an "unexpectederror". So i've fixed this so now it should be an "unexpectederror" with "error": "3253927937" -- which is the code for that nsresult...

But... I'm not sure what we should do about non-network nsresults that are thrown. It does seem to fall into the category of 'unexpected error', but it probably makes sense to narrow that scope, since we can probably do more with the knowledge of what nsresult it is. (I'd almost say that networkerror is bad and it should just be {"name": "nserror", "result": error.result} or similar. But, I don't know.

> why is that? Is that something we can fix before landing, or something we can fix after landing?

Along the lines of my other comment about sync error handling. This is from https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/service.js#551, and the reporting can be somewhat by throwing `info` instead (which causes it to show up as a 401 httperror, which, well, possibly should be an autherror instead...), it seems like really we should be returning something from checkServerError instead.

But as I mentioned above, it seems like there's a lot of places where the sync code swallows errors or error info, and fixing them doesn't sound like it should be part of this bug. (But I'll totally defer to your judgement here). Changing what we throw here is easy enough that it's worth doing though, especially since it will probably cause other kinds of errors to be reported better as well.
https://reviewboard.mozilla.org/r/59844/#review57660

> Do you need the telemetry clientID for this ping? This would allow us to correlate the sync account ID with telemetry clientID, which may not be desirable. If you don't need it, it would be better not to include it.
> 
> Also, will this ping contain the environment block? It's not documented one way or the other.

Sorry, it won't contain the environment block, it will have the application block, but my understanding is that all pings have this.

Mark: do you have a strong opinion on whether or not we have the client id?

> Wouldn't `null` or missing be better in this case? Are there really cases where we can't get a monotonic timestamp?

There's some discussion of this in the reply I made to markh's comments just now, but it's not clear from the code that it's impossible (e.g. it's not solely dependent on the OS having a monotonic timer, which they essentially all do). After some discussion on IRC, it looks like this is unlikely but can happen (it shows up 280 times in the longitudinal table, which AIUI is 1% of users over the last 6 months).

FWIW I think that a reasonable solution might be to attempt to fall back on a non-monotonic timer, but if it reports a negative value, *then* either leave the time out, report 0, or some other error indicator (-1 would be consistent with the rest of telemetry).
https://reviewboard.mozilla.org/r/60442/#review57586

> Right, that sounds possible, and any case where mocking it causes a problem can probably just be addressed there.

Actually, this handles the case where the user is not logged in or not yet authenticated.
Whiteboard: [sync-data-integrity] → [data-integrity]
(In reply to Thom Chiovoloni [:tcsc] from comment #32)
> Mark: do you have a strong opinion on whether or not we have the client id?

I think we do not want the client id for the reasons Benjamin mentions, and there's no identified requirement to correlate between the 2 (and if a good reason comes up later, we can address it in then).

> > Wouldn't `null` or missing be better in this case? Are there really cases where we can't get a monotonic timestamp?
> 
> There's some discussion of this in the reply I made to markh's comments just
> now, but it's not clear from the code that it's impossible (e.g. it's not
> solely dependent on the OS having a monotonic timer, which they essentially
> all do). After some discussion on IRC, it looks like this is unlikely but
> can happen (it shows up 280 times in the longitudinal table, which AIUI is
> 1% of users over the last 6 months).
> 
> FWIW I think that a reasonable solution might be to attempt to fall back on
> a non-monotonic timer, but if it reports a negative value, *then* either
> leave the time out, report 0, or some other error indicator (-1 would be
> consistent with the rest of telemetry).

For our analysis I think losing that 1% due to no monotonic timestamp is reasonable (and may actually be helpful in terms of filtering out insane platforms/situations we don't actually care about ;) So I'd be fine with no fallback to Date.now() and just recording -1 for those cases - it keeps the code simpler and offers less chance of polluting our data with outliers due to this.
https://reviewboard.mozilla.org/r/60442/#review57586

> I think it depends more on how much we care about the time delta for two syncs that happen when the browser restarts in between, since we wouldn't get that with that monotonic time function. Of course, an alternitive would be to include the process start time -- or even just the when telemetry.js loads if the process start time isn't avaliable anywhere, *and* the monotonic msSinceProcessStart. 
> 
> But IMO whether or not that's worth doing depends on how likely Date.now()'s lack of monotonicity is to matter. I have no data on this (and I'd be interested if we *do*), but ISTM that it's unlikely to shift so substantially that we care -- but I could definitely be wrong about this.

As already mentioned, I think just monotonic is sounding ok.

> I hadn't dug into it too much, and since there is a totally natural way to handle it being called multiple times (sum them), I just did that. I still think summing them is going to be much better than throwing (except possibly during tests...), but logging an error sounds like a good idea.
> 
> I was under the impression that it was completely valid for the upload counts to be reported multiple times, indicating multiple uploads, hence it being an array.

I believe the incoming fields should only ever be called once, so if we get multiple something has gone badly wrong that we need to know about. You are correct about outgoing though and I suggested we don't sum them up so we can get some insights into how the batching is working - that lack of symmetry is a shame, but I think it's probably OK.

> I'll double check this, I was using a different JSON-schema validator that didn't like this at first. I had looked into it and, well, JSON-schema doesn't really have a stance on undefined properties, since they aren't part of JSON.
> 
> OTOH when validating I could just JSON.parse(JSON.stringify(obj)) before validating which would resolve this.

I see - you aren't stringifying it - I think that's fine as it stands then.

> This is a good point, there are a couple reasons I wanted `Promise.defer` (mostly because `new Promise` runs it's function asynchronously), but I'll try harder to avoid it.
> 
> (Of course, ideally those utilities would still be available somewhere).

> mostly because new Promise runs it's function asynchronously

It runs it synchronously - https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Promise

> This seems to be dropped/swallowed in the sync error handling code -- https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/resource.js#300 and https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/engines.js#1135-1138 seem to be where the problem is.
> 
> IMO fixing all the places where sync does bad things with errors is out of the scope of this bug, but, well, maybe the data isn't valuable without better error reporting, in which case it might not be.

Yeah, agreed. I still need to circle back on the errors.

> This throws an NS_ERROR_NOT_INITIALIZED: https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/addonsreconciler.js#351.
> 
> But thanks for bringing my attention back to this, I had meant to look into it more, since we were reporting this as "error.engine.reason.unknown_fail", which, well, pretty much should never be reported. Even if we know nothing else, we should be reporting it as an "unexpectederror". So i've fixed this so now it should be an "unexpectederror" with "error": "3253927937" -- which is the code for that nsresult...
> 
> But... I'm not sure what we should do about non-network nsresults that are thrown. It does seem to fall into the category of 'unexpected error', but it probably makes sense to narrow that scope, since we can probably do more with the knowledge of what nsresult it is. (I'd almost say that networkerror is bad and it should just be {"name": "nserror", "result": error.result} or similar. But, I don't know.

That's a good idea. I'm a little worried about some NS_ERROR_XPC_* errors (eg, NS_ERROR_XPC_JAVASCRIPT_ERROR, NS_ERROR_XPC_JS_THREW_EXCEPTION) which sometimes mean "regular JS exception crossed an xpcom (ie, c++->js) boundary, but I'm inclined to think your idea is good and we can fine-tune this later if we start to see those exceptions in large numbers and want to know more.

> Along the lines of my other comment about sync error handling. This is from https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/service.js#551, and the reporting can be somewhat by throwing `info` instead (which causes it to show up as a 401 httperror, which, well, possibly should be an autherror instead...), it seems like really we should be returning something from checkServerError instead.
> 
> But as I mentioned above, it seems like there's a lot of places where the sync code swallows errors or error info, and fixing them doesn't sound like it should be part of this bug. (But I'll totally defer to your judgement here). Changing what we throw here is easy enough that it's worth doing though, especially since it will probably cause other kinds of errors to be reported better as well.

yep, fair enough - maybe add the short version of the above as a comment?
Comment on attachment 8763708 [details]
Bug 1267919 - Part 1. Import Ajv for validation of sync telemetry ping schema

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59842/diff/2-3/
Comment on attachment 8763709 [details]
Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59844/diff/3-4/
Comment on attachment 8764763 [details]
Bug 1267919 - Part 3. Implement initial sync telemetry recording code.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/60442/diff/2-3/
https://reviewboard.mozilla.org/r/60442/#review57586

> > mostly because new Promise runs it's function asynchronously
> 
> It runs it synchronously - https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Promise

Oh, wow, you're totally right. I think at one point I must have confused this with `Promise.resolve().then` and have been wrong about it ever since.
https://reviewboard.mozilla.org/r/59844/#review57660

Just a heads up, a few small changes were made to the schema. networkerror was replaced with nserror (we aren't requiring that the error code be from the network module), and when we can't get a monotonic timestamp. 

I assume these are both fine, but let me know if they are not.

> There's some discussion of this in the reply I made to markh's comments just now, but it's not clear from the code that it's impossible (e.g. it's not solely dependent on the OS having a monotonic timer, which they essentially all do). After some discussion on IRC, it looks like this is unlikely but can happen (it shows up 280 times in the longitudinal table, which AIUI is 1% of users over the last 6 months).
> 
> FWIW I think that a reasonable solution might be to attempt to fall back on a non-monotonic timer, but if it reports a negative value, *then* either leave the time out, report 0, or some other error indicator (-1 would be consistent with the rest of telemetry).

Ended up going with -1 for this, which is consistent with the other monotonic timestamps in telemetry.
https://reviewboard.mozilla.org/r/59844/#review57578

> http://json-schema.org/latest/json-schema-validation.html doesn't mention special semantics for maxLength = 0 - is that actually necessary (and should the pattern include start and end of string expressions?)

No special semantics are required for maxLength 0, it will only permit empty strings (this is what we send for auth errors, if we have no uid). You're right about the pattern though.

> I think we should link to the schema itself as the .rst seems likely to get out of date.

Would including a link to the documentation work? There doesn't seem to be a point to having the docs if we're not planning on updating them, and a link in the "description" field of the ping would address that -- I think.
(In reply to Thom Chiovoloni [:tcsc] from comment #42)
> > I think we should link to the schema itself as the .rst seems likely to get out of date.
> 
> Would including a link to the documentation work? There doesn't seem to be a
> point to having the docs if we're not planning on updating them, and a link
> in the "description" field of the ping would address that -- I think.

I'm not quite sure what you are asking here, but I think there's a real risk that someone will change the schema without updating the docs, so I was thinking having the docs point to the canonical schema might help someone who can't reconcile the (outdated) docs with reality. Having the schema point at the docs also makes sense and might help encourage people to update the docs when changing the schema.

(TBH, for this exact reason I probably would have preferred the schema be .js and the docs be embedded in it as comments and extracted as things change, but that doesn't sound feasible)
https://reviewboard.mozilla.org/r/60442/#review58540

Thinking a little more about this error handling, I wonder if can consider changing the schema so each Sync reports 2 values - an error and the Sync status. eg, there are some cases where sync "fails" without throwing an exception, but the status is recorded, whereas others are actually exceptions (eg, I believe that in some cases Status.sync will be NO_SYNC_NODE_FOUND but the sync will be considered "successful" - we'd have no visibility into that.

IOW, I'm thinking failureReason should always reflect the actual error being thrown (if one was) and a new field syncStatus that always records Status.sync. These strings aren't very short though and will (hopefully) usually be "success.sync" so maybe we could optimize things somehow to keep the size of the payload down (eg, maybe mapping the longer string to a short int, or just making success.sync an empty string, or something)

What do you think?

::: services/sync/modules/browserid_identity.js:120
(Diff revision 3)
>  
> +  userUID() {
> +    if (this._signedInUser) {
> +      return this._signedInUser.uid;
> +    } else {
> +      return '';

It looks like you can throw here now without impacting the telemetry collection.

::: services/sync/tests/unit/head_helpers.js:319
(Diff revision 3)
> +
> +// Used for the (many) cases where we do a 'partial' sync, where only a single
> +// engine is actually synced, but we still want to ensure we're generating a
> +// valid ping. Returns a promise that resolves to the ping, or rejects with the
> +// thrown error after calling an optional callback.
> +function with_validated_engine_sync(engine, success, onError) {

this function name seems a little strange - the "with" in particular. I wonder if something like "sync_engine_and_validate_telem()" or something might be better (it's still a bit wordy though)

::: services/sync/tests/unit/test_errorhandler.js:222
(Diff revision 3)
>  
>    generateCredentialsChangedFailure();
> -  Service.sync();
>  
> +  let ping = yield wait_for_ping(() => Service.sync());
> +  deepEqual(ping.failureReason, { name: "othererror", error: CREDENTIALS_CHANGED });

Re above: For example, in this case what's happening is that an exception is being thrown from https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/stages/enginesync.js#85 (ie, new Error("Aborting sync, remote setup failed")) but CREDENTIALS_CHANGED is the result reported in telemetry. I'm thinking that in this case we should attempt to get *both*
(In reply to Mark Hammond [:markh] from comment #43)
> (In reply to Thom Chiovoloni [:tcsc] from comment #42)
> > > I think we should link to the schema itself as the .rst seems likely to get out of date.
> > 
> > Would including a link to the documentation work? There doesn't seem to be a
> > point to having the docs if we're not planning on updating them, and a link
> > in the "description" field of the ping would address that -- I think.
> 
> I'm not quite sure what you are asking here, but I think there's a real risk
> that someone will change the schema without updating the docs, so I was
> thinking having the docs point to the canonical schema might help someone
> who can't reconcile the (outdated) docs with reality. Having the schema
> point at the docs also makes sense and might help encourage people to update
> the docs when changing the schema.

Hm, alright. That makes sense, I'll update it to point to the docs. My hesitancy was mainly for consistency with the documentation of the other ping types, (but that's probably not going to actually be a real problem).
Whiteboard: [data-integrity] → [data-integrity] [sync-metrics]
Comment on attachment 8763709 [details]
Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59844/diff/4-5/
Comment on attachment 8764763 [details]
Bug 1267919 - Part 3. Implement initial sync telemetry recording code.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/60442/diff/3-4/
https://reviewboard.mozilla.org/r/60442/#review59684

Ran out of time, but I think this is looking great. I'll get back to it ASAP.

::: services/sync/modules/telemetry.js:209
(Diff revisions 3 - 4)
>        version: PING_FORMAT_VERSION,
>        failureReason: this.failureReason,
> +      status: this.status,
>      };
>      let engines = [];
>      for (let engine of this.engines) {

I think we still write an entry for a "do nothing" engine sync - ie, just { took: xxx }

::: services/sync/modules/telemetry.js:218
(Diff revisions 3 - 4)
>      }
>      if (engines.length > 0) {
>        result.engines = engines;
>      }
>      // Avoid uploading a ping that has no information.
>      if (result.failureReason || result.engines) {

and ditto here (maybe :) - can this actually happen in practice? If not, maybe writing the "empty" ping is the right thing to do given it's not expected - it points at an otherwise hidden problem.

::: services/sync/modules/telemetry.js:252
(Diff revisions 3 - 4)
>        }
>      }
>  
> -    if (!this.failureReason) {
> -      if (this.engines.some(e => e.failureReason)) {
> -        // One of the engines failed but we never got the message (this is
> +    let statusObject = {};
> +
> +    // It might be better to check a few times for status.service...

did you mean "things" here? I suspect you are correct, but this should do to get it landed.

::: services/sync/tests/unit/head_helpers.js:255
(Diff revisions 3 - 4)
>  }
>  
>  function assert_valid_ping(record) {
>    if (record) {
>      if (!SyncPingValidator(record)) {
> +      dump("PING: "+JSON.stringify(record, null, 4));

gotta love dump debugging :/

::: services/sync/tests/unit/head_helpers.js:339
(Diff revisions 3 - 4)
> +    // assume that if no failureReason or engine failures are set, and the
> +    // status properties are the same as they were initially, that it's just
> +    // a leftover.
> +    // This is only an issue since we're triggering the sync of just one engine,
> +    // without doing any other parts of the sync.
> +    let initialServiceStatus = ns.Status._service;

If the tests all work, just blindly resetting these to "success" sounds fine. If just a few tests have problems here, please share what they are and it might be better to change them in some way (the convolutions below to check if it really was successful looks a bit smelly, so blindly reseting status and letting assert_success_ping determine if it really was successful seems ideal)

In an even more nit-picking vein... I think the "success" param might be back-to-front, or should have a default of true, or we need an assert_error_ping. Someone copying a test might call |sync_and_validate_telem(engine)| expecting success but not noticing they should pass |false|. IIUC, this will work fine but simply not call |assert_succcess_ping()|. I think we need to arrange so an inverted |success| arg always fails.
Comment on attachment 8763709 [details]
Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping.

https://reviewboard.mozilla.org/r/59844/#review59686

Looks good to me with the new issues addressed, assuming :markh is fine with the new additions about the errors that he requested!

::: toolkit/components/telemetry/docs/sync-ping.rst:8
(Diff revisions 4 - 5)
>  ===========
>  
>  This ping is generated after a sync is completed, for both successful and failed syncs. It's payload contains measurements
>  pertaining to sync performance and error information. It does not contain the enviroment block, nor the clientId.
>  
> +A JSON-schema document describing the exact format of the ping's payload property can be found at `services/sync/tests/unit/sync\_ping\_schema.json <services/sync/tests/unit/sync\_ping\_schema.json>`_.

<p>I think you can link to DXR for the schema file as well.</p>

::: toolkit/components/telemetry/docs/sync-ping.rst:24
(Diff revisions 4 - 5)
>          took: <integer duration in milliseconds>,
>          uid: <string>, // FxA unique ID, or empty string
>          didLogin: <bool>, // optional, is this the first sync after login? excluded if we don't know.
>          why: <string>, // optional, why the sync occured, excluded if we don't know.
>  
>          // optional, excluded if there was no error.

nit: could you please use a capital letter at the beginning of the comment and add a trailing "."?

That's not mandatory, but it would be good to have a consistent styile within this file: the "status" comments all start with capital letters and terminate with a trailing ".", while the others don't.
Attachment #8763709 - Flags: review?(alessio.placitelli) → review+
https://reviewboard.mozilla.org/r/60442/#review59684

> did you mean "things" here? I suspect you are correct, but this should do to get it landed.

I'm not sure what you mean.

My comment is mainly about the fact that a number of things clear out Status.service, or replace it with less useful data. If we check a few times, over the course of syncing, we're more likely to see the useful information instead of just a very high level overview of how the sync went.

> gotta love dump debugging :/

Crap, forgot to delete that one...

> If the tests all work, just blindly resetting these to "success" sounds fine. If just a few tests have problems here, please share what they are and it might be better to change them in some way (the convolutions below to check if it really was successful looks a bit smelly, so blindly reseting status and letting assert_success_ping determine if it really was successful seems ideal)
> 
> In an even more nit-picking vein... I think the "success" param might be back-to-front, or should have a default of true, or we need an assert_error_ping. Someone copying a test might call |sync_and_validate_telem(engine)| expecting success but not noticing they should pass |false|. IIUC, this will work fine but simply not call |assert_succcess_ping()|. I think we need to arrange so an inverted |success| arg always fails.

Yes, all the tests work as it is now. It's actually a large number of tests that had issues...  Nearly every test that used this function did, in fact.

Essentially what would happen is that they'd fail to create the collection URL, due to storageURL being undefined. I'll admit to not digging into this a ton, but my assumption was that it was due to us skipping the code in EngineSynchronizer.prototype.sync. But I could definitely be wrong, I'll dig into this some more.


I agree WRT the success argument -- it definitely seems like it should ensure success by default.
Comment on attachment 8763708 [details]
Bug 1267919 - Part 1. Import Ajv for validation of sync telemetry ping schema

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59842/diff/3-4/
Comment on attachment 8763709 [details]
Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59844/diff/5-6/
Comment on attachment 8764763 [details]
Bug 1267919 - Part 3. Implement initial sync telemetry recording code.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/60442/diff/4-5/
https://reviewboard.mozilla.org/r/59844/#review59686

> <p>I think you can link to DXR for the schema file as well.</p>

Ah, crap, that was supposed to be a DXR link. Thanks.

> nit: could you please use a capital letter at the beginning of the comment and add a trailing "."?
> 
> That's not mandatory, but it would be good to have a consistent styile within this file: the "status" comments all start with capital letters and terminate with a trailing ".", while the others don't.

All the comments should start with a capital letter now, except for the ones that start with a string naming an enum. 

(Initially they did not to be consistent with the summaries in the other files)
https://reviewboard.mozilla.org/r/60442/#review59684

> I'm not sure what you mean.
> 
> My comment is mainly about the fact that a number of things clear out Status.service, or replace it with less useful data. If we check a few times, over the course of syncing, we're more likely to see the useful information instead of just a very high level overview of how the sync went.

I thought the comment was supposed to read "// It might be better to check a few *things* for status.service..."
https://reviewboard.mozilla.org/r/60442/#review59684

> I thought the comment was supposed to read "// It might be better to check a few *things* for status.service..."

Sorry - to be clear, I thought you meant "a few things" as I thought you were saying "there are possibly a few Status.service values we should consider as success". I don't *think* it's worthwhile checking this at other times TBH - Sync tries hard to have a sane status after sync completes as some of the UI used to rely on it and I believe the scheduler still does - so I suspect you are just seeing weirdness in tests and not with real syncs - but please let me know if there are specific cases you have seen that are of concern.
https://reviewboard.mozilla.org/r/60442/#review59950

::: services/sync/modules/telemetry.js:33
(Diff revision 5)
> +XPCOMUtils.defineLazyServiceGetter(this, "Telemetry",
> +                                   "@mozilla.org/base/telemetry;1",
> +                                   "nsITelemetry");
> +
> +const log = Log.repository.getLogger("Sync.Telemetry");
> +

we need to set a level for the log - I think:

> log.level = Log.Level[Svc.Prefs.get("log.logger.telemetry", "Trace")];

is all we need (no need to define the pref IMO - if someone wants to tweak it they can just create the pref)

(Alternatively, I think "Debug" would be OK too, but in that case I'd probably want a log.info() in .submit or somewhere else so that we get a feel-good feeling of seeing at least 1 log from telemetry in each sync log. Setting a default of Trace means we get your observer traces)

::: services/sync/modules/telemetry.js:363
(Diff revision 5)
> +    this.current = null;
> +    this.submit(current.toJSON());
> +  }
> +
> +  observe(subject, topic, data) {
> +    log.trace(`observed ${topic} ${subject} ${data}`);

logging subject here probably doesn't make much sense - it's always going to be "null" or "[object Object]" (and I don't think it's worthwhile trying to stringify the wrapped object for trace level debugging)

::: services/sync/tests/unit/head_helpers.js:255
(Diff revision 5)
> +}
> +
> +function assert_valid_ping(record) {
> +  if (record) {
> +    if (!SyncPingValidator(record)) {
> +      dump("PING: "+JSON.stringify(record, null, 4));

dump

::: services/sync/tests/unit/head_helpers.js:301
(Diff revision 5)
> +function validate_all_future_pings() {
> +  let telem = get_sync_test_telemetry();
> +  telem.submit = assert_valid_ping;
> +}
> +
> +function wait_for_ping(callback, success) {

This function is almost always called with Service.sync() - what do you think about a helper sync_and_validate_telem?

Either way, I think my earlier comment about the |success| param should apply here too - it should fail if success = false but the ping was actually a success.

::: services/sync/tests/unit/test_telemetry.js:135
(Diff revision 5)
> +      name: "othererror",
> +      error: "error.engine.reason.record_download_fail"
> +    });
> +
> +  } finally {
> +    store.wipe();

Is there any reason we can't put the wipe and clearCache in cleanupAndGo and use it everywhere in this test?
https://reviewboard.mozilla.org/r/60442/#review59950

> we need to set a level for the log - I think:
> 
> > log.level = Log.Level[Svc.Prefs.get("log.logger.telemetry", "Trace")];
> 
> is all we need (no need to define the pref IMO - if someone wants to tweak it they can just create the pref)
> 
> (Alternatively, I think "Debug" would be OK too, but in that case I'd probably want a log.info() in .submit or somewhere else so that we get a feel-good feeling of seeing at least 1 log from telemetry in each sync log. Setting a default of Trace means we get your observer traces)

For what it's worth, |TelemetryController.submitExternalPing| logs (trace level) some useful information about the submitted pings. The log level is controlled by the |toolkit.telemetry.log.level| pref.
https://reviewboard.mozilla.org/r/60442/#review59684

> Sorry - to be clear, I thought you meant "a few things" as I thought you were saying "there are possibly a few Status.service values we should consider as success". I don't *think* it's worthwhile checking this at other times TBH - Sync tries hard to have a sane status after sync completes as some of the UI used to rely on it and I believe the scheduler still does - so I suspect you are just seeing weirdness in tests and not with real syncs - but please let me know if there are specific cases you have seen that are of concern.

So the end result of the above is that I think for now you should just remove that comment :)
Comment on attachment 8764763 [details]
Bug 1267919 - Part 3. Implement initial sync telemetry recording code.

https://reviewboard.mozilla.org/r/60442/#review59964

I think this is looking great and it seems to work as advertised \o/. Please fix the open issues but they are all minor enough I don't think I need to re-review afterwards.
Attachment #8764763 - Flags: review?(markh) → review+
Comment on attachment 8764763 [details]
Bug 1267919 - Part 3. Implement initial sync telemetry recording code.

Argh - I spoke too soon :( Sorry for not picking this up earlier, but I think we need to tighten the size of the data a little - in comment 6 I mentioned a 110 byte payload (optimistic much? ;) but we are well that over that.

I think we should:
* Drop the "when" from each engine - we have the when for the Sync itself and we record the took for each engine.

* I'm also surprised to see a number of { took: 0 } entries - I guess it's just a low-res timer? Related, we could also consider a threshold before we write "took" for the engines - I doubt we care much about exact "took" figures for fast syncs - 50ms would remove 5 "tooks" from my sample payload and seems "fast enough" - not sure this is actually worthwhile though?

* Remove all zero values from the "incoming" and "outgoing" objects.

* Hack something up for the "clients" engine, or each Sync it reports:
> { "applied": 1,"succeeded": 1,"failed": 0, "newFailed": 0, "reconciled": $numClients }
along with not having the 0s, we should arrange to see nothing here (as it reflects a noop). I don't really care if we do that in the engine itself or hard-code an engine name hack in the telemetry module - whatever's cleanest and easiest.

Doing the above (by hand) for one of my syncs (that sent a tab but was otherwise a noop, and *without* removing 5 "tooks") took the payload (without newlines) from 609 bytes to 351. Taking "tooks" of < 50 out is 303. Taking the 1 tab from the payload and it's down to 279. Not quite 110, but hopefully good enough.
Attachment #8764763 - Flags: review+
(In reply to Alessio Placitelli [:Dexter] from comment #59)
> For what it's worth, |TelemetryController.submitExternalPing| logs (trace
> level) some useful information about the submitted pings. The log level is
> controlled by the |toolkit.telemetry.log.level| pref.

Thanks, that's useful to know, but I don't think it helps us too much here - I doubt we want to write trace-level logging from telemetry to every sync log, as presumably many such messages aren't related to Sync. If the telemetry framework was setup such that only stuff we do could be logged it might make more sense, but TBH I'm not sure it is worth spending any effort on that.
(In reply to Mark Hammond [:markh] from comment #63)
> Comment on attachment 8764763 [details]
> Bug 1267919 - Part 3. Implement initial sync telemetry recording code.
> 
> Argh - I spoke too soon :( Sorry for not picking this up earlier, but I
> think we need to tighten the size of the data a little - in comment 6 I
> mentioned a 110 byte payload (optimistic much? ;) but we are well that over
> that.

I think it increased a good amount in the last patch (we always submit all engines) so I'

> 
> I think we should:
> * Drop the "when" from each engine - we have the when for the Sync itself
> and we record the took for each engine.
> 
> * I'm also surprised to see a number of { took: 0 } entries - I guess it's
> just a low-res timer? 

Yes, I found this out while digging through the telemetry timing code.  The monotonic timer used by telemetry is explicitly low resolution (it uses `TimeStamp::NowLoRes()`). This seems, well, less than ideal to me, but apparently there are performance issues with the high res timer on some versions of Windows.

> Related, we could also consider a threshold before we
> write "took" for the engines - I doubt we care much about exact "took"
> figures for fast syncs - 50ms would remove 5 "tooks" from my sample payload
> and seems "fast enough" - not sure this is actually worthwhile though?

Well, if we did this (and I'm writing up a patch to do it now), there's really no reason to include do-nothing engines in the sync. If they would literally only have {"name": "foo"}

> 
> * Remove all zero values from the "incoming" and "outgoing" objects.
> 
> * Hack something up for the "clients" engine, or each Sync it reports:
> > { "applied": 1,"succeeded": 1,"failed": 0, "newFailed": 0, "reconciled": $numClients }
> along with not having the 0s, we should arrange to see nothing here (as it
> reflects a noop). I don't really care if we do that in the engine itself or
> hard-code an engine name hack in the telemetry module - whatever's cleanest
> and easiest.

IMO a hardcoded check in telemetry is simpler and more flexible. Of course, we'd only want to ignore it in the success case, I think (e.g. if failed is nonzero, then we want to know).

Related, We can get rid of the `success` property -- It looks like it's always just `applied - failed`, and the code that generates this record confirms this.

> Doing the above (by hand) for one of my syncs (that sent a tab but was
> otherwise a noop, and *without* removing 5 "tooks") took the payload
> (without newlines) from 609 bytes to 351. Taking "tooks" of < 50 out is 303.
> Taking the 1 tab from the payload and it's down to 279. Not quite 110, but
> hopefully good enough.

Are these numbers compressed or uncompressed? I just asked on #telemetry, and we compress the pings before sending, so knowing how much we save in the compressed data sounds more important.
Argh, forgot to finish typing my first comment.  That should be, "I think it increased a good amount in the last patch (we always submit all engines) so I'm not sure you could have noticed this before, at least not as much".
> Well, if we did this (and I'm writing up a patch to do it now), there's really no reason to include do-nothing engines in the sync. If they would literally only have {"name": "foo"}

No, I changed my mind on this, it's still useful to have the information that we tried to do something.

After making the changes (but leaving non-zero "took" values -- cutting off at, e.g. 50 is a weird special case that I'm not a fan of, especially since we'd probably need to document that value with the telemetry...) I get 692b for an initial sync (274b gzipped), and 181b for a single tab sync (155b gzipped).

Hopefully this is small enough. Taking a glance at it https://gist.github.com/thomcc/27b391be971ba8cee9e0f0b3058229f4, it doesn't look like any of that data is redundant or useless, unless we want to start shortening property names or something (And I tried this, moving to single character property names saves around 10%-20% compressed... which isn't trivial, but doesn't seem worth it to me).
Comment on attachment 8763708 [details]
Bug 1267919 - Part 1. Import Ajv for validation of sync telemetry ping schema

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59842/diff/4-5/
Comment on attachment 8763709 [details]
Bug 1267919 - Part 2. Add documentation and a schema for a new "sync" telemetry ping.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59844/diff/6-7/
Comment on attachment 8764763 [details]
Bug 1267919 - Part 3. Implement initial sync telemetry recording code.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/60442/diff/5-6/
Comment on attachment 8764763 [details]
Bug 1267919 - Part 3. Implement initial sync telemetry recording code.

https://reviewboard.mozilla.org/r/60442/#review60454

That looks awesome and the ping size is great - thanks!
Attachment #8764763 - Flags: review+
Comment on attachment 8763708 [details]
Bug 1267919 - Part 1. Import Ajv for validation of sync telemetry ping schema

https://reviewboard.mozilla.org/r/59842/#review60456
Attachment #8763708 - Flags: review?(markh) → review+
Comment on attachment 8764763 [details]
Bug 1267919 - Part 3. Implement initial sync telemetry recording code.

https://reviewboard.mozilla.org/r/60442/#review60572
Attachment #8764763 - Flags: review?(alessio.placitelli) → review+
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/fx-team/rev/2e399342098f
Part 1. Import Ajv for validation of sync telemetry ping schema. r=Dexter,markh
https://hg.mozilla.org/integration/fx-team/rev/2ee8d8806fca
Part 2. Add documentation and a schema for a new "sync" telemetry ping. r=bsmedberg,Dexter
https://hg.mozilla.org/integration/fx-team/rev/f919bd0f63f4
Part 3. Implement initial sync telemetry recording code. r=Dexter,markh
Keywords: checkin-needed
Depends on: 1287473
Depends on: 1288445
Depends on: 1295058
You need to log in before you can comment on or make changes to this bug.