Closed Bug 1267917 Opened 8 years ago Closed 8 years ago

Report consistency of bookmarks data via telemetry.

Categories

(Firefox :: Sync, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Firefox 52
Tracking Status
firefox52 --- fixed

People

(Reporter: markh, Assigned: tcsc)

References

Details

(Whiteboard: [data-integrity][measure-integrity])

Attachments

(1 file)

Bug 1265419 is adding the ability to measure the consistency of Sync related data. This bug is to take that facility, integrate it into Sync itself, then report the results via telemetry.
Flags: firefox-backlog+
Depends on: 1267919
Assignee: nobody → tchiovoloni
Status: NEW → ASSIGNED
This is just the first part of this (integrating it into sync), which was initially a patch on bug 1265419. But it's not particularly meaningful until bug 1267919 is resolved.
Priority: -- → P1
Whiteboard: [data-integrity]
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/49731/diff/1-2/
Attachment #8747102 - Attachment description: MozReview Request: Bug 1267917 - Hook bookmark sync validator into the bookmark sync engine and add a telemetry probe for it → Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping f?markh
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

Lots of parts of this I'm not sure about but I'm getting it up for feedback anyway, at least because doing any work on other validators will depend on knowing how we want the API/integration with telemetry to work. Open questions:

- Should we be running the validation for an engine in the "middle" of syncing the engine (e.g. before the engine emit's its finished event). This allows it to be more consistent for integrating with telemetry, but it's definitely unclear that this is the best time to do it (The earlier patch waited until everything had).

- Providing more information about differences. It's unclear if this information is that valuable (aside from the structural differences, for which there are only two cases, childGUIDs and parentid, which should possibly just be considered their own groups). However, this seems (possibly) more valuable for other types of validators.

- Should we be tracking last validation time on a per-engine basis?  We might even want to stagger them, to reduce server load or poor performance from fetching the entire world in a given sync.

Worth noting that if you look at the diff between the last two patches on mozreview there's *extremely* poor signal to noise, (since it has every change that has happened to these files between the last patch and now) but if you look at the diff between orig and patch 2, it doesn't include this, and is much more readable as a result.
Attachment #8747102 - Flags: feedback?(markh)
https://reviewboard.mozilla.org/r/49731/#review61190

This is certainly heading in the right direction!

::: services/sync/modules/bookmark_validator.js:84
(Diff revision 2)
>  
>    /**
> +   * Convert ("difference", [{ differences: ["tags", "name"] }, { differences: ["name"] }]) into
> +   * [{ name: "difference:tags", count: 1}, { name: "difference:name", count: 2 }], etc.
> +   */
> +  _summarizeDifferences(prefix, diffs) {

I'm not totally convinced this is worthwhile at this early stage as I'm not sure what action we will take based on the breakdown. I could be wrong though, but probably need to have a better understanding of typical payloads with and without this additional param.

::: services/sync/modules/bookmark_validator.js:148
(Diff revision 2)
>  }
>  
>  class BookmarkValidator {
>  
> +  constructor() {
> +    this._log = Log.repository.getLogger("Sync.Validator.Bookmarks");

later you are passing a log here. I think a dedicated log is fine (ie, don't bother passing the log in)

(I also think a later patch could decide to set the level on a log names "Sync.Validator", and if each engine then uses a "sub" log they get the default level for free - but this is fine for now)

::: services/sync/modules/engines.js:1564
(Diff revision 2)
>        this._tracker.addChangedID(id, when);
>      }
>      this._modified = {};
>    },
>  
> -  _sync: function () {
> +  _sync: function (tryValidate) {

I'm a little reluctant to do this at sync time as unexpected delays will make Sync appear very laggy - ie, we should try our best to sync all engines before validating.

That will obviously have an impact on the telemetry collecting module, but I expect that *should* be OK (although if it really makes things over complicated we can reconsider)

::: services/sync/modules/engines/bookmarks.js:433
(Diff revision 2)
>      this._log.debug(item.id + " mapped to " + mapped);
>      return mapped;
> +  },
> +
> +  getValidator() {
> +    return new BookmarkValidator(this._log.debug.bind(this._log));

your passing a log to the ctor, but the ctor creates its own load.

::: services/sync/services-sync.js:82
(Diff revision 2)
>  pref("services.sync.log.cryptoDebug", false);
>  
>  pref("services.sync.fxa.termsURL", "https://accounts.firefox.com/legal/terms");
>  pref("services.sync.fxa.privacyURL", "https://accounts.firefox.com/legal/privacy");
> +
> +pref("services.sync.validationInterval", 86400);

let's comment what this means. As we discussed, I think we should add a few other "throttles" for this - we should insist on there being no more than (say) 1000 bookmarks and I don't think it would hurt to use a hacky random "coin toss" so there's only a 10% (1%?) of the validation being done this session. Then, once we get the results via telemetry (which should include how long the validation took) we can look into making this run more often.

I think we should also write something to the console log so people looking there after seeing Nightly suddenly crawl to a halt have something to blame.

I'm making alot of that up as I go, so I'm open to other implementation ideas that achieve the same effect (ie, minimize the chance of this feature causing Nightly users to have a bad time)

::: toolkit/components/telemetry/Histograms.json:9408
(Diff revision 2)
>      "kind": "count",
>      "keyed": true,
>      "releaseChannelCollection": "opt-out",
>      "description": "Exceptions thrown by a Sync engine. Keyed on the engine name."
>    },
> +  "WEAVE_BOOKMARK_VALIDATION_DURATION": {

I'm inclined to think that the validation duration should just be written to our ping and ignore the histogram. It might also be worth writing how many server records we processed if it's not too hacky, so we can get some idea of what the pain point is.

You almost certainly also need changes to the sync telemetry schema, and ensure the validation tests create valid pings.
https://reviewboard.mozilla.org/r/49731/#review61190

> I'm not totally convinced this is worthwhile at this early stage as I'm not sure what action we will take based on the breakdown. I could be wrong though, but probably need to have a better understanding of typical payloads with and without this additional param.

Hm. Not sure. It seems valuable to me, here's why: say we see a ton of items in `differences`, or even not a ton, but some percentage of all of the validations we get *something* there. It's not clear how we could go about debugging that without some insight into which property is different. Moreover, it doesn't seem very actionable, and even if it is, it's potentially a lot of work, wheras if we see "oh the tags are different", that could be a very short amount of time to track down and fix.

Also, it would also make it easier to ignore known bugs. E.g. bug 1276969 will cause a number of `diff:parentName`, but if this is just in `differences`, we wouldn't be able to detect this....  Although arguably we should just be special casing these and either removing them or making them an explicit category, e.g. `{ name: "bug1276969", count: 5 }`, or something.

> I'm a little reluctant to do this at sync time as unexpected delays will make Sync appear very laggy - ie, we should try our best to sync all engines before validating.
> 
> That will obviously have an impact on the telemetry collecting module, but I expect that *should* be OK (although if it really makes things over complicated we can reconsider)

It makes sense to me to wait until all the engines have synced (and this is reasonably easy to do), but running it at some point *after* sync sounds somewhat problematic. I don't really like the idea of having the telemetry module wait for a message about validation in order to submit, since validation is not guaranteed to run... 

Although I guess we could have a 'validation finished' event, when we actually send the ping, and before that point, any time we get a 'engine validated' event, we look through the engines in the ping (I guess creating one if necessary? Although it shouldn't be...), and add it to that engine's ping data.

That actually doesn't sound too bad. Although it *does* sound a bit weird to me to record the sync as complete -- e.g. we are no longer spinning the spinner, etc. when sync is still doing work...

> let's comment what this means. As we discussed, I think we should add a few other "throttles" for this - we should insist on there being no more than (say) 1000 bookmarks and I don't think it would hurt to use a hacky random "coin toss" so there's only a 10% (1%?) of the validation being done this session. Then, once we get the results via telemetry (which should include how long the validation took) we can look into making this run more often.
> 
> I think we should also write something to the console log so people looking there after seeing Nightly suddenly crawl to a halt have something to blame.
> 
> I'm making alot of that up as I go, so I'm open to other implementation ideas that achieve the same effect (ie, minimize the chance of this feature causing Nightly users to have a bad time)

I think doing the following things sounds like a good start:

- Don't bother validating at all if telemetry recording is disabled for the user (#telemetry indicates that this is `Services.telemetry.canRecordBase` for opt-out pings, and `Services.telemetry.canRecordExtended` for opt-in ones. I *think* the sync ping is opt-out, but... I guess I don't actually know?).

- Add a pref for the probability of running a validation this time around, controlled by a pref. If `Math.random() < percentage`, then we validate. Otherwise, we don't and we wait until the validationInterval is up to try again.

- Logging into the console (I guess via `Console.jsm`? Or is there a better way to get stuff to show up there?) sounds good. Maybe with a message that this is controlled by prefs x, y, and z (or the pref branch `services.sync.validation.*`, and put all the prefs that control it under there), so that anybody who wants to turn it off or reduce it's frequency can.

- A number of records (also a pref) that above this point, we won't bother validating (should this be per-engine) because it would be too costly. I don't know what the right number is here. https://telemetry.mozilla.org/new-pipeline/dist.html#!measure=PLACES_BOOKMARKS_COUNT indicates that 100 would get 2/3 of users, and beyond that are steeply diminishing returns (80% requires ~300), so 100 sounds like a nice number to me.  
You mentioned checking the server for this number yesterday, I don't see the benefit to doing this, since at this point we will have synced, so the client and the server should have roughly the same number of entries. I guess there's an edge case where maybe the server has hundreds of deleted entries? I'm willing to be persuaded here, it just sounds like more trouble than it's worth.

> I'm inclined to think that the validation duration should just be written to our ping and ignore the histogram. It might also be worth writing how many server records we processed if it's not too hacky, so we can get some idea of what the pain point is.
> 
> You almost certainly also need changes to the sync telemetry schema, and ensure the validation tests create valid pings.

Well, the validation tests are very self contained. They just instantiate the validator and pass in objects/arrays that look like the bookmark tree/server records, that is, they don't touch the rest of sync and so they wouldn't create pings at all.

(But this does sound like something I could fake/mock in some way.)
https://reviewboard.mozilla.org/r/49731/#review61190

> I think doing the following things sounds like a good start:
> 
> - Don't bother validating at all if telemetry recording is disabled for the user (#telemetry indicates that this is `Services.telemetry.canRecordBase` for opt-out pings, and `Services.telemetry.canRecordExtended` for opt-in ones. I *think* the sync ping is opt-out, but... I guess I don't actually know?).
> 
> - Add a pref for the probability of running a validation this time around, controlled by a pref. If `Math.random() < percentage`, then we validate. Otherwise, we don't and we wait until the validationInterval is up to try again.
> 
> - Logging into the console (I guess via `Console.jsm`? Or is there a better way to get stuff to show up there?) sounds good. Maybe with a message that this is controlled by prefs x, y, and z (or the pref branch `services.sync.validation.*`, and put all the prefs that control it under there), so that anybody who wants to turn it off or reduce it's frequency can.
> 
> - A number of records (also a pref) that above this point, we won't bother validating (should this be per-engine) because it would be too costly. I don't know what the right number is here. https://telemetry.mozilla.org/new-pipeline/dist.html#!measure=PLACES_BOOKMARKS_COUNT indicates that 100 would get 2/3 of users, and beyond that are steeply diminishing returns (80% requires ~300), so 100 sounds like a nice number to me.  
> You mentioned checking the server for this number yesterday, I don't see the benefit to doing this, since at this point we will have synced, so the client and the server should have roughly the same number of entries. I guess there's an edge case where maybe the server has hundreds of deleted entries? I'm willing to be persuaded here, it just sounds like more trouble than it's worth.

Actually, regarding checking the server, this does seem simpler than writing code that counts the records for each engine, especially since every engine would need its own version of that code.
Blocks: 1286915
Blocks: 1286918
Blocks: 1286923
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/49731/diff/2-3/
Attachment #8747102 - Flags: feedback?(markh)
https://reviewboard.mozilla.org/r/49731/#review61190

> later you are passing a log here. I think a dedicated log is fine (ie, don't bother passing the log in)
> 
> (I also think a later patch could decide to set the level on a log names "Sync.Validator", and if each engine then uses a "sub" log they get the default level for free - but this is fine for now)

Forgot to mention this, we don't do any logging that matters here, so it doesn't seem worth it to log in the validator at all.

> your passing a log to the ctor, but the ctor creates its own load.

There's no need for either of these, really.
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

Ack, it looks like I forgot to add you back on this after updating.

(Should have a version of this with tests up pretty soon/tomorrow, just fyi)
Attachment #8747102 - Flags: feedback?(markh)
https://reviewboard.mozilla.org/r/49731/#review61190

> Hm. Not sure. It seems valuable to me, here's why: say we see a ton of items in `differences`, or even not a ton, but some percentage of all of the validations we get *something* there. It's not clear how we could go about debugging that without some insight into which property is different. Moreover, it doesn't seem very actionable, and even if it is, it's potentially a lot of work, wheras if we see "oh the tags are different", that could be a very short amount of time to track down and fix.
> 
> Also, it would also make it easier to ignore known bugs. E.g. bug 1276969 will cause a number of `diff:parentName`, but if this is just in `differences`, we wouldn't be able to detect this....  Although arguably we should just be special casing these and either removing them or making them an explicit category, e.g. `{ name: "bug1276969", count: 5 }`, or something.

These are good points, but I guess I'm mostly concerned with increasing the size of the ping without actionable data (ie, it's not clear to me that we will prioritize fixing anything other than structural issues for some time).

Let's defer the decision here while we wait for the fallout from the main sync ping to settle.

> Forgot to mention this, we don't do any logging that matters here, so it doesn't seem worth it to log in the validator at all.

SGTM

> It makes sense to me to wait until all the engines have synced (and this is reasonably easy to do), but running it at some point *after* sync sounds somewhat problematic. I don't really like the idea of having the telemetry module wait for a message about validation in order to submit, since validation is not guaranteed to run... 
> 
> Although I guess we could have a 'validation finished' event, when we actually send the ping, and before that point, any time we get a 'engine validated' event, we look through the engines in the ping (I guess creating one if necessary? Although it shouldn't be...), and add it to that engine's ping data.
> 
> That actually doesn't sound too bad. Although it *does* sound a bit weird to me to record the sync as complete -- e.g. we are no longer spinning the spinner, etc. when sync is still doing work...

Sorry, I wasn't clear - I meant we shouldn't do it at *engine* sync time. ie, I think we should do this as part of the sync (ie, before we've sent sync:finished and released the "lock"), but should do it *after* all engines have synced. This should make life easier for the validator (it still submits on sync finish) and is probably necessary anyway to avoid us starting a Sync during validation (which would almost certainly be bad)

> Actually, regarding checking the server, this does seem simpler than writing code that counts the records for each engine, especially since every engine would need its own version of that code.

All the above sounds good to me. I think unconditionally logging a message each validation is OK initially, and something we might address if we increse the frequency later. Console.jsm or Services.console.logStringMessage() both would be fine.

> Well, the validation tests are very self contained. They just instantiate the validator and pass in objects/arrays that look like the bookmark tree/server records, that is, they don't touch the rest of sync and so they wouldn't create pings at all.
> 
> (But this does sound like something I could fake/mock in some way.)

Right - it still seems worthile to ensure that when the output from the validator is fed into the telemetry module, the ping remains valid. There's significant flexibility how we actually achieve that though.
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/49731/diff/3-4/
Attachment #8747102 - Flags: feedback?(markh)
Attachment #8747102 - Flags: feedback?(markh)
https://reviewboard.mozilla.org/r/49731/#review61190

> These are good points, but I guess I'm mostly concerned with increasing the size of the ping without actionable data (ie, it's not clear to me that we will prioritize fixing anything other than structural issues for some time).
> 
> Let's defer the decision here while we wait for the fallout from the main sync ping to settle.

Hm, thats a good point. I do think that the info about *which* structural difference it is (since there are two possible ones) is still relevant, given that that's our priority. Doing the same for differences, should we end up seeing any of these, would be easy given this.
https://reviewboard.mozilla.org/r/49731/#review62598

::: browser/app/profile/firefox.js:1460
(Diff revision 3)
>  pref("print.use_simplify_page", true);
>  
>  // Space separated list of URLS that are allowed to send objects (instead of
>  // only strings) through webchannels.
>  pref("webchannel.allowObject.urlWhitelist", "https://accounts.firefox.com https://content.cdn.mozilla.net https://hello.firefox.com https://input.mozilla.org https://support.mozilla.org https://install.mozilla.org");
> +

Removing this so that there isn't so much of a reason to wait for data analysis to land it.
https://reviewboard.mozilla.org/r/49731/#review62824

This is looking great!

::: services/sync/modules/stages/enginesync.js:194
(Diff revision 4)
>          } catch (error) {
>            this._log.error("Unable to upload meta/global. Leaving marked as new.");
>          }
>        }
>  
> +      this._tryValidateEngines(enginesToSync)

nit: missing ';'

::: services/sync/modules/stages/enginesync.js:258
(Diff revision 4)
> +    } else {
> +
> +      let collectionCountsURL = this.service.userBaseURL + "info/collection_counts";
> +      try {
> +        let infoResp = this.service._fetchInfo(collectionCountsURL);
> +        info = infoResp.obj; // might throw because obj is a getter which parses json.

I think you also want to check infoResp.success here, or there's a risk you will get a non-200 with valid JSON

::: services/sync/modules/stages/enginesync.js:261
(Diff revision 4)
> +      try {
> +        let infoResp = this.service._fetchInfo(collectionCountsURL);
> +        info = infoResp.obj; // might throw because obj is a getter which parses json.
> +      } catch (e) {
> +        // Not running validation is totally fine, so we just write an error log and return.
> +        this._log.error("Can't run validation: Caught error when fetching counts {}", e);

I don't think you need the trailing "{}" in the string - you could use "${}" although it's not necessary in this case (ie, where you want the object appended)

::: services/sync/modules/stages/enginesync.js:275
(Diff revision 4)
> +    let engineLookup = new Map(recentlySyncedEngines.map(e => [e.name, e]));
> +    let toRun = [];
> +    for (let [engineName, recordCount] of Object.entries(info)) {
> +      let engine = engineLookup.get(engineName);
> +      if (recordCount > maxRecords || !engine) {
> +        continue;

let's log.debug here

::: services/sync/modules/stages/enginesync.js:290
(Diff revision 4)
> +
> +    if (!toRun.length) {
> +      return;
> +    }
> +    Services.console.logStringMessage(
> +      "Sync is about to run a consistency check. This may be slow, and " +

let's throw the word "experimental" in there :) Seeing as we are being verbose, it might even be worthwhile requesting they open a bug if they notice bad things.

::: services/sync/modules/stages/enginesync.js:293
(Diff revision 4)
> +    }
> +    Services.console.logStringMessage(
> +      "Sync is about to run a consistency check. This may be slow, and " +
> +      "can be controlled using the pref \"services.sync.validation.enabled\".");
> +    for (let { validator, engine } of toRun) {
> +      let result = Async.promiseSpinningly(validator.validate(engine));

should we catch here?

::: services/sync/tests/unit/sync_ping_schema.json:48
(Diff revision 4)
>        "properties": {
>          "failureReason": { "$ref": "#/definitions/error" },
>          "name": { "enum": ["addons", "bookmarks", "clients", "forms", "history", "passwords", "prefs", "tabs"] },
>          "took": { "type": "integer", "minimum": 1 },
> +        "validationTook": { "type": "integer", "minimum": 1 },
> +        "validationChecked": { "type": "integer", "minimum": 1 },

I was kinda expecting a "validation" object with this meta-data and the actual results? What does a new ping look like with this patch (assuming it ran)?
https://reviewboard.mozilla.org/r/49731/#review62824

> I think you also want to check infoResp.success here, or there's a risk you will get a non-200 with valid JSON

`Service._fetchInfo` takes care of this for us, but there's probably no harm in being more explict about how a unsuccessful request is handled, especially since I did go back and forth on using `_fetchInfo`, and it seems plausible that something else could be used at some point in the future. (And it would only be three additional lines)

> I don't think you need the trailing "{}" in the string - you could use "${}" although it's not necessary in this case (ie, where you want the object appended)

Would ${} in a template string output the same data? It would just do String(e) -- although I guess I don't know for sure that that won't provide as much info as log.error (which I had assumed knows how to handle exception objects, but maybe not). 

I'll just take {} out though, no reason to provide it if it's there by default.

> let's log.debug here

Sure. It's going to be spammy since `info` contains the counts of collections that don't correspond to engines (like `meta`), but a debug log is harmless.

> should we catch here?

Sure, I don't think it's likely that they'll throw, but I've been taking the "failure to validate is not a sync failure" approach elsewhere in this patch. 

(That said, I'm somewhat torn about that approach in general, since it *does* indicate a problem we want to know about, but the sync didn't actually *fail*. Maybe we should add something to Status.engines and let telemetry pick it up?)

> I was kinda expecting a "validation" object with this meta-data and the actual results? What does a new ping look like with this patch (assuming it ran)?

Ah, this was a somewhat misguided attempt at backwards compatibility. If we had already been getting validation arrays, it would have allowed us to add those fields without requiring any special handling in the code looking at the pings.

Of course, we aren't getting any validation arrays yet, so there's no backwards compatibility issue -- I'll change it to an object.
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/49731/diff/4-5/
Attachment #8747102 - Flags: feedback?(markh)
Attachment #8747102 - Flags: feedback?(markh)
https://reviewboard.mozilla.org/r/49731/#review62824

> `Service._fetchInfo` takes care of this for us, but there's probably no harm in being more explict about how a unsuccessful request is handled, especially since I did go back and forth on using `_fetchInfo`, and it seems plausible that something else could be used at some point in the future. (And it would only be three additional lines)

Ah, yes, you are correct. I'm fine with this either way (ie, originally or with those 3 additional lines)

> Would ${} in a template string output the same data? It would just do String(e) -- although I guess I don't know for sure that that won't provide as much info as log.error (which I had assumed knows how to handle exception objects, but maybe not). 
> 
> I'll just take {} out though, no reason to provide it if it's there by default.

yeah, you should get the same results either way - it was more an FYI

> Sure, I don't think it's likely that they'll throw, but I've been taking the "failure to validate is not a sync failure" approach elsewhere in this patch. 
> 
> (That said, I'm somewhat torn about that approach in general, since it *does* indicate a problem we want to know about, but the sync didn't actually *fail*. Maybe we should add something to Status.engines and let telemetry pick it up?)

Hopefully we'd pick it up eventually via someone noticing an error log, or we could consider adding a flag to indicate it failed.
https://reviewboard.mozilla.org/r/49731/#review64748

Looks great! As we discussed, I think the next steps (apart from the minor issues) here are to ping the telemetry team before landing (but after they are happy with the volume of the main ping :)

::: services/sync/modules/engines/bookmarks.js:21
(Diff revision 5)
>  Cu.import("resource://services-sync/constants.js");
>  Cu.import("resource://services-sync/engines.js");
>  Cu.import("resource://services-sync/record.js");
>  Cu.import("resource://services-sync/util.js");
>  Cu.import("resource://services-sync/bookmark_utils.js");
> +Cu.import("resource://services-sync/bookmark_validator.js");

probably not particularly important, but a lazy import seems to make sense here.

::: services/sync/modules/stages/enginesync.js:220
(Diff revision 5)
> +      this._log.info("Skipping validation: validation or telemetry reporting is disabled");
> +      return;
> +    }
> +
> +    let lastValidation = Svc.Prefs.get("validation.lastTime", 0);
> +    let validationInterval = Svc.Prefs.get("validation.interval", 60 * 60 * 24);

I don't think it's worth re-specifying the default value here (doesn't really hurt though - whatever)

::: services/sync/modules/stages/enginesync.js:237
(Diff revision 5)
> +    let validationProbability = Svc.Prefs.get("validation.percentageChance", 0) / 100.0;
> +    if (validationProbability < Math.random()) {
> +      this._log.info("Skipping validation: Probability threshold not met");
> +      return;
> +    }
> +    let maxRecords = Svc.Prefs.get("validation.maxRecords", 0);

ditto (or especially) here as the default (if no pref) is undefined which would work fine.

::: services/sync/modules/stages/enginesync.js:302
(Diff revision 5)
> +      "Sync is about to run a consistency check. This may be slow, and " +
> +      "can be controlled using the pref \"services.sync.validation.enabled\".\n" +
> +      "If you encounter any problems because of this, please file a bug.");
> +    for (let { validator, engine } of toRun) {
> +      try {
> +        let result = Async.promiseSpinningly(validator.validate(engine));

can we turn this function into a task and hoist the promiseSpinningly up to the caller?

::: services/sync/modules/telemetry.js:178
(Diff revision 5)
> +      log.error(`Multiple validations occurred for engine ${this.name}!`);
> +      return;
> +    }
> +    let { problems, duration, recordCount } = validationResult;
> +    let validation = {
> +      checked: recordCount || 0,

when will we report validation with a non-integer record count?

::: services/sync/services-sync.js:82
(Diff revision 5)
>  pref("services.sync.log.cryptoDebug", false);
>  
>  pref("services.sync.fxa.termsURL", "https://accounts.firefox.com/legal/terms");
>  pref("services.sync.fxa.privacyURL", "https://accounts.firefox.com/legal/privacy");
> +
> +pref("services.sync.validation.interval", 86400);

I thought I mentioned this last time, but apparently not. Let's add comments for these prefs, including the unit (seconds) and  that 86400 is once per day.

::: services/sync/tests/unit/sync_ping_schema.json:70
(Diff revision 5)
>            "type": "array",
>            "minItems": 1,
>            "items": { "$ref": "#/definitions/outgoingBatch" }
>          },
>          "validation": {
> +          "type": "object",

ah, sorry, I missed the fact that validation was already in the schema in my previous comments. I think this change is fine though.

::: services/sync/tests/unit/test_bookmark_validator.js:264
(Diff revision 5)
> +  // remove c
> +  server.pop();
> +  server[0].children.pop();
> +  const duration = 50;
> +  let ping = yield validationPing(server, client, duration);
> +  ok(!!ping.engines);

I doubt we really need the !! here and below - assert.jsm explicitly says it allows anything "truthy" (but it doesn't really hurt; it's more a style issue)
Attachment #8747102 - Flags: feedback?(markh) → feedback+
Priority: P1 → P2
Whiteboard: [data-integrity] → [data-integrity][measure-integrity]
Priority: P2 → P1
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

Clearing Benjamin's request for now - I think the patch will want one more pass, and we should re-flag Benjamin for review later, when we also summarize the patch and bug so he doesn't need to piece the context together from the many comments above.
Attachment #8747102 - Flags: review?(benjamin)
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

https://reviewboard.mozilla.org/r/49731/#review76060

This is looking great and almost ready to land - just a few little tweaks. As per my last comment, when you push the next version, leave bsmedberg off the commit message so he doesn't get re-flagged, then we can manually flag him for review in bugzilla and also summarize what's going on here to help him (hopefully!) perform a quick rubber-stamp.

::: services/sync/modules/stages/enginesync.js:194
(Diff revision 6)
>          } catch (error) {
>            this._log.error("Unable to upload meta/global. Leaving marked as new.");
>          }
>        }
>  
> +      this._tryValidateEngines(enginesToSync);

I should have thought of this earlier, but we probably want to skip validation of the engine failed to Sync - IIUC we are almost certain to find client/server inconsistencies in that case. I suspect this is probably best done by using a new var `enginesToValidate` and push the current engine name after it syncs successfully.

::: services/sync/modules/stages/enginesync.js:214
(Diff revision 6)
>  
>      this.onComplete(null);
>    },
>  
> +  _tryValidateEngines(recentlySyncedEngines) {
> +    if (!Services.telemetry.canRecordBase || !Svc.Prefs.get("validation.enabled", false)) {

I can't see where the default value for this pref is, and it doesn't seem to already be in the tree. I mainly wanted to make sure it was only true for nightly builds at this stage, but it looks like it's never true by default?

::: services/sync/modules/stages/enginesync.js:305
(Diff revision 6)
> +    for (let { validator, engine } of toRun) {
> +      try {
> +        let result = Async.promiseSpinningly(validator.validate(engine));
> +        Observers.notify("weave:engine:sync:validated", result, engine.name);
> +      } catch (e) {
> +        this._log.error(`Failed to run validation on ${engine.name}!`, e);

Do you think we can tweak the telemetry such that the failure is recorded?
Attachment #8747102 - Flags: review?(markh)
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

https://reviewboard.mozilla.org/r/49731/#review76060

Sounds good, I wasn't sure if I should have added him last time, looks like I shouldn't have.

> Do you think we can tweak the telemetry such that the failure is recorded?

I did this, but it seems like there might be a simpler way of doing so (Maybe I'm worrying too much though).
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

https://reviewboard.mozilla.org/r/49731/#review78010

Awesome - as mentioned above, please summarize this patch and flag Benjamin for review.
Attachment #8747102 - Flags: review?(markh) → review+
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

bsmedberg: This patch enables adds reporting of sync engine validation information to the sync ping.

Note that this was already part of the sync ping spec, however actually plugging the validator's information into the ping was left as a TODO.  Now that we actually do this, there were a few minor changes to the spec (since nothing would have previously existed in that property, there doesn't sound like any reason to bump the version).

It will increase the size of the pings when validation information is actually reported, however there are a few steps taken to prevent the size increase.


1. Validation is only enabled on Nightly.

2. It can only attempt to run at maximum once per day.

3. It will only actually run 10% of the times it attempts to run (randomly, e.g. it bails out and doesn't try for at least another day if `Math.random() < 0.1`).

4. It only runs for users who have relatively few records stored for the given engine -- this doesn't actually effect the size of the validation results, and should have no real bearing on the ping, but it further reduces the number of pings the data will show up in, and it ensures we don't degrade performance while e.g. trying to figure out if the user's 5000 bookmarks are the same on the client and server.


These are all controlled via prefs so should it turn out to be an issue, we could easily land something either reducing it's frequency of usage or turning it off entirely.

The ping size increase (when it *does* run) in the worst case should be under 1kb per engine which did validation.  This is much higher than should be expected in practice -- but if somehow every property was wrong with every record, and every engine reported validation errors in as much detail as the bookmark engine, then ~1kb would be added to the ping for those engines.

There's no risk of PII here, we only report the types of errors we find and their count, not any more information than that.

Let me know if you have further questions.
Attachment #8747102 - Flags: review?(benjamin)
Comment on attachment 8747102 [details]
Bug 1267917 - Hook the sync bookmark validator into the new sync telemetry ping

https://reviewboard.mozilla.org/r/49731/#review78540

data-review=me based on reading only sync-ping.rst
Attachment #8747102 - Flags: review?(benjamin) → review+
I'm not sure why this still lists 2 open issues on reviewboard, since there don't seem to be any. (I've heard open issues can prevent autoland). Let me know if this is a problem.
Keywords: checkin-needed
Looks like autoland couldn't rebase this patch for landing.
Flags: needinfo?(tchiovoloni)
Keywords: checkin-needed
Sorry about that, should be good now.
Flags: needinfo?(tchiovoloni)
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/12d60c0e0b0e
Hook the sync bookmark validator into the new sync telemetry ping r=bsmedberg,markh
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/12d60c0e0b0e
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 52
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: