Open Bug 1260900 Opened 3 years ago Updated 4 months ago

[meta] Add a command to remotely flag records for (re)upload

Categories

(Firefox :: Sync, task, P3)

task

Tracking

()

People

(Reporter: rnewman, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Keywords: meta, Whiteboard: [data-integrity][Q2 OKR])

We have a resetEngine command that remotely requests a full download then a full upload.

It would be convenient to have a more tightly scoped command: track GUIDs in a collection for upload in the next sync.

Sometimes a client will detect that some records are missing or stale. This command would allow it to ask another client to upload the current state of those records, correcting the inconsistency without first downloading and applying the corruption.

See also Bug 1260896.

This command might (thanks to Sync object format limitations) look like:

   {"command": "trackGUIDs", "args": ["bookmarks", "abcdefghi,lmnopqrst"]}


Unknown commands are dropped on the floor by current desktop clients.
Summary: Add a command to flag records for upload → Add a command to remotely flag records for (re)upload
Note that the queued request should be executed _after_ downloading records in the given collection. That is: this is equivalent to adding a few GUIDs to the tracker right before engine._uploadOutgoing().
Mark noted that this doesn't directly address missed deletions. Here's a concrete proposal that does.


Assume four clients:

  CI: iOS v5.0
  CA: Android v44
  CD1: Desktop v49
  CD2: Desktop v49

Assume three missing or inconsistent records:

  R1: moved and missed (on CD1)
  R2: added and missed (on CD1)
  RD: deleted and missed (so a non-deleted record is on the server, but its parent doesn't refer to it) (on CD2)


CI detects these three missing or inconsistent items.

It selects the most recently synced other client. It should do so by knowledge of versions, to skip clients that don't understand these commands, but let's assume it doesn't. It picks CA, which is old.


It sends a command to CA:

  [trackGUIDs, "R1,R2,RD", "CI", 1234567890]

CA doesn't understand this command, and drops the command on the floor. It uploads its own client record again with the command removed.

CI sees this, uploads the command once more, and waits again until CA reuploads a record without the command. This cycle confirms that CA has completed a sync and had its opportunity to process the command; that it didn't send a reply means it dropped it on the floor.


CI moves on to CD1, uploading the same command.

CD1 understands the command, and it tracks R1, R2, and both their correct parents. (A second cycle through this whole process might be needed to detect and resolve the former parent.)

[[Mark points out that here a smart client can choose to check for more missing records, particularly in the subtree under R2, and proactively upload those.]]


CD1 downloads changed bookmarks as usual, reconciling and tracking/untracking as usual, and uploads the remainder of R1, R2, P(R1), P(R2). It sends a response command to CI:

  [trackedGUIDs, has: "R1,R2,A,B", didnothave: "RD", "CD1", 1234569990]

CI now downloads those four records and the command. We're inching closer to consistency!

The only outstanding question is: what happened to RD?

We have one more client to ask. We send a command to CD2:

  [trackGUIDs, "RD", "CI", 1234570000]

CD2 doesn't have RD -- it was deleted! So it sends back:

  [trackedGUIDs, has: [], didnothave: "RD", "CD2", 1234571000]

CI has now been told by every client that RD doesn't exist; that's as good as we're gonna get. It marks RD as deleted and in need of upload. CI has a consistent tree, and can now sync and upload the final result.



As clients get better at tracking changes, we'll get into this situation less often.

As clients get better at automatically and proactively recovering, we similarly won't need to have iOS ask for missing records so often.

In the mean time, I think this is relatively straightforward. Receiving clients will have an easy time implementing trackGUIDs. iOS will need to persist and work through a kind of "recovery plan", driven by arriving records and changes in the clients engine.


In the simplest case -- one desktop, one iOS device -- this process should complete conclusively quite quickly, even if we need to recursively find half the tree!

N.B., pay attention to the maximum size of a client payload.

Mark, what do you think?
Flags: needinfo?(markh)
(In reply to Richard Newman [:rnewman] from comment #2)
> Mark noted that this doesn't directly address missed deletions.

If we ignore deletions, it seems we can probably simplify this quite a bit - particularly the "response" part. Knowing which clients failed to help doesn't seem worth the complexity it adds - let's have faith one of them will *eventually* help :)

It could look something like:

* CI advertises it (ie, the server collection's view) is missing GUIDs by sending a new command to every client.
* other clients should respond by (a) seeing if they have it, (b) re-checking the server is still missing it and (c) uploading it.
* eventually CI ends up consistent. Or it doesn't. Then it does something else to get closer :) 

this is a pure "items may be missing" feature - clients are still free to check if other related IDs are missing and do the same dance with them, but the contract is pure "create" - first client wins.

Ideally we could track how successful this is somehow, but ISTM it ends up with the same result...

...except for deletes...

How important *are* deletes? I've still not studied your validation code, but do you think missing-deleted is a significant part of the validation failures? Maybe they could have a specific command with the complex semantics at a later date?
Flags: needinfo?(markh)
(In reply to Mark Hammond [:markh] from comment #3)

tl;dr - how bad would it be to ignore deletes and go back to your original idea (asking *all* clients)?
(In reply to Mark Hammond [:markh] from comment #3)

> If we ignore deletions, it seems we can probably simplify this quite a bit -
> particularly the "response" part. Knowing which clients failed to help
> doesn't seem worth the complexity it adds - let's have faith one of them
> will *eventually* help :)

There is additional value in the request/response dance: there'll be a state machine on the 'driving' client, and getting some kind of feedback from compliant clients reduces the guesswork in moving through those states.


> How important *are* deletes? I've still not studied your validation code,
> but do you think missing-deleted is a significant part of the validation
> failures? Maybe they could have a specific command with the complex
> semantics at a later date?

It's hard to draw firm conclusions about missed deletes, because they look very much like other kinds of inconsistency: they're records with a parentid that doesn't agree with the rest of the tree, and that looks just like half-a-move or half-an-add.

Worst case we assume that moves and adds usually make it to the server with their new parent, in which case we can attribute perhaps 50% of our observed corruption to missed deletions.


My suspicion -- based on intuitions about user behavior -- is that missed moves (re-filing through the bookmark star) are more common than missed deletes, but I think deletes are more common than partial writes. Together they make up 50-75% of the observed corruption.


How desktop observer notifications handle multiple deletions and large-scale bulk deletions might flip this around entirely -- for all we know we almost never handle deletions of whole folders correctly, which makes this much more important!

(Possibilities: we get a notification for each deleted item, start syncing on the first, and miss the last few. Or we get a "deleted lots!" notice, and never upload deletions at all. Both are bad. Or maybe we do it right…)


The main reason I propose the slightly more complicated version with delete-acknowledgement is that our test cycle on this is ~3 months long, so I'd rather do an extra week of work now than find out in September that we've only fixed half the problem!

That said: if you can do informal testing locally that suggests that desktop only misses deletions very infrequently, then you might sway me that we can just do a simple broadcast.
Blocks: 1260896
(In reply to Richard Newman [:rnewman] from comment #5)
> There is additional value in the request/response dance: there'll be a state
> machine on the 'driving' client, and getting some kind of feedback from
> compliant clients reduces the guesswork in moving through those states.

Agreed - and the additional burden of this mechanism isn't really going to fall on desktop (ie, writing the response is trivial) - so I've no objection if you think it will help in the short term.

> CD1 downloads changed bookmarks as usual, reconciling and tracking/untracking as usual, and 
> uploads the remainder of R1, R2, P(R1), P(R2). It sends a response command to CI:
>
>  [trackedGUIDs, has: "R1,R2,A,B", didnothave: "RD", "CD1", 1234569990]

I'm wonder if this could also have an extra field: alsoFound: [U1, U2] which is optionally a list of other IDs it found while examining R1 and R2 (eg, children it also noticed were missing from the server). That might not offer much as we'd expect CI to automagically notice these new IDs - but it still might give us some interesting insights for telemetry (ie, CI isn't going to know *why* those new IDs arrived and that they were a direct result of this strategy.

Similarly, I'd like to ensure that you have some recording of how successful this strategy is (but that's really a comment for bug  	1260896). But desktop should probably create some of its own telemetry around this (ie, track if it could help or not)

Finally, I can't really see a reason you shouldn't broadcast this to all clients rather than waiting for a round-robin - I assume the client would want to be smart and only upload the purportedly missing records if they are still actually missing. That might change the response strategy though - eg:

>  [trackedGUIDs, has: "R1,R2,A,B", skipped: "R1,A,B", didnothave: "RD", "CD1", 1234569990]

would imply that the client does have the listed records, but skipped uploading R1,A,B due to the client finding them already on the server (ie, another client beat us to them). However, it did upload R2.
Depends on: 1269902
Blocks: 1269902
No longer depends on: 1269902
No longer blocks: 1257961
Priority: -- → P2
Whiteboard: [sync-data-integrity]
Whiteboard: [sync-data-integrity] → [data-integrity]
Priority: P2 → P3
We'll need another bug for tracking telemetry from both the "choir" and the "conductor" for this, Mark. Could you file that, along with the desktop client bugs?
Flags: needinfo?(markh)
Keywords: meta
Priority: P3 → --
Summary: Add a command to remotely flag records for (re)upload → [meta] Add a command to remotely flag records for (re)upload
No longer blocks: 1260896
Depends on: 1260896
I shifted Bug 1260896 to be the iOS implementation part of this work, because we're not planning to use the resetClient approach.
Blocks: 1261186
Depends on: 1317223
Blocks: 1257970
(In reply to Richard Newman [:rnewman] from comment #7)
> We'll need another bug for tracking telemetry from both the "choir" and the
> "conductor" for this, Mark. Could you file that, along with the desktop
> client bugs?

Bug 1317223 is the desktop implementation, which includes reporting telemetry "events" for every stage of the repair process. Once this is ready to land, I'll be updating the documentation at https://docs.google.com/document/d/1hghyRyg737rUVOjDFqqNYBxmujpl121c1OMRB9tjlOI/edit# with specific information so the mobile implementations can fill in the other bits of the jigsaw.
Flags: needinfo?(markh)
Priority: -- → P3
Whiteboard: [data-integrity] → [data-integrity][Q2 OKR]
Component: Firefox Sync: Cross-client → Sync
Product: Cloud Services → Firefox
Type: defect → task
You need to log in before you can comment on or make changes to this bug.