1260896 - Send commands from iOS to a desktop client to remotely fix bookmark corruption

Reporter

Description

•

8 years ago

Basic principle: when corruption is found on the server, ask another client to fix it. The other client has a consistent tree and all the missing facts, right?

We can't simply request a full remote re-sync: that'll cause a remote client to download those records and apply their own corruption before uploading whatever was missing.

We don't want to delete everything from the server and request a full remote re-sync: it's expensive (every client must upload and download everything, including us), and the server might be the user's only copy of their data.

A decent middle ground is to delete just the records that might conflict, then send resetEngine.

This should be equivalent to a wipe-and-re-sync, but without the full reupload/redownload.

We can offer this, with a client picker, to allow the user to un-break their own account.


It should be possible to manually test this hypothesis on a synthetically corrupted account without writing any code.

Richard Newman [:rnewman]

Reporter

Updated

•

8 years ago

Comment 1

•

8 years ago

Proposed implementation outline:

The bookmarks synchronizer owns a RecoveryPlan.

There is a small amount of long-lasting account- or server-scoped state (e.g., requests made, commands sent, records known to be missing?), which exists to avoid cycling through the same work endlessly without making progress.

Naturally there is also some plan-scoped state: attempts made, missing records, etc.

When the buffer is found to be inconsistent, a plan is built. This encapsulates the set of records that would need to be updated or added in order to move us closer to consistency, and also the steps we'll take to get there.

Starting inputs:

* State extracted from the buffer and mirror.
* The set of clients, versions, etc.

Inputs over time:

* Received commands.
* New/changed buffer records. (The mirror can't change under us.)
* Invalidating events (changed syncID, node reassignment, changed FxA, etc.)

I use the term "plan" to indicate that this (a) starts out with a prediction, (b) is effected over time, (c) eventually succeeds, fails, or is superseded by a new plan.

When records in the buffer change, the plan is updated to match. When all of the plan's needs are met, it's done and can be discarded; the validation/plan/execute cycle begins again.

The plan doesn't execute immediately; we might be seeing partial writes, so no need to flip out immediately. We wait a reasonable amount of time before acting.

The internal life of a plan might vary; we might decide that the server is in such bad state that we need a reset, or we might decide that we can use commands or sentinel records to trigger uploads from other devices. The latter will need a small state machine to track advancement.

We expect most situations to be simple, and thus most plans to be simple: an input set of inconsistent or missing records, a single other client, and resolution after a finite number of steps (usually just one or two).

Depends on: 1261169

Richard Newman [:rnewman]

Reporter

Updated

•

8 years ago

Depends on: 1260900

Richard Newman [:rnewman]

Reporter

Updated

•

8 years ago

Blocks: 1261186

Emily Toop (:fluffyemily)

Updated

•

8 years ago

tracking-fxios: --- → ?

Stephan Leroux [:sleroux]

Updated

•

8 years ago

tracking-fxios: ? → +

Flags: needinfo?(sleroux)

Stephan Leroux [:sleroux]

Updated

•

8 years ago

Flags: needinfo?(sleroux)

Edwin Wong [:edwong]

Comment 2

•

8 years ago

Triage: is this still needed considering all the work that's being done.

Priority: -- → P3

Richard Newman [:rnewman]

Reporter

Comment 3

•

8 years ago

Mark: what's the status of recovery from corrupted server states?

Flags: needinfo?(markh)

Mark Hammond [:markh] [:mhammond]

Comment 4

•

8 years ago

(In reply to Richard Newman [:rnewman] from comment #3)
> Mark: what's the status of recovery from corrupted server states?

That's a good question. I wish I had a good answer. My approach so far has been to first identify and fix scenarios where we know desktop causes bad state (eg, bug 1276823, bug 1293163), then to work out how to best repair such issues already caused - my thinking is that it's better to repair after we are confident we aren't going to continue to cause the corruption.

The other well-known problem is simply IDs missing from the server caused by limitations in the tracker. This *is* something we could repair sooner rather than later. However, the question here is "how?".

One scenario is as outlined above - wait for a client to notice it missing and to try and message other clients to upload the missing records. The main problem I see with this is that it relies on those other clients being smart - we'd need to make all 3 clients both detect missing records, and also trigger the other clients to upload the missing items. IOW, it seems unlikely we would complete that this year.

Another option is that desktop (and maybe other clients later) would periodically ask the server for all IDs it knows about, compare that with all IDs locally, and upload the missing items. This doesn't rely on other clients even noticing missing items. However, it is a relatively expensive operation for outliers with a huge number of bookmarks. However, it is much less expensive than the client trying to fix other structural issues as this operation only requires a list of IDs rather than the full records.

So TBH I'm not sure how to proceed here, but agree we need a plan. What do you think?

Flags: needinfo?(markh) → needinfo?(rnewman)

Richard Newman [:rnewman]

Reporter

Comment 5

•

8 years ago

(In reply to Mark Hammond [:markh] from comment #4)

> my thinking is that it's better to repair after
> we are confident we aren't going to continue to cause the corruption.

I agree, but it'd be nice if we could pipeline the programming work some.


> The other well-known problem is simply IDs missing from the server caused by
> limitations in the tracker. This *is* something we could repair sooner
> rather than later. However, the question here is "how?".

Missing records would be a good start, but IIRC we rarely see missing records alone -- they come with other missed changes.


> One scenario is as outlined above - wait for a client to notice it missing
> and to try and message other clients to upload the missing records. The main
> problem I see with this is that it relies on those other clients being smart
> - we'd need to make all 3 clients both detect missing records, and also
> trigger the other clients to upload the missing items. IOW, it seems
> unlikely we would complete that this year.

A much more limited version is: iOS is the implementation that cares most, and it's also the one that already knows what's missing. I don't think we need to build detection (or, indeed, handling of absences!) into every platform in order to get value out of this.

It's worth noting that it's way more efficient to detect a problem when you're downloading and processing changes. Android is already working on two-stage buffer/application like iOS has, at which point it'll be easy to spot issues there, too.


> So TBH I'm not sure how to proceed here, but agree we need a plan. What do
> you think?

Regardless of the approach we take, there are two parts:

* Recognizing when there's a problem. IMO a signal from another client is a decent first step for that.
* Fixing the problem.

This bug and Bug 1260900 are two approaches to the latter: this bug is to have the detecting client delete and send resetEngine; the other is to be a little more specific and hands-off. I'd prefer the latter.

It seems like there's a lot to figure out about how desktop might trigger detection and how it would detect inconsistency, so I'm leery about waiting for that.

If indeed we reach a point where desktop doesn't corrupt server data, then perhaps we don't need to do the work for desktop to detect corruption at all -- corruption should be rare, and so it's okay to have the mobile device ask for a fixup and otherwise fall back to SUMO.

Flags: needinfo?(rnewman)

Mark Hammond [:markh] [:mhammond]

Comment 6

•

8 years ago

(In reply to Richard Newman [:rnewman] from comment #5)
> Missing records would be a good start, but IIRC we rarely see missing
> records alone -- they come with other missed changes.

Can you clarify here? Bug 1260900 mentions "flagging records for reupload", which I assumed was mainly for missing records. I guess you also mean "we have an item, but some of the content is missing/invalid, so please re-upload it". I'm having trouble seeing why it is likely another client would upload anything different than what is already on the server, and what would happen when the record re-uploaded still isn't suitable.

> A much more limited version is: iOS is the implementation that cares most,
> and it's also the one that already knows what's missing. I don't think we
> need to build detection (or, indeed, handling of absences!) into every
> platform in order to get value out of this.

Fair enough - I'm more than happy to work with iOS on the desktop side of this as a priority.

> It's worth noting that it's way more efficient to detect a problem when
> you're downloading and processing changes. Android is already working on
> two-stage buffer/application like iOS has, at which point it'll be easy to
> spot issues there, too.

Yes, that's true - I was suggesting that for items that are simply missing, it's probably not *that* expensive for desktop to upload them without signals from other clients, and would help the user should they go and connect another desktop device to the account.

I agree that without far more work, desktop can't efficiently handle problems other than items missing.

> Regardless of the approach we take, there are two parts:
> 
> * Recognizing when there's a problem. IMO a signal from another client is a
> decent first step for that.

As above, I'm fine with that, although I do think the problem of missing IDs could be worthwhile without that signal as it will fix the problem for users without an iOS device attached.

> * Fixing the problem.
> 
> This bug and Bug 1260900 are two approaches to the latter: this bug is to
> have the detecting client delete and send resetEngine; the other is to be a
> little more specific and hands-off. I'd prefer the latter.
> 
> It seems like there's a lot to figure out about how desktop might trigger
> detection and how it would detect inconsistency, so I'm leery about waiting
> for that.

Fair enough - it need not be either/or, so I fully support us reacting to such a signal. Given we are worried about more than just missing records, are we sure a resetEngine() isn't what we want here, forcing the client to consider every bookmark?

> If indeed we reach a point where desktop doesn't corrupt server data, then
> perhaps we don't need to do the work for desktop to detect corruption at all
> -- corruption should be rare, and so it's okay to have the mobile device ask
> for a fixup and otherwise fall back to SUMO.

SGTM.

Richard Newman [:rnewman]

Reporter

Comment 7

•

8 years ago

(In reply to Mark Hammond [:markh] from comment #6)

> Can you clarify here? Bug 1260900 mentions "flagging records for reupload",
> which I assumed was mainly for missing records. I guess you also mean "we
> have an item, but some of the content is missing/invalid, so please
> re-upload it".

Yes. More specifically: we have several records that, when considered together, are inconsistent; please reupload them all.

> I'm having trouble seeing why it is likely another client
> would upload anything different than what is already on the server, and what
> would happen when the record re-uploaded still isn't suitable.

Bug 1235269 (duped to tracker work in Bug 1258127) is exactly this problem. If you just make the client reupload the records, the corruption goes away.

Say desktop at some point failed to track a record change (e.g., during deduping, or moving a bookmark soon after creation).

The new, moved bookmark will be on the server, and iOS can see that its parent record doesn't know about it. iOS has enough context that it knows which parts of the tree are inconsistent, so it can request reupload of dodgy records. We'll ask for the parent record, and desktop will regenerate its parent list when uploading, perhaps fixing the inconsistency.

It won't always succeed, but we can definitely do more than just find missing records.

One way of looking at this is that there's rarely such a thing as a missing record. The missing record is usually referenced somewhere else in the tree…


> Yes, that's true - I was suggesting that for items that are simply missing,
> it's probably not *that* expensive for desktop to upload them without
> signals from other clients, and would help the user should they go and
> connect another desktop device to the account.

True. I'll be as supportive as I can if you want to work on that :D


> Fair enough - it need not be either/or, so I fully support us reacting to
> such a signal. Given we are worried about more than just missing records,
> are we sure a resetEngine() isn't what we want here, forcing the client to
> consider every bookmark?

My worry about resetEngine alone is that it downloads every record from the server, with the server record winning, which will destroy the local state we're trying to keep on the supposedly pristine client.

My worry about having another client delete records from the server then call resetEngine is that, well, it's deleting data from the server and hoping someone can replace it. And we might not excise all the bad bits, so we might still get corruption. And if send resetEngine to the wrong device, the bad bits won't be reuploaded! It's risky.

Requesting upload (Bug 1260900) is much safer, in that it's much more specific: either nothing happens or known-good records make it up to the server.

Richard Newman [:rnewman]

Reporter

Comment 8

•

8 years ago

As originally filed, this bug was "delete records and reset desktop". Bug 1260900 was "tell desktop that records X, Y, and Z are missing".

The latter involves more implementation work on desktop, but (a) we also plan to have desktop play the 'conductor', not just the 'orchestra', and (b) it seems more likely to produce good results.

So let's morph this bug to be the iOS implementation part of Bug 1260900, and a nice repository of discussion.

Blocks: 1260900
No longer blocks: 1261186

Mentor: rnewman

No longer depends on: 1260900

Priority: P3 → --

Summary: Use a combination of deletion of records and a targeted resetEngine command to remotely recover from bookmark corruption → Send commands from iOS to a desktop client to remotely fix bookmark corruption

Erin Lancaster [:elan]

Updated

•

8 years ago

Whiteboard: [data-integrity] [triage]

Stephan Leroux [:sleroux]

Comment 9

•

7 years ago

Using this bug as the root bug for all bookmark repair related bugs for iOS.

Stephan Leroux [:sleroux]

Updated

•

7 years ago

Depends on: 1350959

Stephan Leroux [:sleroux]

Updated

•

7 years ago

Depends on: 1350961

Stephan Leroux [:sleroux]

Updated

•

7 years ago

Depends on: 1350963

Julie McCracken (:julie)

Updated

•

7 years ago

Whiteboard: [data-integrity] [triage] → [data-integrity] [triage] [Q2 OKR]

Edouard Oger [:eoger]

Updated

•

7 years ago

Depends on: 1365329

Richard Newman [:rnewman]

Reporter

Updated

•

7 years ago

Blocks: 1365329

No longer depends on: 1365329

Mark Hammond [:markh] [:mhammond]

Comment 10

•

2 years ago

We've no plans for this at the current time.

Status: NEW → RESOLVED

Closed: 2 years ago

Resolution: --- → WONTFIX