Closed Bug 824888 Opened 12 years ago Closed 9 years ago

Allow client to recover gracefully if server reverts to a previous version?

Categories

(Firefox :: Sync, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: rfkelly, Unassigned)

Details

(This is currently a vague thought, it may not be possible or desirable, but is worth thinking about)

In the increasingly-less-hypothetical future world of "Durable Sync", we may want to replicate each client's sync data across multiple servers and even multiple data centres.

Suppose we have some sort of asynchronous replication process running, wherein the slave server may be ever-so-slightly behind the master server.  A failure occurs, and we switch over from master to slave.

From the client's POV, it would appear as though the server has "forgotten" the last little bit of its history.  The server-side version number would jump backwards and the stored data would be left in a consistent, but out-of-date state.

Can the client deal gracefully with such a scenario?

IIUC then this scenario would lead to data loss with the current client architecture.  Any data that was not replicated out to the slave prior to failover would be lost.  The client thinks it has been safely uploaded, and the version number does not increase so there is no trigger to upload it again.

This is not currently a problem, because there's no replication or failover on the server.  Either your data is in the state you left it, or it's completely wiped out and you have to start over from scratch.

Can we do better?  Should we bother?
I wonder if the client could even detect such a situation if it occurred partway through an sync.

The current plan is to pre-condition all writes with the X-If-Unmodified-Since-Version header, which allows the client to detect whether the server-side version number has *increased* since it was last read.  It would be of no help in the above scenario where the version number has *decreased*.

An "X-If-Match-Version" header might be useful here, letting us treat the version number more like an etag.
Worth discussing!

Obvious: clients have to track server version with slightly more nuance. That's easy enough.

Assumption: version numbers are linear and ever-increasing, even after rollback.

Trivial solution: whenever rollback is detected, do a full sync.

Improvement: the last client to sync should do so, because we're going to really screw up version numbers when we do. Hope that the right client does the full sync!

Real recovery is harder, and involves uploading the current canonical version of the missing records.


Two classes of errors: failure to detect, failure to recover.

Failure to detect:

* Client A syncs. Version = 3.
* Client B syncs twice. Version = 4, 5.
* Server fails over. Version = 4.
* Client A syncs.

Obviously, this can yield one kind of failure to recover: Changes between version 4 and 5 are lost; new version 6 will rely on Client B successfully reconciling. Data loss will occur if A and B collided; the wrong one wins, because server timestamp is always later in this case.

Client B must sync first. If Client B is now gone, we cannot recover.


This suggests that in-protocol recovery is impossible: once a rollback is detected, the clients must elect for recovery.


Recovery also relies on clients maintaining state about correspondences between local changes and server changes; essentially, as well as "what's changed locally since I last synced?", the local storage layer has to be able to answer the question "what's changed locally since server time V?". I planned for something like this in the bookmarks data structure Bug 814801, but it has not been generally solved.

If they can't do that, they have to do a full bidirectional sync and reconcile.


And failure during recovery is pernicious; if we get an interrupted connection or any other kind of failure, we will have advanced the version to beyond the highest previous version (thus 'recovering' from the perspective of other clients), but without having restored data.

A recovery mode where clients can set each record's version might help with this, but still, nasty.


If we can avoid this situation, I would very much recommend it (even if it's technically possible, and we add logic to each client to detect and handle this situation). Better to have the service unavailable until full recovery can be achieved….


Error situations I don't even want to think about:

* Rollback that crosses a meta/global change.
* Rollback across salts or keys changing.
* Rollback that reintroduces a command (however we implement those), or undoes the effect of a wipe.

To handle those, every client essentially needs to track a complete transaction log, reproducing it from the server, and ensure that the server matches or supersedes the log chunks it already knows about. I allude to this in Bug 814801. If moving in this direction is what we need to do for durability, then perhaps we should start thinking about syncing changelogs….
(In reply to Richard Newman [:rnewman] from comment #2)
> Assumption: version numbers are linear and ever-increasing, even after
> rollback.

I'm not sure what you mean here - do you mean that all version numbers after a rollback must be larger than the version numbers before the rollback?  In other words, is this an acceptable timeline of version numbers?

    1 --> 2 --> 3 --*failure*--> 2 --> 3 --> 4

If not, what benefit do we gain from this restriction?

> Trivial solution: whenever rollback is detected, do a full sync.
> 
> Improvement: the last client to sync should do so, because we're going to
> really screw up version numbers when we do. Hope that the right client does
> the full sync!

Alternative: whenever rollback is detected, instruct *all* clients to do a full sync.  This is clearly much less efficient, but seems safer.

Rollback events would hopefully be very rare, so "everyone does a full sync and reconcile" seems like a feasible recovery strategy to me.  As long as we can do it without blowing away the user's data on the server.

> Failure to detect:
> 
> * Client A syncs. Version = 3.
> * Client B syncs twice. Version = 4, 5.
> * Server fails over. Version = 4.
> * Client A syncs.

Indeed.  Humph.  This example suggests that in-protocol detection of a rollback is impossible in the general case.  From the perspective of Client A, it looks like nothing has gone wrong.

If we can't even detect it properly, any recovery strategy seems moot.

I suppose the server could somehow flag that a rollback has occurred, instructing each client that they need to do some recovery.  But I don't see a nice way to fit that into the current protocol at the moment.

> If we can avoid this situation, I would very much recommend it

"Don't do that" may end up being the best option, or even our only realistic option.  Coping with the occasional rollback would enable us to do some interesting things with durability on the server side, but I'm not sure it's worth too much extra complication in the client.

In the scheme of things, saying "Sorry, a meteor hit our data centre, your data has been lost, please re-sync" might be a better overall solution.

> Error situations I don't even want to think about:
> 
> * Rollback that crosses a meta/global change.
> * Rollback across salts or keys changing.
> * Rollback that reintroduces a command (however we implement those), or
> undoes the effect of a wipe.

/me backs slowly away...

> To handle those, every client essentially needs to track a complete
> transaction log, reproducing it from the server, and ensure that the server
> matches or supersedes the log chunks it already knows about. I allude to
> this in Bug 814801. If moving in this direction is what we need to do for
> durability, then perhaps we should start thinking about syncing changelogs….

There is a quite different, vastly superior sync protocol in there somewhere, just waiting to get out.  Explicitly changeset-driven, with an append-only git-like storage model...but I digress.

Summing up, within the scope of the current sync2.0 protocol and current client architecture:

    * there is no way for clients to reliably detect a version rollback in general
    * doing recovery any more clever than "everyone resync" would be very complicated

Will have to think on this some more...
(In reply to Ryan Kelly [:rfkelly] from comment #3)

> > Assumption: version numbers are linear and ever-increasing, even after
> > rollback.
> 
> I'm not sure what you mean here - do you mean that all version numbers after
> a rollback must be larger than the version numbers before the rollback?  In
> other words, is this an acceptable timeline of version numbers?
> 
>     1 --> 2 --> 3 --*failure*--> 2 --> 3 --> 4
> 
> If not, what benefit do we gain from this restriction?

I'm assuming that -- because version ~= timestamp, at least for now -- every version will be later than all previously saved versions.

That is,

  1 2 3 *failure* … 5 6

That has one set of properties. (Probably the most desirable, because the timestamps are still present in uploaded records, and will be used for reconciling; the version is just used to tell clients when to fetch.)

The alternative -- which we can name "version reuse" -- has a different set of properties: for one, clients seeing that the server has a version number that *has already been seen* but *indicates different data* will be confused, and data omission could occur, or the client could decide not to sync.


> Alternative: whenever rollback is detected, instruct *all* clients to do a
> full sync.  This is clearly much less efficient, but seems safer.

It'll catch the case of the non-latest-client syncing, sure, but otherwise doesn't buy us anything other than a lot of client expense (mobile clients particularly). And it also could lead to the *wrong* client syncing first, and getting different end state.


> Indeed.  Humph.  This example suggests that in-protocol detection of a
> rollback is impossible in the general case.  From the perspective of Client
> A, it looks like nothing has gone wrong.

Correct. The protocol is designed to use meta/global syncIDs to detect significant server-side changes, with the response having been tailored for the "wiped server" case.

We could have the server rewrite meta/global on rollback, of course… so long as the rollback doesn't cross a meta/global boundary.


> If we can't even detect it properly, any recovery strategy seems moot.
> 
> I suppose the server could somehow flag that a rollback has occurred,
> instructing each client that they need to do some recovery.  But I don't see
> a nice way to fit that into the current protocol at the moment.

Well, we could do it by explicitly denoting vector changes: that is, our timeline becomes a tree, not a line, and clients are explicitly merging tree branches. This would need version numbers to become IDs, or carry a branch identifier.

At that point we've reinvented Git.


> There is a quite different, vastly superior sync protocol in there
> somewhere, just waiting to get out.  Explicitly changeset-driven, with an
> append-only git-like storage model...but I digress.

Yeah, I've been wanting to build that for quite a while. Maybe for Sync 3.0 ;)
(Also, at some point we need to create a "Sync: cross-client" bug component. "General" sucks, and "Backend" only applies to desktop.)
(In reply to Richard Newman [:rnewman] from comment #4)
> (In reply to Ryan Kelly [:rfkelly] from comment #3)
> 
> The alternative -- which we can name "version reuse" -- has a different set
> of properties: for one, clients seeing that the server has a version number
> that *has already been seen* but *indicates different data* will be
> confused, and data omission could occur, or the client could decide not to
> sync.

Thanks, that makes a lot of sense. 

> > Alternative: whenever rollback is detected, instruct *all* clients to do a
> > full sync.  This is clearly much less efficient, but seems safer.
> 
> It'll catch the case of the non-latest-client syncing, sure, but otherwise
> doesn't buy us anything other than a lot of client expense (mobile clients
> particularly). And it also could lead to the *wrong* client syncing first,
> and getting different end state.

It also gets around the problem of figuring out *who* was last to sync, which could be tricky to do after a rollback.
 
> We could have the server rewrite meta/global on rollback, of course… so long
> as the rollback doesn't cross a meta/global boundary.

This is complicated by the notion that syncstorage is a generic service, which doesn't know about any of the client-specific details of FF Sync, doesn't know that meta/global is special, etc.

Strawman:

* If we failover to an old snapshot of your data, all read requests start returning an error response { "status": "repair-needed" }.  This lets clients reliably detect that something has gone wrong.

* Any write request will clear the server's "repair-needed" flag, confirming that the failover has been detected by a client and returning the store to working order so it can be repaired.

* After that, clients are on their own.  They'd probably need to write some sort of flag BSO to say "repair in progress", interrogate the store, put it in a consistent state, convince other clients to re-sync, then move on with their lives.

That's pretty awful, but may be approximately as good as we can do without serious protocol surgery.  I'm not convinced it would be a net win.

> > There is a quite different, vastly superior sync protocol in there
> > somewhere, just waiting to get out.  Explicitly changeset-driven, with an
> > append-only git-like storage model...but I digress.
> 
> Yeah, I've been wanting to build that for quite a while. Maybe for Sync 3.0
> ;)

I live in hope ;-)
Doing two things with this ancient bug:

1) Adding :dcoates for some background context on previous discussions for "durable sync"
2) Closing it out, because whatever we do, it'll probably have to be better than the above
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.