Closed Bug 664865 Opened 10 years ago Closed 10 years ago

More considered handling of node reassignment mid-sync

Categories

(Firefox :: Sync, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla7

People

(Reporter: rnewman, Assigned: rnewman)

References

Details

(Whiteboard: [sync-hmac][fixed in services][qa-])

Attachments

(1 file, 2 obsolete files)

A 401 mid-sync can result in an inconsistent server state.
Attached patch Proposed patch. v1 (obsolete) — Splinter Review
HOORAY!

This patch:

* Adds all the tests from Bug 664808, including the check for consistency after a 401, which we now pass.
* Removes entirely the 401 observer notification from Resource. It's unnecessary, because Service checks the status code of an HTTP failure, and thus calls updateCluster anyway!
* Removes tests for the 401 redirect nonsense.
* Adds a status code, NODE_REASSIGNED, which we examine to decide whether to abort at the start of each engine sync.
* Rejig some of the logic between setCluster and updateCluster. We now don't have two different entry points, which is nice.

All 101 tests pass on s-c, and I've observed that sync aborts after a 401 with cluster change. Haven't run TPS yet (still b0rked on my Mac).

I think this might be the sculpture inside the marble.
Attachment #540152 - Flags: review?(philipp)
What this *doesn't* yet do is encourage another sync. Looking at that now.
Looks great, a 1000 times better than what we have. I think we could simplify some things. See comments below:

(In reply to comment #1)
> * Removes entirely the 401 observer notification from Resource. It's
> unnecessary, because Service checks the status code of an HTTP failure, and
> thus calls updateCluster anyway!

Yes yes yes. Yes!

> * Adds a status code, NODE_REASSIGNED, which we examine to decide whether to
> abort at the start of each engine sync.

First let me see if understand the new behaviour as implemented by your patch:

If we end up with a 401, we call Service._updateCluster() which calls _setCluster(). This sets Status.sync = NODE_REASSIGNED if the cluster has actually changed. We then return false from _syncEngine, which causes Service.sync() to skip all other engines and then bail.

Comments/questions regarding this behaviour:

* I find it a bit icky that Status.sync will be NODE_REASSIGNED during the first sync when we there wasn't a cluster URL set yet. Of course, a successful sync is going to set Stauts.sync = SYNC_SUCCEEDED, but still startling at first to anybody who's going to read the code.

* If we end up with Status.sync = NODE_REASSIGNED, it's not immediately clear to me how Status.sync gets reset to a value that will allow us to sync the next time. Maybe this still needs to be addressed (see next point)?

* As you point out in comment 2, we should really trigger a sync right away. Marina's work of factoring all that triggering out to a separate SyncScheduler object will help us here, as it nicely encapsulates all the cases when we actually kick off a sync in an automated way. Unfortunately, you guys are competing for time here :/


Now for my alternate suggestion:

A 401 can occur for one of two things: Either the user changed his/her password and this particular machine wasn't updated yet. Or they got moved to another node = their cluster URL changed. So, if we detect 401 during a sync, why don't we:

1. reset the clusterURL pref
2. log them out
3. trigger another sync immediately (as in, Utils.nextTick)

Since they're logged out, we're first going to try and log in. And since and the clusterURL pref isn't set anymore, Service.verifyLogin() is going to fetch it. If it's the same as the old one, no harm done. It means the user probably changed their password on another machine. Then Service.verifyLogin() will proceed to verify the username's password + Sync Key. If this fails, they'll get the error bar we know and love. But if it succeeds (because it was just a node reassignments), we sync normally and everybody's happy.

I see a few advantages of this:

* It avoids calling _updateCluster from inside the sync call stack (less entry points to the cluster setting/update code)
* It avoids introducing another Status.sync flag (doing so will always raise the question of where it gets reset)
* It appears to be bit more reliable since we can't really get into a deadlock state (an unset clusterURL is something Sync knows how to deal with very easily).

What do you think?

> * Rejig some of the logic between setCluster and updateCluster. We now don't
> have two different entry points, which is nice.

That *is* nice indeed!
(In reply to comment #3)
> 1. reset the clusterURL pref
> 2. log them out
> 3. trigger another sync immediately (as in, Utils.nextTick)

Step 2 should really be:

2. log them out and abort the sync, hard. (= throw, so that Service.sync actually doesn't even try to do anything else.)
Attached patch Proposed patch. v2 (obsolete) — Splinter Review
This might be enough, but I want to look at it again in the clear light of day.

(Philipp, if you wish to review regardless, you may do so!)

Note the test shenanigans to avoid lock contention.
Attachment #540152 - Attachment is obsolete: true
Attachment #540152 - Flags: review?(philipp)
Comment on attachment 540687 [details] [diff] [review]
Proposed patch. v2

>+function no_add_test() {}
>+no_add_test(function hmac_error_during_404() {

no_add_test? I think you forgot to get rid of this :)

>+  try {
>+    _("Syncing.");
>+    Service.sync();
>+    _("---------------------------");
>+    _("Partially resetting client, as if after a restart.");
>+    CollectionKeys.clear();
>+    engine.lastSync = 0;

Why would a restart reset engine.lastSync? Perhaps you're just resetting it so that the engine ends up redownloading everything?

>+  let should401 = false;
>+  function upd401(coll, handler) {
>+    return function (request, response) {
>+      if (should401 && (request.method != "DELETE")) {
>+        on401();
>+        should401 = false;
>+        let body = "\"reassigned!\"";
>+        response.setStatusLine(request.htttpVersion, 401, "Node reassignment.");

typo: htttpVersion

>+  // Use observers to perform actions when our sync finishes.
>+  // This allows us to observe the automatic next-tick sync that occurs after
>+  // an abort.
>+  function onSyncError() {}

Put a do_throw("Should not get a sync error!") in here so ensure the user will never see an error bar.

>+      // Kick off another sync. Can't just call it, because we're inside the
>+      // lock...
>+      Utils.nextTick(function() {
>+        _("Now a fresh sync will get no HMAC errors.");
>+        _("Partially resetting client, as if after a restart.");
>+        _("---------------------------");
>+        CollectionKeys.clear();
>+        engine.lastSync = 0;
>+        hmacErrorCount = 0;

Again, I don't see why a restart would reset engine.lastSync.

>+
>+        // Event-driven try/finally!
>+        onSyncFinished = onSyncError = function() {

Why would we expect the last sync to throw an error?

>+          Svc.Prefs.resetBranch("");
>+          Records.clearCache();
>+          server.stop(run_next_test);
>+        };

Please unregister the observers, too:

    Svc.Obs.remove("weave:service:sync:finish", obs);
    Svc.Obs.remove("weave:service:sync:error", obs);

>+        Service.sync();
>+        _("---------------------------");
>+
>+        // Two rotary items, one client record... no errors.
>+        do_check_eq(hmacErrorCount, 0)

Can you please move this check into the onSyncFinished function above (the one that ends up calling run_next_test). That way we don't depend on Service.sync() blocking execution :)

(P.S.: I found all that _("---------------------------") noise a bit distracting... but maybe that's just me :))


Treating this as a WIP, based on comment 5, so I'm not setting a review flag yet. I couldn't find any major flaws, though, so I think a patch with the above points addressed should get an r+ no problem :)
This should do the trick... thanks for catching my late-night brainos, Philipp!
Attachment #540687 - Attachment is obsolete: true
Attachment #541735 - Flags: review?(philipp)
Comment on attachment 541735 [details] [diff] [review]
Proposed patch. v3

r=me
Attachment #541735 - Flags: review?(philipp) → review+
I went ahead and landed this on s-c so that Marina can rebase Instant Sync on top of this (in case there are conflicts). We're zeroing in on the final set of patches for that :)

http://hg.mozilla.org/services/services-central/rev/196903b5b002
Whiteboard: [sync-hmac] → [sync-hmac][fixed in services]
(In reply to comment #9)
> I went ahead and landed this on s-c so that Marina can rebase Instant Sync
> on top of this (in case there are conflicts).

Thanks, Philipp!
Duplicate of this bug: 664808
Duplicate of this bug: 663953
I can't think of an (easy) way for QA to verify this bug, so I'm marking it [qa-]. Please feel free to correct me, Richard, if you disagree and have STRs for them :)
Whiteboard: [sync-hmac][fixed in services] → [sync-hmac][fixed in services][qa-]
(In reply to comment #13)
> I can't think of an (easy) way for QA to verify this bug, so I'm marking it
> [qa-]. Please feel free to correct me, Richard, if you disagree and have
> STRs for them :)

Indeed, there's no easy way -- it would require either careful timing with ops to modify staging node assignment, or futzing in a privileged JS console to change things mid-flight.

At that point the (quite thorough) xpcshell-tests become clearly the better simulation of the issue.
Hmm, apparently we forgot to mark this as fixed.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.