Need to handle node reassignment on all storage requests

RESOLVED FIXED in mozilla10

Status

()

--
major
RESOLVED FIXED
7 years ago
a day ago

People

(Reporter: tchung, Assigned: rnewman)

Tracking

({regression})

unspecified
mozilla10
regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox7 affected, firefox8 affected, firefox9 affected)

Details

(Whiteboard: [fixed in services])

Attachments

(7 attachments, 8 obsolete attachments)

835 bytes, text/rtf
Details
43.66 KB, text/rtf
Details
25.44 KB, patch
philikon
: review+
Details | Diff | Splinter Review
6.75 KB, patch
Details | Diff | Splinter Review
9.91 KB, patch
Details | Diff | Splinter Review
429.85 KB, application/zip
Details
12.54 KB, patch
philikon
: review+
Details | Diff | Splinter Review
(Reporter)

Description

7 years ago
Created attachment 565445 [details]
Sync Error 401 log

I was just surfing around on the Mac Nightly, and a 401 suddenly popped up with a yellow bar, saying incorrect name or password.   Clicking the preferences button popped up the username reset window.  I cancelled the dialog, and the yellow bar went away.

Attaching success and error logs.

Repro:
1) install Mac nightly : Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:10.0a1) Gecko/20111005 Firefox/10.0a1
2) log into sync on phx
3) sit for awhile, and it popped up a 401 error with a yellow bar.

Expected:
- no 401 error, no yellow bar notification
(Reporter)

Comment 1

7 years ago
Created attachment 565447 [details]
Sync Success log
So yeah, in bug 664865 we removed a very generic handling of 401 responses that would have us detect a node reassignment on all requests in order to make it more reliable mid-sync. The problem is that in any requests leading up to a sync (info/collections, etc.) isn't covered by this and will in the worst case produce the yellow error bar telling the users their username and password are wrong.
Blocks: 664865
Summary: 401 migration error thrown → Need to handle node reassignment on all storage requests
While I haven't done a complete analysis of which requests are missing treatment for node reassignment, the evidence from Tony's logs is pretty clear. Bug 664865 landed in Firefox 7, so I'm marking everything since then as affected.
status-firefox10: --- → affected
status-firefox7: --- → affected
status-firefox8: --- → affected
status-firefox9: --- → affected
(Assignee)

Comment 4

7 years ago
*dons dunce hat*

I wonder if I'll be awake enough to work on this tonight...
OS: Mac OS X → All
Hardware: x86 → All
Keywords: regression
(Assignee)

Comment 5

7 years ago
Looking into this now.
Assignee: nobody → rnewman
Status: NEW → ASSIGNED
(Assignee)

Comment 6

7 years ago
Wrote some tests. Looks like the client requests a node for, e.g., initial info/collections requests:

Sync.Service	DEBUG	User-Agent: XPCShell/1 FxSync/1.13.0.20100621.
Sync.Service	INFO	Starting sync at 2011-10-09 19:42:48
Sync.Service	DEBUG	In sync: should login.
Sync.Service	INFO	Logging in user johndoe
Sync.Test.Server	DEBUG	SyncServer: Handling request: GET /1.1/johndoe/info/collections
Sync.Resource	DEBUG	mesg: GET fail 401 http://localhost:8080/1.1/johndoe/info/collections
Sync.Resource	DEBUG	GET fail 401 http://localhost:8080/1.1/johndoe/info/collections
Sync.Service	WARN	401: login failed.
Sync.Service	DEBUG	Finding cluster for user johndoe
Client made a request for a node reassignment.  << TEST CODE
Sync.Resource	DEBUG	mesg: GET success 200 http://localhost:8080/user/1.0/johndoe/node/weave
Sync.Resource	DEBUG	GET success 200 http://localhost:8080/user/1.0/johndoe/node/weave
Sync.Service	DEBUG	Cluster value = null
Sync.SyncScheduler	DEBUG	Clearing sync triggers.
Sync.Service	DEBUG	Exception: Login failed: error.login.reason.account No traceback available
Sync.Service	DEBUG	Not syncing: login returned false.

Also tested 401s for storage requests and only engine storage requests. All of them result in the clusterURL pref being reset or node/weave being fetched.

Will continue investigating.
(Assignee)

Comment 7

7 years ago
Created attachment 566624 [details] [diff] [review]
Proposed patch. v1

Haven't run TPS; waiting for build.
Attachment #566624 - Flags: review?(philipp)
(Assignee)

Comment 8

7 years ago
Comment on attachment 566624 [details] [diff] [review]
Proposed patch. v1

Adding a couple things.
Attachment #566624 - Flags: review?(philipp)
(Assignee)

Comment 9

7 years ago
Created attachment 566863 [details] [diff] [review]
Proposed patch. v2

Today's a travel day, so on and off I'll be doing the following:

* Running TPS
* Manually checking behavior in error scenarios (such as bad stored password)
* Porting tests to use old test infrastructure for landing on Aurora and Beta

Until then, here's a patch that passes 108 tests on s-c.
Attachment #566624 - Attachment is obsolete: true
Attachment #566863 - Flags: review?(philipp)
(Assignee)

Comment 10

7 years ago
TPS output:

Test Summary

TEST-PASS | test_sync.js | 
TEST-PASS | test_prefs.js | 
TEST-PASS | test_tabs.js | 
TEST-PASS | test_passwords.js | 
TEST-PASS | test_history.js | 
TEST-PASS | test_formdata.js | 
TEST-PASS | test_bug530717.js | 
TEST-PASS | test_bug531489.js | 
TEST-PASS | test_bug538298.js | 
TEST-PASS | test_bug556509.js | 
TEST-PASS | test_bug562515.js | 
TEST-PASS | test_bug563989.js | 
TEST-PASS | test_bug535326.js | 
TEST-PASS | test_bug501528.js | 
TEST-PASS | test_bug575423.js | 
TEST-PASS | test_bug546807.js | 
TEST-PASS | test_history_collision.js | 
TEST-PASS | test_privbrw_formdata.js | 
TEST-PASS | test_privbrw_passwords.js | 
TEST-PASS | test_privbrw_tabs.js | 
TEST-PASS | test_bookmarks_in_same_named_folder.js | 
TEST-PASS | test_client_wipe.js | 
TEST-PASS | test_special_tabs.js | 
TEST-UNEXPECTED-FAIL | test_mozmill_sanity.js | test did not complete
Comment on attachment 566863 [details] [diff] [review]
Proposed patch. v2

>       case 401:
>         Weave.Service.logout();
>         Status.login = LOGIN_FAILED_LOGIN_REJECTED;

Can you explain (best in a comment) why this is still needed here? I personally have doubts.

>+        this._log.info("Got 401 response; resetting clusterURL.");
>+        Svc.Prefs.reset("clusterURL");
>+        this._log.info("Attempting to schedule another sync.");
>+        SyncScheduler.scheduleNextSync(0);
>         break;

>+/**
>+ * Make a test request to `url`, then watch the result of two syncs
>+ * to ensure that a node request was made.
>+ */
>+function syncAndExpectNodeReassignment(server, firstNotification, secondNotification, url) {
>+  function onwards() {
>+    // Set backoff to avoid repeated syncs.
>+    Status.backoffInterval = 9999999999;
>+    let nodeFetched = false;
>+    function onFirstSync() {
>+      _("First sync completed.");
>+      Svc.Obs.remove(firstNotification, onFirstSync);
>+      Svc.Obs.add(secondNotification, onSecondSync);
>+      do_check_eq(Service.clusterURL, "");
>+
>+      // Track whether we fetched node/weave. We want to wait for the second
>+      // sync to finish so that we're cleaned up for the next test, so don't
>+      // run_next_test in the node handler.
>+      nodeFetched = false;
>+
>+      // Verify that the client requests a node reassignment.
>+      // Install a node handler to watch for these requests.
>+      installNodeHandler(server, function () { 
>+        nodeFetched = true;
>+      });
>+    }
>+    function onSecondSync() {
>+      _("Second sync completed.");
>+      Svc.Obs.remove(secondNotification, onSecondSync);
>+      do_check_true(nodeFetched);
>+      server.stop(run_next_test);

You forgot to Service.startOver(); here. Also, to decouple that from ErrorHandler's and SyncScheduler's own sync:finish or sync:error notifications, best to wrap it in a nextTick.

I would also like to add negative tests for weave:ui:sync:error and weave:ui:login:error, along the lines of:

  function onUIError() {
    do_throw("Poor user shouldn't see a thing!");
  }
  Svc.Obs.add("weave:ui:login:error", onUIError);
  Svc.Obs.add("weave:ui:sync:error", onUIError);

Could even make this a global observer if we're confident that every test in here will not show UI messages. Otherwise you want to unregister them of course :)

r- mostly for that last point: I want to ensure the user doesn't get to see *anything*
Attachment #566863 - Flags: review?(philipp) → review-
(Assignee)

Comment 12

7 years ago
(In reply to Philipp von Weitershausen [:philikon] from comment #11)

> You forgot to Service.startOver(); here.

One of the tests relies on the fact that the previous test results in the user being logged in from the prior test. Starting over would cause it to pass without the fix.

I should probably make that clear, perhaps by adjusting that test to independently set up its environment.

> I would also like to add negative tests for weave:ui:sync:error and
> weave:ui:login:error, along the lines of:

Good idea.
(In reply to Richard Newman [:rnewman] from comment #12)
> I should probably make that clear, perhaps by adjusting that test to
> independently set up its environment.

I would prefer the latter.
(Assignee)

Comment 14

7 years ago
This is turning out to be slightly more involved than I first expected. Will resume work in the morning.
(Assignee)

Comment 15

7 years ago
Created attachment 567256 [details] [diff] [review]
Part 0: cleanup and logging. v1

Might as well get this into the review queue while I finish up the rest.

Along with

  Bug 694728 - Sync: Log Status transitions

this makes the behavior of Sync in these error situations much more obvious from the log (albeit preferably in Trace mode).
Attachment #567256 - Flags: review?(philipp)
(Assignee)

Comment 16

7 years ago
Created attachment 567263 [details] [diff] [review]
Part 1: handle node reassignment on all storage requests. v1

This passes tests, and applies quite cleanly on Aurora. (Actual Aurora patch to come.)

Tests use the old HTTP test harness, which is why they're ugly and awful. Part 2 improves matters.

TPS passes. I have done some hand verification of edge cases, but would like to do more, considering the possible impact.

My current patch queue:

 0 A bug-694728-status: Bug 694728 - Sync: Log Status transitions.
 1 A bug-694657: Bug 694657 - Sync: test JS server 404 for GET on missing WBO.
 2 A bug-692714-part-0-cleanup: Bug 692714 - Part 0: cleanup and logging.
 3 A bug-692714-part-1: Bug 692714 - Part 1: handle node reassignment on all storage requests.
 4 A bug-692714-part-2: Bug 692714 - Part 2: port tests to SyncServer for non-Aurora landing.

Not requesting review yet (because I want to do more testing), but feel free to review it if you want, Philipp.
Attachment #566863 - Attachment is obsolete: true
(Assignee)

Comment 17

7 years ago
Created attachment 567267 [details] [diff] [review]
Part 2: port tests to SyncServer for non-Aurora landing. v1

This switches to using SyncServer, which makes the tests much cleaner (and with fewer spurious errors in the log). Depends on Bug 694744, which fixes a stoopid typo in the server code.
(Assignee)

Comment 18

7 years ago
Created attachment 567674 [details] [diff] [review]
Part 0: cleanup and logging. v2
Attachment #567256 - Attachment is obsolete: true
Attachment #567256 - Flags: review?(philipp)
Attachment #567674 - Flags: review?(philipp)
(Assignee)

Comment 19

7 years ago
Created attachment 567675 [details] [diff] [review]
Part 1: handle node reassignment on all storage requests. v2

This version adds a backoff (with pref-based persistence of cross-sync 401 knowledge) and a test for same.

Manual testing with a faked 401 on /storage/ suggests that each instant sync will cause another (silent) failure, with timed retries occurring at the backoff duration.

"Sync Now" causes an Unknown Error, but improving that is outside the scope of this bug.

Tests are in the old style (see next patch), so this should apply quite cleanly on Aurora.
Attachment #567263 - Attachment is obsolete: true
Attachment #567675 - Flags: review?(philipp)
(Assignee)

Comment 20

7 years ago
Created attachment 567676 [details] [diff] [review]
Part 2: port tests to SyncServer for non-Aurora landing. v2

Because SyncServer.
Attachment #567267 - Attachment is obsolete: true
Attachment #567676 - Flags: review?(philipp)
(Assignee)

Updated

7 years ago
Attachment #567675 - Attachment description: Part 1: handle node reassignment on all storage requests. v1 → Part 1: handle node reassignment on all storage requests. v2
(Assignee)

Comment 21

7 years ago
These patches touch the Service/ErrorHandler/Resource/SyncScheduler/backoff/Status mess, and attempt to ameliorate some edge cases that have not previously been considered. As such, I wouldn't be in the least bit surprised if I've broken or missed something, or at least generated a few follow-up bugs.

Assuming this passes review, I'd like to land it on s-c and get a week of banging on it to ensure coverage of all the important cases. We can request Aurora approval if it seems solid.
(Assignee)

Comment 22

7 years ago
I've done a bunch of manual verification of this. Incorrect password still shows up in the correct spot, and overall it seems to be acting fine.
Comment on attachment 567674 [details] [diff] [review]
Part 0: cleanup and logging. v2

>@@ -134,16 +134,17 @@ let SyncScheduler = {
> 
>         // reset backoff info, if the server tells us to continue backing off,
>         // we'll handle that later
>         Status.resetBackoff();
> 
>         this.globalScore = 0;
>         break;
>       case "weave:service:sync:finish":
>+        this._log.trace("SyncScheduler handling weave:service:sync:finish."); 

this._log will already print "Sync.Scheduler". Also, you have the topic variable. So why don't we put

  this._log.trace("Handling " + topic);

at the very top of observe()?

>@@ -548,32 +554,35 @@ let ErrorHandler = {
>           this.notifyOnNextTick("weave:ui:login:error");
>         } else {
>           this.notifyOnNextTick("weave:ui:clear-error");
>         }
> 
>         this.dontIgnoreErrors = false;
>         break;
>       case "weave:service:sync:error":
>+        this._log.debug("ErrorHandler handling weave:service:sync:error.");

Ditto for ErrorHandler, but make it trace too!

>       case "weave:service:sync:finish":
>+        this._log.debug("ErrorHandler handling weave:service:sync:finish.");
>+        this._log.debug("Status.service is " + Status.service);

Make this trace please!

>@@ -588,17 +597,17 @@ let ErrorHandler = {
>         break;
>     }
>   },
> 
>   notifyOnNextTick: function notifyOnNextTick(topic) {
>     Utils.nextTick(function() {
>       this._log.trace("Notifying " + topic +
>                       ". Status.login is " + Status.login +
>-                      ". Status.sync is " + Status.sync);
>+                      ". Status.sync is " + Status.sync + ".");

Unnecessary churn and minus points for inconsistency with your own log statement for Status.service above!

>diff --git a/services/sync/modules/service.js b/services/sync/modules/service.js
>--- a/services/sync/modules/service.js
>+++ b/services/sync/modules/service.js
>@@ -508,28 +508,31 @@ WeaveSvc.prototype = {
>   // gets cluster from central LDAP server and returns it, or null on error
>   _findCluster: function _findCluster() {
>     this._log.debug("Finding cluster for user " + this.username);
> 
>     let fail;
>     let res = new Resource(this.userAPI + this.username + "/node/weave");
>     try {
>       let node = res.get();
>+      this._log.trace("_findCluster got HTTP status " + node.status);

The logging from resource isn't enough?

>@@ -729,19 +732,19 @@ WeaveSvc.prototype = {
>       } catch (ex) {
>         this._log.debug("Fetching passphrase threw " + ex +
>                         "; assuming master password locked.");
>         Status.login = MASTER_PASSWORD_LOCKED;
>         return false;
>       }
> 
>       try {
>-        // Make sure we have a cluster to verify against
>-        // this is a little weird, if we don't get a node we pretend
>-        // to succeed, since that probably means we just don't have storage
>+        // Make sure we have a cluster to verify against.
>+        // This is a little weird, if we don't get a node we pretend
>+        // to succeed, since that probably means we just don't have storage.

If you're getting in there, please rewrap to 80 chars :)

r+ with those changes.
Attachment #567674 - Flags: review?(philipp) → review+
(Assignee)

Comment 24

7 years ago
(In reply to Philipp von Weitershausen [:philikon] from comment #23)

> >+        // to succeed, since that probably means we just don't have storage.
> 
> If you're getting in there, please rewrap to 80 chars :)

That's 76 chars. Otherwise, all done, thanks!
Comment on attachment 567675 [details] [diff] [review]
Part 1: handle node reassignment on all storage requests. v2

>diff --git a/services/sync/modules/policies.js b/services/sync/modules/policies.js
>--- a/services/sync/modules/policies.js
>+++ b/services/sync/modules/policies.js
>@@ -151,16 +151,23 @@ let SyncScheduler = {
> 
>         let sync_interval;
>         this._syncErrors = 0;
>         if (Status.sync == NO_SYNC_NODE_FOUND) {
>           this._log.trace("Scheduling a sync at interval NO_SYNC_NODE_FOUND.");
>           sync_interval = NO_SYNC_NODE_INTERVAL;
>         }
>         this.scheduleNextSync(sync_interval);
>+
>+        if (Status.sync == SYNC_SUCCEEDED) {
>+          // Great. Let's clear our mid-sync 401 note.
>+          this._log.trace("Clearing lastSyncReassigned.");
>+          Svc.Prefs.reset("lastSyncReassigned");
>+        }

Why are we doing this in SyncScheduler and not in ErrorHandler? This is where the pref gets set so it should also be where it gets reset. Encapsulation FTW! (r- for this)

>@@ -734,16 +741,25 @@ let ErrorHandler = {
> 
>     let lastSync = Svc.Prefs.get("lastSync");
>     if (lastSync && ((Date.now() - Date.parse(lastSync)) >
>         Svc.Prefs.get("errorhandler.networkFailureReportTimeout") * 1000)) {
>       Status.sync = PROLONGED_SYNC_FAILURE;
>       this._log.trace("shouldReportError: true (prolonged sync failure).");
>       return true;
>     }
>+ 
>+    // We got a 401 mid-sync. Wait for the next sync before actually handling
>+    // an error. This assumes that we'll get a 401 again on a login fetch in
>+    // order to report the error.
>+    if (Weave.Service.clusterURL == "") {

if (!Weave.Service.clusterURL) ?

>       case 401:
>         Weave.Service.logout();
>-        Status.login = LOGIN_FAILED_LOGIN_REJECTED;
>+        this._log.info("Got 401 response; resetting clusterURL.");
>+        Svc.Prefs.reset("clusterURL");
>+
>+        let delay = 0;
>+        if (Svc.Prefs.get("lastSyncReassigned")) {
>+          this._log.warn("Last sync also failed for 401. Delaying next sync.");
>+          delay = MINIMUM_BACKOFF_INTERVAL;

Add a comment here along the lines of

  // We're getting another 401 mid-sync when the last sync already did that. It can't
  // be a wrong password since login worked properly. Probably an intermittent problem
  // on the server, so let's give it some space to recover.

>diff --git a/services/sync/modules/service.js b/services/sync/modules/service.js
>--- a/services/sync/modules/service.js
>+++ b/services/sync/modules/service.js
>@@ -917,16 +917,17 @@ WeaveSvc.prototype = {
>       CollectionKeys.clear();
> 
>       /* Login and sync. This also generates new keys. */
>       this.sync();
>       return true;
>     }))(),
> 
>   startOver: function() {
>+    this._log.info("Invoking Service.startOver.");

trace!


I skimmed over the tests, they look good (and familiar from the WIP). Will take a closer look when you upload a new version  of this patch :)
Attachment #567675 - Flags: review?(philipp) → review-
Comment on attachment 567676 [details] [diff] [review]
Part 2: port tests to SyncServer for non-Aurora landing. v2

>@@ -66,40 +64,21 @@ function installNodeHandler(server, next
>   function handleNodeRequest(req, resp) {
>     _("Client made a request for a node reassignment.");
>     resp.setStatusLine(req.httpVersion, 200, "OK");
>     resp.setHeader("Content-Type", "text/plain");
>     resp.bodyOutputStream.write(newNodeBody, newNodeBody.length);
>     Utils.nextTick(next);
>   }
>   let nodePath = "/user/1.0/johndoe/node/weave";
>-  server.registerPathHandler(nodePath, handleNodeRequest);
>+  server.server.registerPathHandler(nodePath, handleNodeRequest);

server.server? Meh... :)
Attachment #567676 - Flags: review?(philipp) → review+
(Assignee)

Comment 27

7 years ago
Created attachment 567939 [details] [diff] [review]
Part 1: handle node reassignment on all storage requests. v3

Review comments addressed. Tests pass. Part 2 applies cleanly.
Attachment #567675 - Attachment is obsolete: true
Attachment #567939 - Flags: review?(philipp)
(Assignee)

Comment 28

7 years ago
Created attachment 568255 [details] [diff] [review]
Part 0: cleanup and logging. v3

Updated part 0.
Attachment #567674 - Attachment is obsolete: true
(Assignee)

Comment 29

7 years ago
Created attachment 568256 [details] [diff] [review]
Part 2: port tests to SyncServer for non-Aurora landing. v3

Updated Part 2.
Attachment #567676 - Attachment is obsolete: true
Comment on attachment 567939 [details] [diff] [review]
Part 1: handle node reassignment on all storage requests. v3

>@@ -151,16 +152,17 @@ let SyncScheduler = {
> 
>         let sync_interval;
>         this._syncErrors = 0;
>         if (Status.sync == NO_SYNC_NODE_FOUND) {
>           this._log.trace("Scheduling a sync at interval NO_SYNC_NODE_FOUND.");
>           sync_interval = NO_SYNC_NODE_INTERVAL;
>         }
>         this.scheduleNextSync(sync_interval);
>+
>         break;

Spurious line break, pls remove :)

>+  Svc.Obs.add("weave:service:sync:error", onSyncError);
>+  function onSyncError() {
>+    _("Got weave:service:sync:error in first sync.");
>+    // Wait for the automatic next sync.
>+    Svc.Obs.remove("weave:service:sync:error", onSyncError);
>+    Svc.Obs.add("weave:service:login:error", onLoginError);
>+  }
>+
>+  function onLoginError() {
>+    _("Got weave:service:login:error in second sync.");
>+    Svc.Obs.remove("weave:service:login:error", onLoginError);
>+
>+    do_check_eq(Status.login, LOGIN_FAILED_LOGIN_REJECTED);
>+    do_check_false(Service.isLoggedIn);
>+
>+    // Clean up.
>+    Utils.nextTick(function () {
>+      Service.startOver();
>+      server.stop(run_next_test);
>+    });
>+  }
>+

I'm almost inclined to prefer that onLoginError were nested inside onSyncError to indicate succession. Or do what you did in test_node_reassignment.js and call them onFirstLoginError and onSecondSyncError or something? Don't feel strongly about it...

>+    function onSecondSync() {
>+      _("Second sync completed.");
>+      Svc.Obs.remove(secondNotification, onSecondSync);
>+      SyncScheduler.clearSyncTriggers();
>+
>+      // Make absolutely sure that any event listeners are done with their work
>+      // before we proceed.
>+      nthTick(4, function () {

So how are you coming up with the number 4? What's really involved behind the scenes? If it's a timer, you want to wait 150ms and then do one or two next ticks. See the waitForZeroTimer() helper in test_syncscheduler.js. Feel free to move it to head_helpers.js and reuse it :)

Looks good otherwise! r=me
Attachment #567939 - Flags: review?(philipp) → review+
(Assignee)

Comment 32

7 years ago
Will request approval for Aurora after this has gone through QA.

Background for QA:

This set of patches addresses handling of 401 responses during a sync. In particular 401 responses that indicate a node reassignment, and in particular those received mid-sync (as would happen if a user is reassigned whilst they are syncing).

The desired behavior in this case is as follows:

* Abort the current sync
* Do not show an error bar (unless this was a Sync Now, or the usual prolonged error handling kicks in)
* Log out and clear our cluster URL
* Schedule another sync
* Allow the login logic on the next sync to pick up a node reassignment or detect a bad password.

A subsequent mid-sync failure, as would happen in the case of a pernicious server issue, will result in a minimum backoff.

So, please verify:

* Syncing normally continues to work.
* An ordinary node reassignment outside of a sync results in the usual client behavior: silently picking up the new server and doing a first sync.
* A node reassignment within a sync (easily simulated by having Zeus return a 401 for /storage/) results in a silent immediate abort and retry, followed on repeated errors by repeated retries ever MINIMUM_BACKOFF_INTERVALs.
* "Instant Sync" syncs during this backoff time will proceed as normal (that is, triggering a sync, failing, and not reporting an error).
* A bad password is reported as expected, with no subsequent attempt scheduled until the password is changed.
Created attachment 568490 [details]
zip of log files during loop creating endless logs
(Assignee)

Updated

7 years ago
Attachment #568490 - Attachment mime type: text/plain → application/zip
(Assignee)

Comment 34

7 years ago
13:17:01 <@rnewman> so checkSyncStatus is scheduling another sync, then my 401 handling code gets to run
13:17:39 <@rnewman> you can see in the log: we schedule one sync, then schedule an immediate sync, then we finish up and schedule a sync for 90 seconds
13:17:56 <@rnewman> well, the last schedule is an interval change
13:17:58 <@rnewman> but still
13:18:04 <@rnewman> there's way too much logic firing here
13:19:02 <@rnewman> next line in the next log: 1319139800308   Sync.SyncScheduler      TRACE   There's already a sync scheduled in 88692 ms.
13:19:14 <@rnewman> so we have kicked off a sync and still have one scheduled
13:19:17 < tracy> 2 syncs per second were firing, 'til I had atoll shot off the 401
13:19:23 <@atoll> yeah
13:19:56 <@rnewman> yes, on each failure it's scheduling another sync, exactly as it was when I tested
13:20:06 <@rnewman> but something is also causing an immediate sync to be scheduled
13:20:22 <@rnewman> presumably because of some small discrepancy introduced in event handling order
13:20:35 <@rnewman> so despite the backoff I added, the loop still occurs
13:21:43 < tracy> immediateInterval was last before I shut browser off.  this test was triggered by that timer being past, thus syncing on startup.
13:21:54 < tracy> if that helps at all.
13:22:46 < tracy> all of the ever half second syncs are setting to activeInterval.
(Assignee)

Comment 35

7 years ago
13:30:44 <@rnewman> OK, I found the bug
13:30:51 <@atoll> with 10 minutes to spare!
13:30:57 <@rnewman> at the end of the *first* 401:
13:31:01 <@rnewman> 1319139799468   Sync.ErrorHandler       TRACE   Clearing lastSyncReassigned.
13:31:12 <@rnewman> so we're making a note that next time we need to back off...
13:31:19 <@rnewman> and then we clear the note before the next sync!
13:31:46 <@rnewman> this is occurring because we got sync:finish, but Status.sync == SYNC_SUCCEEDED
13:31:49 <@rnewman> which is wrong
13:32:46 <@rnewman> I don't see Status.sync *ever* being adjusted in these logs
13:32:50 <@rnewman> Status.service is
13:33:50 <@rnewman> so I believe that if we can get the status to be correct by
                    the time finish is called, or we alter the check, things
                    will start working as they did when I tested this myself
(Assignee)

Comment 37

7 years ago
Created attachment 568603 [details] [diff] [review]
Possible incremental remediation. v1

I spent a little quality time tonight with some bourbon and Vim.

I don't know at what point it regressed, but even in my tests it's plain that lastSyncReassigned was being reset at the end of the bad sync, because we were only checking one Status flag. Status needs to die.

When I added a login observer to check lastSyncReassigned was still set (given that we expect it to be unset at the end of a completely successful sync), my test failed. Hurrah!

This is a delta patch for purposes of examination. It includes the following:

* We no longer simply check Status.sync; we verify that both Status.sync and Status.service are set to their respective success values before we clear the "back off, I got a 401" pref.

* I added a test for 401 looping on a particular engine, not just all of storage, in case that case ever differs.

* I added the login observer to the node reassignment tests to verify that the pref sticks around.

Status needs to die.

I welcome feedback from interested parties, when they're done turning petroleum and rubber into noise!
Attachment #568603 - Flags: feedback?
(Assignee)

Updated

7 years ago
Attachment #568603 - Flags: feedback? → feedback?(philipp)
Comment on attachment 568603 [details] [diff] [review]
Possible incremental remediation. v1

Status is a total jerk.
Attachment #568603 - Flags: feedback?(philipp) → review+

Updated

7 years ago
status-firefox10: affected → ---
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.