Closed Bug 985145 Opened 6 years ago Closed 6 years ago

Error bar displayed during Sync 1.5 node re-assignment

Categories

(Firefox :: Sync, defect)

x86
All
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla31
Tracking Status
firefox29 --- fixed
firefox30 --- fixed
firefox31 --- fixed

People

(Reporter: bobm, Assigned: markh)

References

Details

(Whiteboard: [qa+])

Attachments

(1 file, 2 obsolete files)

An error bar was displayed during a node reassignment.  The node reassignment took place as follows:

1. The node https://sync-1-us-east-1.sync.services.mozilla.com was set to downed and backoff in the token db.
2. The DNS for sync-1-us-east-1.sync.services.mozilla.com was redirected to the Sync 401inator (a web service that returns 401s for all requests).
3. User assignments to sync-1-us-east-1.sync.services.mozilla.com were removed from the token db.

Relevant Sync 1.5 log: 
    1395100241050   Sync.ErrorHandler  DEBUG     Flushing file log.
    1395100241051   Sync.Service       DEBUG  Exception: Error JS Stack trace: Res_get@resource.js:413:5 < Sync11Service.prototype._fetchInfo@service.js:526:7 < sync@enginesync.js:77:1 < onNotify@service.js:1249:7 < WrappedNotify@util.js:143:1 < WrappedLock@util.js:98:9 < _lockedSync@service.js:1243:1 < sync/<@service.js:1234:7 < WrappedCatch@util.js:72:9 < sync@service.js:1222:5
    1395100241056   Sync.Status        DEBUG   Status.service: success.status_ok => success.status_ok
    1395100241057   Sync.Status        DEBUG   Status.service: success.status_ok => success.status_ok
    1395100241057   Sync.ErrorHandler  DEBUG     Log cleanup threshold time: 1394236241057
    1395100241058   Sync.ErrorHandler  DEBUG     No logs to clean up.
    1395100540030   Sync.Status        DEBUG   Status.service: success.status_ok => success.status_ok
    1395100540031   Sync.ErrorHandler  DEBUG     Beginning user-triggered sync.
    1395100540031   Sync.Service       DEBUG  User-Agent: Firefox/30.0a1 FxSync/1.32.0.20140317030202.
    1395100540031   Sync.Service       INFO   Starting sync at 2014-03-17 16:55:40
    1395100540032   Sync.SyncScheduler DEBUG    Clearing sync triggers and the global score.
    1395100540032   Sync.Status        INFO    Resetting Status.
    1395100540032   Sync.Status        DEBUG   Status.service: success.status_ok => success.status_ok
    1395100540504   Sync.Resource      DEBUG mesg: GET fail 401 https://sync-1-us-east-1.sync.services.mozilla.com/1.5/XXXXX/info/collections
    1395100540504   Sync.Resource      DEBUG GET fail 401 https://sync-1-us-east-1.sync.services.mozilla.com/1.5/XXXXX/info/collections
    1395100540504   Sync.Service       INFO   Logging out
    1395100540505   Sync.Status        DEBUG   Status.service: success.status_ok => success.status_ok
    1395100540505   Sync.Status        DEBUG   Status.service: success.status_ok => success.status_ok
    1395100540506   Sync.SyncScheduler DEBUG    Next sync in 600000 ms.
    1395100540506   Sync.ErrorHandler  INFO      Got 401 response; resetting clusterURL.
    1395100540506   Sync.ErrorHandler  DEBUG     New mid-sync 401 failure. Making a note.
    1395100540507   Sync.ErrorHandler  INFO      Attempting to schedule another sync.
    1395100540508   Sync.SyncScheduler DEBUG    Sync error count has exceeded 3; enforcing backoff.
    1395100540508   Sync.SyncScheduler DEBUG    Starting client-initiated backoff. Next sync in 14158771 ms.
    1395100540508   Sync.SyncScheduler DEBUG    Next sync in 14158771 ms.
The problem is that in general, we treat 401 errors as real auth errors that require user intervention (such as re-entering the password) to resolve.

If there a 404inator you could use instead? :) Telling sync "not found" seems more accurate than "unauthorized" in this case.
(hrm, well, 404 probably isn't great either, but some 500 - or almost anything else - would work better)
Node reassignments in Sync 1.5 are accomplished by the token server, not by the storage server.

All the storage server needs to do is tell the client to go away.

A 503 with Retry-After should suffice if nothing else works.
Whiteboard: [qa+]
> Node reassignments in Sync 1.5 are accomplished by the token server, not by the storage server.

Hmm, the in
Yay accidental submission.

> Node reassignments in Sync 1.5 are accomplished by the token server, not by the storage server.

In my head, they're still triggered by a 401 from the storage server.  Your token expires, you get a 401, you go back to the tokenserver and detect that your node assignment has changed.  In fact, there should be no other reason to get a 401 from the storage server.

But I can see how that flow doesn't make as much sense as it used to, since we're not persisting tokens in a variety of circumstances.

I don't think a 503 is appropriate here, but if it'll cause better client behavior, I guess we can roll with it.  Operationally, it would be nice to have something to trigger a "go get your new node assignment" from the storage servers with a minimum of fuss.
(In reply to Ryan Kelly [:rfkelly] from comment #5)
> Yay accidental submission.
> 
> > Node reassignments in Sync 1.5 are accomplished by the token server, not by the storage server.
> 
> In my head, they're still triggered by a 401 from the storage server.  Your
> token expires, you get a 401, you go back to the tokenserver and detect that
> your node assignment has changed.  In fact, there should be no other reason
> to get a 401 from the storage server.

Hrm - yes, good point - we should have re-hit the token server in this case.

> But I can see how that flow doesn't make as much sense as it used to, since
> we're not persisting tokens in a variety of circumstances.

But the point above remains valid - ie, presumably we had a token, then got a 401 from a storage server, so should have tried to grab a new token.  I also don't understand why the log seems to omit entries from the browserid_identity module, which does the token fetch etc.

> I don't think a 503 is appropriate here, but if it'll cause better client
> behavior, I guess we can roll with it.  Operationally, it would be nice to
> have something to trigger a "go get your new node assignment" from the
> storage servers with a minimum of fuss.

I now take back what I said above - it's not at all clear to me why the 401 caused this.
Concur that 401 *should* work. That should be tested. If it doesn't work, and we don't know why, let's fall back to 503+RA.
:jbonacci, sounds like we need to try this out in a controlled manner in stage and try to capture a detailed log.
In case it helps, here's the error log just prior to the one above. The reason this was failing was that the 401inator had a bad SSL cert:

1395099520462	Sync.ErrorHandler	DEBUG	Flushing file log.
1395099520463	Sync.Service	DEBUG	Exception: Error JS Stack trace: Res_get@resource.js:413:5 < Sync11Service.prototype._fetchInfo@service.js:526:7 < sync@enginesync.js:77:1 < onNotify@service.js:1249:7 < WrappedNotify@util.js:143:1 < WrappedLock@util.js:98:9 < _lockedSync@service.js:1243:1 < sync/<@service.js:1234:7 < WrappedCatch@util.js:72:9 < sync@service.js:1222:5
1395099520469	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1395099520470	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1395099520470	Sync.ErrorHandler	DEBUG	Log cleanup threshold time: 1394235520470
1395099520471	Sync.ErrorHandler	DEBUG	No logs to clean up.
1395099640541	Sync.SyncScheduler	DEBUG	Next sync in 600000 ms.
1395100240541	Sync.Service	DEBUG	User-Agent: Firefox/30.0a1 FxSync/1.32.0.20140317030202.
1395100240542	Sync.Service	INFO	Starting sync at 2014-03-17 16:50:40
1395100240543	Sync.SyncScheduler	DEBUG	Clearing sync triggers and the global score.
1395100240544	Sync.Status	INFO	Resetting Status.
1395100240544	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1395100240545	Sync.BrowserIDManager	INFO	Fetching assertion and token from: https://token.services.mozilla.com/1.0/sync/1.5
1395100240547	Sync.BrowserIDManager	DEBUG	Getting an assertion
1395100240551	Sync.BrowserIDManager	DEBUG	Getting a token
1395100240764	Sync.BrowserIDManager	DEBUG	Successfully got a sync token
1395100241047	Sync.CertListener	WARN	Invalid HTTPS certificate encountered!
1395100241050	Sync.SyncScheduler	DEBUG	Sync error count has exceeded 3; enforcing backoff.
1395100241050	Sync.SyncScheduler	DEBUG	Starting client-initiated backoff. Next sync in 12404960 ms.
1395100241050	Sync.SyncScheduler	DEBUG	Next sync in 12404960 ms.
:rfkelly and :bobm 
ok, let's set this up for tomorrow
11am till 3pm PDT is ok
or after 5pm PDT is ok
or Thursday
Status: NEW → ASSIGNED
QA Contact: jbonacci
(In reply to James Bonacci [:jbonacci] from comment #10)
> :rfkelly and :bobm 
> ok, let's set this up for tomorrow
> 11am till 3pm PDT is ok
> or after 5pm PDT is ok
> or Thursday

I will build a stage 401inator before hand.  I am available during either window listed, but would prefer after 1:30 pm.
:bobm ok, let's start at 1:30pm tomorrow. If we don't finish by 3pm, I can pick it up again with you and :rfkelly after 5pm
So, our quick testing of sync, node reassignment, force 401, sync again 
basically showed the "gray bar" bug.

I got the gray bar ;-)
Here is my sync log:
1395269719748	Sync.ErrorHandler	DEBUG	Flushing file log.
1395269719750	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1395269719750	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1395269720579	Sync.Tracker.Clients	DEBUG	Saving changed IDs to clients
1395269720582	Sync.Tracker.Bookmarks	DEBUG	Saving changed IDs to bookmarks
1395269720587	Sync.Tracker.Forms	DEBUG	Saving changed IDs to forms
1395269720588	Sync.Tracker.History	DEBUG	Saving changed IDs to history
1395269720670	Sync.Tracker.Passwords	DEBUG	Saving changed IDs to passwords
1395269720748	Sync.Tracker.Addons	DEBUG	Saving changed IDs to addons
1395269727058	Sync.Tracker.History	DEBUG	Saving changed IDs to history
1395270536962	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1395270538245	Sync.ErrorHandler	DEBUG	Beginning user-triggered sync.
1395270538246	Sync.Service	DEBUG	User-Agent: Firefox/29.0a2 FxSync/1.31.0.20140317004002.
1395270538246	Sync.Service	INFO	Starting sync at 2014-03-19 16:08:58
1395270538246	Sync.SyncScheduler	DEBUG	Clearing sync triggers and the global score.
1395270538248	Sync.Status	INFO	Resetting Status.
1395270538248	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1395270538683	Sync.Resource	DEBUG	mesg: GET fail 401 https://sync-jb-us-west-2.sync.services.mozilla.com/1.5/BLAH/info/collections
1395270538683	Sync.Resource	DEBUG	GET fail 401 https://sync-jb-us-west-2.sync.services.mozilla.com/1.5/BLAH/info/collections
1395270538683	Sync.Service	INFO	Logging out
1395270538684	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1395270538684	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1395270538684	Sync.SyncScheduler	DEBUG	Next sync in 3600000 ms.
1395270538685	Sync.ErrorHandler	INFO	Got 401 response; resetting clusterURL.
1395270538685	Sync.ErrorHandler	DEBUG	New mid-sync 401 failure. Making a note.
1395270538685	Sync.ErrorHandler	INFO	Attempting to schedule another sync.
1395270538686	Sync.SyncScheduler	DEBUG	Next sync in 3600000 ms.
Blocks: 969593, 979432
Component: Firefox Sync: UI → Firefox Sync: Backend
I see the problem.  When a 401 is seen, Sync makes a to setCluster().  With the legacy provider, this will cause a request to be made to determine the node to use and all is good.  However, under FxA, we use the token from the token server to determine the node, and while our token remains valid we re-use the same token.  In this case we should be ignoring the existing token and requesting a new one which (presumably) will give us a different node to use.

A subtlety is that we don't know if we are being asked for the node because of a previous 401 - but I *think* we can simply check if the service already has a clusterURL - if so, it means it is bad and we should always fetch a new token, and if not, we can just reuse the token we have.

This fix should be easy; a test for it will be harder, but worthwhile.
Assignee: nobody → mhammond
Be a little careful here, because there be monsters.

When you get a 401 from a node, you should go get a new token. It's possible we've changed secrets, you've been reassigned, etc, etc. The first time you see it, just do it.

However, if you get the new token and then get another 401, please sleep for some period of time (20 min?) before trying again. This shouldn't happen, so it's best to be conservative here and not swamp the tokenserver with a loop.
Chatting in IRC, Mark and I hypothesize a possible server-side component to this.  Specifically, we need to dig into this sequence to check for edge cases:

"""
2. The DNS for sync-1-us-east-1.sync.services.mozilla.com was redirected to the Sync 401inator (a web service that returns 401s for all requests).
3. User assignments to sync-1-us-east-1.sync.services.mozilla.com were removed from the token db.
"""

First, there's clearly a race here, any client that happens connect between (2) and (3) will not see a node re-assignment and hence will get the error bar.  But perhaps there's something else going on.  Bob, what query is used for "User assignments to sync-1-us-east-1.sync.services.mozilla.com were removed from the token db"?
Flags: needinfo?(bobm)
Just for some more context:

(In reply to Mark Hammond [:markh] from comment #14)
> I see the problem.  When a 401 is seen, Sync makes a to setCluster().  With
> the legacy provider, this will cause a request to be made to determine the
> node to use and all is good.  However, under FxA, we use the token from the
> token server to determine the node, and while our token remains valid we
> re-use the same token.  In this case we should be ignoring the existing
> token and requesting a new one which (presumably) will give us a different
> node to use.

The above was conjecture - what I'm actually seeing while writing a test for this is that after the initial 401, sync asks us to log the user out, then log back in before starting the next sync.  This logout causes us to drop the token and subsequent login to fetch a new one.  IOW, I can't see how we are using the same token in this case.

However, if that subsequent login gave us back a token with an endpoint that *still* 401'd, I could see the problem.
Depends on: 988134
Depends on: 988137
Attached patch Test a 401 for info/collections (obsolete) — Splinter Review
It looks alot like a 401 on info/collections doesn't do the right thing.  It appears to also not do the right thing with legacy sync.

I'd like to check I'm not missing something subtle here.  The following attachment changes the existing test_node_reassignment test so that it simulates a 401 on info/collections, using the same test infrastructure that simulates the 401 on /storage.  With this change the test fails in 2 ways: the cluster URL isn't reset, and the error *is* surfaced to the UI (which is exactly what we are seeing here).  Looking at the code, this isn't that surprising - fetching the info URL doesn't call checkServerError(), which is where alot of the necessary magic happens.  A fix for this is subtle, so I'm asking for more info before I waste more time here.

Richard, can you please confirm that in an ideal world, these test changes would succeed?  Also, re the changes from add_identity_test->add_task, this just prevents these tests from using the FxA provider - I'm working on an equivalent FxA specific test.
Flags: needinfo?(rnewman)
(In reply to Ryan Kelly [:rfkelly] from comment #16)
First, there's clearly a race here, any client that happens connect between
> (2) and (3) will not see a node re-assignment and hence will get the error
> bar.  

This just looks like an operational problem. Steps 2 and 3 should never intentionally happen in that order.
(In reply to Ryan Kelly [:rfkelly] from comment #16)

> 2. The DNS for sync-1-us-east-1.sync.services.mozilla.com was redirected to
> the Sync 401inator (a web service that returns 401s for all requests).
> 3. User assignments to sync-1-us-east-1.sync.services.mozilla.com were
> removed from the token db.

Happy to change this.  What flow would you suggest?  Remove user assignments first, and then redirect the DNS?  


> First, there's clearly a race here, any client that happens connect between
> (2) and (3) will not see a node re-assignment and hence will get the error
> bar.  But perhaps there's something else going on.  Bob, what query is used
> for "User assignments to sync-1-us-east-1.sync.services.mozilla.com were
> removed from the token db"?

DELETE FROM users WHERE node="https://sync-1-us-east-1.sync.services.mozilla.com";
Flags: needinfo?(bobm)
(In reply to Bob Micheletto [:bobm] from comment #20)

> Happy to change this.  What flow would you suggest?  Remove user assignments
> first, and then redirect the DNS?  

It's never good to tell users that their credentials or node assignment are wrong (401) without first establishing a different node assignment.

(2) then (3) is the equivalent of the DMV bouncing you between the same windows over and over.

Establish the new node assignments (which I think in 1.5 is equivalent to "turn off new assignments for the old server, delete user records that used to be assigned to that server"), *then* start the 401.
Flags: needinfo?(rnewman)
(In reply to Mark Hammond [:markh] from comment #18)
> It appears to also not do the right thing with legacy sync.

It used to, and the code looks fine to me.

We'd routinely return a 401 on info/collections to cause a node reassignment. The does this:

* In verifyLogin, fetch info. service.js:698.
* If we get a 401 or 404, fetch a new node assignment.
* If fetching node/weave failed, or we fetched the same node assignment, fail with LOGIN_FAILED_LOGIN_REJECTED.
* Otherwise, proceed with the new node assignment.

A 401 on any other request (e.g., a subsequent sync) will cause the service to log you out (policies.js:888) and the sync to quietly abort. (There might be a bug in the 'quietly' part, but I'd be slightly surprised, 'cos this is tested every day by users.)

The next sync attempt will call verifyLogin again as part of logging in, and we'll go through the handling above.

> I'd like to check I'm not missing something subtle here.  The following
> attachment changes the existing test_node_reassignment test so that it
> simulates a 401 on info/collections, using the same test infrastructure that
> simulates the 401 on /storage.  With this change the test fails in 2 ways:
> the cluster URL isn't reset,

The cluster URL will only be reset for the in-sync error case. We need to know the old cluster URL in the verifyLogin state in order to detect when your password is wrong. The first sync on the service will hit verifyLogin first, and that leaves the old clusterURL.

> and the error *is* surfaced to the UI (which is
> exactly what we are seeing here).

If the client gets a 401 on info/collections, and then it doesn't get a different node assignment, then that's a wrong-password error, which will be surfaced to the user.

The only way you can sanely return 401 on info/collections and expect anything else is if you have a node/weave endpoint that returns a different node assignment (i.e., _clusterManager.setCluster() returns true), or you're actively trying to test the wrong-password flow.
(In reply to Toby Elliott [:telliott] from comment #19)

> This just looks like an operational problem. Steps 2 and 3 should never
> intentionally happen in that order.

I've changed our operational task flow documentation to reflect this.  However, should the tokenserver be providing tokens for nodes in a 'downed' state?
(In reply to Bob Micheletto [:bobm] from comment #23)

> I've changed our operational task flow documentation to reflect this. 
> However, should the tokenserver be providing tokens for nodes in a 'downed'
> state?

That seems OK. Those tokens will invalidate themselves as soon as the 401-inator moves into place.
> What flow would you suggest?  Remove user assignments first, and then redirect the DNS?

It sounds like this is already resolved, but for completeness here's the suggested order:

  1. Mark the node as down in the tokenserver db.  This prevents new assignments to that node.
  2. Clear existing assignments to that node.  This means any new token requests will get a new node.
  3. Engage the 401inator.  This sends all active clients to get their new node assignment.

> However, should the tokenserver be providing tokens for nodes in a 'downed' state?

It currently does provide tokens to existing users who are assigned to a downed node.  We can tweak this but it'll cost an extra DB read or join to check the node state.  Currently behavior is safe but suboptimal as it delays detection of a down node until the client actually tries to talk to that node.

> > Bob, what query is used
> > for "User assignments to sync-1-us-east-1.sync.services.mozilla.com were
> > removed from the token db"?
> 
> DELETE FROM users WHERE node="https://sync-1-us-east-1.sync.services.mozilla.com";

This is not what I had imagined, but ironically, doing it this way will avoid surfacing the error bar :-)

We need to keep the old records around in order for things like Bug 984297 to work properly.  Please wait for Bug 988134  to hit production, then change the procedure to set the replaced_at column to the current time in milliseconds.  Like this:

  "UPDATE users SET replaced_at = (UNIX_TIMESTAMP() * 1000) WHERE node=<foo>;"

I can provide a "clear_node_assignments" script that does the appropriate query if that's cleaner, similar.

Also, we don't have an index on the "node" column so this'll be a full table scan.  Do we care enough about the performance of this query to try adding one?
(In reply to Ryan Kelly [:rfkelly] from comment #25)
> 
> Also, we don't have an index on the "node" column so this'll be a full table
> scan.  Do we care enough about the performance of this query to try adding
> one?

You're looking at a couple million rows and growing, so yeah, that seems wise.
> You're looking at a couple million rows and growing, so yeah, that seems wise.

Filed Bug 988643.
Even if the server did the right thing, I've determined FxA would fail to correctly handle node reassignment iff the user was not already logged in (ie, if a previous sync hasn't yet run) and info/collections returned a 401.  In this case we'd continuously return the same clusterURL making things go pear-shaped and show the error UI.

I believe I've also found an issue with meta/global returning a 401 in the "not already logged in" case - in this case we'd just attempt to re-upload new metadata (which presumably would also fail) instead of treating this 401 like other 401s.  This would be very rare (but not impossible) in practice, as this is fetched after info/collections, so a 401 on info/collections would prevent us trying meta/global - but it still seems worthwhile to fix.  I've added a test to the existing test_node_reassignment test for this case.

The fixes are:

* If browserid_identity is requested for a clusterURL while the service already has a cluster URL, we assume the existing one failed in some way and fetch a new token regardless of the current token's validity.

* If we see a 401 *during* login, we previously didn't call Service.logout, as Service._isLoggedIn didn't get set to true.  Service.logout now ignores the loggedin state and always asks the identity manager to log out.  This forces us to re-fetch a new token next time login is called.

* We explicitly handle a 401 on meta/global during login.

* As a bonus, I've added a guard so verifyLogin() can't recurse to death - it will only recurse once and give up in disgust.  This is to solve a pathological server case that we've probably never seen for real, but given its theoretically possible it seems prudent to guard against it.
Attachment #8397641 - Flags: feedback?(rnewman)
Attachment #8397641 - Flags: feedback?(ckarlof)
Attachment #8396936 - Attachment is obsolete: true
Comment on attachment 8397641 [details] [diff] [review]
0001-Bug-985145-make-node-reassignment-work-correctly-wit.patch

Review of attachment 8397641 [details] [diff] [review]:
-----------------------------------------------------------------

This looks like it's going in the right direction. Without spending the time to check, I'm slightly alarmed that this doesn't touch stages/cluster.js -- ClusterManager is what handles node reassignment. Worth thinking about.

::: services/sync/tests/unit/head_http_server.js
@@ +809,5 @@
> +    // Doing a float compare of the version allows for us to pretend there was
> +    // a node-reassignment - eg, we could re-assign from "1.1/user/" to
> +    // "1.10/user" - this server will then still accept requests with the new
> +    // URL while any code in sync itself which compares URLs will see a
> +    // different URL.

I think I just threw up in my mouth. A lot.

::: services/sync/tests/unit/test_fxa_node_reassignment.js
@@ +56,5 @@
> +
> +let numTokenRequests = 0;
> +
> +function prepareServer(cbAfterTokenFetch) {
> +//  let deferred = Promise.defer();

Kill.
Attachment #8397641 - Flags: feedback?(rnewman) → feedback+
(In reply to Richard Newman [:rnewman] from comment #29)
> I'm slightly alarmed that this doesn't touch stages/cluster.js --
> ClusterManager is what handles node reassignment. Worth thinking about.

FWIW, it does touch the _findCluster method in browserid_identity, which overrides this method from the ClusterManager.  I didn't find it necessary to change the _findCluster() impl used for legacy-sync (thank goodness)

> ::: services/sync/tests/unit/head_http_server.js
> @@ +809,5 @@
> > +    // Doing a float compare of the version allows for us to pretend there was
> > +    // a node-reassignment - eg, we could re-assign from "1.1/user/" to
> > +    // "1.10/user" - this server will then still accept requests with the new
> > +    // URL while any code in sync itself which compares URLs will see a
> > +    // different URL.
> 
> I think I just threw up in my mouth. A lot.

Heh - and here I was thinking how clever I was ;) ;)

> Kill.

Oops - will do.
Comment on attachment 8397641 [details] [diff] [review]
0001-Bug-985145-make-node-reassignment-work-correctly-wit.patch

Review of attachment 8397641 [details] [diff] [review]:
-----------------------------------------------------------------

I guess we only half-addressed this issue during last work week. Thanks Mark.
Attachment #8397641 - Flags: feedback?(ckarlof) → feedback+
Only change is the removal of the commented code rnewman noticed giving feedback.  I looked into a better way to have the test server allow node-reassignment (eg, by stopping the existing server and restarting it on a new port etc) but that doesn't seem simple enough to make it worthwhile for these tests.
Attachment #8400939 - Flags: review?(rnewman)
I'll try to get to this tomorrow, once I have full brain again.

But even then, I'd like this to be thoroughly QAed. Could you line up services QA and ops to get this tested? That can happen in parallel with the review.
(In reply to Richard Newman [:rnewman] from comment #33)
> I'll try to get to this tomorrow, once I have full brain again.
> 
> But even then, I'd like this to be thoroughly QAed. Could you line up
> services QA and ops to get this tested? That can happen in parallel with the
> review.

James, are you able to help with this?  A try is at https://tbpl.mozilla.org/?tree=Try&rev=170b26c0fcd3
Flags: needinfo?(jbonacci)
:markh yea, I think we can do this today with :bobm using the try build above.
We are in Work Week, so scheduling is a bit uneven.
If not today (Thu), can this wait until Monday?
Flags: needinfo?(jbonacci)
(In reply to James Bonacci [:jbonacci] from comment #35)
> :markh yea, I think we can do this today with :bobm using the try build
> above.
> We are in Work Week, so scheduling is a bit uneven.
> If not today (Thu), can this wait until Monday?

That would be fine, thanks.
Try build is installed on my Mac and I am pointing to TokenServer and Sync Stage environmnets. I have a node and sync is working for me. :bobm is setting up the actual test from the node side of things...
I grabbed this build: Try link: https://tbpl.mozilla.org/?tree=Try&rev=170b26c0fcd3
Mac (10.8 Opt) build link: https://tbpl.mozilla.org/php/getParsedLog.php?id=37204126&tree=Try
which redirects to here:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mhammond@skippinet.com.au-170b26c0fcd3/try-macosx64/

Installed that
Created a new profile
Set up sync with a new account to point to TokenServer Stage
Originally got this Stage node: https://sync-3-us-east-1.stage.mozaws.net
Was moved to this test node: https://sync-jb-us-west-2.sync.services.mozilla.com
All my attempts to sync here were met with a Yellow Bar (not a gray bar)
:bobm was finally able to force me to a new node: https://sync-3-us-east-1.stage.mozaws.net
At this point, I was able to sync using this new node in East.
Then when he attempted to force me into a 401 situation, I got the dreaded gray bar, followed quickly by the yellow bar
Gray bar error: https://jbonacci.pastebin.mozilla.org/4775090
Yellow bar error: https://jbonacci.pastebin.mozilla.org/4775093
I think this time we had a problem with the server-side node clearing process, as the code from Bug 988134 hasn't made it to stage.  This would have interfered with Bob's node-reallocation process.
OK. We are all good here. 
Started here: https://sync-3-us-east-1.stage.mozaws.net/1.5/BLAH/
Ended up here: https://sync-2-us-east-1.stage.mozaws.net/1.5/BLAH/
with no gray or yellow bar and no error logs.

Let's get this uplifted to Beta if at all possible...
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Attachment #8397641 - Attachment is obsolete: true
reopening - this hasn't landed yet and the patch is still awaiting final r+ from rnewman
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 8400939 [details] [diff] [review]
0001-Bug-985145-make-node-reassignment-work-correctly-wit.patch

Review of attachment 8400939 [details] [diff] [review]:
-----------------------------------------------------------------

If it works, who am I to judge? :P
Attachment #8400939 - Flags: review?(rnewman) → review+
Comment on attachment 8400939 [details] [diff] [review]
0001-Bug-985145-make-node-reassignment-work-correctly-wit.patch

[Approval Request Comment]
Bug caused by (feature/regressing bug #): FxA Sync
User impact if declined: Firefox might show sync errors when the sync servers are undergoing maintenance (specifically, then users are being pushed to different storage servers)
Testing completed (on m-c, etc.): Landed on fx-team and green there.
Risk to taking this patch (and alternatives if risky): Low risk, limited to
sync
String or IDL/UUID changes made by this patch: None
Attachment #8400939 - Flags: approval-mozilla-beta?
Attachment #8400939 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/d598a49f5e25
Status: ASSIGNED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Attachment #8400939 - Flags: approval-mozilla-beta?
Attachment #8400939 - Flags: approval-mozilla-beta+
Attachment #8400939 - Flags: approval-mozilla-aurora?
Attachment #8400939 - Flags: approval-mozilla-aurora+
James, could you take this for a spin on any of the Fx29b8 release candidates when those are out? Tomorrow morning, hopefully.
:juanb
Yep. Probably some time after lunch Tuesday...
:bobm
I assume it's "easy enough" to force my user onto a new Prod node?
(In reply to James Bonacci [:jbonacci] from comment #49)
> I assume it's "easy enough" to force my user onto a new Prod node?

Yes, "easy enough" is about right.  We will spin a special new prod node without any users on it, and manually migrate your user to it.  Then we can migrate that node.
Did some testing today with Bobm.
Using what I assume is Fx29b8:


I started on this node: sync-4-us-west-2
Got moved to this node: sync-11-us-west-2
At this point, :bobm forced a 401
I saw a gray bar, followed immediately by
an error log
a success log
and I ended up here: sync-3-us-west-2

The error log showed this:
GET fail 401 https://sync-11-us-west-2

The success log showed this:
1397610168869	Sync.ErrorHandler	DEBUG	Flushing file log.
1397610168869	Sync.Service	DEBUG	Exception: Aborting sync: failed to get collections. No traceback available
1397610168870	Sync.Service	DEBUG	User-Agent: Firefox/29.0 FxSync/1.31.0.20140414143035.
1397610168870	Sync.Service	INFO	Starting sync at 2014-04-15 18:02:48
1397610168870	Sync.Service	DEBUG	In sync: should login.
1397610168870	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1397610168870	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1397610168877	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1397610168880	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1397610168881	Sync.Service	INFO	Logging in user BLAH
1397610168882	Sync.ErrorHandler	DEBUG	Log cleanup threshold time: 1396746168882
1397610168882	Sync.ErrorHandler	DEBUG	No logs to clean up.
1397610168882	Sync.BrowserIDManager	INFO	Fetching assertion and token from: https://token.services.mozilla.com/1.0/sync/1.5
1397610168883	Sync.BrowserIDManager	DEBUG	Getting an assertion
1397610168905	Sync.BrowserIDManager	DEBUG	Getting a token
1397610170050	Sync.BrowserIDManager	DEBUG	Successfully got a sync token
1397610170051	Sync.BrowserIDManager	DEBUG	_findCluster returning https://sync-3-us-west-2


So, in Stage, I did not see a gray bar or a yellow bar.
Forgot this key part from about:config:
browser.startup.homepage_override.buildID = 20140414143035
browser.startup.homepage_override.mstone = 29.0
:juanb - did I happen to catch the wrong build?
:markh - ideas?
Maybe it was all about the release.
I switched to Fx29 b9 today.

Working with OPs we switched from one node to another.
Then they forced a 401 on me and a new node assignment.

All of this came together as expected with no yellow or gray bar.
We are done here!

From the Success log (not any error log!):
...etc...
1397865617518	Sync.Resource	DEBUG	mesg: GET fail 401 https://sync-11-us-west-2/1.5/BLAH/info/collections
1397865617518	Sync.Resource	DEBUG	GET fail 401 https://sync-11-us-west-2/1.5/BLAH/info/collections
1397865617518	Sync.Service	WARN	401: login failed.
...etc...
GET success 200 https://sync-7-us-west-2/1.5/BLAH
...etc...
PUT success 200 https://sync-7-us-west-2/1.5/BLAH/storage
...etc...
DELETE success 200 https://sync-7-us-west-2/1.5/BLAH/storage
...etc...
POST success 200 https://sync-7-us-west-2/1.5/BLAH/storage
...etc...
Status: RESOLVED → VERIFIED
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.