Closed
Bug 985145
Opened 11 years ago
Closed 11 years ago
Error bar displayed during Sync 1.5 node re-assignment
Categories
(Firefox :: Sync, defect)
Tracking
()
VERIFIED
FIXED
mozilla31
People
(Reporter: bobm, Assigned: markh)
References
Details
(Whiteboard: [qa+])
Attachments
(1 file, 2 obsolete files)
26.54 KB,
patch
|
rnewman
:
review+
Sylvestre
:
approval-mozilla-aurora+
Sylvestre
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
An error bar was displayed during a node reassignment. The node reassignment took place as follows:
1. The node https://sync-1-us-east-1.sync.services.mozilla.com was set to downed and backoff in the token db.
2. The DNS for sync-1-us-east-1.sync.services.mozilla.com was redirected to the Sync 401inator (a web service that returns 401s for all requests).
3. User assignments to sync-1-us-east-1.sync.services.mozilla.com were removed from the token db.
Relevant Sync 1.5 log:
1395100241050 Sync.ErrorHandler DEBUG Flushing file log.
1395100241051 Sync.Service DEBUG Exception: Error JS Stack trace: Res_get@resource.js:413:5 < Sync11Service.prototype._fetchInfo@service.js:526:7 < sync@enginesync.js:77:1 < onNotify@service.js:1249:7 < WrappedNotify@util.js:143:1 < WrappedLock@util.js:98:9 < _lockedSync@service.js:1243:1 < sync/<@service.js:1234:7 < WrappedCatch@util.js:72:9 < sync@service.js:1222:5
1395100241056 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395100241057 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395100241057 Sync.ErrorHandler DEBUG Log cleanup threshold time: 1394236241057
1395100241058 Sync.ErrorHandler DEBUG No logs to clean up.
1395100540030 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395100540031 Sync.ErrorHandler DEBUG Beginning user-triggered sync.
1395100540031 Sync.Service DEBUG User-Agent: Firefox/30.0a1 FxSync/1.32.0.20140317030202.
1395100540031 Sync.Service INFO Starting sync at 2014-03-17 16:55:40
1395100540032 Sync.SyncScheduler DEBUG Clearing sync triggers and the global score.
1395100540032 Sync.Status INFO Resetting Status.
1395100540032 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395100540504 Sync.Resource DEBUG mesg: GET fail 401 https://sync-1-us-east-1.sync.services.mozilla.com/1.5/XXXXX/info/collections
1395100540504 Sync.Resource DEBUG GET fail 401 https://sync-1-us-east-1.sync.services.mozilla.com/1.5/XXXXX/info/collections
1395100540504 Sync.Service INFO Logging out
1395100540505 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395100540505 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395100540506 Sync.SyncScheduler DEBUG Next sync in 600000 ms.
1395100540506 Sync.ErrorHandler INFO Got 401 response; resetting clusterURL.
1395100540506 Sync.ErrorHandler DEBUG New mid-sync 401 failure. Making a note.
1395100540507 Sync.ErrorHandler INFO Attempting to schedule another sync.
1395100540508 Sync.SyncScheduler DEBUG Sync error count has exceeded 3; enforcing backoff.
1395100540508 Sync.SyncScheduler DEBUG Starting client-initiated backoff. Next sync in 14158771 ms.
1395100540508 Sync.SyncScheduler DEBUG Next sync in 14158771 ms.
Assignee | ||
Comment 1•11 years ago
|
||
The problem is that in general, we treat 401 errors as real auth errors that require user intervention (such as re-entering the password) to resolve.
If there a 404inator you could use instead? :) Telling sync "not found" seems more accurate than "unauthorized" in this case.
Assignee | ||
Comment 2•11 years ago
|
||
(hrm, well, 404 probably isn't great either, but some 500 - or almost anything else - would work better)
Comment 3•11 years ago
|
||
Node reassignments in Sync 1.5 are accomplished by the token server, not by the storage server.
All the storage server needs to do is tell the client to go away.
A 503 with Retry-After should suffice if nothing else works.
Updated•11 years ago
|
Whiteboard: [qa+]
Comment 4•11 years ago
|
||
> Node reassignments in Sync 1.5 are accomplished by the token server, not by the storage server.
Hmm, the in
Comment 5•11 years ago
|
||
Yay accidental submission.
> Node reassignments in Sync 1.5 are accomplished by the token server, not by the storage server.
In my head, they're still triggered by a 401 from the storage server. Your token expires, you get a 401, you go back to the tokenserver and detect that your node assignment has changed. In fact, there should be no other reason to get a 401 from the storage server.
But I can see how that flow doesn't make as much sense as it used to, since we're not persisting tokens in a variety of circumstances.
I don't think a 503 is appropriate here, but if it'll cause better client behavior, I guess we can roll with it. Operationally, it would be nice to have something to trigger a "go get your new node assignment" from the storage servers with a minimum of fuss.
Assignee | ||
Comment 6•11 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #5)
> Yay accidental submission.
>
> > Node reassignments in Sync 1.5 are accomplished by the token server, not by the storage server.
>
> In my head, they're still triggered by a 401 from the storage server. Your
> token expires, you get a 401, you go back to the tokenserver and detect that
> your node assignment has changed. In fact, there should be no other reason
> to get a 401 from the storage server.
Hrm - yes, good point - we should have re-hit the token server in this case.
> But I can see how that flow doesn't make as much sense as it used to, since
> we're not persisting tokens in a variety of circumstances.
But the point above remains valid - ie, presumably we had a token, then got a 401 from a storage server, so should have tried to grab a new token. I also don't understand why the log seems to omit entries from the browserid_identity module, which does the token fetch etc.
> I don't think a 503 is appropriate here, but if it'll cause better client
> behavior, I guess we can roll with it. Operationally, it would be nice to
> have something to trigger a "go get your new node assignment" from the
> storage servers with a minimum of fuss.
I now take back what I said above - it's not at all clear to me why the 401 caused this.
Comment 7•11 years ago
|
||
Concur that 401 *should* work. That should be tested. If it doesn't work, and we don't know why, let's fall back to 503+RA.
Comment 8•11 years ago
|
||
:jbonacci, sounds like we need to try this out in a controlled manner in stage and try to capture a detailed log.
Comment 9•11 years ago
|
||
In case it helps, here's the error log just prior to the one above. The reason this was failing was that the 401inator had a bad SSL cert:
1395099520462 Sync.ErrorHandler DEBUG Flushing file log.
1395099520463 Sync.Service DEBUG Exception: Error JS Stack trace: Res_get@resource.js:413:5 < Sync11Service.prototype._fetchInfo@service.js:526:7 < sync@enginesync.js:77:1 < onNotify@service.js:1249:7 < WrappedNotify@util.js:143:1 < WrappedLock@util.js:98:9 < _lockedSync@service.js:1243:1 < sync/<@service.js:1234:7 < WrappedCatch@util.js:72:9 < sync@service.js:1222:5
1395099520469 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395099520470 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395099520470 Sync.ErrorHandler DEBUG Log cleanup threshold time: 1394235520470
1395099520471 Sync.ErrorHandler DEBUG No logs to clean up.
1395099640541 Sync.SyncScheduler DEBUG Next sync in 600000 ms.
1395100240541 Sync.Service DEBUG User-Agent: Firefox/30.0a1 FxSync/1.32.0.20140317030202.
1395100240542 Sync.Service INFO Starting sync at 2014-03-17 16:50:40
1395100240543 Sync.SyncScheduler DEBUG Clearing sync triggers and the global score.
1395100240544 Sync.Status INFO Resetting Status.
1395100240544 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395100240545 Sync.BrowserIDManager INFO Fetching assertion and token from: https://token.services.mozilla.com/1.0/sync/1.5
1395100240547 Sync.BrowserIDManager DEBUG Getting an assertion
1395100240551 Sync.BrowserIDManager DEBUG Getting a token
1395100240764 Sync.BrowserIDManager DEBUG Successfully got a sync token
1395100241047 Sync.CertListener WARN Invalid HTTPS certificate encountered!
1395100241050 Sync.SyncScheduler DEBUG Sync error count has exceeded 3; enforcing backoff.
1395100241050 Sync.SyncScheduler DEBUG Starting client-initiated backoff. Next sync in 12404960 ms.
1395100241050 Sync.SyncScheduler DEBUG Next sync in 12404960 ms.
Comment 10•11 years ago
|
||
:rfkelly and :bobm
ok, let's set this up for tomorrow
11am till 3pm PDT is ok
or after 5pm PDT is ok
or Thursday
Status: NEW → ASSIGNED
QA Contact: jbonacci
Reporter | ||
Comment 11•11 years ago
|
||
(In reply to James Bonacci [:jbonacci] from comment #10)
> :rfkelly and :bobm
> ok, let's set this up for tomorrow
> 11am till 3pm PDT is ok
> or after 5pm PDT is ok
> or Thursday
I will build a stage 401inator before hand. I am available during either window listed, but would prefer after 1:30 pm.
Comment 12•11 years ago
|
||
:bobm ok, let's start at 1:30pm tomorrow. If we don't finish by 3pm, I can pick it up again with you and :rfkelly after 5pm
Comment 13•11 years ago
|
||
So, our quick testing of sync, node reassignment, force 401, sync again
basically showed the "gray bar" bug.
I got the gray bar ;-)
Here is my sync log:
1395269719748 Sync.ErrorHandler DEBUG Flushing file log.
1395269719750 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395269719750 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395269720579 Sync.Tracker.Clients DEBUG Saving changed IDs to clients
1395269720582 Sync.Tracker.Bookmarks DEBUG Saving changed IDs to bookmarks
1395269720587 Sync.Tracker.Forms DEBUG Saving changed IDs to forms
1395269720588 Sync.Tracker.History DEBUG Saving changed IDs to history
1395269720670 Sync.Tracker.Passwords DEBUG Saving changed IDs to passwords
1395269720748 Sync.Tracker.Addons DEBUG Saving changed IDs to addons
1395269727058 Sync.Tracker.History DEBUG Saving changed IDs to history
1395270536962 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395270538245 Sync.ErrorHandler DEBUG Beginning user-triggered sync.
1395270538246 Sync.Service DEBUG User-Agent: Firefox/29.0a2 FxSync/1.31.0.20140317004002.
1395270538246 Sync.Service INFO Starting sync at 2014-03-19 16:08:58
1395270538246 Sync.SyncScheduler DEBUG Clearing sync triggers and the global score.
1395270538248 Sync.Status INFO Resetting Status.
1395270538248 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395270538683 Sync.Resource DEBUG mesg: GET fail 401 https://sync-jb-us-west-2.sync.services.mozilla.com/1.5/BLAH/info/collections
1395270538683 Sync.Resource DEBUG GET fail 401 https://sync-jb-us-west-2.sync.services.mozilla.com/1.5/BLAH/info/collections
1395270538683 Sync.Service INFO Logging out
1395270538684 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395270538684 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1395270538684 Sync.SyncScheduler DEBUG Next sync in 3600000 ms.
1395270538685 Sync.ErrorHandler INFO Got 401 response; resetting clusterURL.
1395270538685 Sync.ErrorHandler DEBUG New mid-sync 401 failure. Making a note.
1395270538685 Sync.ErrorHandler INFO Attempting to schedule another sync.
1395270538686 Sync.SyncScheduler DEBUG Next sync in 3600000 ms.
Updated•11 years ago
|
Assignee | ||
Comment 14•11 years ago
|
||
I see the problem. When a 401 is seen, Sync makes a to setCluster(). With the legacy provider, this will cause a request to be made to determine the node to use and all is good. However, under FxA, we use the token from the token server to determine the node, and while our token remains valid we re-use the same token. In this case we should be ignoring the existing token and requesting a new one which (presumably) will give us a different node to use.
A subtlety is that we don't know if we are being asked for the node because of a previous 401 - but I *think* we can simply check if the service already has a clusterURL - if so, it means it is bad and we should always fetch a new token, and if not, we can just reuse the token we have.
This fix should be easy; a test for it will be harder, but worthwhile.
Assignee: nobody → mhammond
Comment 15•11 years ago
|
||
Be a little careful here, because there be monsters.
When you get a 401 from a node, you should go get a new token. It's possible we've changed secrets, you've been reassigned, etc, etc. The first time you see it, just do it.
However, if you get the new token and then get another 401, please sleep for some period of time (20 min?) before trying again. This shouldn't happen, so it's best to be conservative here and not swamp the tokenserver with a loop.
Comment 16•11 years ago
|
||
Chatting in IRC, Mark and I hypothesize a possible server-side component to this. Specifically, we need to dig into this sequence to check for edge cases:
"""
2. The DNS for sync-1-us-east-1.sync.services.mozilla.com was redirected to the Sync 401inator (a web service that returns 401s for all requests).
3. User assignments to sync-1-us-east-1.sync.services.mozilla.com were removed from the token db.
"""
First, there's clearly a race here, any client that happens connect between (2) and (3) will not see a node re-assignment and hence will get the error bar. But perhaps there's something else going on. Bob, what query is used for "User assignments to sync-1-us-east-1.sync.services.mozilla.com were removed from the token db"?
Flags: needinfo?(bobm)
Assignee | ||
Comment 17•11 years ago
|
||
Just for some more context:
(In reply to Mark Hammond [:markh] from comment #14)
> I see the problem. When a 401 is seen, Sync makes a to setCluster(). With
> the legacy provider, this will cause a request to be made to determine the
> node to use and all is good. However, under FxA, we use the token from the
> token server to determine the node, and while our token remains valid we
> re-use the same token. In this case we should be ignoring the existing
> token and requesting a new one which (presumably) will give us a different
> node to use.
The above was conjecture - what I'm actually seeing while writing a test for this is that after the initial 401, sync asks us to log the user out, then log back in before starting the next sync. This logout causes us to drop the token and subsequent login to fetch a new one. IOW, I can't see how we are using the same token in this case.
However, if that subsequent login gave us back a token with an endpoint that *still* 401'd, I could see the problem.
Assignee | ||
Comment 18•11 years ago
|
||
It looks alot like a 401 on info/collections doesn't do the right thing. It appears to also not do the right thing with legacy sync.
I'd like to check I'm not missing something subtle here. The following attachment changes the existing test_node_reassignment test so that it simulates a 401 on info/collections, using the same test infrastructure that simulates the 401 on /storage. With this change the test fails in 2 ways: the cluster URL isn't reset, and the error *is* surfaced to the UI (which is exactly what we are seeing here). Looking at the code, this isn't that surprising - fetching the info URL doesn't call checkServerError(), which is where alot of the necessary magic happens. A fix for this is subtle, so I'm asking for more info before I waste more time here.
Richard, can you please confirm that in an ideal world, these test changes would succeed? Also, re the changes from add_identity_test->add_task, this just prevents these tests from using the FxA provider - I'm working on an equivalent FxA specific test.
Flags: needinfo?(rnewman)
Comment 19•11 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #16)
First, there's clearly a race here, any client that happens connect between
> (2) and (3) will not see a node re-assignment and hence will get the error
> bar.
This just looks like an operational problem. Steps 2 and 3 should never intentionally happen in that order.
Reporter | ||
Comment 20•11 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #16)
> 2. The DNS for sync-1-us-east-1.sync.services.mozilla.com was redirected to
> the Sync 401inator (a web service that returns 401s for all requests).
> 3. User assignments to sync-1-us-east-1.sync.services.mozilla.com were
> removed from the token db.
Happy to change this. What flow would you suggest? Remove user assignments first, and then redirect the DNS?
> First, there's clearly a race here, any client that happens connect between
> (2) and (3) will not see a node re-assignment and hence will get the error
> bar. But perhaps there's something else going on. Bob, what query is used
> for "User assignments to sync-1-us-east-1.sync.services.mozilla.com were
> removed from the token db"?
DELETE FROM users WHERE node="https://sync-1-us-east-1.sync.services.mozilla.com";
Flags: needinfo?(bobm)
Comment 21•11 years ago
|
||
(In reply to Bob Micheletto [:bobm] from comment #20)
> Happy to change this. What flow would you suggest? Remove user assignments
> first, and then redirect the DNS?
It's never good to tell users that their credentials or node assignment are wrong (401) without first establishing a different node assignment.
(2) then (3) is the equivalent of the DMV bouncing you between the same windows over and over.
Establish the new node assignments (which I think in 1.5 is equivalent to "turn off new assignments for the old server, delete user records that used to be assigned to that server"), *then* start the 401.
Flags: needinfo?(rnewman)
Comment 22•11 years ago
|
||
(In reply to Mark Hammond [:markh] from comment #18)
> It appears to also not do the right thing with legacy sync.
It used to, and the code looks fine to me.
We'd routinely return a 401 on info/collections to cause a node reassignment. The does this:
* In verifyLogin, fetch info. service.js:698.
* If we get a 401 or 404, fetch a new node assignment.
* If fetching node/weave failed, or we fetched the same node assignment, fail with LOGIN_FAILED_LOGIN_REJECTED.
* Otherwise, proceed with the new node assignment.
A 401 on any other request (e.g., a subsequent sync) will cause the service to log you out (policies.js:888) and the sync to quietly abort. (There might be a bug in the 'quietly' part, but I'd be slightly surprised, 'cos this is tested every day by users.)
The next sync attempt will call verifyLogin again as part of logging in, and we'll go through the handling above.
> I'd like to check I'm not missing something subtle here. The following
> attachment changes the existing test_node_reassignment test so that it
> simulates a 401 on info/collections, using the same test infrastructure that
> simulates the 401 on /storage. With this change the test fails in 2 ways:
> the cluster URL isn't reset,
The cluster URL will only be reset for the in-sync error case. We need to know the old cluster URL in the verifyLogin state in order to detect when your password is wrong. The first sync on the service will hit verifyLogin first, and that leaves the old clusterURL.
> and the error *is* surfaced to the UI (which is
> exactly what we are seeing here).
If the client gets a 401 on info/collections, and then it doesn't get a different node assignment, then that's a wrong-password error, which will be surfaced to the user.
The only way you can sanely return 401 on info/collections and expect anything else is if you have a node/weave endpoint that returns a different node assignment (i.e., _clusterManager.setCluster() returns true), or you're actively trying to test the wrong-password flow.
Reporter | ||
Comment 23•11 years ago
|
||
(In reply to Toby Elliott [:telliott] from comment #19)
> This just looks like an operational problem. Steps 2 and 3 should never
> intentionally happen in that order.
I've changed our operational task flow documentation to reflect this. However, should the tokenserver be providing tokens for nodes in a 'downed' state?
Comment 24•11 years ago
|
||
(In reply to Bob Micheletto [:bobm] from comment #23)
> I've changed our operational task flow documentation to reflect this.
> However, should the tokenserver be providing tokens for nodes in a 'downed'
> state?
That seems OK. Those tokens will invalidate themselves as soon as the 401-inator moves into place.
Comment 25•11 years ago
|
||
> What flow would you suggest? Remove user assignments first, and then redirect the DNS?
It sounds like this is already resolved, but for completeness here's the suggested order:
1. Mark the node as down in the tokenserver db. This prevents new assignments to that node.
2. Clear existing assignments to that node. This means any new token requests will get a new node.
3. Engage the 401inator. This sends all active clients to get their new node assignment.
> However, should the tokenserver be providing tokens for nodes in a 'downed' state?
It currently does provide tokens to existing users who are assigned to a downed node. We can tweak this but it'll cost an extra DB read or join to check the node state. Currently behavior is safe but suboptimal as it delays detection of a down node until the client actually tries to talk to that node.
> > Bob, what query is used
> > for "User assignments to sync-1-us-east-1.sync.services.mozilla.com were
> > removed from the token db"?
>
> DELETE FROM users WHERE node="https://sync-1-us-east-1.sync.services.mozilla.com";
This is not what I had imagined, but ironically, doing it this way will avoid surfacing the error bar :-)
We need to keep the old records around in order for things like Bug 984297 to work properly. Please wait for Bug 988134 to hit production, then change the procedure to set the replaced_at column to the current time in milliseconds. Like this:
"UPDATE users SET replaced_at = (UNIX_TIMESTAMP() * 1000) WHERE node=<foo>;"
I can provide a "clear_node_assignments" script that does the appropriate query if that's cleaner, similar.
Also, we don't have an index on the "node" column so this'll be a full table scan. Do we care enough about the performance of this query to try adding one?
Comment 26•11 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #25)
>
> Also, we don't have an index on the "node" column so this'll be a full table
> scan. Do we care enough about the performance of this query to try adding
> one?
You're looking at a couple million rows and growing, so yeah, that seems wise.
Comment 27•11 years ago
|
||
> You're looking at a couple million rows and growing, so yeah, that seems wise.
Filed Bug 988643.
Assignee | ||
Comment 28•11 years ago
|
||
Even if the server did the right thing, I've determined FxA would fail to correctly handle node reassignment iff the user was not already logged in (ie, if a previous sync hasn't yet run) and info/collections returned a 401. In this case we'd continuously return the same clusterURL making things go pear-shaped and show the error UI.
I believe I've also found an issue with meta/global returning a 401 in the "not already logged in" case - in this case we'd just attempt to re-upload new metadata (which presumably would also fail) instead of treating this 401 like other 401s. This would be very rare (but not impossible) in practice, as this is fetched after info/collections, so a 401 on info/collections would prevent us trying meta/global - but it still seems worthwhile to fix. I've added a test to the existing test_node_reassignment test for this case.
The fixes are:
* If browserid_identity is requested for a clusterURL while the service already has a cluster URL, we assume the existing one failed in some way and fetch a new token regardless of the current token's validity.
* If we see a 401 *during* login, we previously didn't call Service.logout, as Service._isLoggedIn didn't get set to true. Service.logout now ignores the loggedin state and always asks the identity manager to log out. This forces us to re-fetch a new token next time login is called.
* We explicitly handle a 401 on meta/global during login.
* As a bonus, I've added a guard so verifyLogin() can't recurse to death - it will only recurse once and give up in disgust. This is to solve a pathological server case that we've probably never seen for real, but given its theoretically possible it seems prudent to guard against it.
Attachment #8397641 -
Flags: feedback?(rnewman)
Attachment #8397641 -
Flags: feedback?(ckarlof)
Assignee | ||
Updated•11 years ago
|
Attachment #8396936 -
Attachment is obsolete: true
Comment 29•11 years ago
|
||
Comment on attachment 8397641 [details] [diff] [review]
0001-Bug-985145-make-node-reassignment-work-correctly-wit.patch
Review of attachment 8397641 [details] [diff] [review]:
-----------------------------------------------------------------
This looks like it's going in the right direction. Without spending the time to check, I'm slightly alarmed that this doesn't touch stages/cluster.js -- ClusterManager is what handles node reassignment. Worth thinking about.
::: services/sync/tests/unit/head_http_server.js
@@ +809,5 @@
> + // Doing a float compare of the version allows for us to pretend there was
> + // a node-reassignment - eg, we could re-assign from "1.1/user/" to
> + // "1.10/user" - this server will then still accept requests with the new
> + // URL while any code in sync itself which compares URLs will see a
> + // different URL.
I think I just threw up in my mouth. A lot.
::: services/sync/tests/unit/test_fxa_node_reassignment.js
@@ +56,5 @@
> +
> +let numTokenRequests = 0;
> +
> +function prepareServer(cbAfterTokenFetch) {
> +// let deferred = Promise.defer();
Kill.
Attachment #8397641 -
Flags: feedback?(rnewman) → feedback+
Assignee | ||
Comment 30•11 years ago
|
||
(In reply to Richard Newman [:rnewman] from comment #29)
> I'm slightly alarmed that this doesn't touch stages/cluster.js --
> ClusterManager is what handles node reassignment. Worth thinking about.
FWIW, it does touch the _findCluster method in browserid_identity, which overrides this method from the ClusterManager. I didn't find it necessary to change the _findCluster() impl used for legacy-sync (thank goodness)
> ::: services/sync/tests/unit/head_http_server.js
> @@ +809,5 @@
> > + // Doing a float compare of the version allows for us to pretend there was
> > + // a node-reassignment - eg, we could re-assign from "1.1/user/" to
> > + // "1.10/user" - this server will then still accept requests with the new
> > + // URL while any code in sync itself which compares URLs will see a
> > + // different URL.
>
> I think I just threw up in my mouth. A lot.
Heh - and here I was thinking how clever I was ;) ;)
> Kill.
Oops - will do.
Comment 31•11 years ago
|
||
Comment on attachment 8397641 [details] [diff] [review]
0001-Bug-985145-make-node-reassignment-work-correctly-wit.patch
Review of attachment 8397641 [details] [diff] [review]:
-----------------------------------------------------------------
I guess we only half-addressed this issue during last work week. Thanks Mark.
Attachment #8397641 -
Flags: feedback?(ckarlof) → feedback+
Assignee | ||
Comment 32•11 years ago
|
||
Only change is the removal of the commented code rnewman noticed giving feedback. I looked into a better way to have the test server allow node-reassignment (eg, by stopping the existing server and restarting it on a new port etc) but that doesn't seem simple enough to make it worthwhile for these tests.
Attachment #8400939 -
Flags: review?(rnewman)
Comment 33•11 years ago
|
||
I'll try to get to this tomorrow, once I have full brain again.
But even then, I'd like this to be thoroughly QAed. Could you line up services QA and ops to get this tested? That can happen in parallel with the review.
Assignee | ||
Comment 34•11 years ago
|
||
(In reply to Richard Newman [:rnewman] from comment #33)
> I'll try to get to this tomorrow, once I have full brain again.
>
> But even then, I'd like this to be thoroughly QAed. Could you line up
> services QA and ops to get this tested? That can happen in parallel with the
> review.
James, are you able to help with this? A try is at https://tbpl.mozilla.org/?tree=Try&rev=170b26c0fcd3
Flags: needinfo?(jbonacci)
Comment 35•11 years ago
|
||
:markh yea, I think we can do this today with :bobm using the try build above.
We are in Work Week, so scheduling is a bit uneven.
If not today (Thu), can this wait until Monday?
Flags: needinfo?(jbonacci)
Assignee | ||
Comment 36•11 years ago
|
||
(In reply to James Bonacci [:jbonacci] from comment #35)
> :markh yea, I think we can do this today with :bobm using the try build
> above.
> We are in Work Week, so scheduling is a bit uneven.
> If not today (Thu), can this wait until Monday?
That would be fine, thanks.
Comment 37•11 years ago
|
||
Try build is installed on my Mac and I am pointing to TokenServer and Sync Stage environmnets. I have a node and sync is working for me. :bobm is setting up the actual test from the node side of things...
Comment 38•11 years ago
|
||
I grabbed this build: Try link: https://tbpl.mozilla.org/?tree=Try&rev=170b26c0fcd3
Mac (10.8 Opt) build link: https://tbpl.mozilla.org/php/getParsedLog.php?id=37204126&tree=Try
which redirects to here:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mhammond@skippinet.com.au-170b26c0fcd3/try-macosx64/
Installed that
Created a new profile
Set up sync with a new account to point to TokenServer Stage
Originally got this Stage node: https://sync-3-us-east-1.stage.mozaws.net
Was moved to this test node: https://sync-jb-us-west-2.sync.services.mozilla.com
All my attempts to sync here were met with a Yellow Bar (not a gray bar)
:bobm was finally able to force me to a new node: https://sync-3-us-east-1.stage.mozaws.net
At this point, I was able to sync using this new node in East.
Then when he attempted to force me into a 401 situation, I got the dreaded gray bar, followed quickly by the yellow bar
Gray bar error: https://jbonacci.pastebin.mozilla.org/4775090
Yellow bar error: https://jbonacci.pastebin.mozilla.org/4775093
Comment 39•11 years ago
|
||
I think this time we had a problem with the server-side node clearing process, as the code from Bug 988134 hasn't made it to stage. This would have interfered with Bob's node-reallocation process.
Comment 40•11 years ago
|
||
OK. We are all good here.
Started here: https://sync-3-us-east-1.stage.mozaws.net/1.5/BLAH/
Ended up here: https://sync-2-us-east-1.stage.mozaws.net/1.5/BLAH/
with no gray or yellow bar and no error logs.
Let's get this uplifted to Beta if at all possible...
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Attachment #8397641 -
Attachment is obsolete: true
Assignee | ||
Comment 41•11 years ago
|
||
reopening - this hasn't landed yet and the patch is still awaiting final r+ from rnewman
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 42•11 years ago
|
||
Comment on attachment 8400939 [details] [diff] [review]
0001-Bug-985145-make-node-reassignment-work-correctly-wit.patch
Review of attachment 8400939 [details] [diff] [review]:
-----------------------------------------------------------------
If it works, who am I to judge? :P
Attachment #8400939 -
Flags: review?(rnewman) → review+
Assignee | ||
Comment 43•11 years ago
|
||
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 44•11 years ago
|
||
Comment on attachment 8400939 [details] [diff] [review]
0001-Bug-985145-make-node-reassignment-work-correctly-wit.patch
[Approval Request Comment]
Bug caused by (feature/regressing bug #): FxA Sync
User impact if declined: Firefox might show sync errors when the sync servers are undergoing maintenance (specifically, then users are being pushed to different storage servers)
Testing completed (on m-c, etc.): Landed on fx-team and green there.
Risk to taking this patch (and alternatives if risky): Low risk, limited to
sync
String or IDL/UUID changes made by this patch: None
Attachment #8400939 -
Flags: approval-mozilla-beta?
Attachment #8400939 -
Flags: approval-mozilla-aurora?
Comment 45•11 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Updated•11 years ago
|
Attachment #8400939 -
Flags: approval-mozilla-beta?
Attachment #8400939 -
Flags: approval-mozilla-beta+
Attachment #8400939 -
Flags: approval-mozilla-aurora?
Attachment #8400939 -
Flags: approval-mozilla-aurora+
Comment 46•11 years ago
|
||
Comment 47•11 years ago
|
||
James, could you take this for a spin on any of the Fx29b8 release candidates when those are out? Tomorrow morning, hopefully.
Comment 48•11 years ago
|
||
:juanb
Yep. Probably some time after lunch Tuesday...
Comment 49•11 years ago
|
||
:bobm
I assume it's "easy enough" to force my user onto a new Prod node?
Reporter | ||
Comment 50•11 years ago
|
||
(In reply to James Bonacci [:jbonacci] from comment #49)
> I assume it's "easy enough" to force my user onto a new Prod node?
Yes, "easy enough" is about right. We will spin a special new prod node without any users on it, and manually migrate your user to it. Then we can migrate that node.
Comment 51•11 years ago
|
||
Did some testing today with Bobm.
Using what I assume is Fx29b8:
I started on this node: sync-4-us-west-2
Got moved to this node: sync-11-us-west-2
At this point, :bobm forced a 401
I saw a gray bar, followed immediately by
an error log
a success log
and I ended up here: sync-3-us-west-2
The error log showed this:
GET fail 401 https://sync-11-us-west-2
The success log showed this:
1397610168869 Sync.ErrorHandler DEBUG Flushing file log.
1397610168869 Sync.Service DEBUG Exception: Aborting sync: failed to get collections. No traceback available
1397610168870 Sync.Service DEBUG User-Agent: Firefox/29.0 FxSync/1.31.0.20140414143035.
1397610168870 Sync.Service INFO Starting sync at 2014-04-15 18:02:48
1397610168870 Sync.Service DEBUG In sync: should login.
1397610168870 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1397610168870 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1397610168877 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1397610168880 Sync.Status DEBUG Status.service: success.status_ok => success.status_ok
1397610168881 Sync.Service INFO Logging in user BLAH
1397610168882 Sync.ErrorHandler DEBUG Log cleanup threshold time: 1396746168882
1397610168882 Sync.ErrorHandler DEBUG No logs to clean up.
1397610168882 Sync.BrowserIDManager INFO Fetching assertion and token from: https://token.services.mozilla.com/1.0/sync/1.5
1397610168883 Sync.BrowserIDManager DEBUG Getting an assertion
1397610168905 Sync.BrowserIDManager DEBUG Getting a token
1397610170050 Sync.BrowserIDManager DEBUG Successfully got a sync token
1397610170051 Sync.BrowserIDManager DEBUG _findCluster returning https://sync-3-us-west-2
So, in Stage, I did not see a gray bar or a yellow bar.
Comment 52•11 years ago
|
||
Forgot this key part from about:config:
browser.startup.homepage_override.buildID = 20140414143035
browser.startup.homepage_override.mstone = 29.0
Comment 53•11 years ago
|
||
:juanb - did I happen to catch the wrong build?
:markh - ideas?
Comment 54•11 years ago
|
||
Maybe it was all about the release.
I switched to Fx29 b9 today.
Working with OPs we switched from one node to another.
Then they forced a 401 on me and a new node assignment.
All of this came together as expected with no yellow or gray bar.
We are done here!
From the Success log (not any error log!):
...etc...
1397865617518 Sync.Resource DEBUG mesg: GET fail 401 https://sync-11-us-west-2/1.5/BLAH/info/collections
1397865617518 Sync.Resource DEBUG GET fail 401 https://sync-11-us-west-2/1.5/BLAH/info/collections
1397865617518 Sync.Service WARN 401: login failed.
...etc...
GET success 200 https://sync-7-us-west-2/1.5/BLAH
...etc...
PUT success 200 https://sync-7-us-west-2/1.5/BLAH/storage
...etc...
DELETE success 200 https://sync-7-us-west-2/1.5/BLAH/storage
...etc...
POST success 200 https://sync-7-us-west-2/1.5/BLAH/storage
...etc...
Status: RESOLVED → VERIFIED
Updated•6 years ago
|
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in
before you can comment on or make changes to this bug.
Description
•