680323 - More considered handling of 401s resulting from Sync LDAP replication delays

Reporter

Description

•

14 years ago

When a client first signs up, they write to an ldap master in PHX (through sreg), get their node, and immediately start syncing (sometimes to a server in an entirely different datacenter). Our prod ldap replication tree is generally fast enough to handle this, but it would be nice to gracefully fail if replication gets a bit behind. This error can also happen if an existing user gets node re-assigned. They hit auth and get a brand new node allocated, then when they try to sync they get 401s because the node allocation hasn't propogated through LDAP (webhead verifies Host: matches the allocated node). If after fetching our node the first few hits to our clusterURL result in 401s, we could back off and try again (without showing an error bar). After so many minutes/tries (5 minutes?), the normal error handling could resume.

Richard Newman [:rnewman]

Comment 1

•

14 years ago

Are there any replication speed guarantees for our LDAP infra? If there are, then there might be a really simple solution here: delay sync in those situations. If not (which is what I expect), then I guess this is another item for the error policy wishlist…

OS: Linux → All

Hardware: x86_64 → All

Richard Newman [:rnewman]

Updated

•

14 years ago

Summary: be more resilient to ldap replication delays → More considered handling of 401s resulting from Sync LDAP replication delays

Pete Fritchman [:petef]

Reporter

Comment 2

•

14 years ago

(In reply to Richard Newman [:rnewman] from comment #1) > Are there any replication speed guarantees for our LDAP infra? If there are, > then there might be a really simple solution here: delay sync in those > situations. No speed guarantees. Normally it's pretty instant, but there are a lot of factors (many out of our control, like cross-DC VPN links) that can impact that.

:Atoll

Comment 3

•

14 years ago

- From memory, on any given day, <5 seconds lag is normal, and most likely 0-2 seconds. (Everyday life.) - From memory, we might see a lag of 2-5 minutes when the intra-site VPN goes down, or when we have to nuke many nodes at once, or when we're doing ldap maintenance. (Once or twice a month.) - From memory, our worst lag was 25-30 minutes, when there was a core router crash between PHX-SJC and we had no way to replicate data. (Once or twice a year.)

Richard Newman [:rnewman]

Comment 4

•

14 years ago

Summary of discussion today: * LDAP replication lag is a normal event. * Fetching node/weave will only succeed after LDAP replication has completed. * Consequently, a 401 should have some delay until trying again, and/or some kind of silent retry mechanism to ensure that LDAP replication occurs. Philipp's view: --- Currently we do this: (1) fetch info/collections (first login), get a 401 (2) refetch node/weave. - if it returns a new node, go to (1) - otherwise report incorrect password You would like us to do this: (0) set counter to 0 (1) fetch info/collections (first login), get a 401 (2) refetch node/weave. - if it returns a new node, go to (1) - if counter == 0, increment counter by 1, go to (1) - otherwise report incorrect password --- atoll's view: --- (0) set counter to 0 (1) fetch whatever, get a 401 or several (2) (a) if counter == 0, increment counter by 1, wait 10 minutes, go to (1) (2) (b) else, refetch node/weave, compare, report incorrect password, etc. Because, if 10 minutes pass and they *still* can't auth, then it's truly an issue, and we're probably already being paged about it anyways. ---

Pete Fritchman [:petef]

Reporter

Comment 5

•

14 years ago

(In reply to Richard Newman [:rnewman] from comment #4) > * Fetching node/weave will only succeed after LDAP replication has completed. No, the node servers talk directly to the LDAP master. The issue I'm worried about is when you get assigned a node that is NOT in the same datacenter as the ldap master (master in phx, get an scl2 node). When you ask for node/weave, you're talking to phx and you get pointed to scl2. You may start talking to an scl2 webhead before the ldap slaves in scl2 have replicated the user. Of course, this can happen within phx too (although less likely; master & slave on same VLAN vs going over an internet-backed VPN). It might not even be the first request you make to the sync node, because we have multiple LDAP slaves. You might be 3 or 4 requests in to the sync before you see a 401 because of an out-of-date LDAP slave. tl;dr you will get a 200 fetching node/weave, and still sometimes hit a 401 later because of ldap replication lag

Richard Newman [:rnewman]

Comment 6

•

14 years ago

(In reply to Pete Fritchman [:petef] from comment #5) > (In reply to Richard Newman [:rnewman] from comment #4) > > * Fetching node/weave will only succeed after LDAP replication has completed. > > No, the node servers talk directly to the LDAP master. Sorry, that was an awful misphrasing on my part. Thanks for correcting! Our "we got a 401" strategy is "clear our node assignment and schedule another sync to deal with any problem"; at the start of a sync we hit node/weave to set our local node assignment, and then proceed with syncing. What I ought to have said was "the node assignment we get from node/weave isn't guaranteed to be 100% functional until some minutes after our last 401", because we can make subsequent requests and some of them might fail. (Fortunately, Bug 692714 largely addresses the issue of 401s happening repeatedly in the middle of consecutive syncs. This bug is to cover any other holes.) As you can see, I don't have a clear enough head to address those tonight :D

Richard Newman [:rnewman]

Comment 8

•

13 years ago

Possible candidate for WONTFIX with new auth?

Toby Elliott [:telliott]

Comment 9

•

13 years ago

New auth makes this irrelevant, but it sounds like that's not happening any time soon.

Richard Newman [:rnewman]

Updated

•

12 years ago

Whiteboard: [closeme-sync.next]

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: Firefox Sync: Backend → Sync

Product: Cloud Services → Firefox

Mark Hammond [:markh] [:mhammond]

Comment 10

•

3 years ago

bulk closing old sync bugs with "closeme" in the whiteboard.

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → INACTIVE

Bugzilla

More considered handling of 401s resulting from Sync LDAP replication delays

Categories

(Firefox :: Sync, defect)

Tracking

()

People

(Reporter: petef, Unassigned)

References

Details

(Whiteboard: [closeme-sync.next])

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 8

Comment 9

Updated

Updated

Comment 10