A 503 from tokenserver should mean we don't try to sync to the storage node

NEW
Unassigned

Status

Cloud Services
Firefox Sync: Backend
P3
normal
3 years ago
a year ago

People

(Reporter: rfkelly, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
:jrgm noticed this behaviour in Bug 1186238 Comment 17.  If firefox gets a 503 response from the tokenserver, it appears to plow right ahead and try syncing anyway, despite that fact that it didn't successfully acquire a token.  Perhaps there's a missing error-handling path that should be bailing out in this case, but is instead swallowing the error?
I'll check if this is still the case.
Flags: needinfo?(kit)
Comment hidden (mozreview-request)
The attached test shows this is still the case. The process is fairly side-effect heavy, but here's my understanding of what happens:

* We try to fetch info/collections in http://searchfox.org/mozilla-central/rev/ac40ca3ec39efe85bfb111274c10ee4ceea5bb7a/services/sync/modules/service.js#552. This is an authenticated request, so we use http://searchfox.org/mozilla-central/rev/ac40ca3ec39efe85bfb111274c10ee4ceea5bb7a/services/sync/modules/browserid_identity.js#726-728,743 to obtain the token.

* We make a request to the token server. The token server clients converts non-200 responses into errors. Based on the comments in http://searchfox.org/mozilla-central/rev/ac40ca3ec39efe85bfb111274c10ee4ceea5bb7a/services/common/tokenserverclient.js#146-148, I think a 503 will be either a "general" or "malformed-response" error class, but not an auth error.

* We correctly assume this is a transient network error (http://searchfox.org/mozilla-central/rev/ac40ca3ec39efe85bfb111274c10ee4ceea5bb7a/services/sync/modules/browserid_identity.js#680-683), and fail the background key bundle fetch as a side effect. But, `_getAuthenticationHeader` also catches the token server error, and returns `null`.

* This causes us to make an info/collections request without a token, which returns a 401.

Richard's comment in bug 1205111, comment 10 seems relevant here. It seems we intentionally return `null` instead of letting the error propagate, but with the downside that we end up trying to sync, even if we don't have a token. Mark noted in bug 1205111, comment 11 that throwing is risky, because it impacts invalidation logic that runs during the sync.

But ISTM we specifically don't want that invalidation logic to run if we got a 401 after a 503 from the token server? Flagging Richard and Mark for comment, since they have a lot more insight into how these different layers interact.
Flags: needinfo?(rnewman)
Flags: needinfo?(markh)
Flags: needinfo?(kit)
(In reply to Kit Cambridge [:kitcambridge] (PTO from 2017-01-02 to 2017-01-09) from comment #3)

> Flagging Richard and Mark for
> comment, since they have a lot more insight into how these different layers
> interact.

ha! I wish :)

> * We make a request to the token server. The token server clients converts
> non-200 responses into errors. Based on the comments in
> http://searchfox.org/mozilla-central/rev/
> ac40ca3ec39efe85bfb111274c10ee4ceea5bb7a/services/common/tokenserverclient.
> js#146-148, I think a 503 will be either a "general" or "malformed-response"
> error class, but not an auth error.

Yep, that's what should happen.

> * This causes us to make an info/collections request without a token, which
> returns a 401.
> 
> Richard's comment in bug 1205111, comment 10 seems relevant here. It seems
> we intentionally return `null` instead of letting the error propagate, but
> with the downside that we end up trying to sync, even if we don't have a
> token. Mark noted in bug 1205111, comment 11 that throwing is risky, because
> it impacts invalidation logic that runs during the sync.

Actually, in that comment I meant something along the lines of "even if we do have such invalidation logic in place, I can change my patch to avoid hitting it and thus cut the risk" - but I don't think there is much in the way of such invalidation logic (at least, I can't see any beyond a call to service.logout()

> But ISTM we specifically don't want that invalidation logic to run if we got
> a 401 after a 503 from the token server?

As above, I don't think we do, but I'm not sure it matters in practice (ie, I don't think any harm will come from that 401 other than a possible inefficiency.)

As Richard says in bug 1205111, comment 10, the behaviour is really a "shortcut" - I think it is safe to abandon the sync on such errors (probably still ensuring service.logout() is called) - it's just not clear on the mechanics we should use.

To complicate the matter further, there are actually 2 requests of interest here. The sync code will (say) attempt to make a request to storage. *As part of* establishing that request, we will, in some cases, need to perform a token fetch. The way things are structured currently, it's very difficult to simply abandon the "outer" request at this point - we've already opened the channel for this outer request and are trying to add headers - at this point we find we need a new token and fetch a new one, effectively ignoring any errors. IOW:

> If firefox gets a 503 response from the tokenserver, it appears to plow right 
> ahead and try syncing anyway, despite that fact that it didn't successfully
> acquire a token.

Should really be written as "If firefox gets any response from the tokenserver without a token ..."

In both cases, the ideal would be that we make *no* request after the failed tokenserver call - but as above, this is tricky to recover from, as in most cases we've already opened the channel to storage, so an exception is probably the only sane way to recover from this. However, the sync code handling a 401/5XX is looking at the response to this "outer" request - throwing an exception at tokenfetch time means there will be no response to examine.

IOW, the sync code as currently written really isn't expecting that an ancillary request for a token as a side-effect of attempting to hit storage, and certainly isn't setup to handle a failure in this request.

So without a significant refactor, I'd be happy with a failure to fetch a token causing, at most, a single further storage request be made without a header, expecting a 401, causing a logout. However, a subsequent attempt to re-fetch a token should honour any retry-after semantics from a 50X request - and as this will be the first token fetch after a logout, this token fetch would *not* be part of a storage request, so the login should gracefully fail and no further requests would be made. It's not ideal, but I expect it will not cause harm even in the case of some server issue causing the 5XX responses from the token server.

And I expect we are already relatively close to this today.

Ryan, does that sound reasonable to you, or do you think we should bit the bullet and schedule a broader refactor here to make this cleaner. (We already need a refactor here for various reasons - Resource vs AsyncResource anyone? :)
Flags: needinfo?(rnewman)
Flags: needinfo?(rfkelly)
Flags: needinfo?(markh)
(Reporter)

Comment 5

a year ago
It sounds like a reasonable short-term compromise, sure :-)

>  a failure to fetch a token causing, at most, a single further storage request
> be made without a header, expecting a 401, causing a logout

I don't recall all of the context of this bug, does "causing a logout" here mean that a 503 from tokenserver will cause your browser to enter the "needs reauth" state?  ISTM it should abort the sync but leave you connected to FxA.
Flags: needinfo?(rfkelly)
(In reply to Ryan Kelly [:rfkelly] from comment #5)
> I don't recall all of the context of this bug, does "causing a logout" here
> mean that a 503 from tokenserver will cause your browser to enter the "needs
> reauth" state?  ISTM it should abort the sync but leave you connected to FxA.

No, it just enters a "not yet logged in" state, which causes us to do a little more work and pre-fetch a token before the sync itself starts (and the result of that seeing a 401 will indeed enter the "needs reauth" state)
(Reporter)

Comment 7

a year ago
In which case, :thumbsup: to fixing lowhanging fruit and deferring the rest to part of a larger refactor
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.