Closed Bug 985784 Opened 10 years ago Closed 10 years ago

Prevent/mitigate storage of bad client-state values sent by clients

Categories

(Cloud Services Graveyard :: Server: Token, defect, P2)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

Attachments

(3 files)

We've seen several bugs where clients send a badly-computed x-client-state hash to the tokenserver.  In theory this should never happen, and the server is correctly surfacing invalid-client-state errors in these cases.  But as seen in e.g. Bug 985504, if the tokenserver records a badly-computed x-client-state as the current client state of your account, you get locked out on all of your devices.  The only way to dig the account out of this hole is a password reset.

Ideally we will do two things: make it harder for a bad client-state value to be recorded into the tokenserver database, and make it easier for clients to recover if their client-state is busted for whatever reason.

Ideas for prevention:

  * Do more extensive server-side validation of the x-client-state value.  Currently the tokenserver will accept any appropriate length-and-charset-limited string and default it to the empty string if missing.  We could instead make it compulsory and sanity-check that it's hex-encoded bytes of the proper length.

  * Have the server check correlation between cert generation and client state.  If everything is working properly, each change in client-state should be accompanied by an increase in the cert version number.  The server could error out if that's not the case.  It could also accept previously-seen client-state values if they come with an increased generation number.

Both of these changes mean that we give an immediate invalid-client-state error to a buggy client, rather than accepting their buggy write and then giving invalid-client-state errors to everyone else.

They also encode more sync-specific semantics into the tokenserver, making the client-state less "some opaque protocol-specific state identifier" and more "this is a hash of my password-derived sync key".


Ideas for mitigation:

  * Provide an explicit "reset my account" API on the tokenserver.  If a client is damn sure that their kB is correct and the server-side state is wrong, then they could issue some sort of override command to the tokenserver.  This might be a "DELETE /1.0/sync/1.5" request or it might be an "X-Client-State-Override" header or similar.  It's ugly but not as ugly having to force a password reset event in order to generate a fresh kB.

  * Use lastAuthAt timestamps to override conflicting client state values.  The FxA auth server includes a lastAuthAt timestamp in its certificates, giving the time at which the user last successfully entered their password.  The tokenserver could use this to resolve conflicting client-state values in a similar manner to generation numbers - the client-state with the highest lastAuthAt is assumed to be correct.

This won't help against "actively malicious" clients that e.g. just do the wrong hash calculation on kB.  But it would make it easier for clients who have some transient buggy calculation (e.g. accidentally hashing the empty string) to dig themselves out of the hole.
More generally, we can sort FxA certificates by the tuple (generation, lastAuthAt, iat) and assume that the client with the newest certificate has the correct current client-state value.
(In reply to Ryan Kelly [:rfkelly] from comment #0)
> It could also accept
> previously-seen client-state values if they come with an increased
> generation number.

This makes sense: one of the purposes of X-C-S is to ensure that only clients with the same kB reach the same server.
Here's a tiny mitigation that will prevent the current bug in the wild - if we've ever seen a non-empty client-state string, don't allow clients to revert to an empty one.  This should be completely safe to push out to prod to avoid more users getting stuck in the bad state of Bug 985504, and we can think a little more deeply about the semantic changes proposed above.
Attachment #8393951 - Flags: review?(telliott)
> It could also accept previously-seen client-state values if
> they come with an increased generation number.

I like this idea less on reflection.  Bug-free clients should *never* revert to a previous value of kB, so we lose an opportunity to detect legitimate bugs (like the one Gavin found where cycling through passwords produced a corresponding cycle in kB values).
(In reply to Ryan Kelly [:rfkelly] from comment #4)
> Bug-free clients should *never* revert
> to a previous value of kB, so we lose an opportunity to detect legitimate
> bugs (like the one Gavin found where cycling through passwords produced a
> corresponding cycle in kB values).

Unless you consider it not a "reversion", but a continuation. Imagine a correct client A, and a broken client B:

A: 1234 1
B: 1234 1
A: 1234 1
B: 0000 1   < invalid value, but new, so wins
A: 1234 1   < fails
A: gets new cert with new generation number or later lastAuthAt
A: 1234 2   < succeeds

kB didn't change, so X-C-S didn't change, and we needed to accept the old one in order to proceed.
True, but I'm pretty sure this would have made Gavin's bug invisible.  Each password change/reset would have bumped the generation number and caused the X-C-S value to be accepted despite the fact that it was cycling around.

Your scenario above would be detected by the "X-C-S change must be accompanied by generation change" rule.
(In reply to Ryan Kelly [:rfkelly] from comment #6)

> Your scenario above would be detected by the "X-C-S change must be
> accompanied by generation change" rule.

Ah, I missed that. Then yeah, I agree with you.
Attachment #8393951 - Flags: review?(telliott) → review+
(In reply to Ryan Kelly [:rfkelly] from comment #0)
> the client-state with the highest
> lastAuthAt is assumed to be correct.

Can clients mess with this value? This seems like the most promising direction, as it has direct correlation with validity.

I'd be a little concerned if the value is external to the state calculation and could be easily manipulated.
> Can clients mess with this value?

No, it's covered by the signature from the fxa-auth-server, same as the generation number.
Priority: -- → P2
Whiteboard: [qa+]
> We could instead make it compulsory and sanity-check that it's hex-encoded bytes of the proper length.

yes

> each change in client-state should be accompanied by an increase in the cert version number

yes

> It could also accept previously-seen client-state values if they come with an increased generation number

no, that should never happen kB, but I guess it could happen if ever allow users to manage their own keys.
Depends on: 987506
Here's the next mitigation: make client-state change require a generation-number change.

I think this is apprximately as far as we should go, and we should not implement the ability to revert to previous client-state strings under any circumstances.  If a client finds itself in a state where its known-to-be-valid client-state string is being rejected, it means that someone screwed up in handling the encryption keys.  The safest thing for it to do is change kB and try again.
Attachment #8396140 - Flags: review?(telliott)
And for good measure, here's an update to the tokenserver API docs that explains the rules for updating X-Client-State.
Attachment #8396143 - Flags: review?(telliott)
Comment on attachment 8396143 [details] [diff] [review]
docs-ts-client-state-change-rules.diff

Review of attachment 8396143 [details] [diff] [review]:
-----------------------------------------------------------------

::: source/token/apis.rst
@@ +72,5 @@
> +    accessing :ref:`server_syncstorage_api_15` would include a hex-encoded
> +    hash of the encryption key in this header, since a change in the encryption
> +    key will make any existing data unreadable.
> +
> +    Updated values of the  **X-Client-State** will be rejected with an error

random extra space
Attachment #8396143 - Flags: review?(telliott) → review+
Attachment #8396140 - Flags: review?(telliott) → review+
Committed in:

https://github.com/mozilla-services/docs/commit/c64335739ed354236e532568fe9f220e835da8e7
https://github.com/mozilla-services/tokenserver/commit/b9e58fb1353f351a27579b4924d6568ed700e98e

Nick and Richard, do you have further changes you want to hang off this bug?
Status: NEW → ASSIGNED
Flags: needinfo?(nalexander)
Nothing in the short term.
OK, I'm going to close this out, feel free to re-open if we need to do more here.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(nalexander)
Resolution: --- → FIXED
Verified in code.
Verified this here: https://docs.services.mozilla.com/token/apis.html
But is this something we need to specifically deploy and test in TS Stage?
I will Verify for now.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: