Closed Bug 988134 Opened 10 years ago Closed 10 years ago

X-Client-State tracking prevents proper node-reassignment

Categories

(Cloud Services Graveyard :: Server: Token, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

Attachments

(2 files)

(I suspect this to be the server-side problem underlying Bug 985145)

The tokenserver database currently has a unique constraint on (email, service, client_state) in order to help enforce the "client-state values must never be re-used" rule.  Unfortunately this will interact badly with node-reassignments.

There are two ways we could try to node-reassign a user:

  1. Change the value of the "node" column in their current assignment row.
     (Or equivalently, delete their current assignment row).

  2. Mark their existing assignment row as "replaced", so that the next
     time they ask for a token a new assignment row will be created.


Doing (1) means losing the history of where that user's data has been, which means we can't properly clean it up for Bug 971907 and friends.

But doing (2) means that the next time they show up, we must create them a new row with their current (unchanged) value of X-Client-State.  This will fail due to the unique index.

Bleh.

I think the right thing to do is (2) but without the unique constraint on client_state.  We can still check for old values by slurping in the old rows and doing a lookup, and indeed we already do this in a number of places.  But it may have some interesting implications for concurrency/race-conditions which I need to dig in to.
This patch removes the unique index on client-state, and the corresponding places in the code where this might previously have raised an IntegrityError.

This opens up the possibility of two clients racing and creating two different active rows for a single user.  This seems unlikely in practice and I don't think it will actually *break* anything.  Future attempts to read the user's active record will consistently select whichever one sorts highest during the select.  So the only outcome of the race would be that one client syncs to the wrong place for a while.

An alternative is to take a transaction, and check that there's no existing active record before inserting a new one.  At first glance it doesn't seem worth it to me.
Attachment #8396908 - Flags: review?(telliott)
Blocks: 988137
And an Ops/QA note to self: this needs corresponding DB migration before it can go live.
No longer blocks: 988137
Blocks: 988137
Whiteboard: [qa+]
Comment on attachment 8396908 [details] [diff] [review]
ts-no-unique-index-on-client-state.diff

Review of attachment 8396908 [details] [diff] [review]:
-----------------------------------------------------------------

We could catch this in the get_node query - if 2 nodes come back, we trigger a subprocess that removes one of them.
Attachment #8396908 - Flags: review?(telliott) → review+
> We could catch this in the get_node query - if 2 nodes come back,
> we trigger a subprocess that removes one of them.

This is a very good idea, we iterate through the old records in get_user() already, so it costs basically nothing to check that they're marked as replaced and update them if not.
Here is a quick follow-up to recover from racy record creation by marking them as replaced at runtime.

Note that _GET_USER_RECORDS will always fetch in a fully consistent order, since it sorts by (created_at, uid).
Attachment #8397456 - Flags: review?(telliott)
Attachment #8397456 - Flags: review?(telliott) → review+
https://github.com/mozilla-services/wimms/commit/ee59b851369dd873a15eb0e35a27874164683baa
https://github.com/mozilla-services/wimms/commit/e1dada719b6e0b4d078bfaffe871fc6964b572ce

Resolving fixed, but let's not verify this until we get it rolled to prod with appropriate schema changes.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: 988643
Actually, we need to verify this in Stage as part of the fix for bug 985145
I am wondering if we can add this to the Verifier deployment and test: bug 996763
Just to catch two important changes in one bit of work?...
This is now in prod, I think we can call it verified.
OK.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: