Closed
Bug 984297
Opened 11 years ago
Closed 11 years ago
Add tokenserver admin script to purge old user records
Categories
(Cloud Services Graveyard :: Server: Token, defect, P2)
Cloud Services Graveyard
Server: Token
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: rfkelly, Assigned: rfkelly)
References
Details
(Whiteboard: [qa+])
Attachments
(1 file, 2 obsolete files)
15.00 KB,
patch
|
telliott
:
review+
|
Details | Diff | Splinter Review |
The tokenserver database keeps a history of "user" records for each email address that it has seen, to help prevent clients from reverting to a previous state. We need to periodically clean up this history to prevent it from growing without bound.
I'm attaching a simple script to do this, based on the approach taken for purging expired items in server-syncstorage. This script can be run in the background to periodically find old user records, clean up the associated serice data, and delete them.
The tricky thing here is that it cleans up old data stored on the syncstorage nodes, by "forging" an access token and doing a DELETE /storage request on behalf of the no-longer-around-to-do-it-themselves user. This makes things a bit slower and more complex than a simple "delete from users where replaced_at is not null" but it seems like a sensible thing to do.
(Plus we want such functionality anyway for Bug 971907)
I need to do some more testing of this before asking for r?, but any feedback at this stage is appreciated.
Attachment #8392098 -
Flags: feedback?(telliott)
Attachment #8392098 -
Flags: feedback?(bwong)
Comment 1•11 years ago
|
||
If you play this out you're getting into SWF (simple workflow) functionality. SWF is much more complicated than SQS but it was designed exactly for managing jobs like this cleanup. I like it over something like a cron job since we have the DELETE on the remote syncstorage nodes, which could have a high chance of failure.
Though since we don't have anything in place right now, I'm OK with a cronjob that runs on the tokenserver admin box. I'd like lots of logging so we can grep to see if we need to build out the infrastructure more.
Assignee | ||
Comment 2•11 years ago
|
||
Thanks Ben, I'll dig into SWF a little more for potential future updates.
Also note that the script is currently designed to run as a persistent background service by default, rather than a cronjob. IIRC this is how the equivalent functionality is currently run on the syncstorage nodes.
Updated•11 years ago
|
Whiteboard: [qa+]
Assignee | ||
Comment 3•11 years ago
|
||
Comment on attachment 8392098 [details] [diff] [review]
ts-purge-old-records.diff
clearing :mostlygeek feedback flag
Attachment #8392098 -
Flags: feedback?(bwong) → feedback+
Assignee | ||
Updated•11 years ago
|
Attachment #8392098 -
Flags: feedback+
Assignee | ||
Comment 4•11 years ago
|
||
> SWF is much more complicated than SQS
By the way, this is now my front-runner for the Understatement Of The Year award...
Comment 5•11 years ago
|
||
The year is still young.
Assignee | ||
Comment 6•11 years ago
|
||
Here's an updated version that includes a testcase for proper functioning of the script. A clunky testcase, but a testcase nontheless!
Attachment #8392098 -
Attachment is obsolete: true
Attachment #8392098 -
Flags: feedback?(telliott)
Attachment #8392681 -
Flags: review?(telliott)
Assignee | ||
Comment 7•11 years ago
|
||
Slight tweaks to the patch now that Bug 984731 and Bug 984232 have landed.
Attachment #8392681 -
Attachment is obsolete: true
Attachment #8392681 -
Flags: review?(telliott)
Attachment #8393280 -
Flags: review?(telliott)
Comment 8•11 years ago
|
||
Comment on attachment 8393280 [details] [diff] [review]
ts-purge-old-records.diff
Review of attachment 8393280 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good. Some notes below for enhancements or potential adjustments.
::: tokenserver/scripts/purge_old_records.py
@@ +31,5 @@
> +
> +logger = logging.getLogger("tokenserver.scripts.purge_old_records")
> +
> +
> +def purge_old_records(config_file, grace_period=-1, max_per_loop=100):
Is the purpose of capping the loop simply to keep down the max time an iteration of this process might take? I feel like 100 is still pretty high for that, and since it's a discrete process for each individually, it makes no difference to the external service how many records there are in this DB. I suspect you want a number closer to 10.
@@ +57,5 @@
> + rows = list(backend.get_old_user_records(service, **kwds))
> + for row in rows:
> + logger.debug("Purging uid %s on %s", row.uid, row.node)
> + delete_service_data(config, service, row.uid, row.node)
> + backend.delete_user_record(service, row.uid)
I think we need to take some protective steps here. If something gets a little stuck (or we get the script run twice in succession), we could get into a situation where we're doing multiple delete calls on the same account. That's not the end of the world (it'll eventually kill all the data, so there's not side-effect danger), but it may cause a bit of db contention. It wouldn't completely solve it, but maybe as a precuation before starting on a row we give it a higher timestamp?
@@ +87,5 @@
> + user = {"uid": uid, "node": node}
> + token = tokenlib.make_token(user, secret=node_secrets[-1])
> + secret = tokenlib.get_derived_secret(token, secret=node_secrets[-1])
> + endpoint = pattern.format(uid=uid, service=service, node=node)
> + resp = requests.delete(endpoint, auth=HawkAuth(token, secret))
What happens if we timeout here?
@@ +116,5 @@
> + parser.add_option("", "--purge-interval", type="int", default=3600,
> + help="Interval to sleep between purging runs")
> + parser.add_option("", "--grace-period", type="int", default=86400,
> + help="Number of seconds grace to allow on replacement")
> + parser.add_option("", "--max-per-loop", type="int", default=1000,
1000 default here? that also seems excessive :)
Attachment #8393280 -
Flags: review?(telliott) → review+
Assignee | ||
Comment 9•11 years ago
|
||
> Is the purpose of capping the loop simply to keep down the max time an iteration of
> this process might take?
Broadly yes, although it's less about the time taken (nothing is waiting for this process, and it's not holding any locks) and more about having to hold the unprocessed items in memory. 10 seems sensible.
> I think we need to take some protective steps here.
Right. I guess that comes back to Benson's point about slowly iterating to a full-blown workflow-based system. If the job fails we should "put it at the back of the queue" and move on to the next one. Except we don't have an explicit queue.
> What happens if we timeout here?
It'll log the error, and if you're not using --oneshot, it'll sleep for a while and then try again. No progress will be made until that record is successfully handled, due to the above point.
I should will also add an explicit timeout value to that request for completeness.
> 1000 default here? that also seems excessive :)
Hah, indeed! I copied this boilerplate from the sync ttl-purge script where that's a much more reasonable value. Good catch.
Assignee | ||
Comment 10•11 years ago
|
||
Committed with above-mentioned tweaks:
https://github.com/mozilla-services/tokenserver/commit/e68a6055527f97d69a7ceb2ea4abf2bb57a34012
I filed Bug 985794 as a follow-up for more robust handling of downstream service errors. Whats in this script will work, but allows a single troublesome record to block all progress until it resolves.
Assignee | ||
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Priority: -- → P2
Updated•2 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•