Closed Bug 984297 Opened 10 years ago Closed 10 years ago

Add tokenserver admin script to purge old user records

Categories

(Cloud Services Graveyard :: Server: Token, defect, P2)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

Attachments

(1 file, 2 obsolete files)

Attached patch ts-purge-old-records.diff (obsolete) — Splinter Review
The tokenserver database keeps a history of "user" records for each email address that it has seen, to help prevent clients from reverting to a previous state.  We need to periodically clean up this history to prevent it from growing without bound.

I'm attaching a simple script to do this, based on the approach taken for purging expired items in server-syncstorage.  This script can be run in the background to periodically find old user records, clean up the associated serice data, and delete them.

The tricky thing here is that it cleans up old data stored on the syncstorage nodes, by "forging" an access token and doing a DELETE /storage request on behalf of the no-longer-around-to-do-it-themselves user.  This makes things a bit slower and more complex than a simple "delete from users where replaced_at is not null" but it seems like a sensible thing to do.

(Plus we want such functionality anyway for Bug 971907)

I need to do some more testing of this before asking for r?, but any feedback at this stage is appreciated.
Attachment #8392098 - Flags: feedback?(telliott)
Attachment #8392098 - Flags: feedback?(bwong)
Blocks: 971907
If you play this out you're getting into SWF (simple workflow) functionality. SWF is much more complicated than SQS but it was designed exactly for managing jobs like this cleanup. I like it over something like a cron job since we have the DELETE on the remote syncstorage nodes, which could have a high chance of failure. 

Though since we don't have anything in place right now, I'm OK with a cronjob that runs on the tokenserver admin box. I'd like lots of logging so we can grep to see if we need to build out the infrastructure more.
Thanks Ben, I'll dig into SWF a little more for potential future updates.

Also note that the script is currently designed to run as a persistent background service by default, rather than a cronjob.  IIRC this is how the equivalent functionality is currently run on the syncstorage nodes.
Whiteboard: [qa+]
Comment on attachment 8392098 [details] [diff] [review]
ts-purge-old-records.diff

clearing :mostlygeek feedback flag
Attachment #8392098 - Flags: feedback?(bwong) → feedback+
Attachment #8392098 - Flags: feedback+
> SWF is much more complicated than SQS

By the way, this is now my front-runner for the Understatement Of The Year award...
The year is still young.
Depends on: 984731
Attached patch ts-purge-old-records.diff (obsolete) — Splinter Review
Here's an updated version that includes a testcase for proper functioning of the script.  A clunky testcase, but a testcase nontheless!
Attachment #8392098 - Attachment is obsolete: true
Attachment #8392098 - Flags: feedback?(telliott)
Attachment #8392681 - Flags: review?(telliott)
Slight tweaks to the patch now that Bug 984731 and Bug 984232 have landed.
Attachment #8392681 - Attachment is obsolete: true
Attachment #8392681 - Flags: review?(telliott)
Attachment #8393280 - Flags: review?(telliott)
Comment on attachment 8393280 [details] [diff] [review]
ts-purge-old-records.diff

Review of attachment 8393280 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Some notes below for enhancements or potential adjustments.

::: tokenserver/scripts/purge_old_records.py
@@ +31,5 @@
> +
> +logger = logging.getLogger("tokenserver.scripts.purge_old_records")
> +
> +
> +def purge_old_records(config_file, grace_period=-1, max_per_loop=100):

Is the purpose of capping the loop simply to keep down the max time an iteration of this process might take? I feel like 100 is still pretty high for that, and since it's a discrete process for each individually, it makes no difference to the external service how many records there are in this DB. I suspect you want a number closer to 10.

@@ +57,5 @@
> +                rows = list(backend.get_old_user_records(service, **kwds))
> +                for row in rows:
> +                    logger.debug("Purging uid %s on %s", row.uid, row.node)
> +                    delete_service_data(config, service, row.uid, row.node)
> +                    backend.delete_user_record(service, row.uid)

I think we need to take some protective steps here. If something gets a little stuck (or we get the script run twice in succession), we could get into a situation where we're doing multiple delete calls on the same account. That's not the end of the world (it'll eventually kill all the data, so there's not side-effect danger), but it may cause a bit of db contention. It wouldn't completely solve it, but maybe as a precuation before starting on a row we give it a higher timestamp?

@@ +87,5 @@
> +    user = {"uid": uid, "node": node}
> +    token = tokenlib.make_token(user, secret=node_secrets[-1])
> +    secret = tokenlib.get_derived_secret(token, secret=node_secrets[-1])
> +    endpoint = pattern.format(uid=uid, service=service, node=node)
> +    resp = requests.delete(endpoint, auth=HawkAuth(token, secret))

What happens if we timeout here?

@@ +116,5 @@
> +    parser.add_option("", "--purge-interval", type="int", default=3600,
> +                      help="Interval to sleep between purging runs")
> +    parser.add_option("", "--grace-period", type="int", default=86400,
> +                      help="Number of seconds grace to allow on replacement")
> +    parser.add_option("", "--max-per-loop", type="int", default=1000,

1000 default here? that also seems excessive :)
Attachment #8393280 - Flags: review?(telliott) → review+
> Is the purpose of capping the loop simply to keep down the max time an iteration of
> this process might take?

Broadly yes, although it's less about the time taken (nothing is waiting for this process, and it's not holding any locks) and more about having to hold the unprocessed items in memory.  10 seems sensible.

> I think we need to take some protective steps here. 

Right.  I guess that comes back to Benson's point about slowly iterating to a full-blown workflow-based system.  If the job fails we should "put it at the back of the queue" and move on to the next one.  Except we don't have an explicit queue.

> What happens if we timeout here?

It'll log the error, and if you're not using --oneshot, it'll sleep for a while and then try again.  No progress will be made until that record is successfully handled, due to the above point.

I should will also add an explicit timeout value to that request for completeness.

> 1000 default here? that also seems excessive :)

Hah, indeed!  I copied this boilerplate from the sync ttl-purge script where that's a much more reasonable value.  Good catch.
Blocks: 985794
Committed with above-mentioned tweaks:
https://github.com/mozilla-services/tokenserver/commit/e68a6055527f97d69a7ceb2ea4abf2bb57a34012

I filed Bug 985794 as a follow-up for more robust handling of downstream service errors.  Whats in this script will work, but allows a single troublesome record to block all progress until it resolves.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Priority: -- → P2
All good here.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: