Closed Bug 996751 Opened 12 years ago Closed 11 years ago

Run the tokenserver purge_old_records script from the tokenserver webheads

Categories

(Cloud Services :: Operations: Miscellaneous, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Unassigned)

References

Details

(Whiteboard: [qa+])

(Spinning this off from Bug 971907 Comment 37) The tokenserver provides a "purge_old_records" script that will search for old records in the tokenserver database, delete the corresponding data from the sync node via HTTP DELETE request, then remove the old record from the tokenserver db. This is necessary both for general garbage-reduction purposes, and to meet the timely-deletion-of-user-data requirements from Bug 971907. Please arrange for it to be run as a background process to ensure that old records are cleaned up in a timely manner. The necessary command is: ./bin/python -m tokenserver.scripts.purge_old_records -v ./path/to/tokenserver/config.ini It will run in a loop, periodically querying the db and doing the necessary cleanup, and logging to stderr.
Do you mind if we run this from the tokenserver itself?
Whiteboard: [qa+]
> Do you mind if we run this from the tokenserver itself? I don't mind where it runs from. The only trick with running it on the webheads is to prevent several webheads from running it concurrently. This won't *break* anything but it's wasteful.
This is the process that needs to connect to individual sync nodes to delete data?
> This is the process that needs to connect to individual sync nodes to delete data? Correct.
Not in Stage yet. Is this a prerequisite for bug 996915? (noting that 1.2.0 is already out in Stage and Prod)
Status: NEW → ASSIGNED
> Is this a prerequisite for bug 996915? No, it's not a prereq for any particular deployment, but we do need to figure it out in time for FF29 to satisfy Bug 971907.
Well, we have less than a week for Fx29 in Beta. Is this a hard blocker or a soft blocker for Fx29?
The node_manager would be a nice place to put it since it will only be a single server and we don't have to do any coordination between nodes. However, it depends on tokenserver's configuration/secrets. Though it already has a copy of these, it'd be more things to keep in sync. If we did it on the tokenserver, which is a cluster of machines, we would need some sort of semaphore or synchronization system so only one process is happening at a time. We don't want to go full SWF (simple work flow). I think an SQS queue would be enough. We could have a wrapper script that receives a message from the queue, runs the script, deletes the old message and then puts a new message on the queue with a timeout for running the job again. It is a lot more complicated than a cronjob, a lot less complicated than SWF and less technical debt than having duplicate values in the node manager. I think this is what my preferred solution would be.
> If we did it on the tokenserver, which is a cluster of machines, we would > need some sort of semaphore or synchronization system so only one process is happening at a time We could use the database for this, via "SELECT ... FOR UPDATE" or similar. > I think an SQS queue would be enough. We could have a wrapper script that receives a message > from the queue, runs the script, deletes the old message and then puts a new message on the > queue with a timeout for running the job again. There are still windows for job loss or duplication here, I think a db-based approach may be better. Will have to dig in in more detail to see for sure. I propose the following: * We run this on the webheads as a persistent background process, not via cron. The process will do a run, sleep for a bit, do a run, sleep for a bit, repeat. * Initially, we use a randomized sleep interval so that webheads are unlikely to be running the process at the same time. Even if they do happen to overlap by chance, this won't break anything, it will just result in some useless duplicate work. * For a future release we move to a more nuanced scheduling system, the details of which are yet to be determined. Thoughts?
:rfkelly, I agree. Having it run on multiple machines as a background process which runs at randomized times is a good solution. Moving to something later where they coordinate, maybe through the DB since they're already talking to it, and these are infrequent coordination events (daily), sounds good.
Summary: Periodically run the tokenserver purge_old_records script from the node-admin server → Run the tokenserver purge_old_records script from the tokenserver webheads
OK, here's what we need to do, in order: * Get 1.2.1 deployed, which we need for db migrations but doesn't include this code change (Bug 996915) * Update the config to run purge_old_records (https://github.com/mozilla-services/puppet-config/pull/396) * Do another deployment with this config change in place
In the interests of getting this out before FF29 release, I've tweaked the upcoming deployments to include it *before* the db migration stuff. So the new plan is: * Update the config to run purge_old_records (https://github.com/mozilla-services/puppet-config/pull/396) * Get 1.2.3 deployed (Bug 996915) * Follow up with a second deployment to complete the db migrations, which we can do after FF29
Built a new stack in stage with this running. Seems to work looking at the logs. It would be good if the logs had a time stamp for the entries. Maybe 1.2.4 ;)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
I can verify this once bug 1001305 is Verified.
Which, of course, is already out there and running in Prod...
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.