Closed Bug 1231531 Opened 10 years ago Closed 10 years ago

Backfill Push Endpoint server Redshift derived stream

Categories

(Cloud Services :: Operations: Metrics/Monitoring, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rmiller, Assigned: relud)

References

Details

One of the consequences of bug 1231477 is that the current set of data in the production Push Redshift cluster is full of duplicate entries. All of the tables in the existing database should be dropped, and the database should be back-filled with at least 30 days of data. In order to make this easier, I've updated the push_load Hindsight set up to be able to configurably load multiple historical days of data. The following steps should get us where we want to be: * Drop all existing db tables. * Update the push_derived code from the latest revision on master (i.e. https://github.com/rafrombrc/push_derived, SHA b5f99641ca866969b84a6758845cd7d5225b62b3) * Edit the `push_log_in.cfg` file to add `num_days = 30` and `instruction_limit = 1000000 * 350`. * Run `hindsight_cli hindsight/etc/hindsight.cfg` like normal. This will likely take about 15-20 minutes. * When it's done, delete the contents of the hs_output directory and remove the `num_days` and `instruction_limit` settings that were added to `push_log_in.cfg`.
Just pushed a new change up to the push_derived code that helps me identify duplicate records which I'd like to use for performing the back-fill. All of the instructions above are still accurate, but you'll want to use SHA af1b835c2658b6dec933888a6120b7054822410f. Thanks!
Blocks: 1232848
I've run the backfill with 13a7ca0 and saw no unusual errors.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.