Closed Bug 1102228 Opened 11 years ago Closed 11 years ago

Improve the data cycling routine to divide the target dataset in chunks

Categories

(Tree Management :: Treeherder, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdoglio, Assigned: mdoglio)

References

Details

Attachments

(1 file)

We need to improve the data cycling routine to be able to delete one month of data. At the moment it doesn't partition the target dataset. If we run it now it will try to delete 33000x30x12 (number of jobs per day x number of days times average number of artifacts per job) rows from the job_artifact table. And to do that it will use a single query based on an IN filter containing 33000x30 IDs.
Blocks: 1078523
Assignee: nobody → mdoglio
Status: NEW → ASSIGNED
OS: Mac OS X → All
Priority: -- → P1
Hardware: x86 → All
I added some chunking logic and started testing it on dev. I got an operational error while processing the fx-team database, I will investigate why. The error is >OperationalError: (2006, 'MySQL server has gone away')
The operational error I faced was probably due to the gigantic size of the query that the routine was trying to execute. I had to add a new parameter to specify the size of the data partitions form the command line. I'm running the routine on dev, once I finished I'll merge into master and run it on stage.
Attachment #8529145 - Flags: review?(cdawson)
Comment on attachment 8529145 [details] [review] Github PR #291 on treeherder-service I commented on the question of using cascading deletes for some of these tables. But if that isn't possible(or feasible) then this is good to go.
Attachment #8529145 - Flags: review?(cdawson) → review+
Commits pushed to master at https://github.com/mozilla/treeherder-service https://github.com/mozilla/treeherder-service/commit/06f62d21f61fd5775e9fa31612be9b0fe6e666bf Bug 1102228 - Improve the data cycling routine Added several parameters to the cycle_data shell command: cycle-interval (in days), chunk-size (in number of result sets), sleep-time (in seconds). I made the cycle_data task a very thin wrapper around the shell command, there is no more logic in it. All the queries for data cycling are executed with the retry logic to handle db deadlocks https://github.com/mozilla/treeherder-service/commit/0a02d494ef8d53ea7717bcfe374ed8489f870bfa Merge pull request #291 from mozilla/bug-1102228-improve-data-cycling Bug 1102228 improve data cycling
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: