Closed Bug 1283492 Opened 9 years ago Closed 8 years ago

make balrog cleanup/retention policy runnable through cron

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(1 file)

add cleanup option to manage-db.py 9 years ago bhearsum@mozilla.com (:bhearsum) 41 bytes, text/x-github-pull-request	nthomas : review+ nthomas : feedback+ bhearsum : checked-in+	Details \| Review

bhearsum@mozilla.com (:bhearsum)

Assignee

Description

•

9 years ago

We will not be migrating the stored procedures we currently use to cleanup old releases and releases_history data into RDS. Instead, CloudOps recommends doing something that is runnable by cron. Benson even suggested that we can run it on a much more frequent interval (up to as often as every 15min) to make the load from it less spikey. One of the nice side effects of this is that it now makes it easy to put this policy into the Balrog repo for easier updates.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 1

•

9 years ago

Attached file add cleanup option to manage-db.py — Details

Attachment #8766788 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Updated

•

9 years ago

Attachment #8766788 - Flags: review?(nthomas) → feedback+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 2

•

9 years ago

Comment on attachment 8766788 [details] [review] add cleanup option to manage-db.py This is ready for final review now.

Attachment #8766788 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

9 years ago

Comment on attachment 8766788 [details] [review] add cleanup option to manage-db.py Clearing r? since we found some more improvements.

Attachment #8766788 - Flags: review?(nthomas)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 4

•

9 years ago

Comment on attachment 8766788 [details] [review] add cleanup option to manage-db.py This should be ready for final review now.

Attachment #8766788 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Updated

•

9 years ago

Attachment #8766788 - Flags: review?(nthomas) → review+

[github robot]

Comment 5

•

9 years ago

Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/cfe4f31f17fea8b0f8e8f210a6aca703899d5b71 bug 1283492: Integrate retention policy with existing db management script (#96). r=nthomas,mostlygeek

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Attachment #8766788 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1293306

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 6

•

9 years ago

This is in production now, we still need to do a dry run against the prod db, and enable a cronjob to run it regularly.

Benson Wong [:mostlygeek]

Comment 7

•

9 years ago

Did some testing with a snapshot version of the RDS DB. Same size datbase / storage / iops. TL;DR: - it will take about 5.7hours to delete a days worth of expired '%latest' records. - no idea how long it will take to delete the %nightly% records - the index on the name column is never used. It only works for left hand matches uses index: some-string-% does not use index: %latest, %nightly% - we should think of a different strategy for storing releases_history changes DETAILS: The SQL conditions: WHERE name LIKE '%latest' AND timestamp<1000*UNIX_TIMESTAMP(NOW()-INTERVAL 14 DAY) Takes a very long time. The `name LIKE '%latest'` causes a scan as it is not an indexed column. Without the name column a range scan is very fast. It takes about 300ms to 350ms PER RECORD to find and delete it. There is about 6700ish %latest records everyday so it would take about 5.7hours to delete a days worth. I haven't even dug into the nightly records yet. Here is my crazy query to delete a range of records: ---- DELETE R.* FROM releases_history R WHERE R.change_id IN ( SELECT x.change_id FROM (SELECT change_id FROM releases_history WHERE name LIKE '%latest' AND timestamp<1000*UNIX_TIMESTAMP(NOW()-INTERVAL 14 DAY) limit 1000) x); Query OK, 1000 rows affected (6 min 42.66 sec) ---- The above query uses sub-queries to make a list of change_id. Then the DELETE will just use the primary key. It looks like this in the DB graphs: https://www.evernote.com/l/AAOmEW4547FEy7g3RorfannGcI9ggVyfKC4B/image.png You can see the read IO for SELECT and where the DELETE kicks in. Looks like MySQL is keeping the ID's in RAM and running through the list. I'm sure my query sucks.

Benson Wong [:mostlygeek]

Comment 8

•

9 years ago

I feel like I've gone off the deep end with this one. I dug a bit deeper and wrote this ridiculous query to estimate how many %latest records there are. This monstrosity exists since we don't have a proper index that uses both timestamp+name for the query we need. It's fast though. :) SELECT count(*) FROM ( # ------------------------------------------------------ # Use index(timestamp) to find records that are too old # and are candidates to be deleted. # ------------------------------------------------------ SELECT B.change_id FROM releases_history B WHERE B.timestamp<1000*UNIX_TIMESTAMP(NOW()-INTERVAL 7 DAY) ) LSIDE LEFT JOIN ( # ------------------------------------------------------ # Use index(name) to match records. Extract the record's # primary key to use for matching. # ------------------------------------------------------ SELECT C.change_id FROM releases_history C WHERE C.name in ( # ------------------------------------------------------ # Use index(name) to get unique names and filter the # ones that end with %latest # ------------------------------------------------------ SELECT UniqueOnly.name FROM ( SELECT distinct name FROM releases_history ) UniqueOnly WHERE UniqueOnly.name LIKE '%latest' ) ) RSIDE ON LSIDE.change_id=RSIDE.change_id WHERE RSIDE.change_id IS NOT NULL # too old AND correct name This returns a count of about 400K records %latest that are suitable for cleaning up. A similar query shows about 200K %nightly% records to clean up. So about 600K suitable records to clean up combined. Due to innodb reasons (rollback log, updating indexes) deleting a ton of rows is really bad. Batches seems to be the recommended approach. After some testing this seems to give a good balance of performance and batch size: DELETE R.* FROM releases_history R WHERE R.change_id IN ( SELECT X.change_id FROM (SELECT change_id FROM releases_history WHERE name LIKE '%latest' AND timestamp<1000*UNIX_TIMESTAMP(NOW()-INTERVAL 14 DAY) ORDER BY change_id LIMIT 100 ) X ); It averages about 400ms/rows. Which is about 9000 rows/hour we can delete. Using a 6 hour, 3PM - 9PM PST purge window, we can delete about 54K rows/day. My best guess is we add about 8K rows/day to the history table. So... - 54K - 8K = 46K rows of backlog we clean up / day - ~600K rows of backlog / 46K = ~ 13 days to get caught up Though I wonder, the releases table has 1,034 rows in it. Why are we adding so many rows/day to the releases_history table? An alternative is to maybe redesign the way releases_history is kept so we can maybe: - truncate - drop table or something else that is a lot more efficient.

Benson Wong [:mostlygeek]

Comment 9

•

9 years ago

Update from meeting: - Will put script in place to delete in batches (bwong) - bhearsum will look at not saving history for nightly and latest which should get rid of the problem

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 10

•

9 years ago

(In reply to Benson Wong [:mostlygeek] from comment #9) > - bhearsum will look at not saving history for nightly and latest which > should get rid of the problem https://bugzilla.mozilla.org/show_bug.cgi?id=1294493 or this

[github robot]

Comment 11

•

9 years ago

Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/7a21838321a65de1a9b138bc98ae168043e9c488 bug 1283492: Add batch delete for releases and releases_history tables (#110). r=bhearsum

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

8 years ago

Depends on: 1295183

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 12

•

8 years ago

This got enabled as part of the push from https://bugzilla.mozilla.org/show_bug.cgi?id=1304054. The cronjob is running at 2100 UTC. The most recent run took a couple of hours, but it was cleaning up over a month of backlog. When we timed it out last time, a daily run should only take a few minutes.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

5 years ago

Product: Release Engineering → Release Engineering Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

make balrog cleanup/retention policy runnable through cron

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Updated

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Updated

Attachment

General

Description

File Name

Content Type