make balrog cleanup/retention policy runnable through cron

RESOLVED FIXED

Status

Release Engineering
Balrog: Backend
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: bhearsum, Assigned: bhearsum)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Assignee)

Description

2 years ago
We will not be migrating the stored procedures we currently use to cleanup old releases and releases_history data into RDS. Instead, CloudOps recommends doing something that is runnable by cron. Benson even suggested that we can run it on a much more frequent interval (up to as often as every 15min) to make the load from it less spikey.

One of the nice side effects of this is that it now makes it easy to put this policy into the Balrog repo for easier updates.
(Assignee)

Comment 1

2 years ago
Created attachment 8766788 [details] [review]
add cleanup option to manage-db.py
Attachment #8766788 - Flags: review?(nthomas)
Attachment #8766788 - Flags: review?(nthomas) → feedback+
(Assignee)

Comment 2

a year ago
Comment on attachment 8766788 [details] [review]
add cleanup option to manage-db.py

This is ready for final review now.
Attachment #8766788 - Flags: review?(nthomas)
Comment on attachment 8766788 [details] [review]
add cleanup option to manage-db.py

Clearing r? since we found some more improvements.
Attachment #8766788 - Flags: review?(nthomas)
(Assignee)

Comment 4

a year ago
Comment on attachment 8766788 [details] [review]
add cleanup option to manage-db.py

This should be ready for final review now.
Attachment #8766788 - Flags: review?(nthomas)
Attachment #8766788 - Flags: review?(nthomas) → review+

Comment 5

a year ago
Commit pushed to master at https://github.com/mozilla/balrog

https://github.com/mozilla/balrog/commit/cfe4f31f17fea8b0f8e8f210a6aca703899d5b71
bug 1283492: Integrate retention policy with existing db management script (#96). r=nthomas,mostlygeek
(Assignee)

Updated

a year ago
Attachment #8766788 - Flags: checked-in+
(Assignee)

Updated

a year ago
Depends on: 1293306
(Assignee)

Comment 6

a year ago
This is in production now, we still need to do a dry run against the prod db, and enable a cronjob to run it regularly.
Did some testing with a snapshot version of the RDS DB. Same size datbase / storage / iops. 

TL;DR:

 - it will take about 5.7hours to delete a days worth of expired '%latest' records. 
 - no idea how long it will take to delete the %nightly% records
 - the index on the name column is never used. It only works for left hand matches
   uses index: some-string-%
   does not use index: %latest, %nightly%
 - we should think of a different strategy for storing releases_history changes


DETAILS: 

The SQL conditions: 

   WHERE name LIKE '%latest' AND timestamp<1000*UNIX_TIMESTAMP(NOW()-INTERVAL 14 DAY)

Takes a very long time. The `name LIKE '%latest'` causes a scan as it is not an indexed column. Without the name column a range scan is very fast. 

It takes about 300ms to 350ms PER RECORD to find and delete it. There is about 6700ish %latest records everyday so it would take about 5.7hours to delete a days worth. I haven't even dug into the nightly records yet. 

Here is my crazy query to delete a range of records:

----
DELETE R.*
FROM releases_history R 
WHERE R.change_id IN (
    SELECT x.change_id 
    FROM (SELECT change_id 
          FROM releases_history 
          WHERE name LIKE '%latest' AND timestamp<1000*UNIX_TIMESTAMP(NOW()-INTERVAL 14 DAY) limit 1000) x);

Query OK, 1000 rows affected (6 min 42.66 sec)
----

The above query uses sub-queries to make a list of change_id. Then the DELETE will just use the primary key. It looks like this in the DB graphs: 

   https://www.evernote.com/l/AAOmEW4547FEy7g3RorfannGcI9ggVyfKC4B/image.png

You can see the read IO for SELECT and where the DELETE kicks in. Looks like MySQL is keeping the ID's in RAM and running through the list. I'm sure my query sucks.
I feel like I've gone off the deep end with this one. 

I dug a bit deeper and wrote this ridiculous query to estimate how many %latest records there are. This monstrosity exists since we don't have a proper index that uses both timestamp+name for the query we need. It's fast though. :)

SELECT count(*)
FROM (
      # ------------------------------------------------------
      # Use index(timestamp) to find records that are too old
      # and are candidates to be deleted. 
      # ------------------------------------------------------      
      SELECT B.change_id
      FROM releases_history B
      WHERE B.timestamp<1000*UNIX_TIMESTAMP(NOW()-INTERVAL 7 DAY)      
    ) LSIDE LEFT JOIN (
        # ------------------------------------------------------
        # Use index(name) to match records. Extract the record's 
        # primary key to use for matching.
        # ------------------------------------------------------            
        SELECT C.change_id
        FROM releases_history C
        WHERE C.name in (
            # ------------------------------------------------------
            # Use index(name) to get unique names and filter the 
            # ones that end with %latest
            # ------------------------------------------------------
         
            SELECT UniqueOnly.name 
            FROM (
                SELECT distinct name FROM releases_history
            ) UniqueOnly
            WHERE UniqueOnly.name LIKE '%latest'
        )        
    ) RSIDE ON LSIDE.change_id=RSIDE.change_id    
WHERE RSIDE.change_id IS NOT NULL # too old AND correct name

This returns a count of about 400K records %latest that are suitable for cleaning up. A similar query shows about 200K %nightly% records to clean up. So about 600K suitable records to clean up combined. 

Due to innodb reasons (rollback log, updating indexes) deleting a ton of rows is really bad. Batches seems to be the recommended approach. After some testing this seems to give a good balance of performance and batch size:

DELETE R.*
FROM releases_history R 
WHERE R.change_id IN (
    SELECT X.change_id 
    FROM (SELECT change_id 
          FROM releases_history 
          WHERE name LIKE '%latest' AND timestamp<1000*UNIX_TIMESTAMP(NOW()-INTERVAL 14 DAY) 
          ORDER BY change_id
          LIMIT 100
    ) X
);

It averages about 400ms/rows. Which is about 9000 rows/hour we can delete. 

Using a 6 hour, 3PM - 9PM PST purge window, we can delete about 54K rows/day. My best guess is we add about 8K rows/day to the history table. So... 

  - 54K - 8K = 46K rows of backlog we clean up / day
  - ~600K rows of backlog / 46K = ~ 13 days to get caught up

Though I wonder, the releases table has 1,034 rows in it. Why are we adding so many rows/day to the releases_history table?
An alternative is to maybe redesign the way releases_history is kept so we can maybe:

  - truncate
  - drop table

or something else that is a lot more efficient.
Update from meeting: 

- Will put script in place to delete in batches (bwong)
- bhearsum will look at not saving history for nightly and latest which should get rid of the problem
(Assignee)

Comment 10

a year ago
(In reply to Benson Wong [:mostlygeek] from comment #9)
> - bhearsum will look at not saving history for nightly and latest which
> should get rid of the problem

https://bugzilla.mozilla.org/show_bug.cgi?id=1294493 or this

Comment 11

a year ago
Commit pushed to master at https://github.com/mozilla/balrog

https://github.com/mozilla/balrog/commit/7a21838321a65de1a9b138bc98ae168043e9c488
bug 1283492: Add batch delete for releases and releases_history tables (#110). r=bhearsum
(Assignee)

Updated

a year ago
Depends on: 1295183
(Assignee)

Comment 12

a year ago
This got enabled as part of the push from https://bugzilla.mozilla.org/show_bug.cgi?id=1304054. The cronjob is running at 2100 UTC. The most recent run took a couple of hours, but it was cleaning up over a month of backlog. When we timed it out last time, a daily run should only take a few minutes.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.