Closed Bug 952135 Opened 11 years ago Closed 10 years ago

Create a data lifecycle for aus data

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: scabral, Assigned: scabral)

Details

(Whiteboard: [2014q2] June)

Since it started gathering data in late June 2013, the aus4_dev database has been growing rapidly. There are 174,804 rows in releases_history, and the table is 38G in size.

The production database started gathering data in late September 2013, and already has 145,262 rows and is 26G in size.

These are rates of about 6-9G per month. While there's no danger of running out of space in the near future, there's only room for about 2-3 years' worth of data, assuming the 6-9G per month figure stays the same.

We should put some kind of a data lifecycle policy in place. This isn't urgent, but it should happen eventually, so I'm making a placeholder bug for it. (I tried to defragment, but the table is not fragmented).
Yes, indeed...Nick and I have talked vaguely about this before. I think it would be good to figure this out - at least for the releases_history table. (The rest don't grow at remotely close to the same rate.)
Sheeri, once we decide on the lifecycle for various things is that something that's implemented on your side and/or by you, or do we need to write scripts to do clean-up ourselves?
Flags: needinfo?(scabral)
We can do either way. For example, we have written and implemented scripts for tbpl cleanup, but bouncer cleanup is done by the app itself...We'll work with you to figure out what's best.
Flags: needinfo?(scabral)
(In reply to Sheeri Cabral [:sheeri] from comment #3)
> We can do either way. For example, we have written and implemented scripts
> for tbpl cleanup, but bouncer cleanup is done by the app itself...We'll work
> with you to figure out what's best.

Thanks! I'm going to try to drive the policy decision, and then we can figure that out.
So, we've got 6 tables to think about:
permissions
permissions_history
rules
rules_history
releases
releases_history

Both permissions and permissions_history are very small, and change very infrequently. I propose we keep their data indefinitely.

Rules and rules_history are slightly bigger, and will have small updates at least twice a week after we ship move betas/releases to aus4. The rate of change could increase further if Mozilla starts directly serving some FxOS updates.

Releases currently contains information about all nightlies going back at least a year, and will contain authoritative information about releases in the future. I think we should keep the rows that correspond to release builds indefinitely, but we should be able to purge old nightly rows after some threshold (6 months? a year?). We have to be careful not to purge rows that rules.mapping points at. (Aside: we should probably make rules.mapping a foreign key - I'm not sure why it isn't already...). To restate the above more clearly:
* Never purge rows that match "%latest" or "%build%".
* Never purge rows that a rules.mapping points at.
* Other rows can be purged after $threshold

I'm a bit torn on what to do with releases_history. It could either follow the same logic as above (likely with a smaller $threshold), or we could just purge all rows after $threshold. Doing the latter potentially makes it more difficult to debug issues with releases later.

---

All of the above is meant for production -- for dev I think we should do the same for permissions/rules, and purge releases/releases_history after a small threshold (only keeping rows that a rules.mapping points at).

Nick, what do you think of the above?
Flags: needinfo?(nthomas)
Looks good to me. I'd opt to treat releases and release_history the same - the audit trail for releases is valuable, the churn of nightly builds not so much.
Flags: needinfo?(nthomas)
Okay! So the policy for production is as follows:
* No removal of data from rules, rules_history, permissions, or permissions_history
* Remove data from releases and releases_history that meet the following conditions:
** "name" column does not match "%latest%" or "%build%"
** No rules.mapping matches the "name" column
** Is older than 6 months

And for dev:
* No removal of data from rules, rules_history, permissions, or permissions_history
* Remove data from releases and releases_history that meet the following conditions:
** No rules.mapping matches the "name" column
** Is older than a month

Sheeri, can we chat sometime soon about how we'll go about this?
Flags: needinfo?(scabral)
This all looks good - releases_history is really the one to watch out for. All tables are 96K currently in production, except for releases (141M) and releases_history (51G) so that's where the pain point will be anyway.

1) The releases and releases_history table join on name, right?

2) What if the timestamp is NULL?

Looking at current production data:
mysql> select count(name) FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and timestamp<now()-interval 6 month;
+-------------+
| count(name) |
+-------------+
|      152823 |
+-------------+
1 row in set (18.90 sec)

The query is taking a while because of the string matching. I want to look into putting an InnoDB fulltext index (will use the backup server so as not to affect production.
Flags: needinfo?(scabral)
Oh, that timestamp is unix timestamp format...


select count(name) FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and from_unixtime(timestamp/1000)<now()-interval 6 month;

shows nothing, because the earliest stuff is from Sept....

but this works!
select count(name) FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and from_unixtime(timestamp/1000)<now()-interval 1 month;

The fulltext indexing isn't working as I expect it (I'm getting 0 matches when I look for entries that contain 'latest' and don't contain 'latest' - maybe because they're not really words, they're separated by dashes).

Anyway.....

We can meet whenever, my calendar in zimbra is up to date so you can just schedule something (just remember I'm in Eastern time).
Sheeri and I chatted today. She said that her team should be able to take this on implementation sometime in Q2, so I'm moving this bug over to y'all!

She also mentioned that we may need to take a slightly different approach in determining which rows are old, because releases_history.timestamp can be validly null in some cases.
Assignee: nobody → server-ops-database
Component: Balrog: Backend → Server Operations: Database
Product: Release Engineering → mozilla.org
QA Contact: bhearsum → scabral
Version: unspecified → other
Whiteboard: [2014q2]
Hrm, doing a function on timestamp won't use the index. Let's change that:

select count(name) FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and timestamp<1000*unix_timestamp(now()-interval 1 month);

Here's a sample of what's coming up - looks good to me:

mysql> select name,from_unixtime(timestamp/1000) FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and timestamp<1000*unix_timestamp(now()-interval 1 month) limit 5;
+------------------------------------------------+-------------------------------+
| name                                           | from_unixtime(timestamp/1000) |
+------------------------------------------------+-------------------------------+
| Firefox-mozilla-central-nightly-20130927030202 | 2013-09-27 12:57:13.5820      |
| Firefox-mozilla-central-nightly-20130927030202 | 2013-09-27 12:57:13.5830      |
| Firefox-mozilla-central-nightly-20130927030202 | 2013-09-27 12:57:13.5930      |
| Firefox-ux-nightly-20130927040201              | 2013-09-27 12:58:05.4330      |
| Firefox-ux-nightly-20130927040201              | 2013-09-27 12:58:05.4340      |
+------------------------------------------------+-------------------------------+
5 rows in set (1.31 sec)

Now we just need to automate deleting those records :D
Assignee: server-ops-database → scabral
Note that in this first iteration, I am leaving NULL values of timestamp alone.

Great, so we need to delete rows in releases and releases history that match, 6 months old for production, 1 month old for dev.

DELETE can support multiple tables, but not with a limit, and right now there's a lot of data. So I think the best way to go through this is to have a cursor, getting the primary key from the releases and releases_history tables and then 

select releases.name,releases_history.change_id FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and timestamp<1000*unix_timestamp(now()-interval 1 month);

To make this portable, I can:
0) make a stored procedure that lives on both dev and prod (and stage if that's an issue), that takes in a # months as a parameter,
1) Then make an event on dev/prod that calls the stored procedure once a week, deleting everything older than 1 or 6 months, depending on which server it is. 
2) And finally, monitor the age of the oldest record, and e-mail if the oldest record is older than 2 or 7 months (depending on which server it is).
Whiteboard: [2014q2] → [2014q2] June
Here's the stored procedure that's working on dev:

drop procedure if exists clean_aus_data;
DELIMITER |
CREATE PROCEDURE clean_aus_data (IN purge_months TINYINT)
BEGIN
DECLARE delete_from_releases VARCHAR(100);
DECLARE delete_from_rel_hist INT;
DECLARE done TINYINT UNSIGNED DEFAULT 0;
DECLARE delete_sql text;
DECLARE record_to_delete CURSOR FOR
SELECT releases.name,
releases_history.change_id
FROM releases INNER JOIN releases_history USING (name) 
LEFT JOIN rules ON (name=mapping) 
WHERE rules.mapping IS NULL AND name NOT LIKE '%build%' 
AND name NOT LIKE '%latest%' 
AND timestamp<1000*UNIX_TIMESTAMP(NOW()-INTERVAL purge_months MONTH);

DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;

OPEN record_to_delete;
WHILE done=0 DO
FETCH NEXT FROM record_to_delete INTO delete_from_releases,delete_from_rel_hist;
--SELECT delete_from_releases, delete_from_rel_hist;
set @delete_sql=CONCAT("DELETE FROM releases WHERE name='",delete_from_releases,"';");
--SELECT @delete_sql;
PREPARE stmt1 FROM @delete_sql;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
set @delete_sql=CONCAT("DELETE FROM releases_history WHERE change_id='",delete_from_rel_hist,"';");
--SELECT @delete_sql;
PREPARE stmt1 FROM @delete_sql;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
END WHILE;
CLOSE record_to_delete;
END
|
DELIMITER ;

I called it with:
CALL clean_aus_data(11);
And it worked, got the minimum timestamp from:
2013-06-21 07:23:46.3940 
to:
2013-07-10 08:55:19.4560
Now I'm working month by month to delete the data on dev:

CALL clean_aus_data(10);
CALL clean_aus_data(9);
CALL clean_aus_data(8);
etc.

I'll stop at 2 months, then defragment dev.
I have put the stored procedure on production, and am deleting records older than 8 months.

The deletion on dev is still going on, 15 minutes and counting (I'm checking it, it's deleting proper records).
Verified aus4_stage is empty; I'm not putting the stored procedure on there.
dev finally finished deleting records older than 10 months; it took 25 minutes. Will continue to delete records from dev. We have backups if anything goes awry.
Production is deleted for the last 6 months:

mysql> select @@hostname; CALL clean_aus_data(6);
+----------------------------+
| @@hostname                 |
+----------------------------+
| aus4-1.db.phx1.mozilla.com |
+----------------------------+
1 row in set (0.00 sec)

Query OK, 0 rows affected (3 min 48.87 sec)

Min datetime in releases_history: 2013-12-10 09:17:04.5900  

I have defragmented releases, bringing it from 228M to 156M, for a 54% reduction in size:
mysql> optimize table releases;
+---------------------------+----------+----------+-------------------------------------------------------------------+
| Table                     | Op       | Msg_type | Msg_text                                                          |
+---------------------------+----------+----------+-------------------------------------------------------------------+
| aus4_mozilla_org.releases | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| aus4_mozilla_org.releases | optimize | status   | OK                                                                |
+---------------------------+----------+----------+-------------------------------------------------------------------+
2 rows in set (4.49 sec)


Can we set aside time to defragment releases_history? The releases_history table is 83G, it should shrink a lot when we defragment, because of the deletions. I'd think it'd have comparable shrinkage, probably down to 41G or so. I'd want a window of 30 mins, just in case (I'd expect it to take 10 mins).
purge of dev complete (took 49 mins)

Defrag'ing. Before:
-rw-rw---- 1 mysql mysql 480M Jun  9 19:11 releases.ibd
-rw-rw---- 1 mysql mysql  39G Jun  9 19:11 releases_history.ibd

After:
-rw-rw---- 1 mysql mysql 340M Jun  9 20:22 releases.ibd
-rw-rw---- 1 mysql mysql  29G Jun  9 21:35 /var/lib/mysql/aus4_dev/releases_history.ibd
took aus4-2 out of the load balancer, defrag'ing the table to see how long it takes.

mysql> \! du -sh releases_history.ibd
85G     releases_history.ibd
mysql> set sql_log_bin=0;
Query OK, 0 rows affected (0.00 sec)

mysql> optimize table aus4_mozilla_org.releases_history;
Not much of a change on aus4-2:
78G     /var/lib/mysql/aus4_mozilla_org/releases_history.ibd
aus4-2 is back in the load balancer :D
Steps left to do:

1) Then make an event on dev/prod that calls the stored procedure once a week, deleting everything older than 1 or 6 months, depending on which server it is. 
2) And finally, monitor the age of the oldest record, and e-mail if the oldest record is older than 2 or 7 months (depending on which server it is).
3) Put the stored procedure and events into puppet to ensure it's on the right servers, and that the code is backed up.
1) event created:
on dev:
CREATE EVENT aus4_lifecycle ON SCHEDULE EVERY 1 WEEK DO CALL clean_aus_data(1);

on production:
CREATE EVENT aus4_lifecycle ON SCHEDULE EVERY 1 WEEK DO CALL clean_aus_data(6);



2) monitoring check info - 
Monitoring check should monitor:
select datediff(CURRENT_DATE,DATE(min(from_unixtime(timestamp/1000)))) FROM releases RIGHT JOIN releases_history USING (name)  LEFT JOIN rules ON (name=mapping)  WHERE rules.mapping IS NULL AND name NOT LIKE '%build%'  AND name NOT LIKE '%latest%';

Should be no greater than 60 on dev
Should be no greater than 7*30 (210) on production

3) puppetize stored procedure and events, making sure they're disabled on the slave(s) and enabled on the master.
Monitored the check, it's at https://nagios.mozilla.org/phx1/cgi-bin/extinfo.cgi?type=2&host=aus4-1.db.phx1.mozilla.com&service=MySQL+data+lifecycle+check+for+aus4

Need to puppetize the stored procedure and events, but that's a bigger topic, and the functionality and monitoring exist, so I'm going to consider this closed.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Awesome, thanks Sheeri.
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.