Closed Bug 952135 Opened 12 years ago Closed 11 years ago

Create a data lifecycle for aus data

Categories

(Data & BI Services Team :: DB: MySQL, task)

Product:

Component:

Platform:

x86

macOS

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: scabral, Assigned: scabral)

Details

(Whiteboard: [2014q2] June)

Sheeri Cabral [:sheeri]

Assignee

Description

•

12 years ago

Since it started gathering data in late June 2013, the aus4_dev database has been growing rapidly. There are 174,804 rows in releases_history, and the table is 38G in size. The production database started gathering data in late September 2013, and already has 145,262 rows and is 26G in size. These are rates of about 6-9G per month. While there's no danger of running out of space in the near future, there's only room for about 2-3 years' worth of data, assuming the 6-9G per month figure stays the same. We should put some kind of a data lifecycle policy in place. This isn't urgent, but it should happen eventually, so I'm making a placeholder bug for it. (I tried to defragment, but the table is not fragmented).

bhearsum@mozilla.com (:bhearsum)

Comment 1

•

12 years ago

Yes, indeed...Nick and I have talked vaguely about this before. I think it would be good to figure this out - at least for the releases_history table. (The rest don't grow at remotely close to the same rate.)

bhearsum@mozilla.com (:bhearsum)

Comment 2

•

12 years ago

Sheeri, once we decide on the lifecycle for various things is that something that's implemented on your side and/or by you, or do we need to write scripts to do clean-up ourselves?

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Assignee

Comment 3

•

11 years ago

We can do either way. For example, we have written and implemented scripts for tbpl cleanup, but bouncer cleanup is done by the app itself...We'll work with you to figure out what's best.

Flags: needinfo?(scabral)

bhearsum@mozilla.com (:bhearsum)

Comment 4

•

11 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #3) > We can do either way. For example, we have written and implemented scripts > for tbpl cleanup, but bouncer cleanup is done by the app itself...We'll work > with you to figure out what's best. Thanks! I'm going to try to drive the policy decision, and then we can figure that out.

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

11 years ago

So, we've got 6 tables to think about: permissions permissions_history rules rules_history releases releases_history Both permissions and permissions_history are very small, and change very infrequently. I propose we keep their data indefinitely. Rules and rules_history are slightly bigger, and will have small updates at least twice a week after we ship move betas/releases to aus4. The rate of change could increase further if Mozilla starts directly serving some FxOS updates. Releases currently contains information about all nightlies going back at least a year, and will contain authoritative information about releases in the future. I think we should keep the rows that correspond to release builds indefinitely, but we should be able to purge old nightly rows after some threshold (6 months? a year?). We have to be careful not to purge rows that rules.mapping points at. (Aside: we should probably make rules.mapping a foreign key - I'm not sure why it isn't already...). To restate the above more clearly: * Never purge rows that match "%latest" or "%build%". * Never purge rows that a rules.mapping points at. * Other rows can be purged after $threshold I'm a bit torn on what to do with releases_history. It could either follow the same logic as above (likely with a smaller $threshold), or we could just purge all rows after $threshold. Doing the latter potentially makes it more difficult to debug issues with releases later. --- All of the above is meant for production -- for dev I think we should do the same for permissions/rules, and purge releases/releases_history after a small threshold (only keeping rows that a rules.mapping points at). Nick, what do you think of the above?

Flags: needinfo?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

11 years ago

Looks good to me. I'd opt to treat releases and release_history the same - the audit trail for releases is valuable, the churn of nightly builds not so much.

Flags: needinfo?(nthomas)

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

11 years ago

Okay! So the policy for production is as follows: * No removal of data from rules, rules_history, permissions, or permissions_history * Remove data from releases and releases_history that meet the following conditions: ** "name" column does not match "%latest%" or "%build%" ** No rules.mapping matches the "name" column ** Is older than 6 months And for dev: * No removal of data from rules, rules_history, permissions, or permissions_history * Remove data from releases and releases_history that meet the following conditions: ** No rules.mapping matches the "name" column ** Is older than a month Sheeri, can we chat sometime soon about how we'll go about this?

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Assignee

Comment 8

•

11 years ago

This all looks good - releases_history is really the one to watch out for. All tables are 96K currently in production, except for releases (141M) and releases_history (51G) so that's where the pain point will be anyway. 1) The releases and releases_history table join on name, right? 2) What if the timestamp is NULL? Looking at current production data: mysql> select count(name) FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and timestamp<now()-interval 6 month; +-------------+ | count(name) | +-------------+ | 152823 | +-------------+ 1 row in set (18.90 sec) The query is taking a while because of the string matching. I want to look into putting an InnoDB fulltext index (will use the backup server so as not to affect production.

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Assignee

Comment 9

•

11 years ago

Oh, that timestamp is unix timestamp format... select count(name) FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and from_unixtime(timestamp/1000)<now()-interval 6 month; shows nothing, because the earliest stuff is from Sept.... but this works! select count(name) FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and from_unixtime(timestamp/1000)<now()-interval 1 month; The fulltext indexing isn't working as I expect it (I'm getting 0 matches when I look for entries that contain 'latest' and don't contain 'latest' - maybe because they're not really words, they're separated by dashes). Anyway..... We can meet whenever, my calendar in zimbra is up to date so you can just schedule something (just remember I'm in Eastern time).

bhearsum@mozilla.com (:bhearsum)

Comment 10

•

11 years ago

Sheeri and I chatted today. She said that her team should be able to take this on implementation sometime in Q2, so I'm moving this bug over to y'all! She also mentioned that we may need to take a slightly different approach in determining which rows are old, because releases_history.timestamp can be validly null in some cases.

Assignee: nobody → server-ops-database

Component: Balrog: Backend → Server Operations: Database

Product: Release Engineering → mozilla.org

QA Contact: bhearsum → scabral

Version: unspecified → other

Sheeri Cabral [:sheeri]

Assignee

Updated

•

11 years ago

Whiteboard: [2014q2]

Sheeri Cabral [:sheeri]

Assignee

Comment 11

•

11 years ago

Hrm, doing a function on timestamp won't use the index. Let's change that: select count(name) FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and timestamp<1000*unix_timestamp(now()-interval 1 month); Here's a sample of what's coming up - looks good to me: mysql> select name,from_unixtime(timestamp/1000) FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and timestamp<1000*unix_timestamp(now()-interval 1 month) limit 5; +------------------------------------------------+-------------------------------+ | name | from_unixtime(timestamp/1000) | +------------------------------------------------+-------------------------------+ | Firefox-mozilla-central-nightly-20130927030202 | 2013-09-27 12:57:13.5820 | | Firefox-mozilla-central-nightly-20130927030202 | 2013-09-27 12:57:13.5830 | | Firefox-mozilla-central-nightly-20130927030202 | 2013-09-27 12:57:13.5930 | | Firefox-ux-nightly-20130927040201 | 2013-09-27 12:58:05.4330 | | Firefox-ux-nightly-20130927040201 | 2013-09-27 12:58:05.4340 | +------------------------------------------------+-------------------------------+ 5 rows in set (1.31 sec) Now we just need to automate deleting those records :D

Assignee: server-ops-database → scabral

Sheeri Cabral [:sheeri]

Assignee

Comment 12

•

11 years ago

Note that in this first iteration, I am leaving NULL values of timestamp alone. Great, so we need to delete rows in releases and releases history that match, 6 months old for production, 1 month old for dev. DELETE can support multiple tables, but not with a limit, and right now there's a lot of data. So I think the best way to go through this is to have a cursor, getting the primary key from the releases and releases_history tables and then select releases.name,releases_history.change_id FROM releases inner join releases_history using (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping is null AND name not like '%build%' and name not like '%latest%' and timestamp<1000*unix_timestamp(now()-interval 1 month); To make this portable, I can: 0) make a stored procedure that lives on both dev and prod (and stage if that's an issue), that takes in a # months as a parameter, 1) Then make an event on dev/prod that calls the stored procedure once a week, deleting everything older than 1 or 6 months, depending on which server it is. 2) And finally, monitor the age of the oldest record, and e-mail if the oldest record is older than 2 or 7 months (depending on which server it is).

Whiteboard: [2014q2] → [2014q2] June

Sheeri Cabral [:sheeri]

Assignee

Comment 13

•

11 years ago

Here's the stored procedure that's working on dev: drop procedure if exists clean_aus_data; DELIMITER | CREATE PROCEDURE clean_aus_data (IN purge_months TINYINT) BEGIN DECLARE delete_from_releases VARCHAR(100); DECLARE delete_from_rel_hist INT; DECLARE done TINYINT UNSIGNED DEFAULT 0; DECLARE delete_sql text; DECLARE record_to_delete CURSOR FOR SELECT releases.name, releases_history.change_id FROM releases INNER JOIN releases_history USING (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping IS NULL AND name NOT LIKE '%build%' AND name NOT LIKE '%latest%' AND timestamp<1000*UNIX_TIMESTAMP(NOW()-INTERVAL purge_months MONTH); DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1; OPEN record_to_delete; WHILE done=0 DO FETCH NEXT FROM record_to_delete INTO delete_from_releases,delete_from_rel_hist; --SELECT delete_from_releases, delete_from_rel_hist; set @delete_sql=CONCAT("DELETE FROM releases WHERE name='",delete_from_releases,"';"); --SELECT @delete_sql; PREPARE stmt1 FROM @delete_sql; EXECUTE stmt1; DEALLOCATE PREPARE stmt1; set @delete_sql=CONCAT("DELETE FROM releases_history WHERE change_id='",delete_from_rel_hist,"';"); --SELECT @delete_sql; PREPARE stmt1 FROM @delete_sql; EXECUTE stmt1; DEALLOCATE PREPARE stmt1; END WHILE; CLOSE record_to_delete; END | DELIMITER ; I called it with: CALL clean_aus_data(11); And it worked, got the minimum timestamp from: 2013-06-21 07:23:46.3940 to: 2013-07-10 08:55:19.4560

Sheeri Cabral [:sheeri]

Assignee

Comment 14

•

11 years ago

Now I'm working month by month to delete the data on dev: CALL clean_aus_data(10); CALL clean_aus_data(9); CALL clean_aus_data(8); etc. I'll stop at 2 months, then defragment dev.

Sheeri Cabral [:sheeri]

Assignee

Comment 15

•

11 years ago

I have put the stored procedure on production, and am deleting records older than 8 months. The deletion on dev is still going on, 15 minutes and counting (I'm checking it, it's deleting proper records).

Sheeri Cabral [:sheeri]

Assignee

Comment 16

•

11 years ago

Verified aus4_stage is empty; I'm not putting the stored procedure on there.

Sheeri Cabral [:sheeri]

Assignee

Comment 17

•

11 years ago

dev finally finished deleting records older than 10 months; it took 25 minutes. Will continue to delete records from dev. We have backups if anything goes awry.

Sheeri Cabral [:sheeri]

Assignee

Comment 18

•

11 years ago

Production is deleted for the last 6 months: mysql> select @@hostname; CALL clean_aus_data(6); +----------------------------+ | @@hostname | +----------------------------+ | aus4-1.db.phx1.mozilla.com | +----------------------------+ 1 row in set (0.00 sec) Query OK, 0 rows affected (3 min 48.87 sec) Min datetime in releases_history: 2013-12-10 09:17:04.5900 I have defragmented releases, bringing it from 228M to 156M, for a 54% reduction in size: mysql> optimize table releases; +---------------------------+----------+----------+-------------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------------------------+----------+----------+-------------------------------------------------------------------+ | aus4_mozilla_org.releases | optimize | note | Table does not support optimize, doing recreate + analyze instead | | aus4_mozilla_org.releases | optimize | status | OK | +---------------------------+----------+----------+-------------------------------------------------------------------+ 2 rows in set (4.49 sec) Can we set aside time to defragment releases_history? The releases_history table is 83G, it should shrink a lot when we defragment, because of the deletions. I'd think it'd have comparable shrinkage, probably down to 41G or so. I'd want a window of 30 mins, just in case (I'd expect it to take 10 mins).

Sheeri Cabral [:sheeri]

Assignee

Comment 19

•

11 years ago

purge of dev complete (took 49 mins) Defrag'ing. Before: -rw-rw---- 1 mysql mysql 480M Jun 9 19:11 releases.ibd -rw-rw---- 1 mysql mysql 39G Jun 9 19:11 releases_history.ibd After: -rw-rw---- 1 mysql mysql 340M Jun 9 20:22 releases.ibd -rw-rw---- 1 mysql mysql 29G Jun 9 21:35 /var/lib/mysql/aus4_dev/releases_history.ibd

Sheeri Cabral [:sheeri]

Assignee

Comment 20

•

11 years ago

took aus4-2 out of the load balancer, defrag'ing the table to see how long it takes. mysql> \! du -sh releases_history.ibd 85G releases_history.ibd mysql> set sql_log_bin=0; Query OK, 0 rows affected (0.00 sec) mysql> optimize table aus4_mozilla_org.releases_history;

Sheeri Cabral [:sheeri]

Assignee

Comment 21

•

11 years ago

Not much of a change on aus4-2: 78G /var/lib/mysql/aus4_mozilla_org/releases_history.ibd

Sheeri Cabral [:sheeri]

Assignee

Comment 22

•

11 years ago

aus4-2 is back in the load balancer :D

Sheeri Cabral [:sheeri]

Assignee

Comment 23

•

11 years ago

Steps left to do: 1) Then make an event on dev/prod that calls the stored procedure once a week, deleting everything older than 1 or 6 months, depending on which server it is. 2) And finally, monitor the age of the oldest record, and e-mail if the oldest record is older than 2 or 7 months (depending on which server it is). 3) Put the stored procedure and events into puppet to ensure it's on the right servers, and that the code is backed up.

Sheeri Cabral [:sheeri]

Assignee

Comment 24

•

11 years ago

1) event created: on dev: CREATE EVENT aus4_lifecycle ON SCHEDULE EVERY 1 WEEK DO CALL clean_aus_data(1); on production: CREATE EVENT aus4_lifecycle ON SCHEDULE EVERY 1 WEEK DO CALL clean_aus_data(6); 2) monitoring check info - Monitoring check should monitor: select datediff(CURRENT_DATE,DATE(min(from_unixtime(timestamp/1000)))) FROM releases RIGHT JOIN releases_history USING (name) LEFT JOIN rules ON (name=mapping) WHERE rules.mapping IS NULL AND name NOT LIKE '%build%' AND name NOT LIKE '%latest%'; Should be no greater than 60 on dev Should be no greater than 7*30 (210) on production 3) puppetize stored procedure and events, making sure they're disabled on the slave(s) and enabled on the master.

Sheeri Cabral [:sheeri]

Assignee

Comment 25

•

11 years ago

Monitored the check, it's at https://nagios.mozilla.org/phx1/cgi-bin/extinfo.cgi?type=2&host=aus4-1.db.phx1.mozilla.com&service=MySQL+data+lifecycle+check+for+aus4 Need to puppetize the stored procedure and events, but that's a bigger topic, and the functionality and monitoring exist, so I'm going to consider this closed.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Comment 26

•

11 years ago

Awesome, thanks Sheeri.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Data & BI Services Team

You need to log in before you can comment on or make changes to this bug.