Closed Bug 450674 Opened 16 years ago Closed 16 years ago

Remove old test data

Categories

(Webtools Graveyard :: Graph Server, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rdoherty, Assigned: xb95)

Details

A quick win for performance could be removing old test data.

First ideas are removing/archiving anything older than 2 months. Need to get some feedback first from users before removing data.
I think that the continuous graph information should persist (that would be the graphs that most people use for regression tracking).  They are also relatively small and grow at a reasonable rate.

It's the fine grained data for the individual tests that gets out of hand.  These would be the individual page load times, memory size sampled every second, cpu usage sampled every second, etc.  Since we end up maintaining the average value of each of these sets of data in the continuous graph I think that we could reasonable age out data after a few months.

In this scheme we would keep around a track record of the health of builds (and be able to monitor branching points, regressions, etc) while removing a lot of cruft that nobody is looking at.
Target Milestone: 0.4 → Future
Severity: normal → critical
Target Milestone: Future → 0.5
Adding Mark as we'll be working with him on this.
Reassigning to Mark after he proved his scripting skillz were far superior to mine.

Basic plan is:
1) Make backup
2) Delete data older than 60 days AND type = 'discrete'

Thanks Mark!
Assignee: rdoherty → mark
Severity: critical → major
Target Milestone: 0.5 → 0.6
When am I okay to do this?  Once I get the backup, can I start immediately?

Also we talked about making a cron to do it every night.  For now, you don't want to do that, correct?  We'll just purge and see how it goes once, and maybe cron it later?
FYI

mysql> select count(*) from dataset_branchinfo bi, dataset_info i where bi.dataset_id = i.id and i.date < unix_timestamp() - 86400*60;
+-----------+
| count(*)  |
+-----------+
| 765774246 |
+-----------+
1 row in set (18 min 28.06 sec)

This is 75% of the content in the graphs database.  (Which means that 25% of the content of the database has been created in the last two months... and since the oldest data is older than that... it's gone to plaid.)
(In reply to comment #4)
> When am I okay to do this?  Once I get the backup, can I start immediately?

Let's start dumping data on Tues 9/16 unless we have any big complaints when we announce it on Monday. Since we don't have a good list of everyone who uses the graph server, we'll have to announce it during the all-hands.

> Also we talked about making a cron to do it every night.  For now, you don't
> want to do that, correct?  We'll just purge and see how it goes once, and maybe
> cron it later?

Yes, correct.
Mark, can you list/attach the queries you'll be using to dump the data here? We'll need to verify we're removing the data we don't want and we're not missing anything.
Timestamp: 1216278000 in epoch time, aka 2008-07-17 00:00:00 UTC.

Step 1)

[root@mrdb05 graphs-dump]# echo "SELECT id FROM dataset_info WHERE date < 1216278000 AND type = 'discrete';" | mysql graphs_mozilla_org > old-ids
[root@mrdb05 graphs-dump]# wc -l old-ids
1172834 old-ids

So there are 1,172,834 items that are discrete and have a timestamp older than 60 days ago.  These are the items to remove.

Step 2)

Short Perl script that does the following:

    # get all the values used for this set
    SELECT time FROM dataset_branchinfo WHERE dataset_id = ?

    # now remove those rows everywhere
    DELETE FROM dataset_branchinfo WHERE dataset_id = ? AND time IN (?)
    DELETE FROM dataset_values WHERE dataset_id = ? AND time IN (?)
    DELETE FROM dataset_extra_data WHERE dataset_id = ? AND time IN (?)

Step 3)

Record the dataset_id and the times deleted in this format:

    dataset_id:time,time,time,...

----

That's the process.  The logic for deleting only by the tuple (dataset_id, time) is that there seem to be duplicate uses of dataset_id in the dataset_extra_data table.  For example:

mysql> select * from dataset_extra_data where dataset_id = 1;
+------------+------------+------+
| dataset_id | time       | data |
+------------+------------+------+
|          1 |          0 | ms   |
|          1 |          0 | ms   |
|          1 |          1 | ms   |
|          1 |          1 | ms   |
|          1 |          2 | ms   |
|          1 |          2 | ms   |
|          1 |          3 | ms   |
|          1 |          3 | ms   |
|          1 |          4 | ms   |
|          1 |          4 | ms   |
|          1 | 1181712240 | --   |
|          1 | 1181714460 | --   |
|          1 | 1181716860 | --   |
|          1 | 1181719020 | --   |
|          1 | 1181721240 | --   |
...
+------------+------------+------+
1136 rows in set (0.01 sec)

I don't know what causes those actually-timestamped rows and I don't know whether they're used for something else or what, so it might be safer to just leave them there, unless you can vouch that they're safe to remove.

If you approve this process, I'll get the ball rolling.  I've already taken a backup from last week that we can use to restore the data being removed if needed.  (Since the data before that timestamp shouldn't be changing, unless you're actually inserting data with an old timestamp for some reason...)
Oh, and for the record, the recovery process in case we blow away data and need it again:

1) restore the backup to a temporary location somewhere
2) use the log generated in the third step of the original process to select out data
3) re-insert this data into the production database

It's not an amazingly fast process - it will take time to restore the database and write a script to do the above.  But the recovery process is there in case we need to do it.
I'd be more comfortable with determining a set of ids from dataset_info which are within the time range, and then removing all data associated with those ids from dataset_branchinfo/dataset_extra_data/dataset_values/etc.

The meaning of 'time' is different over the various tables and I can only guarantee that we'll eliminate the data that we are interested in if we base it upon the date provided for a given dataset in dataset_info.
Yes, step 1 is getting the ids from dataset_info.  That is as you say.

However, we need to ALSO get the 'times' from dataset_branchinfo (or dataset_values) in order to determine which rows from dataset_extra_data to remove.  As it is now, there are lots of rows in dataset_extra_data that do not appear to correspond to rows from dataset_values and dataset_branchinfo.  I am not sure that we want to remove these rows or not?

See my example, note that the top 10 rows have "time" of 0,1,2,3,4 which corresponds to the iterations of the tests.  These are the same values as in the dataset_branchinfo and dataset_values tables.

However, the last ~1000 rows have a time value of something that does NOT correspond to the other tables.  What are these values for?  If we can delete them, I'm all for it, but I'm not sure.
There shouldn't be any of those duplicate entries - so I can't say when/how they were created.  I'd be comfortable offing them as they seem like db corruption to me.  

Could you tell me more about a dataset that has those points?  That might help me track down how they got into the db in the first place.
Well, I'm not too worried about the dupes.  Those are okay.  It's the ones with a real time value that I care about:

|          1 | 1181712240 | --   |

What are those?  Why are they there?  If you look in dataset_values, 

mysql> select id, type, machine, test, test_type, date
       from dataset_info where id = 1;
+----+----------+----------+------+-----------+------------+
| id | type     | machine  | test | test_type | date       |
+----+----------+----------+------+-----------+------------+
|  1 | discrete | qm-pxp01 | ts   | perf      | 1180657203 |
+----+----------+----------+------+-----------+------------+
1 row in set (0.00 sec)

mysql> select * from dataset_values where dataset_id = 1;
+------------+------+-------+
| dataset_id | time | value |
+------------+------+-------+
|          1 |    0 |  2281 |
|          1 |    0 |  2281 |
|          1 |    1 |  2250 |
|          1 |    1 |  2250 |
|          1 |    2 |  2235 |
|          1 |    2 |  2235 |
|          1 |    3 |  2266 |
|          1 |    3 |  2266 |
|          1 |    4 |  2297 |
|          1 |    4 |  2297 |
+------------+------+-------+
10 rows in set (0.02 sec)

mysql> select * from dataset_branchinfo where dataset_id = 1;
+------------+------+------------+
| dataset_id | time | branchid   |
+------------+------+------------+
|          1 |    0 | 2007053104 |
|          1 |    1 | 2007053104 |
|          1 |    2 | 2007053104 |
|          1 |    3 | 2007053104 |
|          1 |    4 | 2007053104 |
|          1 |    0 | 2007053104 |
|          1 |    1 | 2007053104 |
|          1 |    2 | 2007053104 |
|          1 |    3 | 2007053104 |
|          1 |    4 | 2007053104 |
+------------+------+------------+
10 rows in set (0.05 sec)

mysql> select * from dataset_extra_data where dataset_id = 1 limit 30;
+------------+------------+------+
| dataset_id | time       | data |
+------------+------------+------+
|          1 |          0 | ms   |
|          1 |          0 | ms   |
|          1 |          1 | ms   |
|          1 |          1 | ms   |
|          1 |          2 | ms   |
|          1 |          2 | ms   |
|          1 |          3 | ms   |
|          1 |          3 | ms   |
|          1 |          4 | ms   |
|          1 |          4 | ms   |
|          1 | 1181712240 | --   |
|          1 | 1181714460 | --   |
|          1 | 1181716860 | --   |
|          1 | 1181719020 | --   |
|          1 | 1181721240 | --   |
...
|          1 | 1184303100 | --   |
|          1 | 1184305680 | --   |
+------------+------------+------+
1136 rows in set (0.03 sec)

See the difference?  I'm not sure why those extra rows in dataset_extra_data are there.  Should they be removed too, or left?
Per in person discussion, pulling the trigger on this.

I am keeping a log of every dataset_id that is being removed.  The script ensures that a dataset is both type = 'discrete' and dated more than 60 days ago.

Initial estimates put this at 2-3 weeks to complete.  I can speed it up by some amount if it turns out the databases aren't having any trouble keeping up.  If they do end up having issues, I will have to slow it down.

The SQL backup I took last week is at mrdb-backup01:/data/backups/one-offs in case we need to recover this data.

And so it begins:

-I- 0.00% - 93451 rows deleted
-I- 0.00% - 37.48d - 222407 rows deleted
-I- 0.00% - 41.81d - 316572 rows deleted
-I- 0.00% - 43.81d - 380114 rows deleted
-I- 0.01% - 43.30d - 443896 rows deleted
-I- 0.01% - 42.93d - 508062 rows deleted
-I- 0.01% - 41.62d - 572884 rows deleted
-I- 0.01% - 40.46d - 636942 rows deleted
-I- 0.01% - 39.80d - 700964 rows deleted
-I- 0.01% - 38.89d - 765334 rows deleted
-I- 0.01% - 38.48d - 829124 rows deleted
-I- 0.01% - 38.13d - 893026 rows deleted
-I- 0.01% - 38.90d - 957756 rows deleted
-I- 0.01% - 41.86d - 1021658 rows deleted
-I- 0.01% - 41.79d - 1085824 rows deleted
Per in person discussion, pulling the trigger on this.

I am keeping a log of every dataset_id that is being removed.  The script ensures that a dataset is both type = 'discrete' and dated more than 60 days ago.

Initial estimates put this at "a very long time" to complete.  I can speed it up by some amount if it turns out the databases aren't having any trouble keeping up.  If they do end up having issues, I will have to slow it down more.

The SQL backup I took last week is at mrdb-backup01:/data/backups/one-offs in case we need to recover this data.

And so it begins:

-I- 0.00% - 93451 rows deleted
-I- 0.00% - 37.48d - 222407 rows deleted
-I- 0.00% - 41.81d - 316572 rows deleted
-I- 0.00% - 43.81d - 380114 rows deleted
-I- 0.01% - 43.30d - 443896 rows deleted
-I- 0.01% - 42.93d - 508062 rows deleted
-I- 0.01% - 41.62d - 572884 rows deleted
-I- 0.01% - 40.46d - 636942 rows deleted
-I- 0.01% - 39.80d - 700964 rows deleted
-I- 0.01% - 38.89d - 765334 rows deleted
-I- 0.01% - 38.48d - 829124 rows deleted
-I- 0.01% - 38.13d - 893026 rows deleted
-I- 0.01% - 38.90d - 957756 rows deleted
-I- 0.01% - 41.86d - 1021658 rows deleted
-I- 0.01% - 41.79d - 1085824 rows deleted
On second thought, Alice recommends this be mentioned more widely first.  I've stopped the process and posted to my blog which will get picked up on Planet.  Sometime tomorrow I will get this rolling again.
Target Milestone: 0.6 → ---
This is done, but, it's not visible yet since it's done on the new master and hasn't been put into production yet.  Current schedule is for the new master to be in production on Thursday, October 2nd.

Resolving this as the work is done, but it can't be verified until the new master slots in.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.