450674 - Remove old test data

Reporter

Description

•

16 years ago

A quick win for performance could be removing old test data.

First ideas are removing/archiving anything older than 2 months. Need to get some feedback first from users before removing data.

alice nodelman [:alice] [:anode]

Comment 1

•

16 years ago

I think that the continuous graph information should persist (that would be the graphs that most people use for regression tracking).  They are also relatively small and grow at a reasonable rate.

It's the fine grained data for the individual tests that gets out of hand.  These would be the individual page load times, memory size sampled every second, cpu usage sampled every second, etc.  Since we end up maintaining the average value of each of these sets of data in the continuous graph I think that we could reasonable age out data after a few months.

In this scheme we would keep around a track record of the health of builds (and be able to monitor branching points, regressions, etc) while removing a lot of cruft that nobody is looking at.

Ryan Doherty (:rdoherty)

Reporter

Updated

•

16 years ago

Target Milestone: 0.4 → Future

Ryan Doherty (:rdoherty)

Reporter

Updated

•

16 years ago

Severity: normal → critical

Target Milestone: Future → 0.5

Ryan Doherty (:rdoherty)

Reporter

Comment 2

•

16 years ago

Adding Mark as we'll be working with him on this.

Ryan Doherty (:rdoherty)

Reporter

Comment 3

•

16 years ago

Reassigning to Mark after he proved his scripting skillz were far superior to mine.

Basic plan is:
1) Make backup
2) Delete data older than 60 days AND type = 'discrete'

Thanks Mark!

Assignee: rdoherty → mark

Severity: critical → major

Target Milestone: 0.5 → 0.6

Mark Smith [:xb95]

Assignee

Comment 4

•

16 years ago

When am I okay to do this?  Once I get the backup, can I start immediately?

Also we talked about making a cron to do it every night.  For now, you don't want to do that, correct?  We'll just purge and see how it goes once, and maybe cron it later?

Mark Smith [:xb95]

Assignee

Comment 5

•

16 years ago

FYI

mysql> select count(*) from dataset_branchinfo bi, dataset_info i where bi.dataset_id = i.id and i.date < unix_timestamp() - 86400*60;
+-----------+
| count(*)  |
+-----------+
| 765774246 |
+-----------+
1 row in set (18 min 28.06 sec)

This is 75% of the content in the graphs database.  (Which means that 25% of the content of the database has been created in the last two months... and since the oldest data is older than that... it's gone to plaid.)

Ryan Doherty (:rdoherty)

Reporter

Comment 6

•

16 years ago

(In reply to comment #4)
> When am I okay to do this?  Once I get the backup, can I start immediately?

Let's start dumping data on Tues 9/16 unless we have any big complaints when we announce it on Monday. Since we don't have a good list of everyone who uses the graph server, we'll have to announce it during the all-hands.

> Also we talked about making a cron to do it every night.  For now, you don't
> want to do that, correct?  We'll just purge and see how it goes once, and maybe
> cron it later?

Yes, correct.

Ryan Doherty (:rdoherty)

Reporter

Comment 7

•

16 years ago

Mark, can you list/attach the queries you'll be using to dump the data here? We'll need to verify we're removing the data we don't want and we're not missing anything.

Mark Smith [:xb95]

Assignee

Comment 8

•

16 years ago

Timestamp: 1216278000 in epoch time, aka 2008-07-17 00:00:00 UTC.

Step 1)

[root@mrdb05 graphs-dump]# echo "SELECT id FROM dataset_info WHERE date < 1216278000 AND type = 'discrete';" | mysql graphs_mozilla_org > old-ids
[root@mrdb05 graphs-dump]# wc -l old-ids
1172834 old-ids

So there are 1,172,834 items that are discrete and have a timestamp older than 60 days ago.  These are the items to remove.

Step 2)

Short Perl script that does the following:

    # get all the values used for this set
    SELECT time FROM dataset_branchinfo WHERE dataset_id = ?

    # now remove those rows everywhere
    DELETE FROM dataset_branchinfo WHERE dataset_id = ? AND time IN (?)
    DELETE FROM dataset_values WHERE dataset_id = ? AND time IN (?)
    DELETE FROM dataset_extra_data WHERE dataset_id = ? AND time IN (?)

Step 3)

Record the dataset_id and the times deleted in this format:

    dataset_id:time,time,time,...

----

That's the process.  The logic for deleting only by the tuple (dataset_id, time) is that there seem to be duplicate uses of dataset_id in the dataset_extra_data table.  For example:

mysql> select * from dataset_extra_data where dataset_id = 1;
+------------+------------+------+
| dataset_id | time       | data |
+------------+------------+------+
|          1 |          0 | ms   |
|          1 |          0 | ms   |
|          1 |          1 | ms   |
|          1 |          1 | ms   |
|          1 |          2 | ms   |
|          1 |          2 | ms   |
|          1 |          3 | ms   |
|          1 |          3 | ms   |
|          1 |          4 | ms   |
|          1 |          4 | ms   |
|          1 | 1181712240 | --   |
|          1 | 1181714460 | --   |
|          1 | 1181716860 | --   |
|          1 | 1181719020 | --   |
|          1 | 1181721240 | --   |
...
+------------+------------+------+
1136 rows in set (0.01 sec)

I don't know what causes those actually-timestamped rows and I don't know whether they're used for something else or what, so it might be safer to just leave them there, unless you can vouch that they're safe to remove.

If you approve this process, I'll get the ball rolling.  I've already taken a backup from last week that we can use to restore the data being removed if needed.  (Since the data before that timestamp shouldn't be changing, unless you're actually inserting data with an old timestamp for some reason...)

Mark Smith [:xb95]

Assignee

Comment 9

•

16 years ago

Oh, and for the record, the recovery process in case we blow away data and need it again:

1) restore the backup to a temporary location somewhere
2) use the log generated in the third step of the original process to select out data
3) re-insert this data into the production database

It's not an amazingly fast process - it will take time to restore the database and write a script to do the above.  But the recovery process is there in case we need to do it.

alice nodelman [:alice] [:anode]

Comment 10

•

16 years ago

I'd be more comfortable with determining a set of ids from dataset_info which are within the time range, and then removing all data associated with those ids from dataset_branchinfo/dataset_extra_data/dataset_values/etc.

The meaning of 'time' is different over the various tables and I can only guarantee that we'll eliminate the data that we are interested in if we base it upon the date provided for a given dataset in dataset_info.

Mark Smith [:xb95]

Assignee

Comment 11

•

16 years ago

Yes, step 1 is getting the ids from dataset_info.  That is as you say.

However, we need to ALSO get the 'times' from dataset_branchinfo (or dataset_values) in order to determine which rows from dataset_extra_data to remove.  As it is now, there are lots of rows in dataset_extra_data that do not appear to correspond to rows from dataset_values and dataset_branchinfo.  I am not sure that we want to remove these rows or not?

See my example, note that the top 10 rows have "time" of 0,1,2,3,4 which corresponds to the iterations of the tests.  These are the same values as in the dataset_branchinfo and dataset_values tables.

However, the last ~1000 rows have a time value of something that does NOT correspond to the other tables.  What are these values for?  If we can delete them, I'm all for it, but I'm not sure.

alice nodelman [:alice] [:anode]

Comment 12

•

16 years ago

There shouldn't be any of those duplicate entries - so I can't say when/how they were created.  I'd be comfortable offing them as they seem like db corruption to me.  

Could you tell me more about a dataset that has those points?  That might help me track down how they got into the db in the first place.

Mark Smith [:xb95]

Assignee

Comment 13

•

16 years ago

Well, I'm not too worried about the dupes.  Those are okay.  It's the ones with a real time value that I care about:

|          1 | 1181712240 | --   |

What are those?  Why are they there?  If you look in dataset_values, 

mysql> select id, type, machine, test, test_type, date
       from dataset_info where id = 1;
+----+----------+----------+------+-----------+------------+
| id | type     | machine  | test | test_type | date       |
+----+----------+----------+------+-----------+------------+
|  1 | discrete | qm-pxp01 | ts   | perf      | 1180657203 |
+----+----------+----------+------+-----------+------------+
1 row in set (0.00 sec)

mysql> select * from dataset_values where dataset_id = 1;
+------------+------+-------+
| dataset_id | time | value |
+------------+------+-------+
|          1 |    0 |  2281 |
|          1 |    0 |  2281 |
|          1 |    1 |  2250 |
|          1 |    1 |  2250 |
|          1 |    2 |  2235 |
|          1 |    2 |  2235 |
|          1 |    3 |  2266 |
|          1 |    3 |  2266 |
|          1 |    4 |  2297 |
|          1 |    4 |  2297 |
+------------+------+-------+
10 rows in set (0.02 sec)

mysql> select * from dataset_branchinfo where dataset_id = 1;
+------------+------+------------+
| dataset_id | time | branchid   |
+------------+------+------------+
|          1 |    0 | 2007053104 |
|          1 |    1 | 2007053104 |
|          1 |    2 | 2007053104 |
|          1 |    3 | 2007053104 |
|          1 |    4 | 2007053104 |
|          1 |    0 | 2007053104 |
|          1 |    1 | 2007053104 |
|          1 |    2 | 2007053104 |
|          1 |    3 | 2007053104 |
|          1 |    4 | 2007053104 |
+------------+------+------------+
10 rows in set (0.05 sec)

mysql> select * from dataset_extra_data where dataset_id = 1 limit 30;
+------------+------------+------+
| dataset_id | time       | data |
+------------+------------+------+
|          1 |          0 | ms   |
|          1 |          0 | ms   |
|          1 |          1 | ms   |
|          1 |          1 | ms   |
|          1 |          2 | ms   |
|          1 |          2 | ms   |
|          1 |          3 | ms   |
|          1 |          3 | ms   |
|          1 |          4 | ms   |
|          1 |          4 | ms   |
|          1 | 1181712240 | --   |
|          1 | 1181714460 | --   |
|          1 | 1181716860 | --   |
|          1 | 1181719020 | --   |
|          1 | 1181721240 | --   |
...
|          1 | 1184303100 | --   |
|          1 | 1184305680 | --   |
+------------+------------+------+
1136 rows in set (0.03 sec)

See the difference?  I'm not sure why those extra rows in dataset_extra_data are there.  Should they be removed too, or left?

Mark Smith [:xb95]

Assignee

Comment 14

•

16 years ago

Per in person discussion, pulling the trigger on this.

I am keeping a log of every dataset_id that is being removed.  The script ensures that a dataset is both type = 'discrete' and dated more than 60 days ago.

Initial estimates put this at 2-3 weeks to complete.  I can speed it up by some amount if it turns out the databases aren't having any trouble keeping up.  If they do end up having issues, I will have to slow it down.

The SQL backup I took last week is at mrdb-backup01:/data/backups/one-offs in case we need to recover this data.

And so it begins:

-I- 0.00% - 93451 rows deleted
-I- 0.00% - 37.48d - 222407 rows deleted
-I- 0.00% - 41.81d - 316572 rows deleted
-I- 0.00% - 43.81d - 380114 rows deleted
-I- 0.01% - 43.30d - 443896 rows deleted
-I- 0.01% - 42.93d - 508062 rows deleted
-I- 0.01% - 41.62d - 572884 rows deleted
-I- 0.01% - 40.46d - 636942 rows deleted
-I- 0.01% - 39.80d - 700964 rows deleted
-I- 0.01% - 38.89d - 765334 rows deleted
-I- 0.01% - 38.48d - 829124 rows deleted
-I- 0.01% - 38.13d - 893026 rows deleted
-I- 0.01% - 38.90d - 957756 rows deleted
-I- 0.01% - 41.86d - 1021658 rows deleted
-I- 0.01% - 41.79d - 1085824 rows deleted

Mark Smith [:xb95]

Assignee

Comment 15

•

16 years ago

Per in person discussion, pulling the trigger on this.

I am keeping a log of every dataset_id that is being removed.  The script ensures that a dataset is both type = 'discrete' and dated more than 60 days ago.

Initial estimates put this at "a very long time" to complete.  I can speed it up by some amount if it turns out the databases aren't having any trouble keeping up.  If they do end up having issues, I will have to slow it down more.

The SQL backup I took last week is at mrdb-backup01:/data/backups/one-offs in case we need to recover this data.

And so it begins:

-I- 0.00% - 93451 rows deleted
-I- 0.00% - 37.48d - 222407 rows deleted
-I- 0.00% - 41.81d - 316572 rows deleted
-I- 0.00% - 43.81d - 380114 rows deleted
-I- 0.01% - 43.30d - 443896 rows deleted
-I- 0.01% - 42.93d - 508062 rows deleted
-I- 0.01% - 41.62d - 572884 rows deleted
-I- 0.01% - 40.46d - 636942 rows deleted
-I- 0.01% - 39.80d - 700964 rows deleted
-I- 0.01% - 38.89d - 765334 rows deleted
-I- 0.01% - 38.48d - 829124 rows deleted
-I- 0.01% - 38.13d - 893026 rows deleted
-I- 0.01% - 38.90d - 957756 rows deleted
-I- 0.01% - 41.86d - 1021658 rows deleted
-I- 0.01% - 41.79d - 1085824 rows deleted

Mark Smith [:xb95]

Assignee

Comment 16

•

16 years ago

On second thought, Alice recommends this be mentioned more widely first.  I've stopped the process and posted to my blog which will get picked up on Planet.  Sometime tomorrow I will get this rolling again.

Ryan Doherty (:rdoherty)

Reporter

Updated

•

16 years ago

Target Milestone: 0.6 → ---

Mark Smith [:xb95]

Assignee

Comment 17

•

16 years ago

This is done, but, it's not visible yet since it's done on the new master and hasn't been put into production yet.  Current schedule is for the new master to be in production on Thursday, October 2nd.

Resolving this as the work is done, but it can't be verified until the new master slots in.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Webtools → Webtools Graveyard

Bugzilla

Quick Search

Remove old test data

Categories

(Webtools Graveyard :: Graph Server, defect)

Tracking

(Not tracked)

People

(Reporter: rdoherty, Assigned: xb95)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated

Comment 17

Updated