If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Status

Release Engineering
General Automation
P2
normal
RESOLVED FIXED
7 years ago
3 years ago

People

(Reporter: catlee, Assigned: catlee)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [database][qa][buildbot])

Attachments

(3 attachments)

(Assignee)

Description

7 years ago
Bad Things happen, and we sometimes end up with stale data in the DB. e.g.
- Rows in buildrequests that will never be run if we disable old builders, or rename things
- Builds never get marked as finished if the master crashes

We should periodically find and clean up these kind of things.
This should probably be (optionally) done in Buildbot itself, along with the regular pruning?
Priority: -- → P5
Whiteboard: [database][qa]
Whiteboard: [database][qa] → [database][qa][buildbot]
(Assignee)

Updated

6 years ago
Assignee: nobody → catlee
Severity: enhancement → normal
Depends on: 713255
Priority: P5 → P2
(Assignee)

Comment 2

6 years ago
notes w/ sheeri on how to use mysql partitions to do this on an ongoing basis in the future: https://etherpad.mozilla.org/GQrjGbbAjn
FYI, buildbot is on Percona 5.5, so we could indeed set this up for partitioning when we can.

Updated

5 years ago
Depends on: 755012
(Assignee)

Updated

5 years ago
Priority: P2 → P5
Note that partitioning doesn't work with foreign keys (as of now), so
..so any tables that have foreign keys on them can't be partitioned. :D
Is this still an issue?
Flags: needinfo?(catlee)
(Assignee)

Comment 7

4 years ago
In terms of having bad/stale data, yes, we still need to fix that up.

In terms of pruning old data to improve load on the DB, you're probably in a better position to say than I am.
Flags: needinfo?(catlee)
OK, we'll put this as a q3 goal to get this done, then.
Whiteboard: [database][qa][buildbot] → [2013q3][database][qa][buildbot]
Assignee: catlee → scabral
Component: Release Engineering → Server Operations: Database
QA Contact: scabral
Priority: P5 → P3
Is it accurate to say that keeping a month of data is desired? (e.g. when should data be purged?)

We do this on tbpl (keep 1 month of data) so it should be no problem to do it here, too.
Whiteboard: [2013q3][database][qa][buildbot] → [2013q3] August [database][qa][buildbot]
FWIW if we're only keeping a month there's no need to partition.

It also looks like :catlee might have some scripts as used in https://bugzilla.mozilla.org/show_bug.cgi?id=713255 - do you still have those scripts? Otherwise we'll just make our own with what's in the etherpad and bug 713255.
(Assignee)

Comment 11

4 years ago
Yeah, I still have them. Running them is really expensive, as we found out the last time we tried to purge old data.

Most of the data isn't indexed by date. The existing scripts work by deleting rows that are keyed by date, and then finding and deleting orphaned rows.
What was expensive...the deletes, or the defragmenting with OPTIMIZE TABLE? 

If we delete regularly, say, every week we make sure there's nothing older than a month, we don't have to defrag, because once you delete you free up the space in the table you delete it in. If you're just going to fill that space in that table with more data, it will use up that extra space before taking more disk space.
(Assignee)

Comment 13

4 years ago
I think the deletes themselves were expensive. The times are listed here:
https://bugzilla.mozilla.org/show_bug.cgi?id=713255#c4

It took several hours to run on the master DB, and several more hours for the slave to catch up.

Perhaps if we started deleting only a few weeks of data at a time, until we are left with "enough" data.
:catlee - that sounds like a great plan. 

mysql> select min(starttime) from builds;
+---------------------+
| min(starttime)      |
+---------------------+
| 2011-01-01 00:00:13 |
+---------------------+
1 row in set (0.00 sec)

So indeed, we removed builds prior to 1/1/2011 in bug 713255.

I would say let's try a progression:

1 day
2 days
4 days
7 days
14 days
1 month

To see the progression of how long the deletes take, and how much replication lag it causes. If at any step things are taking "too long" we won't continue to the next step....

e.g. let's say 1 day takes 10 seconds to delete and causes 10 seconds of replication lag. Then 2 days might take 20 seconds, and 4 days 40 seconds, and 7 days 1 min 10 seconds. We may decide that one week at a time is all we want to do, but we might run that once an hour during non-peak times to catch up to where we want to be (e.g. every hour, we delete another week).

Or, we could just do it during a tree closure if we can get permission.

Whenever we start, I'll be sure to save that morning's backups just in case.
Depends on: 904756
Blocks: 899498
(Assignee)

Comment 15

4 years ago
The scripts we ran last time are:
https://github.com/catlee/tools/blob/db_maint/buildbot-helpers/clean_schedulerdb.py
and
https://github.com/catlee/tools/blob/db_maint/buildbot-helpers/clean_statusdb.py

Their basic approach is to delete some key data prior to the cutoff date, and then delete any orphaned rows in the rest of the tables.

Sheeri, is there a way to estimate the cost of queries like this one (https://github.com/catlee/tools/blob/db_maint/buildbot-helpers/clean_schedulerdb.py#L12):

DELETE LOW_PRIORITY FROM buildrequests USING
buildrequests LEFT OUTER JOIN buildsets ON
buildrequests.buildsetid = buildsets.id
WHERE
buildsets.id IS NULL

Alternatively, is there a better way of deleting orphaned rows?
Flags: needinfo?(scabral)
The best way to delete orphaned rows would be to make an actual foreign key dependency with ON DELETE CASCADE - this will ensure that any 

If you can't do that, then yes, this is the best way. The only suggestion I might have is to do something like a DELETE with a LIMIT 100 or something and stop after the 2nd or 3rd time you get "0 rows deleted" (you could preface it with a COUNT and batch it up, and add a few extra batches just in case).

You can estimate the cost with an EXPLAIN - before MySQL 5.6, you'd change it to a SELECT (SELECT * with the same FROM/WHERE clause) to get the number of rows. As for "cost" that's hard to quantify - you might know there are 5,000 rows to delete but how long does that take, how much I/O, how much replication lag - it's hard to say.

Once we start doing this regularly, like daily or weekly, the number of rows deleted goes WAY down.
Flags: needinfo?(scabral)
(Assignee)

Comment 17

4 years ago
I don't think there are any foreign key dependencies in place right now. Adding a limit to the # of deleted rows is a good idea. Is the delete or the big outer join the most expensive bit do you think?
My first instinct says that the delete is the most expensive bit - disk is usually the bottleneck. Even if all the rows are uncached and have to be read from disk, writing is more expensive than reading.

My second instinct says "it depends on how many rows you're deleting" - if you're doing a left join, the left table (buildrequests) will do a full table scan. It has to, because you want every single row there. If you're deleting 75% of the rows, a full table scan is the best thing, and the delete is the most expensive operation.

If you're only deleting 10% of the rows, it would be better to somehow get the ids you want to delete...but that would require doing an outer join or an IN subquery, like:

select * FROM buildrequests where buildsetid not in (select id from buildsets)

For that matter you could do 

DELETE * FROM buildrequests where buildsetid not in (select id from buildsets)

However, that would generate not only a full table scan on buildrequests, but buildsets ends up being a dependent subquery.

If we could keep track somehow of the deleted ids that would be the fastest, but more complex. For example, here is https://github.com/catlee/tools/blob/db_maint/buildbot-helpers/clean_statusdb.py in pseudocode - this is long and feel free to call me if any of this doesn't make sense.......

[result_hash] = SELECT id,builder_id,master_id,slave_id,source_id FROM builds WHERE starttime < cutoff;

for each b_id,bld_id,m_id,sl_id,so_id in [result_hash]
do
 DELETE FROM builds WHERE id=b_id;
 DELETE FROM builders WHERE id=bld_id;
 DELETE FROM slaves WHERE id=sl_id;
 DELETE FROM builder_slaves WHERE builder_id=bld_id;

 [prop_array] = SELECT property_id FROM build_properties WHERE build_id=b_id;

 DELETE FROM build_properties WHERE build_id=b_id;
 DELETE FROM build_requests WHERE build_id=b_id;

 [req_array] = SELECT DISTINCT id FROM requests WHERE builder_id=b_id;
 for each r_id in [req_array]
 do
  DELETE FROM requests WHERE id=r_id;
 [prop_array2] = SELECT property_id FROM build_properties WHERE request_id_id=r_id;
  DELETE FROM request_properties WHERE request_id=r_id;
 done

 DELETE FROM steps WHERE build_id=b_id;

 [patch_array] = SELECT patch_id FROM sourcestamps WHERE id=so_id;
 for each p_id in [patch_array]
 do
   DELETE FROM patches WHERE id=p_id;
 done
 DELETE FROM sourcestamps WHERE id=so_id;

 [src_array] = SELECT DISTINCT change_id FROM source_changes WHERE source_id=so_id;
 for each c_id in [src_array]
 do
   DELETE FROM changes WHERE id=c_id;

   [fc_array] = SELECT DISTINCT file_id FROM file_changes WHERE change_id=c_id;
   for each f_id in [fc_array]
   do
     DELETE FROM files WHERE id=f_id;
   done

   DELETE FROM file_changes WHERE change_id=c_id;
 done

 DELETE FROM source_changes WHERE source_id=so_id;

 [props_to_del] = intersection [prop_array],[prop_array2]
 DELETE FROM properties WHERE id in [props_to_del];

 DELETE FROM masters WHERE id=m_id;
 DELETE FROM master_slaves WHERE master_id=m_id;
done

I put the masters and master_slaves stuff at the end because in https://github.com/catlee/tools/blob/db_maint/buildbot-helpers/clean_statusdb.py you never actually call it. You also don't call patches, but that depends on sourcestamps, so I put that in the right place.

Just in case, I checked to see if there was existing cleanup that needed to be done and there are no masters or master_slaves that are orphans:
mysql> select * from masters LEFT OUTER JOIN builds ON builds.master_id = masters.id WHERE builds.id IS NULL;
Empty set (0.00 sec)

mysql> select * from master_slaves LEFT OUTER JOIN masters ON master_slaves.master_id = masters.id WHERE masters.id IS NULL;
Empty set (0.00 sec)

mysql> select * from patches LEFT OUTER JOIN sourcestamps ON patches.id = sourcestamps.patch_id WHERE sourcestamps.id IS NULL;
Empty set (0.00 sec)

This makes the deletes really easy - no joins, you're basically making cursors and traveling through everything one number at a time. It's complex and error prone, because it's less intuitive than just saying "if it's not in table foo, delete it from table bar, baz and bap" - you're saying "find properties of one thing we're deleting in foo, then delete what's in table bar, baz and bap that has those properties." For instance, if you use b_id instead of bld_id in the wrong place, it's harder to spot (builds.id as opposed to builders.id - if you typo in an outer join query, MySQL will give an error, because builders.id doesn't exist in the builds table. But if you use bld_id vs. b_id there won't be any failures, and you might accidentally delete important stuff.

So tl;dr is I'd leave it as-is...
:catlee - can you comment on the progress that was made over the weekend?
Whiteboard: [2013q3] August [database][qa][buildbot] → [database][qa][buildbot]
(also is there any way we can keep the target of q3 to have the initial cleanout done?)
Whiteboard: [database][qa][buildbot] → [2013q3] August [database][qa][buildbot]
So, here's an idea.

What if we take the slave out of the load balancer, so the master is doing all the reads and writes. Then we run the script specifically against the slave, cleaning up all the data, letting it run for however long it runs, but don't let it replicate back to the master.

Then we let the slave catch up, so it should have all the recent data, and then we failover to the slave, so the slave is doing all reads and writes. Then we can delete from the master, the same stuff we deleted on the slave, for however long it takes, again not letting the changes replicate.

Before we put the master back in as the master, we run checksums to make sure the data is exactly the same, and set up the backup server to have the same data as the master.

This should work in a shorter timeframe than trying to run the script every outage window.

We will still need to figure out a solution to cleanup the database more frequently, in an automated way that does not lock tables. Perhaps we can set up triggers, so when a row is deleted from one table all its children are deleted too?
Chris - any feedback on the plan in comment 21? the sooner we can cleanup buildbot, the better.
(Assignee)

Comment 23

4 years ago
per some emails back and forth:

I think the plan in comment 21 could end up with dangling references to data in some of the shared tables like properties. Consider the following initial state:

build id 1
  date 1
  properties: 1, 2, 3 (via build_properties table)

build id 2
  date 2
  properties: 2, 3, 4 (via build_properties table)

If we decide to delete all builds older than 2 on the slave, we will delete build 1. Then any orphaned properties would be deleted. In this case, property 1 would be deleted as well.

Let's say right after this has happened, we write a new build to the master db:

build id 3
   date 3
   properties: 1, 4, 5
 
since property 1 still exists in the master, we think we're ok referencing it. however, it's been deleted on the slave. so now when we replicate the new build info to the slave, we'll be referencing data that doesn't exist any more.

For now, I think an approach like comment 18 is more workable. We can delete rows from builds and steps without too much trouble, since those have date columns on them. After that we can look at how to delete orphaned properties and other data.
It'd be a huge win to delete from steps, so let's do that (and builds too, why not?).

One big problem is that steps does not have an index on either starttime or endtime. 

I think we'll probably still need a failover to do this, take the slave out of the load balancer, delete from steps and builds, add an index to steps (this will rebuild the table and defragment it), then promote the slave to master, and delete to the same date from steps/builds, and add the index.

How does that sound?
Also I agree that comment 18 is the best idea, but I assume that will take some time to code, which is why I'm pushing on the cleanup of steps sooner rather than later.
Whiteboard: [2013q3] August [database][qa][buildbot] → [database][qa][buildbot]
Priority: P3 → P2
What's the plan for the cleanup?
This can be done any time:

repeat until done:
select id from builds where starttime<'2012-10-01 00:00:00' limit 100;
delete from builds where id in (those 100 ids)
delete from steps where build_id in (those 100 ids)
Chris deleted everything in builds and steps before Oct 2012, and I defragmented those tables. We regained about 45G of disk space on buildbot2!

We will regain the space on buildbot1 when we upgrade, before the end of October.
The space has been regained on buildbot1.

The only thing left in this is to automate the deletions, which I'll put in Chris' hands.
(Assignee)

Updated

4 years ago
Assignee: scabral → catlee
(Assignee)

Comment 30

4 years ago
Created attachment 821223 [details]
cleanup_db.py

cleanup_db.py
Attachment #821223 - Flags: review?(bhearsum)
Comment on attachment 821223 [details]
cleanup_db.py

>def cleaner_upper(select_query, delete_queries):
>    while True:
>        t = time.time()
>        log.debug("finding rows: %s", query_to_str(select_query))
>        rows = select_query.execute()

Is there any concern that this could return so many ids that it chews up a ton of RAM? If so, might be better to use something that returns an iterator.


>def cleanup_orphaned_steps(meta):

It's likely that I'm just missing something here, but why is this method (and the next one) necessary? It looks to me that cleanup_builds should take care of deleting recently orphaned steps and properties already...
(Assignee)

Comment 32

4 years ago
(In reply to Ben Hearsum [:bhearsum] from comment #31)
> Comment on attachment 821223 [details]
> cleanup_db.py
> 
> >def cleaner_upper(select_query, delete_queries):
> >    while True:
> >        t = time.time()
> >        log.debug("finding rows: %s", query_to_str(select_query))
> >        rows = select_query.execute()
> 
> Is there any concern that this could return so many ids that it chews up a
> ton of RAM? If so, might be better to use something that returns an iterator.

I'm relying on the queries to limit the # of results. Using an iterator would mean holding some kind of cursor/connection open for longer, which could get closed if one of the subsequent deletions takes a long time.


> >def cleanup_orphaned_steps(meta):
> 
> It's likely that I'm just missing something here, but why is this method
> (and the next one) necessary? It looks to me that cleanup_builds should take
> care of deleting recently orphaned steps and properties already...

These are to find and remove steps and properties whose builds have gone away. You're right in that the regular cleanup should take care of deleting these rows first, and so we wouldn't end up with orphans, but this is to catch the edge cases. We had plenty of orphaned steps/build_properties until recently!
Attachment #821223 - Flags: review?(bhearsum) → review+
(Assignee)

Updated

4 years ago
Component: Server Operations: Database → General Automation
Product: mozilla.org → Release Engineering
QA Contact: scabral → catlee
(Assignee)

Comment 33

4 years ago
Comment on attachment 821223 [details]
cleanup_db.py

https://hg.mozilla.org/build/tools/rev/27d68d07222d
Attachment #821223 - Flags: checked-in+
w00t!

what's the next step on this?
(Assignee)

Comment 35

4 years ago
Need to get this deployed w/ puppet and running weekly or somesuch
Sure, is that something you do, or we do?
(Assignee)

Comment 37

4 years ago
I'm working on a patch today!
(Assignee)

Comment 38

4 years ago
Created attachment 824099 [details] [diff] [review]
add logging support
Attachment #824099 - Flags: review?(bhearsum)
(Assignee)

Comment 39

4 years ago
Created attachment 824101 [details] [diff] [review]
deploy db cleanup script with puppet
Attachment #824101 - Flags: review?(dustin)
Comment on attachment 824101 [details] [diff] [review]
deploy db cleanup script with puppet

Review of attachment 824101 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good

::: manifests/moco-nodes.pp
@@ +456,5 @@
>  }
>  
>  node "buildbot-master81.srv.releng.scl3.mozilla.com" {
>      include toplevel::server
>      include toplevel::server::buildmaster

Why are both of these specified here?  I guess that's really bug 891859.
Attachment #824101 - Flags: review?(dustin) → review+
Comment on attachment 824099 [details] [diff] [review]
add logging support

Review of attachment 824099 [details] [diff] [review]:
-----------------------------------------------------------------

::: buildfarm/maintenance/cleanup_db.py
@@ +190,3 @@
>  
>      if not options.cutoff:
> +        parser.error("cutoff date is both required")

SyntaxError: could not part "is both required"
Attachment #824099 - Flags: review?(bhearsum) → review+
(Assignee)

Comment 42

4 years ago
Comment on attachment 824099 [details] [diff] [review]
add logging support

https://hg.mozilla.org/build/tools/rev/39741a8d27bd
Attachment #824099 - Flags: checked-in+
(Assignee)

Comment 43

4 years ago
Comment on attachment 824101 [details] [diff] [review]
deploy db cleanup script with puppet

https://hg.mozilla.org/build/puppet/rev/df9273a70c62
Attachment #824101 - Flags: checked-in+
So has this been cleaning stuff up since the end of October? or are there more steps here?
(Assignee)

Comment 45

4 years ago
Yes, it has been running since October.

How are we doing for space? There are still lots of other things we could be cleaning up, but aren't yet.
I just wanted to make sure it wasn't something we should be running from cron or anything.

Looks like we're running great:

mysql> select min(starttime) from builds;
+---------------------+
| min(starttime)      |
+---------------------+
| 2012-12-08 00:00:00 |
+---------------------+
1 row in set (0.01 sec)


We're doing fine on space - 126G used, 82G free.

buildbot:
[root@buildbot1.db.scl3 buildbot]# du -sh * | grep G | sort -n
1.5G    builds.MYD
2.5G    builds.MYI
2.6G    schedulerdb_requests.ibd
6.3G    build_properties.MYD
7.6G    properties.MYI
9.5G    properties.MYD
11G     steps.MYI
15G     build_properties.MYI
26G     steps.MYD

buildbot_schedulers:
[root@buildbot1.db.scl3 buildbot_schedulers]# du -sh * | grep G | sort -n
1.5G    buildrequests.MYI
1.5G    builds.MYI
5.1G    buildrequests.MYD

I'm happy for y'all to build in lifecycles when you can, but I'm going to consider this a success and closed!
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Blocks: 1124687
You need to log in before you can comment on or make changes to this bug.