617017 - Regular database cleanup

Assignee

Description

•

15 years ago

Bad Things happen, and we sometimes end up with stale data in the DB. e.g. - Rows in buildrequests that will never be run if we disable old builders, or rename things - Builds never get marked as finished if the master crashes We should periodically find and clean up these kind of things.

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

15 years ago

This should probably be (optionally) done in Buildbot itself, along with the regular pruning?

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

15 years ago

Priority: -- → P5

Whiteboard: [database][qa]

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

15 years ago

Whiteboard: [database][qa] → [database][qa][buildbot]

Chris AtLee [:catlee]

Assignee

Updated

•

13 years ago

Assignee: nobody → catlee

Severity: enhancement → normal

Depends on: 713255

Priority: P5 → P2

Chris AtLee [:catlee]

Assignee

Comment 2

•

13 years ago

notes w/ sheeri on how to use mysql partitions to do this on an ongoing basis in the future: https://etherpad.mozilla.org/GQrjGbbAjn

Sheeri Cabral [:sheeri]

Comment 3

•

13 years ago

FYI, buildbot is on Percona 5.5, so we could indeed set this up for partitioning when we can.

Nick Thomas [:nthomas] (UTC+12)

Updated

•

13 years ago

Depends on: 755012

Chris AtLee [:catlee]

Assignee

Updated

•

13 years ago

Priority: P2 → P5

Sheeri Cabral [:sheeri]

Comment 4

•

13 years ago

Note that partitioning doesn't work with foreign keys (as of now), so

Sheeri Cabral [:sheeri]

Comment 5

•

13 years ago

..so any tables that have foreign keys on them can't be partitioned. :D

Sheeri Cabral [:sheeri]

Comment 6

•

12 years ago

Is this still an issue?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Assignee

Comment 7

•

12 years ago

In terms of having bad/stale data, yes, we still need to fix that up. In terms of pruning old data to improve load on the DB, you're probably in a better position to say than I am.

Flags: needinfo?(catlee)

Sheeri Cabral [:sheeri]

Comment 8

•

12 years ago

OK, we'll put this as a q3 goal to get this done, then.

Whiteboard: [database][qa][buildbot] → [2013q3][database][qa][buildbot]

Sheeri Cabral [:sheeri]

Updated

•

12 years ago

Assignee: catlee → scabral

Component: Release Engineering → Server Operations: Database

QA Contact: scabral

Sheeri Cabral [:sheeri]

Updated

•

12 years ago

Priority: P5 → P3

Sheeri Cabral [:sheeri]

Comment 9

•

12 years ago

Is it accurate to say that keeping a month of data is desired? (e.g. when should data be purged?) We do this on tbpl (keep 1 month of data) so it should be no problem to do it here, too.

Whiteboard: [2013q3][database][qa][buildbot] → [2013q3] August [database][qa][buildbot]

Sheeri Cabral [:sheeri]

Comment 10

•

12 years ago

FWIW if we're only keeping a month there's no need to partition. It also looks like :catlee might have some scripts as used in https://bugzilla.mozilla.org/show_bug.cgi?id=713255 - do you still have those scripts? Otherwise we'll just make our own with what's in the etherpad and bug 713255.

Chris AtLee [:catlee]

Assignee

Comment 11

•

12 years ago

Yeah, I still have them. Running them is really expensive, as we found out the last time we tried to purge old data. Most of the data isn't indexed by date. The existing scripts work by deleting rows that are keyed by date, and then finding and deleting orphaned rows.

Sheeri Cabral [:sheeri]

Comment 12

•

12 years ago

What was expensive...the deletes, or the defragmenting with OPTIMIZE TABLE? If we delete regularly, say, every week we make sure there's nothing older than a month, we don't have to defrag, because once you delete you free up the space in the table you delete it in. If you're just going to fill that space in that table with more data, it will use up that extra space before taking more disk space.

Chris AtLee [:catlee]

Assignee

Comment 13

•

12 years ago

I think the deletes themselves were expensive. The times are listed here: https://bugzilla.mozilla.org/show_bug.cgi?id=713255#c4 It took several hours to run on the master DB, and several more hours for the slave to catch up. Perhaps if we started deleting only a few weeks of data at a time, until we are left with "enough" data.

Sheeri Cabral [:sheeri]

Comment 14

•

12 years ago

:catlee - that sounds like a great plan. mysql> select min(starttime) from builds; +---------------------+ | min(starttime) | +---------------------+ | 2011-01-01 00:00:13 | +---------------------+ 1 row in set (0.00 sec) So indeed, we removed builds prior to 1/1/2011 in bug 713255. I would say let's try a progression: 1 day 2 days 4 days 7 days 14 days 1 month To see the progression of how long the deletes take, and how much replication lag it causes. If at any step things are taking "too long" we won't continue to the next step.... e.g. let's say 1 day takes 10 seconds to delete and causes 10 seconds of replication lag. Then 2 days might take 20 seconds, and 4 days 40 seconds, and 7 days 1 min 10 seconds. We may decide that one week at a time is all we want to do, but we might run that once an hour during non-peak times to catch up to where we want to be (e.g. every hour, we delete another week). Or, we could just do it during a tree closure if we can get permission. Whenever we start, I'll be sure to save that morning's backups just in case.

hwine

Updated

•

12 years ago

Depends on: 904756

Sheeri Cabral [:sheeri]

Updated

•

12 years ago

Blocks: 899498

Chris AtLee [:catlee]

Assignee

Comment 15

•

12 years ago

The scripts we ran last time are: https://github.com/catlee/tools/blob/db_maint/buildbot-helpers/clean_schedulerdb.py and https://github.com/catlee/tools/blob/db_maint/buildbot-helpers/clean_statusdb.py Their basic approach is to delete some key data prior to the cutoff date, and then delete any orphaned rows in the rest of the tables. Sheeri, is there a way to estimate the cost of queries like this one (https://github.com/catlee/tools/blob/db_maint/buildbot-helpers/clean_schedulerdb.py#L12): DELETE LOW_PRIORITY FROM buildrequests USING buildrequests LEFT OUTER JOIN buildsets ON buildrequests.buildsetid = buildsets.id WHERE buildsets.id IS NULL Alternatively, is there a better way of deleting orphaned rows?

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Comment 16

•

12 years ago

The best way to delete orphaned rows would be to make an actual foreign key dependency with ON DELETE CASCADE - this will ensure that any If you can't do that, then yes, this is the best way. The only suggestion I might have is to do something like a DELETE with a LIMIT 100 or something and stop after the 2nd or 3rd time you get "0 rows deleted" (you could preface it with a COUNT and batch it up, and add a few extra batches just in case). You can estimate the cost with an EXPLAIN - before MySQL 5.6, you'd change it to a SELECT (SELECT * with the same FROM/WHERE clause) to get the number of rows. As for "cost" that's hard to quantify - you might know there are 5,000 rows to delete but how long does that take, how much I/O, how much replication lag - it's hard to say. Once we start doing this regularly, like daily or weekly, the number of rows deleted goes WAY down.

Flags: needinfo?(scabral)

Chris AtLee [:catlee]

Assignee

Comment 17

•

12 years ago

I don't think there are any foreign key dependencies in place right now. Adding a limit to the # of deleted rows is a good idea. Is the delete or the big outer join the most expensive bit do you think?

Sheeri Cabral [:sheeri]

Comment 18

•

12 years ago

My first instinct says that the delete is the most expensive bit - disk is usually the bottleneck. Even if all the rows are uncached and have to be read from disk, writing is more expensive than reading. My second instinct says "it depends on how many rows you're deleting" - if you're doing a left join, the left table (buildrequests) will do a full table scan. It has to, because you want every single row there. If you're deleting 75% of the rows, a full table scan is the best thing, and the delete is the most expensive operation. If you're only deleting 10% of the rows, it would be better to somehow get the ids you want to delete...but that would require doing an outer join or an IN subquery, like: select * FROM buildrequests where buildsetid not in (select id from buildsets) For that matter you could do DELETE * FROM buildrequests where buildsetid not in (select id from buildsets) However, that would generate not only a full table scan on buildrequests, but buildsets ends up being a dependent subquery. If we could keep track somehow of the deleted ids that would be the fastest, but more complex. For example, here is https://github.com/catlee/tools/blob/db_maint/buildbot-helpers/clean_statusdb.py in pseudocode - this is long and feel free to call me if any of this doesn't make sense....... [result_hash] = SELECT id,builder_id,master_id,slave_id,source_id FROM builds WHERE starttime < cutoff; for each b_id,bld_id,m_id,sl_id,so_id in [result_hash] do DELETE FROM builds WHERE id=b_id; DELETE FROM builders WHERE id=bld_id; DELETE FROM slaves WHERE id=sl_id; DELETE FROM builder_slaves WHERE builder_id=bld_id; [prop_array] = SELECT property_id FROM build_properties WHERE build_id=b_id; DELETE FROM build_properties WHERE build_id=b_id; DELETE FROM build_requests WHERE build_id=b_id; [req_array] = SELECT DISTINCT id FROM requests WHERE builder_id=b_id; for each r_id in [req_array] do DELETE FROM requests WHERE id=r_id; [prop_array2] = SELECT property_id FROM build_properties WHERE request_id_id=r_id; DELETE FROM request_properties WHERE request_id=r_id; done DELETE FROM steps WHERE build_id=b_id; [patch_array] = SELECT patch_id FROM sourcestamps WHERE id=so_id; for each p_id in [patch_array] do DELETE FROM patches WHERE id=p_id; done DELETE FROM sourcestamps WHERE id=so_id; [src_array] = SELECT DISTINCT change_id FROM source_changes WHERE source_id=so_id; for each c_id in [src_array] do DELETE FROM changes WHERE id=c_id; [fc_array] = SELECT DISTINCT file_id FROM file_changes WHERE change_id=c_id; for each f_id in [fc_array] do DELETE FROM files WHERE id=f_id; done DELETE FROM file_changes WHERE change_id=c_id; done DELETE FROM source_changes WHERE source_id=so_id; [props_to_del] = intersection [prop_array],[prop_array2] DELETE FROM properties WHERE id in [props_to_del]; DELETE FROM masters WHERE id=m_id; DELETE FROM master_slaves WHERE master_id=m_id; done I put the masters and master_slaves stuff at the end because in https://github.com/catlee/tools/blob/db_maint/buildbot-helpers/clean_statusdb.py you never actually call it. You also don't call patches, but that depends on sourcestamps, so I put that in the right place. Just in case, I checked to see if there was existing cleanup that needed to be done and there are no masters or master_slaves that are orphans: mysql> select * from masters LEFT OUTER JOIN builds ON builds.master_id = masters.id WHERE builds.id IS NULL; Empty set (0.00 sec) mysql> select * from master_slaves LEFT OUTER JOIN masters ON master_slaves.master_id = masters.id WHERE masters.id IS NULL; Empty set (0.00 sec) mysql> select * from patches LEFT OUTER JOIN sourcestamps ON patches.id = sourcestamps.patch_id WHERE sourcestamps.id IS NULL; Empty set (0.00 sec) This makes the deletes really easy - no joins, you're basically making cursors and traveling through everything one number at a time. It's complex and error prone, because it's less intuitive than just saying "if it's not in table foo, delete it from table bar, baz and bap" - you're saying "find properties of one thing we're deleting in foo, then delete what's in table bar, baz and bap that has those properties." For instance, if you use b_id instead of bld_id in the wrong place, it's harder to spot (builds.id as opposed to builders.id - if you typo in an outer join query, MySQL will give an error, because builders.id doesn't exist in the builds table. But if you use bld_id vs. b_id there won't be any failures, and you might accidentally delete important stuff. So tl;dr is I'd leave it as-is...

Sheeri Cabral [:sheeri]

Comment 19

•

12 years ago

:catlee - can you comment on the progress that was made over the weekend?

Whiteboard: [2013q3] August [database][qa][buildbot] → [database][qa][buildbot]

Sheeri Cabral [:sheeri]

Comment 20

•

12 years ago

(also is there any way we can keep the target of q3 to have the initial cleanout done?)

Whiteboard: [database][qa][buildbot] → [2013q3] August [database][qa][buildbot]

Sheeri Cabral [:sheeri]

Comment 21

•

12 years ago

So, here's an idea. What if we take the slave out of the load balancer, so the master is doing all the reads and writes. Then we run the script specifically against the slave, cleaning up all the data, letting it run for however long it runs, but don't let it replicate back to the master. Then we let the slave catch up, so it should have all the recent data, and then we failover to the slave, so the slave is doing all reads and writes. Then we can delete from the master, the same stuff we deleted on the slave, for however long it takes, again not letting the changes replicate. Before we put the master back in as the master, we run checksums to make sure the data is exactly the same, and set up the backup server to have the same data as the master. This should work in a shorter timeframe than trying to run the script every outage window. We will still need to figure out a solution to cleanup the database more frequently, in an automated way that does not lock tables. Perhaps we can set up triggers, so when a row is deleted from one table all its children are deleted too?

Sheeri Cabral [:sheeri]

Comment 22

•

12 years ago

Chris - any feedback on the plan in comment 21? the sooner we can cleanup buildbot, the better.

Chris AtLee [:catlee]

Assignee

Comment 23

•

12 years ago

per some emails back and forth: I think the plan in comment 21 could end up with dangling references to data in some of the shared tables like properties. Consider the following initial state: build id 1 date 1 properties: 1, 2, 3 (via build_properties table) build id 2 date 2 properties: 2, 3, 4 (via build_properties table) If we decide to delete all builds older than 2 on the slave, we will delete build 1. Then any orphaned properties would be deleted. In this case, property 1 would be deleted as well. Let's say right after this has happened, we write a new build to the master db: build id 3 date 3 properties: 1, 4, 5 since property 1 still exists in the master, we think we're ok referencing it. however, it's been deleted on the slave. so now when we replicate the new build info to the slave, we'll be referencing data that doesn't exist any more. For now, I think an approach like comment 18 is more workable. We can delete rows from builds and steps without too much trouble, since those have date columns on them. After that we can look at how to delete orphaned properties and other data.

Sheeri Cabral [:sheeri]

Comment 24

•

12 years ago

It'd be a huge win to delete from steps, so let's do that (and builds too, why not?). One big problem is that steps does not have an index on either starttime or endtime. I think we'll probably still need a failover to do this, take the slave out of the load balancer, delete from steps and builds, add an index to steps (this will rebuild the table and defragment it), then promote the slave to master, and delete to the same date from steps/builds, and add the index. How does that sound?

Sheeri Cabral [:sheeri]

Comment 25

•

12 years ago

Also I agree that comment 18 is the best idea, but I assume that will take some time to code, which is why I'm pushing on the cleanup of steps sooner rather than later.

Sheeri Cabral [:sheeri]

Updated

•

12 years ago

Whiteboard: [2013q3] August [database][qa][buildbot] → [database][qa][buildbot]

Sheeri Cabral [:sheeri]

Updated

•

12 years ago

Priority: P3 → P2

Sheeri Cabral [:sheeri]

Comment 26

•

12 years ago

What's the plan for the cleanup?

Sheeri Cabral [:sheeri]

Comment 27

•

12 years ago

This can be done any time: repeat until done: select id from builds where starttime<'2012-10-01 00:00:00' limit 100; delete from builds where id in (those 100 ids) delete from steps where build_id in (those 100 ids)

Sheeri Cabral [:sheeri]

Comment 28

•

12 years ago

Chris deleted everything in builds and steps before Oct 2012, and I defragmented those tables. We regained about 45G of disk space on buildbot2! We will regain the space on buildbot1 when we upgrade, before the end of October.

Sheeri Cabral [:sheeri]

Comment 29

•

12 years ago

The space has been regained on buildbot1. The only thing left in this is to automate the deletions, which I'll put in Chris' hands.

Chris AtLee [:catlee]

Assignee

Updated

•

12 years ago

Assignee: scabral → catlee

Chris AtLee [:catlee]

Assignee

Comment 30

•

12 years ago

Attached file cleanup_db.py — Details

cleanup_db.py

Attachment #821223 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Comment 31

•

12 years ago

Comment on attachment 821223 [details] cleanup_db.py >def cleaner_upper(select_query, delete_queries): > while True: > t = time.time() > log.debug("finding rows: %s", query_to_str(select_query)) > rows = select_query.execute() Is there any concern that this could return so many ids that it chews up a ton of RAM? If so, might be better to use something that returns an iterator. >def cleanup_orphaned_steps(meta): It's likely that I'm just missing something here, but why is this method (and the next one) necessary? It looks to me that cleanup_builds should take care of deleting recently orphaned steps and properties already...

Chris AtLee [:catlee]

Assignee

Comment 32

•

12 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #31) > Comment on attachment 821223 [details] > cleanup_db.py > > >def cleaner_upper(select_query, delete_queries): > > while True: > > t = time.time() > > log.debug("finding rows: %s", query_to_str(select_query)) > > rows = select_query.execute() > > Is there any concern that this could return so many ids that it chews up a > ton of RAM? If so, might be better to use something that returns an iterator. I'm relying on the queries to limit the # of results. Using an iterator would mean holding some kind of cursor/connection open for longer, which could get closed if one of the subsequent deletions takes a long time. > >def cleanup_orphaned_steps(meta): > > It's likely that I'm just missing something here, but why is this method > (and the next one) necessary? It looks to me that cleanup_builds should take > care of deleting recently orphaned steps and properties already... These are to find and remove steps and properties whose builds have gone away. You're right in that the regular cleanup should take care of deleting these rows first, and so we wouldn't end up with orphans, but this is to catch the edge cases. We had plenty of orphaned steps/build_properties until recently!

bhearsum@mozilla.com (:bhearsum)

Updated

•

12 years ago

Attachment #821223 - Flags: review?(bhearsum) → review+

Chris AtLee [:catlee]

Assignee

Updated

•

12 years ago

Component: Server Operations: Database → General Automation

Product: mozilla.org → Release Engineering

QA Contact: scabral → catlee

Chris AtLee [:catlee]

Assignee

Comment 33

•

12 years ago

Comment on attachment 821223 [details] cleanup_db.py https://hg.mozilla.org/build/tools/rev/27d68d07222d

Attachment #821223 - Flags: checked-in+

Sheeri Cabral [:sheeri]

Comment 34

•

12 years ago

w00t! what's the next step on this?

Chris AtLee [:catlee]

Assignee

Comment 35

•

12 years ago

Need to get this deployed w/ puppet and running weekly or somesuch

Sheeri Cabral [:sheeri]

Comment 36

•

12 years ago

Sure, is that something you do, or we do?

Chris AtLee [:catlee]

Assignee

Comment 37

•

12 years ago

I'm working on a patch today!

Chris AtLee [:catlee]

Assignee

Comment 38

•

12 years ago

Attached patch add logging support — Details — Splinter Review

Attachment #824099 - Flags: review?(bhearsum)

Chris AtLee [:catlee]

Assignee

Comment 39

•

12 years ago

Attached patch deploy db cleanup script with puppet — Details — Splinter Review

Attachment #824101 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 40

•

12 years ago

Comment on attachment 824101 [details] [diff] [review] deploy db cleanup script with puppet Review of attachment 824101 [details] [diff] [review]: ----------------------------------------------------------------- Looks good ::: manifests/moco-nodes.pp @@ +456,5 @@ > } > > node "buildbot-master81.srv.releng.scl3.mozilla.com" { > include toplevel::server > include toplevel::server::buildmaster Why are both of these specified here? I guess that's really bug 891859.

Attachment #824101 - Flags: review?(dustin) → review+

bhearsum@mozilla.com (:bhearsum)

Comment 41

•

12 years ago

Comment on attachment 824099 [details] [diff] [review] add logging support Review of attachment 824099 [details] [diff] [review]: ----------------------------------------------------------------- ::: buildfarm/maintenance/cleanup_db.py @@ +190,3 @@ > > if not options.cutoff: > + parser.error("cutoff date is both required") SyntaxError: could not part "is both required"

Attachment #824099 - Flags: review?(bhearsum) → review+

Chris AtLee [:catlee]

Assignee

Comment 42

•

12 years ago

Comment on attachment 824099 [details] [diff] [review] add logging support https://hg.mozilla.org/build/tools/rev/39741a8d27bd

Attachment #824099 - Flags: checked-in+

Chris AtLee [:catlee]

Assignee

Comment 43

•

12 years ago

Comment on attachment 824101 [details] [diff] [review] deploy db cleanup script with puppet https://hg.mozilla.org/build/puppet/rev/df9273a70c62

Attachment #824101 - Flags: checked-in+

Sheeri Cabral [:sheeri]

Comment 44

•

11 years ago

So has this been cleaning stuff up since the end of October? or are there more steps here?

Chris AtLee [:catlee]

Assignee

Comment 45

•

11 years ago

Yes, it has been running since October. How are we doing for space? There are still lots of other things we could be cleaning up, but aren't yet.

Sheeri Cabral [:sheeri]

Comment 46

•

11 years ago

I just wanted to make sure it wasn't something we should be running from cron or anything. Looks like we're running great: mysql> select min(starttime) from builds; +---------------------+ | min(starttime) | +---------------------+ | 2012-12-08 00:00:00 | +---------------------+ 1 row in set (0.01 sec) We're doing fine on space - 126G used, 82G free. buildbot: [root@buildbot1.db.scl3 buildbot]# du -sh * | grep G | sort -n 1.5G builds.MYD 2.5G builds.MYI 2.6G schedulerdb_requests.ibd 6.3G build_properties.MYD 7.6G properties.MYI 9.5G properties.MYD 11G steps.MYI 15G build_properties.MYI 26G steps.MYD buildbot_schedulers: [root@buildbot1.db.scl3 buildbot_schedulers]# du -sh * | grep G | sort -n 1.5G buildrequests.MYI 1.5G builds.MYI 5.1G buildrequests.MYD I'm happy for y'all to build in lifecycles when you can, but I'm going to consider this a success and closed!

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: General Automation → General

cleanup_db.py 12 years ago Chris AtLee [:catlee] 5.97 KB, text/plain	bhearsum : review+ catlee : checked-in+	Details
add logging support 12 years ago Chris AtLee [:catlee] 2.04 KB, patch	bhearsum : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
deploy db cleanup script with puppet 12 years ago Chris AtLee [:catlee] 4.08 KB, patch	dustin : review+ catlee : checked-in+	Details \| Diff \| Splinter Review