Closed Bug 1302790 Opened 8 years ago Closed 8 years ago

Treeherder SCL3 prod DB usage increased by 75GB on 8-9th Sept

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

On 8-9th Sept, disk usage increased from 375GB to 445GB.
Today (6 days later), it's at 456GB, and increasing each day.

See:
https://rpm.newrelic.com/accounts/677903/servers/6106888/disks#id=5b2253797374656d2f46696c6573797374656d2f5e646174612f557365642f6279746573222c2253797374656d2f4469736b2f5e6465765e73646231225d

This blocks performing the DB migration for Heroku (bug 1283170), since we need to dump the DB to the same disk, however there's now insufficient space to do so, allowing headroom for safety (the disk has 984GB usable).

I believe there are a couple of causes:
1) the recent schema migration means there is log data duplicated between several tables
2) (to a lesser extent) cycle_data is now timing out, so we're not expiring old data:
https://rpm.newrelic.com/accounts/677903/applications/4180461/filterable_errors#/show/4eecf682-7a4f-11e6-a90b-b82a72d22a14_21418_27301/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&filters=%5B%7B%22key%22%3A%22transactionUiName%22%2C%22value%22%3A%22cycle-data%22%2C%22like%22%3Afalse%7D%5D&_k=0o5hw5

Please could we all have a concerted effort to see if we can drive the usage down here?

Once we're on Heroku we can request as much disk as we like, so can stop worrying about these kind of things..
The easiest thing would be to stop ingesting text log artifacts and purge them from the database, since they're not actually being used for anything anymore. I left them in so we could revert, but it's been several days now.
I've discovered that the dumps for bug 1246965 was left behind on treeherder2.db.scl3, which are 45GB (they are gzipped).

I've removed /data/dump_bug_1246965.

[emorley@treeherder1.db.scl3 mysql]$ sudo find . -type f -printf "%f %s\n" |   awk '{
      split( $1, FILEPARTS, "." );
      FILENAME=FILEPARTS[1];
      SIZE_IN_MB = int($2/(1024*1024));
      FILENAME_MAP[FILENAME]+=SIZE_IN_MB;
    }
   END {
     for( FILENAME in FILENAME_MAP ) {
       SIZE_IN_GB = FILENAME_MAP[FILENAME] / 1024;
       printf("%.1fGB %s\n", SIZE_IN_GB, FILENAME);
      }
   }' | sort -rh | head -n 15
81.4GB job_detail
57.0GB treeherder1-bin
56.0GB job_artifact
47.8GB performance_datum
47.1GB failure_line
44.7GB job
29.4GB text_log_step
20.8GB job_log
17.9GB text_log_error
7.2GB text_log_summary_line
4.1GB text_log_summary
1.4GB ibdata1
1.0GB revision
0.4GB treeherder1-relay-bin
0.3GB revision_map

It's possible some of these tables are fragmented too.
Depends on: 1303069
Between:
* stopping storing some old artifact types (bug 1258861, bug 1301729)
* deleting the old data dump (comment 2)
* purging old job_artifacts/defragging most tables (bug 1303069)
* the bloated binlogs (from recent data migrations) slowly expiring (we keep 7 days worth)

...prod DB usage has dropped by 102 GB since Wednesday, and is now at 354 GB.

There is now plenty of headroom for the DB dump now (and less to dump in the first place).
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.