Closed Bug 1142488 Opened 10 years ago Closed 10 years ago

Stage DB usage increased after recent perfherder changes

Categories

(Tree Management :: Perfherder, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

Details

I clearly spoke too soon in the meeting yesterday. I just got an alert from new relic for the stage DB - it's currently at 80% full - usage rate climbed on the deploy 10th March @ ~12-2pm UTC: https://rpm.newrelic.com/accounts/677903/servers/6106894/disks#id=868884149 In two days it's used another 80GB (from 220GB to 300GB). The changes only recently made it to prod, so hard to say the impact there - though we have more wiggle room there. We can: a) Expire data sooner (again) b) Try the gzipping blobs idea from the meeting c) Double check there's not something crazy going on (80GB in two days seems like a lot!) Will, would you mind driving this?
Flags: needinfo?(wlachance)
Meant to say: I don't see any cycle-data errors on stage's New Relic page: https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors (remember to increase the range to >24 hours since only runs once a day) The cycle data task seems to be running according to: https://rpm.newrelic.com/accounts/677903/applications/5585473/transactions#id=5b224f746865725472616e73616374696f6e2f43656c6572792f6379636c652d64617461222c22225d (In reply to Ed Morley [:edmorley] from comment #0) > b) Try the gzipping blobs idea from the meeting > c) Double check there's not something crazy going on (80GB in two days seems > like a lot!) Put it this way: 300GB is currently 6 weeks of Treeherder data, which is 7GB/day for _everything_. We've just gone up 40GB per day.
That's pretty odd, I wouldn't think what was currently in git should be causing any change in git usage. I'll take a look...
Disk usage that is, not git usage
On stage db1 > SELECT ROUND(SUM((data_length+index_length)/power(1024,3)),1) size_gb, table_schema AS db FROM information_schema.tables GROUP BY table_schema ORDER BY size_gb DESC LIMIT 20 + ------------ + ------- + | size_gb | db | + ------------ + ------- + | 75.8 | mozilla_inbound_jobs_1 | | 29.3 | try_jobs_1 | | 27.2 | fx_team_jobs_1 | | 13.8 | mozilla_central_jobs_1 | | 12.5 | b2g_inbound_jobs_1 | | 7.7 | mozilla_aurora_jobs_1 | | 4.0 | mozilla_beta_jobs_1 | | 2.6 | try_objectstore_1 | | 2.3 | mozilla_inbound_objectstore_1 | | 1.9 | gaia_try_jobs_1 | | 1.6 | gum_jobs_1 | | 0.9 | ash_jobs_1 | | 0.8 | cypress_jobs_1 | | 0.8 | cedar_jobs_1 | | 0.8 | fx_team_objectstore_1 | | 0.7 | mozilla_central_objectstore_1 | | 0.6 | mozilla_b2g37_v2_2_jobs_1 | | 0.5 | b2g_inbound_objectstore_1 | | 0.4 | gaia_try_objectstore_1 | | 0.4 | mozilla_aurora_objectstore_1 | + ------------ + ------- + 20 rows > SELECT ROUND(SUM((data_length+index_length)/power(1024,3)),1) size_gb, table_name FROM information_schema.tables GROUP BY table_name ORDER BY size_gb DESC LIMIT 10 + ------------ + --------------- + | size_gb | table_name | + ------------ + --------------- + | 119.8 | performance_artifact | | 48.7 | job_artifact | | 8.6 | objectstore | | 5.5 | job | | 3.4 | performance_series | | 1.3 | job_log_url | | 0.7 | job_eta | | 0.2 | revision | | 0.2 | series_signature | | 0.1 | revision_map | + ------------ + --------------- + 10 rows
Wow ok I may have spoken too soon. There are a lot of treeherder1-bin.* binary logs - though the timestamps from them are only over two days: -bash-4.1$ du -bc treeherder1-bin.* -ch | tail -n 1 109G total I know Sheeri mentioned wanting to keep the binary logs for longer, but 109GB seems excessive. I guess the question is: have we always had 109GB (and something else changed), or have the logs got larger (eg more activity)/the duration we're keeping them for increased? The full breakdown: -bash-4.1$ du -hs * | sort -hr 82G mozilla_inbound_jobs_1 33G try_jobs_1 29G fx_team_jobs_1 15G mozilla_central_jobs_1 14G b2g_inbound_jobs_1 8.4G mozilla_aurora_jobs_1 5.0G mozilla_beta_jobs_1 2.7G try_objectstore_1 2.4G mozilla_inbound_objectstore_1 2.2G gaia_try_jobs_1 2.1G gum_jobs_1 1.2G ash_jobs_1 1.1G treeherder1-bin.001553 1.1G treeherder1-bin.001552 1.1G treeherder1-bin.001551 1.1G treeherder1-bin.001550 1.1G treeherder1-bin.001549 1.1G treeherder1-bin.001548 1.1G treeherder1-bin.001547 1.1G treeherder1-bin.001546 1.1G treeherder1-bin.001545 1.1G treeherder1-bin.001544 1.1G treeherder1-bin.001543 1.1G treeherder1-bin.001542 1.1G treeherder1-bin.001541 1.1G treeherder1-bin.001540 1.1G treeherder1-bin.001539 1.1G treeherder1-bin.001538 1.1G treeherder1-bin.001537 1.1G treeherder1-bin.001536 1.1G treeherder1-bin.001535 1.1G treeherder1-bin.001534 1.1G treeherder1-bin.001533 1.1G treeherder1-bin.001532 1.1G treeherder1-bin.001531 1.1G treeherder1-bin.001530 1.1G treeherder1-bin.001529 1.1G treeherder1-bin.001528 1.1G treeherder1-bin.001527 1.1G treeherder1-bin.001526 1.1G treeherder1-bin.001525 1.1G treeherder1-bin.001524 1.1G treeherder1-bin.001523 1.1G treeherder1-bin.001522 1.1G treeherder1-bin.001521 1.1G treeherder1-bin.001520 1.1G treeherder1-bin.001519 1.1G treeherder1-bin.001518 1.1G treeherder1-bin.001517 1.1G treeherder1-bin.001516 1.1G treeherder1-bin.001515 1.1G treeherder1-bin.001514 1.1G treeherder1-bin.001513 1.1G treeherder1-bin.001512 1.1G treeherder1-bin.001511 1.1G treeherder1-bin.001510 1.1G treeherder1-bin.001509 1.1G treeherder1-bin.001508 1.1G treeherder1-bin.001507 1.1G treeherder1-bin.001506 1.1G treeherder1-bin.001505 1.1G treeherder1-bin.001504 1.1G treeherder1-bin.001503 1.1G treeherder1-bin.001502 1.1G treeherder1-bin.001501 1.1G treeherder1-bin.001500 1.1G treeherder1-bin.001499 1.1G treeherder1-bin.001498 1.1G treeherder1-bin.001497 1.1G treeherder1-bin.001496 1.1G treeherder1-bin.001495 1.1G treeherder1-bin.001494 1.1G treeherder1-bin.001493 1.1G treeherder1-bin.001492 1.1G treeherder1-bin.001491 1.1G treeherder1-bin.001490 1.1G treeherder1-bin.001489 1.1G treeherder1-bin.001488 1.1G treeherder1-bin.001487 1.1G treeherder1-bin.001486 1.1G treeherder1-bin.001485 1.1G treeherder1-bin.001484 1.1G treeherder1-bin.001483 1.1G treeherder1-bin.001482 1.1G treeherder1-bin.001481 1.1G treeherder1-bin.001480 1.1G treeherder1-bin.001479 1.1G treeherder1-bin.001478 1.1G treeherder1-bin.001477 1.1G treeherder1-bin.001476 1.1G treeherder1-bin.001475 1.1G treeherder1-bin.001474 1.1G treeherder1-bin.001473 1.1G treeherder1-bin.001472 1.1G treeherder1-bin.001471 1.1G treeherder1-bin.001470 1.1G treeherder1-bin.001469 1.1G treeherder1-bin.001468 1.1G treeherder1-bin.001467 1.1G treeherder1-bin.001466 1.1G treeherder1-bin.001465 1.1G treeherder1-bin.001464 1.1G treeherder1-bin.001463 1.1G treeherder1-bin.001462 1.1G treeherder1-bin.001461 1.1G treeherder1-bin.001460 1.1G treeherder1-bin.001459 1.1G treeherder1-bin.001458 1.1G treeherder1-bin.001457 1.1G treeherder1-bin.001456 1.1G treeherder1-bin.001455 1.1G treeherder1-bin.001454 1.1G treeherder1-bin.001453 1.1G treeherder1-bin.001452 1.1G treeherder1-bin.001451 1.1G treeherder1-bin.001450 1.1G treeherder1-bin.001449 1.1G treeherder1-bin.001448 1.1G treeherder1-bin.001447 1.1G treeherder1-bin.001446 1.1G cypress_jobs_1 1022M cedar_jobs_1 1005M mozilla_central_objectstore_1 809M fx_team_objectstore_1 768M mozilla_b2g37_v2_2_jobs_1 674M mozilla_release_jobs_1 668M mozilla_b2g34_v2_1_jobs_1 585M gaia_try_objectstore_1 570M mozilla_b2g32_v2_0_jobs_1 533M b2g_inbound_objectstore_1 513M mozilla_aurora_objectstore_1 378M holly_jobs_1 376M comm_central_jobs_1 370M treeherder1-bin.001554 337M mozilla_b2g30_v1_4_jobs_1 333M ibdata1 313M maple_jobs_1 305M mozilla_beta_objectstore_1 301M ib_logfile1 300M ib_logfile0 299M oak_jobs_1 290M try_comm_central_jobs_1 268M mozilla_esr31_jobs_1 200M date_jobs_1 173M mozilla_b2g37_v2_2_objectstore_1 166M comm_aurora_jobs_1 156M mozilla_b2g34_v2_1s_jobs_1 150M elm_jobs_1 145M pine_jobs_1 129M mozilla_b2g34_v2_1_objectstore_1 109M mozilla_b2g32_v2_0_objectstore_1 99M gaia_jobs_1 97M larch_jobs_1 95M comm_esr31_jobs_1 89M mozilla_esr31_objectstore_1 83M comm_beta_jobs_1 77M mozilla_b2g30_v1_4_objectstore_1 77M gum_objectstore_1 77M cypress_objectstore_1 77M cedar_objectstore_1 72M treeherder_stage 71M jamun_jobs_1 69M ash_objectstore_1 56M ux_jobs_1 54M addon_sdk_jobs_1 49M oak_objectstore_1 49M mozilla_release_objectstore_1 37M mozilla_b2g34_v2_1s_objectstore_1 31M comm_central_objectstore_1 27M gaia_master_jobs_1 25M alder_jobs_1 24M try_comm_central_objectstore_1 23M comm_aurora_objectstore_1 22M staging_gaia_try_jobs_1 16M percona 16M holly_objectstore_1 14M larch_objectstore_1 14M comm_esr31_objectstore_1 13M maple_objectstore_1 13M addon_sdk_objectstore_1 12M pine_objectstore_1 12M comm_beta_objectstore_1 11M elm_objectstore_1 11M date_objectstore_1 7.6M mysql 4.2M ib_buffer_pool 4.1M alder_objectstore_1 2.9M mozilla_b2g28_v1_3t_jobs_1 2.7M unknown_jobs_1 2.7M try_taskcluster_jobs_1 2.7M taskcluster_integration_jobs_1 2.7M services_central_jobs_1 2.7M qa_try_jobs_1 2.7M mozilla_esr24_jobs_1 2.7M mozilla_esr17_jobs_1 2.7M mozilla_b2g28_v1_3_jobs_1 2.7M mozilla_b2g26_v1_2_jobs_1 2.7M mozilla_b2g18_v1_1_0_hd_jobs_1 2.7M mozilla_b2g18_jobs_1 2.7M graphics_jobs_1 2.7M gaia_v1_4_jobs_1 2.7M fig_jobs_1 2.7M comm_esr24_jobs_1 2.7M build_system_jobs_1 2.7M bugzilla_jobs_1 2.7M bmo_jobs_1 2.7M birch_jobs_1 636K performance_schema 292K mozilla_b2g28_v1_3t_objectstore_1 180K ux_objectstore_1 180K unknown_objectstore_1 180K try_taskcluster_objectstore_1 180K taskcluster_integration_objectstore_1 180K staging_gaia_try_objectstore_1 180K services_central_objectstore_1 180K qa_try_objectstore_1 180K mozilla_esr24_objectstore_1 180K mozilla_esr17_objectstore_1 180K mozilla_b2g28_v1_3_objectstore_1 180K mozilla_b2g26_v1_2_objectstore_1 180K mozilla_b2g18_v1_1_0_hd_objectstore_1 180K mozilla_b2g18_objectstore_1 180K jamun_objectstore_1 180K graphics_objectstore_1 180K gaia_v1_4_objectstore_1 180K gaia_objectstore_1 180K gaia_master_objectstore_1 180K fig_objectstore_1 180K comm_esr24_objectstore_1 180K build_system_objectstore_1 180K bugzilla_objectstore_1 180K bmo_objectstore_1 180K birch_objectstore_1 8.0K treeherder1-bin.index 4.0K treeherder1.stage.db.scl3.mozilla.com.pid 4.0K treeherder1-relay-bin.index 4.0K treeherder1-relay-bin.000896 4.0K treeherder1-relay-bin.000895 4.0K test 4.0K RPM_UPGRADE_MARKER-LAST 4.0K RPM_UPGRADE_HISTORY 4.0K relay-log.info 4.0K mysql_upgrade_info 4.0K auto.cnf 0 mysql.sock
That said, performance_artifact still accounts for 120GB out of 200GB total table usage...
Summary: DB usage increased after recent perfherder changes → Stage DB usage increased after recent perfherder changes
So a few things about performance artifacts: 1. It looks like the performance artifact has quite a bit of useless data in it (talos aux data) which we're unnecessarily including as metadata to every artifact. This should actually be stored with the summary series, as it's essentially data which applies only over the whole suite. 2. We should probably gzip the performance artifacts. I think between those two things we should be able to bring db usage right down for that aspect of perfherder. This still doesn't really explain the recent disk usage spike. Sheeri, do you have any suggestions on what might be going on with the proliferation of binary logs mentioned above in comment 5.
I have another theory. When I put the broken perf summary stuff in, no talos data was being ingested for suites like tp5. This meant that disk usage would go down, as old data was expired and no new data was being ingested to take its place. It makes sense now that this has been "fixed" that disk usage should be crawling back up. However, I believe it should plateau at around the level it is at now. I'll do a bit more digging but I suspect this is a problem that won't get any worse. We should probably still fix the excessive perf artifact space usage issues. I'll file another issue about that.
Filed bug 1142631 to deal with the perf artifact bloat (basically just (1) in my list in comment 7). Let's get that in and see if this problem gets any better or worse.
Flags: needinfo?(wlachance)
(In reply to William Lachance (:wlach) from comment #7) > 2. We should probably gzip the performance artifacts. Filed bug 1142648.
The binary logs have been huge for a while. When treeherder disk space first started filling up, we reduced the binary logs down to about 2 days' worth. 50G per day sounds about right, at least in the last few months. If you like, we can analyze to see what's going on inside the logs. But the binary logs reflect every single change that happens in the system, so if you don't think treeherder has a huge volume of changes (e.g. INSERT/UPDATE/DELETE/REPLACE/CREATE/DROP/ALTER etc, but *not* SELECT) then that's a flag to raise. It's probably easiest if I analyze one binary log, about 1.1G of info, sounds like about 30 mins of data. Let me know if you want that analysis.
(In reply to Sheeri Cabral [:sheeri] from comment #11) > The binary logs have been huge for a while. When treeherder disk space first > started filling up, we reduced the binary logs down to about 2 days' worth. > 50G per day sounds about right, at least in the last few months. > > If you like, we can analyze to see what's going on inside the logs. But the > binary logs reflect every single change that happens in the system, so if > you don't think treeherder has a huge volume of changes (e.g. > INSERT/UPDATE/DELETE/REPLACE/CREATE/DROP/ALTER etc, but *not* SELECT) then > that's a flag to raise. > > It's probably easiest if I analyze one binary log, about 1.1G of info, > sounds like about 30 mins of data. Let me know if you want that analysis. Let's see how things go over the next few days now that the fix to bug 1142631 is deployed. I suspect the removal of the aux stuff in the performance series should also reduce the size of the logs.
Thank you Sheeri - knowing that 50GB/day is in roughly the right ballpark is fine for now. We do have a fair rate of churn on some tables, which probably doesn't help (things like bug 1140349 will help with that).
And today (though Sheeri has just pruned logs): (In reply to Ed Morley [:edmorley] from comment #5) > -bash-4.1$ du -bc treeherder1-bin.* -ch | tail -n 1 > 109G total -bash-4.1$ du -bc treeherder1-bin.* -ch | tail -n 1 43G total > The full breakdown: > > -bash-4.1$ du -hs * | sort -hr > 82G mozilla_inbound_jobs_1 > 33G try_jobs_1 > 29G fx_team_jobs_1 > 15G mozilla_central_jobs_1 > 14G b2g_inbound_jobs_1 > 8.4G mozilla_aurora_jobs_1 > 5.0G mozilla_beta_jobs_1 > 2.7G try_objectstore_1 > 2.4G mozilla_inbound_objectstore_1 > 2.2G gaia_try_jobs_1 > 2.1G gum_jobs_1 > 1.2G ash_jobs_1 > 1.1G treeherder1-bin.001553 > 1.1G treeherder1-bin.001552 > 1.1G treeherder1-bin.001551 -bash-4.1$ du -hs * | sort -hr | head -n 15 84G mozilla_inbound_jobs_1 36G try_jobs_1 30G fx_team_jobs_1 16G mozilla_central_jobs_1 14G b2g_inbound_jobs_1 8.6G mozilla_aurora_jobs_1 5.1G mozilla_beta_jobs_1 3.3G try_objectstore_1 3.1G mozilla_inbound_objectstore_1 2.5G gaia_try_jobs_1 2.2G gum_jobs_1 1.2G ash_jobs_1 1.1G treeherder1-bin.001813 1.1G treeherder1-bin.001812 1.1G treeherder1-bin.001811
Running Ed's query on mozilla_inbound_jobs again, we see that performance_artifact is taking up most of the space: +---------+----------------------+ | size_gb | table_name | +---------+----------------------+ | 111.9 | performance_artifact | | 54.1 | job_artifact | | 10.2 | objectstore | | 5.8 | job | | 3.6 | performance_series | | 1.5 | job_log_url | | 0.7 | job_eta | | 0.2 | revision | | 0.1 | series_signature | | 0.1 | revision_map | +---------+----------------------+ 10 rows in set (0.15 sec) It's possible that the addition of summary series artifacts is causing a net increase of space used, even if the lack of auxillary data helped somewhat. We should probably bite the bullet and implement bug 1142648, I don't think it should be that hard.
Stage paged for space again - expire_logs_days is set to 1 and that's still over 100G of logs.
Whiteboard: MGSEI-RTL-3F
Whiteboard: MGSEI-RTL-3F
Stage db{1,2} alerted for disk usage again this evening (85%+). I've run an optimize table on a few tables (see bug 1142648 comment 5) - this freed ~44GB. I also truncated a few of the objectstore tables. It looks like Sheeri also purged the binlogs. We're now comfortably within the limit (43%), and should be fine even when the logs grow back (~70%).
Whiteboard: MGSEI-RTL-3F
Whiteboard: MGSEI-RTL-3F
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee: nobody → emorley
You need to log in before you can comment on or make changes to this bug.