Closed
Bug 1142488
Opened 10 years ago
Closed 10 years ago
Stage DB usage increased after recent perfherder changes
Categories
(Tree Management :: Perfherder, defect, P1)
Tree Management
Perfherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
Details
I clearly spoke too soon in the meeting yesterday.
I just got an alert from new relic for the stage DB - it's currently at 80% full - usage rate climbed on the deploy 10th March @ ~12-2pm UTC:
https://rpm.newrelic.com/accounts/677903/servers/6106894/disks#id=868884149
In two days it's used another 80GB (from 220GB to 300GB).
The changes only recently made it to prod, so hard to say the impact there - though we have more wiggle room there.
We can:
a) Expire data sooner (again)
b) Try the gzipping blobs idea from the meeting
c) Double check there's not something crazy going on (80GB in two days seems like a lot!)
Will, would you mind driving this?
Flags: needinfo?(wlachance)
Assignee | ||
Comment 1•10 years ago
|
||
Meant to say:
I don't see any cycle-data errors on stage's New Relic page:
https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors
(remember to increase the range to >24 hours since only runs once a day)
The cycle data task seems to be running according to:
https://rpm.newrelic.com/accounts/677903/applications/5585473/transactions#id=5b224f746865725472616e73616374696f6e2f43656c6572792f6379636c652d64617461222c22225d
(In reply to Ed Morley [:edmorley] from comment #0)
> b) Try the gzipping blobs idea from the meeting
> c) Double check there's not something crazy going on (80GB in two days seems
> like a lot!)
Put it this way: 300GB is currently 6 weeks of Treeherder data, which is 7GB/day for _everything_. We've just gone up 40GB per day.
Comment 2•10 years ago
|
||
That's pretty odd, I wouldn't think what was currently in git should be causing any change in git usage. I'll take a look...
Comment 3•10 years ago
|
||
Disk usage that is, not git usage
Assignee | ||
Comment 4•10 years ago
|
||
On stage db1
> SELECT ROUND(SUM((data_length+index_length)/power(1024,3)),1) size_gb, table_schema AS db
FROM information_schema.tables GROUP BY table_schema
ORDER BY size_gb DESC LIMIT 20
+ ------------ + ------- +
| size_gb | db |
+ ------------ + ------- +
| 75.8 | mozilla_inbound_jobs_1 |
| 29.3 | try_jobs_1 |
| 27.2 | fx_team_jobs_1 |
| 13.8 | mozilla_central_jobs_1 |
| 12.5 | b2g_inbound_jobs_1 |
| 7.7 | mozilla_aurora_jobs_1 |
| 4.0 | mozilla_beta_jobs_1 |
| 2.6 | try_objectstore_1 |
| 2.3 | mozilla_inbound_objectstore_1 |
| 1.9 | gaia_try_jobs_1 |
| 1.6 | gum_jobs_1 |
| 0.9 | ash_jobs_1 |
| 0.8 | cypress_jobs_1 |
| 0.8 | cedar_jobs_1 |
| 0.8 | fx_team_objectstore_1 |
| 0.7 | mozilla_central_objectstore_1 |
| 0.6 | mozilla_b2g37_v2_2_jobs_1 |
| 0.5 | b2g_inbound_objectstore_1 |
| 0.4 | gaia_try_objectstore_1 |
| 0.4 | mozilla_aurora_objectstore_1 |
+ ------------ + ------- +
20 rows
> SELECT ROUND(SUM((data_length+index_length)/power(1024,3)),1) size_gb, table_name
FROM information_schema.tables GROUP BY table_name
ORDER BY size_gb DESC LIMIT 10
+ ------------ + --------------- +
| size_gb | table_name |
+ ------------ + --------------- +
| 119.8 | performance_artifact |
| 48.7 | job_artifact |
| 8.6 | objectstore |
| 5.5 | job |
| 3.4 | performance_series |
| 1.3 | job_log_url |
| 0.7 | job_eta |
| 0.2 | revision |
| 0.2 | series_signature |
| 0.1 | revision_map |
+ ------------ + --------------- +
10 rows
Assignee | ||
Comment 5•10 years ago
|
||
Wow ok I may have spoken too soon. There are a lot of treeherder1-bin.* binary logs - though the timestamps from them are only over two days:
-bash-4.1$ du -bc treeherder1-bin.* -ch | tail -n 1
109G total
I know Sheeri mentioned wanting to keep the binary logs for longer, but 109GB seems excessive. I guess the question is: have we always had 109GB (and something else changed), or have the logs got larger (eg more activity)/the duration we're keeping them for increased?
The full breakdown:
-bash-4.1$ du -hs * | sort -hr
82G mozilla_inbound_jobs_1
33G try_jobs_1
29G fx_team_jobs_1
15G mozilla_central_jobs_1
14G b2g_inbound_jobs_1
8.4G mozilla_aurora_jobs_1
5.0G mozilla_beta_jobs_1
2.7G try_objectstore_1
2.4G mozilla_inbound_objectstore_1
2.2G gaia_try_jobs_1
2.1G gum_jobs_1
1.2G ash_jobs_1
1.1G treeherder1-bin.001553
1.1G treeherder1-bin.001552
1.1G treeherder1-bin.001551
1.1G treeherder1-bin.001550
1.1G treeherder1-bin.001549
1.1G treeherder1-bin.001548
1.1G treeherder1-bin.001547
1.1G treeherder1-bin.001546
1.1G treeherder1-bin.001545
1.1G treeherder1-bin.001544
1.1G treeherder1-bin.001543
1.1G treeherder1-bin.001542
1.1G treeherder1-bin.001541
1.1G treeherder1-bin.001540
1.1G treeherder1-bin.001539
1.1G treeherder1-bin.001538
1.1G treeherder1-bin.001537
1.1G treeherder1-bin.001536
1.1G treeherder1-bin.001535
1.1G treeherder1-bin.001534
1.1G treeherder1-bin.001533
1.1G treeherder1-bin.001532
1.1G treeherder1-bin.001531
1.1G treeherder1-bin.001530
1.1G treeherder1-bin.001529
1.1G treeherder1-bin.001528
1.1G treeherder1-bin.001527
1.1G treeherder1-bin.001526
1.1G treeherder1-bin.001525
1.1G treeherder1-bin.001524
1.1G treeherder1-bin.001523
1.1G treeherder1-bin.001522
1.1G treeherder1-bin.001521
1.1G treeherder1-bin.001520
1.1G treeherder1-bin.001519
1.1G treeherder1-bin.001518
1.1G treeherder1-bin.001517
1.1G treeherder1-bin.001516
1.1G treeherder1-bin.001515
1.1G treeherder1-bin.001514
1.1G treeherder1-bin.001513
1.1G treeherder1-bin.001512
1.1G treeherder1-bin.001511
1.1G treeherder1-bin.001510
1.1G treeherder1-bin.001509
1.1G treeherder1-bin.001508
1.1G treeherder1-bin.001507
1.1G treeherder1-bin.001506
1.1G treeherder1-bin.001505
1.1G treeherder1-bin.001504
1.1G treeherder1-bin.001503
1.1G treeherder1-bin.001502
1.1G treeherder1-bin.001501
1.1G treeherder1-bin.001500
1.1G treeherder1-bin.001499
1.1G treeherder1-bin.001498
1.1G treeherder1-bin.001497
1.1G treeherder1-bin.001496
1.1G treeherder1-bin.001495
1.1G treeherder1-bin.001494
1.1G treeherder1-bin.001493
1.1G treeherder1-bin.001492
1.1G treeherder1-bin.001491
1.1G treeherder1-bin.001490
1.1G treeherder1-bin.001489
1.1G treeherder1-bin.001488
1.1G treeherder1-bin.001487
1.1G treeherder1-bin.001486
1.1G treeherder1-bin.001485
1.1G treeherder1-bin.001484
1.1G treeherder1-bin.001483
1.1G treeherder1-bin.001482
1.1G treeherder1-bin.001481
1.1G treeherder1-bin.001480
1.1G treeherder1-bin.001479
1.1G treeherder1-bin.001478
1.1G treeherder1-bin.001477
1.1G treeherder1-bin.001476
1.1G treeherder1-bin.001475
1.1G treeherder1-bin.001474
1.1G treeherder1-bin.001473
1.1G treeherder1-bin.001472
1.1G treeherder1-bin.001471
1.1G treeherder1-bin.001470
1.1G treeherder1-bin.001469
1.1G treeherder1-bin.001468
1.1G treeherder1-bin.001467
1.1G treeherder1-bin.001466
1.1G treeherder1-bin.001465
1.1G treeherder1-bin.001464
1.1G treeherder1-bin.001463
1.1G treeherder1-bin.001462
1.1G treeherder1-bin.001461
1.1G treeherder1-bin.001460
1.1G treeherder1-bin.001459
1.1G treeherder1-bin.001458
1.1G treeherder1-bin.001457
1.1G treeherder1-bin.001456
1.1G treeherder1-bin.001455
1.1G treeherder1-bin.001454
1.1G treeherder1-bin.001453
1.1G treeherder1-bin.001452
1.1G treeherder1-bin.001451
1.1G treeherder1-bin.001450
1.1G treeherder1-bin.001449
1.1G treeherder1-bin.001448
1.1G treeherder1-bin.001447
1.1G treeherder1-bin.001446
1.1G cypress_jobs_1
1022M cedar_jobs_1
1005M mozilla_central_objectstore_1
809M fx_team_objectstore_1
768M mozilla_b2g37_v2_2_jobs_1
674M mozilla_release_jobs_1
668M mozilla_b2g34_v2_1_jobs_1
585M gaia_try_objectstore_1
570M mozilla_b2g32_v2_0_jobs_1
533M b2g_inbound_objectstore_1
513M mozilla_aurora_objectstore_1
378M holly_jobs_1
376M comm_central_jobs_1
370M treeherder1-bin.001554
337M mozilla_b2g30_v1_4_jobs_1
333M ibdata1
313M maple_jobs_1
305M mozilla_beta_objectstore_1
301M ib_logfile1
300M ib_logfile0
299M oak_jobs_1
290M try_comm_central_jobs_1
268M mozilla_esr31_jobs_1
200M date_jobs_1
173M mozilla_b2g37_v2_2_objectstore_1
166M comm_aurora_jobs_1
156M mozilla_b2g34_v2_1s_jobs_1
150M elm_jobs_1
145M pine_jobs_1
129M mozilla_b2g34_v2_1_objectstore_1
109M mozilla_b2g32_v2_0_objectstore_1
99M gaia_jobs_1
97M larch_jobs_1
95M comm_esr31_jobs_1
89M mozilla_esr31_objectstore_1
83M comm_beta_jobs_1
77M mozilla_b2g30_v1_4_objectstore_1
77M gum_objectstore_1
77M cypress_objectstore_1
77M cedar_objectstore_1
72M treeherder_stage
71M jamun_jobs_1
69M ash_objectstore_1
56M ux_jobs_1
54M addon_sdk_jobs_1
49M oak_objectstore_1
49M mozilla_release_objectstore_1
37M mozilla_b2g34_v2_1s_objectstore_1
31M comm_central_objectstore_1
27M gaia_master_jobs_1
25M alder_jobs_1
24M try_comm_central_objectstore_1
23M comm_aurora_objectstore_1
22M staging_gaia_try_jobs_1
16M percona
16M holly_objectstore_1
14M larch_objectstore_1
14M comm_esr31_objectstore_1
13M maple_objectstore_1
13M addon_sdk_objectstore_1
12M pine_objectstore_1
12M comm_beta_objectstore_1
11M elm_objectstore_1
11M date_objectstore_1
7.6M mysql
4.2M ib_buffer_pool
4.1M alder_objectstore_1
2.9M mozilla_b2g28_v1_3t_jobs_1
2.7M unknown_jobs_1
2.7M try_taskcluster_jobs_1
2.7M taskcluster_integration_jobs_1
2.7M services_central_jobs_1
2.7M qa_try_jobs_1
2.7M mozilla_esr24_jobs_1
2.7M mozilla_esr17_jobs_1
2.7M mozilla_b2g28_v1_3_jobs_1
2.7M mozilla_b2g26_v1_2_jobs_1
2.7M mozilla_b2g18_v1_1_0_hd_jobs_1
2.7M mozilla_b2g18_jobs_1
2.7M graphics_jobs_1
2.7M gaia_v1_4_jobs_1
2.7M fig_jobs_1
2.7M comm_esr24_jobs_1
2.7M build_system_jobs_1
2.7M bugzilla_jobs_1
2.7M bmo_jobs_1
2.7M birch_jobs_1
636K performance_schema
292K mozilla_b2g28_v1_3t_objectstore_1
180K ux_objectstore_1
180K unknown_objectstore_1
180K try_taskcluster_objectstore_1
180K taskcluster_integration_objectstore_1
180K staging_gaia_try_objectstore_1
180K services_central_objectstore_1
180K qa_try_objectstore_1
180K mozilla_esr24_objectstore_1
180K mozilla_esr17_objectstore_1
180K mozilla_b2g28_v1_3_objectstore_1
180K mozilla_b2g26_v1_2_objectstore_1
180K mozilla_b2g18_v1_1_0_hd_objectstore_1
180K mozilla_b2g18_objectstore_1
180K jamun_objectstore_1
180K graphics_objectstore_1
180K gaia_v1_4_objectstore_1
180K gaia_objectstore_1
180K gaia_master_objectstore_1
180K fig_objectstore_1
180K comm_esr24_objectstore_1
180K build_system_objectstore_1
180K bugzilla_objectstore_1
180K bmo_objectstore_1
180K birch_objectstore_1
8.0K treeherder1-bin.index
4.0K treeherder1.stage.db.scl3.mozilla.com.pid
4.0K treeherder1-relay-bin.index
4.0K treeherder1-relay-bin.000896
4.0K treeherder1-relay-bin.000895
4.0K test
4.0K RPM_UPGRADE_MARKER-LAST
4.0K RPM_UPGRADE_HISTORY
4.0K relay-log.info
4.0K mysql_upgrade_info
4.0K auto.cnf
0 mysql.sock
Assignee | ||
Comment 6•10 years ago
|
||
That said, performance_artifact still accounts for 120GB out of 200GB total table usage...
Assignee | ||
Updated•10 years ago
|
Summary: DB usage increased after recent perfherder changes → Stage DB usage increased after recent perfherder changes
Comment 7•10 years ago
|
||
So a few things about performance artifacts:
1. It looks like the performance artifact has quite a bit of useless data in it (talos aux data) which we're unnecessarily including as metadata to every artifact. This should actually be stored with the summary series, as it's essentially data which applies only over the whole suite.
2. We should probably gzip the performance artifacts.
I think between those two things we should be able to bring db usage right down for that aspect of perfherder.
This still doesn't really explain the recent disk usage spike. Sheeri, do you have any suggestions on what might be going on with the proliferation of binary logs mentioned above in comment 5.
Comment 8•10 years ago
|
||
I have another theory. When I put the broken perf summary stuff in, no talos data was being ingested for suites like tp5. This meant that disk usage would go down, as old data was expired and no new data was being ingested to take its place. It makes sense now that this has been "fixed" that disk usage should be crawling back up. However, I believe it should plateau at around the level it is at now.
I'll do a bit more digging but I suspect this is a problem that won't get any worse. We should probably still fix the excessive perf artifact space usage issues. I'll file another issue about that.
Comment 9•10 years ago
|
||
Filed bug 1142631 to deal with the perf artifact bloat (basically just (1) in my list in comment 7). Let's get that in and see if this problem gets any better or worse.
Flags: needinfo?(wlachance)
Assignee | ||
Comment 10•10 years ago
|
||
(In reply to William Lachance (:wlach) from comment #7)
> 2. We should probably gzip the performance artifacts.
Filed bug 1142648.
Comment 11•10 years ago
|
||
The binary logs have been huge for a while. When treeherder disk space first started filling up, we reduced the binary logs down to about 2 days' worth. 50G per day sounds about right, at least in the last few months.
If you like, we can analyze to see what's going on inside the logs. But the binary logs reflect every single change that happens in the system, so if you don't think treeherder has a huge volume of changes (e.g. INSERT/UPDATE/DELETE/REPLACE/CREATE/DROP/ALTER etc, but *not* SELECT) then that's a flag to raise.
It's probably easiest if I analyze one binary log, about 1.1G of info, sounds like about 30 mins of data. Let me know if you want that analysis.
Comment 12•10 years ago
|
||
(In reply to Sheeri Cabral [:sheeri] from comment #11)
> The binary logs have been huge for a while. When treeherder disk space first
> started filling up, we reduced the binary logs down to about 2 days' worth.
> 50G per day sounds about right, at least in the last few months.
>
> If you like, we can analyze to see what's going on inside the logs. But the
> binary logs reflect every single change that happens in the system, so if
> you don't think treeherder has a huge volume of changes (e.g.
> INSERT/UPDATE/DELETE/REPLACE/CREATE/DROP/ALTER etc, but *not* SELECT) then
> that's a flag to raise.
>
> It's probably easiest if I analyze one binary log, about 1.1G of info,
> sounds like about 30 mins of data. Let me know if you want that analysis.
Let's see how things go over the next few days now that the fix to bug 1142631 is deployed. I suspect the removal of the aux stuff in the performance series should also reduce the size of the logs.
Assignee | ||
Comment 13•10 years ago
|
||
Thank you Sheeri - knowing that 50GB/day is in roughly the right ballpark is fine for now. We do have a fair rate of churn on some tables, which probably doesn't help (things like bug 1140349 will help with that).
Assignee | ||
Comment 14•10 years ago
|
||
And today (though Sheeri has just pruned logs):
(In reply to Ed Morley [:edmorley] from comment #5)
> -bash-4.1$ du -bc treeherder1-bin.* -ch | tail -n 1
> 109G total
-bash-4.1$ du -bc treeherder1-bin.* -ch | tail -n 1
43G total
> The full breakdown:
>
> -bash-4.1$ du -hs * | sort -hr
> 82G mozilla_inbound_jobs_1
> 33G try_jobs_1
> 29G fx_team_jobs_1
> 15G mozilla_central_jobs_1
> 14G b2g_inbound_jobs_1
> 8.4G mozilla_aurora_jobs_1
> 5.0G mozilla_beta_jobs_1
> 2.7G try_objectstore_1
> 2.4G mozilla_inbound_objectstore_1
> 2.2G gaia_try_jobs_1
> 2.1G gum_jobs_1
> 1.2G ash_jobs_1
> 1.1G treeherder1-bin.001553
> 1.1G treeherder1-bin.001552
> 1.1G treeherder1-bin.001551
-bash-4.1$ du -hs * | sort -hr | head -n 15
84G mozilla_inbound_jobs_1
36G try_jobs_1
30G fx_team_jobs_1
16G mozilla_central_jobs_1
14G b2g_inbound_jobs_1
8.6G mozilla_aurora_jobs_1
5.1G mozilla_beta_jobs_1
3.3G try_objectstore_1
3.1G mozilla_inbound_objectstore_1
2.5G gaia_try_jobs_1
2.2G gum_jobs_1
1.2G ash_jobs_1
1.1G treeherder1-bin.001813
1.1G treeherder1-bin.001812
1.1G treeherder1-bin.001811
Comment 15•10 years ago
|
||
Running Ed's query on mozilla_inbound_jobs again, we see that performance_artifact is taking up most of the space:
+---------+----------------------+
| size_gb | table_name |
+---------+----------------------+
| 111.9 | performance_artifact |
| 54.1 | job_artifact |
| 10.2 | objectstore |
| 5.8 | job |
| 3.6 | performance_series |
| 1.5 | job_log_url |
| 0.7 | job_eta |
| 0.2 | revision |
| 0.1 | series_signature |
| 0.1 | revision_map |
+---------+----------------------+
10 rows in set (0.15 sec)
It's possible that the addition of summary series artifacts is causing a net increase of space used, even if the lack of auxillary data helped somewhat. We should probably bite the bullet and implement bug 1142648, I don't think it should be that hard.
Comment 16•10 years ago
|
||
Stage paged for space again - expire_logs_days is set to 1 and that's still over 100G of logs.
Updated•10 years ago
|
Whiteboard: MGSEI-RTL-3F
Assignee | ||
Updated•10 years ago
|
Whiteboard: MGSEI-RTL-3F
Assignee | ||
Comment 17•10 years ago
|
||
Stage db{1,2} alerted for disk usage again this evening (85%+). I've run an optimize table on a few tables (see bug 1142648 comment 5) - this freed ~44GB. I also truncated a few of the objectstore tables. It looks like Sheeri also purged the binlogs. We're now comfortably within the limit (43%), and should be fine even when the logs grow back (~70%).
Assignee | ||
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → emorley
You need to log in
before you can comment on or make changes to this bug.
Description
•