Closed Bug 1142488 Opened 9 years ago Closed 9 years ago

Stage DB usage increased after recent perfherder changes

Categories

(Tree Management :: Perfherder, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

Details

I clearly spoke too soon in the meeting yesterday.

I just got an alert from new relic for the stage DB - it's currently at 80% full - usage rate climbed on the deploy 10th March @ ~12-2pm UTC:
https://rpm.newrelic.com/accounts/677903/servers/6106894/disks#id=868884149

In two days it's used another 80GB (from 220GB to 300GB).

The changes only recently made it to prod, so hard to say the impact there - though we have more wiggle room there.

We can:
a) Expire data sooner (again)
b) Try the gzipping blobs idea from the meeting
c) Double check there's not something crazy going on (80GB in two days seems like a lot!)

Will, would you mind driving this?
Flags: needinfo?(wlachance)
Meant to say:

I don't see any cycle-data errors on stage's New Relic page: 
https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors
(remember to increase the range to >24 hours since only runs once a day)

The cycle data task seems to be running according to:
https://rpm.newrelic.com/accounts/677903/applications/5585473/transactions#id=5b224f746865725472616e73616374696f6e2f43656c6572792f6379636c652d64617461222c22225d

(In reply to Ed Morley [:edmorley] from comment #0)
> b) Try the gzipping blobs idea from the meeting
> c) Double check there's not something crazy going on (80GB in two days seems
> like a lot!)

Put it this way: 300GB is currently 6 weeks of Treeherder data, which is 7GB/day for _everything_. We've just gone up 40GB per day.
That's pretty odd, I wouldn't think what was currently in git should be causing any change in git usage. I'll take a look...
Disk usage that is, not git usage
On stage db1

> SELECT ROUND(SUM((data_length+index_length)/power(1024,3)),1) size_gb, table_schema AS db
FROM information_schema.tables GROUP BY table_schema
ORDER BY size_gb DESC LIMIT 20

+ ------------ + ------- +
| size_gb      | db      |
+ ------------ + ------- +
| 75.8         | mozilla_inbound_jobs_1 |
| 29.3         | try_jobs_1 |
| 27.2         | fx_team_jobs_1 |
| 13.8         | mozilla_central_jobs_1 |
| 12.5         | b2g_inbound_jobs_1 |
| 7.7          | mozilla_aurora_jobs_1 |
| 4.0          | mozilla_beta_jobs_1 |
| 2.6          | try_objectstore_1 |
| 2.3          | mozilla_inbound_objectstore_1 |
| 1.9          | gaia_try_jobs_1 |
| 1.6          | gum_jobs_1 |
| 0.9          | ash_jobs_1 |
| 0.8          | cypress_jobs_1 |
| 0.8          | cedar_jobs_1 |
| 0.8          | fx_team_objectstore_1 |
| 0.7          | mozilla_central_objectstore_1 |
| 0.6          | mozilla_b2g37_v2_2_jobs_1 |
| 0.5          | b2g_inbound_objectstore_1 |
| 0.4          | gaia_try_objectstore_1 |
| 0.4          | mozilla_aurora_objectstore_1 |
+ ------------ + ------- +
20 rows

> SELECT ROUND(SUM((data_length+index_length)/power(1024,3)),1) size_gb, table_name
FROM information_schema.tables GROUP BY table_name
ORDER BY size_gb DESC LIMIT 10

+ ------------ + --------------- +
| size_gb      | table_name      |
+ ------------ + --------------- +
| 119.8        | performance_artifact |
| 48.7         | job_artifact    |
| 8.6          | objectstore     |
| 5.5          | job             |
| 3.4          | performance_series |
| 1.3          | job_log_url     |
| 0.7          | job_eta         |
| 0.2          | revision        |
| 0.2          | series_signature |
| 0.1          | revision_map    |
+ ------------ + --------------- +
10 rows
Wow ok I may have spoken too soon. There are a lot of treeherder1-bin.* binary logs - though the timestamps from them are only over two days:

-bash-4.1$ du -bc treeherder1-bin.* -ch | tail -n 1
109G    total

I know Sheeri mentioned wanting to keep the binary logs for longer, but 109GB seems excessive. I guess the question is: have we always had 109GB (and something else changed), or have the logs got larger (eg more activity)/the duration we're keeping them for increased?

The full breakdown:

-bash-4.1$ du -hs * | sort -hr
82G     mozilla_inbound_jobs_1
33G     try_jobs_1
29G     fx_team_jobs_1
15G     mozilla_central_jobs_1
14G     b2g_inbound_jobs_1
8.4G    mozilla_aurora_jobs_1
5.0G    mozilla_beta_jobs_1
2.7G    try_objectstore_1
2.4G    mozilla_inbound_objectstore_1
2.2G    gaia_try_jobs_1
2.1G    gum_jobs_1
1.2G    ash_jobs_1
1.1G    treeherder1-bin.001553
1.1G    treeherder1-bin.001552
1.1G    treeherder1-bin.001551
1.1G    treeherder1-bin.001550
1.1G    treeherder1-bin.001549
1.1G    treeherder1-bin.001548
1.1G    treeherder1-bin.001547
1.1G    treeherder1-bin.001546
1.1G    treeherder1-bin.001545
1.1G    treeherder1-bin.001544
1.1G    treeherder1-bin.001543
1.1G    treeherder1-bin.001542
1.1G    treeherder1-bin.001541
1.1G    treeherder1-bin.001540
1.1G    treeherder1-bin.001539
1.1G    treeherder1-bin.001538
1.1G    treeherder1-bin.001537
1.1G    treeherder1-bin.001536
1.1G    treeherder1-bin.001535
1.1G    treeherder1-bin.001534
1.1G    treeherder1-bin.001533
1.1G    treeherder1-bin.001532
1.1G    treeherder1-bin.001531
1.1G    treeherder1-bin.001530
1.1G    treeherder1-bin.001529
1.1G    treeherder1-bin.001528
1.1G    treeherder1-bin.001527
1.1G    treeherder1-bin.001526
1.1G    treeherder1-bin.001525
1.1G    treeherder1-bin.001524
1.1G    treeherder1-bin.001523
1.1G    treeherder1-bin.001522
1.1G    treeherder1-bin.001521
1.1G    treeherder1-bin.001520
1.1G    treeherder1-bin.001519
1.1G    treeherder1-bin.001518
1.1G    treeherder1-bin.001517
1.1G    treeherder1-bin.001516
1.1G    treeherder1-bin.001515
1.1G    treeherder1-bin.001514
1.1G    treeherder1-bin.001513
1.1G    treeherder1-bin.001512
1.1G    treeherder1-bin.001511
1.1G    treeherder1-bin.001510
1.1G    treeherder1-bin.001509
1.1G    treeherder1-bin.001508
1.1G    treeherder1-bin.001507
1.1G    treeherder1-bin.001506
1.1G    treeherder1-bin.001505
1.1G    treeherder1-bin.001504
1.1G    treeherder1-bin.001503
1.1G    treeherder1-bin.001502
1.1G    treeherder1-bin.001501
1.1G    treeherder1-bin.001500
1.1G    treeherder1-bin.001499
1.1G    treeherder1-bin.001498
1.1G    treeherder1-bin.001497
1.1G    treeherder1-bin.001496
1.1G    treeherder1-bin.001495
1.1G    treeherder1-bin.001494
1.1G    treeherder1-bin.001493
1.1G    treeherder1-bin.001492
1.1G    treeherder1-bin.001491
1.1G    treeherder1-bin.001490
1.1G    treeherder1-bin.001489
1.1G    treeherder1-bin.001488
1.1G    treeherder1-bin.001487
1.1G    treeherder1-bin.001486
1.1G    treeherder1-bin.001485
1.1G    treeherder1-bin.001484
1.1G    treeherder1-bin.001483
1.1G    treeherder1-bin.001482
1.1G    treeherder1-bin.001481
1.1G    treeherder1-bin.001480
1.1G    treeherder1-bin.001479
1.1G    treeherder1-bin.001478
1.1G    treeherder1-bin.001477
1.1G    treeherder1-bin.001476
1.1G    treeherder1-bin.001475
1.1G    treeherder1-bin.001474
1.1G    treeherder1-bin.001473
1.1G    treeherder1-bin.001472
1.1G    treeherder1-bin.001471
1.1G    treeherder1-bin.001470
1.1G    treeherder1-bin.001469
1.1G    treeherder1-bin.001468
1.1G    treeherder1-bin.001467
1.1G    treeherder1-bin.001466
1.1G    treeherder1-bin.001465
1.1G    treeherder1-bin.001464
1.1G    treeherder1-bin.001463
1.1G    treeherder1-bin.001462
1.1G    treeherder1-bin.001461
1.1G    treeherder1-bin.001460
1.1G    treeherder1-bin.001459
1.1G    treeherder1-bin.001458
1.1G    treeherder1-bin.001457
1.1G    treeherder1-bin.001456
1.1G    treeherder1-bin.001455
1.1G    treeherder1-bin.001454
1.1G    treeherder1-bin.001453
1.1G    treeherder1-bin.001452
1.1G    treeherder1-bin.001451
1.1G    treeherder1-bin.001450
1.1G    treeherder1-bin.001449
1.1G    treeherder1-bin.001448
1.1G    treeherder1-bin.001447
1.1G    treeherder1-bin.001446
1.1G    cypress_jobs_1
1022M   cedar_jobs_1
1005M   mozilla_central_objectstore_1
809M    fx_team_objectstore_1
768M    mozilla_b2g37_v2_2_jobs_1
674M    mozilla_release_jobs_1
668M    mozilla_b2g34_v2_1_jobs_1
585M    gaia_try_objectstore_1
570M    mozilla_b2g32_v2_0_jobs_1
533M    b2g_inbound_objectstore_1
513M    mozilla_aurora_objectstore_1
378M    holly_jobs_1
376M    comm_central_jobs_1
370M    treeherder1-bin.001554
337M    mozilla_b2g30_v1_4_jobs_1
333M    ibdata1
313M    maple_jobs_1
305M    mozilla_beta_objectstore_1
301M    ib_logfile1
300M    ib_logfile0
299M    oak_jobs_1
290M    try_comm_central_jobs_1
268M    mozilla_esr31_jobs_1
200M    date_jobs_1
173M    mozilla_b2g37_v2_2_objectstore_1
166M    comm_aurora_jobs_1
156M    mozilla_b2g34_v2_1s_jobs_1
150M    elm_jobs_1
145M    pine_jobs_1
129M    mozilla_b2g34_v2_1_objectstore_1
109M    mozilla_b2g32_v2_0_objectstore_1
99M     gaia_jobs_1
97M     larch_jobs_1
95M     comm_esr31_jobs_1
89M     mozilla_esr31_objectstore_1
83M     comm_beta_jobs_1
77M     mozilla_b2g30_v1_4_objectstore_1
77M     gum_objectstore_1
77M     cypress_objectstore_1
77M     cedar_objectstore_1
72M     treeherder_stage
71M     jamun_jobs_1
69M     ash_objectstore_1
56M     ux_jobs_1
54M     addon_sdk_jobs_1
49M     oak_objectstore_1
49M     mozilla_release_objectstore_1
37M     mozilla_b2g34_v2_1s_objectstore_1
31M     comm_central_objectstore_1
27M     gaia_master_jobs_1
25M     alder_jobs_1
24M     try_comm_central_objectstore_1
23M     comm_aurora_objectstore_1
22M     staging_gaia_try_jobs_1
16M     percona
16M     holly_objectstore_1
14M     larch_objectstore_1
14M     comm_esr31_objectstore_1
13M     maple_objectstore_1
13M     addon_sdk_objectstore_1
12M     pine_objectstore_1
12M     comm_beta_objectstore_1
11M     elm_objectstore_1
11M     date_objectstore_1
7.6M    mysql
4.2M    ib_buffer_pool
4.1M    alder_objectstore_1
2.9M    mozilla_b2g28_v1_3t_jobs_1
2.7M    unknown_jobs_1
2.7M    try_taskcluster_jobs_1
2.7M    taskcluster_integration_jobs_1
2.7M    services_central_jobs_1
2.7M    qa_try_jobs_1
2.7M    mozilla_esr24_jobs_1
2.7M    mozilla_esr17_jobs_1
2.7M    mozilla_b2g28_v1_3_jobs_1
2.7M    mozilla_b2g26_v1_2_jobs_1
2.7M    mozilla_b2g18_v1_1_0_hd_jobs_1
2.7M    mozilla_b2g18_jobs_1
2.7M    graphics_jobs_1
2.7M    gaia_v1_4_jobs_1
2.7M    fig_jobs_1
2.7M    comm_esr24_jobs_1
2.7M    build_system_jobs_1
2.7M    bugzilla_jobs_1
2.7M    bmo_jobs_1
2.7M    birch_jobs_1
636K    performance_schema
292K    mozilla_b2g28_v1_3t_objectstore_1
180K    ux_objectstore_1
180K    unknown_objectstore_1
180K    try_taskcluster_objectstore_1
180K    taskcluster_integration_objectstore_1
180K    staging_gaia_try_objectstore_1
180K    services_central_objectstore_1
180K    qa_try_objectstore_1
180K    mozilla_esr24_objectstore_1
180K    mozilla_esr17_objectstore_1
180K    mozilla_b2g28_v1_3_objectstore_1
180K    mozilla_b2g26_v1_2_objectstore_1
180K    mozilla_b2g18_v1_1_0_hd_objectstore_1
180K    mozilla_b2g18_objectstore_1
180K    jamun_objectstore_1
180K    graphics_objectstore_1
180K    gaia_v1_4_objectstore_1
180K    gaia_objectstore_1
180K    gaia_master_objectstore_1
180K    fig_objectstore_1
180K    comm_esr24_objectstore_1
180K    build_system_objectstore_1
180K    bugzilla_objectstore_1
180K    bmo_objectstore_1
180K    birch_objectstore_1
8.0K    treeherder1-bin.index
4.0K    treeherder1.stage.db.scl3.mozilla.com.pid
4.0K    treeherder1-relay-bin.index
4.0K    treeherder1-relay-bin.000896
4.0K    treeherder1-relay-bin.000895
4.0K    test
4.0K    RPM_UPGRADE_MARKER-LAST
4.0K    RPM_UPGRADE_HISTORY
4.0K    relay-log.info
4.0K    mysql_upgrade_info
4.0K    auto.cnf
0       mysql.sock
That said, performance_artifact still accounts for 120GB out of 200GB total table usage...
Summary: DB usage increased after recent perfherder changes → Stage DB usage increased after recent perfherder changes
So a few things about performance artifacts:

1. It looks like the performance artifact has quite a bit of useless data in it (talos aux data) which we're unnecessarily including as metadata to every artifact. This should actually be stored with the summary series, as it's essentially data which applies only over the whole suite.
2. We should probably gzip the performance artifacts.

I think between those two things we should be able to bring db usage right down for that aspect of perfherder.

This still doesn't really explain the recent disk usage spike. Sheeri, do you have any suggestions on what might be going on with the proliferation of binary logs mentioned above in comment 5.
I have another theory. When I put the broken perf summary stuff in, no talos data was being ingested for suites like tp5. This meant that disk usage would go down, as old data was expired and no new data was being ingested to take its place. It makes sense now that this has been "fixed" that disk usage should be crawling back up. However, I believe it should plateau at around the level it is at now.

I'll do a bit more digging but I suspect this is a problem that won't get any worse. We should probably still fix the excessive perf artifact space usage issues. I'll file another issue about that.
Filed bug 1142631 to deal with the perf artifact bloat (basically just (1) in my list in comment 7). Let's get that in and see if this problem gets any better or worse.
Flags: needinfo?(wlachance)
(In reply to William Lachance (:wlach) from comment #7)
> 2. We should probably gzip the performance artifacts.

Filed bug 1142648.
The binary logs have been huge for a while. When treeherder disk space first started filling up, we reduced the binary logs down to about 2 days' worth. 50G per day sounds about right, at least in the last few months.

If you like, we can analyze to see what's going on inside the logs. But the binary logs reflect every single change that happens in the system, so if you don't think treeherder has a huge volume of changes (e.g. INSERT/UPDATE/DELETE/REPLACE/CREATE/DROP/ALTER etc, but *not* SELECT) then that's a flag to raise.

It's probably easiest if I analyze one binary log, about 1.1G of info, sounds like about 30 mins of data. Let me know if you want that analysis.
(In reply to Sheeri Cabral [:sheeri] from comment #11)
> The binary logs have been huge for a while. When treeherder disk space first
> started filling up, we reduced the binary logs down to about 2 days' worth.
> 50G per day sounds about right, at least in the last few months.
> 
> If you like, we can analyze to see what's going on inside the logs. But the
> binary logs reflect every single change that happens in the system, so if
> you don't think treeherder has a huge volume of changes (e.g.
> INSERT/UPDATE/DELETE/REPLACE/CREATE/DROP/ALTER etc, but *not* SELECT) then
> that's a flag to raise.
> 
> It's probably easiest if I analyze one binary log, about 1.1G of info,
> sounds like about 30 mins of data. Let me know if you want that analysis.

Let's see how things go over the next few days now that the fix to bug 1142631 is deployed. I suspect the removal of the aux stuff in the performance series should also reduce the size of the logs.
Thank you Sheeri - knowing that 50GB/day is in roughly the right ballpark is fine for now. We do have a fair rate of churn on some tables, which probably doesn't help (things like bug 1140349 will help with that).
And today (though Sheeri has just pruned logs):

(In reply to Ed Morley [:edmorley] from comment #5)
> -bash-4.1$ du -bc treeherder1-bin.* -ch | tail -n 1
> 109G    total

-bash-4.1$ du -bc treeherder1-bin.* -ch | tail -n 1
43G     total

> The full breakdown:
> 
> -bash-4.1$ du -hs * | sort -hr
> 82G     mozilla_inbound_jobs_1
> 33G     try_jobs_1
> 29G     fx_team_jobs_1
> 15G     mozilla_central_jobs_1
> 14G     b2g_inbound_jobs_1
> 8.4G    mozilla_aurora_jobs_1
> 5.0G    mozilla_beta_jobs_1
> 2.7G    try_objectstore_1
> 2.4G    mozilla_inbound_objectstore_1
> 2.2G    gaia_try_jobs_1
> 2.1G    gum_jobs_1
> 1.2G    ash_jobs_1
> 1.1G    treeherder1-bin.001553
> 1.1G    treeherder1-bin.001552
> 1.1G    treeherder1-bin.001551

-bash-4.1$ du -hs * | sort -hr | head -n 15
84G     mozilla_inbound_jobs_1
36G     try_jobs_1
30G     fx_team_jobs_1
16G     mozilla_central_jobs_1
14G     b2g_inbound_jobs_1
8.6G    mozilla_aurora_jobs_1
5.1G    mozilla_beta_jobs_1
3.3G    try_objectstore_1
3.1G    mozilla_inbound_objectstore_1
2.5G    gaia_try_jobs_1
2.2G    gum_jobs_1
1.2G    ash_jobs_1
1.1G    treeherder1-bin.001813
1.1G    treeherder1-bin.001812
1.1G    treeherder1-bin.001811
Running Ed's query on mozilla_inbound_jobs again, we see that performance_artifact is taking up most of the space:

+---------+----------------------+
| size_gb | table_name           |
+---------+----------------------+
|   111.9 | performance_artifact |
|    54.1 | job_artifact         |
|    10.2 | objectstore          |
|     5.8 | job                  |
|     3.6 | performance_series   |
|     1.5 | job_log_url          |
|     0.7 | job_eta              |
|     0.2 | revision             |
|     0.1 | series_signature     |
|     0.1 | revision_map         |
+---------+----------------------+
10 rows in set (0.15 sec)

It's possible that the addition of summary series artifacts is causing a net increase of space used, even if the lack of auxillary data helped somewhat. We should probably bite the bullet and implement bug 1142648, I don't think it should be that hard.
Stage paged for space again - expire_logs_days is set to 1 and that's still over 100G of logs.
Whiteboard: MGSEI-RTL-3F
Whiteboard: MGSEI-RTL-3F
Stage db{1,2} alerted for disk usage again this evening (85%+). I've run an optimize table on a few tables (see bug 1142648 comment 5) - this freed ~44GB. I also truncated a few of the objectstore tables. It looks like Sheeri also purged the binlogs. We're now comfortably within the limit (43%), and should be fine even when the logs grow back (~70%).
Whiteboard: MGSEI-RTL-3F
Whiteboard: MGSEI-RTL-3F
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee: nobody → emorley
You need to log in before you can comment on or make changes to this bug.