1078523 - Analyse current DB size & growth rates

Assignee

Description

•

10 years ago

In order to find ways to help with bug 1078392, it would be great to have DB usage broken down by:
* repo
* table
* for job artefacts, by artefact type

Once we have the first round of stats, we may need to drill deeper (eg log parser artefact for successful vs not successful jobs).

Ed Morley [:emorley]

Assignee

Comment 1

•

10 years ago

We'll likely also want to just look at the last 2 weeks worth of data to avoid the skew from past lower consumption - and also likely again after bug 1078415 lands, to see if that helped.

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Summary: Analyse current DB usage → Analyse current DB storage usage

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Summary: Analyse current DB storage usage → Analyse current DB size & growth rates

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Blocks: 1080757

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Priority: -- → P3

Jonathan Griffin (:jgriffin)

Updated

•

10 years ago

Priority: P3 → P2

Mauro Doglio [:mdoglio]

Updated

•

10 years ago

Assignee: nobody → mdoglio

Mauro Doglio [:mdoglio]

Comment 2

•

10 years ago

Attached file prod_db_size_20141119.csv — Details

The attached csv file contains a list of all the databases on production with their respective size in GB.

Mauro Doglio [:mdoglio]

Comment 3

•

10 years ago

New relic says that the disk on treeherder1.db.scl3.mozilla.com will be full in 16 days and and the one on treeherder2.db.scl3.mozilla.com will be full in 13 days. Raising the priority here to P1

Priority: P2 → P1

Jonathan Griffin (:jgriffin)

Comment 4

•

10 years ago

Whoops.  Can we prune a month of old data as an emergency measure to give us more space to fix this correctly?

Ed Morley [:emorley]

Assignee

Comment 5

•

10 years ago

Summing the table sizes in the attached CSV gives a total of 330GB - bug 1076740 supposedly gave us 700GB to play with. Why the substantial difference?
a) New Relic incorrectly reporting free space?
b) The 700GB is also used for something else? (stage?)
c) The increase to 700GB didn't stick?
d) The values in the CSV are incorrect (eg don't including indexes)?

That aside, perhaps we could dial down the 6 month expiry to something shorter for now in case the 16 days figure is true?

Some of the oldest data has less use at the moment, since (a) the trees weren't being sheriffed then, so things like the failure classifications are missing, (b) job name regexes were wrong etc. Also, by the time we have dashboards written to use Treeherder as a datasource (eg OrangeFactor v2), we'll have expired these older jobs even if we just left the timeframe at the current 6 months.

Bug 1078392 comment 0 implies as of today, we have ~5 months of data in treeherder - is this correct?

Reducing the 6 months to say 4, should give us some breathing room for now - plus also test our data expiration code.

We should also remove repos we've marked as "onhold" at some point, however looking at the CSV, the top 20 tables account for 97% of the data usage, and none of them are repos we've hidden/deprecated.

Jonathan Griffin (:jgriffin)

Comment 6

•

10 years ago

At the work week, Fubar showed us a new relic view which claimed we were using 400GB+ of db space.  Maybe he can give us current figures?

Flags: needinfo?(klibby)

Mauro Doglio [:mdoglio]

Comment 7

•

10 years ago

Attached file prod_try_tables_size_20141119.csv — Details

Size of each table in try_jobs_1

Mauro Doglio [:mdoglio]

Comment 8

•

10 years ago

Attached file job_artifact_type_size_20141119.csv — Details

This list tells us what's the impact of each artifact type on the size of ~25000 rows. Query used: 

select name, sum(char_length(`blob`)) /1024 / 1024 "size in MB" from job_artifact where id > <place a job_artifact id here> group by name;

Mauro Doglio [:mdoglio]

Comment 9

•

10 years ago

:edmorley this is the query I used to obtain the total size of the databases
SELECT
  table_schema "Data Base Name",
  SUM( data_length + index_length ) / 1024 / 1024 / 1024 "Data Base Size in GB"
FROM
  information_schema.TABLES
GROUP BY
  table_schema ;

Mauro Doglio [:mdoglio]

Comment 10

•

10 years ago

:jgriffin if cutting out a month of data is acceptable, that's the easiest and quickest solution. Otherwise we can ask to the DBAs if we can move try_jobs_1 to a separate VM, from the application point of view that's totally doable.

Jonathan Griffin (:jgriffin)

Comment 11

•

10 years ago

I think stripping the oldest month of data is very acceptable, for all the reasons that Ed stated in comment #5.  Let's hold off on moving try_jobs_1 to a separate VM until we've done some more work to analyze and reduce our DB growth rate.

Kendall Libby [:fubar] (he/him)

Comment 12

•

10 years ago

I would suggest involving the DBAs if you have any questions at all about how much space is actually used, and by what. e.g. mysql binary logs are 1.1gb a pop, and there are 20+. also, your try table sizes above show low-60gb, but on disk I see:

69G     try_jobs_1
9.3G    try_objectstore_1


Current disk usage:
treeherder1.db.scl3# df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       690G  400G  255G  62% /data

Flags: needinfo?(klibby)

Mauro Doglio [:mdoglio]

Comment 13

•

10 years ago

:jgriffin we need to improve the data cycling routine before we run it to delete one month of data. I'll work on the required changes in bug 1102228

Mauro Doglio [:mdoglio]

Updated

•

10 years ago

Depends on: 1102228

Mauro Doglio [:mdoglio]

Updated

•

10 years ago

Depends on: 1102257

Mauro Doglio [:mdoglio]

Comment 14

•

10 years ago

I'm running the improved data cycling routine on production. I'll start with a 150 days cycle.

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

No longer blocks: 1080757

Component: Treeherder → Treeherder: Data Ingestion

Ed Morley [:emorley]

Assignee

Comment 15

•

9 years ago

(In reply to Mauro Doglio [:mdoglio] from comment #14)
> I'm running the improved data cycling routine on production. I'll start with
> a 150 days cycle.

Looking at dev/stage/prod, it seems that stage/dev are still set to 6 months (rather than 5 of dev) - is this going to cause a problem, or were the time until disk full predictions completely off after all?

Flags: needinfo?(mdoglio)

Mauro Doglio [:mdoglio]

Comment 16

•

9 years ago

I haven't set the new data cycling policy in the setting file yet.

Flags: needinfo?(mdoglio)

Treeherder GitHub Bugbot

Comment 17

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/ae1ef4e3457262980ce3dad8d216d574ba845bef
Bug 1078523 - cycle data older than 5 months

Ed Morley [:emorley]

Assignee

Comment 18

•

9 years ago

I think this is no longer a P1, now that we've reduced the retained data to 4 months.,

Priority: P1 → P2

Ed Morley [:emorley]

Assignee

Updated

•

9 years ago

Blocks: 1120019

Mauro Doglio [:mdoglio]

Updated

•

9 years ago

No longer blocks: 1120019

Ed Morley [:emorley]

Assignee

Updated

•

9 years ago

Blocks: 1120019

Ed Morley [:emorley]

Assignee

Updated

•

9 years ago

Depends on: 1125569

Ed Morley [:emorley]

Assignee

Comment 19

•

9 years ago

Sheeri, does this look like it will be at least roughly accurate, in terms of calculating table sizes and row counts? (Or is an ANALYZE required to even get the row counts in the correct ballpark?)

SELECT table_schema AS db, table_name, ROUND(SUM((data_length+index_length)/power(1024,2)),1) size_mb, table_rows
FROM information_schema.tables
GROUP BY table_schema, table_name
ORDER BY size_mb DESC;

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Updated

•

9 years ago

Flags: needinfo?(scabral)

Whiteboard: [data:consultative]

Ed Morley [:emorley]

Assignee

Comment 20

•

9 years ago

Initial thoughts (see below for data):

* There are some strange discrepancies between the figures on treeherder1 and treeherder2 - different data_free amounts would make sense, since the fragmentation may be different on each host - however I thought the rest should be roughly the same?
* We can save ~80GB just by running OPTIMIZE on the job_artifact table of inbound/try/fx-team/central/b2g-inbound.
** That said, it's strange that job_artifact has such a high proportion of free space compared to other tables, given the data expiration should affect them all equally. I can only think this is because in the past we've had much larger artifacts than we have at the moment? (eg the thousand line Talos error summaries; bug 1078409).
* performance_artifact accounts for 40% of total usage, even though it is only populated on trees that run Talos, and aiui hasn't even been ingested for the full 4 months of our data lifecycle yet.
** There must be some wins in terms of optimising what we store.
** Reducing lifecycle for just the perf artifacts table will get us a big win without loosing history for everything else.
* We could prune objectstores sooner than 4 months (currently 25GB, not a huge disk win, but would help with objectstore perf of bug 1125410).
* It looks like job_eta isn't being expired properly (it still has size on now-inactive repos where all other tables have been purged), though across all databases it is only 2.5GB anyway.
* We should delete databases for repositories now marked as 'onhold' (bug 1035294), though surprisingly it looks like it won't win back much disk space, since an OPTIMIZE must have been run for them post data expiration (or else the data_free field is inaccurate).
* We should spot-check the oldest rows in several tables to make sure we don't have any orphaned rows left due to issues with expiration.

...can anyone see anything else?

I'm happy to drive this bug for the moment - and will file bugs for the above :-)

===

Largest 20 databases:

> SELECT ROUND(SUM((data_length+index_length)/power(1024,3)),1) size_gb, table_schema AS db
FROM information_schema.tables GROUP BY table_schema
ORDER BY size_gb DESC LIMIT 20

treeherder1:

+ ------------ + ------- +
| size_gb      | db      |
+ ------------ + ------- +
| 136.4        | mozilla_inbound_jobs_1 |
| 61.3         | try_jobs_1 |
| 38.3         | fx_team_jobs_1 |
| 23.0         | mozilla_central_jobs_1 |
| 18.5         | b2g_inbound_jobs_1 |
| 7.8          | mozilla_inbound_objectstore_1 |
| 7.5          | try_objectstore_1 |
| 7.4          | mozilla_beta_jobs_1 |
| 6.8          | mozilla_aurora_jobs_1 |
| 3.6          | mozilla_esr31_jobs_1 |
| 3.3          | gaia_try_jobs_1 |
| 3.2          | cedar_jobs_1 |
| 2.9          | fx_team_objectstore_1 |
| 2.8          | maple_jobs_1 |
| 2.5          | ash_jobs_1 |
| 2.0          | b2g_inbound_objectstore_1 |
| 2.0          | gum_jobs_1 |
| 1.6          | mozilla_central_objectstore_1 |
| 1.3          | mozilla_b2g34_v2_1_jobs_1 |
| 1.3          | mozilla_release_jobs_1 |
+ ------------ + ------- +

treeherder2:

+ ------------ + ------- +
| size_gb      | db      |
+ ------------ + ------- +
| 166.4        | mozilla_inbound_jobs_1 |
| 63.0         | fx_team_jobs_1 |
| 61.3         | try_jobs_1 |
| 33.4         | mozilla_central_jobs_1 |
| 28.7         | b2g_inbound_jobs_1 |
| 13.0         | mozilla_aurora_jobs_1 |
| 7.8          | mozilla_inbound_objectstore_1 |
| 7.5          | try_objectstore_1 |
| 7.4          | mozilla_beta_jobs_1 |
| 3.6          | mozilla_esr31_jobs_1 |
| 3.5          | gaia_try_jobs_1 |
| 3.2          | cedar_jobs_1 |
| 2.9          | fx_team_objectstore_1 |
| 2.8          | maple_jobs_1 |
| 2.5          | ash_jobs_1 |
| 2.0          | b2g_inbound_objectstore_1 |
| 2.0          | gum_jobs_1 |
| 1.6          | mozilla_central_objectstore_1 |
| 1.3          | mozilla_b2g34_v2_1_jobs_1 |
| 1.3          | mozilla_release_jobs_1 |
+ ------------ + ------- +


Top table sizes, summed across all databases:

> SELECT ROUND(SUM((data_length+index_length)/power(1024,3)),1) size_gb, table_name
FROM information_schema.tables GROUP BY table_name
ORDER BY size_gb DESC LIMIT 10;

treeherder1:

+ ------------ + --------------- +
| size_gb      | table_name      |
+ ------------ + --------------- +
| 239.2        | job_artifact    |
| 58.0         | performance_artifact |
| 25.5         | objectstore     |
| 10.2         | job             |
| 5.3          | performance_series |
| 2.9          | job_log_url     |
| 2.3          | job_eta         |
| 0.6          | revision        |
| 0.3          | series_signature |
| 0.1          | revision_map    |
+ ------------ + --------------- +

treeherder2:

+ ------------ + --------------- +
| size_gb      | table_name      |
+ ------------ + --------------- +
| 198.0        | job_artifact    |
| 181.7        | performance_artifact |
| 25.4         | objectstore     |
| 9.9          | job             |
| 5.3          | performance_series |
| 2.8          | job_log_url     |
| 2.5          | job_eta         |
| 0.4          | revision        |
| 0.2          | series_signature |
| 0.1          | revision_map    |
+ ------------ + --------------- +


Top 10 candidates for OPTIMIZE:

> SELECT ROUND(data_free/power(1024,3),1) data_free_gb,
ROUND((data_length+index_length)/power(1024,3),1) size_gb, table_schema, table_name
FROM information_schema.tables
ORDER BY data_free DESC LIMIT 10

treeherder1:

+ ----------------- + ------------ + ----------------- + --------------- +
| data_free_gb      | size_gb      | table_schema      | table_name      |
+ ----------------- + ------------ + ----------------- + --------------- +
| 53.4              | 106.8        | mozilla_inbound_jobs_1 | job_artifact    |
| 22.3              | 54.8         | try_jobs_1        | job_artifact    |
| 6.2               | 29.0         | fx_team_jobs_1    | job_artifact    |
| 2.5               | 13.4         | mozilla_central_jobs_1 | job_artifact    |
| 2.3               | 13.9         | b2g_inbound_jobs_1 | job_artifact    |
| 0.6               | 2.2          | mozilla_inbound_jobs_1 | performance_series |
| 0.4               | 2.5          | cedar_jobs_1      | job_artifact    |
| 0.4               | 1.8          | ash_jobs_1        | job_artifact    |
| 0.4               | 0.4          | elm_jobs_1        | job_artifact    |
| 0.4               | 1.6          | maple_jobs_1      | job_artifact    |
+ ----------------- + ------------ + ----------------- + --------------- +

treeherder2:

+ ----------------- + ------------ + ----------------- + --------------- +
| data_free_gb      | size_gb      | table_schema      | table_name      |
+ ----------------- + ------------ + ----------------- + --------------- +
| 52.9              | 76.9         | mozilla_inbound_jobs_1 | job_artifact    |
| 21.6              | 42.6         | try_jobs_1        | job_artifact    |
| 6.2               | 29.1         | fx_team_jobs_1    | job_artifact    |
| 2.5               | 13.3         | mozilla_central_jobs_1 | job_artifact    |
| 2.2               | 13.9         | b2g_inbound_jobs_1 | job_artifact    |
| 0.4               | 2.4          | cedar_jobs_1      | job_artifact    |
| 0.4               | 1.8          | ash_jobs_1        | job_artifact    |
| 0.4               | 1.6          | maple_jobs_1      | job_artifact    |
| 0.3               | 0.4          | jamun_jobs_1      | job_artifact    |
| 0.3               | 2.1          | mozilla_inbound_jobs_1 | performance_series |
+ ----------------- + ------------ + ----------------- + --------------- +

I've also exported the more detailed |GROUP BY table_schema, table_name| breakdown to Google docs (public link):
https://docs.google.com/spreadsheets/d/1gOIjXkGgp4oIxulTXVoinkE8khI5s_7h1QybK6uhEvQ/edit?usp=sharing

Assignee: mdoglio → emorley

Status: NEW → ASSIGNED

Ed Morley [:emorley]

Assignee

Comment 21

•

9 years ago

(In reply to Ed Morley [:edmorley] from comment #20)
> * There are some strange discrepancies between the figures on treeherder1
> and treeherder2 - different data_free amounts would make sense, since the
> fragmentation may be different on each host - however I thought the rest
> should be roughly the same?

Sheeri, do you have any ideas about this?

Flags: needinfo?(scabral)

Ed Morley [:emorley]

Assignee

Updated

•

9 years ago

No longer blocks: 1120019

Sheeri Cabral [:sheeri]

Comment 22

•

9 years ago

Ed, 

data_free is not the amount of fragmentation in a table.

I defragmented tables on 1/19, as per https://bugzilla.mozilla.org/show_bug.cgi?id=1120019#c10

Here's what changed:
Specifically:
fx_team_jobs1.job_artifact shrank from 46G to 30G
b2g_inbound_jobs_1.job_artifact shrank from 22G to 14G
mozilla_central_jobs_1.job_artifact shrank from 22G to 14G
mozilla_inbound_objectstore_1.objectstore shrank from 11G to 7.6G

These tables also shrank, but were smaller tables and could be run in real-time:
cedar_jobs_1.job_artifact
mozilla_central_objectstore_1.objectstore
fx_team_jobs_1.job
try_jobs_1.job_log_url
mozilla_inbound_jobs_1.job_log_url
mozilla_aurora_objectstore_1.objectstore
b2g_inbound_objectstore_1.objectstore
fx_team_objectstore_1.objectstore
fx_team_jobs_1.performance_series
ash_jobs_1.job_artifact
b2g_inbound_jobs_1.job
fx_team_jobs_1.job
gaia_try_jobs_1.job_artifact
jamun_jobs_1.job_artifact
maple_jobs_1.job_artifact
mozilla_aurora_jobs_1.job_artifact
mozilla_beta_jobs_1.job_artifact
mozilla_inbound_jobs_1.job
mozilla_inbound_jobs_1.performance_series
try_jobs_1.job

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Comment 23

•

9 years ago

InnoDB uses approximate statistics, so even if you use ANALYZE you won't get exact numbers. To see about fragmentation, add up data_length and index_length and compare it to the size on disk, again keeping in mind that InnoDB stats are approximate, so give it a 10% margin of error.

Ed Morley [:emorley]

Assignee

Comment 24

•

9 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #23)
> InnoDB uses approximate statistics, so even if you use ANALYZE you won't get
> exact numbers. To see about fragmentation, add up data_length and
> index_length and compare it to the size on disk, again keeping in mind that
> InnoDB stats are approximate, so give it a 10% margin of error.

Meant to say thank you for this and comment 22 :-)

Ed Morley [:emorley]

Assignee

Comment 25

•

9 years ago

Breakdown of stage's mozilla-inbound job_artifact table, by artefact type:

> SELECT name as `Artefact name`, count(*) as `Count`, ROUND(SUM(LENGTH(`blob`))/power(1024,2),1) AS `Size in MB` FROM mozilla_inbound_jobs_1.job_artifact WHERE id > 9700000 GROUP BY name ORDER BY `Size in MB` DESC

+ ------------------ + ---------- + --------------- +
| Artefact name      | Count      | Size in MB      |
+ ------------------ + ---------- + --------------- +
| Structured Log     | 15007      | 75.1            |
| buildapi_complete  | 12965      | 14.6            |
| Job Info           | 14832      | 10.3            |
| Bug suggestions    | 15008      | 8.5             |
| buildapi_running   | 13033      | 4.7             |
| buildapi           | 24730      | 2.4             |
| buildapi_pending   | 15735      | 1.8             |
+ ------------------ + ---------- + --------------- +
7 rows

...strange how many more buildapi artefacts there are compared to anything else? I get that not all jobs will have {Structured Log, buildapi_complete, Bug suggestions} since they may have been coalesced, but they should still have a comparable number of {Job Info, buildapi_pending} to buildapi, right? (buildapi being where we store the buildername).

Ed Morley [:emorley]

Assignee

Comment 26

•

9 years ago

(In reply to Ed Morley [:edmorley] from comment #25)
> Breakdown of stage's mozilla-inbound job_artifact table, by artefact type:

For the last ~100,000 rows in the jobs_artifact table that is, the full query would have taken too long.

Ed Morley [:emorley]

Assignee

Updated

•

9 years ago

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

prod_db_size_20141119.csv 10 years ago Mauro Doglio [:mdoglio] 4.80 KB, text/csv		Details
prod_try_tables_size_20141119.csv 10 years ago Mauro Doglio [:mdoglio] 426 bytes, text/csv		Details
job_artifact_type_size_20141119.csv 10 years ago Mauro Doglio [:mdoglio] 213 bytes, text/csv		Details