<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 1

•

10 years ago

Old data is being (hopefully) purged at 5 months. What is perhaps not happening is automatic maintenance (defrag) and/or rotation of the logs. (In reply to Matt Pressman [:mpressman] from comment #0) > We are seeing a massive spike in treeherder activity. The database size and > logging activity are reaching critical levels on disk. Could you be more specific? "massive spike in treeherder activity" reads to me as increase in I/O, whereas the next sentence implies size on disk. (eg: Yesterday we deployed the "only keep 5 months, instead of 6 months of data" patch, which presumably would have caused a lot of deletes and thus I/O. Whether that resulted in reduced disk usage is to be seen, perhaps we need a defrag).

Comment 2

•

10 years ago

We can further reduce our data lifecycle to 4 month if needed. :mpressman do you think it would be a good idea to change the logging policy to only log slow queries instead of (I guess) everything? If you want to verify the correctness of the data cycling routine, you can look at: - the oldest push_timestamp value of the result_set table in a *_jobs_1 database - the oldest loaded_timestamp value of the objectstore table in a *_objectstore_1 database They shouldn't be more than 5 months old

Flags: needinfo?(mdoglio)

Jonathan Griffin (:jgriffin)

Comment 3

•

10 years ago

Please do reduce the data lifecycle. I spot-checked on the biggest dbs, and those tables indeed have data within the last 5 months only. However, the largest tables in the *_jobs_1 databases are job_artifact and performance_artifact (the only table in the *_objectstore_1 databases is objectstore) - do those tables ever get pruned? This is an important issue, as we do not want to run out of disk space - that can corrupt all the data in the tables. The database is over 520G at this point (over half that is stored in the mozilla_inbound databases).

Comment 4

•

10 years ago

Will, do we know how old the data is in the performance_artifact table?

Flags: needinfo?(wlachance)

William Lachance (:wlach)

Comment 5

•

10 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #4) > Will, do we know how old the data is in the performance_artifact table? They are just JSON blobs. Running this query it looks like the first one is from mid-October: https://treeherder.mozilla.org/api/project/mozilla-central/performance_artifact/?id__gte=0&id__lte=10&count=10 We aren't actually using these blobs for anything after we've updated the series tables inside Treeherder. Kyle is going to use them for dzAlerts until we put those in treeherder itself, so we probably want to keep a month or so's worth of them. If there's an urgent issue we could probably safely delete at least half the table without ill effect.

Flags: needinfo?(wlachance)

Treeherder GitHub Bugbot

Comment 6

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder-service https://github.com/mozilla/treeherder-service/commit/6468df0247fe21b0b59c1f6e1ff867aa545b2b0c Bug 1120019 - set the data lifecycle to 4 months

Comment 7

•

10 years ago

I reduced the data lifecycle here: https://github.com/mozilla/treeherder-service/commit/6468df0247fe21b0b59c1f6e1ff867aa545b2b0c It will go to production with the next push

Comment 8

•

10 years ago

The 4 months data lifecycle is now on prod

Comment 9

•

10 years ago

Thanks for the quick work, folks! Removing all but the last month from performance_artifact would be super-helpful, as well.

Comment 10

•

10 years ago

Yesterday I defragmented all the tables on treeherder2 (the slave) while it was out of the load balancer. Today I failed over treeherder so I could work on treeherder1, although I only focused on the tables that actually had fragmentation: Specifically: fx_team_jobs1.job_artifact shrank from 46G to 30G b2g_inbound_jobs_1.job_artifact shrank from 22G to 14G mozilla_central_jobs_1.job_artifact shrank from 22G to 14G mozilla_inbound_objectstore_1.objectstore shrank from 11G to 7.6G These tables also shrank, but were smaller tables and could be run in real-time: cedar_jobs_1.job_artifact mozilla_central_objectstore_1.objectstore fx_team_jobs_1.job try_jobs_1.job_log_url mozilla_inbound_jobs_1.job_log_url mozilla_aurora_objectstore_1.objectstore b2g_inbound_objectstore_1.objectstore fx_team_objectstore_1.objectstore fx_team_jobs_1.performance_series ash_jobs_1.job_artifact b2g_inbound_jobs_1.job fx_team_jobs_1.job gaia_try_jobs_1.job_artifact jamun_jobs_1.job_artifact maple_jobs_1.job_artifact mozilla_aurora_jobs_1.job_artifact mozilla_beta_jobs_1.job_artifact mozilla_inbound_jobs_1.job mozilla_inbound_jobs_1.performance_series try_jobs_1.job

Comment 11

•

10 years ago

This is *just* enough (by about 3G) to not go into the critical state daily. Please let us know when you have removed all but the prior month from performance_artifact, as that will also be a big savings (once we defragment again).

Comment 12

•

10 years ago

Any news on the performance_artifact cleanup? we have put a script in place to purge binary logs every 12 hours because we need the space, otherwise we get paged at all hours of the day and night. :(

Updated

•

10 years ago

Depends on: 1124708

Comment 13

•

10 years ago

Tweaking summary to be more specific, seeing as we have bug 1078392 and bug 1078523 about other work. Sheeri I don't suppose you could help us out - in bug 1078523 we've been trying to figure out (a) total vs free space, and (b) where it's going. However we hit a few problems - mdoglio's rough calculations (back in Nov) said the DB tables were 330 GB, but New Relic said we were running out of space, even though I thought we had 700GB to play with. So: a) Is total disk definitely 700GB? b) Is that shared between stage and prod, or just prod? c) Would it be possible to have a breakdown of DB (and even table) sizes? d) Other than that, where else is the space going (eg logs, indexes, fragmentation, ...?) and how much there? I think we also need to check that we're expiring data from all tables correctly - eg: perhaps we're not expiring the performance artefacts at all? It's just I'm really surprised we're still hitting the limits, given we've now dropped from 6 months to 4 months data retention and you've defragged the tables. Yes our usage per day has gone up now that we ingest the performance data, but we've been doing so for a while now. Thanks :-)

Depends on: 1078392, 1078523

Flags: needinfo?(scabral)

Summary: treeherder database data lifecycle → Diskspace low on Treeherder DB nodes

Comment 14

•

10 years ago

Also, seeing as Treeherder is going to subsume Datazilla - could we shorten the Datazilla data lifecycle, reduce the disk space allocated there and either: a) Directly use it for the Treeherder DB nodes (if they're even on the same cluster) b) Use the goodwill generated to beg for some more disk space in Treeherder?

Comment 15

•

10 years ago

(In reply to Ed Morley [:edmorley] from comment #14) > Also, seeing as Treeherder is going to subsume Datazilla To be clearer: Treeherder's data usage has partly increased due to it now ingesting the same performance data that Datazilla does - since the intent is for Treeherder to replace Datazilla and for the latter to be EOLed.

Updated

•

10 years ago

Depends on: 1124723
No longer depends on: 1078392, 1078523

Summary: Diskspace low on Treeherder DB nodes → treeherder database data lifecycle

Comment 16

•

10 years ago

(In reply to Ed Morley [:edmorley] from comment #13) > I think we also need to check that we're expiring data from all tables > correctly - eg: perhaps we're not expiring the performance artefacts at all? Ah, bug 1124723. Good spot Mauro :-)

Depends on: 1078392, 1078523

OS: Mac OS X → All

Priority: -- → P1

Hardware: x86 → All

Summary: treeherder database data lifecycle → Diskspace low on Treeherder DB nodes

Comment 17

•

10 years ago

[scabral@treeherder1.db.scl3 ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 37G 6.3G 29G 19% / tmpfs 4.9G 0 4.9G 0% /dev/shm /dev/sda1 969M 59M 860M 7% /boot /dev/sdb1 689G 565G 90G 87% /data Disk size is definitely around 700Gb - it's 730G total for the machine. 700Gb for /data is a bit on the high side but not by much. One of the problems is that binary logs from MySQL take up a fair bit of room. We keep up to 18 hours at a time, which is approximately 90G. Binary logs are required for replication and backups, and on most clusters we keep 10 days' worth of logs. We've been reducing the amount of binary logs we keep for a few *months* because of the growing amounts of space. The reason this is an issue now is because there's only so low we can go with the logs - we need to keep around enough binary logs so that the backup can function properly, for example - we take the backup server offline to back it up, so we need to keep around enough logs to compensate for the time it takes to do backups.

Flags: needinfo?(scabral)

Updated

•

10 years ago

Whiteboard: [data:serveropt][2014q4]

Updated

•

10 years ago

Whiteboard: [data:serveropt][2014q4] → [data:serveropt]

Updated

•

10 years ago

Depends on: 1125903

Comment 18

•

10 years ago

As we just realized, these are VMs, so we can increase the disk pretty easily! we just need to find a little operational time to do so.

Comment 19

•

10 years ago

Removing deps that are children of bug 1078392, since they are already in the dependency tree.

No longer depends on: 1078523, 1124723, 1124708

Matt Pressman [:mpressman]

Updated

•

10 years ago

Depends on: 1131603

Assignee

Comment 20

•

10 years ago

Increased disk on treeherder2.db.scl3 and treeherder2.stage.db.scl3 - In order to increase treeherder1.db.scl3, we'll need cab approval since it'll require a failover. With the treeherder1.stage.db.scl3 master, can you let us know when would be a good time to do that failover?

Matt Pressman [:mpressman]

Comment 21

•

10 years ago

Any time on stage is fine with me - if you could just comment on the bug or in #treeherder so we know it's happening for when the new relic alerts go off and we start scratching our heads :-)

Assignee

Comment 22

•

10 years ago

treeherder2.stage.db.scl3 disk size has been increased

Comment 23

•

10 years ago

That's great - thank you :-)

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED