Closed Bug 1120019 Opened 10 years ago Closed 10 years ago

Diskspace low on Treeherder DB nodes

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mpressman, Assigned: mpressman)

References

(Depends on 1 open bug)

Details

(Whiteboard: [data:serveropt])

We are seeing a massive spike in treeherder activity. The database size and logging activity are reaching critical levels on disk. In order to better manage this can you give us the answers to the following questions: Is there a data lifecycle? More precisely is older data being purged? If not, at what age can data be removed? If so, how can we verify that it is working? Thank you
Flags: needinfo?(mdoglio)
Old data is being (hopefully) purged at 5 months. What is perhaps not happening is automatic maintenance (defrag) and/or rotation of the logs. (In reply to Matt Pressman [:mpressman] from comment #0) > We are seeing a massive spike in treeherder activity. The database size and > logging activity are reaching critical levels on disk. Could you be more specific? "massive spike in treeherder activity" reads to me as increase in I/O, whereas the next sentence implies size on disk. (eg: Yesterday we deployed the "only keep 5 months, instead of 6 months of data" patch, which presumably would have caused a lot of deletes and thus I/O. Whether that resulted in reduced disk usage is to be seen, perhaps we need a defrag).
We can further reduce our data lifecycle to 4 month if needed. :mpressman do you think it would be a good idea to change the logging policy to only log slow queries instead of (I guess) everything? If you want to verify the correctness of the data cycling routine, you can look at: - the oldest push_timestamp value of the result_set table in a *_jobs_1 database - the oldest loaded_timestamp value of the objectstore table in a *_objectstore_1 database They shouldn't be more than 5 months old
Flags: needinfo?(mdoglio)
Please do reduce the data lifecycle. I spot-checked on the biggest dbs, and those tables indeed have data within the last 5 months only. However, the largest tables in the *_jobs_1 databases are job_artifact and performance_artifact (the only table in the *_objectstore_1 databases is objectstore) - do those tables ever get pruned? This is an important issue, as we do not want to run out of disk space - that can corrupt all the data in the tables. The database is over 520G at this point (over half that is stored in the mozilla_inbound databases).
Will, do we know how old the data is in the performance_artifact table?
Flags: needinfo?(wlachance)
(In reply to Jonathan Griffin (:jgriffin) from comment #4) > Will, do we know how old the data is in the performance_artifact table? They are just JSON blobs. Running this query it looks like the first one is from mid-October: https://treeherder.mozilla.org/api/project/mozilla-central/performance_artifact/?id__gte=0&id__lte=10&count=10 We aren't actually using these blobs for anything after we've updated the series tables inside Treeherder. Kyle is going to use them for dzAlerts until we put those in treeherder itself, so we probably want to keep a month or so's worth of them. If there's an urgent issue we could probably safely delete at least half the table without ill effect.
Flags: needinfo?(wlachance)
I reduced the data lifecycle here: https://github.com/mozilla/treeherder-service/commit/6468df0247fe21b0b59c1f6e1ff867aa545b2b0c It will go to production with the next push
The 4 months data lifecycle is now on prod
Thanks for the quick work, folks! Removing all but the last month from performance_artifact would be super-helpful, as well.
Yesterday I defragmented all the tables on treeherder2 (the slave) while it was out of the load balancer. Today I failed over treeherder so I could work on treeherder1, although I only focused on the tables that actually had fragmentation: Specifically: fx_team_jobs1.job_artifact shrank from 46G to 30G b2g_inbound_jobs_1.job_artifact shrank from 22G to 14G mozilla_central_jobs_1.job_artifact shrank from 22G to 14G mozilla_inbound_objectstore_1.objectstore shrank from 11G to 7.6G These tables also shrank, but were smaller tables and could be run in real-time: cedar_jobs_1.job_artifact mozilla_central_objectstore_1.objectstore fx_team_jobs_1.job try_jobs_1.job_log_url mozilla_inbound_jobs_1.job_log_url mozilla_aurora_objectstore_1.objectstore b2g_inbound_objectstore_1.objectstore fx_team_objectstore_1.objectstore fx_team_jobs_1.performance_series ash_jobs_1.job_artifact b2g_inbound_jobs_1.job fx_team_jobs_1.job gaia_try_jobs_1.job_artifact jamun_jobs_1.job_artifact maple_jobs_1.job_artifact mozilla_aurora_jobs_1.job_artifact mozilla_beta_jobs_1.job_artifact mozilla_inbound_jobs_1.job mozilla_inbound_jobs_1.performance_series try_jobs_1.job
This is *just* enough (by about 3G) to not go into the critical state daily. Please let us know when you have removed all but the prior month from performance_artifact, as that will also be a big savings (once we defragment again).
Any news on the performance_artifact cleanup? we have put a script in place to purge binary logs every 12 hours because we need the space, otherwise we get paged at all hours of the day and night. :(
Depends on: 1124708
Tweaking summary to be more specific, seeing as we have bug 1078392 and bug 1078523 about other work. Sheeri I don't suppose you could help us out - in bug 1078523 we've been trying to figure out (a) total vs free space, and (b) where it's going. However we hit a few problems - mdoglio's rough calculations (back in Nov) said the DB tables were 330 GB, but New Relic said we were running out of space, even though I thought we had 700GB to play with. So: a) Is total disk definitely 700GB? b) Is that shared between stage and prod, or just prod? c) Would it be possible to have a breakdown of DB (and even table) sizes? d) Other than that, where else is the space going (eg logs, indexes, fragmentation, ...?) and how much there? I think we also need to check that we're expiring data from all tables correctly - eg: perhaps we're not expiring the performance artefacts at all? It's just I'm really surprised we're still hitting the limits, given we've now dropped from 6 months to 4 months data retention and you've defragged the tables. Yes our usage per day has gone up now that we ingest the performance data, but we've been doing so for a while now. Thanks :-)
Depends on: 1078392, 1078523
Flags: needinfo?(scabral)
Summary: treeherder database data lifecycle → Diskspace low on Treeherder DB nodes
Also, seeing as Treeherder is going to subsume Datazilla - could we shorten the Datazilla data lifecycle, reduce the disk space allocated there and either: a) Directly use it for the Treeherder DB nodes (if they're even on the same cluster) b) Use the goodwill generated to beg for some more disk space in Treeherder?
(In reply to Ed Morley [:edmorley] from comment #14) > Also, seeing as Treeherder is going to subsume Datazilla To be clearer: Treeherder's data usage has partly increased due to it now ingesting the same performance data that Datazilla does - since the intent is for Treeherder to replace Datazilla and for the latter to be EOLed.
Depends on: 1124723
No longer depends on: 1078392, 1078523
Summary: Diskspace low on Treeherder DB nodes → treeherder database data lifecycle
(In reply to Ed Morley [:edmorley] from comment #13) > I think we also need to check that we're expiring data from all tables > correctly - eg: perhaps we're not expiring the performance artefacts at all? Ah, bug 1124723. Good spot Mauro :-)
Depends on: 1078392, 1078523
OS: Mac OS X → All
Priority: -- → P1
Hardware: x86 → All
Summary: treeherder database data lifecycle → Diskspace low on Treeherder DB nodes
[scabral@treeherder1.db.scl3 ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 37G 6.3G 29G 19% / tmpfs 4.9G 0 4.9G 0% /dev/shm /dev/sda1 969M 59M 860M 7% /boot /dev/sdb1 689G 565G 90G 87% /data Disk size is definitely around 700Gb - it's 730G total for the machine. 700Gb for /data is a bit on the high side but not by much. One of the problems is that binary logs from MySQL take up a fair bit of room. We keep up to 18 hours at a time, which is approximately 90G. Binary logs are required for replication and backups, and on most clusters we keep 10 days' worth of logs. We've been reducing the amount of binary logs we keep for a few *months* because of the growing amounts of space. The reason this is an issue now is because there's only so low we can go with the logs - we need to keep around enough binary logs so that the backup can function properly, for example - we take the backup server offline to back it up, so we need to keep around enough logs to compensate for the time it takes to do backups.
Flags: needinfo?(scabral)
Whiteboard: [data:serveropt][2014q4]
Whiteboard: [data:serveropt][2014q4] → [data:serveropt]
Depends on: 1125903
As we just realized, these are VMs, so we can increase the disk pretty easily! we just need to find a little operational time to do so.
Removing deps that are children of bug 1078392, since they are already in the dependency tree.
No longer depends on: 1078523, 1124723, 1124708
Depends on: 1131603
Increased disk on treeherder2.db.scl3 and treeherder2.stage.db.scl3 - In order to increase treeherder1.db.scl3, we'll need cab approval since it'll require a failover. With the treeherder1.stage.db.scl3 master, can you let us know when would be a good time to do that failover?
Any time on stage is fine with me - if you could just comment on the bug or in #treeherder so we know it's happening for when the new relic alerts go off and we start scratching our heads :-)
treeherder2.stage.db.scl3 disk size has been increased
That's great - thank you :-)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee: nobody → mpressman
You need to log in before you can comment on or make changes to this bug.