Closed
Bug 1120019
Opened 10 years ago
Closed 10 years ago
Diskspace low on Treeherder DB nodes
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mpressman, Assigned: mpressman)
References
(Depends on 1 open bug)
Details
(Whiteboard: [data:serveropt])
We are seeing a massive spike in treeherder activity. The database size and logging activity are reaching critical levels on disk. In order to better manage this can you give us the answers to the following questions:
Is there a data lifecycle? More precisely is older data being purged? If not, at what age can data be removed? If so, how can we verify that it is working?
Thank you
Updated•10 years ago
|
Flags: needinfo?(mdoglio)
Comment 1•10 years ago
|
||
Old data is being (hopefully) purged at 5 months.
What is perhaps not happening is automatic maintenance (defrag) and/or rotation of the logs.
(In reply to Matt Pressman [:mpressman] from comment #0)
> We are seeing a massive spike in treeherder activity. The database size and
> logging activity are reaching critical levels on disk.
Could you be more specific? "massive spike in treeherder activity" reads to me as increase in I/O, whereas the next sentence implies size on disk. (eg: Yesterday we deployed the "only keep 5 months, instead of 6 months of data" patch, which presumably would have caused a lot of deletes and thus I/O. Whether that resulted in reduced disk usage is to be seen, perhaps we need a defrag).
Comment 2•10 years ago
|
||
We can further reduce our data lifecycle to 4 month if needed.
:mpressman do you think it would be a good idea to change the logging policy to only log slow queries instead of (I guess) everything?
If you want to verify the correctness of the data cycling routine, you can look at:
- the oldest push_timestamp value of the result_set table in a *_jobs_1 database
- the oldest loaded_timestamp value of the objectstore table in a *_objectstore_1 database
They shouldn't be more than 5 months old
Flags: needinfo?(mdoglio)
Comment 3•10 years ago
|
||
Please do reduce the data lifecycle. I spot-checked on the biggest dbs, and those tables indeed have data within the last 5 months only.
However, the largest tables in the *_jobs_1 databases are job_artifact and performance_artifact (the only table in the *_objectstore_1 databases is objectstore) - do those tables ever get pruned?
This is an important issue, as we do not want to run out of disk space - that can corrupt all the data in the tables. The database is over 520G at this point (over half that is stored in the mozilla_inbound databases).
Comment 4•10 years ago
|
||
Will, do we know how old the data is in the performance_artifact table?
Flags: needinfo?(wlachance)
Comment 5•10 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #4)
> Will, do we know how old the data is in the performance_artifact table?
They are just JSON blobs. Running this query it looks like the first one is from mid-October:
https://treeherder.mozilla.org/api/project/mozilla-central/performance_artifact/?id__gte=0&id__lte=10&count=10
We aren't actually using these blobs for anything after we've updated the series tables inside Treeherder. Kyle is going to use them for dzAlerts until we put those in treeherder itself, so we probably want to keep a month or so's worth of them.
If there's an urgent issue we could probably safely delete at least half the table without ill effect.
Flags: needinfo?(wlachance)
Comment 6•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder-service
https://github.com/mozilla/treeherder-service/commit/6468df0247fe21b0b59c1f6e1ff867aa545b2b0c
Bug 1120019 - set the data lifecycle to 4 months
Comment 7•10 years ago
|
||
I reduced the data lifecycle here:
https://github.com/mozilla/treeherder-service/commit/6468df0247fe21b0b59c1f6e1ff867aa545b2b0c
It will go to production with the next push
Comment 8•10 years ago
|
||
The 4 months data lifecycle is now on prod
Comment 9•10 years ago
|
||
Thanks for the quick work, folks!
Removing all but the last month from performance_artifact would be super-helpful, as well.
Comment 10•10 years ago
|
||
Yesterday I defragmented all the tables on treeherder2 (the slave) while it was out of the load balancer. Today I failed over treeherder so I could work on treeherder1, although I only focused on the tables that actually had fragmentation:
Specifically:
fx_team_jobs1.job_artifact shrank from 46G to 30G
b2g_inbound_jobs_1.job_artifact shrank from 22G to 14G
mozilla_central_jobs_1.job_artifact shrank from 22G to 14G
mozilla_inbound_objectstore_1.objectstore shrank from 11G to 7.6G
These tables also shrank, but were smaller tables and could be run in real-time:
cedar_jobs_1.job_artifact
mozilla_central_objectstore_1.objectstore
fx_team_jobs_1.job
try_jobs_1.job_log_url
mozilla_inbound_jobs_1.job_log_url
mozilla_aurora_objectstore_1.objectstore
b2g_inbound_objectstore_1.objectstore
fx_team_objectstore_1.objectstore
fx_team_jobs_1.performance_series
ash_jobs_1.job_artifact
b2g_inbound_jobs_1.job
fx_team_jobs_1.job
gaia_try_jobs_1.job_artifact
jamun_jobs_1.job_artifact
maple_jobs_1.job_artifact
mozilla_aurora_jobs_1.job_artifact
mozilla_beta_jobs_1.job_artifact
mozilla_inbound_jobs_1.job
mozilla_inbound_jobs_1.performance_series
try_jobs_1.job
Comment 11•10 years ago
|
||
This is *just* enough (by about 3G) to not go into the critical state daily. Please let us know when you have removed all but the prior month from performance_artifact, as that will also be a big savings (once we defragment again).
Comment 12•10 years ago
|
||
Any news on the performance_artifact cleanup? we have put a script in place to purge binary logs every 12 hours because we need the space, otherwise we get paged at all hours of the day and night. :(
Comment 13•10 years ago
|
||
Tweaking summary to be more specific, seeing as we have bug 1078392 and bug 1078523 about other work.
Sheeri I don't suppose you could help us out - in bug 1078523 we've been trying to figure out (a) total vs free space, and (b) where it's going.
However we hit a few problems - mdoglio's rough calculations (back in Nov) said the DB tables were 330 GB, but New Relic said we were running out of space, even though I thought we had 700GB to play with.
So:
a) Is total disk definitely 700GB?
b) Is that shared between stage and prod, or just prod?
c) Would it be possible to have a breakdown of DB (and even table) sizes?
d) Other than that, where else is the space going (eg logs, indexes, fragmentation, ...?) and how much there?
I think we also need to check that we're expiring data from all tables correctly - eg: perhaps we're not expiring the performance artefacts at all?
It's just I'm really surprised we're still hitting the limits, given we've now dropped from 6 months to 4 months data retention and you've defragged the tables. Yes our usage per day has gone up now that we ingest the performance data, but we've been doing so for a while now.
Thanks :-)
Comment 14•10 years ago
|
||
Also, seeing as Treeherder is going to subsume Datazilla - could we shorten the Datazilla data lifecycle, reduce the disk space allocated there and either:
a) Directly use it for the Treeherder DB nodes (if they're even on the same cluster)
b) Use the goodwill generated to beg for some more disk space in Treeherder?
Comment 15•10 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #14)
> Also, seeing as Treeherder is going to subsume Datazilla
To be clearer: Treeherder's data usage has partly increased due to it now ingesting the same performance data that Datazilla does - since the intent is for Treeherder to replace Datazilla and for the latter to be EOLed.
Updated•10 years ago
|
Comment 16•10 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #13)
> I think we also need to check that we're expiring data from all tables
> correctly - eg: perhaps we're not expiring the performance artefacts at all?
Ah, bug 1124723. Good spot Mauro :-)
Comment 17•10 years ago
|
||
[scabral@treeherder1.db.scl3 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 37G 6.3G 29G 19% /
tmpfs 4.9G 0 4.9G 0% /dev/shm
/dev/sda1 969M 59M 860M 7% /boot
/dev/sdb1 689G 565G 90G 87% /data
Disk size is definitely around 700Gb - it's 730G total for the machine. 700Gb for /data is a bit on the high side but not by much.
One of the problems is that binary logs from MySQL take up a fair bit of room. We keep up to 18 hours at a time, which is approximately 90G. Binary logs are required for replication and backups, and on most clusters we keep 10 days' worth of logs.
We've been reducing the amount of binary logs we keep for a few *months* because of the growing amounts of space. The reason this is an issue now is because there's only so low we can go with the logs - we need to keep around enough binary logs so that the backup can function properly, for example - we take the backup server offline to back it up, so we need to keep around enough logs to compensate for the time it takes to do backups.
Flags: needinfo?(scabral)
Updated•10 years ago
|
Whiteboard: [data:serveropt][2014q4]
Updated•10 years ago
|
Whiteboard: [data:serveropt][2014q4] → [data:serveropt]
Comment 18•10 years ago
|
||
As we just realized, these are VMs, so we can increase the disk pretty easily! we just need to find a little operational time to do so.
Comment 19•10 years ago
|
||
Removing deps that are children of bug 1078392, since they are already in the dependency tree.
Assignee | ||
Comment 20•10 years ago
|
||
Increased disk on treeherder2.db.scl3 and treeherder2.stage.db.scl3 - In order to increase treeherder1.db.scl3, we'll need cab approval since it'll require a failover. With the treeherder1.stage.db.scl3 master, can you let us know when would be a good time to do that failover?
Comment 21•10 years ago
|
||
Any time on stage is fine with me - if you could just comment on the bug or in #treeherder so we know it's happening for when the new relic alerts go off and we start scratching our heads :-)
Assignee | ||
Comment 22•10 years ago
|
||
treeherder2.stage.db.scl3 disk size has been increased
Comment 23•10 years ago
|
||
That's great - thank you :-)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Assignee: nobody → mpressman
You need to log in
before you can comment on or make changes to this bug.
Description
•