Closed
Bug 593523
Opened 15 years ago
Closed 14 years ago
Running out of space on surf
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlaz, Assigned: jabba)
References
Details
(Whiteboard: [cleanup][buildduty])
Attachments
(1 file)
9.19 KB,
image/png
|
Details |
surf:disk - /mnt/netapp/stage/archive.mozilla.org/pub/firefox is WARNING: DISK WARNING - free space: /mnt/netapp/stage/archive.mozilla.org/pub/firefox 71844 MB (3%):
Is there anything build can do to free up space here?
Comment 1•15 years ago
|
||
Unassigning from bhearsum, who will be out through Wednesday.
Assignee: bhearsum → nobody
Comment 2•15 years ago
|
||
What can we do to add disk?
We already went through and added a cleanup policy. We can go back and see if there are things we've missed, but I imagine usage is only going to go up.
Comment 3•15 years ago
|
||
If we're critical, we could delete old try server builds. We currently keep 14 days work, and I think could get away with deleting stuff older than 10 days to free up space.
Going to need more space in the long term though.
Comment 4•15 years ago
|
||
I've changed this to 10 days in trybld's crontab.
Comment 5•15 years ago
|
||
Nick and I had a brief back-and-forth about this on IRC the other day.
The explosion in tryserver usage means we're regularly keeping ~500GB of try builds at any given time, and this was not the case (yet) in the spring when the cleanup policy went into effect. Current try build usage is 311GB after the updated 10-day cronjob ran last night.
Nick and I had discussed cutting try build retention back to 7 days. I'm fine with that, but it does represent a policy change and should be discussed with developers. We should also have a standard place that specific try builds can be moved to if they do need more longevity, e.g. tabcandy.
We could also consider setting up a new, separate partition for try builds that doesn't have to be as reliable as the other storage.
However, our nightly builds are also taking up much more space than before. With 4 months left to go in 2010, we've already doubled our disk usage from 2009. (2009: 161GB, 2010 so far: 318GB). This is largely due to the proliferation of project branches.
Should we have a separate longevity policy for project branch builds? Are we at the point we've discussed before where we simply buy more storage?
Comment 6•15 years ago
|
||
How hard would it be to have the trychooser allow specifying retention? I'd say probably close to 99% of try builds only need to be retained for as long as it takes the test machines to download them.
Comment 7•15 years ago
|
||
The trychooser doesn't reach into buildbot steps or the underlying factories, it's just for the creation of buildbot schedulers so that only what is asked for is created and thus less waste of setting up builders and killing them off needs to occur.
If some project needs builds to outlast the N days of tryserver retention policy, perhaps booking out a disposable branch for a few builds is a good way to have some builds kept around longer but this actually brings up the fact that the disposable project branches are supposed to be just that. So we could also be looking at an automated way of wiping out the {maple,cedar,birch} dirs whenever a disposable branch is released.
The tryserver has never been known as a tool for saving a particular build and I've seen devs download and host a special try build elsewhere when they need to because they are aware of the retention policy for try. I like that people take that into their own hands because it does allow us to figure out how to scale our storage and shape our cleanup scripts with a decent approximation of how much space we need. Changing to a user-specified retention would throw a lot of unknown into the mix.
Now that more folk use the try syntax do we have an idea if the average amount of try build data is lowering? Can we afford to wait a bit and see what happens as trychooser use gets more widespread?
Comment 8•15 years ago
|
||
I'm not arguing for longer retention, or devs specifying precisely how many days, I'm saying that dropping it to 10 days is not nearly enough. I don't really care whether the long period is 10 or 14 days, what I care about is that even 1 day is far too long a period for 99% of try pushes, which will never be downloaded by anyone at all other than the test machines.
Wanting the builds to be available for download at all is much more of an edge-case situation than wanting talos, so if it's possible to tell the builders to either upload to a directory that includes "retainme" in the name, or upload a file marker saying "this is the rare build someone actually wants to have kept for a few days," then having the default retention period be something more like "24 hours and no pending tests" would be more reasonable, and the odd case where someone actually wants to be able to download the build rather than just see the results of test runs could require a --retain param passed to the trychooser.
300GB down from 500GB is a nice little gain, but 265GB of that remaining 300 is just wasted space.
Comment 9•15 years ago
|
||
I'm going to start a discussion on the lists to have try modified to support a developer flagging their build as purgeable. This will cause it to be cleaned up faster than the default retention period.
This is not to replace any discussion of future increases to disk space, but just as a way of managing current requirements.
Assignee: nobody → bear
Updated•15 years ago
|
Whiteboard: [cleanup]
Comment 10•15 years ago
|
||
The focus of the post has changed from starting a discussion to notifying folks of the change to try build retention from 10 to 4 days.
Updated•15 years ago
|
Assignee: bear → nobody
Updated•15 years ago
|
Priority: -- → P3
Whiteboard: [cleanup] → [cleanup][buildduty]
Comment 11•15 years ago
|
||
As a shortterm workaround, we removed builds that are >10days old, but this space is being quickly consumed by a combination of a) more people using tryserver and b) more builds being produced per tryserver run.
After discussions with shaver, we're adding space, and moving older TryServer builds to additional disks. The dep. bugs linked here will add extra space and modify crontabs.
Leaving this bug open to track the issue.
OS: Mac OS X → All
Comment 12•15 years ago
|
||
Tryserver builds are now moved to /home/ftp/pub/firefox/tryserver-builds/old after 7 days. This new partition has 2TB of dedicated storage. Try builds will be purged from the old dir after a further 7 days (14 days total retention).
We are experiencing an uptick in regular usage now that we're uploading logs alongside builds. Not sure how desperate the situation is now, but I think this bug needs to stay open a little longer to watch it.
Comment 13•15 years ago
|
||
We've recently added a couple of new build types without adding any space to the main firefox partition:
* build logs that used to only go to tinderbox
* fuzzing results
* shadow-central builds
Fuzzing results are currently capped at 29 days in crontab. shadow-central builds weren't capped at all (at least not in ffxbld's crontab, I can't see root's), so I just added a corresponding 29 day cap.
I've mostly removed the 3.5.13-candidates and 3.6.10-candidates dirs from under /pub/mozilla.org/firefox/nightly, but there are some contrib/ dirs blocking complete removal. If someone with root could finish the job, it would be appreciated.
Even with all that, we still only have 95GB free on stage.
Reporter | ||
Comment 14•15 years ago
|
||
It looks like this is still getting worse:
11/06 @ 9:07 AM
[92] surf:disk - /mnt/netapp/stage/archive.mozilla.org/pub/firefox is WARNING: DISK WARNING - free space: /mnt/netapp/stage/archive.mozilla.org/pub/firefox 82464 MB (4%)
Comment 15•15 years ago
|
||
The red line on this graph is /mnt/eql/builds/firefox, as reported by df. The big dip is probably because we had 76 pushes to try on the 11/04, which is quite a bit higher than recent trend. Theprevious high was 78 on 09/09 when we've been maxing out at ~60 recently; a full try run is more than a GB of disk space, but people are often limiting to a smaller set of jobs.
I would have expected the change in comment #12 (new mount at firefox/tryserver-builds/old and move content there) make the red line jump up at least 100-200GB. Instead we just see the weekly variation we've seen for the last month or so, where we have a busy week on the try server and a quiet weekend and clean up files 2 weeks old.
I'm wondering if the reporting by df is off. Here's part of the output:
Filesystem Size Used Avail Use% Mounted on
/mnt/eql/builds/firefox
2.0T 1.9T 84G 96% /mnt/netapp/stage/archive.mozilla.org/pub/firefox
10.253.0.10:/vol/ftp_stage
4.8T 239M 4.8T 1% /mnt/netapp/stage_new
10.253.0.139:/data/try-builds
17T 60M 17T 1% /mnt/eql/builds/firefox/tryserver-builds/old
du says that there is 146G of content in tryserver-builds/old which isn't showing up there. Unfortunately it's a very slow process to check all of firefox/ by du. Could the 'deep mounting' throw df off ?
Comment 16•15 years ago
|
||
(In reply to comment #15)
> The red line on this graph is /mnt/eql/builds/firefox, as reported by df. The
> big dip is probably because we had 76 pushes to try on the 11/04, which is
s/76/79/, plus there's 11G of ffirefox/4.0b7-candidates/build1 built on the 4th too.
(In reply to comment #13)
> We've recently added a couple of new build types without adding any space to
> the main firefox partition:
>
> * build logs that used to only go to tinderbox
> * fuzzing results
> * shadow-central builds
fuzzing and shadow currently account for 3G of space, so that's under control now. Not sure how much the build logs add, looks like about 1M per nightly build and 5M when you have all the tests in a tinderbox-builds/foo/blah dir. Doesn't seem like that'd add up to lots of gigs of space.
Comment 17•15 years ago
|
||
df handles nested mounts fine, so that's not it. Indeed, all but one mount are necessarily nested under / !
Comment 18•15 years ago
|
||
It's a nontrival case:
nfs mount at /mnt/netapp/stage
nfs mount at /mnt/eql/builds
bind mount from /mnt/eql/builds/firefox to /mnt/netapp/stage/firefox
nfs mount at /mnt/eql/builds/firefox/tryserver-builds/old
So nfs mounted onto a bind of an nfs mount inside an nfs mount. And
10.253.0.139:/data/try-builds
17T 60M 17T 1% /mnt/eql/builds/firefox/tryserver-builds/old
is clearly bogus if du is to be trusted (145GB used).
Comment 19•15 years ago
|
||
Cleaned up 28G of old versions in
firefox/tinderbox-builds/mozilla-{1.9.2,central}-l10n
firefox/nightly/latest/mozilla{-{1.9.1,1.9.2}{,-l10n}
Currently 145G free.
Comment 20•15 years ago
|
||
@14:30 PDT:
>
> phong: armenzg_buildduty: [27] surf:disk - /mnt/netapp/stage/archive.mozilla.org
>/pub/firefox is WARNING: DISK WARNING - free space: /mnt/netapp/stag
> /archive.mozilla.org/pub/firefox 80074 MB (4%):
Comment 21•15 years ago
|
||
(In reply to comment #12)
> Tryserver builds are now moved to /home/ftp/pub/firefox/tryserver-builds/old
> after 7 days. This new partition has 2TB of dedicated storage. Try builds will
> be purged from the old dir after a further 7 days (14 days total retention).
>
> We are experiencing an uptick in regular usage now that we're uploading logs
> alongside builds. Not sure how desperate the situation is now, but I think this
> bug needs to stay open a little longer to watch it.
Could we change this crontab to 4 days on tryserver-builds and 10 days on try-server-builds/old?
BTW why does "old" appear under "try-server-builds"?
> -bash-3.2$ ls -l /mnt/netapp/stage/archive.mozilla.org/pub/firefox/tryserver-builds/ | grep old
> drwxr-sr-x 246 trybld users 12288 Nov 11 14:00 old
I believe what nthomas mentions on comment 15 is what I am noticing and I believe we have really not moved the old builds to the new 2TB mount. It seems that the referred new mount is not being used:
> Filesystem Size Used Avail Use% Mounted on
> 10.253.0.139:/data/try-builds
> 17T 60M 17T 1% /mnt/eql/builds/firefox/tryserver-builds/old
> -bash-3.2$ ls -l /mnt/eql/builds/firefox/tryserver-builds/old/trybuilds/
> total 0
(In reply to comment #15)
> I would have expected the change in comment #12 (new mount at
> firefox/tryserver-builds/old and move content there) make the red line jump up
> at least 100-200GB. Instead we just see the weekly variation we've seen for the
> last month or so, where we have a busy week on the try server and a quiet
> weekend and clean up files 2 weeks old.
Severity: normal → major
Priority: P3 → --
Comment 22•15 years ago
|
||
(In reply to comment #21)
> Could we change this crontab to 4 days on tryserver-builds and 10 days on
> try-server-builds/old?
As an interim solution, yes,
> BTW why does "old" appear under "try-server-builds"?
I assume you mean "tryserver-builds."
Having old/ underneath tryserver-builds/ seems natural enough to me, provided our disk usage reporting is accurate (see comment #18).
> I believe what nthomas mentions on comment 15 is what I am noticing and I
> believe we have really not moved the old builds to the new 2TB mount. It seems
> that the referred new mount is not being used:
It may be being used, but the reporting may be broken (see comment #18).
I'm assigning this back to IT to investigate the nested mount.
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Updated•15 years ago
|
Assignee: server-ops → armenzg
Updated•15 years ago
|
Assignee: armenzg → jdow
Assignee | ||
Comment 23•15 years ago
|
||
all this talk of mounts under mounts is confusing. What exactly are we investigating about the nested mounts?
Comment 24•15 years ago
|
||
(In reply to comment #23)
> all this talk of mounts under mounts is confusing. What exactly are we
> investigating about the nested mounts?
See comment #18.
We're not getting valid disk usage reports for the mount at /mnt/eql/builds/firefox/tryserver-builds/old. Either the mount isn't actually being used or reporting of disk usage numbers is broken for it.
Once we're sure the mount *is* being populated properly and reporting works, we can move tryserver builds to the old/ dir sooner (adjust the crontabs).
Comment 25•15 years ago
|
||
Something weird is going on. if you umount /mnt/eql/builds/firefox/tryserver-builds/old then there's still an old dir in /mnt/eql/builds/firefox/tryserver-builds/. If you remount it then the files in there go away. I've left it unmounted for now. Could something be holding a reference ?
Comment 26•14 years ago
|
||
Aravind has reworked the mounting a bit. The 2T partition for old try builds (bug 595802, which is presumably a share of the 17T reported by df) is at /mnt/cm-ixstore01/try-builds and bind mounted to firefox/tryserver-builds/old for both the eql and netapp mounts.
The files we had already moved to old/ using cron are now in old_nthomas/, but are in the process moving them back into old/. The cron job is currently disabled but will be renabled, and possible the age threshold shortened.
Comment 27•14 years ago
|
||
What is left to do on this bug? Are we pretty much done with what we have to do now?
Comment 28•14 years ago
|
||
(In reply to comment #26)
> The files we had already moved to old/ using cron are now in old_nthomas/, but
> are in the process moving them back into old/. The cron job is currently
> disabled but will be renabled, and possible the age threshold shortened.
The move was done a couple of days ago and cron jobs were re-enabled as is. Currently have 382G of data in tryserver-builds/, 176G in tryserver-builds/old and 206G of builds from the last 7 days. This will fluctuate with try server usage.
There is 191G free in the Firefox mount (/mnt/eql/builds/firefox), and 429G free for everything else, so we're OK for now. As time goes by we will accumulate more nightly builds and releases and hit space pressure again. I'd like to hear more about the large /mnt/netapp/stage_new mount on stage which justdave has been working on, and plans for growing storage.
Comment 29•14 years ago
|
||
(In reply to comment #28)
> The move was done a couple of days ago and cron jobs were re-enabled as is.
> Currently have 382G of data in tryserver-builds/, 176G in tryserver-builds/old
> and 206G of builds from the last 7 days. This will fluctuate with try server
> usage.
Can we also jigger the cronjobs to move tryserver-builds to tryserver-builds/old after only 4 days (vs. 7 now), but keep the total retention time the same (14 days)?
Comment 30•14 years ago
|
||
Done, we just have to be careful out when people requests for retests on older builds.
Comment 31•14 years ago
|
||
And again in English ...
Done, we just have to be careful when people request retests for older builds.
Assignee | ||
Comment 32•14 years ago
|
||
I assume this is done and fixed now?
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 33•14 years ago
|
||
(In reply to comment #32)
> I assume this is done and fixed now?
We'll need to revisit this eventually, but no reason to keep this particular bug open indefinitely until we do.
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•