Closed Bug 1007583 Opened 10 years ago Closed 8 years ago

More intelligent disk space management - jobs to report their disk usage, so jobs don't need to guess their usage, or daemon to clean up on-the-fly

Categories

(Release Engineering :: General, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: pmoore, Unassigned)

References

Details

Recently we've had issues with build machines running out of disk space, most recently I worked on bug 1001518 but prior to this there have been numerous other bugs.

At the moment, we have a mechanism to define how much disk space we believe we need for a job, and then to purge this amount of free space at the start of the job (https://bugzilla.mozilla.org/show_bug.cgi?id=1001518#c5).

Rather than us maintaining a list of disk space sizes that can quickly become out-of-date, I propose that we introduce a mechanism whereby when a job starts, it makes a note of how much disk space it has available. After it completes, it does the same. Then after the job has run, it updates a database with details about how much disk space it used on that run.

By keeping a history of how much disk space each job takes up with each run, we can formulate an algorithm to provide a suitable build_space value, rather than manually maintaining it.

There are several degrees of freedom in implementing this - e.g. maybe we keep all disk usage stats, maybe only for the last run, or maybe just the most recent X runs. Possible formulas for calculating build_space could be taking either the average or maximum of previous runs, and multiplying it by a factor like 1.5, or something similar.

The two key parts are therefore:
1) recording disk usage info
2) estimating build_space based on data captured in step 1

The goal of this bug is to reduce the amount of manual time spent tweaking build space usage, and to make this process entirely automatic. For example, on an unrelated topic, bug 862910 is about caching downloads in update verify, which will increase disk space for this job. Currently we would have to perform several manual tests to try to establish disk usage, and then add config to buildbot-configs to handle this. With the mechanism described in this bug, this would not be necessary.

This solution should be a "nail in the coffin". Once done, this problem should be a problem of the past - so despite being quite a reasonable amount of work to implement, the benefit is putting a regular problem to bed, hopefully forever. This should reduce buildduty work, which is a goal for this year.
gps added monitoring of a whole bunch of stats using psutil. If we're not already watching free disk space we could add that, which also has the advantage of catching peak usage (can be higher than final usage).
Thanks Nick!

Gregory, do you have links to the work you've done on this?
Flags: needinfo?(gps)
RyanVM tells me gps is on holiday - Nick, do you have links/more details about gps's psutil work?

Thanks!
Pete
Flags: needinfo?(gps) → needinfo?(nthomas)
I searched around a bit and found the original bug 883209. The call './mach resource-usage' allows developers to see the result, but we're not using mach in the automation so probably that isn't available to us.

The other caveat is that we'd only cover the gecko part of b2g builds with this; should be OK for Firefox desktop and mobile though.
Flags: needinfo?(nthomas)
So today again I've had two or three machines that ran out of disk space, that needed manually fixing.

This seems to becoming more of a problem.

How about an even simpler solution - we have a daemon running on the build slave, which is monitoring disk availability. When it gets critically low (e.g. <1 GB free space) it starts purging builds, until we have e.g. 2GB available. This could even run while a job is running, so if a job starts eating up more and more disk, it would take care of clearing space as needed *while the job is running*.

This is of course fine, unless we are performance testing.

In the end I'm happy with any mechanism that is self-managing, and doesn't require us to manually maintain data about how much disk space we think a particular job is going to take up, and that is reliable (i.e. not a system that fails under certain circumstances to free up enough disk space in time). I think we shouldn't keep on manually fixing machines that run out of disk space.
A daemon seems pretty heavy-handed for this. I think slave pre-flight tasks will cover this: bug 712206.
Agreed a daemon could be pretty heavy-handed. The benefits it has is that we don't need to know how much space any job takes, either by manually setting it, or by recording data about historic runs. Of course, this can be made simple if we have an "absolute" disk availability requirement - although this might not sit well with e.g. aws slaves that only have 15GB disk.

So how can we best balance the following goals?

1) We don't want to introduce too much complexity if possible
2) We don't want to manually "manage numbers" i.e. to keep track of how much disk space each job requires
3) We want AWS slaves with less disk space to be able to survive with a small amount of free disk space (e.g. 1GB)
4) Some jobs (linux 64 debug builds for b2g) may need around 16GB free disk space
5) We don't want anything too heavy-handed

The benefit of a daemon is that it satisfies goals 1-4. You could argue that keeping a database of usage (comment 0) satisfies goals 2-5.

Preflight tasks in principle can clean up space, but how do they decide how much space to free up? I think the challenge is working out a way to free up space that works for low disk machines like AWS slaves with 15GB, that typically only need 1GB free disk for a new job, but also works with slaves with 200GB disk, that need over 15GB for a particular build job - and at the same time, in a way that we don't need to manually keep track of how much disk space each job needs (i.e. that we are not editing config files that say "job X requires 15GB, job Y requires 1GB"). Anything that requires manual maintenance is going to go wrong, and keep us from being as automated as possible.
I agree that manual maintenance is a non-starter. Why not just increase the space required for everything to 20GB and avoid the problem entirely? 

I think trying to micro-manage the disk on AWS is the wrong approach here. We should be tracking the build + objdir + max-build-process size requirements instead so that we don't find out about these problems when builds start failing. Having a consistent target of 20GB (or whatever) would make that monitoring easier.
Did you also want to give up on either a) treating release- dirs as sacred artifacts or b) actually using the in-house slaves for anything other than releases? I just disabled one because it could only free up 12GB, good luck getting 20GB on them.
(In reply to Phil Ringnalda (:philor) from comment #9)
> Did you also want to give up on either a) treating release- dirs as sacred
> artifacts or b) actually using the in-house slaves for anything other than
> releases? I just disabled one because it could only free up 12GB, good luck
> getting 20GB on them.

Yes, frankly. AFAIK we've *never* actually gone back to those slaves to use the artifacts after a release. If we're serious about keeping them, we should be moving them to "safer," non-slave-based storage.
Which AWS slaves only have 15G disk ? That sounds like testers rather than builders.
(In reply to Nick Thomas [:nthomas] from comment #11)
> Which AWS slaves only have 15G disk ? That sounds like testers rather than
> builders.

15G is used for tst-linux32 instances only, see:
http://hg.mozilla.org/build/cloud-tools/file/5a73c7cf1d5f/configs/tst-linux32#l16

tst-linux64 instances use 20G:
Per http://hg.mozilla.org/build/cloud-tools/file/5a73c7cf1d5f/configs/tst-linux64#l16 we use 20G there
Ah great, so that means we can do what Coop suggests, and go with a standard 20GB for all jobs, as per comment 8?

I think if we have jobs that use more than 20GB, they could probably do with a little optimising. :)
The historical reason for choosing a smaller, closely matching values is free space clobbers on other builddirs. eg asking for more space ends up removing more build dirs, so when we come back to those builds they start from scratch. Since then I think we decided the pools are so large that we effectively clobber anyway. 

We should check with bhearsum how he sets up jacuzzis though - a global 20G requirement may reduce the # of jobs we can put in each jacuzzi.
(In reply to Chris Cooper [:coop] from comment #10)
> (In reply to Phil Ringnalda (:philor) from comment #9)
> > Did you also want to give up on either a) treating release- dirs as sacred
> > artifacts or b) actually using the in-house slaves for anything other than
> > releases? I just disabled one because it could only free up 12GB, good luck
> > getting 20GB on them.
> 
> Yes, frankly. AFAIK we've *never* actually gone back to those slaves to use
> the artifacts after a release. If we're serious about keeping them, we
> should be moving them to "safer," non-slave-based storage.

Yeah, I'm on board with this too. When there was less contention for space I think it was a fine trade-off. These days it's pretty clearly not the right thing.

(In reply to Nick Thomas [:nthomas] from comment #14)
> The historical reason for choosing a smaller, closely matching values is
> free space clobbers on other builddirs. eg asking for more space ends up
> removing more build dirs, so when we come back to those builds they start
> from scratch. Since then I think we decided the pools are so large that we
> effectively clobber anyway. 
> 
> We should check with bhearsum how he sets up jacuzzis though - a global 20G
> requirement may reduce the # of jobs we can put in each jacuzzi.

It's mostly a pick and guess, but catlee has more up-to-date information now that we're doing semi-dynamic jacuzzis...
Flags: needinfo?(catlee)
The jacuzzi allocation right now isn't really taking into account disk space directly. I've just set a maximum of 2 builders per slave, and left it at that.
Flags: needinfo?(catlee)
It sounds like we're all happy with 20GB default?
(In reply to Chris Cooper [:coop] from comment #10)
> (In reply to Phil Ringnalda (:philor) from comment #9)
> > Did you also want to give up on either a) treating release- dirs as sacred
> > artifacts or b) actually using the in-house slaves for anything other than
> > releases? I just disabled one because it could only free up 12GB, good luck
> > getting 20GB on them.
> 
> Yes, frankly. AFAIK we've *never* actually gone back to those slaves to use
> the artifacts after a release. If we're serious about keeping them, we
> should be moving them to "safer," non-slave-based storage.

I've done plenty of release cleanup work on the slaves that did the release builds. Most recently doing funnelcake stub installer builds for fx29 based on the previous build directory.
(In reply to Pete Moore [:pete][:pmoore] from comment #13)
> Ah great, so that means we can do what Coop suggests, and go with a standard
> 20GB for all jobs, as per comment 8?
> 
> I think if we have jobs that use more than 20GB, they could probably do with
> a little optimising. :)

We should still monitor how much space each build type takes. Maybe this is better to do in mozharness? Run "df" before and after doing a job, and report each as well as the difference in the log as well as in buildbot properties?
Blocks: 958987
(In reply to Chris AtLee [:catlee] from comment #19)
> (In reply to Pete Moore [:pete][:pmoore] from comment #13)
> > Ah great, so that means we can do what Coop suggests, and go with a standard
> > 20GB for all jobs, as per comment 8?
> > 
> > I think if we have jobs that use more than 20GB, they could probably do with
> > a little optimising. :)
> 
> We should still monitor how much space each build type takes. Maybe this is
> better to do in mozharness? Run "df" before and after doing a job, and
> report each as well as the difference in the log as well as in buildbot
> properties?

now mozharness can access disk space infos:

http://hg.mozilla.org/build/mozharness/file/807367f3bce5/mozharness/base/diskutils.py
If we report on disk usage, we should feed that back somewhere such that a preflight check can ensure that a job only runs if it has e.g. 125% of the required disk space available.

I think it might be cleaner to not have the job itself concerned with disk usage, i.e. the job execution framework should probably be responsible for this. Otherwise it means we have a dependency that all jobs have to run in mozharness. I still like the idea of a service running on the boxes which cleans up as disk fills up, and alarms if it is not able to clean up. This requires no changes to any job definitions, and is a generic way to keep disk space available, regardless of the job type that runs. I also think it is the simplest, since we do not need to report numbers back into a system and keep tabs on them.

However, we should also wrap this up so the topic doesn't stay unresolved indefinitely, so maybe we can come up with a non-perfect solution that ties us over, especially in light of jobs moving over to Task Cluster.
Greg,

How do you take care of ensuring disk space availability on Task Cluster docker workers? Do you retain any state on workers outside the docker containers, or does everything live and die within the docker container with each job, so disk space does not fill up because the container is destroyed after each task completes?

I believe in RelEng, jobs are designed to retain working dirs and vcs checkouts between jobs to reduce iterative build times (i.e. not always a clean build) - but I think in Task Cluster, builds are made from a clean working dir inside the docker container, using recent vcs caches to avoid fresh clones, but other than that, everything is build from scratch - is that right? If that is the case, then no doubt this problem does not exist in Task Cluster, but wanted to check, in case there is overlap here. =)

Pete
Flags: needinfo?(garndt)
Summary: More intelligent disk space management - jobs to report their disk usage, so jobs don't need to guess their usage → More intelligent disk space management - jobs to report their disk usage, so jobs don't need to guess their usage, or daemon to clean up on-the-fly
We have a diskspace threshold (defaulted to 10gb/task) that is managed by a garbage collector (very loose term) in the worker and this logic does not exist within a task container.  The worker will not claim any work if there is not enough space to accommodate as many tasks as the worker could claim at that time.  Also, if a garbage collection cycle begins and the threshold is reached, all docker images and cached volumes currently not in use will be removed in an attempt to free up enough space for the worker to take on some additional tasks.

Along with caching the vcs checkouts, we cache the build object directories based on build type and if it was debug/opt.

Hope this answers some of your questions, ping me if there is anything else I can clear up.
Flags: needinfo?(garndt)
Oh, also about the task containers, they are marked for removal after a task completes and the garbage collector will remove those regardless if the threshold is met or not.
Migrating to TaskCluster...
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.