Closed Bug 482205 Opened 13 years ago Closed 11 years ago

handle out-of-disk better on build slaves (was "unittest running out of space")

Categories

(Release Engineering :: General, defect, P2)

x86
Windows Server 2003
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: jhford)

References

Details

(Whiteboard: [buildslaves][automation])

Attachments

(5 files, 1 obsolete file)

This slave ran out of space doing a mozilla-central unit test build
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1236546136.1236550760.10190.gz
buildSpace is hardwired to 3GB in config.py, while windows explorer says that there were 1.83GB of files in mozilla-central-win32-unit/, taking up 3.11GB on disk. Interestingly, the same slave's mozilla-1.9.1 unit dir was only 1.08/2.33GB.

TODO (1) - check if 3.11GB of actual disk is typical for a m-c unit build, possibly bump up this value in config.py

The unit build died out in the compile step but tried to run the unit tests anyway, I think we should give up on the build in that case - filed as bug 482169 but I'll dupe it over here.

TODO (2) - set haltOnFailure for compile step of unit test builds

The next build on the slave, a mozilla-central build, failed to clean up space properly:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1236551524.1236551587.12400.gz
It clobbers build/tools but fails to reclone it, weird but perhaps hg needs more space than the final working dir for temporary files. We don't stop on this error and unsurprising the clobber check and purge_builds.py can't run. I think we should definitely halt on the latter case, but really need to avoid getting into this situation, maybe a disk space check before we delete build/tools. The slave got cleaned up by hand.

TODO (3) - haltOnFailure if we fail to pull build/tools/ or purge_builds.py, or robustify our current cleaning up.

TODO (4) - would be helpful for debugging if purge_builds.py always reported free space before exiting
Duplicate of this bug: 482169
FWIW: I bumped down the size requirement for unittest builds after surveying them a couple weeks ago. As far as I can remember there wasn't any > 3gb "size on disk" unittest builds at the time.
These two linux unittest builds also ran out of space today:
 Tue Mar 10 16:19:54 (Build #199Build #1989)
 Tue Mar 10 19:31:02 (Build #1995)
Both in the compile phase. We need to revisit required space for unit test builds, or something is busted in the purge script.
r=bustage for land and reconfig.
Attachment #366761 - Flags: checked‑in+ checked‑in+
Summary: moz2-win32-slave11 ran out of disk space → unittest running out of disk space
Attachment #366760 - Flags: review?(catlee) → review+
Attachment #366760 - Flags: checked‑in+ checked‑in+
(In reply to comment #5)
> Created an attachment (id=366761) [details]
> TODO1 (sort of): Bump from 3GB to 4GB required space

Followed up with the matching change to mozilla2-staging/master-main.cfg
Adding symbols has boosted required size quite a bit, and 4G is not enough:

4.7G    mozilla-central-linux-unittest
3.2G    mozilla-1.9.1-linux-unittest   (have also seen 3.7G)
3.8G    tracemonkey-linux-unittest

2.2G    mozilla-central-macosx-unittest
1.9G    mozilla-1.9.1-macosx-unittest
1.9G    tracemonkey-macosx-unittest

3.6G    mozilla-central-win32-unittest
3.9G    mozilla-1.9.1-win32-unittest
3.5G    tracemonkey-win32-unittst

Win32 is size on disk from folder properties. Setting a blanket 5 is a bit of a sledgehammer approach but we're not running short of space on mac (typically 30+GB free).
Assignee: nobody → nthomas
Status: NEW → ASSIGNED
Attachment #367161 - Flags: review?(bhearsum)
I think this is the right buildbot <blah>On<blah> for this job.
Attachment #367162 - Flags: review?(bhearsum)
Comment on attachment 367161 [details] [diff] [review]
TODO1 (properly): 5GB free space required

Yeah..this seems fine. A little tight on Linux, but I think the size varies there a lot less.
Attachment #367161 - Flags: review?(bhearsum) → review+
Attachment #367162 - Flags: review?(bhearsum) → review+
Comment on attachment 367162 [details] [diff] [review]
TODO2: haltOnFailure if unit compile fails

Yep, that's the right parameter. Minor nit: use True instead of 1 for consistency with the rest of the file. r=bhearsum with that change.
Comment on attachment 367161 [details] [diff] [review]
TODO1 (properly): 5GB free space required

changeset:   1008:58d74c2fe4c8
Attachment #367161 - Flags: checked‑in+ checked‑in+
Comment on attachment 367162 [details] [diff] [review]
TODO2: haltOnFailure if unit compile fails

changeset:   216:cc9598743763
Attachment #367162 - Flags: checked‑in+ checked‑in+
(In reply to comment #0)
> TODO (3) - haltOnFailure if we fail to pull build/tools/ or purge_builds.py, or
> robustify our current cleaning up.

What do people think about this ? It's do with how we recover from a full disk situation, where we don't have enough space to remove build/tools/ and reclone it.  Obviously we shouldn't end up here but will occasionally, even if only through repo's growing, and we should decide if we want to handle it manually or automatically.  

Currently
 http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l147
we're not setting any "flags" on removing and recloning build/tools, clobberer has flunkOnFailure=False so it won't send the build red on fail, and cleaning old builds has that and warnOnFailure=True so it'll go orange (right?). I'd be tempted to make the whole build fail if it can't get enough space or run the script, since we know it's going to blow up later. 

Other notes/ideas
* the hg clone operation is returning exit code 0 when it fails
* 'hg pull && hg up -C' will fail just as badly if the disk is completely full, but may be better in low disk space situations
* we could move the "delete old package" step up to guarantee a bit of free space, but won't work if we're starting from a scratch (when another build clobbered the dir)
* add an always run step that checks free space at end of build, mails when below a threshold and disconnects the slave ?
(In reply to comment #13)
> What do people think about this ? It's do with how we recover from a full disk
> situation, where we don't have enough space to remove build/tools/ and reclone
> it.  Obviously we shouldn't end up here but will occasionally, even if only
> through repo's growing, and we should decide if we want to handle it manually
> or automatically.

I think haltOnFailure && flunkOnFailure is probably appropriate here. You could make an argument for warnOnFailure instead of flunk, but halting is the right thing to do IMHO, as you note below.


> * 'hg pull && hg up -C' will fail just as badly if the disk is completely full,

We won't hit this if we bail on clobberer problems, right?

> * add an always run step that checks free space at end of build, mails when
> below a threshold and disconnects the slave ?

I'm not sure disconnecting is a useful thing to do because it should get some more free space when the next build starts. Mailing is a fantastic idea for sure, though. As long as we pay attention to it more than Nagios mail.
Not working on this right now, but still worth handling out-of-disk better.
Assignee: nthomas → nobody
Status: ASSIGNED → NEW
Component: Release Engineering → Release Engineering: Future
Priority: -- → P3
Summary: unittest running out of disk space → handle out-of-disk better on build slaves (was "unittest running out of space")
Mass move of bugs from Release Engineering:Future -> Release Engineering. See
http://coop.deadsquid.com/2010/02/kiss-the-future-goodbye/ for more details.
Component: Release Engineering: Future → Release Engineering
Priority: P3 → P5
Whiteboard: [buildslaves][automation]
Assignee: nobody → jhford
My plan for this:
1. parse output of purge_builds to get free space information into buildproperties
2. write a daemon that monitors the amount of free space on disk during the build and write that info to a file.  The lowest amount of disk space during the build would be set in a property
3. Create a webservice that looks at the DB to return the lowest amount of free space and use that value with an extra 2GB for safety as the amount of space to use for purge_builds instead of using a static value.
Going to be working on these bugs
Status: NEW → ASSIGNED
Priority: P5 → P2
Tested that this works in the success case

purge_actual: '20.30GB'
purge_target: '10GB'

I am currently creating a 20 gb file on the disk to test the failure case.
this patch adds information to the status db in the form of properties that show how much disk space is available vs what was requested.
Attachment #471198 - Attachment is obsolete: true
Attachment #494504 - Flags: review?(coop)
Comment on attachment 494504 [details] [diff] [review]
set properties for diskspace

I don't know anything about python regexp performance, but does it make sense to compile the actual regexps beforehand and re-use them? I reckon they're cached which is probably enough for us here.
Attachment #494504 - Flags: review?(coop) → review+
They're not cached, but this doesn't look like it's executed that often, so I suspect it's OK.
I'd like to include this small patch in the next downtime.
Flags: needs-reconfig?
(In reply to comment #23)
> I'd like to include this small patch in the next downtime.

s/downtime/reconfig/
Planning to do a reconfig tomorrow morning.
Flags: needs-reconfig? → needs-reconfig+
Comment on attachment 494504 [details] [diff] [review]
set properties for diskspace

changeset:   1234:92d255b03c1a
Attachment #494504 - Flags: checked-in+
Masters have been updated.
Flags: needs-reconfig+
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
If I'm reading this right we now have some data stuffed into status db, from which we can build a report. The original point of this bug was to handle out-of-disk better on the slave, so at a bare minimum we should file a follow up to create the report.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.