Closed
Bug 482205
Opened 16 years ago
Closed 15 years ago
handle out-of-disk better on build slaves (was "unittest running out of space")
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: jhford)
References
Details
(Whiteboard: [buildslaves][automation])
Attachments
(5 files, 1 obsolete file)
537 bytes,
patch
|
catlee
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
1.02 KB,
patch
|
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
3.61 KB,
patch
|
bhearsum
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
938 bytes,
patch
|
bhearsum
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
2.41 KB,
patch
|
coop
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
This slave ran out of space doing a mozilla-central unit test build
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1236546136.1236550760.10190.gz
buildSpace is hardwired to 3GB in config.py, while windows explorer says that there were 1.83GB of files in mozilla-central-win32-unit/, taking up 3.11GB on disk. Interestingly, the same slave's mozilla-1.9.1 unit dir was only 1.08/2.33GB.
TODO (1) - check if 3.11GB of actual disk is typical for a m-c unit build, possibly bump up this value in config.py
The unit build died out in the compile step but tried to run the unit tests anyway, I think we should give up on the build in that case - filed as bug 482169 but I'll dupe it over here.
TODO (2) - set haltOnFailure for compile step of unit test builds
The next build on the slave, a mozilla-central build, failed to clean up space properly:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1236551524.1236551587.12400.gz
It clobbers build/tools but fails to reclone it, weird but perhaps hg needs more space than the final working dir for temporary files. We don't stop on this error and unsurprising the clobber check and purge_builds.py can't run. I think we should definitely halt on the latter case, but really need to avoid getting into this situation, maybe a disk space check before we delete build/tools. The slave got cleaned up by hand.
TODO (3) - haltOnFailure if we fail to pull build/tools/ or purge_builds.py, or robustify our current cleaning up.
TODO (4) - would be helpful for debugging if purge_builds.py always reported free space before exiting
Comment 2•16 years ago
|
||
FWIW: I bumped down the size requirement for unittest builds after surveying them a couple weeks ago. As far as I can remember there wasn't any > 3gb "size on disk" unittest builds at the time.
Reporter | ||
Comment 3•16 years ago
|
||
These two linux unittest builds also ran out of space today:
Tue Mar 10 16:19:54 (Build #199Build #1989)
Tue Mar 10 19:31:02 (Build #1995)
Both in the compile phase. We need to revisit required space for unit test builds, or something is busted in the purge script.
Reporter | ||
Comment 4•16 years ago
|
||
Attachment #366760 -
Flags: review?(catlee)
Reporter | ||
Comment 5•16 years ago
|
||
r=bustage for land and reconfig.
Attachment #366761 -
Flags: checked‑in+ checked‑in+
Reporter | ||
Updated•16 years ago
|
Summary: moz2-win32-slave11 ran out of disk space → unittest running out of disk space
Updated•16 years ago
|
Attachment #366760 -
Flags: review?(catlee) → review+
Reporter | ||
Updated•16 years ago
|
Attachment #366760 -
Flags: checked‑in+ checked‑in+
Reporter | ||
Comment 6•16 years ago
|
||
(In reply to comment #5)
> Created an attachment (id=366761) [details]
> TODO1 (sort of): Bump from 3GB to 4GB required space
Followed up with the matching change to mozilla2-staging/master-main.cfg
Reporter | ||
Comment 7•16 years ago
|
||
Adding symbols has boosted required size quite a bit, and 4G is not enough:
4.7G mozilla-central-linux-unittest
3.2G mozilla-1.9.1-linux-unittest (have also seen 3.7G)
3.8G tracemonkey-linux-unittest
2.2G mozilla-central-macosx-unittest
1.9G mozilla-1.9.1-macosx-unittest
1.9G tracemonkey-macosx-unittest
3.6G mozilla-central-win32-unittest
3.9G mozilla-1.9.1-win32-unittest
3.5G tracemonkey-win32-unittst
Win32 is size on disk from folder properties. Setting a blanket 5 is a bit of a sledgehammer approach but we're not running short of space on mac (typically 30+GB free).
Reporter | ||
Comment 8•16 years ago
|
||
I think this is the right buildbot <blah>On<blah> for this job.
Attachment #367162 -
Flags: review?(bhearsum)
Comment 9•16 years ago
|
||
Comment on attachment 367161 [details] [diff] [review]
TODO1 (properly): 5GB free space required
Yeah..this seems fine. A little tight on Linux, but I think the size varies there a lot less.
Attachment #367161 -
Flags: review?(bhearsum) → review+
Updated•16 years ago
|
Attachment #367162 -
Flags: review?(bhearsum) → review+
Comment 10•16 years ago
|
||
Comment on attachment 367162 [details] [diff] [review]
TODO2: haltOnFailure if unit compile fails
Yep, that's the right parameter. Minor nit: use True instead of 1 for consistency with the rest of the file. r=bhearsum with that change.
Comment 11•16 years ago
|
||
Comment on attachment 367161 [details] [diff] [review]
TODO1 (properly): 5GB free space required
changeset: 1008:58d74c2fe4c8
Attachment #367161 -
Flags: checked‑in+ checked‑in+
Comment 12•16 years ago
|
||
Comment on attachment 367162 [details] [diff] [review]
TODO2: haltOnFailure if unit compile fails
changeset: 216:cc9598743763
Attachment #367162 -
Flags: checked‑in+ checked‑in+
Reporter | ||
Comment 13•16 years ago
|
||
(In reply to comment #0)
> TODO (3) - haltOnFailure if we fail to pull build/tools/ or purge_builds.py, or
> robustify our current cleaning up.
What do people think about this ? It's do with how we recover from a full disk situation, where we don't have enough space to remove build/tools/ and reclone it. Obviously we shouldn't end up here but will occasionally, even if only through repo's growing, and we should decide if we want to handle it manually or automatically.
Currently
http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l147
we're not setting any "flags" on removing and recloning build/tools, clobberer has flunkOnFailure=False so it won't send the build red on fail, and cleaning old builds has that and warnOnFailure=True so it'll go orange (right?). I'd be tempted to make the whole build fail if it can't get enough space or run the script, since we know it's going to blow up later.
Other notes/ideas
* the hg clone operation is returning exit code 0 when it fails
* 'hg pull && hg up -C' will fail just as badly if the disk is completely full, but may be better in low disk space situations
* we could move the "delete old package" step up to guarantee a bit of free space, but won't work if we're starting from a scratch (when another build clobbered the dir)
* add an always run step that checks free space at end of build, mails when below a threshold and disconnects the slave ?
Comment 14•16 years ago
|
||
(In reply to comment #13)
> What do people think about this ? It's do with how we recover from a full disk
> situation, where we don't have enough space to remove build/tools/ and reclone
> it. Obviously we shouldn't end up here but will occasionally, even if only
> through repo's growing, and we should decide if we want to handle it manually
> or automatically.
I think haltOnFailure && flunkOnFailure is probably appropriate here. You could make an argument for warnOnFailure instead of flunk, but halting is the right thing to do IMHO, as you note below.
> * 'hg pull && hg up -C' will fail just as badly if the disk is completely full,
We won't hit this if we bail on clobberer problems, right?
> * add an always run step that checks free space at end of build, mails when
> below a threshold and disconnects the slave ?
I'm not sure disconnecting is a useful thing to do because it should get some more free space when the next build starts. Mailing is a fantastic idea for sure, though. As long as we pay attention to it more than Nagios mail.
Reporter | ||
Comment 15•16 years ago
|
||
Not working on this right now, but still worth handling out-of-disk better.
Assignee: nthomas → nobody
Status: ASSIGNED → NEW
Component: Release Engineering → Release Engineering: Future
Priority: -- → P3
Reporter | ||
Updated•16 years ago
|
Summary: unittest running out of disk space → handle out-of-disk better on build slaves (was "unittest running out of space")
Comment 16•15 years ago
|
||
Mass move of bugs from Release Engineering:Future -> Release Engineering. See
http://coop.deadsquid.com/2010/02/kiss-the-future-goodbye/ for more details.
Component: Release Engineering: Future → Release Engineering
Reporter | ||
Updated•15 years ago
|
Priority: P3 → P5
Updated•15 years ago
|
Whiteboard: [buildslaves][automation]
Assignee | ||
Updated•15 years ago
|
Assignee: nobody → jhford
Assignee | ||
Comment 17•15 years ago
|
||
My plan for this:
1. parse output of purge_builds to get free space information into buildproperties
2. write a daemon that monitors the amount of free space on disk during the build and write that info to a file. The lowest amount of disk space during the build would be set in a property
3. Create a webservice that looks at the DB to return the lowest amount of free space and use that value with an extra 2GB for safety as the amount of space to use for purge_builds instead of using a static value.
Assignee | ||
Comment 18•15 years ago
|
||
Going to be working on these bugs
Status: NEW → ASSIGNED
Priority: P5 → P2
Assignee | ||
Comment 19•15 years ago
|
||
Tested that this works in the success case
purge_actual: '20.30GB'
purge_target: '10GB'
I am currently creating a 20 gb file on the disk to test the failure case.
Assignee | ||
Comment 20•15 years ago
|
||
this patch adds information to the status db in the form of properties that show how much disk space is available vs what was requested.
Attachment #471198 -
Attachment is obsolete: true
Attachment #494504 -
Flags: review?(coop)
Comment 21•15 years ago
|
||
Comment on attachment 494504 [details] [diff] [review]
set properties for diskspace
I don't know anything about python regexp performance, but does it make sense to compile the actual regexps beforehand and re-use them? I reckon they're cached which is probably enough for us here.
Attachment #494504 -
Flags: review?(coop) → review+
Comment 22•15 years ago
|
||
They're not cached, but this doesn't look like it's executed that often, so I suspect it's OK.
Assignee | ||
Comment 23•15 years ago
|
||
I'd like to include this small patch in the next downtime.
Flags: needs-reconfig?
Assignee | ||
Comment 24•15 years ago
|
||
(In reply to comment #23)
> I'd like to include this small patch in the next downtime.
s/downtime/reconfig/
Comment 25•15 years ago
|
||
Planning to do a reconfig tomorrow morning.
Flags: needs-reconfig? → needs-reconfig+
Comment 26•15 years ago
|
||
Comment on attachment 494504 [details] [diff] [review]
set properties for diskspace
changeset: 1234:92d255b03c1a
Attachment #494504 -
Flags: checked-in+
Assignee | ||
Updated•15 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 28•15 years ago
|
||
If I'm reading this right we now have some data stuffed into status db, from which we can build a report. The original point of this bug was to handle out-of-disk better on the slave, so at a bare minimum we should file a follow up to create the report.
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•