Closed Bug 482205 Opened 16 years ago Closed 15 years ago

handle out-of-disk better on build slaves (was "unittest running out of space")

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: jhford)

References

Details

(Whiteboard: [buildslaves][automation])

Attachments

(5 files, 1 obsolete file)

TODO4: Print free space before exiting purge_builds.py 16 years ago Nick Thomas [:nthomas] (UTC+12) 537 bytes, patch	catlee : review+ nthomas : checked-in+	Details \| Diff \| Splinter Review
TODO1 (sort of): Bump from 3GB to 4GB required space 16 years ago Nick Thomas [:nthomas] (UTC+12) 1.02 KB, patch	nthomas : checked-in+	Details \| Diff \| Splinter Review
TODO1 (properly): 5GB free space required 16 years ago Nick Thomas [:nthomas] (UTC+12) 3.61 KB, patch	bhearsum : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
TODO2: haltOnFailure if unit compile fails 16 years ago Nick Thomas [:nthomas] (UTC+12) 938 bytes, patch	bhearsum : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
get some information from purge_builds into the status db 15 years ago John Ford [:jhford] CET/CEST Berlin Time 1.81 KB, patch		Details \| Diff \| Splinter Review
set properties for diskspace 15 years ago John Ford [:jhford] CET/CEST Berlin Time 2.41 KB, patch	coop : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

16 years ago

This slave ran out of space doing a mozilla-central unit test build http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1236546136.1236550760.10190.gz buildSpace is hardwired to 3GB in config.py, while windows explorer says that there were 1.83GB of files in mozilla-central-win32-unit/, taking up 3.11GB on disk. Interestingly, the same slave's mozilla-1.9.1 unit dir was only 1.08/2.33GB. TODO (1) - check if 3.11GB of actual disk is typical for a m-c unit build, possibly bump up this value in config.py The unit build died out in the compile step but tried to run the unit tests anyway, I think we should give up on the build in that case - filed as bug 482169 but I'll dupe it over here. TODO (2) - set haltOnFailure for compile step of unit test builds The next build on the slave, a mozilla-central build, failed to clean up space properly: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1236551524.1236551587.12400.gz It clobbers build/tools but fails to reclone it, weird but perhaps hg needs more space than the final working dir for temporary files. We don't stop on this error and unsurprising the clobber check and purge_builds.py can't run. I think we should definitely halt on the latter case, but really need to avoid getting into this situation, maybe a disk space check before we delete build/tools. The slave got cleaned up by hand. TODO (3) - haltOnFailure if we fail to pull build/tools/ or purge_builds.py, or robustify our current cleaning up. TODO (4) - would be helpful for debugging if purge_builds.py always reported free space before exiting

bhearsum@mozilla.com (:bhearsum)

Comment 2

•

16 years ago

FWIW: I bumped down the size requirement for unittest builds after surveying them a couple weeks ago. As far as I can remember there wasn't any > 3gb "size on disk" unittest builds at the time.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

16 years ago

These two linux unittest builds also ran out of space today: Tue Mar 10 16:19:54 (Build #199Build #1989) Tue Mar 10 19:31:02 (Build #1995) Both in the compile phase. We need to revisit required space for unit test builds, or something is busted in the purge script.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 4

•

16 years ago

Attached patch TODO4: Print free space before exiting purge_builds.py — Details — Splinter Review

Attachment #366760 - Flags: review?(catlee)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 5

•

16 years ago

Attached patch TODO1 (sort of): Bump from 3GB to 4GB required space — Details — Splinter Review

r=bustage for land and reconfig.

Attachment #366761 - Flags: checkedâ€‘in+ checked‑in+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

16 years ago

Summary: moz2-win32-slave11 ran out of disk space → unittest running out of disk space

Chris AtLee [:catlee]

Updated

•

16 years ago

Attachment #366760 - Flags: review?(catlee) → review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

16 years ago

Attachment #366760 - Flags: checkedâ€‘in+ checked‑in+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 6

•

16 years ago

(In reply to comment #5) > Created an attachment (id=366761) [details] > TODO1 (sort of): Bump from 3GB to 4GB required space Followed up with the matching change to mozilla2-staging/master-main.cfg

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

16 years ago

Attached patch TODO1 (properly): 5GB free space required — Details — Splinter Review

Adding symbols has boosted required size quite a bit, and 4G is not enough: 4.7G mozilla-central-linux-unittest 3.2G mozilla-1.9.1-linux-unittest (have also seen 3.7G) 3.8G tracemonkey-linux-unittest 2.2G mozilla-central-macosx-unittest 1.9G mozilla-1.9.1-macosx-unittest 1.9G tracemonkey-macosx-unittest 3.6G mozilla-central-win32-unittest 3.9G mozilla-1.9.1-win32-unittest 3.5G tracemonkey-win32-unittst Win32 is size on disk from folder properties. Setting a blanket 5 is a bit of a sledgehammer approach but we're not running short of space on mac (typically 30+GB free).

Assignee: nobody → nthomas

Status: NEW → ASSIGNED

Attachment #367161 - Flags: review?(bhearsum)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 8

•

16 years ago

Attached patch TODO2: haltOnFailure if unit compile fails — Details — Splinter Review

I think this is the right buildbot <blah>On<blah> for this job.

Attachment #367162 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Comment 9

•

16 years ago

Comment on attachment 367161 [details] [diff] [review] TODO1 (properly): 5GB free space required Yeah..this seems fine. A little tight on Linux, but I think the size varies there a lot less.

Attachment #367161 - Flags: review?(bhearsum) → review+

bhearsum@mozilla.com (:bhearsum)

Updated

•

16 years ago

Attachment #367162 - Flags: review?(bhearsum) → review+

bhearsum@mozilla.com (:bhearsum)

Comment 10

•

16 years ago

Comment on attachment 367162 [details] [diff] [review] TODO2: haltOnFailure if unit compile fails Yep, that's the right parameter. Minor nit: use True instead of 1 for consistency with the rest of the file. r=bhearsum with that change.

bhearsum@mozilla.com (:bhearsum)

Comment 11

•

16 years ago

Comment on attachment 367161 [details] [diff] [review] TODO1 (properly): 5GB free space required changeset: 1008:58d74c2fe4c8

Attachment #367161 - Flags: checkedâ€‘in+ checked‑in+

bhearsum@mozilla.com (:bhearsum)

Comment 12

•

16 years ago

Comment on attachment 367162 [details] [diff] [review] TODO2: haltOnFailure if unit compile fails changeset: 216:cc9598743763

Attachment #367162 - Flags: checkedâ€‘in+ checked‑in+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 13

•

16 years ago

(In reply to comment #0) > TODO (3) - haltOnFailure if we fail to pull build/tools/ or purge_builds.py, or > robustify our current cleaning up. What do people think about this ? It's do with how we recover from a full disk situation, where we don't have enough space to remove build/tools/ and reclone it. Obviously we shouldn't end up here but will occasionally, even if only through repo's growing, and we should decide if we want to handle it manually or automatically. Currently http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l147 we're not setting any "flags" on removing and recloning build/tools, clobberer has flunkOnFailure=False so it won't send the build red on fail, and cleaning old builds has that and warnOnFailure=True so it'll go orange (right?). I'd be tempted to make the whole build fail if it can't get enough space or run the script, since we know it's going to blow up later. Other notes/ideas * the hg clone operation is returning exit code 0 when it fails * 'hg pull && hg up -C' will fail just as badly if the disk is completely full, but may be better in low disk space situations * we could move the "delete old package" step up to guarantee a bit of free space, but won't work if we're starting from a scratch (when another build clobbered the dir) * add an always run step that checks free space at end of build, mails when below a threshold and disconnects the slave ?

bhearsum@mozilla.com (:bhearsum)

Comment 14

•

16 years ago

(In reply to comment #13) > What do people think about this ? It's do with how we recover from a full disk > situation, where we don't have enough space to remove build/tools/ and reclone > it. Obviously we shouldn't end up here but will occasionally, even if only > through repo's growing, and we should decide if we want to handle it manually > or automatically. I think haltOnFailure && flunkOnFailure is probably appropriate here. You could make an argument for warnOnFailure instead of flunk, but halting is the right thing to do IMHO, as you note below. > * 'hg pull && hg up -C' will fail just as badly if the disk is completely full, We won't hit this if we bail on clobberer problems, right? > * add an always run step that checks free space at end of build, mails when > below a threshold and disconnects the slave ? I'm not sure disconnecting is a useful thing to do because it should get some more free space when the next build starts. Mailing is a fantastic idea for sure, though. As long as we pay attention to it more than Nagios mail.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 15

•

16 years ago

Not working on this right now, but still worth handling out-of-disk better.

Assignee: nthomas → nobody

Status: ASSIGNED → NEW

Component: Release Engineering → Release Engineering: Future

Priority: -- → P3

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

16 years ago

Summary: unittest running out of disk space → handle out-of-disk better on build slaves (was "unittest running out of space")

Chris Cooper [:coop] (he/him)

Comment 16

•

15 years ago

Mass move of bugs from Release Engineering:Future -> Release Engineering. See http://coop.deadsquid.com/2010/02/kiss-the-future-goodbye/ for more details.

Component: Release Engineering: Future → Release Engineering

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

15 years ago

Priority: P3 → P5

Chris Cooper [:coop] (he/him)

Updated

•

15 years ago

Whiteboard: [buildslaves][automation]

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Updated

•

15 years ago

Assignee: nobody → jhford

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 17

•

15 years ago

My plan for this: 1. parse output of purge_builds to get free space information into buildproperties 2. write a daemon that monitors the amount of free space on disk during the build and write that info to a file. The lowest amount of disk space during the build would be set in a property 3. Create a webservice that looks at the DB to return the lowest amount of free space and use that value with an extra 2GB for safety as the amount of space to use for purge_builds instead of using a static value.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 18

•

15 years ago

Going to be working on these bugs

Status: NEW → ASSIGNED

Priority: P5 → P2

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 19

•

15 years ago

Attached patch get some information from purge_builds into the status db (obsolete) — Details — Splinter Review

Tested that this works in the success case purge_actual: '20.30GB' purge_target: '10GB' I am currently creating a 20 gb file on the disk to test the failure case.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 20

•

15 years ago

Attached patch set properties for diskspace — Details — Splinter Review

this patch adds information to the status db in the form of properties that show how much disk space is available vs what was requested.

Attachment #471198 - Attachment is obsolete: true

Attachment #494504 - Flags: review?(coop)

Chris Cooper [:coop] (he/him)

Comment 21

•

15 years ago

Comment on attachment 494504 [details] [diff] [review] set properties for diskspace I don't know anything about python regexp performance, but does it make sense to compile the actual regexps beforehand and re-use them? I reckon they're cached which is probably enough for us here.

Attachment #494504 - Flags: review?(coop) → review+

Dustin J. Mitchell [:dustin] (he/him)

Comment 22

•

15 years ago

They're not cached, but this doesn't look like it's executed that often, so I suspect it's OK.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 23

•

15 years ago

I'd like to include this small patch in the next downtime.

Flags: needs-reconfig?

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 24

•

15 years ago

(In reply to comment #23) > I'd like to include this small patch in the next downtime. s/downtime/reconfig/

bhearsum@mozilla.com (:bhearsum)

Comment 25

•

15 years ago

Planning to do a reconfig tomorrow morning.

Flags: needs-reconfig? → needs-reconfig+

bhearsum@mozilla.com (:bhearsum)

Comment 26

•

15 years ago

Comment on attachment 494504 [details] [diff] [review] set properties for diskspace changeset: 1234:92d255b03c1a

Attachment #494504 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Comment 27

•

15 years ago

Masters have been updated.

Flags: needs-reconfig+

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Updated

•

15 years ago

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 28

•

15 years ago

If I'm reading this right we now have some data stuffed into status db, from which we can build a report. The original point of this bug was to handle out-of-disk better on the slave, so at a bare minimum we should file a follow up to create the report.

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.