Closed
Bug 1001518
Opened 11 years ago
Closed 11 years ago
running out of disk space during linux64_gecko-debug jobs
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlund, Assigned: pmoore)
References
Details
Attachments
(1 file, 2 obsolete files)
1.30 KB,
patch
|
nthomas
:
review+
pmoore
:
checked-in+
|
Details | Diff | Splinter Review |
bld-centos6-hp-019
bld-centos6-hp-009
bld-centos6-hp-012
bld-centos6-hp-006
have all run out of disk space recently. We may need to either bump the storage on regularly scrub some space on these more.
looking at our slave mgmt wiki: https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Slave_Management#Known_failure_modes
for out of disk on AWS machines, it suggests: To clean them, you can run mock_mozilla -v -r mozilla-centos6-i386 --scrub=all See bug 829186 for an example.
I am not sure if we can do the same or similar for in-house machines.
Reporter | ||
Comment 1•11 years ago
|
||
=== short term solution:
deleted the largest chunk of builder dirs from /builds/slave.
eg: rm -rf /builds/slave/b2g_fx-team_flame_eng_dep-0000/build/*
[cltbld@bld-centos6-hp-019.build.scl1.mozilla.com slave]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 226G 128G 87G 60% /
[cltbld@bld-centos6-hp-009.build.scl1.mozilla.com slave]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 226G 121G 94G 57% /
[cltbld@bld-centos6-hp-012.build.scl1.mozilla.com slave]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 226G 136G 79G 64% /
[cltbld@bld-centos6-hp-006.build.scl1.mozilla.com slave]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 226G 117G 98G 55% /
enabled and rebooted the four machines.
==== long term solution:
- added as a discussion item for next buildduty meeting
- this will most likely continue happening so we will need to decide on a solution sooner rather than later.
NOTE: leaving this bug open until long term solution is addressed.
Comment 2•11 years ago
|
||
The other approach is to look at the history for the slave, and figure out which was the first job to run out of space. Then increase the space requirement for that job, assuming that it has either grown gradually over time or some change has stepwise increased the needed space.
Comment 3•11 years ago
|
||
eg for bld-centos6-hp-019 I gradually increased the numbuilds argument on
https://secure.pub.build.mozilla.org/buildapi/recent/bld-centos6-hp-019?numbuilds=600
until I hit a run of builds with result 5, around the end of April, and found it was
b2g_b2g-inbound_linux64_gecko-debug build
job that failed first.
Reporter | ||
Comment 4•11 years ago
|
||
Thanks nick.
I'll ask a follow question here so it's documented:
1) how do I go about 'increasing the space requirement for that job' once I find the job using the method described in comment 3?
2) if I do specify the space requirement for a build, does that mean a slave with that amount or more of that space will be required for a master to allocate a job to it? Will that not mean we could potentially have less available 'able' slaves at certain times in our pool?
Comment 5•11 years ago
|
||
For MercurialBuildFactory and subclasses it's the buildSpace argument, which is usually set in as build_space variable in PLATFORM_VARS (config.py). In mozharness it's purge_minsize. Assorted misc. and release jobs hard code the value in their own scripts.
Buildbot doesn't know anything how much free space is available when selecting a slave for a job, instead we try to free the needed space after the job has been started. There's a call to buildfarm/maintenance/purge_builds.py in tools, or the copy in mh, to do this.
Assignee | ||
Comment 6•11 years ago
|
||
This has occurred again on bld-centos6-hp-019 in bug 803087.
This time the first build with result=5 was "b2g_mozilla-inbound_linux64_gecko-debug build". So I will see if I can increase the disk space required for all linux64 gecko debug builds (since previously in comment 3 it was also a linux64 gecko debug build: "b2g_b2g-inbound_linux64_gecko-debug build").
Assignee | ||
Comment 7•11 years ago
|
||
Alternatively - if we jacuzzi up these jobs, this would also solve the disk space problem...
Ben, are there any plans in progress to jacuzzi these hp slaves?
Flags: needinfo?(bhearsum)
Assignee | ||
Comment 8•11 years ago
|
||
Bumping up build_space for linux debug builds to 18GB.
Assignee | ||
Comment 9•11 years ago
|
||
Longer term solution documented in bug 1007583 (automatic setting of build_space based on historical job usage).
Comment 10•11 years ago
|
||
Comment on attachment 8419264 [details] [diff] [review]
bug1001518_buildbot-configs.patch
"b2g_mozilla-inbound_linux64_gecko-debug build" jobs do
python tools/buildfarm/maintenance/purge_builds.py -s 13
so you need to modify buildbot-configs/mozilla/b2g_config.py
Attachment #8419264 -
Flags: review?(nthomas) → review-
Assignee | ||
Comment 11•11 years ago
|
||
Thanks Nick! Good spot.
Looks like we've only had linux64 gecko debug builds for B2G since last month:
https://github.com/mozilla/build-buildbot-configs/commit/2359768e2b84004dc0d5de2588c61ca239cb9b36
This might explain why it is showing up as an issue now, and didn't before.
Hopefully this patch will fix it. I nervously hand it over to you for review. :)
Thanks,
Pete
Attachment #8419264 -
Attachment is obsolete: true
Attachment #8419385 -
Flags: review?(nthomas)
Comment 12•11 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #7)
> Alternatively - if we jacuzzi up these jobs, this would also solve the disk
> space problem...
>
> Ben, are there any plans in progress to jacuzzi these hp slaves?
Eventually, probably. Nothing specific right now.
Flags: needinfo?(bhearsum)
Reporter | ||
Comment 13•11 years ago
|
||
Thanks pete. I had a patch for this but I was trying to verify if there was other b2g platforms that needed a bump too.
Hopefully it's just this one since, as you mentioned, it's new!
Comment 14•11 years ago
|
||
Comment on attachment 8419385 [details] [diff] [review]
bug1001518_buildbot-configs_v2.patch
This modifies jobs like
b2g_mozilla-inbound_linux64_gecko
rather than
b2g_mozilla-inbound_linux64_gecko-debug build
There's a separate 'linux64_gecko-debug' platform you want to change to fix this.
FYI, it's good to provide extra context on this sort of patch, which goes far enough back to include the platform name. And you can verify the change is having the effect you want with dump_master.py, docs at
https://wiki.mozilla.org/ReleaseEngineering:TestingTechniques#builder_list.py_.2F_dump_master.py
(although you can just run this for a build master in this case).
Attachment #8419385 -
Flags: review?(nthomas) → review-
Assignee | ||
Comment 15•11 years ago
|
||
Ah my bad! Actually, it was the right change (i.e. for linux64_gecko-debug), but the diff was against the wrong base revision - I seem to have forgotten to update my working version before creating the patch, and I was patching this revision from April 21st:
https://hg.mozilla.org/build/buildbot-configs/file/7e61f9dabf1d/mozilla/b2g_config.py#l240
In any case, you are absolutely right, I should have included more context in the patch, and this would have helped. And of course I should have refreshed my working dir first! And thanks for the dump master tip - I will use it now to validate hopefully my third *and final* patch(!!). Third time lucky, like they say.
Will attach the new patch once I've run the dump master test! :)
Apologies for the two failed attempts so far...
Pete
Assignee | ||
Comment 16•11 years ago
|
||
OK wasn't able to successfully test dump master script today - I set up a dev master, but I got no output:
(build1)bash-4.1$ pwd
/builds/buildbot/pmoore/build1
(build1)bash-4.1$ braindump/buildbot-related/builder_list.py master/master.cfg
/builds/buildbot/pmoore/build1/lib/python2.6/site-packages/twisted/mail/smtp.py:10: DeprecationWarning: the MimeWriter module is deprecated; use the email package instead
import MimeWriter, tempfile, rfc822
At the moment it looks like my master has no builders - this might be the reason: http://dev-master1.srv.releng.scl3.mozilla.com:8444/builders so I will have to troubleshoot this another time.
In any case, in order to avoid delay, here is the patch which I am relatively sure is correct, even if I couldn't test it yet using dump master.
In any case, here is the patch...
Attachment #8419385 -
Attachment is obsolete: true
Attachment #8420241 -
Flags: review?(nthomas)
Updated•11 years ago
|
Attachment #8420241 -
Flags: review?(nthomas) → review+
Assignee | ||
Comment 17•11 years ago
|
||
Committed on default: https://hg.mozilla.org/build/buildbot-configs/rev/dd61727e8e79
Assignee | ||
Updated•11 years ago
|
Attachment #8420241 -
Flags: checked-in+
Assignee | ||
Updated•11 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 18•11 years ago
|
||
Lets leave this open until the fix is merged to production and deployed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 19•11 years ago
|
||
Merged into production and live.
Comment 20•11 years ago
|
||
Ok, closing now.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Summary: bld-centos6-hp-* slaves are running out of disk space → running out of disk space during linux64_gecko-debug jobs
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•