Closed Bug 1001518 Opened 8 years ago Closed 8 years ago
running out of disk space during linux64
bld-centos6-hp-019 bld-centos6-hp-009 bld-centos6-hp-012 bld-centos6-hp-006 have all run out of disk space recently. We may need to either bump the storage on regularly scrub some space on these more. looking at our slave mgmt wiki: https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Slave_Management#Known_failure_modes for out of disk on AWS machines, it suggests: To clean them, you can run mock_mozilla -v -r mozilla-centos6-i386 --scrub=all See bug 829186 for an example. I am not sure if we can do the same or similar for in-house machines.
=== short term solution: deleted the largest chunk of builder dirs from /builds/slave. eg: rm -rf /builds/slave/b2g_fx-team_flame_eng_dep-0000/build/* [email@example.com slave]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 226G 128G 87G 60% / [firstname.lastname@example.org slave]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 226G 121G 94G 57% / [email@example.com slave]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 226G 136G 79G 64% / [firstname.lastname@example.org slave]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 226G 117G 98G 55% / enabled and rebooted the four machines. ==== long term solution: - added as a discussion item for next buildduty meeting - this will most likely continue happening so we will need to decide on a solution sooner rather than later. NOTE: leaving this bug open until long term solution is addressed.
The other approach is to look at the history for the slave, and figure out which was the first job to run out of space. Then increase the space requirement for that job, assuming that it has either grown gradually over time or some change has stepwise increased the needed space.
eg for bld-centos6-hp-019 I gradually increased the numbuilds argument on https://secure.pub.build.mozilla.org/buildapi/recent/bld-centos6-hp-019?numbuilds=600 until I hit a run of builds with result 5, around the end of April, and found it was b2g_b2g-inbound_linux64_gecko-debug build job that failed first.
Thanks nick. I'll ask a follow question here so it's documented: 1) how do I go about 'increasing the space requirement for that job' once I find the job using the method described in comment 3? 2) if I do specify the space requirement for a build, does that mean a slave with that amount or more of that space will be required for a master to allocate a job to it? Will that not mean we could potentially have less available 'able' slaves at certain times in our pool?
For MercurialBuildFactory and subclasses it's the buildSpace argument, which is usually set in as build_space variable in PLATFORM_VARS (config.py). In mozharness it's purge_minsize. Assorted misc. and release jobs hard code the value in their own scripts. Buildbot doesn't know anything how much free space is available when selecting a slave for a job, instead we try to free the needed space after the job has been started. There's a call to buildfarm/maintenance/purge_builds.py in tools, or the copy in mh, to do this.
This has occurred again on bld-centos6-hp-019 in bug 803087. This time the first build with result=5 was "b2g_mozilla-inbound_linux64_gecko-debug build". So I will see if I can increase the disk space required for all linux64 gecko debug builds (since previously in comment 3 it was also a linux64 gecko debug build: "b2g_b2g-inbound_linux64_gecko-debug build").
Alternatively - if we jacuzzi up these jobs, this would also solve the disk space problem... Ben, are there any plans in progress to jacuzzi these hp slaves?
Bumping up build_space for linux debug builds to 18GB.
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Attachment #8419264 - Flags: review?(nthomas)
Longer term solution documented in bug 1007583 (automatic setting of build_space based on historical job usage).
Comment on attachment 8419264 [details] [diff] [review] bug1001518_buildbot-configs.patch "b2g_mozilla-inbound_linux64_gecko-debug build" jobs do python tools/buildfarm/maintenance/purge_builds.py -s 13 so you need to modify buildbot-configs/mozilla/b2g_config.py
Attachment #8419264 - Flags: review?(nthomas) → review-
Thanks Nick! Good spot. Looks like we've only had linux64 gecko debug builds for B2G since last month: https://github.com/mozilla/build-buildbot-configs/commit/2359768e2b84004dc0d5de2588c61ca239cb9b36 This might explain why it is showing up as an issue now, and didn't before. Hopefully this patch will fix it. I nervously hand it over to you for review. :) Thanks, Pete
(In reply to Pete Moore [:pete][:pmoore] from comment #7) > Alternatively - if we jacuzzi up these jobs, this would also solve the disk > space problem... > > Ben, are there any plans in progress to jacuzzi these hp slaves? Eventually, probably. Nothing specific right now.
Thanks pete. I had a patch for this but I was trying to verify if there was other b2g platforms that needed a bump too. Hopefully it's just this one since, as you mentioned, it's new!
Comment on attachment 8419385 [details] [diff] [review] bug1001518_buildbot-configs_v2.patch This modifies jobs like b2g_mozilla-inbound_linux64_gecko rather than b2g_mozilla-inbound_linux64_gecko-debug build There's a separate 'linux64_gecko-debug' platform you want to change to fix this. FYI, it's good to provide extra context on this sort of patch, which goes far enough back to include the platform name. And you can verify the change is having the effect you want with dump_master.py, docs at https://wiki.mozilla.org/ReleaseEngineering:TestingTechniques#builder_list.py_.2F_dump_master.py (although you can just run this for a build master in this case).
Attachment #8419385 - Flags: review?(nthomas) → review-
Ah my bad! Actually, it was the right change (i.e. for linux64_gecko-debug), but the diff was against the wrong base revision - I seem to have forgotten to update my working version before creating the patch, and I was patching this revision from April 21st: https://hg.mozilla.org/build/buildbot-configs/file/7e61f9dabf1d/mozilla/b2g_config.py#l240 In any case, you are absolutely right, I should have included more context in the patch, and this would have helped. And of course I should have refreshed my working dir first! And thanks for the dump master tip - I will use it now to validate hopefully my third *and final* patch(!!). Third time lucky, like they say. Will attach the new patch once I've run the dump master test! :) Apologies for the two failed attempts so far... Pete
OK wasn't able to successfully test dump master script today - I set up a dev master, but I got no output: (build1)bash-4.1$ pwd /builds/buildbot/pmoore/build1 (build1)bash-4.1$ braindump/buildbot-related/builder_list.py master/master.cfg /builds/buildbot/pmoore/build1/lib/python2.6/site-packages/twisted/mail/smtp.py:10: DeprecationWarning: the MimeWriter module is deprecated; use the email package instead import MimeWriter, tempfile, rfc822 At the moment it looks like my master has no builders - this might be the reason: http://dev-master1.srv.releng.scl3.mozilla.com:8444/builders so I will have to troubleshoot this another time. In any case, in order to avoid delay, here is the patch which I am relatively sure is correct, even if I couldn't test it yet using dump master. In any case, here is the patch...
Attachment #8420241 - Flags: review?(nthomas) → review+
Committed on default: https://hg.mozilla.org/build/buildbot-configs/rev/dd61727e8e79
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Lets leave this open until the fix is merged to production and deployed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Merged into production and live.
Ok, closing now.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Summary: bld-centos6-hp-* slaves are running out of disk space → running out of disk space during linux64_gecko-debug jobs
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.