running out of disk space during linux64_gecko-debug jobs

RESOLVED FIXED

Status

task
RESOLVED FIXED
5 years ago
Last year

People

(Reporter: jlund, Assigned: pmoore)

Tracking

Details

Attachments

(1 attachment, 2 obsolete attachments)

bld-centos6-hp-019
bld-centos6-hp-009
bld-centos6-hp-012
bld-centos6-hp-006

have all run out of disk space recently. We may need to either bump the storage on regularly scrub some space on these more.

looking at our slave mgmt wiki: https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Slave_Management#Known_failure_modes

   for out of disk on AWS machines, it suggests: To clean them, you can run mock_mozilla -v -r mozilla-centos6-i386 --scrub=all See bug 829186 for an example.

I am not sure if we can do the same or similar for in-house machines.
=== short term solution:

deleted the largest chunk of builder dirs from /builds/slave.

eg: rm -rf /builds/slave/b2g_fx-team_flame_eng_dep-0000/build/*

[cltbld@bld-centos6-hp-019.build.scl1.mozilla.com slave]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             226G  128G   87G  60% /

[cltbld@bld-centos6-hp-009.build.scl1.mozilla.com slave]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             226G  121G   94G  57% /

[cltbld@bld-centos6-hp-012.build.scl1.mozilla.com slave]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             226G  136G   79G  64% /

[cltbld@bld-centos6-hp-006.build.scl1.mozilla.com slave]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             226G  117G   98G  55% /

enabled and rebooted the four machines.

==== long term solution:

- added as a discussion item for next buildduty meeting
- this will most likely continue happening so we will need to decide on a solution sooner rather than later.



NOTE: leaving this bug open until long term solution is addressed.
The other approach is to look at the history for the slave, and figure out which was the first job to run out of space. Then increase the space requirement for that job, assuming that it has either grown gradually over time or some change has stepwise increased the needed space.
eg for bld-centos6-hp-019 I gradually increased the numbuilds argument on 
 https://secure.pub.build.mozilla.org/buildapi/recent/bld-centos6-hp-019?numbuilds=600
until I hit a run of builds with result 5, around the end of April, and found it was 
 b2g_b2g-inbound_linux64_gecko-debug build
job that failed first.
Thanks nick.

I'll ask a follow question here so it's documented:

1) how do I go about 'increasing the space requirement for that job' once I find the job using the method described in comment 3?
2) if I do specify the space requirement for a build, does that mean a slave with that amount or more of that space will be required for a master to allocate a job to it? Will that not mean we could potentially have less available 'able' slaves at certain times in our pool?
For MercurialBuildFactory and subclasses it's the buildSpace argument, which is usually set in as build_space variable in PLATFORM_VARS (config.py). In mozharness it's purge_minsize. Assorted misc. and release jobs hard code the value in their own scripts.

Buildbot doesn't know anything how much free space is available when selecting a slave for a job, instead we try to free the needed space after the job has been started. There's a call to buildfarm/maintenance/purge_builds.py in tools, or the copy in mh, to do this.
This has occurred again on bld-centos6-hp-019 in bug 803087.

This time the first build with result=5 was "b2g_mozilla-inbound_linux64_gecko-debug build". So I will see if I can increase the disk space required for all linux64 gecko debug builds (since previously in comment 3 it was also a linux64 gecko debug build: "b2g_b2g-inbound_linux64_gecko-debug build").
Alternatively - if we jacuzzi up these jobs, this would also solve the disk space problem...

Ben, are there any plans in progress to jacuzzi these hp slaves?
Flags: needinfo?(bhearsum)
Bumping up build_space for linux debug builds to 18GB.
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Attachment #8419264 - Flags: review?(nthomas)
Longer term solution documented in bug 1007583 (automatic setting of build_space based on historical job usage).
Comment on attachment 8419264 [details] [diff] [review]
bug1001518_buildbot-configs.patch

"b2g_mozilla-inbound_linux64_gecko-debug build" jobs do
  python tools/buildfarm/maintenance/purge_builds.py -s 13 
so you need to modify buildbot-configs/mozilla/b2g_config.py
Attachment #8419264 - Flags: review?(nthomas) → review-
Thanks Nick! Good spot.

Looks like we've only had linux64 gecko debug builds for B2G since last month:
https://github.com/mozilla/build-buildbot-configs/commit/2359768e2b84004dc0d5de2588c61ca239cb9b36

This might explain why it is showing up as an issue now, and didn't before.

Hopefully this patch will fix it. I nervously hand it over to you for review. :)

Thanks,
Pete
Attachment #8419264 - Attachment is obsolete: true
Attachment #8419385 - Flags: review?(nthomas)
(In reply to Pete Moore [:pete][:pmoore] from comment #7)
> Alternatively - if we jacuzzi up these jobs, this would also solve the disk
> space problem...
> 
> Ben, are there any plans in progress to jacuzzi these hp slaves?

Eventually, probably. Nothing specific right now.
Flags: needinfo?(bhearsum)
Thanks pete. I had a patch for this but I was trying to verify if there was other b2g platforms that needed a bump too.

Hopefully it's just this one since, as you mentioned, it's new!
Comment on attachment 8419385 [details] [diff] [review]
bug1001518_buildbot-configs_v2.patch

This modifies jobs like 
  b2g_mozilla-inbound_linux64_gecko
rather than 
  b2g_mozilla-inbound_linux64_gecko-debug build

There's a separate 'linux64_gecko-debug' platform you want to change to fix this.

FYI, it's good to provide extra context on this sort of patch, which goes far enough back to include the platform name. And you can verify the change is having the effect you want with dump_master.py, docs at
 https://wiki.mozilla.org/ReleaseEngineering:TestingTechniques#builder_list.py_.2F_dump_master.py
(although you can just run this for a build master in this case).
Attachment #8419385 - Flags: review?(nthomas) → review-
Ah my bad! Actually, it was the right change (i.e. for linux64_gecko-debug), but the diff was against the wrong base revision - I seem to have forgotten to update my working version before creating the patch, and I was patching this revision from April 21st:

https://hg.mozilla.org/build/buildbot-configs/file/7e61f9dabf1d/mozilla/b2g_config.py#l240

In any case, you are absolutely right, I should have included more context in the patch, and this would have helped. And of course I should have refreshed my working dir first! And thanks for the dump master tip - I will use it now to validate hopefully my third *and final* patch(!!). Third time lucky, like they say.

Will attach the new patch once I've run the dump master test! :)

Apologies for the two failed attempts so far...

Pete
OK wasn't able to successfully test dump master script today - I set up a dev master, but I got no output:

(build1)bash-4.1$ pwd
/builds/buildbot/pmoore/build1
(build1)bash-4.1$ braindump/buildbot-related/builder_list.py master/master.cfg
/builds/buildbot/pmoore/build1/lib/python2.6/site-packages/twisted/mail/smtp.py:10: DeprecationWarning: the MimeWriter module is deprecated; use the email package instead
  import MimeWriter, tempfile, rfc822

At the moment it looks like my master has no builders - this might be the reason: http://dev-master1.srv.releng.scl3.mozilla.com:8444/builders so I will have to troubleshoot this another time.

In any case, in order to avoid delay, here is the patch which I am relatively sure is correct, even if I couldn't test it yet using dump master.

In any case, here is the patch...
Attachment #8419385 - Attachment is obsolete: true
Attachment #8420241 - Flags: review?(nthomas)
Attachment #8420241 - Flags: review?(nthomas) → review+
Attachment #8420241 - Flags: checked-in+
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Lets leave this open until the fix is merged to production and deployed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Merged into production and live.
Ok, closing now.
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Summary: bld-centos6-hp-* slaves are running out of disk space → running out of disk space during linux64_gecko-debug jobs
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.