Closed Bug 1001518 Opened 11 years ago Closed 11 years ago

running out of disk space during linux64_gecko-debug jobs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jlund, Assigned: pmoore)

References

Details

Attachments

(1 file, 2 obsolete files)

bug1001518_buildbot-configs.patch 11 years ago Pete Moore [:pmoore][:pete] 496 bytes, patch	nthomas : review-	Details \| Diff \| Splinter Review
bug1001518_buildbot-configs_v2.patch 11 years ago Pete Moore [:pmoore][:pete] 478 bytes, patch	nthomas : review-	Details \| Diff \| Splinter Review
bug1001518_buildbot-configs_v3.patch 11 years ago Pete Moore [:pmoore][:pete] 1.30 KB, patch	nthomas : review+ pmoore : checked-in+	Details \| Diff \| Splinter Review

Jordan Lund (:jlund)

Reporter

Description

•

11 years ago

bld-centos6-hp-019 bld-centos6-hp-009 bld-centos6-hp-012 bld-centos6-hp-006 have all run out of disk space recently. We may need to either bump the storage on regularly scrub some space on these more. looking at our slave mgmt wiki: https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Slave_Management#Known_failure_modes for out of disk on AWS machines, it suggests: To clean them, you can run mock_mozilla -v -r mozilla-centos6-i386 --scrub=all See bug 829186 for an example. I am not sure if we can do the same or similar for in-house machines.

Jordan Lund (:jlund)

Reporter

Comment 1

•

11 years ago

=== short term solution: deleted the largest chunk of builder dirs from /builds/slave. eg: rm -rf /builds/slave/b2g_fx-team_flame_eng_dep-0000/build/* [cltbld@bld-centos6-hp-019.build.scl1.mozilla.com slave]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 226G 128G 87G 60% / [cltbld@bld-centos6-hp-009.build.scl1.mozilla.com slave]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 226G 121G 94G 57% / [cltbld@bld-centos6-hp-012.build.scl1.mozilla.com slave]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 226G 136G 79G 64% / [cltbld@bld-centos6-hp-006.build.scl1.mozilla.com slave]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 226G 117G 98G 55% / enabled and rebooted the four machines. ==== long term solution: - added as a discussion item for next buildduty meeting - this will most likely continue happening so we will need to decide on a solution sooner rather than later. NOTE: leaving this bug open until long term solution is addressed.

Nick Thomas [:nthomas] (UTC+12)

Comment 2

•

11 years ago

The other approach is to look at the history for the slave, and figure out which was the first job to run out of space. Then increase the space requirement for that job, assuming that it has either grown gradually over time or some change has stepwise increased the needed space.

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

11 years ago

eg for bld-centos6-hp-019 I gradually increased the numbuilds argument on https://secure.pub.build.mozilla.org/buildapi/recent/bld-centos6-hp-019?numbuilds=600 until I hit a run of builds with result 5, around the end of April, and found it was b2g_b2g-inbound_linux64_gecko-debug build job that failed first.

Jordan Lund (:jlund)

Reporter

Comment 4

•

11 years ago

Thanks nick. I'll ask a follow question here so it's documented: 1) how do I go about 'increasing the space requirement for that job' once I find the job using the method described in comment 3? 2) if I do specify the space requirement for a build, does that mean a slave with that amount or more of that space will be required for a master to allocate a job to it? Will that not mean we could potentially have less available 'able' slaves at certain times in our pool?

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

11 years ago

For MercurialBuildFactory and subclasses it's the buildSpace argument, which is usually set in as build_space variable in PLATFORM_VARS (config.py). In mozharness it's purge_minsize. Assorted misc. and release jobs hard code the value in their own scripts. Buildbot doesn't know anything how much free space is available when selecting a slave for a job, instead we try to free the needed space after the job has been started. There's a call to buildfarm/maintenance/purge_builds.py in tools, or the copy in mh, to do this.

Pete Moore [:pmoore][:pete]

Assignee

Comment 6

•

11 years ago

This has occurred again on bld-centos6-hp-019 in bug 803087. This time the first build with result=5 was "b2g_mozilla-inbound_linux64_gecko-debug build". So I will see if I can increase the disk space required for all linux64 gecko debug builds (since previously in comment 3 it was also a linux64 gecko debug build: "b2g_b2g-inbound_linux64_gecko-debug build").

Pete Moore [:pmoore][:pete]

Assignee

Comment 7

•

11 years ago

Alternatively - if we jacuzzi up these jobs, this would also solve the disk space problem... Ben, are there any plans in progress to jacuzzi these hp slaves?

Flags: needinfo?(bhearsum)

Pete Moore [:pmoore][:pete]

Assignee

Comment 8

•

11 years ago

Attached patch bug1001518_buildbot-configs.patch (obsolete) — Details — Splinter Review

Bumping up build_space for linux debug builds to 18GB.

Assignee: nobody → pmoore

Status: NEW → ASSIGNED

Attachment #8419264 - Flags: review?(nthomas)

Pete Moore [:pmoore][:pete]

Assignee

Comment 9

•

11 years ago

Longer term solution documented in bug 1007583 (automatic setting of build_space based on historical job usage).

Nick Thomas [:nthomas] (UTC+12)

Comment 10

•

11 years ago

Comment on attachment 8419264 [details] [diff] [review] bug1001518_buildbot-configs.patch "b2g_mozilla-inbound_linux64_gecko-debug build" jobs do python tools/buildfarm/maintenance/purge_builds.py -s 13 so you need to modify buildbot-configs/mozilla/b2g_config.py

Attachment #8419264 - Flags: review?(nthomas) → review-

Pete Moore [:pmoore][:pete]

Assignee

Comment 11

•

11 years ago

Attached patch bug1001518_buildbot-configs_v2.patch (obsolete) — Details — Splinter Review

Thanks Nick! Good spot. Looks like we've only had linux64 gecko debug builds for B2G since last month: https://github.com/mozilla/build-buildbot-configs/commit/2359768e2b84004dc0d5de2588c61ca239cb9b36 This might explain why it is showing up as an issue now, and didn't before. Hopefully this patch will fix it. I nervously hand it over to you for review. :) Thanks, Pete

Attachment #8419264 - Attachment is obsolete: true

Attachment #8419385 - Flags: review?(nthomas)

bhearsum@mozilla.com (:bhearsum)

Comment 12

•

11 years ago

(In reply to Pete Moore [:pete][:pmoore] from comment #7) > Alternatively - if we jacuzzi up these jobs, this would also solve the disk > space problem... > > Ben, are there any plans in progress to jacuzzi these hp slaves? Eventually, probably. Nothing specific right now.

Flags: needinfo?(bhearsum)

Jordan Lund (:jlund)

Reporter

Comment 13

•

11 years ago

Thanks pete. I had a patch for this but I was trying to verify if there was other b2g platforms that needed a bump too. Hopefully it's just this one since, as you mentioned, it's new!

Nick Thomas [:nthomas] (UTC+12)

Comment 14

•

11 years ago

Comment on attachment 8419385 [details] [diff] [review] bug1001518_buildbot-configs_v2.patch This modifies jobs like b2g_mozilla-inbound_linux64_gecko rather than b2g_mozilla-inbound_linux64_gecko-debug build There's a separate 'linux64_gecko-debug' platform you want to change to fix this. FYI, it's good to provide extra context on this sort of patch, which goes far enough back to include the platform name. And you can verify the change is having the effect you want with dump_master.py, docs at https://wiki.mozilla.org/ReleaseEngineering:TestingTechniques#builder_list.py_.2F_dump_master.py (although you can just run this for a build master in this case).

Attachment #8419385 - Flags: review?(nthomas) → review-

Pete Moore [:pmoore][:pete]

Assignee

Comment 15

•

11 years ago

Ah my bad! Actually, it was the right change (i.e. for linux64_gecko-debug), but the diff was against the wrong base revision - I seem to have forgotten to update my working version before creating the patch, and I was patching this revision from April 21st: https://hg.mozilla.org/build/buildbot-configs/file/7e61f9dabf1d/mozilla/b2g_config.py#l240 In any case, you are absolutely right, I should have included more context in the patch, and this would have helped. And of course I should have refreshed my working dir first! And thanks for the dump master tip - I will use it now to validate hopefully my third *and final* patch(!!). Third time lucky, like they say. Will attach the new patch once I've run the dump master test! :) Apologies for the two failed attempts so far... Pete

Pete Moore [:pmoore][:pete]

Assignee

Comment 16

•

11 years ago

Attached patch bug1001518_buildbot-configs_v3.patch — Details — Splinter Review

OK wasn't able to successfully test dump master script today - I set up a dev master, but I got no output: (build1)bash-4.1$ pwd /builds/buildbot/pmoore/build1 (build1)bash-4.1$ braindump/buildbot-related/builder_list.py master/master.cfg /builds/buildbot/pmoore/build1/lib/python2.6/site-packages/twisted/mail/smtp.py:10: DeprecationWarning: the MimeWriter module is deprecated; use the email package instead import MimeWriter, tempfile, rfc822 At the moment it looks like my master has no builders - this might be the reason: http://dev-master1.srv.releng.scl3.mozilla.com:8444/builders so I will have to troubleshoot this another time. In any case, in order to avoid delay, here is the patch which I am relatively sure is correct, even if I couldn't test it yet using dump master. In any case, here is the patch...

Attachment #8419385 - Attachment is obsolete: true

Attachment #8420241 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Updated

•

11 years ago

Attachment #8420241 - Flags: review?(nthomas) → review+

Pete Moore [:pmoore][:pete]

Assignee

Comment 17

•

11 years ago

Committed on default: https://hg.mozilla.org/build/buildbot-configs/rev/dd61727e8e79

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

11 years ago

Attachment #8420241 - Flags: checked-in+

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

11 years ago

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Comment 18

•

11 years ago

Lets leave this open until the fix is merged to production and deployed.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Armen [:armenzg]

Comment 19

•

11 years ago

Merged into production and live.

Nick Thomas [:nthomas] (UTC+12)

Comment 20

•

11 years ago

Ok, closing now.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Summary: bld-centos6-hp-* slaves are running out of disk space → running out of disk space during linux64_gecko-debug jobs

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.