1069811 - AWS Linux builders fail to allocate LVM free space

Reporter

Description

•

10 years ago

b2g_fx-team_emulator-jb-debug_dep on 2014-09-19 00:51:19 PDT for push f6c42abb5457

slave: bld-linux64-spot-303

https://tbpl.mozilla.org/php/getParsedLog.php?id=48446342&tree=Fx-Team

00:51:25     INFO -  Deleting /builds/slave/m-beta-lx-00000000000000000000
00:51:25     INFO -  Deleting /builds/slave/m-rel-lx-d-0000000000000000000
00:51:25     INFO -  Deleting /builds/slave/oak-and-ntly-00000000000000000
00:51:25     INFO -  Deleting ./scripts
00:51:25     INFO -  Deleting ./logs
00:51:25     INFO -  Error: unable to free 20.00 GB of space. Free space only 13.48 GB
00:51:25    ERROR - Return code: 1
00:51:25    FATAL - failed to purge builds
00:51:25    FATAL - Running post_fatal callback...
00:51:25    FATAL - Exiting 2

maybe the slaves need a bigger disk ?

Pete Moore [:pmoore][:pete]

Comment 1

•

10 years ago

I'm not sure why it could not free the space, it looks like it has 35GB storage, and that 14GB of the used 21.5GB used was under /builds, so I would have expected it to work.

I tried to manually run the purge command, but the purge command had already been deleted:

[cltbld@bld-linux64-spot-303.build.releng.usw2.mozilla.com ~]$ /builds/slave/b2g_fx-team_emu-jb-d_dep-00000/scripts/external_tools/purge_builds.py -s 20 --max-age 14 --not info --not rel-* --not tb-rel-* /builds/slave
-bash: /builds/slave/b2g_fx-team_emu-jb-d_dep-00000/scripts/external_tools/purge_builds.py: No such file or directory




[root@bld-linux64-spot-303.build.releng.usw2.mozilla.com /]# du -sh /*
7.7M	/bin
21M	/boot
14G	/builds
4.0K	/cgroup
152K	/dev
6.2M	/etc
160K	/home
109M	/lib
26M	/lib64
16K	/lost+found
4.0K	/media
4.0K	/mnt
9.4M	/opt
du: cannot access `/proc/1503/task/1503/fd/4': No such file or directory
du: cannot access `/proc/1503/task/1503/fdinfo/4': No such file or directory
du: cannot access `/proc/1503/fd/4': No such file or directory
du: cannot access `/proc/1503/fdinfo/4': No such file or directory
0	/proc
4.0K	/REBOOT_AFTER_PUPPET
84K	/root
13M	/sbin
4.0K	/selinux
4.0K	/srv
4.1G	/swap_file
0	/sys
16M	/tmp
220M	/tools
770M	/usr
161M	/var

Shortly after I connected and ran the above commands, the slave shutdown and was terminated (confirmed by checking in ec2 console).

Pete Moore [:pmoore][:pete]

Comment 2

•

10 years ago

Occurred again, this time a different slave:

[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com ~]# du -sk /* | sort -n
du: cannot access `/proc/1470/task/1470/fd/4': No such file or directory
du: cannot access `/proc/1470/task/1470/fdinfo/4': No such file or directory
du: cannot access `/proc/1470/fd/4': No such file or directory
du: cannot access `/proc/1470/fdinfo/4': No such file or directory
0	/proc
0	/sys
4	/cgroup
4	/media
4	/mnt
4	/REBOOT_AFTER_PUPPET
4	/selinux
4	/srv
16	/lost+found
80	/root
152	/dev
156	/home
6136	/etc
7380	/bin
9580	/opt
12384	/sbin
16136	/tmp
20877	/boot
26280	/lib64
111320	/lib
161872	/var
225212	/tools
783968	/usr
4194308	/swap_file
14338628	/builds
[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com ~]#

Pete Moore [:pmoore][:pete]

Comment 3

•

10 years ago

Something strange going on here, these totals don't add up...

[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com builds]# du -sh $(echo *)
136K	ccache
4.0K	gapi.data
11G	git-shared
3.6G	hg-shared
32K	mock_mozilla
4.0K	mozilla-api.key
6.3M	slave
4.0K	tooltool_cache
0	tooltool.py
[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com builds]# du -sh .
14G	.
[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com builds]#

Pete Moore [:pmoore][:pete]

Comment 4

•

10 years ago

ahh they do - I didn't spot "git-shared"

Pete Moore [:pmoore][:pete]

Comment 5

•

10 years ago

So I can't see what change has led to this yet - I'll see if I can find changes to the amount of free space to purge - but since only 13.5 GB is available, and other jobs previously needed 16GB if I recall correctly, I suspect one of the following to be the cause:

  1) git shared repos have increased in size because either a new one / new ones got added
  2) disk space settings changed (we currently have 35gb) on spot instances
  3) the jacuzzi settings changed, so now these slaves run more types of jobs, so use more git shared or hg shared repos

Clobberer itself is doing a good job - it clears everything out - but still doesn't have the 20gb it needs, since the hg shared and git shared directories are so big, so it is not a clobberer bug.

Pete Moore [:pmoore][:pete]

Comment 6

•

10 years ago

I did not find any Release Engineering bugs landed in the last days which seem good candidates for causing this, and I also could not find any recent changes in buildbot-configs, tools, puppet, mozharness that also might explain why this has started happening.

Given the size of the purge we want for builds, and the small disk space, and the size of the shared repos, we are very close to the limit of what we can do. However, it surprises me that we are a full 6.5GB short of our target (we can free 13.5GB but wish for 20GB).

I suspect :nthomas or :catlee will be able to immediately what changed this to push us over the limit - I don't seem to be able to find it easily.

Both jobs were b2g emulator builds i believe, but on different trees (fx team and mozilla inbound):
 * https://tbpl.mozilla.org/php/getParsedLog.php?id=48446342&tree=Fx-Team
 * https://tbpl.mozilla.org/php/getParsedLog.php?id=48452285&tree=Mozilla-Inbound

Flags: needinfo?(nthomas)

Flags: needinfo?(catlee)

Pete Moore [:pmoore][:pete]

Comment 7

•

10 years ago

Since I did not explicitly summarize it above, I'll do it here:

  * spot instances in this case have 35GB disk available
  * it looks like we need around 14GB just for git shared and hg shared repos
  * we are asking to free 20GB, meaning we leave only 1GB space for everything else needed on the disk
  * this is not enough!
  * something has changed recently to alter one or more of these numbers
  * above I tried to find out which of these numbers has moved, to explain why previously "the maths worked" when now there is not enough space, by as much as 6.5 GB in both these failed job runs.

Carsten Book [:Tomcat]

Reporter

Comment 8

•

10 years ago

cc'ing sheriffs in case this happens more times

Ryan VanderMeulen [:RyanVM]

Comment 9

•

10 years ago

CCing rail. We've had a few AWS instances that have had these issues in the past couple days.

Carsten Book [:Tomcat]

Reporter

Comment 10

•

10 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=48462070&tree=Mozilla-Aurora is another example, this time 


b2g_mozilla-aurora_flame_periodic on 2014-09-19 06:30:55 PDT for push ab2a88c05a4b

slave: b-linux64-hp-0033

so also not even just the AWS slaves ?

Chris AtLee [:catlee]

Updated

•

10 years ago

Flags: needinfo?(catlee) → needinfo?(rail)

Rail Aliiev [:rail]

Assignee

Comment 11

•

10 years ago

This is something intermittent with lvextend...

Assignee: nobody → rail

Depends on: 1069561

Flags: needinfo?(rail)

Flags: needinfo?(nthomas)

Rail Aliiev [:rail]

Assignee

Comment 12

•

10 years ago

I think I found the issue.

AWS shows 2 ephemeral devices via API, but we have only one:

$ curl http://169.254.169.254/latest/meta-data/block-device-mapping/
ami
ephemeral0
ephemeral1

$ blkid
/dev/xvda1: UUID="ed219abd-9757-4308-82d8-501046eadccc" TYPE="ext2" 
/dev/xvda2: UUID="7tdAmc-b480-A36E-MIkn-Ut5g-n9U5-MnON2h" TYPE="LVM2_member" 
/dev/mapper/cloud_root-lv_root: LABEL="root_dev" UUID="1ee91aee-09d0-449a-ba61-8a71685f5494" TYPE="ext4" 
/dev/xvdb: UUID="6ameZS-IweR-bJLX-DKgy-uVpV-DJID-StrOtS" TYPE="LVM2_member"

Rail Aliiev [:rail]

Assignee

Comment 13

•

10 years ago

Just found another instance trying to recover itself in endless runner limbo:

[root@bld-linux64-spot-358.build.releng.usw2.mozilla.com ~]# pvs
  WARNING: Volume Group cloud_root is not consistent
  PV         VG         Fmt  Attr PSize  PFree 
  /dev/xvda2 cloud_root lvm2 a--  34.94g     0 
  /dev/xvdb  cloud_root lvm2 a--  75.00g 75.00g
[root@bld-linux64-spot-358.build.releng.usw2.mozilla.com ~]# lvs
  LV      VG         Attr   LSize  Origin Snap%  Move Log Copy%  Convert
  lv_root cloud_root -wi-ao 34.94g 


The warning must be the issue.

Phil Ringnalda (:philor)

Comment 14

•

10 years ago

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-linux64-spot-199
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-linux64-spot-455
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-linux64-spot-139

Rail Aliiev [:rail]

Assignee

Comment 15

•

10 years ago

Attached patch 20g.diff — Details — Splinter Review

Let's purge 20G on build slaves, so we leave the broken slaves in the reboot limbo instead of burning builds.

This is not the final solution, just moves the problem from one stage to another.

Attachment #8492641 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Updated

•

10 years ago

Attachment #8492641 - Flags: review?(bhearsum) → review+

Rail Aliiev [:rail]

Assignee

Comment 16

•

10 years ago

Comment on attachment 8492641 [details] [diff] [review]
20g.diff

remote:   https://hg.mozilla.org/build/puppet/rev/504e0ec41685
remote:   https://hg.mozilla.org/build/puppet/rev/1693b630c065

Attachment #8492641 - Flags: checked-in+

Ryan VanderMeulen [:RyanVM]

Comment 17

•

10 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=48589613&tree=Mozilla-Aurora

Rail Aliiev [:rail]

Assignee

Comment 18

•

10 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #17)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48589613&tree=Mozilla-Aurora

fixed bld-linux64-spot-440 manually

Rail Aliiev [:rail]

Assignee

Comment 19

•

10 years ago

Attached patch syslog-puppet.diff — Details — Splinter Review

* Adds syslog logging
* Uses "-l 100%VG" instead of "-l 100%FREE", which wasn't working in all cases
* Just in case, added a function to fix inconsistent LVM volumes
* tested on at least 5 broken instances

Attachment #8493328 - Flags: review?(mgervasini)

Massimo Gervasini [:massimo]

Comment 20

•

10 years ago

Comment on attachment 8493328 [details] [diff] [review]
syslog-puppet.diff

Looks good! Could you just add a docstring for query_pv_free_size and maybe_fix_lvm_devices functions?

Attachment #8493328 - Flags: review?(mgervasini) → review+

Rail Aliiev [:rail]

Assignee

Comment 21

•

10 years ago

Comment on attachment 8493328 [details] [diff] [review]
syslog-puppet.diff

remote:   https://hg.mozilla.org/build/puppet/rev/262f22c51330
remote:   https://hg.mozilla.org/build/puppet/rev/e96dd1ea689b

with docstrings added

I'm going to regenerate AMIs manually.

Attachment #8493328 - Flags: checked-in+

Rail Aliiev [:rail]

Assignee

Comment 22

•

10 years ago

Just recovered 4 builders from the reboot limbo.

Rail Aliiev [:rail]

Assignee

Updated

•

10 years ago

Summary: AWS Linux Slave failed with failed to purge builds - Error: unable to free 20.00 GB of space. → AWS Linux builders fail to allocate LVM free space

Rail Aliiev [:rail]

Assignee

Updated

•

10 years ago

Blocks: 1068025

Rail Aliiev [:rail]

Assignee

Comment 23

•

10 years ago

build/try AMIs have been refreshed. Assuming it's fixed now.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Comment 24

•

10 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=48724828&tree=Mozilla-B2g32-v2.0

Rail Aliiev [:rail]

Assignee

Comment 25

•

10 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #24)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48724828&tree=Mozilla-B2g32-
> v2.0

That was based on an AMI from yesterday. Fixed manually.

Carsten Book [:Tomcat]

Reporter

Comment 26

•

10 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=48749623&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=48749976&tree=B2g-Inbound

Rail could you look at this 2 also ? thanks!

Flags: needinfo?(rail)

Pete Moore [:pmoore][:pete]

Updated

•

10 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Rail Aliiev [:rail]

Assignee

Comment 27

•

10 years ago

(In reply to Carsten Book [:Tomcat] from comment #26)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48749623&tree=B2g-Inbound
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48749976&tree=B2g-Inbound
> 
> Rail could you look at this 2 also ? thanks!

Thanks Pete!

FTR, those were in-house builders, not related to the LVM issue, just a regular out-of-space problem.

Flags: needinfo?(rail)

Carsten Book [:Tomcat]

Reporter

Comment 28

•

10 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=48760788&tree=Mozilla-Inbound
slave: b-linux64-hp-0027

Rail Aliiev [:rail]

Assignee

Comment 29

•

10 years ago

I'm going to close the bug since the last 3 slaves are in-house and their issues are not related to the initial one. We can track those in the slave bugs or we can open a new bug to address the issue globally.

Rail Aliiev [:rail]

Assignee

Updated

•

10 years ago

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

20g.diff 10 years ago Rail Aliiev [:rail] 910 bytes, patch	bhearsum : review+ rail : checked-in+	Details \| Diff \| Splinter Review
syslog-puppet.diff 10 years ago Rail Aliiev [:rail] 6.18 KB, patch	massimo : review+ rail : checked-in+	Details \| Diff \| Splinter Review