Closed Bug 1069811 Opened 10 years ago Closed 10 years ago

AWS Linux builders fail to allocate LVM free space

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Assigned: rail)

References

()

Details

Attachments

(2 files)

b2g_fx-team_emulator-jb-debug_dep on 2014-09-19 00:51:19 PDT for push f6c42abb5457

slave: bld-linux64-spot-303

https://tbpl.mozilla.org/php/getParsedLog.php?id=48446342&tree=Fx-Team

00:51:25     INFO -  Deleting /builds/slave/m-beta-lx-00000000000000000000
00:51:25     INFO -  Deleting /builds/slave/m-rel-lx-d-0000000000000000000
00:51:25     INFO -  Deleting /builds/slave/oak-and-ntly-00000000000000000
00:51:25     INFO -  Deleting ./scripts
00:51:25     INFO -  Deleting ./logs
00:51:25     INFO -  Error: unable to free 20.00 GB of space. Free space only 13.48 GB
00:51:25    ERROR - Return code: 1
00:51:25    FATAL - failed to purge builds
00:51:25    FATAL - Running post_fatal callback...
00:51:25    FATAL - Exiting 2

maybe the slaves need a bigger disk ?
I'm not sure why it could not free the space, it looks like it has 35GB storage, and that 14GB of the used 21.5GB used was under /builds, so I would have expected it to work.

I tried to manually run the purge command, but the purge command had already been deleted:

[cltbld@bld-linux64-spot-303.build.releng.usw2.mozilla.com ~]$ /builds/slave/b2g_fx-team_emu-jb-d_dep-00000/scripts/external_tools/purge_builds.py -s 20 --max-age 14 --not info --not rel-* --not tb-rel-* /builds/slave
-bash: /builds/slave/b2g_fx-team_emu-jb-d_dep-00000/scripts/external_tools/purge_builds.py: No such file or directory




[root@bld-linux64-spot-303.build.releng.usw2.mozilla.com /]# du -sh /*
7.7M	/bin
21M	/boot
14G	/builds
4.0K	/cgroup
152K	/dev
6.2M	/etc
160K	/home
109M	/lib
26M	/lib64
16K	/lost+found
4.0K	/media
4.0K	/mnt
9.4M	/opt
du: cannot access `/proc/1503/task/1503/fd/4': No such file or directory
du: cannot access `/proc/1503/task/1503/fdinfo/4': No such file or directory
du: cannot access `/proc/1503/fd/4': No such file or directory
du: cannot access `/proc/1503/fdinfo/4': No such file or directory
0	/proc
4.0K	/REBOOT_AFTER_PUPPET
84K	/root
13M	/sbin
4.0K	/selinux
4.0K	/srv
4.1G	/swap_file
0	/sys
16M	/tmp
220M	/tools
770M	/usr
161M	/var

Shortly after I connected and ran the above commands, the slave shutdown and was terminated (confirmed by checking in ec2 console).
Occurred again, this time a different slave:

[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com ~]# du -sk /* | sort -n
du: cannot access `/proc/1470/task/1470/fd/4': No such file or directory
du: cannot access `/proc/1470/task/1470/fdinfo/4': No such file or directory
du: cannot access `/proc/1470/fd/4': No such file or directory
du: cannot access `/proc/1470/fdinfo/4': No such file or directory
0	/proc
0	/sys
4	/cgroup
4	/media
4	/mnt
4	/REBOOT_AFTER_PUPPET
4	/selinux
4	/srv
16	/lost+found
80	/root
152	/dev
156	/home
6136	/etc
7380	/bin
9580	/opt
12384	/sbin
16136	/tmp
20877	/boot
26280	/lib64
111320	/lib
161872	/var
225212	/tools
783968	/usr
4194308	/swap_file
14338628	/builds
[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com ~]#
Something strange going on here, these totals don't add up...

[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com builds]# du -sh $(echo *)
136K	ccache
4.0K	gapi.data
11G	git-shared
3.6G	hg-shared
32K	mock_mozilla
4.0K	mozilla-api.key
6.3M	slave
4.0K	tooltool_cache
0	tooltool.py
[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com builds]# du -sh .
14G	.
[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com builds]#
ahh they do - I didn't spot "git-shared"
So I can't see what change has led to this yet - I'll see if I can find changes to the amount of free space to purge - but since only 13.5 GB is available, and other jobs previously needed 16GB if I recall correctly, I suspect one of the following to be the cause:

  1) git shared repos have increased in size because either a new one / new ones got added
  2) disk space settings changed (we currently have 35gb) on spot instances
  3) the jacuzzi settings changed, so now these slaves run more types of jobs, so use more git shared or hg shared repos

Clobberer itself is doing a good job - it clears everything out - but still doesn't have the 20gb it needs, since the hg shared and git shared directories are so big, so it is not a clobberer bug.
I did not find any Release Engineering bugs landed in the last days which seem good candidates for causing this, and I also could not find any recent changes in buildbot-configs, tools, puppet, mozharness that also might explain why this has started happening.

Given the size of the purge we want for builds, and the small disk space, and the size of the shared repos, we are very close to the limit of what we can do. However, it surprises me that we are a full 6.5GB short of our target (we can free 13.5GB but wish for 20GB).

I suspect :nthomas or :catlee will be able to immediately what changed this to push us over the limit - I don't seem to be able to find it easily.

Both jobs were b2g emulator builds i believe, but on different trees (fx team and mozilla inbound):
 * https://tbpl.mozilla.org/php/getParsedLog.php?id=48446342&tree=Fx-Team
 * https://tbpl.mozilla.org/php/getParsedLog.php?id=48452285&tree=Mozilla-Inbound
Flags: needinfo?(nthomas)
Flags: needinfo?(catlee)
Since I did not explicitly summarize it above, I'll do it here:

  * spot instances in this case have 35GB disk available
  * it looks like we need around 14GB just for git shared and hg shared repos
  * we are asking to free 20GB, meaning we leave only 1GB space for everything else needed on the disk
  * this is not enough!
  * something has changed recently to alter one or more of these numbers
  * above I tried to find out which of these numbers has moved, to explain why previously "the maths worked" when now there is not enough space, by as much as 6.5 GB in both these failed job runs.
cc'ing sheriffs in case this happens more times
CCing rail. We've had a few AWS instances that have had these issues in the past couple days.
https://tbpl.mozilla.org/php/getParsedLog.php?id=48462070&tree=Mozilla-Aurora is another example, this time 


b2g_mozilla-aurora_flame_periodic on 2014-09-19 06:30:55 PDT for push ab2a88c05a4b

slave: b-linux64-hp-0033

so also not even just the AWS slaves ?
Flags: needinfo?(catlee) → needinfo?(rail)
This is something intermittent with lvextend...
Assignee: nobody → rail
Depends on: 1069561
Flags: needinfo?(rail)
Flags: needinfo?(nthomas)
I think I found the issue.

AWS shows 2 ephemeral devices via API, but we have only one:

$ curl http://169.254.169.254/latest/meta-data/block-device-mapping/
ami
ephemeral0
ephemeral1

$ blkid
/dev/xvda1: UUID="ed219abd-9757-4308-82d8-501046eadccc" TYPE="ext2" 
/dev/xvda2: UUID="7tdAmc-b480-A36E-MIkn-Ut5g-n9U5-MnON2h" TYPE="LVM2_member" 
/dev/mapper/cloud_root-lv_root: LABEL="root_dev" UUID="1ee91aee-09d0-449a-ba61-8a71685f5494" TYPE="ext4" 
/dev/xvdb: UUID="6ameZS-IweR-bJLX-DKgy-uVpV-DJID-StrOtS" TYPE="LVM2_member"
Just found another instance trying to recover itself in endless runner limbo:

[root@bld-linux64-spot-358.build.releng.usw2.mozilla.com ~]# pvs
  WARNING: Volume Group cloud_root is not consistent
  PV         VG         Fmt  Attr PSize  PFree 
  /dev/xvda2 cloud_root lvm2 a--  34.94g     0 
  /dev/xvdb  cloud_root lvm2 a--  75.00g 75.00g
[root@bld-linux64-spot-358.build.releng.usw2.mozilla.com ~]# lvs
  LV      VG         Attr   LSize  Origin Snap%  Move Log Copy%  Convert
  lv_root cloud_root -wi-ao 34.94g 


The warning must be the issue.
Attached patch 20g.diffSplinter Review
Let's purge 20G on build slaves, so we leave the broken slaves in the reboot limbo instead of burning builds.

This is not the final solution, just moves the problem from one stage to another.
Attachment #8492641 - Flags: review?(bhearsum)
Attachment #8492641 - Flags: review?(bhearsum) → review+
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #17)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48589613&tree=Mozilla-Aurora

fixed bld-linux64-spot-440 manually
* Adds syslog logging
* Uses "-l 100%VG" instead of "-l 100%FREE", which wasn't working in all cases
* Just in case, added a function to fix inconsistent LVM volumes
* tested on at least 5 broken instances
Attachment #8493328 - Flags: review?(mgervasini)
Comment on attachment 8493328 [details] [diff] [review]
syslog-puppet.diff

Looks good! Could you just add a docstring for query_pv_free_size and maybe_fix_lvm_devices functions?
Attachment #8493328 - Flags: review?(mgervasini) → review+
Comment on attachment 8493328 [details] [diff] [review]
syslog-puppet.diff

remote:   https://hg.mozilla.org/build/puppet/rev/262f22c51330
remote:   https://hg.mozilla.org/build/puppet/rev/e96dd1ea689b

with docstrings added

I'm going to regenerate AMIs manually.
Attachment #8493328 - Flags: checked-in+
Just recovered 4 builders from the reboot limbo.
Summary: AWS Linux Slave failed with failed to purge builds - Error: unable to free 20.00 GB of space. → AWS Linux builders fail to allocate LVM free space
Blocks: 1068025
build/try AMIs have been refreshed. Assuming it's fixed now.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #24)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48724828&tree=Mozilla-B2g32-
> v2.0

That was based on an AMI from yesterday. Fixed manually.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Carsten Book [:Tomcat] from comment #26)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48749623&tree=B2g-Inbound
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48749976&tree=B2g-Inbound
> 
> Rail could you look at this 2 also ? thanks!

Thanks Pete!

FTR, those were in-house builders, not related to the LVM issue, just a regular out-of-space problem.
Flags: needinfo?(rail)
I'm going to close the bug since the last 3 slaves are in-house and their issues are not related to the initial one. We can track those in the slave bugs or we can open a new bug to address the issue globally.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: