AWS Linux builders fail to allocate LVM free space



4 years ago
9 months ago


(Reporter: cbook, Assigned: rail)





(2 attachments)



4 years ago
b2g_fx-team_emulator-jb-debug_dep on 2014-09-19 00:51:19 PDT for push f6c42abb5457

slave: bld-linux64-spot-303

00:51:25     INFO -  Deleting /builds/slave/m-beta-lx-00000000000000000000
00:51:25     INFO -  Deleting /builds/slave/m-rel-lx-d-0000000000000000000
00:51:25     INFO -  Deleting /builds/slave/oak-and-ntly-00000000000000000
00:51:25     INFO -  Deleting ./scripts
00:51:25     INFO -  Deleting ./logs
00:51:25     INFO -  Error: unable to free 20.00 GB of space. Free space only 13.48 GB
00:51:25    ERROR - Return code: 1
00:51:25    FATAL - failed to purge builds
00:51:25    FATAL - Running post_fatal callback...
00:51:25    FATAL - Exiting 2

maybe the slaves need a bigger disk ?
I'm not sure why it could not free the space, it looks like it has 35GB storage, and that 14GB of the used 21.5GB used was under /builds, so I would have expected it to work.

I tried to manually run the purge command, but the purge command had already been deleted:

[ ~]$ /builds/slave/b2g_fx-team_emu-jb-d_dep-00000/scripts/external_tools/ -s 20 --max-age 14 --not info --not rel-* --not tb-rel-* /builds/slave
-bash: /builds/slave/b2g_fx-team_emu-jb-d_dep-00000/scripts/external_tools/ No such file or directory

[ /]# du -sh /*
7.7M	/bin
21M	/boot
14G	/builds
4.0K	/cgroup
152K	/dev
6.2M	/etc
160K	/home
109M	/lib
26M	/lib64
16K	/lost+found
4.0K	/media
4.0K	/mnt
9.4M	/opt
du: cannot access `/proc/1503/task/1503/fd/4': No such file or directory
du: cannot access `/proc/1503/task/1503/fdinfo/4': No such file or directory
du: cannot access `/proc/1503/fd/4': No such file or directory
du: cannot access `/proc/1503/fdinfo/4': No such file or directory
0	/proc
84K	/root
13M	/sbin
4.0K	/selinux
4.0K	/srv
4.1G	/swap_file
0	/sys
16M	/tmp
220M	/tools
770M	/usr
161M	/var

Shortly after I connected and ran the above commands, the slave shutdown and was terminated (confirmed by checking in ec2 console).
Occurred again, this time a different slave:

[ ~]# du -sk /* | sort -n
du: cannot access `/proc/1470/task/1470/fd/4': No such file or directory
du: cannot access `/proc/1470/task/1470/fdinfo/4': No such file or directory
du: cannot access `/proc/1470/fd/4': No such file or directory
du: cannot access `/proc/1470/fdinfo/4': No such file or directory
0	/proc
0	/sys
4	/cgroup
4	/media
4	/mnt
4	/selinux
4	/srv
16	/lost+found
80	/root
152	/dev
156	/home
6136	/etc
7380	/bin
9580	/opt
12384	/sbin
16136	/tmp
20877	/boot
26280	/lib64
111320	/lib
161872	/var
225212	/tools
783968	/usr
4194308	/swap_file
14338628	/builds
[ ~]#
Something strange going on here, these totals don't add up...

[ builds]# du -sh $(echo *)
136K	ccache
11G	git-shared
3.6G	hg-shared
32K	mock_mozilla
4.0K	mozilla-api.key
6.3M	slave
4.0K	tooltool_cache
[ builds]# du -sh .
14G	.
[ builds]#
ahh they do - I didn't spot "git-shared"
So I can't see what change has led to this yet - I'll see if I can find changes to the amount of free space to purge - but since only 13.5 GB is available, and other jobs previously needed 16GB if I recall correctly, I suspect one of the following to be the cause:

  1) git shared repos have increased in size because either a new one / new ones got added
  2) disk space settings changed (we currently have 35gb) on spot instances
  3) the jacuzzi settings changed, so now these slaves run more types of jobs, so use more git shared or hg shared repos

Clobberer itself is doing a good job - it clears everything out - but still doesn't have the 20gb it needs, since the hg shared and git shared directories are so big, so it is not a clobberer bug.
I did not find any Release Engineering bugs landed in the last days which seem good candidates for causing this, and I also could not find any recent changes in buildbot-configs, tools, puppet, mozharness that also might explain why this has started happening.

Given the size of the purge we want for builds, and the small disk space, and the size of the shared repos, we are very close to the limit of what we can do. However, it surprises me that we are a full 6.5GB short of our target (we can free 13.5GB but wish for 20GB).

I suspect :nthomas or :catlee will be able to immediately what changed this to push us over the limit - I don't seem to be able to find it easily.

Both jobs were b2g emulator builds i believe, but on different trees (fx team and mozilla inbound):
Flags: needinfo?(nthomas)
Flags: needinfo?(catlee)
Since I did not explicitly summarize it above, I'll do it here:

  * spot instances in this case have 35GB disk available
  * it looks like we need around 14GB just for git shared and hg shared repos
  * we are asking to free 20GB, meaning we leave only 1GB space for everything else needed on the disk
  * this is not enough!
  * something has changed recently to alter one or more of these numbers
  * above I tried to find out which of these numbers has moved, to explain why previously "the maths worked" when now there is not enough space, by as much as 6.5 GB in both these failed job runs.

Comment 8

4 years ago
cc'ing sheriffs in case this happens more times
CCing rail. We've had a few AWS instances that have had these issues in the past couple days.

Comment 10

4 years ago is another example, this time 

b2g_mozilla-aurora_flame_periodic on 2014-09-19 06:30:55 PDT for push ab2a88c05a4b

slave: b-linux64-hp-0033

so also not even just the AWS slaves ?
Flags: needinfo?(catlee) → needinfo?(rail)

Comment 11

4 years ago
This is something intermittent with lvextend...
Assignee: nobody → rail
Depends on: 1069561
Flags: needinfo?(rail)
Flags: needinfo?(nthomas)

Comment 12

4 years ago
I think I found the issue.

AWS shows 2 ephemeral devices via API, but we have only one:

$ curl

$ blkid
/dev/xvda1: UUID="ed219abd-9757-4308-82d8-501046eadccc" TYPE="ext2" 
/dev/xvda2: UUID="7tdAmc-b480-A36E-MIkn-Ut5g-n9U5-MnON2h" TYPE="LVM2_member" 
/dev/mapper/cloud_root-lv_root: LABEL="root_dev" UUID="1ee91aee-09d0-449a-ba61-8a71685f5494" TYPE="ext4" 
/dev/xvdb: UUID="6ameZS-IweR-bJLX-DKgy-uVpV-DJID-StrOtS" TYPE="LVM2_member"

Comment 13

4 years ago
Just found another instance trying to recover itself in endless runner limbo:

[ ~]# pvs
  WARNING: Volume Group cloud_root is not consistent
  PV         VG         Fmt  Attr PSize  PFree 
  /dev/xvda2 cloud_root lvm2 a--  34.94g     0 
  /dev/xvdb  cloud_root lvm2 a--  75.00g 75.00g
[ ~]# lvs
  LV      VG         Attr   LSize  Origin Snap%  Move Log Copy%  Convert
  lv_root cloud_root -wi-ao 34.94g 

The warning must be the issue.

Comment 15

4 years ago
Created attachment 8492641 [details] [diff] [review]

Let's purge 20G on build slaves, so we leave the broken slaves in the reboot limbo instead of burning builds.

This is not the final solution, just moves the problem from one stage to another.
Attachment #8492641 - Flags: review?(bhearsum)
Attachment #8492641 - Flags: review?(bhearsum) → review+

Comment 18

4 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #17)

fixed bld-linux64-spot-440 manually

Comment 19

4 years ago
Created attachment 8493328 [details] [diff] [review]

* Adds syslog logging
* Uses "-l 100%VG" instead of "-l 100%FREE", which wasn't working in all cases
* Just in case, added a function to fix inconsistent LVM volumes
* tested on at least 5 broken instances
Attachment #8493328 - Flags: review?(mgervasini)
Comment on attachment 8493328 [details] [diff] [review]

Looks good! Could you just add a docstring for query_pv_free_size and maybe_fix_lvm_devices functions?
Attachment #8493328 - Flags: review?(mgervasini) → review+

Comment 21

4 years ago
Comment on attachment 8493328 [details] [diff] [review]


with docstrings added

I'm going to regenerate AMIs manually.
Attachment #8493328 - Flags: checked-in+

Comment 22

4 years ago
Just recovered 4 builders from the reboot limbo.


4 years ago
Summary: AWS Linux Slave failed with failed to purge builds - Error: unable to free 20.00 GB of space. → AWS Linux builders fail to allocate LVM free space


4 years ago
Blocks: 1068025

Comment 23

4 years ago
build/try AMIs have been refreshed. Assuming it's fixed now.
Last Resolved: 4 years ago
Resolution: --- → FIXED

Comment 25

4 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #24)
> v2.0

That was based on an AMI from yesterday. Fixed manually.
Resolution: FIXED → ---

Comment 27

4 years ago
(In reply to Carsten Book [:Tomcat] from comment #26)
> Rail could you look at this 2 also ? thanks!

Thanks Pete!

FTR, those were in-house builders, not related to the LVM issue, just a regular out-of-space problem.
Flags: needinfo?(rail)

Comment 29

4 years ago
I'm going to close the bug since the last 3 slaves are in-house and their issues are not related to the initial one. We can track those in the slave bugs or we can open a new bug to address the issue globally.


4 years ago
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED


9 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.