Closed
Bug 1069811
Opened 10 years ago
Closed 10 years ago
AWS Linux builders fail to allocate LVM free space
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Assigned: rail)
References
()
Details
Attachments
(2 files)
910 bytes,
patch
|
bhearsum
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
6.18 KB,
patch
|
massimo
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
b2g_fx-team_emulator-jb-debug_dep on 2014-09-19 00:51:19 PDT for push f6c42abb5457
slave: bld-linux64-spot-303
https://tbpl.mozilla.org/php/getParsedLog.php?id=48446342&tree=Fx-Team
00:51:25 INFO - Deleting /builds/slave/m-beta-lx-00000000000000000000
00:51:25 INFO - Deleting /builds/slave/m-rel-lx-d-0000000000000000000
00:51:25 INFO - Deleting /builds/slave/oak-and-ntly-00000000000000000
00:51:25 INFO - Deleting ./scripts
00:51:25 INFO - Deleting ./logs
00:51:25 INFO - Error: unable to free 20.00 GB of space. Free space only 13.48 GB
00:51:25 ERROR - Return code: 1
00:51:25 FATAL - failed to purge builds
00:51:25 FATAL - Running post_fatal callback...
00:51:25 FATAL - Exiting 2
maybe the slaves need a bigger disk ?
Comment 1•10 years ago
|
||
I'm not sure why it could not free the space, it looks like it has 35GB storage, and that 14GB of the used 21.5GB used was under /builds, so I would have expected it to work.
I tried to manually run the purge command, but the purge command had already been deleted:
[cltbld@bld-linux64-spot-303.build.releng.usw2.mozilla.com ~]$ /builds/slave/b2g_fx-team_emu-jb-d_dep-00000/scripts/external_tools/purge_builds.py -s 20 --max-age 14 --not info --not rel-* --not tb-rel-* /builds/slave
-bash: /builds/slave/b2g_fx-team_emu-jb-d_dep-00000/scripts/external_tools/purge_builds.py: No such file or directory
[root@bld-linux64-spot-303.build.releng.usw2.mozilla.com /]# du -sh /*
7.7M /bin
21M /boot
14G /builds
4.0K /cgroup
152K /dev
6.2M /etc
160K /home
109M /lib
26M /lib64
16K /lost+found
4.0K /media
4.0K /mnt
9.4M /opt
du: cannot access `/proc/1503/task/1503/fd/4': No such file or directory
du: cannot access `/proc/1503/task/1503/fdinfo/4': No such file or directory
du: cannot access `/proc/1503/fd/4': No such file or directory
du: cannot access `/proc/1503/fdinfo/4': No such file or directory
0 /proc
4.0K /REBOOT_AFTER_PUPPET
84K /root
13M /sbin
4.0K /selinux
4.0K /srv
4.1G /swap_file
0 /sys
16M /tmp
220M /tools
770M /usr
161M /var
Shortly after I connected and ran the above commands, the slave shutdown and was terminated (confirmed by checking in ec2 console).
Comment 2•10 years ago
|
||
Occurred again, this time a different slave:
[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com ~]# du -sk /* | sort -n
du: cannot access `/proc/1470/task/1470/fd/4': No such file or directory
du: cannot access `/proc/1470/task/1470/fdinfo/4': No such file or directory
du: cannot access `/proc/1470/fd/4': No such file or directory
du: cannot access `/proc/1470/fdinfo/4': No such file or directory
0 /proc
0 /sys
4 /cgroup
4 /media
4 /mnt
4 /REBOOT_AFTER_PUPPET
4 /selinux
4 /srv
16 /lost+found
80 /root
152 /dev
156 /home
6136 /etc
7380 /bin
9580 /opt
12384 /sbin
16136 /tmp
20877 /boot
26280 /lib64
111320 /lib
161872 /var
225212 /tools
783968 /usr
4194308 /swap_file
14338628 /builds
[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com ~]#
Comment 3•10 years ago
|
||
Something strange going on here, these totals don't add up...
[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com builds]# du -sh $(echo *)
136K ccache
4.0K gapi.data
11G git-shared
3.6G hg-shared
32K mock_mozilla
4.0K mozilla-api.key
6.3M slave
4.0K tooltool_cache
0 tooltool.py
[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com builds]# du -sh .
14G .
[root@bld-linux64-spot-1005.build.releng.use1.mozilla.com builds]#
Comment 4•10 years ago
|
||
ahh they do - I didn't spot "git-shared"
Comment 5•10 years ago
|
||
So I can't see what change has led to this yet - I'll see if I can find changes to the amount of free space to purge - but since only 13.5 GB is available, and other jobs previously needed 16GB if I recall correctly, I suspect one of the following to be the cause:
1) git shared repos have increased in size because either a new one / new ones got added
2) disk space settings changed (we currently have 35gb) on spot instances
3) the jacuzzi settings changed, so now these slaves run more types of jobs, so use more git shared or hg shared repos
Clobberer itself is doing a good job - it clears everything out - but still doesn't have the 20gb it needs, since the hg shared and git shared directories are so big, so it is not a clobberer bug.
Comment 6•10 years ago
|
||
I did not find any Release Engineering bugs landed in the last days which seem good candidates for causing this, and I also could not find any recent changes in buildbot-configs, tools, puppet, mozharness that also might explain why this has started happening.
Given the size of the purge we want for builds, and the small disk space, and the size of the shared repos, we are very close to the limit of what we can do. However, it surprises me that we are a full 6.5GB short of our target (we can free 13.5GB but wish for 20GB).
I suspect :nthomas or :catlee will be able to immediately what changed this to push us over the limit - I don't seem to be able to find it easily.
Both jobs were b2g emulator builds i believe, but on different trees (fx team and mozilla inbound):
* https://tbpl.mozilla.org/php/getParsedLog.php?id=48446342&tree=Fx-Team
* https://tbpl.mozilla.org/php/getParsedLog.php?id=48452285&tree=Mozilla-Inbound
Flags: needinfo?(nthomas)
Flags: needinfo?(catlee)
Comment 7•10 years ago
|
||
Since I did not explicitly summarize it above, I'll do it here:
* spot instances in this case have 35GB disk available
* it looks like we need around 14GB just for git shared and hg shared repos
* we are asking to free 20GB, meaning we leave only 1GB space for everything else needed on the disk
* this is not enough!
* something has changed recently to alter one or more of these numbers
* above I tried to find out which of these numbers has moved, to explain why previously "the maths worked" when now there is not enough space, by as much as 6.5 GB in both these failed job runs.
Reporter | ||
Comment 8•10 years ago
|
||
cc'ing sheriffs in case this happens more times
Comment 9•10 years ago
|
||
CCing rail. We've had a few AWS instances that have had these issues in the past couple days.
Reporter | ||
Comment 10•10 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=48462070&tree=Mozilla-Aurora is another example, this time
b2g_mozilla-aurora_flame_periodic on 2014-09-19 06:30:55 PDT for push ab2a88c05a4b
slave: b-linux64-hp-0033
so also not even just the AWS slaves ?
Updated•10 years ago
|
Flags: needinfo?(catlee) → needinfo?(rail)
Assignee | ||
Comment 11•10 years ago
|
||
This is something intermittent with lvextend...
Assignee | ||
Comment 12•10 years ago
|
||
I think I found the issue.
AWS shows 2 ephemeral devices via API, but we have only one:
$ curl http://169.254.169.254/latest/meta-data/block-device-mapping/
ami
ephemeral0
ephemeral1
$ blkid
/dev/xvda1: UUID="ed219abd-9757-4308-82d8-501046eadccc" TYPE="ext2"
/dev/xvda2: UUID="7tdAmc-b480-A36E-MIkn-Ut5g-n9U5-MnON2h" TYPE="LVM2_member"
/dev/mapper/cloud_root-lv_root: LABEL="root_dev" UUID="1ee91aee-09d0-449a-ba61-8a71685f5494" TYPE="ext4"
/dev/xvdb: UUID="6ameZS-IweR-bJLX-DKgy-uVpV-DJID-StrOtS" TYPE="LVM2_member"
Assignee | ||
Comment 13•10 years ago
|
||
Just found another instance trying to recover itself in endless runner limbo:
[root@bld-linux64-spot-358.build.releng.usw2.mozilla.com ~]# pvs
WARNING: Volume Group cloud_root is not consistent
PV VG Fmt Attr PSize PFree
/dev/xvda2 cloud_root lvm2 a-- 34.94g 0
/dev/xvdb cloud_root lvm2 a-- 75.00g 75.00g
[root@bld-linux64-spot-358.build.releng.usw2.mozilla.com ~]# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
lv_root cloud_root -wi-ao 34.94g
The warning must be the issue.
Comment 14•10 years ago
|
||
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-linux64-spot-199
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-linux64-spot-455
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-linux64-spot-139
Assignee | ||
Comment 15•10 years ago
|
||
Let's purge 20G on build slaves, so we leave the broken slaves in the reboot limbo instead of burning builds.
This is not the final solution, just moves the problem from one stage to another.
Attachment #8492641 -
Flags: review?(bhearsum)
Updated•10 years ago
|
Attachment #8492641 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 16•10 years ago
|
||
Comment on attachment 8492641 [details] [diff] [review]
20g.diff
remote: https://hg.mozilla.org/build/puppet/rev/504e0ec41685
remote: https://hg.mozilla.org/build/puppet/rev/1693b630c065
Attachment #8492641 -
Flags: checked-in+
Comment 17•10 years ago
|
||
Assignee | ||
Comment 18•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #17)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48589613&tree=Mozilla-Aurora
fixed bld-linux64-spot-440 manually
Assignee | ||
Comment 19•10 years ago
|
||
* Adds syslog logging
* Uses "-l 100%VG" instead of "-l 100%FREE", which wasn't working in all cases
* Just in case, added a function to fix inconsistent LVM volumes
* tested on at least 5 broken instances
Attachment #8493328 -
Flags: review?(mgervasini)
Comment 20•10 years ago
|
||
Comment on attachment 8493328 [details] [diff] [review]
syslog-puppet.diff
Looks good! Could you just add a docstring for query_pv_free_size and maybe_fix_lvm_devices functions?
Attachment #8493328 -
Flags: review?(mgervasini) → review+
Assignee | ||
Comment 21•10 years ago
|
||
Comment on attachment 8493328 [details] [diff] [review]
syslog-puppet.diff
remote: https://hg.mozilla.org/build/puppet/rev/262f22c51330
remote: https://hg.mozilla.org/build/puppet/rev/e96dd1ea689b
with docstrings added
I'm going to regenerate AMIs manually.
Attachment #8493328 -
Flags: checked-in+
Assignee | ||
Comment 22•10 years ago
|
||
Just recovered 4 builders from the reboot limbo.
Assignee | ||
Updated•10 years ago
|
Summary: AWS Linux Slave failed with failed to purge builds - Error: unable to free 20.00 GB of space. → AWS Linux builders fail to allocate LVM free space
Assignee | ||
Comment 23•10 years ago
|
||
build/try AMIs have been refreshed. Assuming it's fixed now.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 24•10 years ago
|
||
Assignee | ||
Comment 25•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #24)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48724828&tree=Mozilla-B2g32-
> v2.0
That was based on an AMI from yesterday. Fixed manually.
Reporter | ||
Comment 26•10 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=48749623&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=48749976&tree=B2g-Inbound
Rail could you look at this 2 also ? thanks!
Flags: needinfo?(rail)
Updated•10 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 27•10 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #26)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48749623&tree=B2g-Inbound
> https://tbpl.mozilla.org/php/getParsedLog.php?id=48749976&tree=B2g-Inbound
>
> Rail could you look at this 2 also ? thanks!
Thanks Pete!
FTR, those were in-house builders, not related to the LVM issue, just a regular out-of-space problem.
Flags: needinfo?(rail)
Reporter | ||
Comment 28•10 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=48760788&tree=Mozilla-Inbound
slave: b-linux64-hp-0027
Assignee | ||
Comment 29•10 years ago
|
||
I'm going to close the bug since the last 3 slaves are in-house and their issues are not related to the initial one. We can track those in the slave bugs or we can open a new bug to address the issue globally.
Assignee | ||
Updated•10 years ago
|
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•