b2g_fx-team_emulator-jb-debug_dep on 2014-09-19 00:51:19 PDT for push f6c42abb5457 slave: bld-linux64-spot-303 https://tbpl.mozilla.org/php/getParsedLog.php?id=48446342&tree=Fx-Team 00:51:25 INFO - Deleting /builds/slave/m-beta-lx-00000000000000000000 00:51:25 INFO - Deleting /builds/slave/m-rel-lx-d-0000000000000000000 00:51:25 INFO - Deleting /builds/slave/oak-and-ntly-00000000000000000 00:51:25 INFO - Deleting ./scripts 00:51:25 INFO - Deleting ./logs 00:51:25 INFO - Error: unable to free 20.00 GB of space. Free space only 13.48 GB 00:51:25 ERROR - Return code: 1 00:51:25 FATAL - failed to purge builds 00:51:25 FATAL - Running post_fatal callback... 00:51:25 FATAL - Exiting 2 maybe the slaves need a bigger disk ?
I'm not sure why it could not free the space, it looks like it has 35GB storage, and that 14GB of the used 21.5GB used was under /builds, so I would have expected it to work. I tried to manually run the purge command, but the purge command had already been deleted: [firstname.lastname@example.org ~]$ /builds/slave/b2g_fx-team_emu-jb-d_dep-00000/scripts/external_tools/purge_builds.py -s 20 --max-age 14 --not info --not rel-* --not tb-rel-* /builds/slave -bash: /builds/slave/b2g_fx-team_emu-jb-d_dep-00000/scripts/external_tools/purge_builds.py: No such file or directory [email@example.com /]# du -sh /* 7.7M /bin 21M /boot 14G /builds 4.0K /cgroup 152K /dev 6.2M /etc 160K /home 109M /lib 26M /lib64 16K /lost+found 4.0K /media 4.0K /mnt 9.4M /opt du: cannot access `/proc/1503/task/1503/fd/4': No such file or directory du: cannot access `/proc/1503/task/1503/fdinfo/4': No such file or directory du: cannot access `/proc/1503/fd/4': No such file or directory du: cannot access `/proc/1503/fdinfo/4': No such file or directory 0 /proc 4.0K /REBOOT_AFTER_PUPPET 84K /root 13M /sbin 4.0K /selinux 4.0K /srv 4.1G /swap_file 0 /sys 16M /tmp 220M /tools 770M /usr 161M /var Shortly after I connected and ran the above commands, the slave shutdown and was terminated (confirmed by checking in ec2 console).
Occurred again, this time a different slave: [firstname.lastname@example.org ~]# du -sk /* | sort -n du: cannot access `/proc/1470/task/1470/fd/4': No such file or directory du: cannot access `/proc/1470/task/1470/fdinfo/4': No such file or directory du: cannot access `/proc/1470/fd/4': No such file or directory du: cannot access `/proc/1470/fdinfo/4': No such file or directory 0 /proc 0 /sys 4 /cgroup 4 /media 4 /mnt 4 /REBOOT_AFTER_PUPPET 4 /selinux 4 /srv 16 /lost+found 80 /root 152 /dev 156 /home 6136 /etc 7380 /bin 9580 /opt 12384 /sbin 16136 /tmp 20877 /boot 26280 /lib64 111320 /lib 161872 /var 225212 /tools 783968 /usr 4194308 /swap_file 14338628 /builds [email@example.com ~]#
Something strange going on here, these totals don't add up... [firstname.lastname@example.org builds]# du -sh $(echo *) 136K ccache 4.0K gapi.data 11G git-shared 3.6G hg-shared 32K mock_mozilla 4.0K mozilla-api.key 6.3M slave 4.0K tooltool_cache 0 tooltool.py [email@example.com builds]# du -sh . 14G . [firstname.lastname@example.org builds]#
ahh they do - I didn't spot "git-shared"
So I can't see what change has led to this yet - I'll see if I can find changes to the amount of free space to purge - but since only 13.5 GB is available, and other jobs previously needed 16GB if I recall correctly, I suspect one of the following to be the cause: 1) git shared repos have increased in size because either a new one / new ones got added 2) disk space settings changed (we currently have 35gb) on spot instances 3) the jacuzzi settings changed, so now these slaves run more types of jobs, so use more git shared or hg shared repos Clobberer itself is doing a good job - it clears everything out - but still doesn't have the 20gb it needs, since the hg shared and git shared directories are so big, so it is not a clobberer bug.
I did not find any Release Engineering bugs landed in the last days which seem good candidates for causing this, and I also could not find any recent changes in buildbot-configs, tools, puppet, mozharness that also might explain why this has started happening. Given the size of the purge we want for builds, and the small disk space, and the size of the shared repos, we are very close to the limit of what we can do. However, it surprises me that we are a full 6.5GB short of our target (we can free 13.5GB but wish for 20GB). I suspect :nthomas or :catlee will be able to immediately what changed this to push us over the limit - I don't seem to be able to find it easily. Both jobs were b2g emulator builds i believe, but on different trees (fx team and mozilla inbound): * https://tbpl.mozilla.org/php/getParsedLog.php?id=48446342&tree=Fx-Team * https://tbpl.mozilla.org/php/getParsedLog.php?id=48452285&tree=Mozilla-Inbound
Since I did not explicitly summarize it above, I'll do it here: * spot instances in this case have 35GB disk available * it looks like we need around 14GB just for git shared and hg shared repos * we are asking to free 20GB, meaning we leave only 1GB space for everything else needed on the disk * this is not enough! * something has changed recently to alter one or more of these numbers * above I tried to find out which of these numbers has moved, to explain why previously "the maths worked" when now there is not enough space, by as much as 6.5 GB in both these failed job runs.
cc'ing sheriffs in case this happens more times
CCing rail. We've had a few AWS instances that have had these issues in the past couple days.
https://tbpl.mozilla.org/php/getParsedLog.php?id=48462070&tree=Mozilla-Aurora is another example, this time b2g_mozilla-aurora_flame_periodic on 2014-09-19 06:30:55 PDT for push ab2a88c05a4b slave: b-linux64-hp-0033 so also not even just the AWS slaves ?
This is something intermittent with lvextend...
I think I found the issue. AWS shows 2 ephemeral devices via API, but we have only one: $ curl http://169.254.169.254/latest/meta-data/block-device-mapping/ ami ephemeral0 ephemeral1 $ blkid /dev/xvda1: UUID="ed219abd-9757-4308-82d8-501046eadccc" TYPE="ext2" /dev/xvda2: UUID="7tdAmc-b480-A36E-MIkn-Ut5g-n9U5-MnON2h" TYPE="LVM2_member" /dev/mapper/cloud_root-lv_root: LABEL="root_dev" UUID="1ee91aee-09d0-449a-ba61-8a71685f5494" TYPE="ext4" /dev/xvdb: UUID="6ameZS-IweR-bJLX-DKgy-uVpV-DJID-StrOtS" TYPE="LVM2_member"
Just found another instance trying to recover itself in endless runner limbo: [email@example.com ~]# pvs WARNING: Volume Group cloud_root is not consistent PV VG Fmt Attr PSize PFree /dev/xvda2 cloud_root lvm2 a-- 34.94g 0 /dev/xvdb cloud_root lvm2 a-- 75.00g 75.00g [firstname.lastname@example.org ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert lv_root cloud_root -wi-ao 34.94g The warning must be the issue.
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-linux64-spot-199 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-linux64-spot-455 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-linux64-spot-139
Created attachment 8492641 [details] [diff] [review] 20g.diff Let's purge 20G on build slaves, so we leave the broken slaves in the reboot limbo instead of burning builds. This is not the final solution, just moves the problem from one stage to another.
Comment on attachment 8492641 [details] [diff] [review] 20g.diff remote: https://hg.mozilla.org/build/puppet/rev/504e0ec41685 remote: https://hg.mozilla.org/build/puppet/rev/1693b630c065
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #17) > https://tbpl.mozilla.org/php/getParsedLog.php?id=48589613&tree=Mozilla-Aurora fixed bld-linux64-spot-440 manually
Created attachment 8493328 [details] [diff] [review] syslog-puppet.diff * Adds syslog logging * Uses "-l 100%VG" instead of "-l 100%FREE", which wasn't working in all cases * Just in case, added a function to fix inconsistent LVM volumes * tested on at least 5 broken instances
Comment on attachment 8493328 [details] [diff] [review] syslog-puppet.diff Looks good! Could you just add a docstring for query_pv_free_size and maybe_fix_lvm_devices functions?
Comment on attachment 8493328 [details] [diff] [review] syslog-puppet.diff remote: https://hg.mozilla.org/build/puppet/rev/262f22c51330 remote: https://hg.mozilla.org/build/puppet/rev/e96dd1ea689b with docstrings added I'm going to regenerate AMIs manually.
Just recovered 4 builders from the reboot limbo.
build/try AMIs have been refreshed. Assuming it's fixed now.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #24) > https://tbpl.mozilla.org/php/getParsedLog.php?id=48724828&tree=Mozilla-B2g32- > v2.0 That was based on an AMI from yesterday. Fixed manually.
https://tbpl.mozilla.org/php/getParsedLog.php?id=48749623&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=48749976&tree=B2g-Inbound Rail could you look at this 2 also ? thanks!
(In reply to Carsten Book [:Tomcat] from comment #26) > https://tbpl.mozilla.org/php/getParsedLog.php?id=48749623&tree=B2g-Inbound > https://tbpl.mozilla.org/php/getParsedLog.php?id=48749976&tree=B2g-Inbound > > Rail could you look at this 2 also ? thanks! Thanks Pete! FTR, those were in-house builders, not related to the LVM issue, just a regular out-of-space problem.
https://tbpl.mozilla.org/php/getParsedLog.php?id=48760788&tree=Mozilla-Inbound slave: b-linux64-hp-0027
I'm going to close the bug since the last 3 slaves are in-house and their issues are not related to the initial one. We can track those in the slave bugs or we can open a new bug to address the issue globally.