Disk full on tst-linux64-spot-757:

[ integration]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       15G   15G   82M 100% /
udev            1.9G  4.0K  1.9G   1% /dev
tmpfs           751M  656K  750M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            1.9G  4.8M  1.9G   1% /run/shm

Biggest "eaters":

[ ~]$ du -sk /* 2>/dev/null | sort -n | tail -5
337212	/tools
802088	/var
2420164	/usr
3147704	/home
7199976	/builds

Relatively big:
[ .android]$ du -sh /home/cltbld/.android/avd
2.6G	/home/cltbld/.android/avd

Three gaia repositories checked out, consuming 5.1G
[ test]$ cd /builds/hg-shared/integration/
[ integration]$ du -sk * | sort -n
1425320	gaia-1_2
1901980	gaia-1_4
1920616	gaia-central

[ integration]$ du -sh /builds/hg-shared/integration/
5.1G	/builds/hg-shared/integration/

A random spot (tst-linux64-spot-248) without 100% disk usage:
[ ~]$ du -sk /* 2>/dev/null | sort -n | tail -5
337212	/tools
801104	/var
2420164	/usr
3350764	/home
5182808	/builds

However, already at 88% usage!! Only 1.9G available.

[ ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       15G   13G  1.9G  88% /
udev            1.9G  4.0K  1.9G   1% /dev
tmpfs           751M  652K  750M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            1.9G  5.7M  1.9G   1% /run/shm

This spot instance does *not* have the 1.9Gb gaia-1_4 repo checked out!

[ integration]$ ls -ltr
total 8
drwxrwxr-x 19 cltbld cltbld 4096 Mar 11 18:00 gaia-central
drwxrwxr-x 19 cltbld cltbld 4096 Mar 23 02:17 gaia-1_2

In other words, it looks like as soon as a spot has gaia-central, gaia-1_2 *and* gaia-1_4 cloned - we are going to hit this issue!


it looks like we only create 15G instances (thanks mgerva for the link).

I guess we need to either work out how to save disk space *somewhere* or increase this value.
Or maybe we can "jacuzzi" the spots a bit more, so that different gaia versions are tested on different instances?
disabled this slave in slavealloc
We can bump the size of the root partition easily for spot instances. All new instances will be coming with more space. On-demand instances would need to be recreated.

This will also affect our AWS bill.
we could also have the slaves pull the various gaia repos into the same local repo. as long as they're checking things out by revision, this should be safe.
Breakdown on a running machine with 2G free still:
[ /]# du -chs * 2>/dev/null
0	1
6.9M	bin
22M	boot
5.1G	builds
4.0K	dev
12M	etc
3.2G	home
0	initrd.img
216M	lib
4.0K	lib64
16K	lost+found
4.0K	media
4.0K	mnt
4.0K	opt
0	proc
104K	root
728K	run
7.9M	sbin
4.0K	selinux
4.0K	srv
0	sys
60K	tmp
330M	tools
2.4G	usr
788M	var
0	vmlinuz
12G	total
[ /]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       15G   13G  2.0G  87% /
(In reply to Pete Moore [:pete][:pmoore] from comment #0)
> [ .android]$ du -sh
> /home/cltbld/.android/avd
> 2.6G	/home/cltbld/.android/avd

The Android 2.3 jobs extract AVDs-armv7a-gingerbread-build-2014-01-23-ubuntu.tar.gz there and that is approximately the expected size. 

I think it could be reduced to about 25% of that though: We distribute 4 identical avd definitions but only use one of them. (The Android 4.2 x86 needs 4 avd definitions because they run up to 4 emulators at one time, but the Android 2.3 jobs on ec2 slaves only run 1 emulator at a time.)
RyanVM found these 5:
tst-linux64-spot-665  (unreachable, assumed terminated)

Did a short sweep through TBPL for m-c for others: 
tst-linux64-spot-781  (unreachable, assumed terminated)

There are surely more.
Attached patch bump the sizeSplinter Review
Let's bump the root partition size until we can figure out how to reduce disk usage.
attachment 8397593 [details] [diff] [review]
bump the size
bump the size

This change will affect new spot instances, but not existing on-demand ones.
Attached patch resize.diff
It turns out that it's not enough to set the size of the volume properly, we also need to grow the partition when the new size is larger that the one of in AMI. This won't work for HVM root device, but should work for PV instances.
the last fix made it work \o/

The on-demand instances still use 15G volumes.
Assignee: nobody → rail
Rail: What is the plan to age out the old on-demand and old spot instances?
Flags: needinfo?(rail)
(In reply to John Hopkins (:jhopkins) from comment #14)
> Rail: What is the plan to age out the old on-demand and old spot instances?

I was thinking about re-creating them on the TCW this window. Is there any reason why this should be done earlier?
Flags: needinfo?(rail)
Rail: we've been failing test runs due to "out of disk space" messages from time to time all week.  I conferred with edmorley and he says we can wait until the TCW to reimage.
No need to wait, they were idle, so I recreated 200 on-demand instances (tst-linux64-{0,3}{01..99})
