Closed Bug 1217863 Opened 9 years ago Closed 9 years ago

tst-emulator64 instances are running out of disk space

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: kmoir, Unassigned)

References

Details

Attachments

(1 file)

[puppet] Use cleanup task to handle emulator detritus 9 years ago Nick Thomas [:nthomas] (UTC+12) 542 bytes, patch	kmoir : review+ kmoir : checked-in+	Details \| Diff \| Splinter Review

Kim Moir [:kmoir] ET

Reporter

Description

•

9 years ago

http://buildbot-master68.bb.releng.usw2.mozilla.com:8201/buildslaves/tst-emulator64-spot-329

I think this is 
kmoir: 10:36 AM <Tomcat|sheriffduty> Callek_cloud: tst-emulator64-spot-329 out of space
kmoir: "10:40 AM <Tomcat|sheriffduty> Callek_cloud: and  tst-emulator64-spot-025  joined that pary"

kmoir	I think the problem is that we are getting m3.xlarge (1x32GB) for those instance types and we really want c3.xlarge (2x40)
kmoir	the configs bid on either instance type 
kmoir	and perhaps the new debug tests or gtests consume all the space
kmoir	I can't tell because the instances have been terminated now

So we either
1) be more aggressive on cleanup
2) restrict instance type to c3.xlarge

Kim Moir [:kmoir] ET

Reporter

Updated

•

9 years ago

Summary: emulator64 instance types are running out of disk space → tst-emulator64 instances are running out of disk space

Kim Moir [:kmoir] ET

Reporter

Comment 1

•

9 years ago

Also, problem could be that a test is consuming more disk space that it used to.

Kim Moir [:kmoir] ET

Reporter

Comment 2

•

9 years ago

Other theory could be is that since I enabled so many new Android debug tests earlier this week, the additional load is causing instances to live longer which may reveal a problem since looking at the runner logs it appears purge builds only runs on startup.  For instance on tst-emulator64-spot-957, purge builds lastran at 2015-10-22 22:12:46, uptime at 08:33:07 is 10:21,

Comment hidden (Intermittent Failures Robot)

7 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-beta: 7

Platform breakdown:
* android-4-3-armv7-api11: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1217863&startday=2015-10-19&endday=2015-10-25&tree=all

Nigel Babu [:nigelb]

Comment 5

•

9 years ago

There's plenty of these now! I've had to close trees for it.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 6

•

9 years ago

Is there anywhere to check uptime before a job starts and force it to do a purge_builds if uptime is greater than some value (three or four hours, maybe)? That would hopefully confirm or deny comment 2's theory without having to revert whatever changes were made last week.

Flags: needinfo?(kmoir)

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

9 years ago

Managed to catch tst-emulator64-spot-371 in the act, as it were.

* of the 20G on / only 507M was free
* about 10G was used in /tmp/android-cltbld
* about 50 files of the style /tmp/android-cltbld/emulator-yHgXIL were present, each 220M, going back to 'Oct 26 04:48 PDT'
* uptime reported that it had been up continuously since then 'Oct 26 04:42 PDT'
* the most recent was in use, and a 'Android armv7 API 9 try opt test jsreftest-2' job was running at the time

So, a runner task that does 
   find /tmp/android-cltbld -mmin +30 -delete
before buildbot starts would fix this up.

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

9 years ago

Attached patch [puppet] Use cleanup task to handle emulator detritus — Details — Splinter Review

I think this should be OK. The -f arg will prevent errors when that directory is missing, say on non-emulator instances.

Attachment #8679875 - Flags: review?(kmoir)

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

9 years ago

FTR, I'm not pushing that out urgently because today's troubles are likely due to huge load spike we resulted in the whole emulator pool working flat out for several hours. With that all dealt with, and a bit of a lull for the tree closure, a lot of those instances will recycle themselves and come back with empty /tmp. The trees reopened a few minutes ago.

Kim Moir [:kmoir] ET

Reporter

Updated

•

9 years ago

Flags: needinfo?(kmoir)

Attachment #8679875 - Flags: review?(kmoir) → review+

Kim Moir [:kmoir] ET

Reporter

Updated

•

9 years ago

Attachment #8679875 - Flags: checked-in+

Kim Moir [:kmoir] ET

Reporter

Comment 10

•

9 years ago

started the process to regenerate the golden ami for this instance type too

Kim Moir [:kmoir] ET

Reporter

Comment 11

•

9 years ago

new golden ami is available

Kim Moir [:kmoir] ET

Reporter

Comment 12

•

9 years ago

Seems like the new ami does not have the change and we had failures again today.  Looks like there was a puppet error when generating the ami

Kim Moir [:kmoir] ET

Reporter

Comment 13

•

9 years ago

from #buildduty

        arr	maybe something weird happened in the copy phase
	arr	so things in usw2 won't have the change, but things in use1 will
	arr	(new instances)
	kmoir	that makes sense
	arr	well, I'm not sure how the heck the copy got munged like that, though
	arr	kmoir: when you ran the script by hand, did you include the --copy-to-region bits?
	kmoir	arr: I didn't run it by hand.  I just modified the crontab to run it at the time
	arr	ah
	arr	something strange happened, because there are TWO amis with the name spot-tst-emulator64-2015-10-28-09-25 in usw2
	arr	even though they show separate creation dates
	arr	and there is none for the earlier one
	arr	wonders how the heck that state came about
	arr	kmoir: I've initiated a copy of the good ami  from use1 to usw2
	kmoir	arr: okay thanks

Comment hidden (Intermittent Failures Robot)

26 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-aurora: 11
* fx-team: 10
* mozilla-inbound: 5

Platform breakdown:
* android-4-3-armv7-api11: 24
* android-2-3-armv7-api9: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1217863&startday=2015-10-28&endday=2015-10-28&tree=all

Kim Moir [:kmoir] ET

Reporter

Comment 15

•

9 years ago

Looks like there are a many instances who have been up for > 1 day and haven't run check_ami for over a day.  Thus they don't have the patch and are running out of disk space.  Going to open a bug against runner to run check_ami more frequently.

Nick Thomas [:nthomas] (UTC+12)

Comment 16

•

9 years ago

I ended up terminating the 54 old instances still running, as they were continuing to cause grief to sheriffs. Probably we're done with this particular bug, and follow up with bug 1219763 to figure out why the newer ami wasn't being picked up more quickly.

Kim Moir [:kmoir] ET

Reporter

Comment 17

•

9 years ago

thanks Nick, I should have used the terminate by ami script to delete them.  Looking at the instances, it appears runner doesn't run again after buildbot has started which means check_ami doesn't run either for instances that have been up for several days

Comment hidden (Intermittent Failures Robot)

32 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 11
* mozilla-aurora: 11
* fx-team: 10

Platform breakdown:
* android-4-3-armv7-api11: 28
* linux64: 2
* android-2-3-armv7-api9: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1217863&startday=2015-10-26&endday=2015-11-01&tree=all

Kim Moir [:kmoir] ET

Reporter

Comment 19

•

9 years ago

Looks like amis are being recycled regularly according to the console now that bug 1219763 is fixed.  All the current instances for this type are from today.

Kim Moir [:kmoir] ET

Reporter

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.