Closed
Bug 1143681
Opened 10 years ago
Closed 9 years ago
Some AWS test slaves not being recycled as expected
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: coop, Unassigned)
References
()
Details
Attachments
(1 file)
9.60 KB,
text/plain
|
Details |
For the last few weeks, we've had a bunch (>80) of tst-linux64-spot nodes that are not being recycled properly, despite often high numbers (>1000) of pending jobs.
As an example, the instance that has been in this state the longest, tst-linux64-spot-233, has no status info when I try to look it up on aws-manager2, and the AWS console reports nothing about this instance either.
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:search=tst-linux64-spot-2;sort=tag:Name
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tst-linux64-spot&name=tst-linux64-spot-233
Attached is the list of instances that are in this state in case there is some pattern. (the URL provides the same data)
We should figure out what is preventing these instances from being recycled properly and find a way to make sure it happens automatically.
Comment 1•10 years ago
|
||
May be similar to https://bugzilla.mozilla.org/show_bug.cgi?id=1141339#c20 where we had instances without the recycling script installed.
Probably we should install that script on all machines and make it exit 0 on non-aws instances.
The script lives here: http://hg.mozilla.org/build/puppet/file/1cc84a9642ee/modules/runner/files/check_ami.py
Reporter | ||
Comment 2•10 years ago
|
||
If you search for the hostname in Spot Requests in the AWS console, you can click on the instance ID to get more details about the state. For all the instances I've checked so far, the tags have been empty (including Name), which is why you can't find them in the instance list by name.
The states of the individual instances have been a combination of starting up or shutting down. Here's a sampling:
tst-linux64-spot-233 use1 i-69c0d065 shutting-down
tst-linux64-spot-1456 usw2 i-69c0d065 pending
tst-linux64-spot-783 usw2 i-97c0d09b pending
tst-linux64-spot-379 usw2 i-6dc0d061 pending
tst-linux64-spot-1166 usw2 i-95c0d099 pending
I tried terminating tst-linux64-spot-1166 just to make sure I could. It worked.
Knowing this, I can sort the console listing to batch terminate these instances by hand.
However, is there a way to add a check or timeout to cloud-tools for instances that spend undue time in the pending or shutting-down state?
Comment 3•10 years ago
|
||
If the instances have launch_time attribute (sometimes it disappears) we can use it. Pseudo code would look like this:
if now - i.launch_time > 2 days:
i.terminate()
Sometimes these instances are indestructible, I had to open an AWS ticket last month to kill one of those.
Reporter | ||
Comment 4•10 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #3)
> Sometimes these instances are indestructible, I had to open an AWS ticket
> last month to kill one of those.
I was able to terminate all the pending ones (which is promising), but I couldn't affect the state of those listed as shutting-down. I opened a support case for those: https://console.aws.amazon.com/support/home?region=us-east-1#/case/?displayId=1359595871&language=en
Reporter | ||
Comment 5•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #4)
> I was able to terminate all the pending ones (which is promising), but I
> couldn't affect the state of those listed as shutting-down. I opened a
> support case for those:
> https://console.aws.amazon.com/support/home?region=us-east-1#/case/
> ?displayId=1359595871&language=en
Support case was resolved this morning.
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Assignee | ||
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•