Closed
Bug 985330
Opened 11 years ago
Closed 11 years ago
Integration Trees closed, high number of pending linux/android compile jobs about 1 hour backlog
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Unassigned)
Details
and here we go again, we have a high number of pending linux (and android) build jobs. Closing the trees so that they can catch up
Comment 1•11 years ago
|
||
so we see e.g. this build request: https://secure.pub.build.mozilla.org/buildapi/self-serve/fx-team/request/38333900
claimed_at: 0
looks like no master has claimed it
it looks like this job is not handled by jacuzzi
since the buildername is
"buildername": "Android 2.2 fx-team non-unified"
but the jacuzzi builders are: http://jacuzzi-allocator.pub.build.mozilla.org/v1/builders/
and it is not listed there
so if i understand correctly, the scheduler has created an entry, for a master to pick it up, but no master has claimed it
presumably we should see logs on the masters to see the code where they try to claim a job
presumably each linux buildbot master would try to grab this job above, so we can look at any one of them, to see why it did not successfully take this particular job
and presumably this will be in the buildbot master system log directly
(lot of "presumably"'s here) :)
we will now check to see if we can find a buildbot master log file
Comment 2•11 years ago
|
||
2014-03-19 04:32:44-0700 [-] prioritizeBuilders: 0.19s found 0 available of 40 connected slaves
on buildbot-master54.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master/twistd.log
Comment 3•11 years ago
|
||
we appear to have errors listing available AWS instances:
buildduty@cruncher.srv.releng.scl3:/home/buildduty/logs/aws/aws_watch_pending.log
2014-03-19 05:10:29,059 - DEBUG - bld-linux64: 25 running spot instances in us-west-2
Traceback (most recent call last):
File "aws_watch_pending.py", line 854, in <module>
instance_type_changes=config.get("instance_type_changes", {})
File "aws_watch_pending.py", line 795, in aws_watch_pending
slaveset=slaveset)
File "aws_watch_pending.py", line 447, in request_spot_instances
active_requests = aws_get_spot_requests(region=region, moz_instance_type=moz_instance_type)
File "aws_watch_pending.py", line 167, in aws_get_spot_requests
req = conn.get_all_spot_instance_requests(filters=filters)
File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/ec2/connection.py", line 1302, in get_all_spot_instance_requests
[('item', SpotInstanceRequest)], verb='POST')
File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/connection.py", line 1143, in get_list
response = self.make_request(action, params, path, verb)
File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/connection.py", line 1089, in make_request
return self._mexe(http_request)
File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/connection.py", line 923, in _mexe
response = connection.getresponse()
File "/tools/python27/lib/python2.7/httplib.py", line 1030, in getresponse
response.begin()
File "/tools/python27/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/tools/python27/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline()
File "/tools/python27/lib/python2.7/socket.py", line 430, in readline
data = recv(1)
File "/tools/python27/lib/python2.7/ssl.py", line 241, in recv
return self.read(buflen)
File "/tools/python27/lib/python2.7/ssl.py", line 160, in read
return self._sslobj.read(len)
ssl.SSLError: The read operation timed out
Comment 4•11 years ago
|
||
Removing dead spot instance requests has reduced response time, and now we seem not to hit the timeout. Now new spot instances are getting created - hopefully linux backlog should start reducing.
Comment 5•11 years ago
|
||
Suggested improvements:
* Find out why timeout is reached - why the query is taking so long to respond, and fix
* Increase timeout if possible (see if there is a config setting in boto to increase it)
* Clean out any still-existing dead spot-instance requests, so we have a completely clean spot-instance request list
* Update wiki docs to explain how to troubleshoot this problem
Comment 6•11 years ago
|
||
* Revisit the method we use to find available spot instances, rather than querying network interfaces
* Catch time-out exception and handle, while we use the current method
Comment 7•11 years ago
|
||
I'll raise separate bugs for the above, and close this bug now, as resolved.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 8•11 years ago
|
||
Good detective skills! Well done Sherlock Moore Holmes! :)
Comment 9•11 years ago
|
||
Thanks Armen! But I was just the scribe! =) Thanks to sbruno, mgerva and catlee.
Amazon support case raised:
===========================
We have several spot requests that are 'active' and 'fulfilled' with instance ids, but the referenced instances don't exist.
One example was sir-6c107449 in us-east-1, which references instance i-0b6e6a2a. I've tried to cancel that spot request and others in that state, so it's status is now 'request-canceled-and-instance-running'
I trust we're not being charged for spot requests that reference non-existent instances.
Instance ID(s): sir-6c107449,
=======================================
To contact us again about this case, please return to the AWS Support Center using the following URL:
https://aws.amazon.com/support/case?caseId=174589251&language=en
*Please note: this e-mail was sent from an address that cannot accept incoming e-mail. Please use the link above if you need to contact us again about this same issue.
Amazon Web Services, Inc. is an affiliate of Amazon.com, Inc. Amazon.com is a registered trademark of Amazon.com, Inc. or its affiliates.
Reporter | ||
Comment 10•11 years ago
|
||
so for the initial cause trees reopened at 05:32:03 but seems this issue is now back and so fx-team is closed again and so reopen this bug
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 11•11 years ago
|
||
reopened fx-team again since builds are catching up but not sure if this is it for today
Comment 12•11 years ago
|
||
I believe this is now fixed, or worked around, in our tools.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•