Closed Bug 985330 Opened 11 years ago Closed 11 years ago

Integration Trees closed, high number of pending linux/android compile jobs about 1 hour backlog

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Linux
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Unassigned)

Details

and here we go again, we have a high number of pending linux (and android) build jobs. Closing the trees so that they can catch up
so we see e.g. this build request: https://secure.pub.build.mozilla.org/buildapi/self-serve/fx-team/request/38333900 claimed_at: 0 looks like no master has claimed it it looks like this job is not handled by jacuzzi since the buildername is "buildername": "Android 2.2 fx-team non-unified" but the jacuzzi builders are: http://jacuzzi-allocator.pub.build.mozilla.org/v1/builders/ and it is not listed there so if i understand correctly, the scheduler has created an entry, for a master to pick it up, but no master has claimed it presumably we should see logs on the masters to see the code where they try to claim a job presumably each linux buildbot master would try to grab this job above, so we can look at any one of them, to see why it did not successfully take this particular job and presumably this will be in the buildbot master system log directly (lot of "presumably"'s here) :) we will now check to see if we can find a buildbot master log file
2014-03-19 04:32:44-0700 [-] prioritizeBuilders: 0.19s found 0 available of 40 connected slaves on buildbot-master54.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master/twistd.log
we appear to have errors listing available AWS instances: buildduty@cruncher.srv.releng.scl3:/home/buildduty/logs/aws/aws_watch_pending.log 2014-03-19 05:10:29,059 - DEBUG - bld-linux64: 25 running spot instances in us-west-2 Traceback (most recent call last): File "aws_watch_pending.py", line 854, in <module> instance_type_changes=config.get("instance_type_changes", {}) File "aws_watch_pending.py", line 795, in aws_watch_pending slaveset=slaveset) File "aws_watch_pending.py", line 447, in request_spot_instances active_requests = aws_get_spot_requests(region=region, moz_instance_type=moz_instance_type) File "aws_watch_pending.py", line 167, in aws_get_spot_requests req = conn.get_all_spot_instance_requests(filters=filters) File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/ec2/connection.py", line 1302, in get_all_spot_instance_requests [('item', SpotInstanceRequest)], verb='POST') File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/connection.py", line 1143, in get_list response = self.make_request(action, params, path, verb) File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/connection.py", line 1089, in make_request return self._mexe(http_request) File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/connection.py", line 923, in _mexe response = connection.getresponse() File "/tools/python27/lib/python2.7/httplib.py", line 1030, in getresponse response.begin() File "/tools/python27/lib/python2.7/httplib.py", line 407, in begin version, status, reason = self._read_status() File "/tools/python27/lib/python2.7/httplib.py", line 365, in _read_status line = self.fp.readline() File "/tools/python27/lib/python2.7/socket.py", line 430, in readline data = recv(1) File "/tools/python27/lib/python2.7/ssl.py", line 241, in recv return self.read(buflen) File "/tools/python27/lib/python2.7/ssl.py", line 160, in read return self._sslobj.read(len) ssl.SSLError: The read operation timed out
Removing dead spot instance requests has reduced response time, and now we seem not to hit the timeout. Now new spot instances are getting created - hopefully linux backlog should start reducing.
Suggested improvements: * Find out why timeout is reached - why the query is taking so long to respond, and fix * Increase timeout if possible (see if there is a config setting in boto to increase it) * Clean out any still-existing dead spot-instance requests, so we have a completely clean spot-instance request list * Update wiki docs to explain how to troubleshoot this problem
* Revisit the method we use to find available spot instances, rather than querying network interfaces * Catch time-out exception and handle, while we use the current method
I'll raise separate bugs for the above, and close this bug now, as resolved.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Good detective skills! Well done Sherlock Moore Holmes! :)
Thanks Armen! But I was just the scribe! =) Thanks to sbruno, mgerva and catlee. Amazon support case raised: =========================== We have several spot requests that are 'active' and 'fulfilled' with instance ids, but the referenced instances don't exist. One example was sir-6c107449 in us-east-1, which references instance i-0b6e6a2a. I've tried to cancel that spot request and others in that state, so it's status is now 'request-canceled-and-instance-running' I trust we're not being charged for spot requests that reference non-existent instances. Instance ID(s): sir-6c107449, ======================================= To contact us again about this case, please return to the AWS Support Center using the following URL: https://aws.amazon.com/support/case?caseId=174589251&language=en *Please note: this e-mail was sent from an address that cannot accept incoming e-mail. Please use the link above if you need to contact us again about this same issue. Amazon Web Services, Inc. is an affiliate of Amazon.com, Inc. Amazon.com is a registered trademark of Amazon.com, Inc. or its affiliates.
so for the initial cause trees reopened at 05:32:03 but seems this issue is now back and so fx-team is closed again and so reopen this bug
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
reopened fx-team again since builds are catching up but not sure if this is it for today
I believe this is now fixed, or worked around, in our tools.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.