985330 - Integration Trees closed, high number of pending linux/android compile jobs about 1 hour backlog

Reporter

Description

•

11 years ago

and here we go again, we have a high number of pending linux (and android) build jobs. Closing the trees so that they can catch up

Pete Moore [:pmoore][:pete]

Comment 1

•

11 years ago

so we see e.g. this build request: https://secure.pub.build.mozilla.org/buildapi/self-serve/fx-team/request/38333900 claimed_at: 0 looks like no master has claimed it it looks like this job is not handled by jacuzzi since the buildername is "buildername": "Android 2.2 fx-team non-unified" but the jacuzzi builders are: http://jacuzzi-allocator.pub.build.mozilla.org/v1/builders/ and it is not listed there so if i understand correctly, the scheduler has created an entry, for a master to pick it up, but no master has claimed it presumably we should see logs on the masters to see the code where they try to claim a job presumably each linux buildbot master would try to grab this job above, so we can look at any one of them, to see why it did not successfully take this particular job and presumably this will be in the buildbot master system log directly (lot of "presumably"'s here) :) we will now check to see if we can find a buildbot master log file

Pete Moore [:pmoore][:pete]

Comment 2

•

11 years ago

2014-03-19 04:32:44-0700 [-] prioritizeBuilders: 0.19s found 0 available of 40 connected slaves on buildbot-master54.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master/twistd.log

Pete Moore [:pmoore][:pete]

Comment 3

•

11 years ago

we appear to have errors listing available AWS instances: buildduty@cruncher.srv.releng.scl3:/home/buildduty/logs/aws/aws_watch_pending.log 2014-03-19 05:10:29,059 - DEBUG - bld-linux64: 25 running spot instances in us-west-2 Traceback (most recent call last): File "aws_watch_pending.py", line 854, in <module> instance_type_changes=config.get("instance_type_changes", {}) File "aws_watch_pending.py", line 795, in aws_watch_pending slaveset=slaveset) File "aws_watch_pending.py", line 447, in request_spot_instances active_requests = aws_get_spot_requests(region=region, moz_instance_type=moz_instance_type) File "aws_watch_pending.py", line 167, in aws_get_spot_requests req = conn.get_all_spot_instance_requests(filters=filters) File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/ec2/connection.py", line 1302, in get_all_spot_instance_requests [('item', SpotInstanceRequest)], verb='POST') File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/connection.py", line 1143, in get_list response = self.make_request(action, params, path, verb) File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/connection.py", line 1089, in make_request return self._mexe(http_request) File "/home/buildduty/aws/aws-ve-2/lib/python2.7/site-packages/boto/connection.py", line 923, in _mexe response = connection.getresponse() File "/tools/python27/lib/python2.7/httplib.py", line 1030, in getresponse response.begin() File "/tools/python27/lib/python2.7/httplib.py", line 407, in begin version, status, reason = self._read_status() File "/tools/python27/lib/python2.7/httplib.py", line 365, in _read_status line = self.fp.readline() File "/tools/python27/lib/python2.7/socket.py", line 430, in readline data = recv(1) File "/tools/python27/lib/python2.7/ssl.py", line 241, in recv return self.read(buflen) File "/tools/python27/lib/python2.7/ssl.py", line 160, in read return self._sslobj.read(len) ssl.SSLError: The read operation timed out

Pete Moore [:pmoore][:pete]

Comment 4

•

11 years ago

Removing dead spot instance requests has reduced response time, and now we seem not to hit the timeout. Now new spot instances are getting created - hopefully linux backlog should start reducing.

Pete Moore [:pmoore][:pete]

Comment 5

•

11 years ago

Suggested improvements: * Find out why timeout is reached - why the query is taking so long to respond, and fix * Increase timeout if possible (see if there is a config setting in boto to increase it) * Clean out any still-existing dead spot-instance requests, so we have a completely clean spot-instance request list * Update wiki docs to explain how to troubleshoot this problem

Pete Moore [:pmoore][:pete]

Comment 6

•

11 years ago

* Revisit the method we use to find available spot instances, rather than querying network interfaces * Catch time-out exception and handle, while we use the current method

Pete Moore [:pmoore][:pete]

Comment 7

•

11 years ago

I'll raise separate bugs for the above, and close this bug now, as resolved.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Comment 8

•

11 years ago

Good detective skills! Well done Sherlock Moore Holmes! :)

Pete Moore [:pmoore][:pete]

Comment 9

•

11 years ago

Thanks Armen! But I was just the scribe! =) Thanks to sbruno, mgerva and catlee. Amazon support case raised: =========================== We have several spot requests that are 'active' and 'fulfilled' with instance ids, but the referenced instances don't exist. One example was sir-6c107449 in us-east-1, which references instance i-0b6e6a2a. I've tried to cancel that spot request and others in that state, so it's status is now 'request-canceled-and-instance-running' I trust we're not being charged for spot requests that reference non-existent instances. Instance ID(s): sir-6c107449, ======================================= To contact us again about this case, please return to the AWS Support Center using the following URL: https://aws.amazon.com/support/case?caseId=174589251&language=en *Please note: this e-mail was sent from an address that cannot accept incoming e-mail. Please use the link above if you need to contact us again about this same issue. Amazon Web Services, Inc. is an affiliate of Amazon.com, Inc. Amazon.com is a registered trademark of Amazon.com, Inc. or its affiliates.

Carsten Book [:Tomcat]

Reporter

Comment 10

•

11 years ago

so for the initial cause trees reopened at 05:32:03 but seems this issue is now back and so fx-team is closed again and so reopen this bug

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Carsten Book [:Tomcat]

Reporter

Comment 11

•

11 years ago

reopened fx-team again since builds are catching up but not sure if this is it for today

Chris AtLee [:catlee]

Comment 12

•

11 years ago

I believe this is now fixed, or worked around, in our tools.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Integration Trees closed, high number of pending linux/android compile jobs about 1 hour backlog

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: cbook, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Updated