Closed Bug 844648 Opened 12 years ago Closed 11 years ago

ec2 slave remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.

Categories

(Release Engineering :: General, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure)

OS: Mac OS X → Linux
coop, please can you find an owner for this intermittent-failure - the current overall tree intermittent failure rate is spiralling out of control & the majority of bugs are unowned (see dev.platform thread).
Flags: needinfo?(coop)
Boy, that didn't stay "build slave" for very long.
Summary: ec2 build slave remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. → ec2 slave remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
Taking a look at the hostnames: * 37 instances on test slaves, all but 3 are in tst-linux(32|64)-ec2-xxx below 300, which is Amazon's us-east-1 talking to three masters in scl1. bm17 and bm24 mainly * 25 instances on build slaves, all but 1 are bld-linux64-ec2-6xx, which is us-west-2 region talking to bm35 in scl3. The other is us-east-1 talking to bm49 in scl3 The test failures are recent, the bld ones generally older. Some (all?) of them could be related to the scl3 network issues, perhaps this is a 'canary down the mine' given the persistent nature of the buildbot connections.
Bug 781860 may be at fault here.
Just caught this in tcpdump. The build in question is http://buildbot-master18.build.scl1.mozilla.com:8201/builders/Ubuntu%2012.04%20x64%20mozilla-central%20pgo%20test%20jsreftest/builds/51 Interrupted at Mon Mar 11 14:15:07 2013 with [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.] The last few packets are: 14:15:05.500256 IP 10.134.56.226.37248 > 10.12.49.18.9201: . 5260116:5261491(1375) ack 847 win 1002 <nop,nop,timestamp 53341 444211914> 14:15:05.500267 IP 10.134.56.226.37248 > 10.12.49.18.9201: P 5261491:5262678(1187) ack 847 win 1002 <nop,nop,timestamp 53341 444211914> 14:15:05.501119 IP 10.12.49.18.9201 > 10.134.56.226.37248: . ack 5262678 win 5409 <nop,nop,timestamp 444212005 53341> 14:15:05.503124 IP 10.12.49.18.9201 > 10.134.56.226.37248: P 847:856(9) ack 5262678 win 5409 <nop,nop,timestamp 444212007 53341> 14:15:05.596952 IP 10.134.56.226.37248 > 10.12.49.18.9201: . ack 856 win 1002 <nop,nop,timestamp 53363 444212007> 14:15:06.594025 IP 10.134.56.226.37248 > 10.12.49.18.9201: F 5262678:5262678(0) ack 856 win 1002 <nop,nop,timestamp 53614 444212007> 14:15:06.594401 IP 10.12.49.18.9201 > 10.134.56.226.37248: F 856:856(0) ack 5262679 win 5409 <nop,nop,timestamp 444213098 53614> 14:15:06.681670 IP 10.134.56.226.37248 > 10.12.49.18.9201: . ack 857 win 1002 <nop,nop,timestamp 53636 444213098> 14:16:23.691716 IP 10.134.56.226.46578 > 10.12.49.18.9201: S 3381875677:3381875677(0) win 14600 <mss 1387,sackOK,timestamp 4294902761 0,nop,wscale 6>
About the same time our auto-rebooter decided this machine was idle and rebooted it: 2013-03-11 14:15:03,566 - INFO - Rebooting the following instances: 2013-03-11 14:15:03,566 - INFO - tst-linux64-ec2-064
Sorry, that was our impaired instance watcher, not the idle watcher.
I've disabled the impaired instance watcher for now. Let's see if this gets better.
(In reply to Chris AtLee [:catlee] from comment #122) > I've disabled the impaired instance watcher for now. Let's see if this gets > better. Thank you for tracking this down - fingers crossed! :-)
Flags: needinfo?(coop)
Blocks: 851431
See Also: → 851622
Blocks: 851697
Product: mozilla.org → Release Engineering
Depends on: 925285
These should now be retries, thanks to bug 925285. Question is whether it's still worth leaving this bug open, or perhaps closing and leaving bug 918677 to handle making AWS connections more resilient/having in-house masters etc.
Closing bugs where TBPLbot has previously commented, but have now not been modified for >3 months & do not contain the whiteboard strings for disabled/annotated tests or use the keyword leave-open. Filter on: mass-intermittent-bug-closure-2014-07
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.