Closed Bug 812533 Opened 12 years ago Closed 12 years ago

talos-r3-fed64 slaves that are connected but are not given any jobs

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

(Whiteboard: [buildduty])

talos-r3-fed64-033
talos-r3-fed64-037
talos-r3-fed64-069

012-11-16 08:12:54-0800 [Broker,client] Connected to buildbot-master18.build.scl1.mozilla.com:9201; slave is ready
2012-11-16 07:51:49-0800 [Broker,client] Connected to buildbot-master17.build.scl1.mozilla.com:9201; slave is ready
2012-11-16 08:18:28-0800 [Broker,client] Connected to buildbot-master18.build.scl1.mozilla.com:9201; slave is ready

Their uptimes are recent. They seem to reboot 6 hours after being connected to the master.
The buildbot masters show them as connected.

On the slave side:
2012-11-16 01:07:47-0800 [Broker,client] Connected to buildbot-master18.build.scl1.mozilla.com:9201; slave is ready
2012-11-16 08:07:47-0800 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK
2012-11-16 08:07:47-0800 [-] Telling the master we want to shutdown after any running builds are finished
2012-11-16 08:08:26-0800 [Broker,client] Master does not support slave initiated shutdown.  Upgrade master to 0.8.3 or later to use this feature.
2012-11-16 08:08:26-0800 [Broker,client] rebooting NOW, since the master won't talk to us
2012-11-16 08:08:26-0800 [Broker,client] Invoking platform-specific reboot command

On the master's side (10.12.49.213==talos-r3-fed64-033):
2012-11-16 08:07:53-0800 [Broker,55578,10.12.49.213] Peer will receive following PB traceback:
2012-11-16 08:07:53-0800 [Broker,55578,10.12.49.213] Unhandled Error
        Traceback (most recent call last):
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/banana.py", line 153, in gotItem
            self.callExpressionReceived(item)
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/banana.py", line 116, in callExpressionReceived
            self.expressionReceived(obj)
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py", line 514, in expressionReceived
            method(*sexp[1:])
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py", line 826, in proto_message
            self._recvMessage(self.localObjectForID, requestID, objectID, message, answerRequired, netArgs, netKw)
        --- <exception caught here> ---
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py", line 840, in _recvMessage
            netResult = object.remoteMessageReceived(self, message, netArgs, netKw)
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py", line 223, in perspectiveMessageReceived
            method = getattr(self, "perspective_%s" % message)
        exceptions.AttributeError: BuildSlave instance has no attribute 'perspective_shutdown'
From the masters perspective the story goes like this:
2012-11-16 08:07:53-0800 [Broker,55578,10.12.49.213] Peer will receive following PB traceback:
2012-11-16 08:07:53-0800 [Broker,55578,10.12.49.213] Unhandled Error
        Traceback (most recent call last):
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/banana.py", line 153, in gotItem
            self.callExpressionReceived(item)
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/banana.py", line 116, in callExpressionReceived
            self.expressionReceived(obj)
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py", line 514, in expressionReceived
            method(*sexp[1:])
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py", line 826, in proto_message
            self._recvMessage(self.localObjectForID, requestID, objectID, message, answerRequired, netArgs, netKw)
        --- <exception caught here> ---
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py", line 840, in _recvMessage
            netResult = object.remoteMessageReceived(self, message, netArgs, netKw)
          File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py", line 223, in perspectiveMessageReceived
            method = getattr(self, "perspective_%s" % message)
        exceptions.AttributeError: BuildSlave instance has no attribute 'perspective_shutdown'
...
2012-11-16 08:08:25-0800 [Broker,55578,10.12.49.213] BuildSlave.detached(talos-r3-fed64-033)
2012-11-16 08:12:29-0800 [Broker,56308,10.12.49.213] Got slaveinfo from 'talos-r3-fed64-033'
2012-11-16 08:12:29-0800 [Broker,56308,10.12.49.213] bot attached
...
2012-11-16 01:01:56-0800 [Broker,55573,10.12.49.213] duplicate slave talos-r3-fed64-033; rejecting new slave and pinging old
2012-11-16 01:01:56-0800 [Broker,55573,10.12.49.213] old slave was connected from IPv4Address(TCP, '10.12.49.213', 58943)
2012-11-16 01:01:56-0800 [Broker,55573,10.12.49.213] new slave is from IPv4Address(TCP, '10.12.49.213', 46192)
...
2012-11-16 01:06:51-0800 [-] killing new slave on IPv4Address(TCP, '10.12.49.213', 46192)
2012-11-16 01:06:52-0800 [Broker,54874,10.12.49.213] BuildSlave.detached(talos-r3-fed64-033)
2012-11-16 01:06:52-0800 [Broker,54874,10.12.49.213] Unhandled error in Deferred:
2012-11-16 01:06:52-0800 [Broker,54874,10.12.49.213] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
        ]
...
2012-11-16 01:07:10-0800 [Broker,55578,10.12.49.213] Got slaveinfo from 'talos-r3-fed64-033'
2012-11-16 01:07:10-0800 [Broker,55578,10.12.49.213] bot attached
...
2012-11-15 17:55:37-0800 [Broker,54874,10.12.49.213] Got slaveinfo from 'talos-r3-fed64-033'
2012-11-15 17:55:38-0800 [Broker,54874,10.12.49.213] bot attached
2012-11-15 17:31:44-0800 [Broker,54421,10.12.49.213] BuildSlave.sendBuilderList (<BuildSlave 'talos-r3-fed64-033'>) failed
2012-11-15 17:31:44-0800 [Broker,54421,10.12.49.213] Unhandled Error
2012-11-15 17:31:44-0800 [Broker,54421,10.12.49.213] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
        ]
2012-11-15 17:31:44-0800 [Broker,54421,10.12.49.213] BuildSlave.detached(talos-r3-fed64-033)
...
Whiteboard: [buildduty]
Armen, its unclear to me what is the buildduty actionable here, reimage, netops conversation, etc?
Flags: needinfo?(armenzg)
Armen, its unclear to me what is the buildduty actionable here, reimage, netops conversation, etc?
I don't know myself.
Let's bring it to the Monday meeting and see if anyone has any suggestions.
Flags: needinfo?(armenzg)
bhearsum said that we can perhaps fix this by bringing down the master and back up.

Let's try that.
bm17/bm18/bm24 were restarted today.
talos-r3-fed64-033 talos-r3-fed64-037 talos-r3-fed64-069 are back now.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.