Closed Bug 833334 Opened 12 years ago Closed 12 years ago

aws slaves often cause hangs during reconfig

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: catlee)

References

Details

Attachments

(1 file)

One or more of the slaves often get into the state described in https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master I think we talked about this a bit last week and thought that it might be happening because the slaves reboot before buildbot is fully shutdown.
Just ran into this again on bm49 with bld-linux64-ec2-340. Here are some snippets from the log I think are relevant: # start of reconfig 2013-01-22 13:39:25-0800 [-] loading configuration from /builds/buildbot/build1/master/master.cfg # new builder steals buildslave reference from old builder 2013-01-22 13:39:59-0800 [-] stealing buildslave <SlaveBuilder builder='Linux mozilla-inbound build' slave='bld-linux64-ec2-340'> # The slave is running the 'reboot' step. After 60s we give up and try and disconnect the slave. # Note, there's no "BuildSlave.detached(bld-linux64-ec2-340)" here like we'd expect 2013-01-22 13:40:21-0800 [-] Forcibly disconnecting bld-linux64-ec2-340 2013-01-22 13:40:21-0800 [-] disconnecting old slave bld-linux64-ec2-340 now 2013-01-22 13:40:21-0800 [-] waiting for slave to finish disconnecting # Slave has rebooted and tries to reconnect 2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] duplicate slave bld-linux64-ec2-340; rejecting new slave and pinging old 2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] old slave was connected from IPv4Address(TCP, '10.134.52.85', 42362) 2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] new slave is from IPv4Address(TCP, '10.134.52.85', 41372) ... # Old slave eventually disconnects 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] BuildSlave.sendBuilderList (<BuildSlave 'bld-linux64-ec2-340'>) failed 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] BuildSlave.detached(bld-linux64-ec2-340) 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] WEIRD: Builder.detached(<BuildSlave 'bld-linux64-ec2-340'>) (bld-linux64-ec2-340) not in attaching_slaves([]) or slaves([<SlaveBuilder builder='release-comm-beta-linux64_standalone_repack' slave='bld-linux64-ec2-301'>, .... # reconfig done 2013-01-22 13:57:05-0800 [-] configuration update complete
Similar for bld-linux64-ec2-364: # start of reconfig 2013-01-22 13:39:25-0800 [-] loading configuration from /builds/buildbot/build1/master/master.cfg # Stole buildslave 2013-01-22 13:40:01-0800 [-] stealing buildslave <SlaveBuilder builder='Android no-ionmonkey mozilla-inbound build' slave='bld-linux64-ec2-364'> # setBuilders._add 2013-01-22 13:40:14-0800 [-] setBuilders._add: [<buildbot.util.loop.DelegateLoop instance at 0x1de49a28>, ..., <BuildSlave 'bld-linux64-ec2-340'>, <BuildSlave 'bld-linux64-ec2-364'>, # Force disconnect 2013-01-22 13:40:20-0800 [-] Forcibly disconnecting bld-linux64-ec2-364 2013-01-22 13:40:20-0800 [-] disconnecting old slave bld-linux64-ec2-364 now 2013-01-22 13:40:20-0800 [-] waiting for slave to finish disconnecting # Tries to reconnect 2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] duplicate slave bld-linux64-ec2-364; rejecting new slave and pinging old 2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] old slave was connected from IPv4Address(TCP, '10.134.52.57', 41190) 2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] new slave is from IPv4Address(TCP, '10.134.52.57', 34221) # Finally disconnects 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] BuildSlave.sendBuilderList (<BuildSlave 'bld-linux64-ec2-364'>) failed 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] removing IStatusReceiver <WebStatus on port tcp:8001 at 0x3159a9e0> 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] removing IStatusReceiver <buildbotcustom.status.pulse.PulseStatus instance at 0x2aaabf611ef0> 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Pulse <0x2aaabf611ef0>: shutting down 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Pulse <0x2aaabf611ef0>: Processed 4 events (0 heartbeats) in 0.00 seconds 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] BuildSlave.detached(bld-linux64-ec2-364) 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] WEIRD: Builder.detached(<BuildSlave 'bld-linux64-ec2-364'>) (bld-linux64-ec2-364) not in attaching_slaves([]) or slaves([... # reconfig done 2013-01-22 13:57:05-0800 [-] configuration update complete
Assignee: nobody → catlee
Blocks: 835360
I really suspect DisconnectStep for this. All of the occurrences of this I've seen have been when the slave is attempting to reboot.
We had a tree closure today when test masters that have EC2 machines connected to them forced a graceful shutdown which caused a big test backlog. We really need to fix this :(.
I had a bunch of linux test slaves (3-5 per master, over 3 masters) that were stuck in download_props if you believe the individual job status.
See Also: → 851622
Summary: aws masters often hang during reconfig → aws slaves often cause hangs during reconfig
I haven't been able to reproduce the problem...but this doesn't seem to hurt. It attempts to forcibly close the socket for the old slave connection.
Attachment #726270 - Flags: review?(rail)
Attachment #726270 - Flags: review?(rail) → review+
Attachment #726270 - Flags: checked-in+
This is in production now.
Fixed I think?
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: