Closed Bug 833334 Opened 11 years ago Closed 11 years ago

aws slaves often cause hangs during reconfig

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: catlee)

References

Details

Attachments

(1 file)

One or more of the slaves often get into the state described in https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master

I think we talked about this a bit last week and thought that it might be happening because the slaves reboot before buildbot is fully shutdown.
Just ran into this again on bm49 with bld-linux64-ec2-340. Here are some snippets from the log I think are relevant:

# start of reconfig
2013-01-22 13:39:25-0800 [-] loading configuration from /builds/buildbot/build1/master/master.cfg

# new builder steals buildslave reference from old builder
2013-01-22 13:39:59-0800 [-]  stealing buildslave <SlaveBuilder builder='Linux mozilla-inbound build' slave='bld-linux64-ec2-340'>

# The slave is running the 'reboot' step. After 60s we give up and try and disconnect the slave.
# Note, there's no "BuildSlave.detached(bld-linux64-ec2-340)" here like we'd expect
2013-01-22 13:40:21-0800 [-] Forcibly disconnecting bld-linux64-ec2-340
2013-01-22 13:40:21-0800 [-] disconnecting old slave bld-linux64-ec2-340 now
2013-01-22 13:40:21-0800 [-] waiting for slave to finish disconnecting

# Slave has rebooted and tries to reconnect
2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] duplicate slave bld-linux64-ec2-340; rejecting new slave and pinging old
2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] old slave was connected from IPv4Address(TCP, '10.134.52.85', 42362)
2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] new slave is from IPv4Address(TCP, '10.134.52.85', 41372)
...

# Old slave eventually disconnects
2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] BuildSlave.sendBuilderList (<BuildSlave 'bld-linux64-ec2-340'>) failed
2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
        ]

2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] BuildSlave.detached(bld-linux64-ec2-340)
2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] WEIRD: Builder.detached(<BuildSlave 'bld-linux64-ec2-340'>) (bld-linux64-ec2-340) not in attaching_slaves([]) or slaves([<SlaveBuilder builder='release-comm-beta-linux64_standalone_repack' slave='bld-linux64-ec2-301'>, ....

# reconfig done
2013-01-22 13:57:05-0800 [-] configuration update complete
Similar for bld-linux64-ec2-364:

# start of reconfig
2013-01-22 13:39:25-0800 [-] loading configuration from /builds/buildbot/build1/master/master.cfg

# Stole buildslave
2013-01-22 13:40:01-0800 [-]  stealing buildslave <SlaveBuilder builder='Android no-ionmonkey mozilla-inbound build' slave='bld-linux64-ec2-364'>

# setBuilders._add
2013-01-22 13:40:14-0800 [-] setBuilders._add: [<buildbot.util.loop.DelegateLoop instance at 0x1de49a28>, ..., <BuildSlave 'bld-linux64-ec2-340'>, <BuildSlave 'bld-linux64-ec2-364'>,

# Force disconnect
2013-01-22 13:40:20-0800 [-] Forcibly disconnecting bld-linux64-ec2-364
2013-01-22 13:40:20-0800 [-] disconnecting old slave bld-linux64-ec2-364 now
2013-01-22 13:40:20-0800 [-] waiting for slave to finish disconnecting

# Tries to reconnect
2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] duplicate slave bld-linux64-ec2-364; rejecting new slave and pinging old
2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] old slave was connected from IPv4Address(TCP, '10.134.52.57', 41190)
2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] new slave is from IPv4Address(TCP, '10.134.52.57', 34221)

# Finally disconnects
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] BuildSlave.sendBuilderList (<BuildSlave 'bld-linux64-ec2-364'>) failed
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
        ]

2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] removing IStatusReceiver <WebStatus on port tcp:8001 at 0x3159a9e0>
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] removing IStatusReceiver <buildbotcustom.status.pulse.PulseStatus instance at 0x2aaabf611ef0>
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Pulse <0x2aaabf611ef0>: shutting down
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Pulse <0x2aaabf611ef0>: Processed 4 events (0 heartbeats) in 0.00 seconds
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] BuildSlave.detached(bld-linux64-ec2-364)
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] WEIRD: Builder.detached(<BuildSlave 'bld-linux64-ec2-364'>) (bld-linux64-ec2-364) not in attaching_slaves([]) or slaves([...

# reconfig done
2013-01-22 13:57:05-0800 [-] configuration update complete
Assignee: nobody → catlee
Blocks: 835360
I really suspect DisconnectStep for this. All of the occurrences of this I've seen have been when the slave is attempting to reboot.
We had a tree closure today when test masters that have EC2 machines connected to them forced a graceful shutdown which caused a big test backlog. We really need to fix this :(.
I had a bunch of linux test slaves (3-5 per master, over 3 masters) that were stuck in download_props if you believe the individual job status.
See Also: → 851622
Summary: aws masters often hang during reconfig → aws slaves often cause hangs during reconfig
I haven't been able to reproduce the problem...but this doesn't seem to hurt. It attempts to forcibly close the socket for the old slave connection.
Attachment #726270 - Flags: review?(rail)
Attachment #726270 - Flags: review?(rail) → review+
Attachment #726270 - Flags: checked-in+
This is in production now.
Fixed I think?
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: