Closed
Bug 833334
Opened 12 years ago
Closed 12 years ago
aws slaves often cause hangs during reconfig
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: catlee)
References
Details
Attachments
(1 file)
1.01 KB,
patch
|
rail
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
One or more of the slaves often get into the state described in https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master
I think we talked about this a bit last week and thought that it might be happening because the slaves reboot before buildbot is fully shutdown.
Assignee | ||
Comment 1•12 years ago
|
||
Just ran into this again on bm49 with bld-linux64-ec2-340. Here are some snippets from the log I think are relevant:
# start of reconfig
2013-01-22 13:39:25-0800 [-] loading configuration from /builds/buildbot/build1/master/master.cfg
# new builder steals buildslave reference from old builder
2013-01-22 13:39:59-0800 [-] stealing buildslave <SlaveBuilder builder='Linux mozilla-inbound build' slave='bld-linux64-ec2-340'>
# The slave is running the 'reboot' step. After 60s we give up and try and disconnect the slave.
# Note, there's no "BuildSlave.detached(bld-linux64-ec2-340)" here like we'd expect
2013-01-22 13:40:21-0800 [-] Forcibly disconnecting bld-linux64-ec2-340
2013-01-22 13:40:21-0800 [-] disconnecting old slave bld-linux64-ec2-340 now
2013-01-22 13:40:21-0800 [-] waiting for slave to finish disconnecting
# Slave has rebooted and tries to reconnect
2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] duplicate slave bld-linux64-ec2-340; rejecting new slave and pinging old
2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] old slave was connected from IPv4Address(TCP, '10.134.52.85', 42362)
2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] new slave is from IPv4Address(TCP, '10.134.52.85', 41372)
...
# Old slave eventually disconnects
2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] BuildSlave.sendBuilderList (<BuildSlave 'bld-linux64-ec2-340'>) failed
2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] BuildSlave.detached(bld-linux64-ec2-340)
2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] WEIRD: Builder.detached(<BuildSlave 'bld-linux64-ec2-340'>) (bld-linux64-ec2-340) not in attaching_slaves([]) or slaves([<SlaveBuilder builder='release-comm-beta-linux64_standalone_repack' slave='bld-linux64-ec2-301'>, ....
# reconfig done
2013-01-22 13:57:05-0800 [-] configuration update complete
Assignee | ||
Comment 2•12 years ago
|
||
Similar for bld-linux64-ec2-364:
# start of reconfig
2013-01-22 13:39:25-0800 [-] loading configuration from /builds/buildbot/build1/master/master.cfg
# Stole buildslave
2013-01-22 13:40:01-0800 [-] stealing buildslave <SlaveBuilder builder='Android no-ionmonkey mozilla-inbound build' slave='bld-linux64-ec2-364'>
# setBuilders._add
2013-01-22 13:40:14-0800 [-] setBuilders._add: [<buildbot.util.loop.DelegateLoop instance at 0x1de49a28>, ..., <BuildSlave 'bld-linux64-ec2-340'>, <BuildSlave 'bld-linux64-ec2-364'>,
# Force disconnect
2013-01-22 13:40:20-0800 [-] Forcibly disconnecting bld-linux64-ec2-364
2013-01-22 13:40:20-0800 [-] disconnecting old slave bld-linux64-ec2-364 now
2013-01-22 13:40:20-0800 [-] waiting for slave to finish disconnecting
# Tries to reconnect
2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] duplicate slave bld-linux64-ec2-364; rejecting new slave and pinging old
2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] old slave was connected from IPv4Address(TCP, '10.134.52.57', 41190)
2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] new slave is from IPv4Address(TCP, '10.134.52.57', 34221)
# Finally disconnects
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] BuildSlave.sendBuilderList (<BuildSlave 'bld-linux64-ec2-364'>) failed
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] removing IStatusReceiver <WebStatus on port tcp:8001 at 0x3159a9e0>
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] removing IStatusReceiver <buildbotcustom.status.pulse.PulseStatus instance at 0x2aaabf611ef0>
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Pulse <0x2aaabf611ef0>: shutting down
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Pulse <0x2aaabf611ef0>: Processed 4 events (0 heartbeats) in 0.00 seconds
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] BuildSlave.detached(bld-linux64-ec2-364)
2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] WEIRD: Builder.detached(<BuildSlave 'bld-linux64-ec2-364'>) (bld-linux64-ec2-364) not in attaching_slaves([]) or slaves([...
# reconfig done
2013-01-22 13:57:05-0800 [-] configuration update complete
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → catlee
Assignee | ||
Comment 3•12 years ago
|
||
I really suspect DisconnectStep for this. All of the occurrences of this I've seen have been when the slave is attempting to reboot.
Reporter | ||
Comment 4•12 years ago
|
||
We had a tree closure today when test masters that have EC2 machines connected to them forced a graceful shutdown which caused a big test backlog. We really need to fix this :(.
Comment 5•12 years ago
|
||
I had a bunch of linux test slaves (3-5 per master, over 3 masters) that were stuck in download_props if you believe the individual job status.
Updated•12 years ago
|
Summary: aws masters often hang during reconfig → aws slaves often cause hangs during reconfig
Assignee | ||
Comment 6•12 years ago
|
||
I haven't been able to reproduce the problem...but this doesn't seem to hurt. It attempts to forcibly close the socket for the old slave connection.
Attachment #726270 -
Flags: review?(rail)
Updated•12 years ago
|
Attachment #726270 -
Flags: review?(rail) → review+
Assignee | ||
Updated•12 years ago
|
Attachment #726270 -
Flags: checked-in+
Comment 7•12 years ago
|
||
This is in production now.
Assignee | ||
Comment 8•12 years ago
|
||
Fixed I think?
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•