Closed
Bug 833334
Opened 11 years ago
Closed 11 years ago
aws slaves often cause hangs during reconfig
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: catlee)
References
Details
Attachments
(1 file)
1.01 KB,
patch
|
rail
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
One or more of the slaves often get into the state described in https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master I think we talked about this a bit last week and thought that it might be happening because the slaves reboot before buildbot is fully shutdown.
Assignee | ||
Comment 1•11 years ago
|
||
Just ran into this again on bm49 with bld-linux64-ec2-340. Here are some snippets from the log I think are relevant: # start of reconfig 2013-01-22 13:39:25-0800 [-] loading configuration from /builds/buildbot/build1/master/master.cfg # new builder steals buildslave reference from old builder 2013-01-22 13:39:59-0800 [-] stealing buildslave <SlaveBuilder builder='Linux mozilla-inbound build' slave='bld-linux64-ec2-340'> # The slave is running the 'reboot' step. After 60s we give up and try and disconnect the slave. # Note, there's no "BuildSlave.detached(bld-linux64-ec2-340)" here like we'd expect 2013-01-22 13:40:21-0800 [-] Forcibly disconnecting bld-linux64-ec2-340 2013-01-22 13:40:21-0800 [-] disconnecting old slave bld-linux64-ec2-340 now 2013-01-22 13:40:21-0800 [-] waiting for slave to finish disconnecting # Slave has rebooted and tries to reconnect 2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] duplicate slave bld-linux64-ec2-340; rejecting new slave and pinging old 2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] old slave was connected from IPv4Address(TCP, '10.134.52.85', 42362) 2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] new slave is from IPv4Address(TCP, '10.134.52.85', 41372) ... # Old slave eventually disconnects 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] BuildSlave.sendBuilderList (<BuildSlave 'bld-linux64-ec2-340'>) failed 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] BuildSlave.detached(bld-linux64-ec2-340) 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] WEIRD: Builder.detached(<BuildSlave 'bld-linux64-ec2-340'>) (bld-linux64-ec2-340) not in attaching_slaves([]) or slaves([<SlaveBuilder builder='release-comm-beta-linux64_standalone_repack' slave='bld-linux64-ec2-301'>, .... # reconfig done 2013-01-22 13:57:05-0800 [-] configuration update complete
Assignee | ||
Comment 2•11 years ago
|
||
Similar for bld-linux64-ec2-364: # start of reconfig 2013-01-22 13:39:25-0800 [-] loading configuration from /builds/buildbot/build1/master/master.cfg # Stole buildslave 2013-01-22 13:40:01-0800 [-] stealing buildslave <SlaveBuilder builder='Android no-ionmonkey mozilla-inbound build' slave='bld-linux64-ec2-364'> # setBuilders._add 2013-01-22 13:40:14-0800 [-] setBuilders._add: [<buildbot.util.loop.DelegateLoop instance at 0x1de49a28>, ..., <BuildSlave 'bld-linux64-ec2-340'>, <BuildSlave 'bld-linux64-ec2-364'>, # Force disconnect 2013-01-22 13:40:20-0800 [-] Forcibly disconnecting bld-linux64-ec2-364 2013-01-22 13:40:20-0800 [-] disconnecting old slave bld-linux64-ec2-364 now 2013-01-22 13:40:20-0800 [-] waiting for slave to finish disconnecting # Tries to reconnect 2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] duplicate slave bld-linux64-ec2-364; rejecting new slave and pinging old 2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] old slave was connected from IPv4Address(TCP, '10.134.52.57', 41190) 2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] new slave is from IPv4Address(TCP, '10.134.52.57', 34221) # Finally disconnects 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] BuildSlave.sendBuilderList (<BuildSlave 'bld-linux64-ec2-364'>) failed 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] removing IStatusReceiver <WebStatus on port tcp:8001 at 0x3159a9e0> 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] removing IStatusReceiver <buildbotcustom.status.pulse.PulseStatus instance at 0x2aaabf611ef0> 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Pulse <0x2aaabf611ef0>: shutting down 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Pulse <0x2aaabf611ef0>: Processed 4 events (0 heartbeats) in 0.00 seconds 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] BuildSlave.detached(bld-linux64-ec2-364) 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] WEIRD: Builder.detached(<BuildSlave 'bld-linux64-ec2-364'>) (bld-linux64-ec2-364) not in attaching_slaves([]) or slaves([... # reconfig done 2013-01-22 13:57:05-0800 [-] configuration update complete
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → catlee
Assignee | ||
Comment 3•11 years ago
|
||
I really suspect DisconnectStep for this. All of the occurrences of this I've seen have been when the slave is attempting to reboot.
Reporter | ||
Comment 4•11 years ago
|
||
We had a tree closure today when test masters that have EC2 machines connected to them forced a graceful shutdown which caused a big test backlog. We really need to fix this :(.
Comment 5•11 years ago
|
||
I had a bunch of linux test slaves (3-5 per master, over 3 masters) that were stuck in download_props if you believe the individual job status.
Updated•11 years ago
|
Summary: aws masters often hang during reconfig → aws slaves often cause hangs during reconfig
Assignee | ||
Comment 6•11 years ago
|
||
I haven't been able to reproduce the problem...but this doesn't seem to hurt. It attempts to forcibly close the socket for the old slave connection.
Attachment #726270 -
Flags: review?(rail)
Updated•11 years ago
|
Attachment #726270 -
Flags: review?(rail) → review+
Assignee | ||
Updated•11 years ago
|
Attachment #726270 -
Flags: checked-in+
Comment 7•11 years ago
|
||
This is in production now.
Assignee | ||
Comment 8•11 years ago
|
||
Fixed I think?
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•