Closed Bug 833334 Opened 12 years ago Closed 12 years ago

aws slaves often cause hangs during reconfig

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: catlee)

References

Details

Attachments

(1 file)

forcibly close the socket 12 years ago Chris AtLee [:catlee] 1.01 KB, patch	rail : review+ catlee : checked-in+	Details \| Diff \| Splinter Review

bhearsum@mozilla.com (:bhearsum)

Reporter

Description

•

12 years ago

One or more of the slaves often get into the state described in https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master I think we talked about this a bit last week and thought that it might be happening because the slaves reboot before buildbot is fully shutdown.

Chris AtLee [:catlee]

Assignee

Comment 1

•

12 years ago

Just ran into this again on bm49 with bld-linux64-ec2-340. Here are some snippets from the log I think are relevant: # start of reconfig 2013-01-22 13:39:25-0800 [-] loading configuration from /builds/buildbot/build1/master/master.cfg # new builder steals buildslave reference from old builder 2013-01-22 13:39:59-0800 [-] stealing buildslave <SlaveBuilder builder='Linux mozilla-inbound build' slave='bld-linux64-ec2-340'> # The slave is running the 'reboot' step. After 60s we give up and try and disconnect the slave. # Note, there's no "BuildSlave.detached(bld-linux64-ec2-340)" here like we'd expect 2013-01-22 13:40:21-0800 [-] Forcibly disconnecting bld-linux64-ec2-340 2013-01-22 13:40:21-0800 [-] disconnecting old slave bld-linux64-ec2-340 now 2013-01-22 13:40:21-0800 [-] waiting for slave to finish disconnecting # Slave has rebooted and tries to reconnect 2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] duplicate slave bld-linux64-ec2-340; rejecting new slave and pinging old 2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] old slave was connected from IPv4Address(TCP, '10.134.52.85', 42362) 2013-01-22 13:42:20-0800 [Broker,29446,10.134.52.85] new slave is from IPv4Address(TCP, '10.134.52.85', 41372) ... # Old slave eventually disconnects 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] BuildSlave.sendBuilderList (<BuildSlave 'bld-linux64-ec2-340'>) failed 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] BuildSlave.detached(bld-linux64-ec2-340) 2013-01-22 13:56:48-0800 [Broker,29395,10.134.52.85] WEIRD: Builder.detached(<BuildSlave 'bld-linux64-ec2-340'>) (bld-linux64-ec2-340) not in attaching_slaves([]) or slaves([<SlaveBuilder builder='release-comm-beta-linux64_standalone_repack' slave='bld-linux64-ec2-301'>, .... # reconfig done 2013-01-22 13:57:05-0800 [-] configuration update complete

Chris AtLee [:catlee]

Assignee

Comment 2

•

12 years ago

Similar for bld-linux64-ec2-364: # start of reconfig 2013-01-22 13:39:25-0800 [-] loading configuration from /builds/buildbot/build1/master/master.cfg # Stole buildslave 2013-01-22 13:40:01-0800 [-] stealing buildslave <SlaveBuilder builder='Android no-ionmonkey mozilla-inbound build' slave='bld-linux64-ec2-364'> # setBuilders._add 2013-01-22 13:40:14-0800 [-] setBuilders._add: [<buildbot.util.loop.DelegateLoop instance at 0x1de49a28>, ..., <BuildSlave 'bld-linux64-ec2-340'>, <BuildSlave 'bld-linux64-ec2-364'>, # Force disconnect 2013-01-22 13:40:20-0800 [-] Forcibly disconnecting bld-linux64-ec2-364 2013-01-22 13:40:20-0800 [-] disconnecting old slave bld-linux64-ec2-364 now 2013-01-22 13:40:20-0800 [-] waiting for slave to finish disconnecting # Tries to reconnect 2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] duplicate slave bld-linux64-ec2-364; rejecting new slave and pinging old 2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] old slave was connected from IPv4Address(TCP, '10.134.52.57', 41190) 2013-01-22 13:42:17-0800 [Broker,29445,10.134.52.57] new slave is from IPv4Address(TCP, '10.134.52.57', 34221) # Finally disconnects 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] BuildSlave.sendBuilderList (<BuildSlave 'bld-linux64-ec2-364'>) failed 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] removing IStatusReceiver <WebStatus on port tcp:8001 at 0x3159a9e0> 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] removing IStatusReceiver <buildbotcustom.status.pulse.PulseStatus instance at 0x2aaabf611ef0> 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Pulse <0x2aaabf611ef0>: shutting down 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] Pulse <0x2aaabf611ef0>: Processed 4 events (0 heartbeats) in 0.00 seconds 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] BuildSlave.detached(bld-linux64-ec2-364) 2013-01-22 13:56:55-0800 [Broker,29388,10.134.52.57] WEIRD: Builder.detached(<BuildSlave 'bld-linux64-ec2-364'>) (bld-linux64-ec2-364) not in attaching_slaves([]) or slaves([... # reconfig done 2013-01-22 13:57:05-0800 [-] configuration update complete

Chris AtLee [:catlee]

Assignee

Updated

•

12 years ago

Assignee: nobody → catlee

Chris AtLee [:catlee]

Assignee

Updated

•

12 years ago

Blocks: 835360

Chris AtLee [:catlee]

Assignee

Comment 3

•

12 years ago

I really suspect DisconnectStep for this. All of the occurrences of this I've seen have been when the slave is attempting to reboot.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 4

•

12 years ago

We had a tree closure today when test masters that have EC2 machines connected to them forced a graceful shutdown which caused a big test backlog. We really need to fix this :(.

Chris Cooper [:coop] (he/him)

Comment 5

•

12 years ago

I had a bunch of linux test slaves (3-5 per master, over 3 masters) that were stuck in download_props if you believe the individual job status.

Chris Cooper [:coop] (he/him)

Updated

•

12 years ago

Updated

•

12 years ago

Summary: aws masters often hang during reconfig → aws slaves often cause hangs during reconfig

Chris AtLee [:catlee]

Assignee

Comment 6

•

12 years ago

Attached patch forcibly close the socket — Details — Splinter Review

I haven't been able to reproduce the problem...but this doesn't seem to hurt. It attempts to forcibly close the socket for the old slave connection.

Attachment #726270 - Flags: review?(rail)

Rail Aliiev [:rail]

Updated

•

12 years ago

Attachment #726270 - Flags: review?(rail) → review+

Chris AtLee [:catlee]

Assignee

Updated

•

12 years ago

Attachment #726270 - Flags: checked-in+

Chris Cooper [:coop] (he/him)

Comment 7

•

12 years ago

This is in production now.

Chris AtLee [:catlee]

Assignee

Comment 8

•

12 years ago

Fixed I think?

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

aws slaves often cause hangs during reconfig

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Assigned: catlee)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Updated

Comment 3

Comment 4

Comment 5

Updated

Updated

Comment 6

Updated

Updated

Comment 7

Comment 8

Updated

Updated

Attachment

General

Description

File Name

Content Type