Closed Bug 665254 Opened 13 years ago Closed 13 years ago

idleizer: call loseConnection when master does not support slave-initiated graceful shutdown

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(1 file)

We're seeing

2011-06-16 18:22:23-0700 [Broker,client] Master does not support slave initiated shutdown.  Upgrade master to 0.8.3 or later to use this feature.

because masters are still 0.8.2.

In this case, when the slave *is* eventually graceful'd, it's already primed to reboot.  I suspect adding a simple loseConnection() call in there would give us success without any significant risk of burning a build which just happened to start as the slave was rebooting itself.
From bug 665765, gracefulShutdown's immediate shutdown if not connected is also problematic - saw this on three hosts over the weekend.  It turns out that the solution is easiest to do in the same patch as for this bug, so I'm merging them here.
With the fix, a *connected* idle looks like:

2011-06-20 16:42:34-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK
2011-06-20 16:42:34-0700 [-] Telling the master we want to shutdown after any running builds are finished
2011-06-20 16:42:34-0700 [Broker,client] Master does not support slave initiated shutdown.  Upgrade master to 0.8.3 or later to use this feature.
2011-06-20 16:42:34-0700 [Broker,client] rebooting NOW, since the master won't talk to us
2011-06-20 16:42:34-0700 [Broker,client] Invoking platform-specific reboot command
2011-06-20 16:42:34-0700 [Broker,client] lost remote
2011-06-20 16:42:34-0700 [Broker,client] lost remote
2011-06-20 16:42:34-0700 [Broker,client] lost remote
(and rebooted)

and a disconnected slave looks like:

2011-06-20 16:48:18-0700 [-] Connecting to preproduction-master.build.sjc1.mozilla.comm:9010
2011-06-20 16:48:18-0700 [-] Connection to preproduction-master.build.sjc1.mozilla.comm:9010 failed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.DNSLookupError'>: DNS lookup failed: address 'prepr
oduction-master.build.sjc1.mozilla.comm' not found: [Errno 8] nodename nor servname provided, or not known.
        ]
2011-06-20 16:48:18-0700 [-] <twisted.internet.tcp.Connector instance at 0x1012721b8> will retry in 15 seconds
2011-06-20 16:48:18-0700 [-] Stopping factory <buildslave.bot.BotFactory instance at 0x101879518>
2011-06-20 16:48:21-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK
2011-06-20 16:48:21-0700 [-] No active connection, rebooting NOW
2011-06-20 16:48:21-0700 [-] Invoking platform-specific reboot command
2011-06-20 16:48:23-0700 [-] Main loop terminated.
2011-06-20 16:48:23-0700 [-] Server Shut Down.
(and rebooted)

(this was tested on moz2-darwin10-slave01.  No builds were burned in the making of this patch.  Does not contain BPA.)
Attachment #540625 - Flags: review?(catlee)
Attachment #540625 - Flags: review?(catlee) → review+
Committed to the 'slaves' branch.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: