Closed Bug 663399 Opened 14 years ago Closed 14 years ago

look for patterns in idleizer behavior in dev/pp

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Branching from bug 627126: This *is* idleizer: 2011-06-09 20:44:28-0700 [-] Log opened. 2011-06-09 20:44:28-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz1/bin/python 2.6.5) starting up. 2011-06-09 20:44:28-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor. 2011-06-09 20:44:28-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x9eb240c> 2011-06-09 20:44:28-0700 [-] Connecting to staging-master.build.mozilla.org:9052 2011-06-09 20:44:28-0700 [Uninitialized] Connection to staging-master.build.mozilla.org:9052 failed: Connection Refused 2011-06-09 20:44:28-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x9eb2eec> will retry in 2 seconds 2011-06-09 20:44:28-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x9eb240c> 2011-06-09 20:44:31-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x9eb240c> 2011-06-09 20:44:31-0700 [-] Connecting to staging-master.build.mozilla.org:9052 .. 2011-06-09 21:39:48-0700 [-] Connecting to staging-master.build.mozilla.org:9052 2011-06-09 21:39:48-0700 [Uninitialized] Connection to staging-master.build.mozilla.org:9052 failed: Connection Refused 2011-06-09 21:39:48-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x9eb2eec> will retry in 281 seconds 2011-06-09 21:39:48-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x9eb240c> 2011-06-09 21:41:48-0700 [-] connection attempt timed out (is the port number correct?) 2011-06-09 21:44:28-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK 2011-06-09 21:44:28-0700 [-] No active connection, shutting down NOW 2011-06-09 21:44:28-0700 [-] Main loop terminated. 2011-06-09 21:44:28-0700 [-] Server Shut Down. but it didn't reboot at that point. I see aki on the host now, and the host is locked to aki's master, so I assume that's the reason for the non-connection. The question is, why didn't it reboot?
Similar on mv-moz2-darwin10-slave01: 2011-06-14 17:08:08-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x86d718c> 2011-06-14 17:10:08-0700 [-] connection attempt timed out (is the port number correct?) 2011-06-14 17:12:09-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK 2011-06-14 17:12:09-0700 [-] No active connection, shutting down NOW 2011-06-14 17:12:09-0700 [-] Main loop terminated. 2011-06-14 17:12:09-0700 [-] Server Shut Down.
(I should say, it didn't reboot at 17:12, judging by the uptime. Rebooted manually.
And that was mv-moz2-linux-slave01. I really should read my bug posts before clicking "Save Changes". Same deal with moz2-darwin10-slave03: 2011-06-14 17:25:26-0700 [Uninitialized] Connection to dev-master01.build.scl1.mozilla.com:9018 failed: Connection Refused 2011-06-14 17:25:26-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1011111b8> will retry in 257 seconds 2011-06-14 17:25:26-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x1013854d0> 2011-06-14 17:27:26-0700 [-] connection attempt timed out (is the port number correct?) 2011-06-14 17:27:47-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK 2011-06-14 17:27:47-0700 [-] No active connection, shutting down NOW 2011-06-14 17:27:47-0700 [-] Main loop terminated. 2011-06-14 17:27:48-0700 [-] Server Shut Down. and moz2-darwin10-slave04: 2011-06-14 17:26:02-0700 [-] Connecting to dev-master01.build.scl1.mozilla.com:9018 2011-06-14 17:26:02-0700 [Uninitialized] Connection to dev-master01.build.scl1.mozilla.com:9018 failed: Connection Refused 2011-06-14 17:26:02-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x101590dd0> will retry in 273 seconds 2011-06-14 17:26:02-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x1018795a8> 2011-06-14 17:27:48-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK 2011-06-14 17:27:48-0700 [-] No active connection, shutting down NOW 2011-06-14 17:27:48-0700 [-] Main loop terminated. 2011-06-14 17:27:49-0700 [-] Server Shut Down. and linux64-ix-slave01: 2011-06-14 10:22:33-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x374c290> will retry in 229 seconds 2011-06-14 10:22:33-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x3a803f8> 2011-06-14 10:24:33-0700 [-] connection attempt timed out (is the port number correct?) 2011-06-14 10:24:45-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK 2011-06-14 10:24:45-0700 [-] No active connection, shutting down NOW 2011-06-14 10:24:45-0700 [-] Main loop terminated. 2011-06-14 10:24:45-0700 [-] Server Shut Down. common theme: these were all reboots after connection failures, rather than after idleness.
Assignee: nobody → dustin
There's trouble in paradise: 2011-06-16 18:22:23-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK 2011-06-16 18:22:23-0700 [-] Telling the master we want to shutdown after any running builds are finished 2011-06-16 18:22:23-0700 [Broker,client] Master does not support slave initiated shutdown. Upgrade master to 0.8.3 or later to use this feature. I'm thinking that, until we upgrade the masters, we should add a fallback here that just calls loseConnection() and reactor.stop() and hopes the master doesn't manage to start a job at that very instant. I think that's a *fairly* (99.9%) safe assumption. Catlee, does that seems reasonable?
I just graceful'd a slave in this state (moz2-darwin9-slave68) and it rebooted once the master disconnected it. So it may be enough to just call loseConnection() after finding that the master is not modern enough.
Opened bug 665254 to deal with the new-master problem.
I think that all of the behaviors I'm seeing are explained, so this bug is done.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.