Closed
Bug 663399
Opened 14 years ago
Closed 14 years ago
look for patterns in idleizer behavior in dev/pp
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
References
Details
Branching from bug 627126:
This *is* idleizer:
2011-06-09 20:44:28-0700 [-] Log opened.
2011-06-09 20:44:28-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz1/bin/python 2.6.5) starting up.
2011-06-09 20:44:28-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2011-06-09 20:44:28-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x9eb240c>
2011-06-09 20:44:28-0700 [-] Connecting to staging-master.build.mozilla.org:9052
2011-06-09 20:44:28-0700 [Uninitialized] Connection to staging-master.build.mozilla.org:9052 failed: Connection Refused
2011-06-09 20:44:28-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x9eb2eec> will retry in 2 seconds
2011-06-09 20:44:28-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x9eb240c>
2011-06-09 20:44:31-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x9eb240c>
2011-06-09 20:44:31-0700 [-] Connecting to staging-master.build.mozilla.org:9052
..
2011-06-09 21:39:48-0700 [-] Connecting to staging-master.build.mozilla.org:9052
2011-06-09 21:39:48-0700 [Uninitialized] Connection to staging-master.build.mozilla.org:9052 failed: Connection Refused
2011-06-09 21:39:48-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x9eb2eec> will retry in 281 seconds
2011-06-09 21:39:48-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x9eb240c>
2011-06-09 21:41:48-0700 [-] connection attempt timed out (is the port number correct?)
2011-06-09 21:44:28-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK
2011-06-09 21:44:28-0700 [-] No active connection, shutting down NOW
2011-06-09 21:44:28-0700 [-] Main loop terminated.
2011-06-09 21:44:28-0700 [-] Server Shut Down.
but it didn't reboot at that point. I see aki on the host now, and the host is locked to aki's master, so I assume that's the reason for the non-connection. The question is, why didn't it reboot?
| Assignee | ||
Comment 1•14 years ago
|
||
Similar on mv-moz2-darwin10-slave01:
2011-06-14 17:08:08-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x86d718c>
2011-06-14 17:10:08-0700 [-] connection attempt timed out (is the port number correct?)
2011-06-14 17:12:09-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK
2011-06-14 17:12:09-0700 [-] No active connection, shutting down NOW
2011-06-14 17:12:09-0700 [-] Main loop terminated.
2011-06-14 17:12:09-0700 [-] Server Shut Down.
| Assignee | ||
Comment 2•14 years ago
|
||
(I should say, it didn't reboot at 17:12, judging by the uptime. Rebooted manually.
| Assignee | ||
Comment 3•14 years ago
|
||
And that was mv-moz2-linux-slave01. I really should read my bug posts before clicking "Save Changes".
Same deal with moz2-darwin10-slave03:
2011-06-14 17:25:26-0700 [Uninitialized] Connection to dev-master01.build.scl1.mozilla.com:9018 failed: Connection Refused
2011-06-14 17:25:26-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1011111b8> will retry in 257 seconds
2011-06-14 17:25:26-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x1013854d0>
2011-06-14 17:27:26-0700 [-] connection attempt timed out (is the port number correct?)
2011-06-14 17:27:47-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK
2011-06-14 17:27:47-0700 [-] No active connection, shutting down NOW
2011-06-14 17:27:47-0700 [-] Main loop terminated.
2011-06-14 17:27:48-0700 [-] Server Shut Down.
and moz2-darwin10-slave04:
2011-06-14 17:26:02-0700 [-] Connecting to dev-master01.build.scl1.mozilla.com:9018
2011-06-14 17:26:02-0700 [Uninitialized] Connection to dev-master01.build.scl1.mozilla.com:9018 failed: Connection Refused
2011-06-14 17:26:02-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x101590dd0> will retry in 273 seconds
2011-06-14 17:26:02-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x1018795a8>
2011-06-14 17:27:48-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK
2011-06-14 17:27:48-0700 [-] No active connection, shutting down NOW
2011-06-14 17:27:48-0700 [-] Main loop terminated.
2011-06-14 17:27:49-0700 [-] Server Shut Down.
and linux64-ix-slave01:
2011-06-14 10:22:33-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x374c290> will retry in 229 seconds
2011-06-14 10:22:33-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x3a803f8>
2011-06-14 10:24:33-0700 [-] connection attempt timed out (is the port number correct?)
2011-06-14 10:24:45-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK
2011-06-14 10:24:45-0700 [-] No active connection, shutting down NOW
2011-06-14 10:24:45-0700 [-] Main loop terminated.
2011-06-14 10:24:45-0700 [-] Server Shut Down.
common theme: these were all reboots after connection failures, rather than after idleness.
Updated•14 years ago
|
Assignee: nobody → dustin
| Assignee | ||
Comment 4•14 years ago
|
||
There's trouble in paradise:
2011-06-16 18:22:23-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK
2011-06-16 18:22:23-0700 [-] Telling the master we want to shutdown after any running builds are finished
2011-06-16 18:22:23-0700 [Broker,client] Master does not support slave initiated shutdown. Upgrade master to 0.8.3 or later to use this feature.
I'm thinking that, until we upgrade the masters, we should add a fallback here that just calls loseConnection() and reactor.stop() and hopes the master doesn't manage to start a job at that very instant. I think that's a *fairly* (99.9%) safe assumption.
Catlee, does that seems reasonable?
| Assignee | ||
Comment 5•14 years ago
|
||
I just graceful'd a slave in this state (moz2-darwin9-slave68) and it rebooted once the master disconnected it. So it may be enough to just call loseConnection() after finding that the master is not modern enough.
| Assignee | ||
Comment 6•14 years ago
|
||
Opened bug 665254 to deal with the new-master problem.
| Assignee | ||
Comment 7•14 years ago
|
||
I think that all of the behaviors I'm seeing are explained, so this bug is done.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•