Closed Bug 1088032 Opened 10 years ago Closed 9 years ago

Test slaves sometimes fail to start buildbot after a reboot

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: coop, Unassigned)

Details

(Whiteboard: [capacity][win8][win7][10.8][xp])

Test slaves will sometimes fail to start buildbot after a reboot. A quick look at the system.logs on a mtnlion machine indicates that runslave.py was called, but buildbot doesn't end up running. The machine is eventually resurrected after 6 hours when slaverebooter finds the idle slave and reboots it via slaveapi.

We see this issue acutely on the mtnlion slaves where we get backed up very easily and we pay more attention to capacity, but we also see this happen on the Windows test platforms. It might be the same root cause, or it might not.

We tend to see this behavior when our systems are under heavy load or during outages. That may be because we're simply paying more attention to capacity at those times and notice when slaves are unnecessarily unavailable, or it could actually be load-related (e.g. unable to connect to slavealloc, etc).

There is some investigation to do here. We may need to instrument runslave.py or the environment it runs in to get better diagnostic data.

Please add affected platforms to the status whiteboard. I'm unsure whether this also affects build machines, but we don't tend to be constrained there as much, so we may have overlooked the issue there.
Status: NEW → ASSIGNED
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/81643576
Found a few xp slaves in this state today: no work taken in 5-9 hours, and not rebooting on their own apparently.

* t-xp32-ix-106 9:22:07
* t-xp32-ix-060 5:55:03

Both were still accessible via ssh and VNC.

twistd.log indicated that the slave had initiated a reboot via count_and_reboot.py. Logging in via VNC, I could see that the cmd windows had failed to shutdown cleanly. On both hung slaves, the cmd window was stuck at "Terminate batch job (Y/N)?". Presumably the shutdown was still waiting on this window to close.

Not sure if this failure mode is also happening on the other Windows slaves, but it gives us somewhere to start looking.

I've left both slaves in this state and have disabled them in slavealloc. I haven't dug any further at this point.
Whiteboard: [capacity][win8][win7][10.8] → [capacity][win8][win7][10.8][xp]
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/82064758
If you wanted to give us back those two WinXP slaves, you can get two (or five, or seven) more any morning you want until slaverebooter is running again.
Assignee: nobody → coop
(In reply to Phil Ringnalda (:philor) from comment #4)
> If you wanted to give us back those two WinXP slaves, you can get two (or
> five, or seven) more any morning you want until slaverebooter is running
> again.

They're back in the pool.
I haven't noticed this in a while.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Don't you get up before I do? That should give you a couple of hours pretty much every single day to notice it, before I roll out of bed and reboot Windows test slaves as the second thing I do.
(In reply to Phil Ringnalda (:philor) from comment #7)
> Don't you get up before I do? That should give you a couple of hours pretty
> much every single day to notice it, before I roll out of bed and reboot
> Windows test slaves as the second thing I do.

If you do it without updating the bug with at least some indication of general frequency, it is invisible to me. You are effectively (perhaps more so) acting as the human analog to slaverebooter.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Assignee: coop → nobody
Yep, that was the deal I made to get people to stop turning slaverebooter back on, that I would be a far more effective and less destructive version of it, until someone could teach it not to graceful slaves that were running a job after having been idle for four hours, and to actually detect that it had gracefulled them so it didn't just take slaves out of the pool for 12-15 hours.

Guess the context around comment 4, where I meant literally exactly that, that every morning (except weekends when I'm away from the computer, when we get broken) I wake up and reboot two or five or seven mostly-Windows test slaves which went idle after a job-and-reboot and thus I presume hit this, wasn't as clear as I thought it was.
I've been digging into this over the past few days, and have yet to find a smoking gun. However, I did find a pervasive anomaly with runslave.py and basedirs that I've filed as bug 1143018.
I think it's unlikely we're going to find the cause before the switch to TaskCluster. If we get the fix for bug 1143018 deployed to Windows machines and delete any lingering, unused dirs and buildbot.tac files, that should get us most of the way here.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → INCOMPLETE
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.