1088032 - Test slaves sometimes fail to start buildbot after a reboot

Reporter

Description

•

11 years ago

Test slaves will sometimes fail to start buildbot after a reboot. A quick look at the system.logs on a mtnlion machine indicates that runslave.py was called, but buildbot doesn't end up running. The machine is eventually resurrected after 6 hours when slaverebooter finds the idle slave and reboots it via slaveapi. We see this issue acutely on the mtnlion slaves where we get backed up very easily and we pay more attention to capacity, but we also see this happen on the Windows test platforms. It might be the same root cause, or it might not. We tend to see this behavior when our systems are under heavy load or during outages. That may be because we're simply paying more attention to capacity at those times and notice when slaves are unnecessarily unavailable, or it could actually be load-related (e.g. unable to connect to slavealloc, etc). There is some investigation to do here. We may need to instrument runslave.py or the environment it runs in to get better diagnostic data. Please add affected platforms to the status whiteboard. I'm unsure whether this also affects build machines, but we don't tend to be constrained there as much, so we may have overlooked the issue there.

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

11 years ago

Status: NEW → ASSIGNED

:kanban-engops

Comment 1

•

11 years ago

A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/81643576

Chris Cooper [:coop] (he/him)

Reporter

Comment 2

•

11 years ago

Found a few xp slaves in this state today: no work taken in 5-9 hours, and not rebooting on their own apparently. * t-xp32-ix-106 9:22:07 * t-xp32-ix-060 5:55:03 Both were still accessible via ssh and VNC. twistd.log indicated that the slave had initiated a reboot via count_and_reboot.py. Logging in via VNC, I could see that the cmd windows had failed to shutdown cleanly. On both hung slaves, the cmd window was stuck at "Terminate batch job (Y/N)?". Presumably the shutdown was still waiting on this window to close. Not sure if this failure mode is also happening on the other Windows slaves, but it gives us somewhere to start looking. I've left both slaves in this state and have disabled them in slavealloc. I haven't dug any further at this point.

Whiteboard: [capacity][win8][win7][10.8] → [capacity][win8][win7][10.8][xp]

:kanban-engops

Comment 3

•

11 years ago

A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/82064758

Phil Ringnalda (:philor)

Comment 4

•

11 years ago

If you wanted to give us back those two WinXP slaves, you can get two (or five, or seven) more any morning you want until slaverebooter is running again.

Assignee: nobody → coop

Chris Cooper [:coop] (he/him)

Reporter

Comment 5

•

11 years ago

(In reply to Phil Ringnalda (:philor) from comment #4) > If you wanted to give us back those two WinXP slaves, you can get two (or > five, or seven) more any morning you want until slaverebooter is running > again. They're back in the pool.

Chris Cooper [:coop] (he/him)

Reporter

Comment 6

•

10 years ago

I haven't noticed this in a while.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → WORKSFORME

Phil Ringnalda (:philor)

Comment 7

•

10 years ago

Don't you get up before I do? That should give you a couple of hours pretty much every single day to notice it, before I roll out of bed and reboot Windows test slaves as the second thing I do.

Chris Cooper [:coop] (he/him)

Reporter

Comment 8

•

10 years ago

(In reply to Phil Ringnalda (:philor) from comment #7) > Don't you get up before I do? That should give you a couple of hours pretty > much every single day to notice it, before I roll out of bed and reboot > Windows test slaves as the second thing I do. If you do it without updating the bug with at least some indication of general frequency, it is invisible to me. You are effectively (perhaps more so) acting as the human analog to slaverebooter.

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

10 years ago

Assignee: coop → nobody

Phil Ringnalda (:philor)

Comment 9

•

10 years ago

Yep, that was the deal I made to get people to stop turning slaverebooter back on, that I would be a far more effective and less destructive version of it, until someone could teach it not to graceful slaves that were running a job after having been idle for four hours, and to actually detect that it had gracefulled them so it didn't just take slaves out of the pool for 12-15 hours. Guess the context around comment 4, where I meant literally exactly that, that every morning (except weekends when I'm away from the computer, when we get broken) I wake up and reboot two or five or seven mostly-Windows test slaves which went idle after a job-and-reboot and thus I presume hit this, wasn't as clear as I thought it was.

Chris Cooper [:coop] (he/him)

Reporter

Comment 10

•

10 years ago

I've been digging into this over the past few days, and have yet to find a smoking gun. However, I did find a pervasive anomaly with runslave.py and basedirs that I've filed as bug 1143018.

Chris Cooper [:coop] (he/him)

Reporter

Comment 11

•

10 years ago

I think it's unlikely we're going to find the cause before the switch to TaskCluster. If we get the fix for bug 1143018 deployed to Windows machines and delete any lingering, unused dirs and buildbot.tac files, that should get us most of the way here.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → INCOMPLETE

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Test slaves sometimes fail to start buildbot after a reboot

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: coop, Unassigned)

References

Details

(Whiteboard: [capacity][win8][win7][10.8][xp])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Updated

Updated