Closed Bug 629377 Opened 13 years ago Closed 13 years ago

win32-slaveNN on pm03 are often idle

Categories

(Release Engineering :: General, defect, P4)

All
Windows Server 2003
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: arich)

References

Details

(Whiteboard: [slaveduty])

Attachments

(1 file)

Of the 14 distinct hung slaves I've seen this week, 12 have been win32-slaveNN, attached to pm03.  Some have run fuzzer jobs, and one has run a release repack, but for the most part they have done nothing since starting up 4 days before the alert.

Catlee suggests that since tests aren't run on builders anymore, and since builds prefer the much faster IX boxes, this may just be a set of VMs that should be smaller.

pm03 has 32 win32-slaveNN builders attached
pm02 has 1 win32-slaveNN builder attached (win32-slave46)
pm01 has 13 win32-slaveNN builders attached

I'd like to be sure that there isn't some misconfiguration on pm03 causing all of these idle slaves to appear there.  It's reasonable to suppose that pm01 can keep 13 slaves busy, and that if it had 32 slaves it just wouldn't be able to keep them busy enough.  So my thought is to disable 19 of the slaves attached to pm03, but leave them running, and see if the idleness goes away.
Summary: win32-ix-slaveNN on pm03 are often hung → win32-slaveNN on pm03 are often hung
OS: All → Windows Server 2003
Priority: -- → P3
(In reply to comment #0)
> Of the 14 distinct hung slaves I've seen this week, 12 have been win32-slaveNN,
> attached to pm03.  Some have run fuzzer jobs, and one has run a release repack,
> but for the most part they have done nothing since starting up 4 days before
> the alert.

So by "hung" you really mean "idle." Changing the summary to reflect that.
 
> I'd like to be sure that there isn't some misconfiguration on pm03 causing all
> of these idle slaves to appear there.  It's reasonable to suppose that pm01 can
> keep 13 slaves busy, and that if it had 32 slaves it just wouldn't be able to
> keep them busy enough.  So my thought is to disable 19 of the slaves attached
> to pm03, but leave them running, and see if the idleness goes away.

Disable how precisely? Graceful shutdown followed by turning off the nagios alerts?
Summary: win32-slaveNN on pm03 are often hung → win32-slaveNN on pm03 are often idle
(In reply to comment #1)
> Disable how precisely? Graceful shutdown followed by turning off the nagios
> alerts?

Sure.  I'll set this up on Monday.

And I used "hung" because that's the name of the nagios alert.  The two are different, but it's a surprisingly subtle distinction!
Assignee: nobody → dustin
OK, slaves
 w32-ix-slave31
 w32-ix-slave32
 w32-ix-slave33
 w32-ix-slave34
 w32-ix-slave35
 w32-ix-slave36
 w32-ix-slave37
 w32-ix-slave38
 w32-ix-slave39
 w32-ix-slave40
 w32-ix-slave41
 w32-ix-slave42
 w32-ix-slave43
are disabled (I moved buildbot.tac).  Let's see how the idleness goes.  I'll mark them for two weeks of downtime in nagios.
comment 3 has the wrong slave names.  It should read
 win32-slave31
 win32-slave32
 win32-slave33
 win32-slave34
 win32-slave35
 win32-slave36
 win32-slave37
 win32-slave38
 win32-slave39
 win32-slave40
 win32-slave41
 win32-slave42
 win32-slave43
Oh good, you scared me for a moment there. :-)
So these machines actually ended up on staging-master, probably thanks to the old-school buildbot.tac generator and not being able to write the control file due to permissions. And they have prod ssh keys still, so the jobs barf on uploads. Sad faces for my 4.0b12 staging release.

cssh + "taskkill /f /im python.exe" was my partner in crime to stop buildbot for real, so long as we don't reboot these things.
So this expired, and I don't see a substantial difference in wait times over the last two weeks.  There were only two idle slave incidents in this silo on pm03 during those two weeks, which is a heck of a lot better than we were seeing before!

These are VM's, so given that we didn't see any great harm befall us from this change, I think we should just delete them.  Can I get a second on that motion before I write up the corresponding buildbot-configs patch and hand off to IT?
tossing back into the releng pool
Assignee: dustin → nobody
Priority: P3 → --
Priority: -- → P4
From bug 639840, we should look at the set of available VMs, and delete a dozen or so of the VMs with small FAT partitions.
Disabled from the disk space issue in bug 639840
 win32-slave18
 win32-slave19
 win32-slave29

And I've done win32-slave22 now too. All in slave tracking sheet.
win32-slave28 is another one with a 30G E: drive.
OK, I met my clicking-on-stuff quota for today in VI.  The following machines have 30G drives:

win32-slave02
win32-slave05
win32-slave12
win32-slave13
win32-slave15
win32-slave16
win32-slave17
win32-slave18
win32-slave19
win32-slave22
win32-slave23
win32-slave24
win32-slave25
win32-slave27
win32-slave28
win32-slave29

And this one's broken
win32-slave03 (bug 636606)

Let's bring it up to an even 18 with the perfectly-functional 80G builder:
win32-slave14
(since we've had 18 win32 VM's down for 2 months with no pain)

So I'll bring all of the slaves in comment 4 back up, and once that's done, I'll take all of the slaves in this comment down, delete the VM's, and put up a patch for the relevant localconfigs.
Slaves from comment 4 are up - all are on pm03.
win32-slave02 and win32-slave23 are stretching their swan-song builds out as long as they can, but the rest of the slaves in comment 13 are dead and gone.  Once the last two are finished, I'll hand this over to amy for the remaining bookkeeping.
remove slaves in comment 13 from all configs
Assignee: nobody → dustin
Attachment #526160 - Flags: review?(nrthomas)
Amy: can you make the slaves in comment 13 disappear from inventory, nagios, DNS, and DHCP?  They're never coming back.
Assignee: dustin → arich
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Attachment #526160 - Flags: review?(nrthomas) → review+
Removed from nagios, DNS, DHCP, and the inventory.
Assignee: arich → dustin
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Attachment #526160 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Amy, I think you missed the last two:

 win32-slave03
 win32-slave14

they're still in inventory and DNS at least.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Not in DNS anymore, but still in inventory.
Assignee: dustin → arich
oh, duh, I can fix inventory :)

done
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
I deleted these slaves from the production opsi server, too.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: