Closed Bug 629377 Opened 14 years ago Closed 14 years ago

win32-slaveNN on pm03 are often idle

Categories

(Release Engineering :: General, defect, P4)

Product:

Component:

Platform:

All

Windows Server 2003

Type:

defect

Priority:

P4

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dustin, Assigned: arich)

References

Details

(Whiteboard: [slaveduty])

Attachments

(1 file)

m629377-buildbot-config-r1.patch 14 years ago Dustin J. Mitchell [:dustin] (he/him) 2.56 KB, patch	nthomas : review+ dustin : checked-in+	Details \| Diff \| Splinter Review

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Description

•

14 years ago

Of the 14 distinct hung slaves I've seen this week, 12 have been win32-slaveNN, attached to pm03. Some have run fuzzer jobs, and one has run a release repack, but for the most part they have done nothing since starting up 4 days before the alert. Catlee suggests that since tests aren't run on builders anymore, and since builds prefer the much faster IX boxes, this may just be a set of VMs that should be smaller. pm03 has 32 win32-slaveNN builders attached pm02 has 1 win32-slaveNN builder attached (win32-slave46) pm01 has 13 win32-slaveNN builders attached I'd like to be sure that there isn't some misconfiguration on pm03 causing all of these idle slaves to appear there. It's reasonable to suppose that pm01 can keep 13 slaves busy, and that if it had 32 slaves it just wouldn't be able to keep them busy enough. So my thought is to disable 19 of the slaves attached to pm03, but leave them running, and see if the idleness goes away.

bhearsum@mozilla.com (:bhearsum)

Updated

•

14 years ago

Summary: win32-ix-slaveNN on pm03 are often hung → win32-slaveNN on pm03 are often hung

Chris Cooper [:coop] (he/him)

Updated

•

14 years ago

OS: All → Windows Server 2003

Priority: -- → P3

Chris Cooper [:coop] (he/him)

Comment 1

•

14 years ago

(In reply to comment #0) > Of the 14 distinct hung slaves I've seen this week, 12 have been win32-slaveNN, > attached to pm03. Some have run fuzzer jobs, and one has run a release repack, > but for the most part they have done nothing since starting up 4 days before > the alert. So by "hung" you really mean "idle." Changing the summary to reflect that. > I'd like to be sure that there isn't some misconfiguration on pm03 causing all > of these idle slaves to appear there. It's reasonable to suppose that pm01 can > keep 13 slaves busy, and that if it had 32 slaves it just wouldn't be able to > keep them busy enough. So my thought is to disable 19 of the slaves attached > to pm03, but leave them running, and see if the idleness goes away. Disable how precisely? Graceful shutdown followed by turning off the nagios alerts?

Summary: win32-slaveNN on pm03 are often hung → win32-slaveNN on pm03 are often idle

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 2

•

14 years ago

(In reply to comment #1) > Disable how precisely? Graceful shutdown followed by turning off the nagios > alerts? Sure. I'll set this up on Monday. And I used "hung" because that's the name of the nagios alert. The two are different, but it's a surprisingly subtle distinction!

Assignee: nobody → dustin

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 3

•

14 years ago

OK, slaves w32-ix-slave31 w32-ix-slave32 w32-ix-slave33 w32-ix-slave34 w32-ix-slave35 w32-ix-slave36 w32-ix-slave37 w32-ix-slave38 w32-ix-slave39 w32-ix-slave40 w32-ix-slave41 w32-ix-slave42 w32-ix-slave43 are disabled (I moved buildbot.tac). Let's see how the idleness goes. I'll mark them for two weeks of downtime in nagios.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 4

•

14 years ago

comment 3 has the wrong slave names. It should read win32-slave31 win32-slave32 win32-slave33 win32-slave34 win32-slave35 win32-slave36 win32-slave37 win32-slave38 win32-slave39 win32-slave40 win32-slave41 win32-slave42 win32-slave43

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

14 years ago

Oh good, you scared me for a moment there. :-)

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

14 years ago

So these machines actually ended up on staging-master, probably thanks to the old-school buildbot.tac generator and not being able to write the control file due to permissions. And they have prod ssh keys still, so the jobs barf on uploads. Sad faces for my 4.0b12 staging release. cssh + "taskkill /f /im python.exe" was my partner in crime to stop buildbot for real, so long as we don't reboot these things.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 7

•

14 years ago

So this expired, and I don't see a substantial difference in wait times over the last two weeks. There were only two idle slave incidents in this silo on pm03 during those two weeks, which is a heck of a lot better than we were seeing before! These are VM's, so given that we didn't see any great harm befall us from this change, I think we should just delete them. Can I get a second on that motion before I write up the corresponding buildbot-configs patch and hand off to IT?

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 8

•

14 years ago

tossing back into the releng pool

Assignee: dustin → nobody

Priority: P3 → --

Armen [:armenzg]

Updated

•

14 years ago

Priority: -- → P4

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 10

•

14 years ago

From bug 639840, we should look at the set of available VMs, and delete a dozen or so of the VMs with small FAT partitions.

Nick Thomas [:nthomas] (UTC+12)

Comment 11

•

14 years ago

Disabled from the disk space issue in bug 639840 win32-slave18 win32-slave19 win32-slave29 And I've done win32-slave22 now too. All in slave tracking sheet.

Nick Thomas [:nthomas] (UTC+12)

Comment 12

•

14 years ago

win32-slave28 is another one with a 30G E: drive.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 13

•

14 years ago

OK, I met my clicking-on-stuff quota for today in VI. The following machines have 30G drives: win32-slave02 win32-slave05 win32-slave12 win32-slave13 win32-slave15 win32-slave16 win32-slave17 win32-slave18 win32-slave19 win32-slave22 win32-slave23 win32-slave24 win32-slave25 win32-slave27 win32-slave28 win32-slave29 And this one's broken win32-slave03 (bug 636606) Let's bring it up to an even 18 with the perfectly-functional 80G builder: win32-slave14 (since we've had 18 win32 VM's down for 2 months with no pain) So I'll bring all of the slaves in comment 4 back up, and once that's done, I'll take all of the slaves in this comment down, delete the VM's, and put up a patch for the relevant localconfigs.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 14

•

14 years ago

Slaves from comment 4 are up - all are on pm03.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 15

•

14 years ago

win32-slave02 and win32-slave23 are stretching their swan-song builds out as long as they can, but the rest of the slaves in comment 13 are dead and gone. Once the last two are finished, I'll hand this over to amy for the remaining bookkeeping.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 16

•

14 years ago

Attached patch m629377-buildbot-config-r1.patch — Details — Splinter Review

remove slaves in comment 13 from all configs

Assignee: nobody → dustin

Attachment #526160 - Flags: review?(nrthomas)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 17

•

14 years ago

Amy: can you make the slaves in comment 13 disappear from inventory, nagios, DNS, and DHCP? They're never coming back.

Assignee: dustin → arich

Component: Release Engineering → Server Operations: RelEng

QA Contact: release → zandr

Nick Thomas [:nthomas] (UTC+12)

Updated

•

14 years ago

Attachment #526160 - Flags: review?(nrthomas) → review+

Amy Rich [:arr] [:arich]

Assignee

Comment 18

•

14 years ago

Removed from nagios, DNS, DHCP, and the inventory.

Assignee: arich → dustin

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

14 years ago

Component: Server Operations: RelEng → Release Engineering

QA Contact: zandr → release

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

14 years ago

Attachment #526160 - Flags: checked-in+

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

14 years ago

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 19

•

14 years ago

Amy, I think you missed the last two: win32-slave03 win32-slave14 they're still in inventory and DNS at least.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 20

•

14 years ago

Not in DNS anymore, but still in inventory.

Assignee: dustin → arich

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 21

•

14 years ago

oh, duh, I can fix inventory :) done

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

bhearsum@mozilla.com (:bhearsum)

Comment 22

•

14 years ago

I deleted these slaves from the production opsi server, too.

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.