Closed
Bug 629377
Opened 14 years ago
Closed 14 years ago
win32-slaveNN on pm03 are often idle
Categories
(Release Engineering :: General, defect, P4)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: arich)
References
Details
(Whiteboard: [slaveduty])
Attachments
(1 file)
2.56 KB,
patch
|
nthomas
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
Of the 14 distinct hung slaves I've seen this week, 12 have been win32-slaveNN, attached to pm03. Some have run fuzzer jobs, and one has run a release repack, but for the most part they have done nothing since starting up 4 days before the alert.
Catlee suggests that since tests aren't run on builders anymore, and since builds prefer the much faster IX boxes, this may just be a set of VMs that should be smaller.
pm03 has 32 win32-slaveNN builders attached
pm02 has 1 win32-slaveNN builder attached (win32-slave46)
pm01 has 13 win32-slaveNN builders attached
I'd like to be sure that there isn't some misconfiguration on pm03 causing all of these idle slaves to appear there. It's reasonable to suppose that pm01 can keep 13 slaves busy, and that if it had 32 slaves it just wouldn't be able to keep them busy enough. So my thought is to disable 19 of the slaves attached to pm03, but leave them running, and see if the idleness goes away.
Updated•14 years ago
|
Summary: win32-ix-slaveNN on pm03 are often hung → win32-slaveNN on pm03 are often hung
Updated•14 years ago
|
OS: All → Windows Server 2003
Priority: -- → P3
Comment 1•14 years ago
|
||
(In reply to comment #0)
> Of the 14 distinct hung slaves I've seen this week, 12 have been win32-slaveNN,
> attached to pm03. Some have run fuzzer jobs, and one has run a release repack,
> but for the most part they have done nothing since starting up 4 days before
> the alert.
So by "hung" you really mean "idle." Changing the summary to reflect that.
> I'd like to be sure that there isn't some misconfiguration on pm03 causing all
> of these idle slaves to appear there. It's reasonable to suppose that pm01 can
> keep 13 slaves busy, and that if it had 32 slaves it just wouldn't be able to
> keep them busy enough. So my thought is to disable 19 of the slaves attached
> to pm03, but leave them running, and see if the idleness goes away.
Disable how precisely? Graceful shutdown followed by turning off the nagios alerts?
Summary: win32-slaveNN on pm03 are often hung → win32-slaveNN on pm03 are often idle
Reporter | ||
Comment 2•14 years ago
|
||
(In reply to comment #1)
> Disable how precisely? Graceful shutdown followed by turning off the nagios
> alerts?
Sure. I'll set this up on Monday.
And I used "hung" because that's the name of the nagios alert. The two are different, but it's a surprisingly subtle distinction!
Assignee: nobody → dustin
Reporter | ||
Comment 3•14 years ago
|
||
OK, slaves
w32-ix-slave31
w32-ix-slave32
w32-ix-slave33
w32-ix-slave34
w32-ix-slave35
w32-ix-slave36
w32-ix-slave37
w32-ix-slave38
w32-ix-slave39
w32-ix-slave40
w32-ix-slave41
w32-ix-slave42
w32-ix-slave43
are disabled (I moved buildbot.tac). Let's see how the idleness goes. I'll mark them for two weeks of downtime in nagios.
Reporter | ||
Comment 4•14 years ago
|
||
comment 3 has the wrong slave names. It should read
win32-slave31
win32-slave32
win32-slave33
win32-slave34
win32-slave35
win32-slave36
win32-slave37
win32-slave38
win32-slave39
win32-slave40
win32-slave41
win32-slave42
win32-slave43
Comment 5•14 years ago
|
||
Oh good, you scared me for a moment there. :-)
Comment 6•14 years ago
|
||
So these machines actually ended up on staging-master, probably thanks to the old-school buildbot.tac generator and not being able to write the control file due to permissions. And they have prod ssh keys still, so the jobs barf on uploads. Sad faces for my 4.0b12 staging release.
cssh + "taskkill /f /im python.exe" was my partner in crime to stop buildbot for real, so long as we don't reboot these things.
Reporter | ||
Comment 7•14 years ago
|
||
So this expired, and I don't see a substantial difference in wait times over the last two weeks. There were only two idle slave incidents in this silo on pm03 during those two weeks, which is a heck of a lot better than we were seeing before!
These are VM's, so given that we didn't see any great harm befall us from this change, I think we should just delete them. Can I get a second on that motion before I write up the corresponding buildbot-configs patch and hand off to IT?
Reporter | ||
Comment 8•14 years ago
|
||
tossing back into the releng pool
Assignee: dustin → nobody
Priority: P3 → --
Updated•14 years ago
|
Priority: -- → P4
Reporter | ||
Comment 10•14 years ago
|
||
From bug 639840, we should look at the set of available VMs, and delete a dozen or so of the VMs with small FAT partitions.
Comment 11•14 years ago
|
||
Disabled from the disk space issue in bug 639840
win32-slave18
win32-slave19
win32-slave29
And I've done win32-slave22 now too. All in slave tracking sheet.
Comment 12•14 years ago
|
||
win32-slave28 is another one with a 30G E: drive.
Reporter | ||
Comment 13•14 years ago
|
||
OK, I met my clicking-on-stuff quota for today in VI. The following machines have 30G drives:
win32-slave02
win32-slave05
win32-slave12
win32-slave13
win32-slave15
win32-slave16
win32-slave17
win32-slave18
win32-slave19
win32-slave22
win32-slave23
win32-slave24
win32-slave25
win32-slave27
win32-slave28
win32-slave29
And this one's broken
win32-slave03 (bug 636606)
Let's bring it up to an even 18 with the perfectly-functional 80G builder:
win32-slave14
(since we've had 18 win32 VM's down for 2 months with no pain)
So I'll bring all of the slaves in comment 4 back up, and once that's done, I'll take all of the slaves in this comment down, delete the VM's, and put up a patch for the relevant localconfigs.
Reporter | ||
Comment 14•14 years ago
|
||
Slaves from comment 4 are up - all are on pm03.
Reporter | ||
Comment 15•14 years ago
|
||
win32-slave02 and win32-slave23 are stretching their swan-song builds out as long as they can, but the rest of the slaves in comment 13 are dead and gone. Once the last two are finished, I'll hand this over to amy for the remaining bookkeeping.
Reporter | ||
Comment 16•14 years ago
|
||
remove slaves in comment 13 from all configs
Assignee: nobody → dustin
Attachment #526160 -
Flags: review?(nrthomas)
Reporter | ||
Comment 17•14 years ago
|
||
Amy: can you make the slaves in comment 13 disappear from inventory, nagios, DNS, and DHCP? They're never coming back.
Assignee: dustin → arich
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Updated•14 years ago
|
Attachment #526160 -
Flags: review?(nrthomas) → review+
Assignee | ||
Comment 18•14 years ago
|
||
Removed from nagios, DNS, DHCP, and the inventory.
Assignee: arich → dustin
Reporter | ||
Updated•14 years ago
|
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Reporter | ||
Updated•14 years ago
|
Attachment #526160 -
Flags: checked-in+
Reporter | ||
Updated•14 years ago
|
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 19•14 years ago
|
||
Amy, I think you missed the last two:
win32-slave03
win32-slave14
they're still in inventory and DNS at least.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 20•14 years ago
|
||
Not in DNS anymore, but still in inventory.
Assignee: dustin → arich
Reporter | ||
Comment 21•14 years ago
|
||
oh, duh, I can fix inventory :)
done
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
Comment 22•14 years ago
|
||
I deleted these slaves from the production opsi server, too.
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•