Closed
Bug 617831
Opened 14 years ago
Closed 13 years ago
Image 2 ix machines as buildbot-master5
Categories
(Infrastructure & Operations :: RelOps: General, task, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: zandr)
References
Details
(Whiteboard: [hardware][buildduty])
*) Please take 3 of the 1u ix machines in SCL (earmarked for win64), and image them for use them for these new masters. These 3 machines need to be imaged identically to how we did buildbot-master1,2. *) I've cc'd jabba and bhearsum, who worked on the setup of buildbot-master1,2 in case they have additional info. *) This should be done before the upcoming rejuggling of machines in SCL, so we can have each master to be on its own separate power and network circuit. *) After these masters are live in production, we can rebalance our allocation of slaves across these 5 master machines.
Assignee | ||
Comment 1•14 years ago
|
||
The existing buildmasters and these new buildmasters are on deliberately cheap machines: Single PSU, single drive, single NIC. Having 50% (or 20% after implementing this) of Talos dependent on a single spindle is not a good idea. There are two machines with redundant power supplies, four spindles available for RAID, and lots of cores. These were intended to be kvm hosts for this purpose and provide a MUCH higher level of reliability than we'll get out of 5 cheap 1Us, particularly if we cluster them. What's blocking using those machines instead?
Comment 2•14 years ago
|
||
Masters should really be put on the Ganeti VM hosts that were purchased for this purpose. The 1u builders were really supposed to be temporary and I don't want to go further down that path, as it creates double the work load. I'd recommend waiting until after the great shuffle (during which the Ganeti machines will also get their network configs updated to support VMs in the build vlan) and then starting to test and stage buildmaster VMs. There are two identical hosts in Castro that can also be used to stage buildmaster VMs anytime. Zandr and Bkero know Ganeti and can create CentOS5.5 VMs for testing.
Assignee: server-ops → zandr
Assignee | ||
Comment 3•14 years ago
|
||
I spoke to John about this on Friday, and we're more or less on the same page. Based on experience with running builders on VMs, he's gun-shy about virtualizing anything, so we're going to need to do some serious load testing to prove out this solution. In the mean time, we have half of Talos depending on a single spindle. So, I think we're going to go ahead with these new buildmasters in parallel. I have slaves picked out in the new layout that will spread them across four circuits. Once we get a buildmaster VM through staging, we can return the hardware buildmasters (and puppet master) to the w64 build pool.
Comment 4•13 years ago
|
||
Joduinn wanted me to comment here on what OS should go on these machines. Please use the existing 32-bit Linux IX ref image we have, that's what we're using everywhere else.
Reporter | ||
Comment 5•13 years ago
|
||
(In reply to comment #4) > Joduinn wanted me to comment here on what OS should go on these machines. > Please use the existing 32-bit Linux IX ref image we have, that's what we're > using everywhere else. zandr: quick followup after some irc with bhearsum; we believe jlazaro or phong would know the exact name/location of this IX refimage. If it helps, its the same refimage that was used for imaging the linux32 ix builder slaves.
Updated•13 years ago
|
Summary: Image 3 ix machines as buildbot-master03,04,05 → Image 3 ix machines as buildbot-master04,05,06
Updated•13 years ago
|
Summary: Image 3 ix machines as buildbot-master04,05,06 → Image 3 ix machines as buildbot-master4,5,6
Assignee | ||
Comment 6•13 years ago
|
||
Some of the win64 builders should be commandeered for this. Reimage with the linux-ix-slave-ref image, and change hostnames in /etc/network/sysconfig and update inventory/DHCP/DNS for the following: w64-ix-slave02 -> buildbot-master4 w64-ix-slave04 -> buildbot-master5 w64-ix-slave06 -> buildbot-master6
Assignee: zandr → shui
Component: Server Operations → Server Operations: Labs
QA Contact: mrz → zandr
Assignee | ||
Comment 7•13 years ago
|
||
(oops, changed to wrong component)
Component: Server Operations: Labs → Server Operations: RelEng
Comment 8•13 years ago
|
||
It would be great if we could grab machines off the end of any list rather than the start, eg w64-ix-slave4x.
Updated•13 years ago
|
Assignee: shui → zandr
Assignee | ||
Comment 9•13 years ago
|
||
(In reply to comment #8) > It would be great if we could grab machines off the end of any list rather than > the start, eg w64-ix-slave4x. In hindsight, yes. In this case I was following the lead of buildbot-master1,2 and scl production puppet, which were slave01,03,and 05, IIRC.
Updated•13 years ago
|
Assignee: zandr → shui
Comment 11•13 years ago
|
||
buildbot-master4 is malfunctioning and not ready to use. Short-term this-morning plan: move tm03 to buildbot-master 5 Long-term plan: distribute one build, one test, and one try master to each of buildbot-master1 buildbot-master2 buildbot-master4 buildbot-master5 buildbot-master6 all of which are in SCL.
Assignee | ||
Comment 12•13 years ago
|
||
I've just done DNS under embargo a=bustage. 5 and 6 handed over to releng 4 is sad, investigating DHCP can wait for the embargo to end, and I'll take care of inventory
Assignee | ||
Updated•13 years ago
|
Assignee: shui → zandr
Comment 13•13 years ago
|
||
I just reassigned the following slaves to buildbot-master5:9011: talos-r3-fed-003 talos-r3-fed-012 talos-r3-fed-013 talos-r3-fed-014 talos-r3-fed-015 talos-r3-fed-016 talos-r3-fed-025 talos-r3-fed-026 talos-r3-fed-032 talos-r3-fed-033 talos-r3-fed-043 talos-r3-fed-046 talos-r3-fed-050 talos-r3-fed-052 talos-r3-fed-053 talos-r3-fed64-003 talos-r3-fed64-009 talos-r3-fed64-011 talos-r3-fed64-016 talos-r3-fed64-019 talos-r3-fed64-020 talos-r3-fed64-039 talos-r3-fed64-045 talos-r3-fed64-046 talos-r3-fed64-047 talos-r3-fed64-048 talos-r3-fed64-049 talos-r3-fed64-050 talos-r3-fed64-051 talos-r3-fed64-052 talos-r3-leopard-004 talos-r3-leopard-005 talos-r3-leopard-006 talos-r3-leopard-007 talos-r3-leopard-008 talos-r3-leopard-009 talos-r3-leopard-015 talos-r3-leopard-018 talos-r3-leopard-023 talos-r3-leopard-024 talos-r3-leopard-025 talos-r3-leopard-026 talos-r3-leopard-041 talos-r3-leopard-042 talos-r3-snow-003 talos-r3-snow-017 talos-r3-snow-019 talos-r3-snow-021 talos-r3-snow-022 talos-r3-snow-023 talos-r3-snow-026 talos-r3-w7-004 talos-r3-w7-005 talos-r3-w7-006 talos-r3-w7-007 talos-r3-w7-008 talos-r3-w7-009 talos-r3-w7-011 talos-r3-w7-012 talos-r3-w7-013 talos-r3-w7-014 talos-r3-w7-022 talos-r3-w7-023 talos-r3-w7-025 talos-r3-w7-026 talos-r3-w7-027 talos-r3-w7-028 talos-r3-w7-029 talos-r3-w7-033 talos-r3-w7-034 talos-r3-w7-041 talos-r3-w7-042 talos-r3-w7-043 talos-r3-w7-044 talos-r3-w7-045 talos-r3-xp-004 talos-r3-xp-006 talos-r3-xp-008 talos-r3-xp-011 talos-r3-xp-012 talos-r3-xp-022 talos-r3-xp-023 talos-r3-xp-030 talos-r3-xp-041 talos-r3-xp-042 talos-r3-xp-044 talos-r3-xp-045 talos-r3-xp-047 t-r3-w764-005 t-r3-w764-018 coop's working on bringing up the master. still on zandr's plate: nagios, dhcp, and inventory
Comment 14•13 years ago
|
||
(In reply to comment #13) > coop's working on bringing up the master. I think the master is ready to go. We're just waiting for the appropriate hole(s) to be punched in the firewall so it can talk to the scheduler (tm-b01-master01.mozilla.org). zandr was going to file a netops bug to get this done.
Comment 15•13 years ago
|
||
This needs an update to production-masters.json and to https://intranet.mozilla.org/RelEngWiki/index.php/Masters#Production. I *thought* these two were aligned, but it seems that's not the case, so I'll leave this update to someone else.
Reporter | ||
Comment 16•13 years ago
|
||
From RelEng meeting with zandr: we're doing 5,6 here in this bug. bear filed bug#639628 for creation of master4 as a VM on a KVM on new server box.
Reporter | ||
Updated•13 years ago
|
Summary: Image 4 ix machines as buildbot-master5,6 → Image 2 ix machines as buildbot-master5,6
Assignee | ||
Comment 17•13 years ago
|
||
(In reply to comment #16) > From RelEng meeting with zandr: we're doing 5,6 here in this bug. bear filed > bug#639628 for creation of master4 as a VM on a KVM on new server box. In which case there is no further Server Ops work in this bug. Over to releng for setup or to close as appropriate.
Assignee: zandr → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Comment 18•13 years ago
|
||
The masters JSON was updated (comment 15). However, buildbot-master6 is still idle, so we'll need to set that up. For the record, almost all of the w7 systems above tried to connect to buildbot-master5 until sometime on March 7, at which point they just stopped. I blame Windows. I just restarted them all by hand.
Comment 19•13 years ago
|
||
Who's handling this now? buildduty?
Priority: -- → P3
Whiteboard: [hardware][slaveduty]
Comment 20•13 years ago
|
||
Masters are buildduty's, yes.
Whiteboard: [hardware][slaveduty] → [hardware][buildduty]
Comment 21•13 years ago
|
||
glibc was updated in bug 641377, on both these machines. I noticed that I/O was pretty slow on buildbot-master6, eg ls on an empty dir: [root@buildbot-master6 tmp]# time ls real 0m2.271s user 0m0.002s sys 0m2.268s which is pretty appalling, and nearly 6 minutes to upgrade the glibc rpms. But hdparm is over 90MB/s for buffered reads: [root@buildbot-master6 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29332 MB in 1.99 seconds = 14735.68 MB/sec Timing buffered disk reads: 284 MB in 3.02 seconds = 94.03 MB/sec so it's not in the slow slave list in bug 596366. I think it needs some more attention before we use it for anything. Might just be a file system issue, or the hdparm test for slowness may not be valid.
Depends on: 641377
Comment 22•13 years ago
|
||
buildbot-master6 now has puppet disabled using chkconfig, to match buildbot-master5. Awesome that the inventory had the right IPMI IP. It's much more responsive after a reboot, but on running on /dev/sda4 (/builds) there is a lot of ata1: spurious interrupt (irq_stat 0x8 active-tag -84148995 sactive 0xa or 0xa or 0x8, or 0x0, in the first pass. Didn't let it run for long because I think it's sick and needs to go to IX.
Comment 23•13 years ago
|
||
Can someone summarize what's left to do here?
Assignee | ||
Comment 24•13 years ago
|
||
Apparently we need to pick another machine and reimage that. What's your style on reusing hostnames? Should the new machine be buildbot-master6 again, or buildbot-master7?
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Comment 25•13 years ago
|
||
Let's re-use buildbot-master6, when the original hardware for that name comes back, we'll use it for something else. Let's use linux64-ix-slave41 for this, since we've already used 6 machines from the 64-bit Windows pool.
Assignee | ||
Comment 26•13 years ago
|
||
Over to Spencer to image this. Please image linux64-ix-slave41 (IPMI: http://10.12.49.42/ ) with the linux-ix-ref image (32bit) and update the hostname in /etc/network/sysconfig. I'll take care of DNS/DHCP/Inventory.
Assignee: server-ops-releng → shui
Reporter | ||
Comment 27•13 years ago
|
||
(In reply to comment #22) > buildbot-master6 now has puppet disabled using chkconfig, to match > buildbot-master5. Awesome that the inventory had the right IPMI IP. > > It's much more responsive after a reboot, but on running on /dev/sda4 (/builds) > there is a lot of > ata1: spurious interrupt (irq_stat 0x8 active-tag -84148995 sactive 0xa > or 0xa or 0x8, or 0x0, in the first pass. Didn't let it run for long because I > think it's sick and needs to go to IX. Filed bug#641926 to track this return-and-repair
Comment 28•13 years ago
|
||
linux64-ix-slave41 is running really slow so linux64-ix-slave40 is taking its place.
Updated•13 years ago
|
Assignee: shui → server-ops-releng
Updated•13 years ago
|
Assignee: server-ops-releng → zandr
Assignee | ||
Comment 29•13 years ago
|
||
buildbot-master6.build.mozilla.org is ready to go (rDNS should be propagating now) I'll investigate linux64-ix-slave41 separately.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 30•13 years ago
|
||
All the nagios checks for buildbot-master6 disappeared, can these get re-added?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 31•13 years ago
|
||
hdparm looks really bad on the new buildbot-master6, too :( /dev/sda: Timing cached reads: 29288 MB in 1.99 seconds = 14714.35 MB/sec Timing buffered disk reads: 150 MB in 3.02 seconds = 49.67 MB/sec
Assignee | ||
Comment 32•13 years ago
|
||
Per IRC, punting on building up master6. The VM solution is preferred for this.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Comment 33•13 years ago
|
||
Updating summary for posterity.
Summary: Image 2 ix machines as buildbot-master5,6 → Image 2 ix machines as buildbot-master5
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•