Closed Bug 617831 Opened 14 years ago Closed 13 years ago

Image 2 ix machines as buildbot-master5

Categories

(Infrastructure & Operations :: RelOps: General, task, P3)

x86
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: zandr)

References

Details

(Whiteboard: [hardware][buildduty])

*) Please take 3 of the 1u ix machines in SCL (earmarked for win64), and image them for use them for these new masters. These 3 machines need to be imaged identically to how we did buildbot-master1,2.

*) I've cc'd jabba and bhearsum, who worked on the setup of buildbot-master1,2 in case they have additional info. 

*) This should be done before the upcoming rejuggling of machines in SCL, so we can have each master to be on its own separate power and network circuit.

*) After these masters are live in production, we can rebalance our allocation of slaves across these 5 master machines.
The existing buildmasters and these new buildmasters are on deliberately cheap machines: Single PSU, single drive, single NIC.

Having 50% (or 20% after implementing this) of Talos dependent on a single spindle is not a good idea.

There are two machines with redundant power supplies, four spindles available for RAID, and lots of cores. These were intended to be kvm hosts for this purpose and provide a MUCH higher level of reliability than we'll get out of 5 cheap 1Us, particularly if we cluster them.

What's blocking using those machines instead?
Masters should really be put on the Ganeti VM hosts that were purchased for this purpose. The 1u builders were really supposed to be temporary and I don't want to go further down that path, as it creates double the work load. I'd recommend waiting until after the great shuffle (during which the Ganeti machines will also get their network configs updated to support VMs in the build vlan) and then starting to test and stage buildmaster VMs. There are two identical hosts in Castro that can also be used to stage buildmaster VMs anytime.

Zandr and Bkero know Ganeti and can create CentOS5.5 VMs for testing.
Assignee: server-ops → zandr
I spoke to John about this on Friday, and we're more or less on the same page. Based on experience with running builders on VMs, he's gun-shy about virtualizing anything, so we're going to need to do some serious load testing to prove out this solution. 

In the mean time, we have half of Talos depending on a single spindle. So, I think we're going to go ahead with these new buildmasters in parallel. I have slaves picked out in the new layout that will spread them across four circuits. 

Once we get a buildmaster VM through staging, we can return the hardware buildmasters (and puppet master) to the w64 build pool.
Joduinn wanted me to comment here on what OS should go on these machines. Please use the existing 32-bit Linux IX ref image we have, that's what we're using everywhere else.
(In reply to comment #4)
> Joduinn wanted me to comment here on what OS should go on these machines.
> Please use the existing 32-bit Linux IX ref image we have, that's what we're
> using everywhere else.

zandr: quick followup after some irc with bhearsum; we believe jlazaro or phong would know the exact name/location of this IX refimage. If it helps,  its the same refimage that was used for imaging the linux32 ix builder slaves.
Summary: Image 3 ix machines as buildbot-master03,04,05 → Image 3 ix machines as buildbot-master04,05,06
Summary: Image 3 ix machines as buildbot-master04,05,06 → Image 3 ix machines as buildbot-master4,5,6
Some of the win64 builders should be commandeered for this. Reimage with the linux-ix-slave-ref image, and change hostnames in /etc/network/sysconfig and update inventory/DHCP/DNS for the following:

w64-ix-slave02 -> buildbot-master4
w64-ix-slave04 -> buildbot-master5
w64-ix-slave06 -> buildbot-master6
Assignee: zandr → shui
Component: Server Operations → Server Operations: Labs
QA Contact: mrz → zandr
(oops, changed to wrong component)
Component: Server Operations: Labs → Server Operations: RelEng
It would be great if we could grab machines off the end of any list rather than the start, eg w64-ix-slave4x.
Assignee: shui → zandr
(In reply to comment #8)
> It would be great if we could grab machines off the end of any list rather than
> the start, eg w64-ix-slave4x.

In hindsight, yes. In this case I was following the lead of buildbot-master1,2 and scl production puppet, which were slave01,03,and 05, IIRC.
Assignee: zandr → shui
buildbot-master4 is malfunctioning and not ready to use.

Short-term this-morning plan:
 move tm03 to buildbot-master 5

Long-term plan: distribute one build, one test, and one try master to each of
 buildbot-master1
 buildbot-master2
 buildbot-master4
 buildbot-master5
 buildbot-master6
all of which are in SCL.
I've just done DNS under embargo a=bustage.
5 and 6 handed over to releng
4 is sad, investigating
DHCP can wait for the embargo to end, and I'll take care of inventory
Assignee: shui → zandr
I just reassigned the following slaves to buildbot-master5:9011:

talos-r3-fed-003
talos-r3-fed-012
talos-r3-fed-013
talos-r3-fed-014
talos-r3-fed-015
talos-r3-fed-016
talos-r3-fed-025
talos-r3-fed-026
talos-r3-fed-032
talos-r3-fed-033
talos-r3-fed-043
talos-r3-fed-046
talos-r3-fed-050
talos-r3-fed-052
talos-r3-fed-053
talos-r3-fed64-003
talos-r3-fed64-009
talos-r3-fed64-011
talos-r3-fed64-016
talos-r3-fed64-019
talos-r3-fed64-020
talos-r3-fed64-039
talos-r3-fed64-045
talos-r3-fed64-046
talos-r3-fed64-047
talos-r3-fed64-048
talos-r3-fed64-049
talos-r3-fed64-050
talos-r3-fed64-051
talos-r3-fed64-052
talos-r3-leopard-004
talos-r3-leopard-005
talos-r3-leopard-006
talos-r3-leopard-007
talos-r3-leopard-008
talos-r3-leopard-009
talos-r3-leopard-015
talos-r3-leopard-018
talos-r3-leopard-023
talos-r3-leopard-024
talos-r3-leopard-025
talos-r3-leopard-026
talos-r3-leopard-041
talos-r3-leopard-042
talos-r3-snow-003
talos-r3-snow-017
talos-r3-snow-019
talos-r3-snow-021
talos-r3-snow-022
talos-r3-snow-023
talos-r3-snow-026
talos-r3-w7-004
talos-r3-w7-005
talos-r3-w7-006
talos-r3-w7-007
talos-r3-w7-008
talos-r3-w7-009
talos-r3-w7-011
talos-r3-w7-012
talos-r3-w7-013
talos-r3-w7-014
talos-r3-w7-022
talos-r3-w7-023
talos-r3-w7-025
talos-r3-w7-026
talos-r3-w7-027
talos-r3-w7-028
talos-r3-w7-029
talos-r3-w7-033
talos-r3-w7-034
talos-r3-w7-041
talos-r3-w7-042
talos-r3-w7-043
talos-r3-w7-044
talos-r3-w7-045
talos-r3-xp-004
talos-r3-xp-006
talos-r3-xp-008
talos-r3-xp-011
talos-r3-xp-012
talos-r3-xp-022
talos-r3-xp-023
talos-r3-xp-030
talos-r3-xp-041
talos-r3-xp-042
talos-r3-xp-044
talos-r3-xp-045
talos-r3-xp-047
t-r3-w764-005
t-r3-w764-018

coop's working on bringing up the master.

still on zandr's plate: nagios, dhcp, and inventory
(In reply to comment #13)
> coop's working on bringing up the master.

I think the master is ready to go. We're just waiting for the appropriate hole(s) to be punched in the firewall so it can talk to the scheduler (tm-b01-master01.mozilla.org). zandr was going to file a netops bug to get this done.
This needs an update to production-masters.json and to https://intranet.mozilla.org/RelEngWiki/index.php/Masters#Production.  I *thought* these two were aligned, but it seems that's not the case, so I'll leave this update to someone else.
See Also: → 638814
Blocks: 639628
From RelEng meeting with zandr: we're doing 5,6 here in this bug. bear filed bug#639628 for creation of master4 as a VM on a KVM on new server box.
No longer blocks: 639628
See Also: → 639628
Summary: Image 3 ix machines as buildbot-master4,5,6 → Image 4 ix machines as buildbot-master5,6
Summary: Image 4 ix machines as buildbot-master5,6 → Image 2 ix machines as buildbot-master5,6
(In reply to comment #16)
> From RelEng meeting with zandr: we're doing 5,6 here in this bug. bear filed
> bug#639628 for creation of master4 as a VM on a KVM on new server box.

In which case there is no further Server Ops work in this bug. Over to releng for setup or to close as appropriate.
Assignee: zandr → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
The masters JSON was updated (comment 15).  However, buildbot-master6 is still idle, so we'll need to set that up.

For the record, almost all of the w7 systems above tried to connect to buildbot-master5 until sometime on March 7, at which point they just stopped.  I blame Windows.  I just restarted them all by hand.
Who's handling this now? buildduty?
Priority: -- → P3
Whiteboard: [hardware][slaveduty]
Masters are buildduty's, yes.
Whiteboard: [hardware][slaveduty] → [hardware][buildduty]
glibc was updated in bug 641377, on both these machines. 

I noticed that I/O was pretty slow on buildbot-master6, eg ls on an empty dir:
  [root@buildbot-master6 tmp]# time ls
  real	0m2.271s
  user	0m0.002s
  sys	0m2.268s
which is pretty appalling, and nearly 6 minutes to upgrade the glibc rpms. But hdparm is over 90MB/s for buffered reads:
  [root@buildbot-master6 ~]# hdparm -tT /dev/sda
  /dev/sda:
   Timing cached reads:   29332 MB in  1.99 seconds = 14735.68 MB/sec
   Timing buffered disk reads:  284 MB in  3.02 seconds =  94.03 MB/sec
so it's not in the slow slave list in bug 596366. 

I think it needs some more attention before we use it for anything. Might just be a file system issue, or the hdparm test for slowness may not be valid.
Depends on: 641377
buildbot-master6 now has puppet disabled using chkconfig, to match buildbot-master5. Awesome that the inventory had the right IPMI IP.

It's much more responsive after a reboot, but on running on /dev/sda4 (/builds) there is a lot of 
  ata1: spurious interrupt (irq_stat 0x8 active-tag -84148995 sactive 0xa
or 0xa or 0x8, or 0x0, in the first pass. Didn't let it run for long because I think it's sick and needs to go to IX.
Can someone summarize what's left to do here?
Apparently we need to pick another machine and reimage that.

What's your style on reusing hostnames? Should the new machine be buildbot-master6 again, or buildbot-master7?
Assignee: nobody → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Let's re-use buildbot-master6, when the original hardware for that name comes back, we'll use it for something else.

Let's use linux64-ix-slave41 for this, since we've already used 6 machines from the 64-bit Windows pool.
Over to Spencer to image this.

Please image linux64-ix-slave41 (IPMI: http://10.12.49.42/ ) with the linux-ix-ref image (32bit) and update the hostname in /etc/network/sysconfig. I'll take care of DNS/DHCP/Inventory.
Assignee: server-ops-releng → shui
(In reply to comment #22)
> buildbot-master6 now has puppet disabled using chkconfig, to match
> buildbot-master5. Awesome that the inventory had the right IPMI IP.
> 
> It's much more responsive after a reboot, but on running on /dev/sda4 (/builds)
> there is a lot of 
>   ata1: spurious interrupt (irq_stat 0x8 active-tag -84148995 sactive 0xa
> or 0xa or 0x8, or 0x0, in the first pass. Didn't let it run for long because I
> think it's sick and needs to go to IX.

Filed bug#641926 to track this return-and-repair
linux64-ix-slave41 is running really slow so linux64-ix-slave40 is taking its
place.
Assignee: shui → server-ops-releng
Assignee: server-ops-releng → zandr
buildbot-master6.build.mozilla.org is ready to go (rDNS should be propagating now)

I'll investigate linux64-ix-slave41 separately.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
All the nagios checks for buildbot-master6 disappeared, can these get re-added?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
hdparm looks really bad on the new buildbot-master6, too :(

/dev/sda:
 Timing cached reads:   29288 MB in  1.99 seconds = 14714.35 MB/sec
 Timing buffered disk reads:  150 MB in  3.02 seconds =  49.67 MB/sec
Per IRC, punting on building up master6. The VM solution is preferred for this.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Updating summary for posterity.
Summary: Image 2 ix machines as buildbot-master5,6 → Image 2 ix machines as buildbot-master5
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.