Closed Bug 737594 Opened 12 years ago Closed 12 years ago

configure buildbot masters for r5 and linux builders in scl3

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: jhford)

References

Details

(Whiteboard: [buildmasters][capacity])

Attachments

(3 files)

As soon as puppet is working in releng.scl3.mozilla.com, I'll be deploying a two vms for buildbot masters called buildbot-master30 and buildbot-master31.srv.releng.scl3.mozilla.com.  In order to get the r5 builders up by the end of this week, we'll need to have someone set up at least one of these buildbot masters (we'll also be bringing up 50 linux vmware vm builders in the next week or so).

Is there anything else that will be required other than a vm (matching other buildbot masters in specs) with the appropriate root pw set?  Will a third bm be required?
The buildbot-master30.srv.releng.scl3.mozilla.com and buildbot-master31.srv.releng.scl3.mozilla.com vms are up with the standard root pw.
Assignee: nobody → coop
Status: NEW → ASSIGNED
Component: Release Engineering → Release Engineering: Platform Support
OS: Mac OS X → Linux
Priority: -- → P2
QA Contact: release → coop
Whiteboard: [buildmasters][capacity]
Comment on attachment 608791 [details] [diff] [review]
Add new masters to production-masters.json

If you have them already up and running then land as is.
If you don't have them yet running could you please land it with "enabled": false?
Otherwise, people using fabric will try to reconfigure disabled masters.
Attachment #608791 - Flags: review?(armenzg) → review+
Attachment #608792 - Flags: review?(armenzg) → review+
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #4)
> Comment on attachment 608791 [details] [diff] [review]
> Add new masters to production-masters.json
> 
> If you have them already up and running then land as is.
> If you don't have them yet running could you please land it with "enabled":
> false?
> Otherwise, people using fabric will try to reconfigure disabled masters.

I'll aiming to have them running before that would be an issue.
Comment on attachment 608791 [details] [diff] [review]
Add new masters to production-masters.json

https://hg.mozilla.org/build/tools/rev/b6ba25b1e855
Attachment #608791 - Flags: checked-in+
Both masters are running now:

http://buildbot-master30.srv.releng.scl3.mozilla.com:8001/
http://buildbot-master31.srv.releng.scl3.mozilla.com:8101/

On reboot, the initial connection to puppet seems to hang, despite it reporting success in /var/log/messages, e.g.:

Mar 23 17:22:11 buildbot-master30 puppet-agent[2766]: Starting Puppet client version 2.6.14
Mar 23 17:22:21 buildbot-master30 puppet-agent[2766]: Finished catalog run in 7.33 seconds

For now, it's enough to know that if we kill that hung process on reboot, the master will start correctly. I'll try to debug this further on Monday.
Right after you did this, all of the buildbot processes on all of the minis in scl3 stopped.  I'm not sure what to do about the nagios alerts for now (are things broken, is this expected?), so I'm just going to leave them be and let you or someone else in releng downtime/ack/or fix them, whatever is appropriate.
The slaves hit exceptions because the passwords in the slave buildbot.tac were set to None and that only caused thing to break when the masters were available to connect to. Dustin helped me track down the missing password entries in the slavealloc db and get them fixed up. 

I've also made temporary additions to localconfig.py on both masters to allow the slaves to actually connect, but I'll need to discuss with jhford on Monday how we actually want to handle the new slaves.
I've shut down both of these masters. See bug 739032
Attachment #608910 - Flags: review?(armenzg) → review+
Comment on attachment 608910 [details] [diff] [review]
Ganglia and fileserver changes for new buildbot masters

https://hg.mozilla.org/build/puppet-manifests/rev/b182eed16427
Attachment #608910 - Flags: checked-in+
Re-assigning to jhford to get the rev5 builders running side-by-side with the existing builders for a little while.
Assignee: coop → jhford
Status: ASSIGNED → NEW
Priority: P2 → P3
(In reply to Chris AtLee [:catlee] from comment #12)
> I've shut down both of these masters. See bug 739032

I disabled them in the JSON, too:
Catlee fixed the master-side issues in 739032.  I've been running the r5 machines on scl1 masters, limited to the build-system branch.

I'd like to have the r5 slaves pointing to scl3 masters today.  Does anyone have objections to this plan?

I have a test build happening at http://buildbot-master30.srv.releng.scl3.mozilla.com:8001/builders/OS%20X%2010.7%20build-system%20build/builds/0 as a test to make sure that the keys are still working.
(In reply to John Ford [:jhford] from comment #16)
> I'd like to have the r5 slaves pointing to scl3 masters today.  Does anyone
> have objections to this plan?

This was done by about 10am on April 4th.  These masters are working in production, with a possible issue sending tinderbox email as documented in bug 744462.

Bug 744462 is the only remaining work item for this bug, so I think it's time to close this bug.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: