Closed Bug 1298437 Opened 8 years ago Closed 7 years ago

create a pool of 10 machines for OSX tests to be run as tier-2 in taskcluster

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: jmaher)

References

Details

Attachments

(2 files)

our plan is to run a subset of tests as tier-2 for osx on trunk.  This would allow us to get by with a smaller pool of machines.  When we feel all worker, build issues, and deployment issues are resolved, we will make a larger pool and do a quick transition of all debug tests to be tier-1.
Specifically this is to use a pool of machines to run the in-progress taskcluster-worker for os x tests.
Summary: create a pool of 20 machines for OSX tests to be run as tier-2 → create a pool of 20 machines for OSX tests to be run as tier-2 in taskcluster
Is this supposed to be subtracting from the existing pool of 400 10.10 machines? If so, I'm not sure what this looks like for implementation (hostname, OS version, what gets installed, etc). Could someone provide some guidance?
it would subtract from the existing pool- right now we have 1 machine already out as a 'loaner', should we just add 19 more loaners?  These should be identical to the existing osx 10.10 machines except we won't need buildbot running on them.

possibly :wcosta knows more about what is needed?
Okay, in that case, I'll move this over to the buildduty queue, since they handle loaners.
Assignee: relops → nobody
Component: RelOps → Buildduty
Product: Infrastructure & Operations → Release Engineering
QA Contact: arich → bugspam.Callek
We can help with this one, but would like to know what exactly is needed.

@wcosta Any ideas? :-)
Flags: needinfo?(wcosta)
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #5)
> We can help with this one, but would like to know what exactly is needed.
> 
> @wcosta Any ideas? :-)

As taskcluster-worker is still WIP, I would like to keep one loaner to myself for development. Notice things are not ready yet, it will take a couple of weeks before we have a version of taskcluster-worker ready for tier-2.
Flags: needinfo?(wcosta)
There two issues in loaner machines:

1) We need a public IP for liveloggin.
2) We need to redirect syslog to papertrail.

Should I file bugs for these?
Flags: needinfo?(arich)
The machines already send logs to papertail, so either it's a matter of changing the config to send them to a different account or you just look for the logs in the releng papertrail account.

For security purposes, I can't imagine that we will ever allow machines in the datacenter (especially running desktop OSes) to have a public IP, though. At best you're going to need to connect to the VPN and auth with MFA.
Flags: needinfo?(arich)
That's good to know about those requirements. I wasn't not aware that we would not be able to open up a port to those machines, but it makes sense.  We might need to get creative with how to allow live logging from those instances.

We originally had some kind of azure logging backend that we wrote to that could be made accessible without touchign the machine, maybe that's what we do here too.
Depends on: 1305571
Depends on: 1305707
Depends on: 1305982
Depends on: 1283859
hrm, We have this Treeherder bug that prevents it parses taskcluster-worker logs. Not sure if this a blocker to deploy OS X Tier 2 or not.
Flags: needinfo?(garndt)
No longer depends on: 1283859
See Also: → 1283859
I talked to garndt in irc, I think this is ready to go!
Some of the reasoning behind this be ok is that the breakage is outside of TaskCluster logging and is affecting other systems.  If a job was already live, the tier level (or being hidden) would not be touched (I hope) if log parsing was an issue with another system.
Flags: needinfo?(garndt)
Okay, I'd have two questions here:
    - should we create a separate pool for these 20 yosemite machines which are going to be used for tests? Or simply disable them in slavealloc and add a corresponding note?
    - are there any other requirements for these machines/loaners?
Flags: needinfo?(garndt)
I'm not sure how the pools of hardware have been managed in the past, but these 20 machines are still considered a trial of sorts.  If it proves that we're on the right track the next steps would be to work with someone familiar with how to provision these machines to provision them in a more permanent fashion and slowly move machines from the buildbot pool to a taskcluster pool.

I hope I didn't confuse it more.
Flags: needinfo?(garndt)
I believe the intention here would be to get these as loaners and we would do the manual work to get the machines setup for taskcluster.  The medium term goal is to support both buildbot and taskcluster based images/machines, and long term would be only taskcluster.

I do not thing we are ready yet for these to be in a special pool, possibly if we feel we are ready after getting the loaners, then we could create a pool with the 20 machines to start with and grow it as we make it more formalized.
All right then! I disabled t-yosemite-r7 machines in range [0040 - 0069] and added a note in slavealloc for each of them (excluding t-yosemite-r7-0050 which is already loaned to :wcosta). It will take some time for these machines to finish the current jobs and will then reboot. 
I didn't do any sort of cleaning for now.

Let me know if anything else is needed.
I'll assign this to Joel while the work here is in progress.
Assignee: nobody → jmaher
:aselagea, will those machines in 0040-0069 have ssh/vnc credentials to access them?  I assume they will be accessible by :wcosta (and maybe me, :jmaher) ?
Flags: needinfo?(aselagea)
(In reply to Joel Maher ( :jmaher) from comment #18)
> :aselagea, will those machines in 0040-0069 have ssh/vnc credentials to
> access them?  I assume they will be accessible by :wcosta (and maybe me,
> :jmaher) ?

The range actually is [0040-0059], sorry for mistyping that.
Noticed that :wcosta has been added to the releng and vpn_releng LDAP groups in bug 1309408, so he should have access to those machines via ssh (using his ssh keypair). The VNC access is not set at the moment though.

Per IRC: 

aselagea|buildduty> Alin Selagea jmaher: hello! 
16:15:44 jmaher: I was wondering if you also need access to the yosemite machines disabled in bug 1298437 :)
16:16:25 
<jmaher> aselagea|buildduty: technically I don't, but it would be nice if others like myself could help out wcosta as needed

So I'd have two questions here:
1. @wcosta: is VNC access also needed? If yes, we can go on and ensure that access, then e-mail the password for that
2. @coop: any suggestions on how we could ensure access for Joel to those machines?
Flags: needinfo?(wcosta)
Flags: needinfo?(coop)
Flags: needinfo?(aselagea)
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #19)

[snip]

> 1. @wcosta: is VNC access also needed? If yes, we can go on and ensure that
> access, then e-mail the password for that

Yes, please :)
Flags: needinfo?(wcosta)
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #19)
> 2. @coop: any suggestions on how we could ensure access for Joel to those
> machines?

Joel's key should be in authorized_keys for cltbld if he has access to those machines.
Flags: needinfo?(coop)
Trying to summarize the current situation and see what's needed here:
    - Joel does not have access to those machines (checked authorized_keys for both root and cltbld users)
    - per #c3: "These should be identical to the existing osx 10.10 machines except we won't need buildbot running on them" - so I haven't do any cleaning here ==> puppet will continue to run
    - setting up a VNC password does not help by much, as it will also require an SSH password to connect (see attachement). The SSH passwords will be reset when running puppet.
Hi Alin,

For the 20 machines, can my public key also be added to an admin user's authorized_keys?

Also, can you confirm the hostnames?

I'll raise a separate bug about getting VPN access granted to me, once I can confirm the hostnames / subnet they reside on.

Many thanks!
Pete
buildduty will grant you the VPN access and you can get the existing password from wcosta. Keys don't get added to loaners, but you can do that yourself.
Depends on: 1338557
(In reply to Amy Rich [:arr] [:arich] from comment #25)
> buildduty will grant you the VPN access and you can get the existing
> password from wcosta. Keys don't get added to loaners, but you can do that
> yourself.

Ah, dang, read this too late! Thanks Amy.
Depends on: 1338560
I'm reclaiming t-yosemite-r7-005[0-9] to help with the current test backlog.
Comment on attachment 8836173 [details] [diff] [review]
[puppet] Reclaim 10 yosemite machines

I just grabbed a quick review here from mtabara so I can start reimaging the machines.
Attachment #8836173 - Flags: review?(kmoir) → review+
(In reply to Chris Cooper [:coop] from comment #30)
> Comment on attachment 8836173 [details] [diff] [review]
> [puppet] Reclaim 10 yosemite machines
> 
> I just grabbed a quick review here from mtabara so I can start reimaging the
> machines.

These hosts are starting to pick up test jobs now.
Summary: create a pool of 20 machines for OSX tests to be run as tier-2 in taskcluster → create a pool of 10 machines for OSX tests to be run as tier-2 in taskcluster
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: