Closed Bug 1248755 Opened 5 years ago Closed 4 years ago

Review and create a security plan for accessing the Admin API

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mostlygeek, Assigned: mostlygeek)

References

Details

Attachments

(1 file, 1 obsolete file)

The Admin API has higher security requirements than the public update API. In this bug we should determine: 

- who and what needs to access the Admin API 
- what sort of access does each party require
- what is the best way to grant access (ACL, user/pass, etc)
Taking this to come up with a plan next week.
Flags: needinfo?(jvehent)
Attached image AUS Architecture.png (obsolete) —
Attached is an architecture proposal. The goal is to reuse as many of the existing infrastructure while applying cloud services' security standards [1] to the admin panel.

There are two types of administrator that need access to the balrog admin hosts:
* Release engineers and managers (users)
* BUILD instances in EC2 (machines)

Users connect to the admin via mozvpn, which requires LDAP and MFA. Their connections are routed to cloud services' AWS via the NAT instance in SCL3. The security group in front of the admin hosts restricts connectivity from the public IP of the NAT instance in SCL3, such that no connection is permitted unless routed from that network. Upon connection to the admin panel, users must authenticate against Okta to enter the admin.

Machines follows a similar path, but from releng's AWS. They first connect back to SCL3 using the existing IPSec tunnel, then are routed to the NAT instance in SCL3, and finally the admin panel. The authentication is yet to be determined, but something like HAWK would be appropriate.

This architecture requires:
1. Balrog admin hosts must have an elastic IP that mozvpn routes to, and a security group that only allows connections from the NAT in scl3. We can use ELBs in front of the admin hosts.
2. VPN and routing in SCL3 need to be configured by IT (systems & netops).
3. Balrog admins need to integrate with Okta (SAML protocol), for users, and Hawk, for machines.

I have some other security controls to discuss, but they go beyond the scope of the migration, so I'll create a separate bug for them.

Let me know what you think of this plan.

[1] https://mana.mozilla.org/wiki/display/SVCOPS/Services+Security+Principles
Flags: needinfo?(jvehent)
Attachment #8724774 - Flags: review?(bwong)
Attachment #8724774 - Flags: review?(bhearsum)
Comment on attachment 8724774 [details]
AUS Architecture.png

Using the MozVPN to restrict access to both users and machines is good. We already have prior art for this with the tiles:splice implementation. It seems no longer possible to accomplish this via security groups so the VPN is the preferred approach.
Attachment #8724774 - Flags: review?(bwong)
Comment on attachment 8724774 [details]
AUS Architecture.png

This looks generally OK, a couple of comments/questions in line:

(In reply to Julien Vehent [:ulfr] from comment #2)
> Created attachment 8724774 [details]
> AUS Architecture.png
> 
> Attached is an architecture proposal. The goal is to reuse as many of the
> existing infrastructure while applying cloud services' security standards
> [1] to the admin panel.
> 
> There are two types of administrator that need access to the balrog admin
> hosts:
> * Release engineers and managers (users)
> * BUILD instances in EC2 (machines)

There's still some Windows build machines in SCL3 for now. I assume this won't be an issue?

> Users connect to the admin via mozvpn, which requires LDAP and MFA. Their
> connections are routed to cloud services' AWS via the NAT instance in SCL3.
> The security group in front of the admin hosts restricts connectivity from
> the public IP of the NAT instance in SCL3, such that no connection is
> permitted unless routed from that network. Upon connection to the admin
> panel, users must authenticate against Okta to enter the admin.
> 
> Machines follows a similar path, but from releng's AWS. They first connect
> back to SCL3 using the existing IPSec tunnel, then are routed to the NAT
> instance in SCL3, and finally the admin panel. The authentication is yet to
> be determined, but something like HAWK would be appropriate.
> 
> This architecture requires:
> 1. Balrog admin hosts must have an elastic IP that mozvpn routes to, and a
> security group that only allows connections from the NAT in scl3. We can use
> ELBs in front of the admin hosts.
> 2. VPN and routing in SCL3 need to be configured by IT (systems & netops).
> 3. Balrog admins need to integrate with Okta (SAML protocol), for users, and
> Hawk, for machines.

I'd really like to avoid changing the way machines authenticate as part of this project. I'm more than happy to look at it very soon after the migration is complete, but I'm worried about blocking the migration on too many changes (especially something like this, that is wholly incompatible with the existing system). Can we stick with http auth for the machines to start with? I think we talked about this last week...

> 
> I have some other security controls to discuss, but they go beyond the scope
> of the migration, so I'll create a separate bug for them.
> 
> Let me know what you think of this plan.
> 
> [1] https://mana.mozilla.org/wiki/display/SVCOPS/Services+Security+Principles
> There's still some Windows build machines in SCL3 for now. I assume this won't be an issue?

We just need to make sure those machines are routed to cloudservices-aws through the same NAT as the VPN and the EC2 instances. It shouldn't be a problem.

> I'd really like to avoid changing the way machines authenticate as part of this project.
> I'm more than happy to look at it very soon after the migration is complete, but I'm
> worried about blocking the migration on too many changes (especially something like this,
> that is wholly incompatible with the existing system). Can we stick with http auth for
> the machines to start with? I think we talked about this last week...

I would agree. It does mean we need to reconfigure LDAP on balrog admin to point to the public LDAP endpoint, using client certs for auth (there is no direct connection to LDAP from cloudservices-aws). That should still be less work that moving to token-based auth, but it's not work-free.
(In reply to Julien Vehent [:ulfr] from comment #5)
> > I'd really like to avoid changing the way machines authenticate as part of this project.
> > I'm more than happy to look at it very soon after the migration is complete, but I'm
> > worried about blocking the migration on too many changes (especially something like this,
> > that is wholly incompatible with the existing system). Can we stick with http auth for
> > the machines to start with? I think we talked about this last week...
> 
> I would agree. It does mean we need to reconfigure LDAP on balrog admin to
> point to the public LDAP endpoint, using client certs for auth (there is no
> direct connection to LDAP from cloudservices-aws). That should still be less
> work that moving to token-based auth, but it's not work-free.

Sounds fine to me. If there's anything that needs to happen in the Admin WSGI app for this, just let me know.
Attachment #8724774 - Flags: review?(bhearsum) → review+
cc'ing Nick, just to make sure he saw this.
> Sounds fine to me. If there's anything that needs to happen in the Admin WSGI app for this, just let me know.

Benson: do you want to use stunnel for this or made the change in the balrog admin directly? Either way, I can take care of the client certs generation, just needinfo me when they are needed.
The preference is for the app to accept client certs and use those to connect to LDAP. This is what mozidp does and it is a very reliable approach.
(In reply to Benson Wong [:mostlygeek] from comment #9)
> The preference is for the app to accept client certs and use those to
> connect to LDAP. This is what mozidp does and it is a very reliable approach.

This is for humans connecting via a browser, or machines, or both?
> This is for humans connecting via a browser, or machines, or both?

For machines
A suggestion from today's sync up meeting:

- can we use VPC peering from releng's AWS account to CloudOps AWS account
 - use a NAT instance w/ a static IP 
 - NAT instance is white listed
- avoids a round trip through SCL3
Another option to look at: 

- releng has a NAT instance/proxy server for talking to balrog-admin
- servers that need to talk to balrog-admin have a specific routing rule
- we whitelist that instance (has an elastic ip) to talk to balrog-admin
Had a discussion with :ulfr and :dustin and decided

 - use existing ipsec VPN tunnel to SCL3
 - whitelist all of SCL3's outgoing NAT IP addresses to talk to balrog-admin

This is the most pragmatic since it wouldn't require any network routing changes. The ipsec tunnel is already the default gateway and there are no restrictions on outgoing HTTPS connections. 

The VPC peering route wouldn't work since build machines are multi region. 

A NAT instance to talk to balrog-admin direct from releng's aws would require reconfiguration and an owner to configure and monitor it.
Flags: needinfo?(bhearsum)
Flags: needinfo?(bhearsum)
Depends on: 1253367
Attached image AUS Architecture.png
Updated infrastructure diagram.
Attachment #8724774 - Attachment is obsolete: true
The plan is defined, reassigning to :mostlygeek for implementation (or closing this bug if implementation is tracked some other place).
Assignee: jvehent → bwong
Resolving bug. Admin is protected behind the VPN and LDAP auth.
Access to it has been verified for the appropriate people and boxes that need to talk to it.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.