Closed Bug 560588 Opened 14 years ago Closed 14 years ago

Private API for for registration actions that bind with LDAP root DN (create / delete user)

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lorchard, Assigned: lorchard)

Details

Attachments

(2 files)

Using an async queue to create and delete users could help in two areas:

1) Performance, in case user creation is too slow to keep up with requests.

2) Security, preventing the use of the root DN or other over-powered DN on webheads to create / delete users.

Beyond creation / deletion of users, a user DN with the proper permissions could bind with the LDAP server to perform self-modifications (eg. email, password, etc) without affecting other users.
#1 is not expected to be a problem from initial load tests.  Need to get the specific capacity numbers, though.

#2 can be a concern, though is it enough of a concern to warrant adding a job queue to the infrastructure mix?  

For what it's worth, all of our web sites with authentication based on MySQL tables have the same potential issue - that is, the credentials to affect all users are present on webheads.
Some notes on what's in progress, slow going because my LDAP is rusty:

* Revise LDAP driver for auth to use downgraded privileges as appropriate; eg. all user manipulation done by binding as that user, simple lookups (ie. cluster node location) done as anonymous.  May need some LDAP ACL tweakery.

* Gearman for the work queue (http://gearman.org/). Job server can live on a secured machine with a single port opened to webheads.  Workers that bind as root DN can live on same machine as job server, or any machine protected from webheads.  A break-in on one of the webheads would allow access to the job server to create / delete users, but no modification or read of existing user records.
Flags: blocking-weave1.3?
Flags: blocking-weave1.3? → blocking-weave1.3+
Had a quick call with clyon and mcoates about the security of this thing.  Notes from the meeting:

* The Gearman job server is a persistent daemon that will live on a machine with limited access from webheads, ie. just port 4370.

* The connection from webheads to the job server should be secure to deter packet sniffing of jobs (which can contain user credentials) from elsewhere on the network.  Since Gearman doesn't support SSL connections, can we establish secure port tunnels as part of our infrastructure (eg. using stunnel)?

* A job worker is a persistent daemon that connects to the job server, and can live on the same protected machine as the job server.  Webheads will have no access to job workers, which speak only to the job server.

* Only job workers will have access to LDAP root DN credentials.

* Account deletion jobs will require the user's password, which job workers will validate before executing account deletion.  This should limit the arbitrary deletion of accounts via a compromised webhead, but requires a secure connection between webheads and job server to protect the credentials.

* Account creation jobs from a compromised webhead seem no worse than creation requests at the HTTP API level.

* All jobs will be logged per https://intranet.mozilla.org/Security/Users_and_Logs#Logging_Recommendations

Let me know if there's anything I missed!
Status: NEW → ASSIGNED
> * The connection from webheads to the job server should be secure to deter
> packet sniffing of jobs (which can contain user credentials) from elsewhere on
> the network.  Since Gearman doesn't support SSL connections, can we establish
> secure port tunnels as part of our infrastructure (eg. using stunnel)?

Is that really necessary?  That'd imply that we need to safe guard against physical access to the switch.

If you can root the web server I'd bet you could insert yourself before the stunnel.  Is there enough value in doing crypto across a trusted network (in a locked cabinet with security cameras)?
(In reply to comment #5) 
> Is that really necessary?  That'd imply that we need to safe guard against
> physical access to the switch.
> 
> If you can root the web server I'd bet you could insert yourself before the
> stunnel.  Is there enough value in doing crypto across a trusted network (in a
> locked cabinet with security cameras)?

I think the concern was if someone got access to another machine (webhead or not) on that network and might then sniff all traffic between all webheads and the job server.  Is that even possible with a switch between machines?  You might want to arm-wrestle with clyon and mcoates about that.
Not possible or not easily possible on a switched network.  Easy if you have physical access to the switch of course.
(In reply to comment #5)
> Is that really necessary?  That'd imply that we need to safe guard against
> physical access to the switch.
> 
The issue would be any host on this network could see the password, access to the switch isn't necessary, if they are on the network, they can see it. 

> If you can root the web server I'd bet you could insert yourself before the
> stunnel.  Is there enough value in doing crypto across a trusted network (in a
> locked cabinet with security cameras)?

It isn't a matter if you can root the web server, it is a matter of root anything which access to that network stream.
Switches are more point-to-point that broadcast mediums so you'd really have to get the switch to forward unicast packets out your root'd host's switch port before you could really sniff it.
Okay, looks like I've got a working patch that uses the registered user's credentials for self-modifying actions and passes off creation/deletion to a Gearman worker:

http://hg.mozilla.org/users/lorchard_mozilla.com/weaveserver-registration-patches/file/2fc85d618697/ldap-queued

It turned out simple than I expected, so I expect there to be something horribly wrong with it.

One thing I noticed (maybe a question for telliott) is that cluster node assignment didn't happen at account creation, instead happening at the first attempt to get a node location.  

Since the HTTP request to get the cluster node location is unauthenticated, and I needed credentials to assign a node, I moved that into account creation.  Does anyone know if that will break anything?
(In reply to comment #10)

> One thing I noticed (maybe a question for telliott) is that cluster node
> assignment didn't happen at account creation, instead happening at the first
> attempt to get a node location.  
> 
> Since the HTTP request to get the cluster node location is unauthenticated, and
> I needed credentials to assign a node, I moved that into account creation. 
> Does anyone know if that will break anything?

That could be an issue, yes. We use the node assignment to throttle the maximum number of users who can register for a node at any particular time (since they're the ones uploading a ton of data the first time). Returning "no node" is a valid response that tells the client to try again in a while.


I'm a little concerned with the KISS violation here. That's not a comment on Les' code, which looks like a good implementation from my first glance, but more a concern that we're creating complexity to solve problems we may not have, or may be easier to solve.

As best I can tell, this implementation is being driven by two issues:

(1) Concern over the ability of the LDAP server to handle the load under heavy usage. This concern seems to have been mitigated by Aravind's testing and tweaking, and his identification of a couple inefficiencies that were likely causing bottlenecks.

(2) Concern over use of the LDAP master password for the account operations and the dangers of a compromise here.

My question is - assuming that (1) is no longer an issue, is this the best approach to solving (2)? It may be, but I see a lot of moving parts in this flow, and I feel like I want to be reassured of this before we roll all the pieces into place.
(In reply to comment #10)
> One thing I noticed (maybe a question for telliott) is that cluster node
> assignment didn't happen at account creation, instead happening at the first
> attempt to get a node location.  

To add to Toby's reply here, this is something we really need/want from the client/user experience side (and we explicitly asked for this around nine months ago).  This separation means that, even if available storage nodes are melting/over capacity, we're still okay, and the client will understand what to do then.  If we assign a node, we'll just hammer the storage nodes until they force backoff, and that's not a great user experience.
(In reply to comment #12)
> (In reply to comment #10)
> > One thing I noticed (maybe a question for telliott) is that cluster node
> > assignment didn't happen at account creation, instead happening at the first
> > attempt to get a node location.  
> 
> To add to Toby's reply here, this is something we really need/want from the
> client/user experience side (and we explicitly asked for this around nine
> months ago).  This separation means that, even if available storage nodes are
> melting/over capacity, we're still okay, and the client will understand what to
> do then.  If we assign a node, we'll just hammer the storage nodes until they
> force backoff, and that's not a great user experience.

Modifying an attribute for a user (ie. node location) requires some credentials - either the user's own or the root DN.  But, the request to fetch the node location is unauthenticated, so I'm left with the root DN.

So, what I can probably do is hit Gearman with a synchronous job to assign a node location.  I'll poke at that.
(In reply to comment #11)
> (In reply to comment #10)
>
> My question is - assuming that (1) is no longer an issue, is this the best
> approach to solving (2)? It may be, but I see a lot of moving parts in this
> flow, and I feel like I want to be reassured of this before we roll all the
> pieces into place.

As far as I understand, #1 doesn't sound like a huge issue at this point.

As for #2, this does seem like a lot of moving parts.

But, I can't think of an alternative to using the root DN to create / delete accounts.  LDAP ACLs don't seem to offer granting of record creation and deletion without also modification and read rights over those same records.

So, if the over-powered root DN needs protecting from a potential webhead compromise, then isolating it on a protected box that exposes a limited API is the best I can think of.

Gearman seems a pretty simple way to do that, and it could come in handy in the future.  There are other ways, but none that aren't about equivalent or worse in complexity (eg. a private internal HTTP service)

What it boils down to is:

* How likely is a webhead compromise versus the effort to maintain this infrastructure, and is it a good bargain?  (A security vs IT question, I think.)

* Can anyone think of a better approach?  I'm fine with tossing this out and trying something else.
This appears to save us from one scenario: where attacker wants to modify an item in the account without the user knowing. As it currently stands, there's no particular use for this - there's no data worth modifying.

If they want to create, delete, or gain control of accounts, it's still trivially easy. With root on the box, they have access to the password reset key db after which it's a quick hack to gain control of the account and do whatever you wanted with it. It means you've changed the password (since you don't know the original, you can't switch back) and people may complain, but until then, it's undetectable.

Getting the passwords into an encrypted key store would save us from most stuff, but if the box gets rooted, we're pretty doomed regardless of our approach.
(In reply to comment #15)
> If they want to create, delete, or gain control of accounts, it's still
> trivially easy. With root on the box, they have access to the password reset
> key db after which it's a quick hack to gain control of the account and do
> whatever you wanted with it. It means you've changed the password (since you
> don't know the original, you can't switch back) and people may complain, but
> until then, it's undetectable.
> 
> Getting the passwords into an encrypted key store would save us from most
> stuff, but if the box gets rooted, we're pretty doomed regardless of our
> approach.

We batted this around a bit on IRC, and it sounds like this could be improved by: 

1) putting password reset codes and expiration times into LDAP, readable only by root DN and the user's own credentials. (requires an LDAP schema change)

2) pulling password reset code generation, emailing, and verification into Gearman.

But... this is starting to feel like we're backing into the reinvention of a self-service LDAP wheel someone's got lying around somewhere.  It's been years since I played this much with LDAP, so I feel like I'm missing something.
(In reply to comment #16)

> But... this is starting to feel like we're backing into the reinvention of a
> self-service LDAP wheel someone's got lying around somewhere.  It's been years
> since I played this much with LDAP, so I feel like I'm missing something.

At this point, why wouldn't we just proxy all commands that require master ldap access and support that subset? It's slightly (slightly) less complicated, and probably almost as secure.

The asynchronous approach was for performance reasons, and nobody seems to worried about that.
Not trying to pile on here, but this does seem like a *lot* of changes to a pretty key part of the service, one that is likely to be exercised a lot given we are turning on marketing to get more users. 

If the key issue here is protecting the credentials used to create/delete accounts, do we really need async + gearman for that?
Attaching my proposed architecture diagram. Some of you have seen this already. It keeps things relatively simple while still allowing us to lock away the master credentials.
Summary: Async queue for registration actions that bind with LDAP root DN (create / delete user) → Private API for for registration actions that bind with LDAP root DN (create / delete user)
Okay, so I've got an initial stab as a new HTTP-based API for a private admin server.  Haven't requested a labs HG repo yet, so I've just checked my progress so far into my own repos:

http://hg.mozilla.org/users/lorchard_mozilla.com/weaveserver-registration-admin/

http://hg.mozilla.org/users/lorchard_mozilla.com/weaveserver-registration-admin/file/fe21e4fe9288/1.0/index.php

http://hg.mozilla.org/users/lorchard_mozilla.com/weaveserver-registration-patches/file/b4f025756e60/admin-server


This may all be moot pending a review of new LDAP ACLs, but I wanted to get this stuff out there.
Target Milestone: --- → 1.3
Here's a patch that ties the reg API to the private admin account API from:

http://hg.mozilla.org/labs/weaveserver-registration-secure/

It should work without the API (eg. using the mysql auth driver), if a base URL for it isn't configured.
Attachment #442431 - Flags: review?
Comment on attachment 442431 [details] [diff] [review]
Integration of reg component with private API

Looks good. Let's get this onto stage and test it.
Attachment #442431 - Flags: review? → review+
Going to say this is done, since pushed to hg.  Fire off more bugs if problems found
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: