Closed Bug 1335954 Opened 8 years ago Closed 8 years ago

Get taskcluster-pulse a namespace on the pulse rabbitmq server

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bstack, Assigned: dustin)

References

Details

Attachments

(1 file)

48 bytes, text/x-github-pull-request
Details | Review
We will need a namespace to manage independently of pulseguardian/etc. I assume it should be something like 'taskcluster'. I'm not really sure of what changes will need to occur for this to happen, but step one is probably talk to mcote.
Assignee: nobody → dustin
It's just an admin users... well, actually it's probably just giving the admin credentials for pulse to taskcluster-pulse. We have those passwords in lastpass, but it might be rude/unwise to side step mcote :)
Greg: can you fix "You do not have permission to view addon resources on taskcluster-pulse. You need to have the deploy or operate permission on this app." for my account on the taskcluster-pulse app? Mark! I've forgotten a lot of Pulse. Before I bother you with requesting an administrative account for this service, can you remind me if there's a way to use my regular pulse credentials to see what permissions they have? I want to remind myself how things work now, so that I can ask for the right stuff. I think, in general, that the managed queues and exchanges should have the prefix `taskcluster/`. Currently the usernames tc-pulse uses are arbitrary user-supplied strings (from a limited charset, up to 64 characters), but aren't specifically prefixed. We're only using the `/` vhost. A user U has write/config access to `taskcluster/(exchanges|queues)/$U/.*` and read access to `taskcluster/exchanges/.*`. I suspect that read access will need to be broader, since users will want to listen to other pulse exchanges. Should the read access be `.*` or `(taskcluster/exchanges|exchange)/.*` or something else?
Flags: needinfo?(mcote)
Flags: needinfo?(garndt)
It's fixed, sir.
Flags: needinfo?(garndt)
OK, I got my paws on some admin credentials to pulse (I have ways...) and here's what the perms are for a "normal" user: vhost: / config: ^(queue/anarute/.*|exchange/anarute/.*) write: ^(queue/anarute/.*|exchange/anarute/.*) read: ^(queue/anarute/.*|exchange/.*) so basically, queues are named `queue/<username>/<arbitrary>` and exchanges are named `exchange/<username>/<arbitrary>`. I assume that usernames are the arbiter of uniqueness -- that is, I can't "take over" one of anarute's exchanges because I can't create a new user named anarute, because it already exists. We need to ensure that TC-pulse doesn't violate that. It's designed to continually reset passwords for <username>-1 and <username>-2, but it would be pretty bad for it to reset the password for an existing, important user like "buildbot" or "treeherder". Equally bad would be allowing registration of namespace "treeherder", creating users "treeherder-1" and "treeherder-2" without noticing that "treeherder" already exists, then proceeding to publish messages under "exchanges/treeherder/something". So I'll need to verify that (and maybe change the username-generation to suit). I think we should share namespaces with pulse. Currently anyone publishing using taskcluster-pulse must publish to an exchange starting with "taskcluster/", to which pulseguardian users do not have access. That would mean that a tc-pulse "namespace" N would create: - two users - with config access to `^(queue|exchange)/N/.*` - with write access to `^(queue|exchange)/N/.*` - with read access to `^(queue|exchange)/.*` I see two options for the usernames: 1. "N" and "N-2". This sort-of keeps the existing pulseguardian invariant that user U has access only to `(queue|exchange)/U/.*`, but adding this funny "U-2" which has access to U's queues and exchanges. I *think* that's safe, but I'd want to write out the proof and get Mark to check my math. 2. Prefix all tc-generated names with "tc1:" and "tc2:" (thus limiting namespaces to 60 characters). If we add a bit to pulseguardian to prohibit such names, then we have a nice partition of usernames into tc-pulse and non-tc-pulse. I *think* we also need to make sure that taskcluster-pulse is monitoring its users' queues, while pulseguardian is monitoring its users' queues, without the two stepping on each other. So maybe option #2 makes the most sense, with pulseguardian also adjusted to not monitor queues beginning with /tc[12]:/. OK, Mark, I think I understand the situation.. what are your thoughts?
I've read this a couple times, and I think I'm lacking some context. :) My first question is "would it make sense to use a different vhost for TaskCluster"? The original model for Pulse was pretty simple: a place for automation to post events so that people could build apps that are loosely coupled to our systems. TaskCluster is using Pulse as a core component, much more than a simple pub/sub system, and I don't know if PulseGuardian, as the enforcer of general Pulse usage, provides much useful functionality to TaskCluster. Maybe it makes sense to better (logically) separate "standard" Pulse usage and TaskCluster's usage?
Flags: needinfo?(mcote)
I think quite the opposite, actually: this service is intended as a more programmatically-accessible means of accessing and participating in pulse. Most of TC's use of Pulse is intended to support loose coupling -- especially task status messages and index routes. We *want* those messages on the mozilla-wide thing called "pulse" because we want people throughout mozilla to be able to react to those events. We have, in places, used pulse for tight coupling (for example, the tc-auth service recalculates its data structures when notified of a change to the DB tables via a pules message), but we really should be using another queueing system (or at least a different vhost) for those purposes. In other words, we want tc-pulse users to be able to create queues subscribed to exchanges created by pulseguardian users, and we want pulseguardian users to be able to create queues subscribed to exchanges created by tc-pulse users. We want "one big pulse". I think the question comes down to partitioning the existing pulse "username" space so that pulseguardian can guard most of it, and tc-pulse can handle the remainder. That's a little tricky because of our habit of assigning two users per namespace, but the plan in comment 4, sol'n #2 might be sufficient.
Flags: needinfo?(mcote)
Okay that's interesting, thanks for the perspective. Let me restate your basic requirement so I understand it. Currently, PulseGuardian users can create queues, which only they have any (read/write/configure) access to. This is to preserve the integrity of the queues, so that other users can't consume or insert messages. PulseGuardian users can also create exchanges, for which only they have write/configure access. All PulseGuardian users have read access to any other user's exchanges. This ensures that only the user in question can post messages on their exchange, but anyone can hook a queue up to that exchange to subscribe.[1] Your modification to the above is that you want sets of two users to share one user's queues and exchanges. That is, to use your terminology from option 1, both user and user-2 would be able to write to user's exchanges and would be able to consume messages from user's queues. Is that correct? Assuming it is, it sounds like option 1 is to encode this in PulseGuardian, so that any user can have a "shadow user" (to make up a term) that can access the original user. E.g. person creates myuser; they can then create myuser-2 which has the same permissions (or extra permissions) as myuser (or maybe this user is created automatically). Option 2 would be for PulseGuardian to just prevent the creation of users with a special prefix, e.g. "tc[12]:", and let a TaskCluster-specific management app handle those. Is that all accurate? If so, yeah, I think #2 makes the most sense, just because this is a very special situation that probably doesn't make sense to generalize. [1] Technically this isn't quite true since any permission settings apply to both queues and exchanges, that is, someone could create a queue called "exchange/<user>/publicqueue" that others could read from, but if they follow naming conventions in their queues and exchanges, they're safe.
Flags: needinfo?(mcote)
In chatting with mcote we figure: 1) make pulse guardian ignore any queue/exchange created by a user prefix tc- (matching /tc-.*/) 2) forbid pulseguardian for creating users matching /tc-.*/ 3) forbid pulseguardian from creating a user called 'taskcluster' 4) give taskcluster-pulse admin access 5) modify taskcluster-pulse to follow the following namespace pattern: For given <namespace> create: users: tc-<namespace>-1 tc-<namespace>-2 permissions: READ-WRITE-CONFIGURE: /(exchange|queue)/taskcluster/<namespace>/.* READ: /exchange/.* Note: Then taskcluster-pulse is responsible for monitoring queues matching: /(exchange|queue)/taskcluster/<namespace>/.* Alerting and deleting them if necessary.
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #8) > In chatting with mcote we figure: Continuing that chat, I disagree slightly. :) I don't think we need a special namespace for queues and exchanges; rather, I think we should restrict the exception to just a user namespace. If we reserve a certain user namespace for your app, that by definition restricts the namespaces of queues and exchanges, which always include the username, by convention. This also allows more exceptions to be added down the road for other apps. So to modify the above, I propose (and changing the username prefix from "tc-" to "taskcluster-" as suggested by jonasfj) 1) make pulse guardian ignore any queue/exchange matching ^(queue|exchange)/taskcluster-.*/.* 2) forbid pulseguardian for creating users matching /taskcluster-.*/ 3) give taskcluster-pulse admin access 4) modify taskcluster-pulse to follow the following namespace pattern: For given <namespace> create: users: taskcluster-<namespace>-1 taskcluster-<namespace>-2 permissions: READ-WRITE-CONFIGURE: /(exchange|queue)/taskcluster-<namespace>/.* READ: /exchange/.* Note: Then taskcluster-pulse is responsible for monitoring queues matching: /(exchange|queue)/taskcluster-.*/.* Alerting and deleting them if necessary.
This way we can delete our existing pulseguardian users when they've been moved over. And we don't have to rename exchanges. Since we're already using the name pattern: taskcluster-queue|auth|...
Regarding the convention of "taskcluster-" instead of "tc-", you're right: the exchanges in https://docs.taskcluster.net/reference/platform/queue/references/events for example have the prefix "exchanges/taskcluster-queue", so we need to provide a way for the queue service to get credentials that can write to exchanges with that prefix. The downside is, since usernames are limited to 64 characters, that leaves us with 53 character namespaces. Still, I can live with this, and I can attempt to make the required PulseGuardian changes.
So, things to do: * get tests running for tc-pulse (I think this requires running rabbitmq locally, maybe in a docker container) * limit namespace length * modify namespace pattern to that given above * ...
From talking to Jonas, I'm going to shift things around so that the service has a configurable pattern of namespaces it will allow (above and beyond those limited by scopes). We'll use /^taskcluster-/, but another install could use something different (for example, we can make a staging instance with /^tc-pulse-staging-/). Then the API needs to refuse to create namespaces that do not match that pattern, and the monitor needs to only consider queues and exchanges matching that pattern. I'll try to make it general, so someone could configure this for a fairly arbitrary rabbitmq install.
OK, I'm embarking on a rather, uh, drastic overhaul of the tc-pulse app. Per Brian, I'm going to leave the "rabbitmonitor" stuff alone for now. I'm focusing on the API and the rotation and expiration. Remaining: * lots of TODO items in https://github.com/taskcluster/taskcluster-pulse/compare/bug1335954?expand=1 * add a virtualHost configuration option * squint carefully at rotation * does it rotate the right password? * does it fail the second rotation due to not having an editUser function?
I thought a bit about rotation. Each of the two user/passwords has a lifetime of two rotation intervals -- on "active" (when the user/password is being handed out) and one "standby" (when the other user/password is being handed out). We want the caller to re-claim the namespace well in advance of the end of the second rotation interval, because the password will be changed at that point. But we do not want the caller to re-claim at the same time as the rotation occurs, as that would lead to a bit of a race condition and maybe leave the caller with a password only good for one rotation interval. And "the same time as" is tricky because the rotation occurs on a Heroku scheduled task, so rotations will occur on the next scheduled-task run *after* the rotation time. I think we can make this work if: * the scheduled task runs at least three times per rotation interval * the reclaim time given to callers is one-half rotation interval *after* the next rotation time * all of these times are large enough that a few minutes' "slop" on the part of the caller won't change the assumptions. So I would propose: * scheduled task runs every 20 minutes * rotation interval is one hour Then on the first claim of a namespace at, say, 10:03, - the namespace's rotation time will be set to 11:03, - the caller will get user 1 - the caller will be told to reclaim at 11:33 The scheduled task will rotate the passwords at 11:20, well in advance of the 11:30 reclaim, and set the next rotation time to 12:03. At 11:30, the re-claim call will - get user 2 - be told to reclaim at 12:33 I think we can "enforce" this by having the rotation task show warnings if any namespace has a rotation time more than one third of the rotation interval in the past.
https://github.com/taskcluster/taskcluster-pulse/pull/72 I think that with this, plus getting RabbitMonitor up and running, and fixing PulseGuardian to prohibit usernames starting with "taskcluster-", we will be ready to put this into production. Then we need to write support in pulse-publisher :)
OK, I've landed PR#72. I created a "taskcluster" user in pulse and added its credentials to the Heroku config. Of course, it just has "normal user" privileges, so it won't be causing any mayhem just yet. Hopefully Mark can adjust that permission: - tags: ['administrator'] (so it can add/remove users) - vhost: / read/write/config: ^(queues/taskcluster-garbage-.*|exchanges/taskcluster-garbage-.*) (we'll remove the "-garbage" at some point soon) Remaining to do: - pulseguardian patch updating/landing - enable namespace expiration and rotation in the scheduler - add queue monitoring - add support to pulse-publisher, taskcluster-client - add UI (https://github.com/taskcluster/taskcluster-tools/pull/182)
Flags: needinfo?(mcote)
I also dropped the azure table, since its schema had changed.
This user doesn't need the standard "^exchange/.*" read permission?
Just so I'm not blocking you, I set the permissions as follows: Read: ^(queue/taskcluster-garbage-.*|exchange/.*) Write & configure: ^(queue/taskcluster-garbage-.*|exchange/taskcluster-garbage-.*) Let me know if you'd prefer to use the more restrictive write & configure permissions for read as well. I also set the "administrator" tag on the user. Unfortunately, for some reason the RabbitMQ management UI and API require a password to be set when tags are updated, so I generated a random one (again in order not to block you; I didn't see any connections as that user so I assumed this wouldn't break anything). You can set a new one via PulseGuardian.
Flags: needinfo?(mcote)
..and it turns out, when you reset the password, the tags and permissions get overwritten :) I used the admin credentials I mentioned above to reset those. $ curl https://pulse.taskcluster.net/v1/namespaces { "namespaces": [ { "namespace": "taskcluster-garbage-dustin", "created": "2017-03-31T14:45:37.951Z", "contact": { "method": "email", "payload": { "address": "dustin@mozilla.com" } } } ] }
OK, I think we're in good shape for now. * service is deployed and issuing credentials * rotation works, and the rotation scheduled task is enabled * a few minor bits and bobs in another PR Remaining to do: - pulseguardian patch updating/landing (https://github.com/mozilla/pulseguardian/pull/25) - enable namespace expiration (bug 1352470) - add queue monitoring (bug 1352473) - add UI (https://github.com/taskcluster/taskcluster-tools/pull/182) - worth noting here, there are a *lot* of exchanges and they're not a particularly fascinating list - add support to pulse-publisher, taskcluster-client
- add support to pulse-publisher, taskcluster-client (bug 1352474)
Attached file pr
Attachment #8855093 - Flags: review?(cdawson)
Hey Dustin-- Mostly looks great. Just a few small changes requested. Not sure what your preferred "ping" method is, but I figure a n-i works. :)
Flags: needinfo?(dustin)
A reply in github works fine :)
Flags: needinfo?(dustin)
Attachment #8855093 - Flags: review?(cdawson) → review+
Comment on attachment 8855093 [details] [review] pr Clearing my review, so it gets out of my Todo list, since the bug is not yet resolved.
Attachment #8855093 - Flags: review+
Blocks: 1352474
@bstack, @jonasfj and I discussed by video. I'll try to summarize here :) * The use-cases for this service are * trusted services getting pulse credentials without hard-coding htem * tests getting temporary pulse credentials * notably, we are not opening the service to "untrusted" users such as via browsers * The primary purpose of the monitor is to protect the pulse service from crashing, with a secondary purpose of giving warning before starting to kill things. PulseGuardian does an OK job of that, but its warnings aren't very early for heavy-traffic queues and it is sometimes too aggressive in killing queues when it has lots of capacity left. It also fails to kill small, slowly-growing queues that may, together, consume a lot of RAM. * If we compare this service to the other temporary-credentials endpoints in tc-auth (statsum, sentry, STS, SAS, etc.), the monitor doesn't have a parallel. As such, I'd argue that we should build the API to work mostly like the other endpoints; the monitoring is really a separate concern. * Test credentials should have a very short expiration time, so that they do not clutter rmq (like, one hour). Production credentials should have a very long expiration time, so they can ride out extended outages if necessary. We can solve the issue where connections don't appear immediately by periodically killing all (tc-pulse-related) connections over a certain age (rabbitmq connection metadata has a `connected_at` field). That age then provides an upper bound for the lifetime of any stolen username/password combination. This divides the labor nicely: * API + maintenance scheduled tasks handle Namespaces and rmq users * claiming (including creating users) * rotating passwords * expiring (including deleting users, but not worrying about connections, queues, or exchanges) * Monitor handles connections, queues, and exchanges -- including those not associated with a namespace. At a minimum: * warn for and kill queues gone wild * delete queues and exchanges with no associated namespace * terminate connections older than N hours (where N is a little larger than the rotation interval, but small enough to encourage users to actually implement re-claiming) There's plenty of scope to be smarter about monitoring, and we can implement some of that before fully deploying this service. The claimNamespace endpoint will have a default value for `expires`, and will allow namespaces without contact information. These will behave just like any other namespace, except there's no pre-kill warning. I'll remove the deleteNamespace endpoint. It is unlikely to see much use, and can be simulated by setting expires to the current time, for example.
OK, I think that's all done -- all that remains now is the monitoring, which is a separate bug (bug 1352473)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: