Closed Bug 934593 Opened 11 years ago Closed 11 years ago

Memcached, RabbitMQ for releng

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: cliang)

References

Details

BuildAPI, a releng project, is currently hosted on KVM, and we need to get it off of there (bug 863268).  I'd like to put this on the releng cluster, but to do that we'll need a reasonably reliable cache backend and a message queueing architecture suitable for use with carrot.

My understanding is that the supported tools for those are, respectively, memcached and rabbitmq.  I know we have generic versions of both of those running already.

If releng can use the generic versions, then please set up the necessary flows, accounts, etc.

If releng should have its own dedicated systems, please create them.
No longer blocks: 863268
Blocks: 934594
From IRC:

This is definitely tree-closing, which means dedicated infra.  For rabbit, that's pretty much the only question.

For the caching backend, infra supports both redis and memcached, but both have problems:
 * memcached is in-memory only, so state is easily lost
 * memcached doesn't cluster - data's not shared from node to node, and there's no built-in support for finding a working node
 * memcached has no authentication, which may present security concerns for signing
 * redis still doesn't cluster very well

So the options for the backend are memcached, redis, or a real DB.  I'll go talk to releng about those options.  The rabbitmq half of this bug remains actionable, though.
In my (limited) web architecture experience, if you need decent, but not 5000hits/s, performance, a classic database is just good enough. MySQL and Postgres both use memory caches, so you get really close to memcache perfs for simple queries.

And it's backed up.
Per a meeting yesterday, we'll use RabbitMQ and Memcached for Buildapi, which means we need both of those services up and running.

Jake, how practical is that in the near term (like, soon enough for me to move the services before 12/20/2013)?
Flags: needinfo?(nmaul)
Per more meetings, we're going to keep both BuildAPI and signing on Redis, and not use Memcached at all.  So I've changed the topic - we now need reliable, releng-dedicated Redis and RabbitMQ.
Summary: Memcached, RabbitMQ for releng → Redis, RabbitMQ for releng
Blocks: 863268
No longer blocks: 934594
Blocks: 934627
Blocks: 926246
Assignee: server-ops-webops → cliang
Depends on: 945834
I've got a RabbitMQ cluster running across rabbit[12].releng.webapp.scl3.mozilla.com.  The cluster can be addressed by internal ZLB address of releng-rabbitmq-zlb.webapp.scl3.mozilla.com.  It has been set up with a 'buildapi' user that should have full powers over the 'buildapi' exchange.  

Things should be set up to the point that folks should be able to test the new RabbitMQ cluster to see if it meets their needs. 

As per a short IRC conversation, I'm holding off on implementing the redis portion of this bug pending discussions in bug 945751.
I really loved memcached all along, so we.  Let's hold off a little, but as of right now the plan is to use memcached for buildapi (bug 934594), and no k/v store at all for signing (bug 945751).

Rabbit is still required, no worries there :)
No longer blocks: 934627
Flags: needinfo?(nmaul)
Blocks: 945927
I swear we won't change the plan again.  Memcached it is.  I set up two systems with SREGs in inventory:

https://inventory.mozilla.org/en-US/systems/show/11180/
https://inventory.mozilla.org/en-US/systems/show/11181/
Summary: Redis, RabbitMQ for releng → Memcached, RabbitMQ for releng
Blocks: 946334
The VMs memcache[12].releng.webapp.scl3.mozilla.com have been set up as memcached servers.  For right now, they are limited to 500MB of cache.  (The base system seems to already be taking up a significant chunk of memory.)  Both should be set up to the point that folks should be able to test code against them.

I've set up monitoring for these that is similar to some, but not all, of the other memcached servers.  It looks like it is a check against the current number of connections.
Given that no glaring defects have been found, I'm going to close this bug and tie off this particular thread in the bug tapestry.  =)  If you do find a problem, please open a new bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.