Closed Bug 606362 Opened 14 years ago Closed 10 years ago

Add SSL/TLS support to RabbitMQ (pulse.mozilla.org)

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dchanm+bugzilla, Assigned: cliang)

References

()

Details

(Whiteboard: [change - configuration])

Attachments

(1 file)

Net::RabbitFoot does not provide TLS support for the AMQP client/server communication. This allows a MITM attack to potentially steal sensitive bug information if "push-publish-restricted-messages" is enabled.

I'm not sure if TLS is compatible with current AMQP client/server implementations nor the potential performance implications. AnyEvent::RabbitMQ::connect() would need to be refactored to pass in the "tls" argument to AnyEvent::Handle::connect() .
Attached patch enable tlsSplinter Review
This is a quick and dirty proof of concept to enable TLS for AnyEvent::RabbitMQ . I changed my perl @INC path to use the modified RabbitMQ module before the system one. 

SSL was enabled on the server by following the instructions on the rabbitmq site
http://www.rabbitmq.com/ssl.html

For my particular setup I had to pass the ca_file argument to tls_ctx and changing the peername since I was using a selfsigned certificate. However this shouldn't be necessary in a production environment.

Ideally the use of tls would be a client specified option.
Attachment #485385 - Attachment is patch: true
Attachment #485385 - Attachment mime type: application/octet-stream → text/plain
Attachment #485385 - Flags: feedback?(clegnitto)
Attachment #485385 - Flags: feedback?(clegnitto)
Moving to WebOps as we need to verify that this is configured on our current RabbitMQ cluster properly.

This site outlines the instructions needed to enabled SSL support on pulse.mozilla.org
http://www.rabbitmq.com/ssl.html

I assume we need to do this on both servers in the cluster. Also we have the following questions/requirements:

1. Needs to have a proper CA certificate installed and not self-signed.
2. If SSL connection made, certificates must be verified, otherwise connection allowed using fallback.
3. Non-SSL ports also still available until we make SSL mandatory at a later time.

dkl
Assignee: christian → server-ops-webops
Component: Pulse → WebOps: Other
Product: Webtools → Infrastructure & Operations
QA Contact: nmaul
Summary: Add TLS support to AMQP Backend → Add SSL/TLS support to RabbitMQ (pulse.mozilla.org)
Version: Trunk → other
Cyliang, is this something you would be able to verify for me?

dkl
Flags: needinfo?(cliang)
Are you asking me to verify that the cluster is already set up with SSL enabled or to verify that SSL can be enabled?  (Right now, it doesn't look like it is SSL is enabled.) 

* I would assume that both servers would need to have SSL enabled if clients expecting to send an SSL cert can connect to either node in the cluster.  

* Some investigation shows that WebOps would probably need to do a small extension to the RabbitMQ puppet module to support adding ssl_listeners and ssl_options to the RabbitMQ configuration file.  I don't believe that should be too difficult.

* RE: proper CA (not self-signed) -- I'm assuming that you mean getting the cert signed by a third-party CA (we've been primarily using DigiCert lately).  If not, there is an internal Mozilla CA, so if the clients are controlled by Mozilla, we should be ok by adding the Mozilla CA certificate to the list of trusted CAs.  

The DigiCert certificates will probably need the intermediate CA cert added into the cacert file; the rabbitmq-discuss list shows someone doing something similar so this should work.

* It looks like the documented example has the SSL options you want: "Through the {fail_if_no_peer_cert,false} option, we state that we're prepared to accept clients which don't have a certificate to send us, but through the {verify,verify_peer} option, we state that if the client does send us a certificate, we must be able to establish a chain of trust to it."
Flags: needinfo?(cliang)
Thank you for taking the time to look at this!

(In reply to C. Liang [:cyliang] from comment #4)
> Are you asking me to verify that the cluster is already set up with SSL
> enabled or to verify that SSL can be enabled?  (Right now, it doesn't look
> like it is SSL is enabled.) 
> 
> * I would assume that both servers would need to have SSL enabled if clients
> expecting to send an SSL cert can connect to either node in the cluster.  

Correct. We will need identical configurations on both nodes as the client can connect to either.

> * Some investigation shows that WebOps would probably need to do a small
> extension to the RabbitMQ puppet module to support adding ssl_listeners and
> ssl_options to the RabbitMQ configuration file.  I don't believe that should
> be too difficult.

Great
 
> * RE: proper CA (not self-signed) -- I'm assuming that you mean getting the
> cert signed by a third-party CA (we've been primarily using DigiCert
> lately).  If not, there is an internal Mozilla CA, so if the clients are
> controlled by Mozilla, we should be ok by adding the Mozilla CA certificate
> to the list of trusted CAs.  
> 
> The DigiCert certificates will probably need the intermediate CA cert added
> into the cacert file; the rabbitmq-discuss list shows someone doing
> something similar so this should work.

Hmm. Currently only Mozilla clients are connecting to our current servers and I am not sure if we will need
to have non-Mozilla properties connect in the future but I may need to verify that answer. If this is true then
using the Mozilla CA should be sufficient.
 
> * It looks like the documented example has the SSL options you want:
> "Through the {fail_if_no_peer_cert,false} option, we state that we're
> prepared to accept clients which don't have a certificate to send us, but
> through the {verify,verify_peer} option, we state that if the client does
> send us a certificate, we must be able to establish a chain of trust to it."

Right. That is what we want for now. Later on we may opt to make the certs mandatory. I wish we could do this on a user basis but I don't think that is possible at the moment.

So for the time being let's see what it would take to 1) enable SSL on all nodes, 2) use the Mozilla CA and 3) make certificates not mandatory.

dkl
(In reply to David Lawrence [:dkl] from comment #5)  
> > * RE: proper CA (not self-signed) -- I'm assuming that you mean getting the
> > cert signed by a third-party CA (we've been primarily using DigiCert
> > lately).  If not, there is an internal Mozilla CA, so if the clients are
> > controlled by Mozilla, we should be ok by adding the Mozilla CA certificate
> > to the list of trusted CAs.  
> > 
> > The DigiCert certificates will probably need the intermediate CA cert added
> > into the cacert file; the rabbitmq-discuss list shows someone doing
> > something similar so this should work.
> 
> Hmm. Currently only Mozilla clients are connecting to our current servers
> and I am not sure if we will need
> to have non-Mozilla properties connect in the future but I may need to
> verify that answer. If this is true then
> using the Mozilla CA should be sufficient.

OK. Talk to someone and we should be able to allow external sources to connect to Pulse so we may need to use the third-party CA as you mentioned. Hopefully that doesn't add too much complexity to the task.

dkl
Assignee: server-ops-webops → cliang
Whiteboard: [change - configuration]
Thanks. Is the new cert installed on the cluster now or is that part still needing to be done. Also am I allowed to create client certificates or do they need to be requested from IT on a user basis?

Thanks
dkl
Flags: needinfo?(cliang)
The cert isn't yet installed on the cluster as I've been sidelined this week due to the evacuation of the Labs infrastructure managed by IT.  I would like to install the certificate some time next week, which will require a shutdown of the service.   How much lead time do you need for a shutdown, reconfiguration, and restart of the service?  

If you create client certificates that you want to be recognized by the server, I will need to have the CA (Certificate Authority) certificate for whatever is signing your client certificates so I can add it to the list of trusted CAs.  Otherwise, if we use purchased SSL certs, those will be signed by a CA that rabbitMQ will already trust.
Flags: needinfo?(cliang)
(In reply to C. Liang [:cyliang] from comment #8)
> The cert isn't yet installed on the cluster as I've been sidelined this week
> due to the evacuation of the Labs infrastructure managed by IT.  I would
> like to install the certificate some time next week, which will require a
> shutdown of the service.   How much lead time do you need for a shutdown,
> reconfiguration, and restart of the service?  
> 
> If you create client certificates that you want to be recognized by the
> server, I will need to have the CA (Certificate Authority) certificate for
> whatever is signing your client certificates so I can add it to the list of
> trusted CAs.  Otherwise, if we use purchased SSL certs, those will be signed
> by a CA that rabbitMQ will already trust.

Ah thanks. Are you available over the weekend to do the configuration? Or we could do it around
close of business of friday or any day of your choosing. If I send out a notification today to m.tools.pulse and maybe dev.planning, it should be enough notice. Basically clients may or may
not need to be restarted if the server goes down for a short period of time. And since SSL 
is a different port, then clients should continue to work after restart.

Let me know times work best for you and I can get the warning out today.

dkl
Flags: needinfo?(cliang)
My hesitation in doing anything on Friday or the weekend is that, if the RabbitMQ service fails in some subtle way, it may mean trying to page people to restart services.  I like to avoid interrupting anyone's cocktail / happy hour or risk messages plugging up in the queues, which will send out alerts. =) 

Even without that, this week is starting to look quite tight.  If it's not too soon, I have Thursday afternoon (the 20th) clear (starting at roughly noon PST).  The following week (beginning February 24th), my afternoons are clear on Tuesday, Wednesday, and Thursday (the 25th through the 27th).
Flags: needinfo?(cliang)
(In reply to C. Liang [:cyliang] from comment #10)
> Even without that, this week is starting to look quite tight.  If it's not
> too soon, I have Thursday afternoon (the 20th) clear (starting at roughly
> noon PST).  The following week (beginning February 24th), my afternoons are
> clear on Tuesday, Wednesday, and Thursday (the 25th through the 27th).

Moving forward on this on Tuesday February 25th at 2pm PST (5pm EST). I will send out a notice about this today.

dkl
A shutdown of rabbitMQ will take possible requeues from RelEng -- which is (part of) why this should go through CAB. Also, we do have a tree closing window coming up Feb 22 that this may be better to piggy-back on, depending on the outage duration, and complexity of testing the SSL portions.

I've marked for CAB review - :cyliang if you could update this bug with expected outage duration, that would help.
Flags: needinfo?(cliang)
Flags: cab-review?
(In reply to David Lawrence [:dkl] from comment #6)
> OK. Talk to someone and we should be able to allow external sources to
> connect to Pulse so we may need to use the third-party CA as you mentioned.
> Hopefully that doesn't add too much complexity to the task.

I'm out of the loop here, so I'm sorry if this is a stupid question...

What sort of abuse could we expect from this, and what (if anything) do we need to do to mitigate that?

I'm trying to think of ways to refine that question, but not coming up with anything great. If the outside world can connect directly to RabbitMQ, what sort of value does that represent to a malicious actor?
(In reply to Hal Wine [:hwine] (use needinfo) from comment #12)
> A shutdown of rabbitMQ will take possible requeues from RelEng -- which is
> (part of) why this should go through CAB. Also, we do have a tree closing
> window coming up Feb 22 that this may be better to piggy-back on, depending
> on the outage duration, and complexity of testing the SSL portions.
> 
> I've marked for CAB review - :cyliang if you could update this bug with
> expected outage duration, that would help.

Apologies if this was done too quickly. I did not realize the short downtime would cause
any negative impact on Releng and do not have issue with performing the changes during
the tree closure if WebOps has the ability to work on it then. On the other hand I would
rather not wait til the next closure in 6(?) weeks. If WebOps is unable to perform the change
on the 22nd, we have in the past worked with the Sheriffs to figure out a time that would
have the least impact. Once we figure out the new time, I can send an email announcing
the change.

(In reply to Jake Maul [:jakem] from comment #13)
> What sort of abuse could we expect from this, and what (if anything) do we
> need to do to mitigate that?
> 
> I'm trying to think of ways to refine that question, but not coming up with
> anything great. If the outside world can connect directly to RabbitMQ, what
> sort of value does that represent to a malicious actor?

As the system is right now, any client can connect and create queues and since we have
it under a single vhost, it is not simple to create proper accounts with proper ACLs configured
for each client individually. It has been a goal for some time to get the pulse cluster better 
organized and give each group it's own vhost and proper accounts. Using SSL we can go assure
that their credentials are not sent in the clear. Enabling SSL support in the cluster is one step 
towards that and was something I have been wanting to do for some time. Then we can
work on the rest of the refactoring as time allows.

dkl
Flags: needinfo?(cliang)
Dave, have we checked which impact this change has for pulsetranslator and pulsebuildmonitor? Will both still work with SSL enabled?
Flags: needinfo?(dkl)
1) If things go swimmingly, the downtime would be as short as 10 minutes (push config changes, reboot rabbitMQ, test SSL listener to verify that it works the way it is intended).  If things don't go swimmingly, I'd say something more like 30 minutes (10 minutes to fail, time for troubleshooting / info gathering, and then time to revert).

2) I don't know if this answers whimboo's question or not: based on other conversations with dkl, I was going to be binding the SSL listener on a different port (the docs suggest 5671).
(In reply to Henrik Skupin (:whimboo) from comment #15)
> Dave, have we checked which impact this change has for pulsetranslator and
> pulsebuildmonitor? Will both still work with SSL enabled?

mozillapulse used by pulsebuildmonitor uses port 5672 by default which will still be active after the update. The SSL support will work off of port 5671 so in theory pulsebuildmonitor can be switched to use that port instead later.

dkl
Flags: needinfo?(dkl)
Next tree closing window is this Saturday (2/22) - is someone available to make this change during that window and will they be attending CAB today (9am PST)?
I'd like to review the conf before it goes live, take a look at ciphersuites & so on. Can you post a conf sample in the bug?
Flags: needinfo?(cliang)
Flags: cab-review? → cab-review+
We are a go for this saturday (Feb 22nd) at 3:00pm EST to add SSL support. 

1) cyliang to provide config to jvehent for review
2) cyliang to also file a dependency bug to update the netflows to allow access to port 5671 for SSL. cyliang can test that it is working locally if this does not get added before Saturday.
3) dkl to send out revised announcement to tools.pulse and dev.planning to let everyone know of the date change.

Thanks!
dkl
The SSL config snippet is going to look very similar to what is at http://www.rabbitmq.com/ssl.html, only using a cert signed by DigiCert rather than a self-created CA.

[
  {rabbit, [
     {ssl_listeners, [5671]},
     {ssl_options, [{cacertfile,"/etc/rabbitmq/certs/cacert.pem"},
                    {certfile,"/etc/rabbitmq/certs/cert.pem"},
                    {keyfile,"/etc/rabbitmq/certs/key.pem"},
                    {verify,verify_peer},
                    {fail_if_no_peer_cert,false}]}
   ]}
].


Looking at https://www.rabbitmq.com/configure.html, I didn't see anything that would allow specification of cipherssuites.
Flags: needinfo?(cliang)
Not to derail work already done, but if this is going through zeus, we could do the SSL termination at that layer. Obviates the need to mess with rabbitmq at all. Potentially there would be no downtime. If it's just fairly normal SSL/TLS connections (like HTTPS), that seems feasible to me. If it's less normal, that might not be useful. Just food for thought.
(In reply to Jake Maul [:jakem] from comment #22)
> Not to derail work already done, but if this is going through zeus, we could
> do the SSL termination at that layer. Obviates the need to mess with
> rabbitmq at all. Potentially there would be no downtime. If it's just fairly
> normal SSL/TLS connections (like HTTPS), that seems feasible to me. If it's
> less normal, that might not be useful. Just food for thought.

Not a Zeus expert, will that allow clients still using port 5672 to connect without SSL and clients connecting to 5671 with SSL? Also how does that protect against clients using non-SSL to Zeus from getting their credentials intercepted before Zeus? Apologize if I am missing something.

dkl
Flags: needinfo?(nmaul)
Blocks: 971818
(In reply to C. Liang [:cyliang] from comment #21)
> 
> Looking at https://www.rabbitmq.com/configure.html, I didn't see anything
> that would allow specification of cipherssuites.

The doc is in erlang, not rabbitmq.

[
  {rabbit, [
         {ssl_listeners, [5671]},
         {ssl_options, [{cacertfile,"/etc/rabbitmq/certs/cacert.pem"},
                        {certfile,"/etc/rabbitmq/certs/servercert.pem"},
                        {keyfile,"/etc/rabbitmq/certs/serverkey.pem"},
                        {verify,verify_peer},
                        {fail_if_no_peer_cert,false},
                        {ciphers, [{dhe_rsa,aes_128_cbc,sha},
                                   {dhe_rsa,aes_256_cbc,sha},
                                   {dhe_rsa,'3des_ede_cbc',sha},
                                   {rsa,aes_128_cbc,sha},
                                   {rsa,aes_256_cbc,sha},
                                   {rsa,'3des_ede_cbc',sha}]},
                                   {versions, [tlsv1]}
         ]}
  ]}
].

I'd recommend using cipherscan to test the configuration before going live. Ping me if you need help with that.
 This did not work.  I  think the options are to:
   1. Try to find a newer version of Erlang (tricky, given that we're the latest the fedora project has)
   2. Use a self-signed CA (which means that clients will need to add the CA cert if they want to verify the SSL cert presented by the rabbitMQ server) or
   3. Try the option Jake mentioned with doing SSL at the Zeus load balancer.   I've only ever done this with SSL, so I'm not 100% sure how this will interact with RabbitMQ.


After restarting the RabbitMQ servers, attempts to connect over SSL produced errors similar to the ones listed in http://osdir.com/ml/erlang-questions-programming/2013-04/msg00030.html.  (On the client side, this looked like a ssl handshake failure.) The implication from the mailing list is that this may be fixed in a newer version of Erlang.

I'm assuming that if we control the CA signing, we can ensure that there are only PKCS-standard oids in the cert.
Depends on: 976290
(In reply to Jake Maul [:jakem] from comment #22)
> Not to derail work already done, but if this is going through zeus, we could
> do the SSL termination at that layer. Obviates the need to mess with
> rabbitmq at all. Potentially there would be no downtime. If it's just fairly
> normal SSL/TLS connections (like HTTPS), that seems feasible to me. If it's
> less normal, that might not be useful. Just food for thought.

Given the difficulty of doing this natively as we found out during the latest outage. This doesn't seem like a bad idea. How difficult would it be to configure zeus to accept SSL connections on port 5671 and then direct them to the non-SSL port 5672 on the rabbit cluster? Also would we still need the new CA we purchased originally for this task and can Zeus use that?

I am not sure if we would ever need to issue actual client certs to the consumers/publishers but if we did, would we be able to do that with Zeus? 

Thanks!
dkl
Apologies for the long delay in responding... hopefully my answers here are still relevant.

(In reply to David Lawrence [:dkl] from comment #23)
> (In reply to Jake Maul [:jakem] from comment #22)
> > Not to derail work already done, but if this is going through zeus, we could
> > do the SSL termination at that layer. Obviates the need to mess with
> > rabbitmq at all. Potentially there would be no downtime. If it's just fairly
> > normal SSL/TLS connections (like HTTPS), that seems feasible to me. If it's
> > less normal, that might not be useful. Just food for thought.
> 
> Not a Zeus expert, will that allow clients still using port 5672 to connect
> without SSL and clients connecting to 5671 with SSL?

Yes... we can set up one port to do SSL decrypting in Zeus and another that doesn't (plaintext). Each port is treated as an entirely separate vserver... they'd only happen to listen on the same IP in this case. :)

That said, if we *don't* want that, we can always disable the non-SSL port altogether.

And Zeus can translate between ports on the external and internal sides. So end users might connect to port 12093 (random) and Zeus could forward that internally to 5672.

> Also how does that
> protect against clients using non-SSL to Zeus from getting their credentials
> intercepted before Zeus? Apologize if I am missing something.

It doesn't. If the client doesn't use SSL, they're not protected. This is true regardless of whether or not Zeus does the SSL decryption or if the RabbitMQ nodes do it. The only thing we can feasibly do to prevent this is to make non-SSL connections fail, by simple not hosting a non-SSL service.
Flags: needinfo?(nmaul)
(In reply to Jake Maul [:jakem] from comment #27)

Thanks. We have decided to go the route of Zeus and Cyliang has assured me she can perform the needed changes to make it work. We can change this bug to reflect the goal and close it when it is done.

dkl
Depends on: 1005629
The current configuration is now:

   pulse.mozilla.org:5671 - rabbitmq+ssl (MUST use SSL)
   pulse.mozilla.org:5672 - rabbitmq only

Please test and let me if this works as intended.
Blocks: 1005629
No longer depends on: 1005629
Flags: needinfo?(dkl)
ulfr: I forgot to ask -- did you still want to run a ciphersuite scan against pulse.mozilla.org:5671 given that we're currently doing SSL termination on the load balancer?
Flags: needinfo?(jvehent)
It will only show the ZLB ciphersuite, of course.

    $ ./cipherscan pulse.mozilla.org:5671
    ......
    prio  ciphersuite         protocols            pfs_keysize
    1     DHE-RSA-AES128-SHA  SSLv3,TLSv1,TLSv1.1  DH,1024bits
    2     DHE-RSA-AES256-SHA  SSLv3,TLSv1,TLSv1.1  DH,1024bits
    3     AES128-SHA          SSLv3,TLSv1,TLSv1.1
    4     AES256-SHA          SSLv3,TLSv1,TLSv1.1
    5     RC4-SHA             SSLv3,TLSv1,TLSv1.1
    Certificate: trusted, 2048 bit, sha256WithRSAEncryption signature

Once we upgrade the ZLB to 9.6, we can set a separate SSL policy for this site, and remove that nasty RC4 ;)
Flags: needinfo?(jvehent)
Verified that both publishers and consumers work fine with ssl=True and port=5671.  Thanks!
Flags: needinfo?(dkl)
ulfr: 

I *think* I've removed RC4 from the list of accepted ciphers on both pulse.mozilla.org and pulse-dev.allizom.org.  

If you could please 1) confirm this and 2) let me know if I need to make any other SSL-related changes,  I'd appreciate it. =)
Flags: needinfo?(jvehent)
Confirmed. And Awesome, thanks for taking care of it.

I don't see a point in getting any fancier with DH key of 2048 and OCSP Stapling, client libraries wouldn't support these anyway. So we can call it done.


$ ./cipherscan pulse.mozilla.org:5671
.......
prio  ciphersuite           protocols            pfs_keysize
1     DHE-RSA-AES128-SHA    SSLv3,TLSv1,TLSv1.1  DH,1024bits
2     DHE-RSA-AES256-SHA    SSLv3,TLSv1,TLSv1.1  DH,1024bits
3     EDH-RSA-DES-CBC3-SHA  SSLv3,TLSv1,TLSv1.1  DH,1024bits
4     AES128-SHA            SSLv3,TLSv1,TLSv1.1
5     AES256-SHA            SSLv3,TLSv1,TLSv1.1
6     DES-CBC3-SHA          SSLv3,TLSv1,TLSv1.1

Certificate: trusted, 2048 bit, sha256WithRSAEncryption signature

$ ./cipherscan pulse-dev.allizom.org:5671
.......
prio  ciphersuite           protocols            pfs_keysize
1     DHE-RSA-AES128-SHA    SSLv3,TLSv1,TLSv1.1  DH,1024bits
2     DHE-RSA-AES256-SHA    SSLv3,TLSv1,TLSv1.1  DH,1024bits
3     EDH-RSA-DES-CBC3-SHA  SSLv3,TLSv1,TLSv1.1  DH,1024bits
4     AES128-SHA            SSLv3,TLSv1,TLSv1.1
5     AES256-SHA            SSLv3,TLSv1,TLSv1.1
6     DES-CBC3-SHA          SSLv3,TLSv1,TLSv1.1

Certificate: trusted, 2048 bit, sha1WithRSAEncryption signature
Flags: needinfo?(jvehent)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Change Request: --- → approved
Flags: cab-review+
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: