Closed Bug 916243 Opened 11 years ago Closed 11 years ago

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aphadke, Assigned: cturra)

Details

Modify Apache (or equivalent) to redirect/forward all requests coming 
from:
https://data.mozilla.com/submit/telemetry/*  
to:
http://ec2-54-245-195-188.us-west-2.compute.amazonaws.com:8080/submit/telemetry/*

Please ensure that POST body is relayed as-is.
i don't like the idea of specifying the ec2 instance directly to send these requests to - is this not behind an elb or similar?
(In reply to Chris Turra [:cturra] from comment #1)
> i don't like the idea of specifying the ec2 instance directly to send these
> requests to - is this not behind an elb or similar?

The service will be behind an elb instance.  Also, we do not want to make any changes to the production environment until we cut production over to the new backend on October 1.

We will need to test to make sure that forwarding between DCs will work at the required scale[1]

Please see bug 911309 and bug 915808 for related work.

The destination hostname will be "incoming.telemetry.mozilla.org" per the above bugs.

[1]: Current production telemetry traffic to data.mozilla.com averages approximately 240 requests per second at about 14kb (gzip-compressed) per request.
Summary: Redirect https://data.mozilla.com/submit/telemetry/* TO http://ec2-54-245-195-188.us-west-2.compute.amazonaws.com:8080/submit/telemetry/* → Redirect https://data.mozilla.com/submit/telemetry/* TO https://incoming.telemetry.mozilla.org/submit/telemetry/*
Per discussion with :cturra, there's no Apache instance in the current setup that we can use to implement these redirects.  The current load balancer at data.mozilla.com is a Zeus load balancer, so the next step is to check and see if Zeus can do what we want.
A few things on my mind with this...

1) We cannot redirect (301/302) a POST... the browser will do a GET on the result, and the data will be lost. So this would have to be something where our Apache (or Zeus) servers silently, internally forward this data back out to Amazon. That would probably *work*, but I'd consider it a hack... I'm not particularly happy with this as a long-term solution, and would want to see a plan for something better.

2) While this is still pointing internally, this doesn't matter, but post-Oct-1... why not https? This would be forwarded over the big bad Internet. If the data was sensitive enough to merit https protection on the way to us, it should be protected on the way to Amazon too. We could probably rig it up to go through the VPC, but it'd take longer and probably be more fragile in some respects.

3) If I understand the situation properly, this will catch some older versions' FHR data (before it was configured to send to fhr.data.mozilla.com), as well as all current telemetry data. How is that FHR data going to come back to SCL3?

4) I'm pretty sure Zeus can do this (I don't think we've ever tried something like this at that level), though as pointed out there's no real benefit to doing this before Oct 1, except as a way to make sure that we can. The TrafficScript rule in Zeus is likely to simply choose a different backend pool... but, in this case, the backend pool is the same, because we're not actually changing anything yet. So I'm not sure how useful this test will be in practice.

Thanks!
I believe older FHR clients would be submitting to data.mozilla.com/submit/metrics.

HTTP/1.1 clients are allowed to resubmit a POST when they receive a HTTP 307 response (as opposed to 301/302). I'm not sure if Firefox supports this, however. The concept of a permanent variation of a 307 is proposed as 308, but I don't believe it has been formerly accepted nor is widely implemented. However, 307 supports Cache-Control, so you can make it behave like a 301/permanent redirect.
1) agree. We'd want longer term solution.
2) no reason to not do https. Atm we dont use https forwarding because we have a ssh tunnel.
3) older fhr is only going to go to AWS if we switch dns over to point at AWS. If we don't do that and do an http-level forward (which i think is the consensus) only telemetry data will end up on AWS(because fhr and telemetry use different urls)
4) We should setup a similar-looking pathname(to telemetry submission url) to redirect to aws. We can then load-test it to make sure it works.
i have an update from some hacking/testing this morning. i have setup a test pool in zeus that points to a single ec2 instance :mreid provided. the following is the zeus (riverbed) traffic script i have written and the results of my testing.

  if ( http.getHeader("cturra") == "true" ) {
    if (http.getHeader("Host") == "data.mozilla.com") {
      if (string.startsWith(http.getPath(),"/submit/telemetry/" )) {
        pool.select("incoming-telemetry-testing");
      }
    }
  }


$ HOSTNAME=data.mozilla.com; curl -X POST https://$HOSTNAME/submit/telemetry/foo/saved_session/Firefox/release/22.0/20130618035212 -v -H "cturra: true" -d '{"Host":1}'
* About to connect() to data.mozilla.com port 443 (#0)
* Trying 63.245.215.38...
* connected
* Connected to data.mozilla.com (63.245.215.38) port 443 (#0)
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using RC4-SHA
* Server certificate:
* subject: serialNumber=jf9U8AqKeLl1oWvqlzWKbOqP/7IsEi9H; C=US; ST=California; L=Mountain View; O=Mozilla Corporation; CN=data.mozilla.com
* start date: 2013-05-27 09:35:44 GMT
* expire date: 2015-06-29 16:38:53 GMT
* subjectAltName: data.mozilla.com matched
* issuer: C=US; O=GeoTrust, Inc.; CN=GeoTrust SSL CA
* SSL certificate verify ok.
> POST /submit/telemetry/foo/saved_session/Firefox/release/22.0/20130618035212 HTTP/1.1
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8x zlib/1.2.5
> Host: data.mozilla.com
> Accept: */*
> cturra: true
> Content-Length: 10
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 10 out of 10 bytes
< HTTP/1.1 200 OK
< Content-Type: text/plain
< Date: Mon, 16 Sep 2013 18:08:30 GMT
< Connection: keep-alive
< Transfer-Encoding: chunked
<
* Connection #0 to host data.mozilla.com left intact
OK* Closing connection #0
* SSLv3, TLS alert, Client hello (1):
now that we have completed the testing and know it will be possible to pass along these requests to an external pool (as described in comment 7), our next steps are:

 1) let us know when the elb is setup with nodes and DNS can be pointed to it
 2) we will updated the zeus pool to use this elb and disable the "cturra" test header.

*it sounds like these won't be happening until oct 1st, per :mreid in comment 2.
Assignee: nobody → cturra
Flags: needinfo?(aphadke)
OS: Mac OS X → All
Hardware: x86 → All
Component: Operations → WebOps: Product Delivery
Product: Mozilla Services → Infrastructure & Operations
QA Contact: nmaul
Version: unspecified → other
I have a test load balancer in AWS:

telemetry-server-lb-2038738656.us-west-2.elb.amazonaws.com

Can you update the "incoming-telemetry-testing" pool to point there so I can load test?  I will make sure to add the "cturra: true" header.
as discussed on irc, i have updated the pool in zeus for this. for testing, i have setup 80/tcp, but we'll need to chat about how to do ssl pass through when we're ready for this to be https straight through.
Some weird things going on here.

If I hit port 80 on data.mozilla.com, I get "Connection Refused", so maybe data.mozilla.com doesn't allow any non-secure connections?

Also, if I use 'ab' to hit d.m.c via https, it works with a small amount of concurrency (-c5).  If I increase the number of concurrent connections (-c50), I start to get "SSL handshake failed (5)" and lots of failed requests.

Update: now none of the requests are going through - maybe I triggered some kind of anti-spam rules on data.mozilla.com?
(In reply to Mark Reid [:mreid] from comment #11)
> Update: now none of the requests are going through - maybe I triggered some
> kind of anti-spam rules on data.mozilla.com?

you had triggered one of our protection classes in zeus. to get you back on track, i have added an allow ip to the data.mozilla.org protection class - pls remind me when you're done so i can remove it.
we also discovered that the way zeus is configured on data.mozilla.org only allows a maximum of 20 simultaneous connections from a single IP. this is why the error rate climbs after we reach the magical number of 20. we didn't tinker with this value since it's in production for the existing data.mozilla.org virtual host.
General note: please be careful with commenting, specifically mozilla.com vs mozilla.org. Our infrastructure is more than complicated enough without mixing up FQDNs. :)
OK, we're ready to enable this redirect.  The DNS entry for "incoming.telemetry.mozilla.org" has been updated to point at the ELB CNAME (in bug 921167).

Please update the rule in comment 7 to enable the redirect:
- Check that the ZLB's DNS entry for "incoming.telemetry.mozilla.org" has updated to point to the AWS ELB CNAME (dualstack.telemetry-server-lb-2038738656.us-west-2.elb.amazonaws.com)
- Update the target pool name to a production name (I suggest "incoming-telemetry-aws")
- Update the redirect destination to "incoming.telemetry.mozilla.org"
- Remove the check for the "cturra: true" header
Flags: needinfo?(aphadke)
(In reply to Mark Reid [:mreid] from comment #15)
> OK, we're ready to enable this redirect.  The DNS entry for
> "incoming.telemetry.mozilla.org" has been updated to point at the ELB CNAME
> (in bug 921167).
> 
> Please update the rule in comment 7 to enable the redirect:
> - Check that the ZLB's DNS entry for "incoming.telemetry.mozilla.org" has
> updated to point to the AWS ELB CNAME

Confirmed

[root@zlb6.ops.scl3 ~]# host incoming.telemetry.mozilla.org
incoming.telemetry.mozilla.org is an alias for dualstack.telemetry-server-lb-2038738656.us-west-2.elb.amazonaws.com.
dualstack.telemetry-server-lb-2038738656.us-west-2.elb.amazonaws.com has address 54.214.236.47
dualstack.telemetry-server-lb-2038738656.us-west-2.elb.amazonaws.com has IPv6 address 2620:108:700f::36d6:ec2f

> (dualstack.telemetry-server-lb-2038738656.us-west-2.elb.amazonaws.com)



> - Update the target pool name to a production name (I suggest
> "incoming-telemetry-aws")

Done: https://www.zlb.ops.scl3.mozilla.com:9090/apps/zxtm/?name=incoming-telemetry-aws&section=Pools%3AEdit#anchor

> - Update the redirect destination to "incoming.telemetry.mozilla.org"

Updated to only be incoming.telemetry.mozilla.org:443

> - Remove the check for the "cturra: true" header

Updated https://www.zlb.ops.scl3.mozilla.com:9090/apps/zxtm/?name=data-mo-incoming-telemetry-aws&section=Rules%3AEdit#anchor to be

if (http.getHeader("Host") == "data.mozilla.com") {
  if (string.startsWith(http.getPath(),"/submit/telemetry/" )) {
    pool.select("incoming-telemetry-aws");
  }
}
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
I'm getting a 500 error when I try submitting to data.mozilla.com, though submitting directly to incoming.telemetry.mozilla.org works as expected.

Details:

* About to connect() to data.mozilla.com port 443 (#0)
*   Trying 63.245.215.38...
* Connected to data.mozilla.com (63.245.215.38) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using RC4-SHA
* Server certificate:
* 	 subject: serialNumber=jf9U8AqKeLl1oWvqlzWKbOqP/7IsEi9H; C=US; ST=California; L=Mountain View; O=Mozilla Corporation; CN=data.mozilla.com
* 	 start date: 2013-05-27 09:35:44 GMT
* 	 expire date: 2015-06-29 16:38:53 GMT
* 	 subjectAltName: data.mozilla.com matched
* 	 issuer: C=US; O=GeoTrust, Inc.; CN=GeoTrust SSL CA
* 	 SSL certificate verify ok.
> POST /submit/telemetry/foo/saved_session/Firefox/release/22.0/20130618035213 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: data.mozilla.com
> Accept: */*
> Content-Length: 49064
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
> 
< HTTP/1.1 500 Internal Server Error
< Date: Tue, 01 Oct 2013 20:41:20 GMT
< Connection: close
< Content-Type: text/html
< 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<title>Service Unavailable</title>
<style type="text/css">
body, p, h1 {
  font-family: Verdana, Arial, Helvetica, sans-serif;
}
h2 {
  font-family: Arial, Helvetica, sans-serif;
  color: #b10b29;
}
</style>
</head>
<body>
<h2>Service Unavailable</h2>
<p>The service is temporarily unavailable. Please try again later.</p>
</body>
</html>
* Closing connection 0
* SSLv3, TLS alert, Client hello (1):
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
We're losing release-channel data until this is fixed.
Severity: normal → blocker
Priority: -- → P1
I am looking into it
Fixed now. Zeus thought the server incoming.telemetry.mozilla.org:443 was down.

2 problems, one of which wasn't actually a problem:

1) Passive health checks were turned on. We normally turn them off because they generate false positives. In this case, it was indeed the cause of Zeus thinking it was down... however...

2) Zeus was not configured to re-encrypt the traffic out to incoming.telemetry.mozilla.org:443 (note the :443... it's expecting SSL). Because of this, from Zeus's perspective, it really *was* down... passive health checking was, in this case, doing the right thing.


I've enabled SSL on this pool, and disabled the passive health check to prevent future problems. All seems well now (confirmed in IRC w/ :mried).
Severity: blocker → normal
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Priority: P1 → --
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.