Closed Bug 1170832 Opened 9 years ago Closed 8 years ago

From MTV2/SCL3 (wired LAN) traffic gets sent to CloudFront CDN nodes/POPs in Japan/Hong Kong

Categories

(Cloud Services :: Operations: Miscellaneous, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: stephend, Assigned: oremj)

References

Details

(Whiteboard: [fromAutomation])

Attachments

(1 file)

(Filing this primarily to note it -- I'm not sure what, if anything, can be done from our -- Mozilla's -- side to help mitigate this.)

Problem: From ~11:55am PDT to 2:00pm PDT, we noticed a huge slowdown (up to 15 seconds for a GET to a font/image/asset to complete) in our Mozilla.org test-automation suite (and manually, too) against the following host:

https://www.allizom.org/en-US/ (which uses Cloudfront as its CDN)

The problem appears to have been with a Cloudfront misconfiguration, which sent at least internal Mozilla traffic to, at first, a CDN node in Japan, and then, later, one in Hong Kong.  When it finally recovered, we hit an Equinix-peered Amazon CDN node in SF (see last traceroute, below).

Resources:

A) New Relic captured this with its Real User Measurement feature:

1) https://rpm.newrelic.com/accounts/263620/browser/2638806 - www.allizom.org (staging) data for the Python side of Mozilla.org
2) https://rpm.newrelic.com/public/charts/iuRHrSSbARd - chart from the 2 hour window, showing the response-time spike for the above (average of 29 seconds page-load time, but at times, 36+ seconds).

B) Here's a screencast, demonstrating the end-user problem (again, on staging, which affects test automation and manual testing): http://screencast.com/t/OBxkVr30T

C) And, finally, here are some traceroutes:

C:\Users\stephend>tracert d3kg653zred9tf.cloudfront.net
 
Tracing route to d3kg653zred9tf.cloudfront.net [54.182.4.69]
over a maximum of 30 hops:
 
  1    <1 ms    <1 ms    <1 ms  fw1.corp.mtv2.mozilla.net [10.252.24.1]
  2     1 ms     1 ms     1 ms  v-1077.border1.sjc2.mozilla.net [63.245.219.182]
  3     2 ms     3 ms     2 ms  sjo-b21-link.telia.net [62.115.8.161]
  4     2 ms     2 ms     2 ms  ntt-ic-306350-sjo-b21.c.telia.net [62.115.44.46]
  5     2 ms     2 ms     2 ms  ae-4.r23.snjsca04.us.bb.gin.ntt.net [129.250.4.25]
  6   128 ms   128 ms   128 ms  ae-6.r21.osakjp02.jp.bb.gin.ntt.net [129.250.2.131]
  7   127 ms   114 ms   114 ms  ae-5.r23.osakjp02.jp.bb.gin.ntt.net [129.250.6.144]
  8   153 ms   153 ms   159 ms  ae-8.r23.tkokhk01.hk.bb.gin.ntt.net [129.250.2.222]
  9   167 ms   166 ms   166 ms  ae-1.r01.newthk03.hk.bb.gin.ntt.net [129.250.2.3]
 10   175 ms   177 ms   175 ms  203.131.245.182
 11   175 ms   175 ms   175 ms  server-54-182-4-69.hkg51.r.cloudfront.net [54.182.4.69]

C:\Users\stephend>tracert d37xxasan61b7v.cloudfront.net
 
Tracing route to d37xxasan61b7v.cloudfront.net [54.239.194.209]
over a maximum of 30 hops:
 
  1     *        *        *     Request timed out.
  2     2 ms     1 ms     1 ms  v-1077.border1.sjc2.mozilla.net [63.245.219.182]
  3     2 ms     2 ms     2 ms  sjo-b21-link.telia.net [62.115.8.161]
  4     2 ms     2 ms     2 ms  ntt-ic-306350-sjo-b21.c.telia.net [62.115.44.46]
  5     2 ms     2 ms     2 ms  ae-4.r22.snjsca04.us.bb.gin.ntt.net [129.250.4.5]
  6   129 ms   129 ms   123 ms  ae-7.r20.osakjp02.jp.bb.gin.ntt.net [129.250.2.165]
  7   119 ms   159 ms   119 ms  ae-4.r23.osakjp02.jp.bb.gin.ntt.net [129.250.6.90]
  8   133 ms   133 ms   133 ms  ae-2.r01.osakjp02.jp.bb.gin.ntt.net [129.250.3.199]
  9   121 ms   120 ms   121 ms  ae-1.amazon.osakjp02.jp.bb.gin.ntt.net [61.200.82.218]
 10   235 ms   235 ms     *     54.239.52.144
 11   235 ms   234 ms   234 ms  54.239.52.149
 12   236 ms     *      237 ms  27.0.0.115
 13     *        *        *     Request timed out.

Once this cleared up ~2pm PDT, we saw this:

C:\Users\stephend>tracert d3kg653zred9tf.cloudfront.net

Tracing route to d3kg653zred9tf.cloudfront.net [205.251.215.221]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  fw1.corp.mtv2.mozilla.net [10.252.24.1]
  2     2 ms     1 ms     2 ms  v-1077.border1.sjc2.mozilla.net [63.245.219.182]
  3     1 ms     1 ms     1 ms  xe-5-0-0.mpr4.sjc7.us.above.net [64.125.170.37]
  4     1 ms     1 ms     1 ms  ae11.mpr3.sjc7.us.zip.zayo.com [64.125.28.1]
  5     1 ms     1 ms     1 ms  equinix02-sfo5.amazon.com [206.223.116.236]
  6     2 ms     2 ms     4 ms  205.251.229.141
  7     1 ms     1 ms     1 ms  205.251.230.91
  8     1 ms     2 ms     1 ms  server-205-251-215-221.sfo5.r.cloudfront.net [205.251.215.221]

I should note: although the performance of the routes to and the CDN POPs in Japan and Hong Kong were poor, the real problem seems to be that we even were sent to those, rather than a local POP -- at least somewhere in North America or close.
Did someone open a ticket with cloudfront?

NI on fox2mike as I think webops manages the CDNs
Assignee: network-operations → arzhel
Flags: needinfo?(smani)
(In reply to Arzhel Younsi [:XioNoX] from comment #1)
> Did someone open a ticket with cloudfront?
> 
> NI on fox2mike as I think webops manages the CDNs

I'm not sure how this automatically warrants a ticket with Amazon especially when the ticket says it's from the office. I'm also not sure of who's Amazon account this is being setup from :) and will have to dig around to find that. You might be able to file a more meaningful ticket than I can wrt this, fwiw.
Flags: needinfo?(smani)
Based on comment 0 it seems to be a CDN issue and I have no idea on how that's managed. I can't reproduce it though so they probably fixed it. If it's the only time I don't think it's worth opening a ticket with them.

Note that they usually use your DNS resolver to figure out which endpoint to send you to. So make sure you're using a local resolver.

For example if I'm connected to the MozillaVPN and use it for DNS (by default), even if I'm in Paris they will redirect me to the SFO1 endpoint. Using google's dns, will send to to the Paris one.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
(In reply to Arzhel Younsi [:XioNoX] from comment #3)
> Based on comment 0 it seems to be a CDN issue and I have no idea on how
> that's managed. I can't reproduce it though so they probably fixed it. If
> it's the only time I don't think it's worth opening a ticket with them.
> 
> Note that they usually use your DNS resolver to figure out which endpoint to
> send you to. So make sure you're using a local resolver.
> 
> For example if I'm connected to the MozillaVPN and use it for DNS (by
> default), even if I'm in Paris they will redirect me to the SFO1 endpoint.
> Using google's dns, will send to to the Paris one.

Right; FWIW, all of our in-Mountain View (MTV2 QA Lab) hosts (on LAN), as well as my wired-desk connection saw this, and all are using the default DNS servers through DHCP.
Just going to dump some more info from today:

C:\Users\Stephen Donner>tracert d3kg653zred9tf.cloudfront.net

Tracing route to d3kg653zred9tf.cloudfront.net [205.251.212.200]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  fw1.corp.mtv2.mozilla.net [10.252.24.1]
  2     1 ms    <1 ms    <1 ms  v-1075.border1.sjc2.mozilla.net [63.245.219.186]
  3     1 ms     1 ms     1 ms  xe-3-1-0.mpr2.pao1.us.above.net [64.125.170.33]
  4     1 ms     1 ms     1 ms  zayo-kddi.pao1.us.zip.zayo.com [64.125.14.34]
  5     1 ms     1 ms     2 ms  pajbb002.int-gw.kddi.ne.jp [111.87.3.61]
  6   114 ms   111 ms   111 ms  obpjbb206.int-gw.kddi.ne.jp [203.181.100.185]
  7   110 ms   110 ms   110 ms  jc-osa302.kddnet.ad.jp [113.157.227.46]
  8   111 ms   111 ms   114 ms  106.187.29.142
  9   120 ms   120 ms   117 ms  27.0.0.250
 10   111 ms   115 ms   111 ms  27.0.2.1
 11     *        *        *     Request timed out.
 12     *        *

Even externally, over a "cable" connection (hosted by Yottaa.com) via webpagetest.org, we get nearly 14 second page-load time: http://www.webpagetest.org/result/150825_7K_1546/1/details/

No way to tell whether they were also sent to a Japanese node, rather than a "local" US-based one, but at least this shows it's external to the MTV2 network (I hope).

In particular, nearly 4 seconds to get a single image, over the wire: http://www.webpagetest.org/result/150825_7K_1546/1/details/#request112

Request 112: https://d3kg653zred9tf.cloudfront.net/media/img/firefox/partners/form-bg.ce84507a80dd.jpg

URL: https://d3kg653zred9tf.cloudfront.net/media/img/firefox/partners/form-bg.ce84507a80dd.jpg
Host: d3kg653zred9tf.cloudfront.net
IP: 127.0.0.1
Error/Status Code: 200
Initiated By: https://d3kg653zred9tf.cloudfront.net/media/js/partners_desktop-bundle.d53e626103af.js line 4 column 17757
Request Start: 9.272 s
Time to First Byte: 3990 ms

This might be telling?

X-Cache-Info: caching
X-Cache-Info: caching
X-Cache: Miss from cloudfront
Via: 1.1 1dfcd1bafe1d21759a09fdcad1a93705.cloudfront.net (CloudFront)
Reopening; there are at least 2 bugs here besides the perf/routing/DNS/node-region issue:

1) redundant X-Cache-Info: caching headers/values
2) an X-Cache: Miss from cloudfront

Even if we're not-yet expecting this environment to be "future prod," we'd do well to sort this out.

Another datapoint from yesterday:

Again we're seeing assets on Cloudfront being sent to Japanese nodes, with higher latency/request-transfer times.

    Here's a webpagetest.org run:
        http://www.webpagetest.org/result/150825_SR_16P9/1/details/
    Here's a traceroute:
        sdonners-MacBook-Pro:~ sdonner$ traceroute d3kg653zred9tf.cloudfront.net
        traceroute: Warning: d3kg653zred9tf.cloudfront.net has multiple addresses; using 216.137.52.241
        traceroute to d3kg653zred9tf.cloudfront.net (216.137.52.241), 64 hops max, 52 byte packets
         1  fw1.corp.mtv2.mozilla.net (10.252.24.1)  2.240 ms * *
         2  v-1075.border1.sjc2.mozilla.net (63.245.219.186)  2.790 ms  3.219 ms  2.908 ms
         3  xe-3-1-0.mpr2.pao1.us.above.net (64.125.170.33)  2.889 ms  2.630 ms  2.548 ms
         4  zayo-kddi.pao1.us.zip.zayo.com (64.125.14.34)  2.585 ms  2.739 ms  2.881 ms
         5  pajbb002.int-gw.kddi.ne.jp (111.87.3.77)  2.544 ms
            pajbb001.int-gw.kddi.ne.jp (111.87.3.73)  2.205 ms
            pajbb002.int-gw.kddi.ne.jp (111.87.3.77)  22.997 ms
         6  obpjbb205.int-gw.kddi.ne.jp (203.181.100.193)  146.551 ms
            obpjbb206.int-gw.kddi.ne.jp (203.181.100.37)  108.115 ms
            obpjbb206.int-gw.kddi.ne.jp (203.181.100.185)  112.914 ms
         7  jc-osa302.int-gw.kddi.ne.jp (113.157.227.122)  134.100 ms
            jc-osa302.int-gw.kddi.ne.jp (113.157.227.126)  113.457 ms
            jc-osa302.int-gw.kddi.ne.jp (113.157.227.122)  127.153 ms
         8  106.187.29.142 (106.187.29.142)  116.112 ms  114.752 ms  119.901 ms
         9  27.0.0.250 (27.0.0.250)  127.205 ms
            54.239.52.142 (54.239.52.142)  128.851 ms
            27.0.0.250 (27.0.0.250)  133.873 ms
        10  54.239.52.149 (54.239.52.149)  122.667 ms  127.080 ms
            54.239.52.135 (54.239.52.135)  123.080 ms
        11  27.0.0.237 (27.0.0.237)  124.237 ms
        ^C
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Richard, any potential insights on this?
Flags: needinfo?(riweiss)
Stephen, can you give me access to that AWS account? I'd like to look at the configuration and possibly open a ticket with Amazon.
Flags: needinfo?(riweiss) → needinfo?(stephen.donner)
(In reply to Richard Weiss [:r2] from comment #8)
> Stephen, can you give me access to that AWS account? I'd like to look at the
> configuration and possibly open a ticket with Amazon.

Sorry, I have zero knowledge of it - was told that fox2mike and/or jakem would, though.
Flags: needinfo?(stephen.donner) → needinfo?(smani)
I pointed r2 at the right account
Flags: needinfo?(smani)
r2, fox2mike, does it look like a CDN issue? who should own that bug?
Flags: needinfo?(smani)
Flags: needinfo?(riweiss)
(In reply to Arzhel Younsi [:XioNoX] from comment #12)
> r2, fox2mike, does it look like a CDN issue? who should own that bug?

Hey Arzhel - sorry for not updating the bug earlier; Jason Thomas has reached out to Amazon with a trouble ticket (1524721631), and 6 days ago their last response was that they are looking into it further.
I can see the same for our Mozmill-CI nodes in qa.scl3.mozilla.com. We got dozen of test failures today and downloads of builds take extraordinary long with about 25kb/s up to 75kb/s.
Just got this from Amazon support:

> For some reasons when you use your DNS resolver we are mapping you always outside US, which is the issue > I am working to get corrected with our Cloudfront Services Team. If you are using Google DNS, Google sends your actual Public IP as part of the DNS extension, in this case Cloudfront relies on Geo Database and is mapping you with-in US based on Geo Distance.
> This certainly is an issue at our side which I am working to get corrected.
> We regret the inconvenience this has been causing and the delay in resolving this.
As noted earlier this also affects our Firefox UI Tests for Firefox Desktop as run in SCL3. So updating the summary to reflect the real state.

Thank you Jeremy for the next information. Hopefully they will be able to fix that soon. As we noticed late yesterday the Japan/Hong Kong routing got fixed but we were still routed via the uk for US inside traffic.
Summary: From MTV2 (wired LAN) traffic was sent to Cloudfront CDN nodes/POPs in Japan/Hong Kong for 2 hours (affected Mozilla.org test automation) → From MTV2/SCL3 (wired LAN) traffic gets send to Cloudfront CDN nodes/POPs in Japan/Hong Kong
This blocks most parts of our automation due to broken downloadeds of the Firefox binary. Adding our whiteboard entry for tracking.
Whiteboard: [fromAutomation] → [fromAutomation][qa-automation-blocked]
Summary: From MTV2/SCL3 (wired LAN) traffic gets send to Cloudfront CDN nodes/POPs in Japan/Hong Kong → From MTV2/SCL3 (wired LAN) traffic gets sent to CloudFront CDN nodes/POPs in Japan/Hong Kong
Can you tell me which AWS account this ticket was opened in? Either the account name or number will do. I'd like to call this to the attention of our Amazon account team.
Flags: needinfo?(riweiss) → needinfo?(stephen.donner)
Richard, I've e-mailed you the case id.
Flags: needinfo?(stephen.donner)
Did we get an update here meanwhile? Half a month passed by. As of today we hit bug 1219934 again for a couple of DMGs of Firefox (ar locale).
Flags: needinfo?(oremj)
It seems to be getting better. I've attached a throughput graph. The average seems to me quite a bit better the last 4 days. The ticket is still open with AWS and they are working on it.

We are also considering fronting cloudfront with akamai, which should eliminate these performance issues.
Flags: needinfo?(oremj)
Attachment #8688527 - Attachment description: cloudfront-throughput.PNG → scl3 -> cloudfront throughput
Moving the bug to oremj/Services.
Assignee: arzhel → oremj
Component: NetOps: Other → Operations
Product: Infrastructure & Operations → Cloud Services
QA Contact: jbarnell
We had again a couple of incomplete downloaded installers for the latest Firefox 43.0b8 build1 candidate builds which caused problems across all platforms.
(In reply to Henrik Skupin (:whimboo) from comment #23)
> We had again a couple of incomplete downloaded installers for the latest
> Firefox 43.0b8 build1 candidate builds which caused problems across all
> platforms.

Incomplete !=latency, to my knowledge; can you please file a separate issue for that (and cc: me) Henrik?
Flags: needinfo?(hskupin)
Stephen, there is already bug 1219934 filed which is also marked as a dependency for this bug here. As :oremj mentioned those should be related.
Flags: needinfo?(hskupin)
Flags: needinfo?(smani)
No longer blocks: 1231938
I haven't seen broken downloads for installers for a while now. So at least I can remove the blocking whiteboard entry.
Whiteboard: [fromAutomation][qa-automation-blocked] → [fromAutomation]
I'm going to assume this has been fixed on cloudfront's side. Please reopen if there are still problems.
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
(In reply to Jeremy Orem [:oremj] from comment #28)
> I'm going to assume this has been fixed on cloudfront's side. Please reopen
> if there are still problems.

Yep, thanks, Jeremy.  I haven't seen this rear its head, since.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: