1170832 - From MTV2/SCL3 (wired LAN) traffic gets sent to CloudFront CDN nodes/POPs in Japan/Hong Kong

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Description

•

9 years ago

(Filing this primarily to note it -- I'm not sure what, if anything, can be done from our -- Mozilla's -- side to help mitigate this.)

Problem: From ~11:55am PDT to 2:00pm PDT, we noticed a huge slowdown (up to 15 seconds for a GET to a font/image/asset to complete) in our Mozilla.org test-automation suite (and manually, too) against the following host:

https://www.allizom.org/en-US/ (which uses Cloudfront as its CDN)

The problem appears to have been with a Cloudfront misconfiguration, which sent at least internal Mozilla traffic to, at first, a CDN node in Japan, and then, later, one in Hong Kong.  When it finally recovered, we hit an Equinix-peered Amazon CDN node in SF (see last traceroute, below).

Resources:

A) New Relic captured this with its Real User Measurement feature:

1) https://rpm.newrelic.com/accounts/263620/browser/2638806 - www.allizom.org (staging) data for the Python side of Mozilla.org
2) https://rpm.newrelic.com/public/charts/iuRHrSSbARd - chart from the 2 hour window, showing the response-time spike for the above (average of 29 seconds page-load time, but at times, 36+ seconds).

B) Here's a screencast, demonstrating the end-user problem (again, on staging, which affects test automation and manual testing): http://screencast.com/t/OBxkVr30T

C) And, finally, here are some traceroutes:

C:\Users\stephend>tracert d3kg653zred9tf.cloudfront.net
 
Tracing route to d3kg653zred9tf.cloudfront.net [54.182.4.69]
over a maximum of 30 hops:
 
  1    <1 ms    <1 ms    <1 ms  fw1.corp.mtv2.mozilla.net [10.252.24.1]
  2     1 ms     1 ms     1 ms  v-1077.border1.sjc2.mozilla.net [63.245.219.182]
  3     2 ms     3 ms     2 ms  sjo-b21-link.telia.net [62.115.8.161]
  4     2 ms     2 ms     2 ms  ntt-ic-306350-sjo-b21.c.telia.net [62.115.44.46]
  5     2 ms     2 ms     2 ms  ae-4.r23.snjsca04.us.bb.gin.ntt.net [129.250.4.25]
  6   128 ms   128 ms   128 ms  ae-6.r21.osakjp02.jp.bb.gin.ntt.net [129.250.2.131]
  7   127 ms   114 ms   114 ms  ae-5.r23.osakjp02.jp.bb.gin.ntt.net [129.250.6.144]
  8   153 ms   153 ms   159 ms  ae-8.r23.tkokhk01.hk.bb.gin.ntt.net [129.250.2.222]
  9   167 ms   166 ms   166 ms  ae-1.r01.newthk03.hk.bb.gin.ntt.net [129.250.2.3]
 10   175 ms   177 ms   175 ms  203.131.245.182
 11   175 ms   175 ms   175 ms  server-54-182-4-69.hkg51.r.cloudfront.net [54.182.4.69]

C:\Users\stephend>tracert d37xxasan61b7v.cloudfront.net
 
Tracing route to d37xxasan61b7v.cloudfront.net [54.239.194.209]
over a maximum of 30 hops:
 
  1     *        *        *     Request timed out.
  2     2 ms     1 ms     1 ms  v-1077.border1.sjc2.mozilla.net [63.245.219.182]
  3     2 ms     2 ms     2 ms  sjo-b21-link.telia.net [62.115.8.161]
  4     2 ms     2 ms     2 ms  ntt-ic-306350-sjo-b21.c.telia.net [62.115.44.46]
  5     2 ms     2 ms     2 ms  ae-4.r22.snjsca04.us.bb.gin.ntt.net [129.250.4.5]
  6   129 ms   129 ms   123 ms  ae-7.r20.osakjp02.jp.bb.gin.ntt.net [129.250.2.165]
  7   119 ms   159 ms   119 ms  ae-4.r23.osakjp02.jp.bb.gin.ntt.net [129.250.6.90]
  8   133 ms   133 ms   133 ms  ae-2.r01.osakjp02.jp.bb.gin.ntt.net [129.250.3.199]
  9   121 ms   120 ms   121 ms  ae-1.amazon.osakjp02.jp.bb.gin.ntt.net [61.200.82.218]
 10   235 ms   235 ms     *     54.239.52.144
 11   235 ms   234 ms   234 ms  54.239.52.149
 12   236 ms     *      237 ms  27.0.0.115
 13     *        *        *     Request timed out.

Once this cleared up ~2pm PDT, we saw this:

C:\Users\stephend>tracert d3kg653zred9tf.cloudfront.net

Tracing route to d3kg653zred9tf.cloudfront.net [205.251.215.221]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  fw1.corp.mtv2.mozilla.net [10.252.24.1]
  2     2 ms     1 ms     2 ms  v-1077.border1.sjc2.mozilla.net [63.245.219.182]
  3     1 ms     1 ms     1 ms  xe-5-0-0.mpr4.sjc7.us.above.net [64.125.170.37]
  4     1 ms     1 ms     1 ms  ae11.mpr3.sjc7.us.zip.zayo.com [64.125.28.1]
  5     1 ms     1 ms     1 ms  equinix02-sfo5.amazon.com [206.223.116.236]
  6     2 ms     2 ms     4 ms  205.251.229.141
  7     1 ms     1 ms     1 ms  205.251.230.91
  8     1 ms     2 ms     1 ms  server-205-251-215-221.sfo5.r.cloudfront.net [205.251.215.221]

I should note: although the performance of the routes to and the CDN POPs in Japan and Hong Kong were poor, the real problem seems to be that we even were sent to those, rather than a local POP -- at least somewhere in North America or close.

Arzhel Younsi [:XioNoX]

Comment 1

•

9 years ago

Did someone open a ticket with cloudfront?

NI on fox2mike as I think webops manages the CDNs

Assignee: network-operations → arzhel

Flags: needinfo?(smani)

Shyam Mani [:fox2mike]

Comment 2

•

9 years ago

(In reply to Arzhel Younsi [:XioNoX] from comment #1)
> Did someone open a ticket with cloudfront?
> 
> NI on fox2mike as I think webops manages the CDNs

I'm not sure how this automatically warrants a ticket with Amazon especially when the ticket says it's from the office. I'm also not sure of who's Amazon account this is being setup from :) and will have to dig around to find that. You might be able to file a more meaningful ticket than I can wrt this, fwiw.

Flags: needinfo?(smani)

Arzhel Younsi [:XioNoX]

Comment 3

•

9 years ago

Based on comment 0 it seems to be a CDN issue and I have no idea on how that's managed. I can't reproduce it though so they probably fixed it. If it's the only time I don't think it's worth opening a ticket with them.

Note that they usually use your DNS resolver to figure out which endpoint to send you to. So make sure you're using a local resolver.

For example if I'm connected to the MozillaVPN and use it for DNS (by default), even if I'm in Paris they will redirect me to the SFO1 endpoint. Using google's dns, will send to to the Paris one.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WORKSFORME

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 4

•

9 years ago

(In reply to Arzhel Younsi [:XioNoX] from comment #3)
> Based on comment 0 it seems to be a CDN issue and I have no idea on how
> that's managed. I can't reproduce it though so they probably fixed it. If
> it's the only time I don't think it's worth opening a ticket with them.
> 
> Note that they usually use your DNS resolver to figure out which endpoint to
> send you to. So make sure you're using a local resolver.
> 
> For example if I'm connected to the MozillaVPN and use it for DNS (by
> default), even if I'm in Paris they will redirect me to the SFO1 endpoint.
> Using google's dns, will send to to the Paris one.

Right; FWIW, all of our in-Mountain View (MTV2 QA Lab) hosts (on LAN), as well as my wired-desk connection saw this, and all are using the default DNS servers through DHCP.

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 5

•

9 years ago

Just going to dump some more info from today:

C:\Users\Stephen Donner>tracert d3kg653zred9tf.cloudfront.net

Tracing route to d3kg653zred9tf.cloudfront.net [205.251.212.200]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  fw1.corp.mtv2.mozilla.net [10.252.24.1]
  2     1 ms    <1 ms    <1 ms  v-1075.border1.sjc2.mozilla.net [63.245.219.186]
  3     1 ms     1 ms     1 ms  xe-3-1-0.mpr2.pao1.us.above.net [64.125.170.33]
  4     1 ms     1 ms     1 ms  zayo-kddi.pao1.us.zip.zayo.com [64.125.14.34]
  5     1 ms     1 ms     2 ms  pajbb002.int-gw.kddi.ne.jp [111.87.3.61]
  6   114 ms   111 ms   111 ms  obpjbb206.int-gw.kddi.ne.jp [203.181.100.185]
  7   110 ms   110 ms   110 ms  jc-osa302.kddnet.ad.jp [113.157.227.46]
  8   111 ms   111 ms   114 ms  106.187.29.142
  9   120 ms   120 ms   117 ms  27.0.0.250
 10   111 ms   115 ms   111 ms  27.0.2.1
 11     *        *        *     Request timed out.
 12     *        *

Even externally, over a "cable" connection (hosted by Yottaa.com) via webpagetest.org, we get nearly 14 second page-load time: http://www.webpagetest.org/result/150825_7K_1546/1/details/

No way to tell whether they were also sent to a Japanese node, rather than a "local" US-based one, but at least this shows it's external to the MTV2 network (I hope).

In particular, nearly 4 seconds to get a single image, over the wire: http://www.webpagetest.org/result/150825_7K_1546/1/details/#request112

Request 112: https://d3kg653zred9tf.cloudfront.net/media/img/firefox/partners/form-bg.ce84507a80dd.jpg

URL: https://d3kg653zred9tf.cloudfront.net/media/img/firefox/partners/form-bg.ce84507a80dd.jpg
Host: d3kg653zred9tf.cloudfront.net
IP: 127.0.0.1
Error/Status Code: 200
Initiated By: https://d3kg653zred9tf.cloudfront.net/media/js/partners_desktop-bundle.d53e626103af.js line 4 column 17757
Request Start: 9.272 s
Time to First Byte: 3990 ms

This might be telling?

X-Cache-Info: caching
X-Cache-Info: caching
X-Cache: Miss from cloudfront
Via: 1.1 1dfcd1bafe1d21759a09fdcad1a93705.cloudfront.net (CloudFront)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 6

•

9 years ago

Reopening; there are at least 2 bugs here besides the perf/routing/DNS/node-region issue:

1) redundant X-Cache-Info: caching headers/values
2) an X-Cache: Miss from cloudfront

Even if we're not-yet expecting this environment to be "future prod," we'd do well to sort this out.

Another datapoint from yesterday:

Again we're seeing assets on Cloudfront being sent to Japanese nodes, with higher latency/request-transfer times.

    Here's a webpagetest.org run:
        http://www.webpagetest.org/result/150825_SR_16P9/1/details/
    Here's a traceroute:
        sdonners-MacBook-Pro:~ sdonner$ traceroute d3kg653zred9tf.cloudfront.net
        traceroute: Warning: d3kg653zred9tf.cloudfront.net has multiple addresses; using 216.137.52.241
        traceroute to d3kg653zred9tf.cloudfront.net (216.137.52.241), 64 hops max, 52 byte packets
         1  fw1.corp.mtv2.mozilla.net (10.252.24.1)  2.240 ms * *
         2  v-1075.border1.sjc2.mozilla.net (63.245.219.186)  2.790 ms  3.219 ms  2.908 ms
         3  xe-3-1-0.mpr2.pao1.us.above.net (64.125.170.33)  2.889 ms  2.630 ms  2.548 ms
         4  zayo-kddi.pao1.us.zip.zayo.com (64.125.14.34)  2.585 ms  2.739 ms  2.881 ms
         5  pajbb002.int-gw.kddi.ne.jp (111.87.3.77)  2.544 ms
            pajbb001.int-gw.kddi.ne.jp (111.87.3.73)  2.205 ms
            pajbb002.int-gw.kddi.ne.jp (111.87.3.77)  22.997 ms
         6  obpjbb205.int-gw.kddi.ne.jp (203.181.100.193)  146.551 ms
            obpjbb206.int-gw.kddi.ne.jp (203.181.100.37)  108.115 ms
            obpjbb206.int-gw.kddi.ne.jp (203.181.100.185)  112.914 ms
         7  jc-osa302.int-gw.kddi.ne.jp (113.157.227.122)  134.100 ms
            jc-osa302.int-gw.kddi.ne.jp (113.157.227.126)  113.457 ms
            jc-osa302.int-gw.kddi.ne.jp (113.157.227.122)  127.153 ms
         8  106.187.29.142 (106.187.29.142)  116.112 ms  114.752 ms  119.901 ms
         9  27.0.0.250 (27.0.0.250)  127.205 ms
            54.239.52.142 (54.239.52.142)  128.851 ms
            27.0.0.250 (27.0.0.250)  133.873 ms
        10  54.239.52.149 (54.239.52.149)  122.667 ms  127.080 ms
            54.239.52.135 (54.239.52.135)  123.080 ms
        11  27.0.0.237 (27.0.0.237)  124.237 ms
        ^C

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 7

•

9 years ago

Richard, any potential insights on this?

Flags: needinfo?(riweiss)

Richard Weiss [:r2]

Comment 8

•

9 years ago

Stephen, can you give me access to that AWS account? I'd like to look at the configuration and possibly open a ticket with Amazon.

Flags: needinfo?(riweiss) → needinfo?(stephen.donner)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 9

•

9 years ago

(In reply to Richard Weiss [:r2] from comment #8)
> Stephen, can you give me access to that AWS account? I'd like to look at the
> configuration and possibly open a ticket with Amazon.

Sorry, I have zero knowledge of it - was told that fox2mike and/or jakem would, though.

Flags: needinfo?(stephen.donner) → needinfo?(smani)

Shyam Mani [:fox2mike]

Comment 10

•

9 years ago

I pointed r2 at the right account

Flags: needinfo?(smani)

Shyam Mani [:fox2mike]

Comment 11

•

9 years ago

https://www.dropbox.com/s/vhif6mzowhv5v2a/Screenshot%202015-08-27%2014.50.41.png?dl=0

That's the setting it's at. I wonder if that has an impact?

Arzhel Younsi [:XioNoX]

Comment 12

•

9 years ago

r2, fox2mike, does it look like a CDN issue? who should own that bug?

Flags: needinfo?(smani)

Flags: needinfo?(riweiss)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 13

•

9 years ago

(In reply to Arzhel Younsi [:XioNoX] from comment #12)
> r2, fox2mike, does it look like a CDN issue? who should own that bug?

Hey Arzhel - sorry for not updating the bug earlier; Jason Thomas has reached out to Amazon with a trouble ticket (1524721631), and 6 days ago their last response was that they are looking into it further.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 14

•

9 years ago

I can see the same for our Mozmill-CI nodes in qa.scl3.mozilla.com. We got dozen of test failures today and downloads of builds take extraordinary long with about 25kb/s up to 75kb/s.

Jeremy Orem [:oremj]

Assignee

Comment 15

•

9 years ago

Just got this from Amazon support:

> For some reasons when you use your DNS resolver we are mapping you always outside US, which is the issue > I am working to get corrected with our Cloudfront Services Team. If you are using Google DNS, Google sends your actual Public IP as part of the DNS extension, in this case Cloudfront relies on Geo Database and is mapping you with-in US based on Geo Distance.
> This certainly is an issue at our side which I am working to get corrected.
> We regret the inconvenience this has been causing and the delay in resolving this.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 16

•

9 years ago

As noted earlier this also affects our Firefox UI Tests for Firefox Desktop as run in SCL3. So updating the summary to reflect the real state.

Thank you Jeremy for the next information. Hopefully they will be able to fix that soon. As we noticed late yesterday the Japan/Hong Kong routing got fixed but we were still routed via the uk for US inside traffic.

Summary: From MTV2 (wired LAN) traffic was sent to Cloudfront CDN nodes/POPs in Japan/Hong Kong for 2 hours (affected Mozilla.org test automation) → From MTV2/SCL3 (wired LAN) traffic gets send to Cloudfront CDN nodes/POPs in Japan/Hong Kong

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

9 years ago

Blocks: 1198616

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

9 years ago

Blocks: 1219934

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 17

•

9 years ago

This blocks most parts of our automation due to broken downloadeds of the Firefox binary. Adding our whiteboard entry for tracking.

Whiteboard: [fromAutomation] → [fromAutomation][qa-automation-blocked]

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Updated

•

9 years ago

Summary: From MTV2/SCL3 (wired LAN) traffic gets send to Cloudfront CDN nodes/POPs in Japan/Hong Kong → From MTV2/SCL3 (wired LAN) traffic gets sent to CloudFront CDN nodes/POPs in Japan/Hong Kong

Richard Weiss [:r2]

Comment 18

•

9 years ago

Can you tell me which AWS account this ticket was opened in? Either the account name or number will do. I'd like to call this to the attention of our Amazon account team.

Flags: needinfo?(riweiss) → needinfo?(stephen.donner)

Jeremy Orem [:oremj]

Assignee

Comment 19

•

9 years ago

Richard, I've e-mailed you the case id.

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Updated

•

9 years ago

Flags: needinfo?(stephen.donner)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 20

•

9 years ago

Did we get an update here meanwhile? Half a month passed by. As of today we hit bug 1219934 again for a couple of DMGs of Firefox (ar locale).

Flags: needinfo?(oremj)

Jeremy Orem [:oremj]

Assignee

Comment 21

•

9 years ago

Attached image scl3 -> cloudfront throughput — Details

It seems to be getting better. I've attached a throughput graph. The average seems to me quite a bit better the last 4 days. The ticket is still open with AWS and they are working on it.

We are also considering fronting cloudfront with akamai, which should eliminate these performance issues.

Flags: needinfo?(oremj)

Jeremy Orem [:oremj]

Assignee

Updated

•

9 years ago

Attachment #8688527 - Attachment description: cloudfront-throughput.PNG → scl3 -> cloudfront throughput

Arzhel Younsi [:XioNoX]

Comment 22

•

9 years ago

Moving the bug to oremj/Services.

Assignee: arzhel → oremj

Component: NetOps: Other → Operations

Product: Infrastructure & Operations → Cloud Services

QA Contact: jbarnell

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 23

•

9 years ago

We had again a couple of incomplete downloaded installers for the latest Firefox 43.0b8 build1 candidate builds which caused problems across all platforms.

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 24

•

9 years ago

(In reply to Henrik Skupin (:whimboo) from comment #23)
> We had again a couple of incomplete downloaded installers for the latest
> Firefox 43.0b8 build1 candidate builds which caused problems across all
> platforms.

Incomplete !=latency, to my knowledge; can you please file a separate issue for that (and cc: me) Henrik?

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 25

•

9 years ago

Stephen, there is already bug 1219934 filed which is also marked as a dependency for this bug here. As :oremj mentioned those should be related.

Flags: needinfo?(hskupin)

Shyam Mani [:fox2mike]

Updated

•

9 years ago

Flags: needinfo?(smani)

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

9 years ago

Blocks: 1231938

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

9 years ago

No longer blocks: 1231938

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 27

•

8 years ago

I haven't seen broken downloads for installers for a while now. So at least I can remove the blocking whiteboard entry.

Whiteboard: [fromAutomation][qa-automation-blocked] → [fromAutomation]

Jeremy Orem [:oremj]

Assignee

Comment 28

•

8 years ago

I'm going to assume this has been fixed on cloudfront's side. Please reopen if there are still problems.

Status: REOPENED → RESOLVED

Closed: 9 years ago → 8 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 29

•

8 years ago

(In reply to Jeremy Orem [:oremj] from comment #28)
> I'm going to assume this has been fixed on cloudfront's side. Please reopen
> if there are still problems.

Yep, thanks, Jeremy.  I haven't seen this rear its head, since.

Status: RESOLVED → VERIFIED