Closed
Bug 1170832
Opened 9 years ago
Closed 8 years ago
From MTV2/SCL3 (wired LAN) traffic gets sent to CloudFront CDN nodes/POPs in Japan/Hong Kong
Categories
(Cloud Services :: Operations: Miscellaneous, task)
Cloud Services
Operations: Miscellaneous
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: stephend, Assigned: oremj)
References
Details
(Whiteboard: [fromAutomation])
Attachments
(1 file)
100.07 KB,
image/png
|
Details |
(Filing this primarily to note it -- I'm not sure what, if anything, can be done from our -- Mozilla's -- side to help mitigate this.) Problem: From ~11:55am PDT to 2:00pm PDT, we noticed a huge slowdown (up to 15 seconds for a GET to a font/image/asset to complete) in our Mozilla.org test-automation suite (and manually, too) against the following host: https://www.allizom.org/en-US/ (which uses Cloudfront as its CDN) The problem appears to have been with a Cloudfront misconfiguration, which sent at least internal Mozilla traffic to, at first, a CDN node in Japan, and then, later, one in Hong Kong. When it finally recovered, we hit an Equinix-peered Amazon CDN node in SF (see last traceroute, below). Resources: A) New Relic captured this with its Real User Measurement feature: 1) https://rpm.newrelic.com/accounts/263620/browser/2638806 - www.allizom.org (staging) data for the Python side of Mozilla.org 2) https://rpm.newrelic.com/public/charts/iuRHrSSbARd - chart from the 2 hour window, showing the response-time spike for the above (average of 29 seconds page-load time, but at times, 36+ seconds). B) Here's a screencast, demonstrating the end-user problem (again, on staging, which affects test automation and manual testing): http://screencast.com/t/OBxkVr30T C) And, finally, here are some traceroutes: C:\Users\stephend>tracert d3kg653zred9tf.cloudfront.net Tracing route to d3kg653zred9tf.cloudfront.net [54.182.4.69] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms fw1.corp.mtv2.mozilla.net [10.252.24.1] 2 1 ms 1 ms 1 ms v-1077.border1.sjc2.mozilla.net [63.245.219.182] 3 2 ms 3 ms 2 ms sjo-b21-link.telia.net [62.115.8.161] 4 2 ms 2 ms 2 ms ntt-ic-306350-sjo-b21.c.telia.net [62.115.44.46] 5 2 ms 2 ms 2 ms ae-4.r23.snjsca04.us.bb.gin.ntt.net [129.250.4.25] 6 128 ms 128 ms 128 ms ae-6.r21.osakjp02.jp.bb.gin.ntt.net [129.250.2.131] 7 127 ms 114 ms 114 ms ae-5.r23.osakjp02.jp.bb.gin.ntt.net [129.250.6.144] 8 153 ms 153 ms 159 ms ae-8.r23.tkokhk01.hk.bb.gin.ntt.net [129.250.2.222] 9 167 ms 166 ms 166 ms ae-1.r01.newthk03.hk.bb.gin.ntt.net [129.250.2.3] 10 175 ms 177 ms 175 ms 203.131.245.182 11 175 ms 175 ms 175 ms server-54-182-4-69.hkg51.r.cloudfront.net [54.182.4.69] C:\Users\stephend>tracert d37xxasan61b7v.cloudfront.net Tracing route to d37xxasan61b7v.cloudfront.net [54.239.194.209] over a maximum of 30 hops: 1 * * * Request timed out. 2 2 ms 1 ms 1 ms v-1077.border1.sjc2.mozilla.net [63.245.219.182] 3 2 ms 2 ms 2 ms sjo-b21-link.telia.net [62.115.8.161] 4 2 ms 2 ms 2 ms ntt-ic-306350-sjo-b21.c.telia.net [62.115.44.46] 5 2 ms 2 ms 2 ms ae-4.r22.snjsca04.us.bb.gin.ntt.net [129.250.4.5] 6 129 ms 129 ms 123 ms ae-7.r20.osakjp02.jp.bb.gin.ntt.net [129.250.2.165] 7 119 ms 159 ms 119 ms ae-4.r23.osakjp02.jp.bb.gin.ntt.net [129.250.6.90] 8 133 ms 133 ms 133 ms ae-2.r01.osakjp02.jp.bb.gin.ntt.net [129.250.3.199] 9 121 ms 120 ms 121 ms ae-1.amazon.osakjp02.jp.bb.gin.ntt.net [61.200.82.218] 10 235 ms 235 ms * 54.239.52.144 11 235 ms 234 ms 234 ms 54.239.52.149 12 236 ms * 237 ms 27.0.0.115 13 * * * Request timed out. Once this cleared up ~2pm PDT, we saw this: C:\Users\stephend>tracert d3kg653zred9tf.cloudfront.net Tracing route to d3kg653zred9tf.cloudfront.net [205.251.215.221] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms fw1.corp.mtv2.mozilla.net [10.252.24.1] 2 2 ms 1 ms 2 ms v-1077.border1.sjc2.mozilla.net [63.245.219.182] 3 1 ms 1 ms 1 ms xe-5-0-0.mpr4.sjc7.us.above.net [64.125.170.37] 4 1 ms 1 ms 1 ms ae11.mpr3.sjc7.us.zip.zayo.com [64.125.28.1] 5 1 ms 1 ms 1 ms equinix02-sfo5.amazon.com [206.223.116.236] 6 2 ms 2 ms 4 ms 205.251.229.141 7 1 ms 1 ms 1 ms 205.251.230.91 8 1 ms 2 ms 1 ms server-205-251-215-221.sfo5.r.cloudfront.net [205.251.215.221] I should note: although the performance of the routes to and the CDN POPs in Japan and Hong Kong were poor, the real problem seems to be that we even were sent to those, rather than a local POP -- at least somewhere in North America or close.
Comment 1•9 years ago
|
||
Did someone open a ticket with cloudfront? NI on fox2mike as I think webops manages the CDNs
Assignee: network-operations → arzhel
Flags: needinfo?(smani)
Comment 2•9 years ago
|
||
(In reply to Arzhel Younsi [:XioNoX] from comment #1) > Did someone open a ticket with cloudfront? > > NI on fox2mike as I think webops manages the CDNs I'm not sure how this automatically warrants a ticket with Amazon especially when the ticket says it's from the office. I'm also not sure of who's Amazon account this is being setup from :) and will have to dig around to find that. You might be able to file a more meaningful ticket than I can wrt this, fwiw.
Flags: needinfo?(smani)
Comment 3•9 years ago
|
||
Based on comment 0 it seems to be a CDN issue and I have no idea on how that's managed. I can't reproduce it though so they probably fixed it. If it's the only time I don't think it's worth opening a ticket with them. Note that they usually use your DNS resolver to figure out which endpoint to send you to. So make sure you're using a local resolver. For example if I'm connected to the MozillaVPN and use it for DNS (by default), even if I'm in Paris they will redirect me to the SFO1 endpoint. Using google's dns, will send to to the Paris one.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Reporter | ||
Comment 4•9 years ago
|
||
(In reply to Arzhel Younsi [:XioNoX] from comment #3) > Based on comment 0 it seems to be a CDN issue and I have no idea on how > that's managed. I can't reproduce it though so they probably fixed it. If > it's the only time I don't think it's worth opening a ticket with them. > > Note that they usually use your DNS resolver to figure out which endpoint to > send you to. So make sure you're using a local resolver. > > For example if I'm connected to the MozillaVPN and use it for DNS (by > default), even if I'm in Paris they will redirect me to the SFO1 endpoint. > Using google's dns, will send to to the Paris one. Right; FWIW, all of our in-Mountain View (MTV2 QA Lab) hosts (on LAN), as well as my wired-desk connection saw this, and all are using the default DNS servers through DHCP.
Reporter | ||
Comment 5•9 years ago
|
||
Just going to dump some more info from today: C:\Users\Stephen Donner>tracert d3kg653zred9tf.cloudfront.net Tracing route to d3kg653zred9tf.cloudfront.net [205.251.212.200] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms fw1.corp.mtv2.mozilla.net [10.252.24.1] 2 1 ms <1 ms <1 ms v-1075.border1.sjc2.mozilla.net [63.245.219.186] 3 1 ms 1 ms 1 ms xe-3-1-0.mpr2.pao1.us.above.net [64.125.170.33] 4 1 ms 1 ms 1 ms zayo-kddi.pao1.us.zip.zayo.com [64.125.14.34] 5 1 ms 1 ms 2 ms pajbb002.int-gw.kddi.ne.jp [111.87.3.61] 6 114 ms 111 ms 111 ms obpjbb206.int-gw.kddi.ne.jp [203.181.100.185] 7 110 ms 110 ms 110 ms jc-osa302.kddnet.ad.jp [113.157.227.46] 8 111 ms 111 ms 114 ms 106.187.29.142 9 120 ms 120 ms 117 ms 27.0.0.250 10 111 ms 115 ms 111 ms 27.0.2.1 11 * * * Request timed out. 12 * * Even externally, over a "cable" connection (hosted by Yottaa.com) via webpagetest.org, we get nearly 14 second page-load time: http://www.webpagetest.org/result/150825_7K_1546/1/details/ No way to tell whether they were also sent to a Japanese node, rather than a "local" US-based one, but at least this shows it's external to the MTV2 network (I hope). In particular, nearly 4 seconds to get a single image, over the wire: http://www.webpagetest.org/result/150825_7K_1546/1/details/#request112 Request 112: https://d3kg653zred9tf.cloudfront.net/media/img/firefox/partners/form-bg.ce84507a80dd.jpg URL: https://d3kg653zred9tf.cloudfront.net/media/img/firefox/partners/form-bg.ce84507a80dd.jpg Host: d3kg653zred9tf.cloudfront.net IP: 127.0.0.1 Error/Status Code: 200 Initiated By: https://d3kg653zred9tf.cloudfront.net/media/js/partners_desktop-bundle.d53e626103af.js line 4 column 17757 Request Start: 9.272 s Time to First Byte: 3990 ms This might be telling? X-Cache-Info: caching X-Cache-Info: caching X-Cache: Miss from cloudfront Via: 1.1 1dfcd1bafe1d21759a09fdcad1a93705.cloudfront.net (CloudFront)
Reporter | ||
Comment 6•9 years ago
|
||
Reopening; there are at least 2 bugs here besides the perf/routing/DNS/node-region issue: 1) redundant X-Cache-Info: caching headers/values 2) an X-Cache: Miss from cloudfront Even if we're not-yet expecting this environment to be "future prod," we'd do well to sort this out. Another datapoint from yesterday: Again we're seeing assets on Cloudfront being sent to Japanese nodes, with higher latency/request-transfer times. Here's a webpagetest.org run: http://www.webpagetest.org/result/150825_SR_16P9/1/details/ Here's a traceroute: sdonners-MacBook-Pro:~ sdonner$ traceroute d3kg653zred9tf.cloudfront.net traceroute: Warning: d3kg653zred9tf.cloudfront.net has multiple addresses; using 216.137.52.241 traceroute to d3kg653zred9tf.cloudfront.net (216.137.52.241), 64 hops max, 52 byte packets 1 fw1.corp.mtv2.mozilla.net (10.252.24.1) 2.240 ms * * 2 v-1075.border1.sjc2.mozilla.net (63.245.219.186) 2.790 ms 3.219 ms 2.908 ms 3 xe-3-1-0.mpr2.pao1.us.above.net (64.125.170.33) 2.889 ms 2.630 ms 2.548 ms 4 zayo-kddi.pao1.us.zip.zayo.com (64.125.14.34) 2.585 ms 2.739 ms 2.881 ms 5 pajbb002.int-gw.kddi.ne.jp (111.87.3.77) 2.544 ms pajbb001.int-gw.kddi.ne.jp (111.87.3.73) 2.205 ms pajbb002.int-gw.kddi.ne.jp (111.87.3.77) 22.997 ms 6 obpjbb205.int-gw.kddi.ne.jp (203.181.100.193) 146.551 ms obpjbb206.int-gw.kddi.ne.jp (203.181.100.37) 108.115 ms obpjbb206.int-gw.kddi.ne.jp (203.181.100.185) 112.914 ms 7 jc-osa302.int-gw.kddi.ne.jp (113.157.227.122) 134.100 ms jc-osa302.int-gw.kddi.ne.jp (113.157.227.126) 113.457 ms jc-osa302.int-gw.kddi.ne.jp (113.157.227.122) 127.153 ms 8 106.187.29.142 (106.187.29.142) 116.112 ms 114.752 ms 119.901 ms 9 27.0.0.250 (27.0.0.250) 127.205 ms 54.239.52.142 (54.239.52.142) 128.851 ms 27.0.0.250 (27.0.0.250) 133.873 ms 10 54.239.52.149 (54.239.52.149) 122.667 ms 127.080 ms 54.239.52.135 (54.239.52.135) 123.080 ms 11 27.0.0.237 (27.0.0.237) 124.237 ms ^C
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Comment 8•9 years ago
|
||
Stephen, can you give me access to that AWS account? I'd like to look at the configuration and possibly open a ticket with Amazon.
Flags: needinfo?(riweiss) → needinfo?(stephen.donner)
Reporter | ||
Comment 9•9 years ago
|
||
(In reply to Richard Weiss [:r2] from comment #8) > Stephen, can you give me access to that AWS account? I'd like to look at the > configuration and possibly open a ticket with Amazon. Sorry, I have zero knowledge of it - was told that fox2mike and/or jakem would, though.
Flags: needinfo?(stephen.donner) → needinfo?(smani)
Comment 11•9 years ago
|
||
https://www.dropbox.com/s/vhif6mzowhv5v2a/Screenshot%202015-08-27%2014.50.41.png?dl=0 That's the setting it's at. I wonder if that has an impact?
Comment 12•9 years ago
|
||
r2, fox2mike, does it look like a CDN issue? who should own that bug?
Flags: needinfo?(smani)
Flags: needinfo?(riweiss)
Reporter | ||
Comment 13•9 years ago
|
||
(In reply to Arzhel Younsi [:XioNoX] from comment #12) > r2, fox2mike, does it look like a CDN issue? who should own that bug? Hey Arzhel - sorry for not updating the bug earlier; Jason Thomas has reached out to Amazon with a trouble ticket (1524721631), and 6 days ago their last response was that they are looking into it further.
Comment 14•9 years ago
|
||
I can see the same for our Mozmill-CI nodes in qa.scl3.mozilla.com. We got dozen of test failures today and downloads of builds take extraordinary long with about 25kb/s up to 75kb/s.
Assignee | ||
Comment 15•9 years ago
|
||
Just got this from Amazon support:
> For some reasons when you use your DNS resolver we are mapping you always outside US, which is the issue > I am working to get corrected with our Cloudfront Services Team. If you are using Google DNS, Google sends your actual Public IP as part of the DNS extension, in this case Cloudfront relies on Geo Database and is mapping you with-in US based on Geo Distance.
> This certainly is an issue at our side which I am working to get corrected.
> We regret the inconvenience this has been causing and the delay in resolving this.
Comment 16•9 years ago
|
||
As noted earlier this also affects our Firefox UI Tests for Firefox Desktop as run in SCL3. So updating the summary to reflect the real state. Thank you Jeremy for the next information. Hopefully they will be able to fix that soon. As we noticed late yesterday the Japan/Hong Kong routing got fixed but we were still routed via the uk for US inside traffic.
Summary: From MTV2 (wired LAN) traffic was sent to Cloudfront CDN nodes/POPs in Japan/Hong Kong for 2 hours (affected Mozilla.org test automation) → From MTV2/SCL3 (wired LAN) traffic gets send to Cloudfront CDN nodes/POPs in Japan/Hong Kong
Comment 17•9 years ago
|
||
This blocks most parts of our automation due to broken downloadeds of the Firefox binary. Adding our whiteboard entry for tracking.
Whiteboard: [fromAutomation] → [fromAutomation][qa-automation-blocked]
Reporter | ||
Updated•9 years ago
|
Summary: From MTV2/SCL3 (wired LAN) traffic gets send to Cloudfront CDN nodes/POPs in Japan/Hong Kong → From MTV2/SCL3 (wired LAN) traffic gets sent to CloudFront CDN nodes/POPs in Japan/Hong Kong
Comment 18•9 years ago
|
||
Can you tell me which AWS account this ticket was opened in? Either the account name or number will do. I'd like to call this to the attention of our Amazon account team.
Flags: needinfo?(riweiss) → needinfo?(stephen.donner)
Assignee | ||
Comment 19•9 years ago
|
||
Richard, I've e-mailed you the case id.
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(stephen.donner)
Comment 20•9 years ago
|
||
Did we get an update here meanwhile? Half a month passed by. As of today we hit bug 1219934 again for a couple of DMGs of Firefox (ar locale).
Flags: needinfo?(oremj)
Assignee | ||
Comment 21•9 years ago
|
||
It seems to be getting better. I've attached a throughput graph. The average seems to me quite a bit better the last 4 days. The ticket is still open with AWS and they are working on it. We are also considering fronting cloudfront with akamai, which should eliminate these performance issues.
Flags: needinfo?(oremj)
Assignee | ||
Updated•9 years ago
|
Attachment #8688527 -
Attachment description: cloudfront-throughput.PNG → scl3 -> cloudfront throughput
Comment 22•9 years ago
|
||
Moving the bug to oremj/Services.
Assignee: arzhel → oremj
Component: NetOps: Other → Operations
Product: Infrastructure & Operations → Cloud Services
QA Contact: jbarnell
Comment 23•9 years ago
|
||
We had again a couple of incomplete downloaded installers for the latest Firefox 43.0b8 build1 candidate builds which caused problems across all platforms.
Reporter | ||
Comment 24•9 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #23) > We had again a couple of incomplete downloaded installers for the latest > Firefox 43.0b8 build1 candidate builds which caused problems across all > platforms. Incomplete !=latency, to my knowledge; can you please file a separate issue for that (and cc: me) Henrik?
Flags: needinfo?(hskupin)
Comment 25•9 years ago
|
||
Stephen, there is already bug 1219934 filed which is also marked as a dependency for this bug here. As :oremj mentioned those should be related.
Flags: needinfo?(hskupin)
Updated•9 years ago
|
Flags: needinfo?(smani)
Comment 27•8 years ago
|
||
I haven't seen broken downloads for installers for a while now. So at least I can remove the blocking whiteboard entry.
Whiteboard: [fromAutomation][qa-automation-blocked] → [fromAutomation]
Assignee | ||
Comment 28•8 years ago
|
||
I'm going to assume this has been fixed on cloudfront's side. Please reopen if there are still problems.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 8 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 29•8 years ago
|
||
(In reply to Jeremy Orem [:oremj] from comment #28) > I'm going to assume this has been fixed on cloudfront's side. Please reopen > if there are still problems. Yep, thanks, Jeremy. I haven't seen this rear its head, since.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•