(Filing this primarily to note it -- I'm not sure what, if anything, can be done from our -- Mozilla's -- side to help mitigate this.) Problem: From ~11:55am PDT to 2:00pm PDT, we noticed a huge slowdown (up to 15 seconds for a GET to a font/image/asset to complete) in our Mozilla.org test-automation suite (and manually, too) against the following host: https://www.allizom.org/en-US/ (which uses Cloudfront as its CDN) The problem appears to have been with a Cloudfront misconfiguration, which sent at least internal Mozilla traffic to, at first, a CDN node in Japan, and then, later, one in Hong Kong. When it finally recovered, we hit an Equinix-peered Amazon CDN node in SF (see last traceroute, below). Resources: A) New Relic captured this with its Real User Measurement feature: 1) https://rpm.newrelic.com/accounts/263620/browser/2638806 - www.allizom.org (staging) data for the Python side of Mozilla.org 2) https://rpm.newrelic.com/public/charts/iuRHrSSbARd - chart from the 2 hour window, showing the response-time spike for the above (average of 29 seconds page-load time, but at times, 36+ seconds). B) Here's a screencast, demonstrating the end-user problem (again, on staging, which affects test automation and manual testing): http://screencast.com/t/OBxkVr30T C) And, finally, here are some traceroutes: C:\Users\stephend>tracert d3kg653zred9tf.cloudfront.net Tracing route to d3kg653zred9tf.cloudfront.net [184.108.40.206] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms fw1.corp.mtv2.mozilla.net [10.252.24.1] 2 1 ms 1 ms 1 ms v-1077.border1.sjc2.mozilla.net [220.127.116.11] 3 2 ms 3 ms 2 ms sjo-b21-link.telia.net [18.104.22.168] 4 2 ms 2 ms 2 ms ntt-ic-306350-sjo-b21.c.telia.net [22.214.171.124] 5 2 ms 2 ms 2 ms ae-4.r23.snjsca04.us.bb.gin.ntt.net [126.96.36.199] 6 128 ms 128 ms 128 ms ae-6.r21.osakjp02.jp.bb.gin.ntt.net [188.8.131.52] 7 127 ms 114 ms 114 ms ae-5.r23.osakjp02.jp.bb.gin.ntt.net [184.108.40.206] 8 153 ms 153 ms 159 ms ae-8.r23.tkokhk01.hk.bb.gin.ntt.net [220.127.116.11] 9 167 ms 166 ms 166 ms ae-1.r01.newthk03.hk.bb.gin.ntt.net [18.104.22.168] 10 175 ms 177 ms 175 ms 22.214.171.124 11 175 ms 175 ms 175 ms server-54-182-4-69.hkg51.r.cloudfront.net [126.96.36.199] C:\Users\stephend>tracert d37xxasan61b7v.cloudfront.net Tracing route to d37xxasan61b7v.cloudfront.net [188.8.131.52] over a maximum of 30 hops: 1 * * * Request timed out. 2 2 ms 1 ms 1 ms v-1077.border1.sjc2.mozilla.net [184.108.40.206] 3 2 ms 2 ms 2 ms sjo-b21-link.telia.net [220.127.116.11] 4 2 ms 2 ms 2 ms ntt-ic-306350-sjo-b21.c.telia.net [18.104.22.168] 5 2 ms 2 ms 2 ms ae-4.r22.snjsca04.us.bb.gin.ntt.net [22.214.171.124] 6 129 ms 129 ms 123 ms ae-7.r20.osakjp02.jp.bb.gin.ntt.net [126.96.36.199] 7 119 ms 159 ms 119 ms ae-4.r23.osakjp02.jp.bb.gin.ntt.net [188.8.131.52] 8 133 ms 133 ms 133 ms ae-2.r01.osakjp02.jp.bb.gin.ntt.net [184.108.40.206] 9 121 ms 120 ms 121 ms ae-1.amazon.osakjp02.jp.bb.gin.ntt.net [220.127.116.11] 10 235 ms 235 ms * 18.104.22.168 11 235 ms 234 ms 234 ms 22.214.171.124 12 236 ms * 237 ms 126.96.36.199 13 * * * Request timed out. Once this cleared up ~2pm PDT, we saw this: C:\Users\stephend>tracert d3kg653zred9tf.cloudfront.net Tracing route to d3kg653zred9tf.cloudfront.net [188.8.131.52] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms fw1.corp.mtv2.mozilla.net [10.252.24.1] 2 2 ms 1 ms 2 ms v-1077.border1.sjc2.mozilla.net [184.108.40.206] 3 1 ms 1 ms 1 ms xe-5-0-0.mpr4.sjc7.us.above.net [220.127.116.11] 4 1 ms 1 ms 1 ms ae11.mpr3.sjc7.us.zip.zayo.com [18.104.22.168] 5 1 ms 1 ms 1 ms equinix02-sfo5.amazon.com [22.214.171.124] 6 2 ms 2 ms 4 ms 126.96.36.199 7 1 ms 1 ms 1 ms 188.8.131.52 8 1 ms 2 ms 1 ms server-205-251-215-221.sfo5.r.cloudfront.net [184.108.40.206] I should note: although the performance of the routes to and the CDN POPs in Japan and Hong Kong were poor, the real problem seems to be that we even were sent to those, rather than a local POP -- at least somewhere in North America or close.
Did someone open a ticket with cloudfront? NI on fox2mike as I think webops manages the CDNs
(In reply to Arzhel Younsi [:XioNoX] from comment #1) > Did someone open a ticket with cloudfront? > > NI on fox2mike as I think webops manages the CDNs I'm not sure how this automatically warrants a ticket with Amazon especially when the ticket says it's from the office. I'm also not sure of who's Amazon account this is being setup from :) and will have to dig around to find that. You might be able to file a more meaningful ticket than I can wrt this, fwiw.
Based on comment 0 it seems to be a CDN issue and I have no idea on how that's managed. I can't reproduce it though so they probably fixed it. If it's the only time I don't think it's worth opening a ticket with them. Note that they usually use your DNS resolver to figure out which endpoint to send you to. So make sure you're using a local resolver. For example if I'm connected to the MozillaVPN and use it for DNS (by default), even if I'm in Paris they will redirect me to the SFO1 endpoint. Using google's dns, will send to to the Paris one.
(In reply to Arzhel Younsi [:XioNoX] from comment #3) > Based on comment 0 it seems to be a CDN issue and I have no idea on how > that's managed. I can't reproduce it though so they probably fixed it. If > it's the only time I don't think it's worth opening a ticket with them. > > Note that they usually use your DNS resolver to figure out which endpoint to > send you to. So make sure you're using a local resolver. > > For example if I'm connected to the MozillaVPN and use it for DNS (by > default), even if I'm in Paris they will redirect me to the SFO1 endpoint. > Using google's dns, will send to to the Paris one. Right; FWIW, all of our in-Mountain View (MTV2 QA Lab) hosts (on LAN), as well as my wired-desk connection saw this, and all are using the default DNS servers through DHCP.
Just going to dump some more info from today: C:\Users\Stephen Donner>tracert d3kg653zred9tf.cloudfront.net Tracing route to d3kg653zred9tf.cloudfront.net [220.127.116.11] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms fw1.corp.mtv2.mozilla.net [10.252.24.1] 2 1 ms <1 ms <1 ms v-1075.border1.sjc2.mozilla.net [18.104.22.168] 3 1 ms 1 ms 1 ms xe-3-1-0.mpr2.pao1.us.above.net [22.214.171.124] 4 1 ms 1 ms 1 ms zayo-kddi.pao1.us.zip.zayo.com [126.96.36.199] 5 1 ms 1 ms 2 ms pajbb002.int-gw.kddi.ne.jp [188.8.131.52] 6 114 ms 111 ms 111 ms obpjbb206.int-gw.kddi.ne.jp [184.108.40.206] 7 110 ms 110 ms 110 ms jc-osa302.kddnet.ad.jp [220.127.116.11] 8 111 ms 111 ms 114 ms 18.104.22.168 9 120 ms 120 ms 117 ms 22.214.171.124 10 111 ms 115 ms 111 ms 126.96.36.199 11 * * * Request timed out. 12 * * Even externally, over a "cable" connection (hosted by Yottaa.com) via webpagetest.org, we get nearly 14 second page-load time: http://www.webpagetest.org/result/150825_7K_1546/1/details/ No way to tell whether they were also sent to a Japanese node, rather than a "local" US-based one, but at least this shows it's external to the MTV2 network (I hope). In particular, nearly 4 seconds to get a single image, over the wire: http://www.webpagetest.org/result/150825_7K_1546/1/details/#request112 Request 112: https://d3kg653zred9tf.cloudfront.net/media/img/firefox/partners/form-bg.ce84507a80dd.jpg URL: https://d3kg653zred9tf.cloudfront.net/media/img/firefox/partners/form-bg.ce84507a80dd.jpg Host: d3kg653zred9tf.cloudfront.net IP: 127.0.0.1 Error/Status Code: 200 Initiated By: https://d3kg653zred9tf.cloudfront.net/media/js/partners_desktop-bundle.d53e626103af.js line 4 column 17757 Request Start: 9.272 s Time to First Byte: 3990 ms This might be telling? X-Cache-Info: caching X-Cache-Info: caching X-Cache: Miss from cloudfront Via: 1.1 1dfcd1bafe1d21759a09fdcad1a93705.cloudfront.net (CloudFront)
Reopening; there are at least 2 bugs here besides the perf/routing/DNS/node-region issue: 1) redundant X-Cache-Info: caching headers/values 2) an X-Cache: Miss from cloudfront Even if we're not-yet expecting this environment to be "future prod," we'd do well to sort this out. Another datapoint from yesterday: Again we're seeing assets on Cloudfront being sent to Japanese nodes, with higher latency/request-transfer times. Here's a webpagetest.org run: http://www.webpagetest.org/result/150825_SR_16P9/1/details/ Here's a traceroute: sdonners-MacBook-Pro:~ sdonner$ traceroute d3kg653zred9tf.cloudfront.net traceroute: Warning: d3kg653zred9tf.cloudfront.net has multiple addresses; using 188.8.131.52 traceroute to d3kg653zred9tf.cloudfront.net (184.108.40.206), 64 hops max, 52 byte packets 1 fw1.corp.mtv2.mozilla.net (10.252.24.1) 2.240 ms * * 2 v-1075.border1.sjc2.mozilla.net (220.127.116.11) 2.790 ms 3.219 ms 2.908 ms 3 xe-3-1-0.mpr2.pao1.us.above.net (18.104.22.168) 2.889 ms 2.630 ms 2.548 ms 4 zayo-kddi.pao1.us.zip.zayo.com (22.214.171.124) 2.585 ms 2.739 ms 2.881 ms 5 pajbb002.int-gw.kddi.ne.jp (126.96.36.199) 2.544 ms pajbb001.int-gw.kddi.ne.jp (188.8.131.52) 2.205 ms pajbb002.int-gw.kddi.ne.jp (184.108.40.206) 22.997 ms 6 obpjbb205.int-gw.kddi.ne.jp (220.127.116.11) 146.551 ms obpjbb206.int-gw.kddi.ne.jp (18.104.22.168) 108.115 ms obpjbb206.int-gw.kddi.ne.jp (22.214.171.124) 112.914 ms 7 jc-osa302.int-gw.kddi.ne.jp (126.96.36.199) 134.100 ms jc-osa302.int-gw.kddi.ne.jp (188.8.131.52) 113.457 ms jc-osa302.int-gw.kddi.ne.jp (184.108.40.206) 127.153 ms 8 220.127.116.11 (18.104.22.168) 116.112 ms 114.752 ms 119.901 ms 9 22.214.171.124 (126.96.36.199) 127.205 ms 188.8.131.52 (184.108.40.206) 128.851 ms 220.127.116.11 (18.104.22.168) 133.873 ms 10 22.214.171.124 (126.96.36.199) 122.667 ms 127.080 ms 188.8.131.52 (184.108.40.206) 123.080 ms 11 220.127.116.11 (18.104.22.168) 124.237 ms ^C
Richard, any potential insights on this?
Stephen, can you give me access to that AWS account? I'd like to look at the configuration and possibly open a ticket with Amazon.
(In reply to Richard Weiss [:r2] from comment #8) > Stephen, can you give me access to that AWS account? I'd like to look at the > configuration and possibly open a ticket with Amazon. Sorry, I have zero knowledge of it - was told that fox2mike and/or jakem would, though.
I pointed r2 at the right account
https://www.dropbox.com/s/vhif6mzowhv5v2a/Screenshot%202015-08-27%2014.50.41.png?dl=0 That's the setting it's at. I wonder if that has an impact?
r2, fox2mike, does it look like a CDN issue? who should own that bug?
(In reply to Arzhel Younsi [:XioNoX] from comment #12) > r2, fox2mike, does it look like a CDN issue? who should own that bug? Hey Arzhel - sorry for not updating the bug earlier; Jason Thomas has reached out to Amazon with a trouble ticket (1524721631), and 6 days ago their last response was that they are looking into it further.
I can see the same for our Mozmill-CI nodes in qa.scl3.mozilla.com. We got dozen of test failures today and downloads of builds take extraordinary long with about 25kb/s up to 75kb/s.
Just got this from Amazon support: > For some reasons when you use your DNS resolver we are mapping you always outside US, which is the issue > I am working to get corrected with our Cloudfront Services Team. If you are using Google DNS, Google sends your actual Public IP as part of the DNS extension, in this case Cloudfront relies on Geo Database and is mapping you with-in US based on Geo Distance. > This certainly is an issue at our side which I am working to get corrected. > We regret the inconvenience this has been causing and the delay in resolving this.
As noted earlier this also affects our Firefox UI Tests for Firefox Desktop as run in SCL3. So updating the summary to reflect the real state. Thank you Jeremy for the next information. Hopefully they will be able to fix that soon. As we noticed late yesterday the Japan/Hong Kong routing got fixed but we were still routed via the uk for US inside traffic.
This blocks most parts of our automation due to broken downloadeds of the Firefox binary. Adding our whiteboard entry for tracking.
Can you tell me which AWS account this ticket was opened in? Either the account name or number will do. I'd like to call this to the attention of our Amazon account team.
Richard, I've e-mailed you the case id.
Did we get an update here meanwhile? Half a month passed by. As of today we hit bug 1219934 again for a couple of DMGs of Firefox (ar locale).
Created attachment 8688527 [details] scl3 -> cloudfront throughput It seems to be getting better. I've attached a throughput graph. The average seems to me quite a bit better the last 4 days. The ticket is still open with AWS and they are working on it. We are also considering fronting cloudfront with akamai, which should eliminate these performance issues.
Moving the bug to oremj/Services.
We had again a couple of incomplete downloaded installers for the latest Firefox 43.0b8 build1 candidate builds which caused problems across all platforms.
(In reply to Henrik Skupin (:whimboo) from comment #23) > We had again a couple of incomplete downloaded installers for the latest > Firefox 43.0b8 build1 candidate builds which caused problems across all > platforms. Incomplete !=latency, to my knowledge; can you please file a separate issue for that (and cc: me) Henrik?
Stephen, there is already bug 1219934 filed which is also marked as a dependency for this bug here. As :oremj mentioned those should be related.
I haven't seen broken downloads for installers for a while now. So at least I can remove the blocking whiteboard entry.
I'm going to assume this has been fixed on cloudfront's side. Please reopen if there are still problems.
(In reply to Jeremy Orem [:oremj] from comment #28) > I'm going to assume this has been fixed on cloudfront's side. Please reopen > if there are still problems. Yep, thanks, Jeremy. I haven't seen this rear its head, since.