Closed Bug 1130386 Opened 10 years ago Closed 10 years ago

SSL errors on aus4.mozilla.org load balancer

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: cliang)

Details

We started seeing SSL related failures today trying to talk to aus4.mozilla.org. The ratio is not high, but this prevents 36.0b7 final verification tests to pass. An example output from a curl based test: Fri Feb 6 02:28:28 PST 2015: FAILURE 10: Update xml file not available Fri Feb 6 02:28:28 PST 2015: Fri Feb 6 02:28:28 PST 2015: Download url: https://aus4.mozilla.org/update/3/Firefox/33.0/20140911191954/Linux_x86-gcc3/es-CL/beta-cdntest/default/default/default/update.xml?force=1 Fri Feb 6 02:28:28 PST 2015: Curl returned exit code: 56 Fri Feb 6 02:28:28 PST 2015: Fri Feb 6 02:28:28 PST 2015: The HTTP headers were: Fri Feb 6 02:28:28 PST 2015: Fri Feb 6 02:28:28 PST 2015: The full curl debug output was: Fri Feb 6 02:28:28 PST 2015: * About to connect() to aus4.mozilla.org port 443 (#0) Fri Feb 6 02:28:28 PST 2015: * Trying 63.245.217.219... connected Fri Feb 6 02:28:28 PST 2015: * Connected to aus4.mozilla.org (63.245.217.219) port 443 (#0) Fri Feb 6 02:28:28 PST 2015: * Initializing NSS with certpath: sql:/etc/pki/nssdb Fri Feb 6 02:28:28 PST 2015: * warning: ignoring value of ssl.verifyhost Fri Feb 6 02:28:28 PST 2015: * skipping SSL peer certificate verification Fri Feb 6 02:28:28 PST 2015: * SSL connection using TLS_DHE_RSA_WITH_AES_128_CBC_SHA Fri Feb 6 02:28:28 PST 2015: * Server certificate: Fri Feb 6 02:28:28 PST 2015: * subject: CN=aus4.mozilla.org,O=Mozilla Foundation,L=Mountain View,ST=CA,C=US Fri Feb 6 02:28:28 PST 2015: * start date: Sep 23 00:00:00 2013 GMT Fri Feb 6 02:28:28 PST 2015: * expire date: Sep 27 12:00:00 2016 GMT Fri Feb 6 02:28:28 PST 2015: * common name: aus4.mozilla.org Fri Feb 6 02:28:28 PST 2015: * issuer: CN=DigiCert Secure Server CA,O=DigiCert Inc,C=US Fri Feb 6 02:28:28 PST 2015: > GET /update/3/Firefox/33.0/20140911191954/Linux_x86-gcc3/es-CL/beta-cdntest/default/default/default/update.xml?force=1 HTTP/1.1 Fri Feb 6 02:28:28 PST 2015: > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2 Fri Feb 6 02:28:28 PST 2015: > Host: aus4.mozilla.org Fri Feb 6 02:28:28 PST 2015: > Accept: */* Fri Feb 6 02:28:28 PST 2015: > Fri Feb 6 02:28:28 PST 2015: * SSL connection receive timeout Fri Feb 6 02:28:28 PST 2015: * Closing connection #0 And this one: Fri Feb 6 02:28:27 PST 2015: ==================================== Fri Feb 6 02:28:27 PST 2015: Fri Feb 6 02:28:27 PST 2015: FAILURE 6: Update xml file not available Fri Feb 6 02:28:27 PST 2015: Fri Feb 6 02:28:27 PST 2015: Download url: https://aus4.mozilla.org/update/3/Firefox/32.0/20140814150857/Linux_x86_64-gcc3/rm/beta-cdntest/default/default/default/update.xml?force=1 Fri Feb 6 02:28:27 PST 2015: Curl returned exit code: 35 Fri Feb 6 02:28:27 PST 2015: Fri Feb 6 02:28:27 PST 2015: The HTTP headers were: Fri Feb 6 02:28:27 PST 2015: Fri Feb 6 02:28:27 PST 2015: The full curl debug output was: Fri Feb 6 02:28:27 PST 2015: * About to connect() to aus4.mozilla.org port 443 (#0) Fri Feb 6 02:28:27 PST 2015: * Trying 63.245.217.219... connected Fri Feb 6 02:28:27 PST 2015: * Connected to aus4.mozilla.org (63.245.217.219) port 443 (#0) Fri Feb 6 02:28:27 PST 2015: * Initializing NSS with certpath: sql:/etc/pki/nssdb Fri Feb 6 02:28:27 PST 2015: * warning: ignoring value of ssl.verifyhost Fri Feb 6 02:28:27 PST 2015: * NSS error -5990 Fri Feb 6 02:28:27 PST 2015: * Closing connection #0 Fri Feb 6 02:28:27 PST 2015: * SSL connect error
I don't know if it's related or not, but bug 1125159 has been open for awhile.
Also, looking at https://rpm.newrelic.com/accounts/263620/applications/5929969 I see that the backends are generally fine - there was a spike in response time for ~5min around 5:45am pacific, but that appears to just be on aus2.webapp.phx1, and hasn't recurred.
FYI: the timestamps in comment #0 are invalid - they're done in batch at the end of our test script. We can probably guesstimate some more legit times if it's helpful.
(In reply to Rail Aliiev [:rail] from comment #4) > The full log can be found at: > > http://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/36.0b7-candidates/ > build1/logs/release-mozilla-beta-final_verification-bm77-build1-build11.txt. > gz > > There is a summary at the end. Does it sort errors to the end or were all the errors at the end of the test?
If it's not horribly difficult to guesstimate timestamps, it would be helpful. I'm wondering if it's a cached curl connection that goes bad (and doesn't get cleared until something else times out / closes). The logs don't appear to be showing any SSL-related errors. I'm going to dig some more into SSL tuning on the load balancer.
(In reply to Jason Crowe [:jd] from comment #5) > (In reply to Rail Aliiev [:rail] from comment #4) > > The full log can be found at: > > > > http://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/36.0b7-candidates/ > > build1/logs/release-mozilla-beta-final_verification-bm77-build1-build11.txt. > > gz > > > > There is a summary at the end. > > Does it sort errors to the end or were all the errors at the end of the test? Yeah, the timestamps are uses in the summary block :(
also, is this new behavior or have you always gotten a few failures? It his a thing like, we are over a threshold of errors, or is it that you have never seen these errors before? If these errors are new, do you know when they started? It might help to determine what might have changed
This is new. The test has no tolerance. :) Last time when we run this test was on Tuesday I believe, for the previous release, see http://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/36.0b6-candidates/build1/logs/release-mozilla-beta-final_verification-bm77-build1-build7.txt.gz
(In reply to C. Liang [:cyliang] from comment #6) > If it's not horribly difficult to guesstimate timestamps, it would be > helpful. I'm wondering if it's a cached curl connection that goes bad (and > doesn't get cleared until something else times out / closes). The most recent batch of failures occurred between 9:40am pacific and 10:04am pacific.
Rail just remembered that we took a CentOS 6.2 -> 6.5 upgrade this week. I'm running this test on a 6.2 machine to make sure that's not the issue.
Assignee: server-ops-webops → cliang
catlee changing AWS routing tables to force using public internet to access aus4.m.o. I forced another test to verify this.
An update here : I was helping C troubleshoot this issue. Here's some of things we found : 1) There were a bunch of "spurious" SSL requests to aus4.mozilla.org and engaged kang from opsec to help look at that. Most of these were harmless and we concluded that this wasn't actually affecting performance in any way. We were getting about 2500 a minute so about 40 a second and compared to actual traffic, wasn't much of an impact. 2) I cloned the build-tools repo and ran the tests that were failing in use1 across different machines (my laptop, my personal instance in usw2 and 2 different DCs) and none of them had any failures. At this time, it seems like Zeus is performing as expected and this feels like an issue specific to use1 <-> aus4.mozilla.org. I believe releng is adding specific routes to aus4.mozilla.org to mitigate the issue at this time.
mtr on the slave the current test is running: [root@bld-linux64-spot-220.build.releng.use1.mozilla.com ~]# mtr -r aus4.mozilla.org HOST: bld-linux64-spot-220.build. Loss% Snt Last Avg Best Wrst StDev 1. ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 2. ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 3. ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 4. 100.64.16.111 0.0% 10 0.8 0.7 0.6 1.0 0.1 5. 205.251.245.231 0.0% 10 1.1 1.1 1.0 1.6 0.2 6. 72.21.220.60 0.0% 10 1.7 2.1 0.9 4.5 1.2 7. ae1.er5.iad10.us.zip.zayo.co 0.0% 10 0.9 2.4 0.9 15.3 4.5 8. ae5.er2.iad10.us.zip.zayo.co 0.0% 10 1.2 1.2 1.0 1.5 0.1 9. ae1.cr2.dca2.us.zip.zayo.com 0.0% 10 2.1 2.1 1.9 2.3 0.1 10. ae14.cr2.iah1.us.zip.zayo.co 0.0% 10 25.3 30.2 25.3 56.3 9.8 11. ae6.mpr4.phx2.us.zip.zayo.co 0.0% 10 74.4 74.4 74.1 75.1 0.3 12. 64.124.201.182.mozilla.com 0.0% 10 59.1 59.2 59.1 59.4 0.1 13. xe-1-0-1.0.core1.phx1.mozill 0.0% 10 60.8 61.1 60.8 62.8 0.6 14. aus4.vips.phx1.mozilla.com 0.0% 10 59.5 59.6 59.4 59.8 0.2
... and it passes! 6 mins, 56 secs vs almost 30minutes before
Let's resolve it! Thanks to cyliang and fox2mike!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.