1244851 - Investigate latency between webqa-ci1.qa.scl3.mozilla.com and bouncer via proxy

Reporter

Description

•

10 years ago

Attached file Console log with example of the failure — Details

We've recently been seeing timeouts in our test automation requests. The error behavior is very similar to bug 1238977. The timeouts are sporadic, but also happen on a regular basis. I've been unable to reproduce the error state when running the tests external to our Jenkins environment. The effected urls that are under test include: - https://download.mozilla.org - http://bouncer-bouncer.stage.mozaws.net > if r.status_code == 302 and r.headers['Location']: > try: > request_url = r.headers['Location'] > r = requests.head(request_url, headers=headers, verify=False, > timeout=15, allow_redirects=True) > except requests.RequestException as e: > raise AssertionError('Failing URL: %s.\nError message: %s' % (request_url, e)) >E AssertionError: Failing URL: http://download.cdn.mozilla.net/pub/firefox/releases/31.0/linux-i686/ja/firefox-31.0.tar.bz2. >E Error message: HTTPConnectionPool(host='proxy.dmz.scl3.mozilla.com', port=3128): Read timed out. (read timeout=15)

:Atoll

Comment 1

•

10 years ago

To confirm, you're running the tests that *do* work from a server within the SCL3 datacenter?

Matt Brandt [:mbrandt]

Reporter

Comment 2

•

10 years ago

(In reply to Richard Soderberg [:atoll] from comment #1) > To confirm, you're running the tests that *do* work from a server within the > SCL3 datacenter? The tests are running from a server external to SCL3.

:Atoll

Comment 3

•

10 years ago

(In reply to Matt Brandt [:mbrandt] from comment #0) > > r = requests.head(request_url, headers=headers, verify=False, > > timeout=15, allow_redirects=True) This timeout is too low for use with the Squid servers. They do not have a timeout configured manually, and so are using the Squid defaults: connect_timeout 60s request_timeout 300s You'll need to raise your application timeouts if you want to get back diagnostic information from Squid about why it was unable to retrieve a reply to your request. (In reply to Matt Brandt [:mbrandt] from comment #2) > (In reply to Richard Soderberg [:atoll] from comment #1) > > To confirm, you're running the tests that *do* work from a server within the > > SCL3 datacenter? > > The tests are running from a server external to SCL3. Do you have access to any servers in SCL3 that are *not* required to go through the proxy for direct internet communication? It would be extremely useful to confirm that this issue is restricted only to the proxies, and not also to all of SCL3.

Matt Brandt [:mbrandt]

Reporter

Comment 4

•

10 years ago

> This timeout is too low for use with the Squid servers. You'll need to raise your > application timeouts if you want to get back diagnostic information from Squid Thank you for noticing this nuance. I'll up our timeouts. > Do you have access to any servers in SCL3 that are *not* required to go through the > proxy for direct internet communication? No, I do not. But I can reach out to someone on my team who may.

:Atoll

Comment 5

•

10 years ago

Re: timeouts, I don't know that we've ever configured them, or if Squid can configure them differently for certain requestors. It might be possible to tune that for your app specifically so that you don't have to go all the way out to 5 minutes, but I can't guarantee that, so temporarily lifting the timeouts in the app might be enough to get some answers.

Matt Brandt [:mbrandt]

Reporter

Comment 6

•

10 years ago

Yeah, agreed. I like the thought thread on upping the timeout to explore the problem space a bit. Especially given that it's something that I have access to.

Matt Brandt [:mbrandt]

Reporter

Comment 7

•

10 years ago

Just wanted to report in, since upping the timeout to 300 seconds test failures appear to have ceased. I'm going to let the tests continue to run overnight (they run every 5 minutes) to gather more data points. I'll work with my team tomorrow to investigate option #2 > Do you have access to any servers in SCL3 that are *not* required to go through the > proxy for direct internet communication?

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 8

•

10 years ago

In https://webqa-ci.mozilla.com/view/Bouncer/job/bouncer.stage/27786/testReport/tests.test_redirects/TestRedirects/ there were three tests that exceeded the previous 15 second timeout, though they're all still under 60 seconds. I wonder why this happens, what is an acceptable duration for these tests, and why these delays appear to have increased recently. atoll: In bug 1238977 it appears to take 900 seconds before I see a proxy timeout, which doesn't match either of the timeouts mentioned in comment 3. > Do you have access to any servers in SCL3 that are *not* required to go through the proxy for direct internet communication? I thought all servers were required to go through the proxy? Is it possible to provide a list of addresses to avoid the proxy? Like you say this would help to confirm if the proxy is really the issue.

:Atoll

Comment 9

•

10 years ago

"All servers", where all is defined as a limited subset. I'm not privy to the actual definition, but I can offer some assistance here. Is this something we can test with a simple curl -o /dev/null? If you can give us a simple bash script that logs fetch times and such, we can run that for you and report back. I'm not surprise that the proxy timeout differs from the defaults I found in the docs. We don't appear to set it intentionally anywhere in Puppet, but I'm sorry that it's so much higher! Perhaps we should set something lower.

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 10

•

10 years ago

Taking an example of a >15 second request from the bouncer results, this will log the time to connect, time to start transfer, and total time (as the bouncer tests do head requests, these last two should be identical) for the URL: $ while true ; do curl --head -w '%{time_connect}:%{time_starttransfer}:%{time_total}\n' -o /dev/null -s http://download.cdn.mozilla.net/pub/firefox/releases/31.0/win32/es-MX/Firefox%20Setup%2031.0.exe >> bouncer.log ; sleep 5 ; done We could run this for a while with/without the proxy and compare results. > We don't appear to set it intentionally anywhere in Puppet, but I'm sorry that it's so much higher! Perhaps we should set something lower. My evidence is somewhat circumstantial, and either way I'm not too concerned with the high timeout. I'd rather identify why we're getting timeouts when we shouldn't be, and address that.

Marlena

Updated

•

10 years ago

Severity: major → normal

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

10 years ago

Summary: Squid proxy timeouts between webqa-ci1.qa.scl3.mozilla.com → Investigate latency between webqa-ci1.qa.scl3.mozilla.com and bouncer via proxy

:Atoll

Updated

•

10 years ago

Comment 11

•

10 years ago

We've seen two timeouts in a single test run at the new 300 second timeout: https://webqa-ci.mozilla.com/job/bouncer.prod/28063/console as our tests now timeout after 300 seconds I suspect we may still not be hitting any proxy timeout, but perhaps we could use these occurrences to check the proxy logs? Times are Pacific and approximate (it's the time the failure was logged to console, so may be a short time after the actually occurred). Date is Feb 2nd. 17:11:47 except requests.RequestException as e: 17:11:47 > raise AssertionError('Failing URL: %s.\nError message: %s' % (request_url, e)) 17:11:47 E AssertionError: Failing URL: http://download.cdn.mozilla.net/pub/firefox/releases/31.0/linux-x86_64/pl/firefox-31.0.tar.bz2. 17:11:47 E Error message: HTTPConnectionPool(host='proxy.dmz.scl3.mozilla.com', port=3128): Read timed out. (read timeout=300) 17:11:47 except requests.RequestException as e: 17:11:47 > raise AssertionError('Failing URL: %s.\nError message: %s' % (request_url, e)) 17:11:47 E AssertionError: Failing URL: http://download.cdn.mozilla.net/pub/firefox/releases/31.0/win32/pt-PT/Firefox%20Setup%2031.0.exe. 17:11:47 E Error message: HTTPConnectionPool(host='proxy.dmz.scl3.mozilla.com', port=3128): Read timed out. (read timeout=300)

Matt Brandt [:mbrandt]

Reporter

Comment 12

•

10 years ago

> Do you have access to any servers in SCL3 that are *not* required to go through > the proxy for direct internet communication? I'm still on the hunt for a box to run the test suite, I have a few feelers out. Hope to get fresh data soon for you.

Matt Brandt [:mbrandt]

Reporter

Comment 13

•

9 years ago

An update, I'm actively receiving help with the investigation bug 1247332. More information to come as we reduce the scope of the variables.

Brian Hourigan [:digi]

Comment 14

•

9 years ago

Just wanted to check in here, bug 1247332 is closed. Is there anything further we can do?

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

9 years ago

Flags: needinfo?(mbrandt)

Matt Brandt [:mbrandt]

Reporter

Comment 15

•

9 years ago

Sorry, this slipped through my net. Yes, I think we're done here. The issues that we're experiencing appear to be on sauce's side. I deeply appreciate your help. Count me in your debt; tea and muffins on me next we meet.

Status: NEW → RESOLVED

Closed: 9 years ago

Flags: needinfo?(mbrandt)

Resolution: --- → FIXED