Closed Bug 1244851 Opened 8 years ago Closed 8 years ago

Investigate latency between webqa-ci1.qa.scl3.mozilla.com and bouncer via proxy

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mbrandt, Unassigned)

References

Details

Attachments

(1 file)

We've recently been seeing timeouts in our test automation requests. The error behavior is very similar to bug 1238977. The timeouts are sporadic, but also happen on a regular basis.

I've been unable to reproduce the error state when running the tests external to our Jenkins environment.

The effected urls that are under test include:
- https://download.mozilla.org
- http://bouncer-bouncer.stage.mozaws.net


>    if r.status_code == 302 and r.headers['Location']:
>        try:
>            request_url = r.headers['Location']
>            r = requests.head(request_url, headers=headers, verify=False,
>                              timeout=15, allow_redirects=True)
>        except requests.RequestException as e:
>           raise AssertionError('Failing URL: %s.\nError message: %s' % (request_url, e))
>E           AssertionError: Failing URL: http://download.cdn.mozilla.net/pub/firefox/releases/31.0/linux-i686/ja/firefox-31.0.tar.bz2.
>E           Error message: HTTPConnectionPool(host='proxy.dmz.scl3.mozilla.com', port=3128): Read timed out. (read timeout=15)
To confirm, you're running the tests that *do* work from a server within the SCL3 datacenter?
(In reply to Richard Soderberg [:atoll] from comment #1)
> To confirm, you're running the tests that *do* work from a server within the
> SCL3 datacenter?

The tests are running from a server external to SCL3.
(In reply to Matt Brandt [:mbrandt] from comment #0)
> >            r = requests.head(request_url, headers=headers, verify=False,
> >                              timeout=15, allow_redirects=True)

This timeout is too low for use with the Squid servers. They do not have a timeout configured manually, and so are using the Squid defaults:

connect_timeout 60s
request_timeout 300s

You'll need to raise your application timeouts if you want to get back diagnostic information from Squid about why it was unable to retrieve a reply to your request.

(In reply to Matt Brandt [:mbrandt] from comment #2)
> (In reply to Richard Soderberg [:atoll] from comment #1)
> > To confirm, you're running the tests that *do* work from a server within the
> > SCL3 datacenter?
> 
> The tests are running from a server external to SCL3.

Do you have access to any servers in SCL3 that are *not* required to go through the proxy for direct internet communication? It would be extremely useful to confirm that this issue is restricted only to the proxies, and not also to all of SCL3.
> This timeout is too low for use with the Squid servers. You'll need to raise your 
> application timeouts if you want to get back diagnostic information from Squid

Thank you for noticing this nuance. I'll up our timeouts.

> Do you have access to any servers in SCL3 that are *not* required to go through the 
> proxy for direct internet communication?

No, I do not. But I can reach out to someone on my team who may.
Re: timeouts, I don't know that we've ever configured them, or if Squid can configure them differently for certain requestors. It might be possible to tune that for your app specifically so that you don't have to go all the way out to 5 minutes, but I can't guarantee that, so temporarily lifting the timeouts in the app might be enough to get some answers.
Yeah, agreed. I like the thought thread on upping the timeout to explore the problem space a bit. Especially given that it's something that I have access to.
Just wanted to report in, since upping the timeout to 300 seconds test failures appear to have ceased. I'm going to let the tests continue to run overnight (they run every 5 minutes) to gather more data points. I'll work with my team tomorrow to investigate option #2

> Do you have access to any servers in SCL3 that are *not* required to go through the 
> proxy for direct internet communication?
In https://webqa-ci.mozilla.com/view/Bouncer/job/bouncer.stage/27786/testReport/tests.test_redirects/TestRedirects/ there were three tests that exceeded the previous 15 second timeout, though they're all still under 60 seconds. I wonder why this happens, what is an acceptable duration for these tests, and why these delays appear to have increased recently.

atoll: In bug 1238977 it appears to take 900 seconds before I see a proxy timeout, which doesn't match either of the timeouts mentioned in comment 3.

> Do you have access to any servers in SCL3 that are *not* required to go through the proxy for direct internet communication?

I thought all servers were required to go through the proxy? Is it possible to provide a list of addresses to avoid the proxy? Like you say this would help to confirm if the proxy is really the issue.
"All servers", where all is defined as a limited subset. I'm not privy to the actual definition, but I can offer some assistance here. Is this something we can test with a simple curl -o /dev/null? If you can give us a simple bash script that logs fetch times and such, we can run that for you and report back.

I'm not surprise that the proxy timeout differs from the defaults I found in the docs. We don't appear to set it intentionally anywhere in Puppet, but I'm sorry that it's so much higher! Perhaps we should set something lower.
Taking an example of a >15 second request from the bouncer results, this will log the time to connect, time to start transfer, and total time (as the bouncer tests do head requests, these last two should be identical) for the URL:

$ while true ; do curl --head -w '%{time_connect}:%{time_starttransfer}:%{time_total}\n' -o /dev/null -s http://download.cdn.mozilla.net/pub/firefox/releases/31.0/win32/es-MX/Firefox%20Setup%2031.0.exe >> bouncer.log ; sleep 5 ; done

We could run this for a while with/without the proxy and compare results.

> We don't appear to set it intentionally anywhere in Puppet, but I'm sorry that it's so much higher! Perhaps we should set something lower.

My evidence is somewhat circumstantial, and either way I'm not too concerned with the high timeout. I'd rather identify why we're getting timeouts when we shouldn't be, and address that.
Severity: major → normal
Summary: Squid proxy timeouts between webqa-ci1.qa.scl3.mozilla.com → Investigate latency between webqa-ci1.qa.scl3.mozilla.com and bouncer via proxy
See Also: → 1238977
We've seen two timeouts in a single test run at the new 300 second timeout: https://webqa-ci.mozilla.com/job/bouncer.prod/28063/console as our tests now timeout after 300 seconds I suspect we may still not be hitting any proxy timeout, but perhaps we could use these occurrences to check the proxy logs? Times are Pacific and approximate (it's the time the failure was logged to console, so may be a short time after the actually occurred). Date is Feb 2nd.

17:11:47             except requests.RequestException as e:
17:11:47 >               raise AssertionError('Failing URL: %s.\nError message: %s' % (request_url, e))
17:11:47 E               AssertionError: Failing URL: http://download.cdn.mozilla.net/pub/firefox/releases/31.0/linux-x86_64/pl/firefox-31.0.tar.bz2.
17:11:47 E               Error message: HTTPConnectionPool(host='proxy.dmz.scl3.mozilla.com', port=3128): Read timed out. (read timeout=300)

17:11:47             except requests.RequestException as e:
17:11:47 >               raise AssertionError('Failing URL: %s.\nError message: %s' % (request_url, e))
17:11:47 E               AssertionError: Failing URL: http://download.cdn.mozilla.net/pub/firefox/releases/31.0/win32/pt-PT/Firefox%20Setup%2031.0.exe.
17:11:47 E               Error message: HTTPConnectionPool(host='proxy.dmz.scl3.mozilla.com', port=3128): Read timed out. (read timeout=300)
> Do you have access to any servers in SCL3 that are *not* required to go through 
> the proxy for direct internet communication?

I'm still on the hunt for a box to run the test suite, I have a few feelers out. Hope to get fresh data soon for you.
An update, I'm actively receiving help with the investigation bug 1247332. More information to come as we reduce the scope of the variables.
Just wanted to check in here, bug 1247332 is closed. Is there anything further we can do?
Flags: needinfo?(mbrandt)
Sorry, this slipped through my net. Yes, I think we're done here. The issues that we're experiencing appear to be on sauce's side. 

I deeply appreciate your help. Count me in your debt; tea and muffins on me next we meet.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(mbrandt)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: