Closed Bug 1238977 Opened 8 years ago Closed 8 years ago

Squid proxy timeouts between webqa-ci1.qa.scl3.mozilla.com and saucelabs.com

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: davehunt, Unassigned)

References

Details

Attachments

(4 files)

We've recently been seeing timeouts in our browser automation requests using Sauce Labs. It's unclear if these are caused by problems with our infrastructure, so I'm also going to open a ticket with Sauce Labs.

I've attached a console log from a recent failure. Here's some of the more interesting details from the log:

> ERROR: The requested URL could not be retrieved
> The following error was encountered while trying to retrieve the URL:
> http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/element/1/click
> Read Timeout
> The system returned: [No Error]
> A Timeout occurred while waiting to read data from the network. The network
> or server may be down or congested. Please retry your request.
> Generated Tue, 12 Jan 2016 03:26:42 GMT by proxy2.dmz.scl3.mozilla.com (squid/3.1.23)

The timeout appears to be 900 seconds (15 minutes).

One reason that leads me to suspect Sauce Labs is that the request that didn't get a response was 'click' and looking at the related Sauce Labs job (https://saucelabs.com/jobs/3fe195fb23014178a5d040aa0588f86d) I can see that the last command received was the click. The session then times out in Sauce Labs due to no further commands received after 90 seconds.
punting bug to the infra team as it's not a DCOPs issue.
Assignee: server-ops-dcops → infra
Component: DCOps → Infrastructure: Other
QA Contact: jdow
Another Jenkins instance outside of our infrastructure doesn't appear to be having these issues with Sauce Labs, making me think this is potentially an intermittent issue with our proxy? We also notice occasional timeouts in other jobs, which may be related:

> Error message: HTTPConnectionPool(host='proxy.dmz.scl3.mozilla.com', port=3128): Read timed out. (read timeout=15)

The above is from https://webqa-ci.mozilla.com/job/bouncer.stage/26812/console
jabba: As the QA Contact for this component, do you have some suggestions for how I could dig into this?
Flags: needinfo?(jdow)
jbarnell: I'm also told you may be able to provide some assistance.
Flags: needinfo?(jbarnell)
(In reply to Dave Hunt (:davehunt) from comment #3)
> jabba: As the QA Contact for this component, do you have some suggestions
> for how I could dig into this?

I'm gonna pass my needinfo on to cshields.
Flags: needinfo?(jdow) → needinfo?(cshields)
Hey Dave, sorry for the hot potato here.  What host(s) is trying to make the connection and getting the timeout?  We will try to reproduce it and check the proxy logs.
Flags: needinfo?(jbarnell)
Flags: needinfo?(cshields)
Hey Corey! There's an example in https://webqa-ci.mozilla.com/view/Buildmaster/job/bedrock.dev.linux.firefox/27/console

The host http://webqa-ci1.qa.scl3.mozilla.com is attempting to communicate with saucelabs.com (162.222.75.179) and eventually getting a timeout. The error notes that it was using proxy4.dmz.scl3.mozilla.com (squid/3.1.23) and it likely happened around 01:23:38 Pacific.
Hi :davehunt

> > http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/element/1/click
> > Read Timeout

Thank you for the logs!

> 
> The timeout appears to be 900 seconds (15 minutes).
> 
> One reason that leads me to suspect Sauce Labs is that the request that
> didn't get a response was 'click' and looking at the related Sauce Labs job
> (https://saucelabs.com/jobs/3fe195fb23014178a5d040aa0588f86d) I can see that
> the last command received was the click. The session then times out in Sauce
> Labs due to no further commands received after 90 seconds.

I pulled all proxy logs for this job and I can confirm that your click failure was observed at 1452569202.003 with a request duration of 900700ms. The next request to SauceLabs was observed at 1452569322.070, with a request duration of 120007ms.

I went ahead and pulled all log data for 30 seconds before and after 1452569202 (click), filtering out requests related to this SauceLabs job. The average request duration at this time was 10739.3ms. The average duration of all requests over the past 30 days is 13073.9ms. I dug deeper and I did find two requests which had a durations of ~60000ms, but, further analysis confirmed this is normal for this particular endpoint. Based on this data I cannot find any evidence of a systematic problem with our proxy servers or network.

I've included all proxy logs related to this job below. Data is in native squid format, described at http://wiki.squid-cache.org/Features/LogFormat#Squid_native_access.log_format_in_detail

> 1452568297.288    216 10.22.73.35 TCP_MISS/200 468 POST http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/window/current/size - DIRECT/162.222.75.179 application/json
> 1452568297.538    244 10.22.73.35 TCP_MISS/200 1297 PUT http://saucelabs.com/rest/v1/mozilla-ee/jobs/3fe195fb23014178a5d040aa0588f86d - DIRECT/162.222.75.243 application/json
> 1452568300.834   3289 10.22.73.35 TCP_MISS/200 467 POST http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/url - DIRECT/162.222.75.179 application/json
> 1452568300.906     70 10.22.73.35 TCP_MISS/200 518 GET http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/url - DIRECT/162.222.75.179 application/json
> 1452568301.087    178 10.22.73.35 TCP_MISS/200 479 POST http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/element - DIRECT/162.222.75.179 application/json
> 1452568301.205    110 10.22.73.35 TCP_MISS/200 497 GET http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/element/0/attribute/class - DIRECT/162.222.75.179 application/json
> 1452568301.297     90 10.22.73.35 TCP_MISS/200 480 POST http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/element - DIRECT/162.222.75.179 application/json
> 1452569202.003 900700 10.22.73.35 TCP_MISS/- 4012 POST http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/element/1/click - DIRECT/162.222.75.179 text/html
> 1452569322.070 120007 10.22.73.35 TCP_MISS/404 319 GET http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/url - DIRECT/162.222.75.179 text/plain
> 1452569322.084      8 10.22.73.35 TCP_MISS/404 591 GET http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/screenshot - DIRECT/162.222.75.179 text/plain
> 1452569322.110     20 10.22.73.35 TCP_MISS/404 591 GET http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/source - DIRECT/162.222.75.179 text/plain
> 1452569322.331    201 10.22.73.35 TCP_MISS/200 1305 PUT http://saucelabs.com/rest/v1/mozilla-ee/jobs/3fe195fb23014178a5d040aa0588f86d - DIRECT/162.222.75.243 application/json
> 1452569322.342     13 10.22.73.35 TCP_MISS/404 591 DELETE http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d - DIRECT/162.222.75.179 text/plain
> 1452569322.468     16 10.22.73.35 TCP_MISS/404 591 GET http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/url - DIRECT/162.222.75.179 text/plain
> 1452569424.067 101594 10.22.73.35 TCP_MISS/404 319 GET http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/screenshot - DIRECT/162.222.75.179 text/plain
> 1452569424.080      9 10.22.73.35 TCP_MISS/404 591 GET http://162.222.75.179/wd/hub/session/3fe195fb23014178a5d040aa0588f86d/source - DIRECT/162.222.75.179 text/plain
> 1452569424.136     48 10.22.73.35 TCP_MISS/200 1305 PUT http://saucelabs.com/rest/v1/mozilla-ee/jobs/3fe195fb23014178a5d040aa0588f86d - DIRECT/162.222.75.243 application/json
See Also: → 1244851
Thanks Brian, that's really helpful! If we were to assume that Sauce Labs was up and responsive at this time (running tests outside of our datacenter do not appear to hit this issue), what might cause this 900700ms response without a status code? I did some digging at the wiki page you mentioned and found this interesting page: http://wiki.squid-cache.org/ZeroSizedReply. This is another error we see occasionally, and the page mentions a race condition between the timeouts and zero sized reply, which sounds like it might be affecting us.
Flags: needinfo?(bhourigan)
(In reply to Dave Hunt (:davehunt) from comment #9)
> Thanks Brian, that's really helpful! If we were to assume that Sauce Labs
> was up and responsive at this time (running tests outside of our datacenter
> do not appear to hit this issue), what might cause this 900700ms response
> without a status code? I did some digging at the wiki page you mentioned and
> found this interesting page: http://wiki.squid-cache.org/ZeroSizedReply.
> This is another error we see occasionally, and the page mentions a race
> condition between the timeouts and zero sized reply, which sounds like it
> might be affecting us.

A zero sized reply is likely the effect of a remote timeout, where it was unable to send any type of http response. If it were a remote failure of any sort I would at least expect some sort of reply, even an error. The subsequent requests all got a 404 (except for the PUT), so this makes me even more suspicious of the remote side.

I'm happy to dig deeper, I just need some more direction in how I can pinpoint this to be a issue with our squid setup.
Flags: needinfo?(bhourigan)
This is continuing to occur, a recent example can be found here: https://webqa-ci.mozilla.com/job/bedrock.dev.win10.ie/275/console

Once again, the host http://webqa-ci1.qa.scl3.mozilla.com is attempting to communicate with saucelabs.com (162.222.75.39) and eventually getting a timeout. The error notes that it was using proxy4.dmz.scl3.mozilla.com (squid/3.1.23) and it likely happened around 09:27:21 Pacific on Feb 28, 2016.

Brian: Could you take a look at the logs to see if you can confirm this timeout like you did before? We are also running tests from outside of our data center, and these have not experienced similar issues, which suggests it has something to do with our infrastructure. Do you have any idea what might cause a 900 second timeout? Are you able to determine from the logs how often we hit this timeout, and if it's only affecting Sauce Labs traffic?
Flags: needinfo?(bhourigan)
> Brian: Could you take a look at the logs to see if you can confirm this
> timeout like you did before?

I can confirm that the same 900 second timeout was reached:

1456680317.041 900144 10.22.73.35 TCP_MISS/- 4015 POST http://162.222.75.39/wd/hub/session/d446ec325e8d40c18a7e969a2dd154b8/element/4/element - DIRECT/162.222.75.39 text/html

I also see a second request that timed out after ~90s, but returned a 404:

1456680408.918  91762 10.22.73.35 TCP_MISS/404 319 GET http://162.222.75.39/wd/hub/session/d446ec325e8d40c18a7e969a2dd154b8/source - DIRECT/162.222.75.39 text/plain

> We are also running tests from outside of our
> data center, and these have not experienced similar issues, which suggests
> it has something to do with our infrastructure.

Are you running the same tests outside of the datacenter?

> Do you have any idea what might cause a 900 second timeout? Are
> you able to determine from the logs how often we hit this timeout,
> and if it's only affecting Sauce Labs traffic?

I can confirm that no other proxy traffic is experiencing timeouts. I aggregated the logs for the past week, and response times are normal. I filtered out averages and looked for peak request durations, but nothing stood out.

I'm fairly confident this is not a problem with our proxies, and in the least, this should provide sufficient evidence to involve Sauce Labs technical support. They may have more logs on their side related to the test timeouts.
Flags: needinfo?(bhourigan)
(In reply to Brian Hourigan [:digi] from comment #12)
> > We are also running tests from outside of our
> > data center, and these have not experienced similar issues, which suggests
> > it has something to do with our infrastructure.
> 
> Are you running the same tests outside of the datacenter?

Yes, the exact same tests are running from an AWS instance without this issue.

> > Do you have any idea what might cause a 900 second timeout? Are
> > you able to determine from the logs how often we hit this timeout,
> > and if it's only affecting Sauce Labs traffic?
> 
> I can confirm that no other proxy traffic is experiencing timeouts. I
> aggregated the logs for the past week, and response times are normal. I
> filtered out averages and looked for peak request durations, but nothing
> stood out.

We see this 900 second timeout several times a week, so I'm surprised you didn't find these. Would it help for me to provide more timestamps for when we've seen these?

> I'm fairly confident this is not a problem with our proxies, and in the
> least, this should provide sufficient evidence to involve Sauce Labs
> technical support. They may have more logs on their side related to the test
> timeouts.

We've already involved Sauce Labs, and they've not been able to find anything on their side so far. Is it possible to avoid the proxy (perhaps temporarily) for Sauce Labs to rule it out?
Flags: needinfo?(bhourigan)
James had talked about bypassing proxy to test, I'm guessing that did not go anywhere?
> We see this 900 second timeout several times a week, so I'm surprised you
> didn't find these. Would it help for me to provide more timestamps for when
> we've seen these?

My search review specifically excluded SauceLabs requests with the intent of finding evidence of a systematic problem with the proxy hosts.

> We've already involved Sauce Labs, and they've not been able to find
> anything on their side so far. Is it possible to avoid the proxy (perhaps
> temporarily) for Sauce Labs to rule it out?

I do not see a problem with this, we'll need to open flows for you to Saucelabs. Do you have a list of all endpoints you need to connect to?
Flags: needinfo?(bhourigan)
(In reply to Brian Hourigan [:digi] from comment #15)
> > We see this 900 second timeout several times a week, so I'm surprised you
> > didn't find these. Would it help for me to provide more timestamps for when
> > we've seen these?
> 
> My search review specifically excluded SauceLabs requests with the intent of
> finding evidence of a systematic problem with the proxy hosts.

Ah, that makes sense then. I'd be interested in getting details of the timed out Sauce Labs requests for the last couple of weeks. It's not too easy for me to get that information from our CI server, and it would be useful information to provide Sauce for their investigation.

> > We've already involved Sauce Labs, and they've not been able to find
> > anything on their side so far. Is it possible to avoid the proxy (perhaps
> > temporarily) for Sauce Labs to rule it out?
> 
> I do not see a problem with this, we'll need to open flows for you to
> Saucelabs. Do you have a list of all endpoints you need to connect to?

I believe it would be just saucelabs.com. Without the proxy to time out these requests, what would we expect here? If we genuinely never get a response from Sauce Labs, would the request potentially wait indefinitely?
Flags: needinfo?(bhourigan)
> Sauce Labs, and they've not been able to find
> > > anything on their side so far. Is it possible to avoid the proxy (perhaps
> > > temporarily) for Sauce Labs to rule it out?
> > 
> > I do not see a problem with this, we'll need to open flows for you to
> > Saucelabs. Do you have a list of all endpoints you need to connect to?
> 
> I believe it would be just saucelabs.com. Without the proxy to time out
> these requests, what would we expect here? If we genuinely never get a
> response from Sauce Labs, would the request potentially wait indefinitely?

No problem, the current IP of saucelabs.com is 162.222.75.243 so I requested a flow for that in bug 1252933. As soon as that is resolved you should be able to stop using the proxy servers. When filing the request I noticed that Saucelabs has a /21, which is rather large. Their DNS provider is Dynect which offers rule based query responses, so we'll want to get an IP address list from Saucelabs if this remains in place for any significant period of time.

It's hard to say for sure how your timeout behavior will change without the proxy in the middle. I would say it has to depend on what your user agent is. I conducted a quick test with curl using the proxy:

host1$ nc -l 8888
host2$ http_proxy=http://localhost:3128 time curl http://<host1>:8888/

Once host2 connected to host1 I sent SIGSTOP to nc, simulating a dead socket. It timed out in 15m0.586s. If I were to re-run this test using curl without http_proxy curl's default behavior is to never timeout[1].

[1] https://curl.haxx.se/libcurl/c/CURLOPT_TIMEOUT.html
Flags: needinfo?(bhourigan)
For completeness (and I'll try to see if I can get more specifics on these)

1) Sauce Labs had a maintenance window last night:

http://status.saucelabs.com/incidents/g5kvh13dcs5y

2) This Saturday, March 5th, @9am PDT, they begin a 32-hour (likely extensive, by the sound of it) network/infra-maintenance window, which will result in an outright outage, while they work:

http://status.saucelabs.com/incidents/rp8gx9wr3xyh

I'll be keeping an eye on the above as much as I'm able, and have alerted the team already via mozwebqa@mozilla.org
(In reply to Brian Hourigan [:digi] from comment #17)
> No problem, the current IP of saucelabs.com is 162.222.75.243 so I requested
> a flow for that in bug 1252933. As soon as that is resolved you should be
> able to stop using the proxy servers. When filing the request I noticed that
> Saucelabs has a /21, which is rather large. Their DNS provider is Dynect
> which offers rule based query responses, so we'll want to get an IP address
> list from Saucelabs if this remains in place for any significant period of
> time.

Thanks for raising that bug Brian.

> It's hard to say for sure how your timeout behavior will change without the
> proxy in the middle. I would say it has to depend on what your user agent
> is.

Yeah, I suspect we'll see jobs stall until we abort them manually. At least this will confirm that the proxy is not the issue, and we'll be able to switch back to the proxy easily enough.

Brian: If it's not too much to ask, could you gather details of the timed out Sauce Labs requests for the last few weeks from the logs? At the moment I need to dig through build results to find this evidence, and many are likely masked by our test automatically rerunning when they encounter a failure. Gathering this information may help Sauce Labs to identify any issues on their side, and it may also help us to identify a pattern. If there's a way I can gather this data myself I'd be happy to with your guidance.
Flags: needinfo?(bhourigan)
> Brian: If it's not too much to ask, could you gather details of the timed
> out Sauce Labs requests for the last few weeks from the logs? At the moment
> I need to dig through build results to find this evidence, and many are
> likely masked by our test automatically rerunning when they encounter a
> failure. Gathering this information may help Sauce Labs to identify any
> issues on their side, and it may also help us to identify a pattern. If
> there's a way I can gather this data myself I'd be happy to with your
> guidance.

No trouble at all. I grepped for all requests to saucelabs.com and emailed you the raw log data directly.
Flags: needinfo?(bhourigan)
Thanks Brian. These appear to be only the REST API calls, and it looks like the slowest response took just over 200 seconds. This particular entry and the failures surrounding it fall perfectly into a known incident: http://status.saucelabs.com/incidents/gnmjbjjfhpmd

I'm also interested in the commands sent to Sauce Labs outside of their REST API. These may not mention saucelabs.com in the logs, much like your extract from comment 8, which states 162.222.75.179 as the host in the URL. I wonder if we can use this IP or somehow expand it to Sauce Labs entire subnet range, or perhaps just grep the logs for either saucelabs.com OR /wd/hub/ in the URL?
Flags: needinfo?(bhourigan)
> I'm also interested in the commands sent to Sauce Labs outside of their REST
> API. These may not mention saucelabs.com in the logs, much like your extract
> from comment 8, which states 162.222.75.179 as the host in the URL. I wonder
> if we can use this IP or somehow expand it to Sauce Labs entire subnet
> range, or perhaps just grep the logs for either saucelabs.com OR /wd/hub/ in
> the URL?

Not a problem. I went ahead and searched all proxy logs for 162.222.7[2-9], which will match all Saucelabs IP addresses in 162.222.72.0/21. It's quite a bit more data, too big to email. I chronologically sorted all the logs, compressed them, and dropped them into your home directory on people.mozilla.org.
Flags: needinfo?(bhourigan)
Thanks Brian! I've filtered these to requests that took >800 seconds and there are 173 occurrences. I've attached them them to the bug, and will also send them to Sauce Labs to see if they correlate with any issues on their side.

Interestingly, I've been trying to replicate this issue from our staging box - so far without success. Can you see any reason why this issue would occur for webqa-ci1 but not webqa-staging-ci1? As far as I'm aware these boxes should be configured identically. If I continue to have difficulties replicating this issue on staging, it may indicate that the problem is with the production VM itself...
Flags: needinfo?(bhourigan)
> Thanks Brian! I've filtered these to requests that took >800 seconds and
> there are 173 occurrences. I've attached them them to the bug, and will also
> send them to Sauce Labs to see if they correlate with any issues on their
> side.

No problem - I'm glad the log data was helpful.

> 
> Interestingly, I've been trying to replicate this issue from our staging box
> - so far without success. Can you see any reason why this issue would occur
> for webqa-ci1 but not webqa-staging-ci1? As far as I'm aware these boxes
> should be configured identically. If I continue to have difficulties
> replicating this issue on staging, it may indicate that the problem is with
> the production VM itself...

I can't speak to the configuration of Jenkins or any differences between the hosts, but as far as network connectivity and proxy usage both appear to be operating identically.

I performed the following test on both webqa-ci1.qa.scl3 and webqa-ci-staging1.qa.scl3:

> for i in `seq 1000`; do curl -s -o /dev/null -I -w "%{http_code}\n" --proxy proxy.dmz.scl3.mozilla.com:3128 https://www.mozilla.org/en-US/; sleep 0.5; done

All requests returned 200. I didn't want to run this too much and accidentally DDoS bedrock, but both hosts finished the same test in an identical amount of time.
Flags: needinfo?(bhourigan)
I'm curious, did the direct connection to Saucelabs make any difference?
(In reply to Brian Hourigan [:digi] from comment #25)
> I'm curious, did the direct connection to Saucelabs make any difference?

I haven't tried yet. I wanted to get this replicated on staging first, as I'd rather not continue to investigate on our production box. Fortunately, by increasing the number of tests running in parallel I have been able to replicate this on staging now. I'll leave this running today, and then tomorrow I'll attempt to disable use of the proxy for this test to see what impact that has.
(In reply to Dave Hunt (:davehunt) from comment #26)
> (In reply to Brian Hourigan [:digi] from comment #25)
> > I'm curious, did the direct connection to Saucelabs make any difference?
> 
> I haven't tried yet. I wanted to get this replicated on staging first, as
> I'd rather not continue to investigate on our production box. Fortunately,
> by increasing the number of tests running in parallel I have been able to
> replicate this on staging now. I'll leave this running today, and then
> tomorrow I'll attempt to disable use of the proxy for this test to see what
> impact that has.

No problem. Given that I had the wrong assumption about the log entry searches, I'd like to confirm that the netops ACL is correct to allow you full access to Saucelabs. Do you have a list of API endpoints you'll need to connect to? Right now only 162.222.75.243 is on the access list.
I've requested a list of IPs from Sauce Labs. When I hear back I'll reopen bug 1252933 with the details.
I disabled the proxy for our staging job a couple of days ago, and have seen a suspicious failure with the following stack trace:

> self = <httplib.HTTPResponse instance at 0x7fb8a47178c0>
> 
>     def _read_status(self):
>         # Initialize with Simple-Response defaults
>         line = self.fp.readline(_MAXLINE + 1)
>         if len(line) > _MAXLINE:
>             raise LineTooLong("header line")
>         if self.debuglevel > 0:
>             print "reply:", repr(line)
>         if not line:
>             # Presumably, the server closed the connection before
>             # sending a valid response.
> >           raise BadStatusLine(line)
> E           BadStatusLine: ''

I'll attach the full console log. Interestingly, these were much quicker to fail than the proxy timeouts we've been seeing and they were all very close together between 10:39:47 and 10:39:51 (Pacific) on March 15th, 2016. This may of course be an unrelated issue.

Brian: Do you know of any network issues in the given timeframe that may have affected our tests? If this is the same issue as our proxy timeouts but without the proxy, do you have any suggestions for how we can investigate further? I'll continue to watch this job and report back if I see more evidence of an issue without the proxy in use.
Flags: needinfo?(bhourigan)
Okay, we have a test hanging on staging, which is likely to be the type of issue that the proxy would time out for us. This means that the proxy is not likely to be the issue, but we still need to know what the issue is. I've left the job running (waiting for response) in case there's something we can do to investigate while it's in this state. Here's what I believe is the netstat output for this connection:

> tcp        0      0 10.22.73.142:36241          162.222.75.33:80            ESTABLISHED

Brian: Is there something else you can suggest to help me to investigate?
> Brian: Do you know of any network issues in the given timeframe that may
> have affected our tests? If this is the same issue as our proxy timeouts but
> without the proxy, do you have any suggestions for how we can investigate
> further? I'll continue to watch this job and report back if I see more
> evidence of an issue without the proxy in use.

Tagging netops to provide input here - I was out on PTO last week and I'm still catching up. :dcurado, would you mind mentioning if  there any known issues between 10:39:47 and 10:39:51 (pacific) on March 15th?

> Okay, we have a test hanging on staging, which is likely to be the type of
> issue that the proxy would time out for us. This means that the proxy is not
> likely to be the issue, but we still need to know what the issue is. I've
> left the job running (waiting for response) in case there's something we can
> do to investigate while it's in this state. Here's what I believe is the
> netstat output for this connection:
> 
> > tcp        0      0 10.22.73.142:36241          162.222.75.33:80            ESTABLISHED
> 
> Brian: Is there something else you can suggest to help me to investigate?

Dave, we have exhausted all of the steps that we can take on our side. We have an overwhelming amount evidence that highlights failures internal to Saucelabs and I think it's time to escalate the issue with them. Beyond identifying an open connection in netstat we can run packet captures but I wouldn't expect to learn anything new. It'll hang until some timeout at which point socket read functions will return with an EOF.
Flags: needinfo?(bhourigan) → needinfo?(dcurado)
I checked the logs for that date/hour between 38 minutes and 40 minutes.
There is nothing to indicate any network problem(s).

If you have overwhelming evident that shows that "Saucelabs" has a problem, then I agree that this issue should be escalated with them.

Please let me know if I can help in any way.
Dave
Flags: needinfo?(dcurado)
I disagree that we have overwhelming evidence of the issues being internal to Sauce Labs. We're running many tests from an AWS instance of Jenkins [1] that are not suffering from this problem at all. I believe Sauce also use AWS, so that might be in the other Jenkins' favour, but I wouldn't expect that difference to cause intermittent timeouts on our side.

[1] https://ci.us-west.moz.works/job/bedrock_integration_tests_runner/
We have no evidence to support that there is an issue with SCL3 itself; literally no other users report any such issues with frequently-broken connections, and the sporadic VPC outages do not coincide with the Saucelabs issues.

We have significant evidence to support that there is an issue somewhere between SCL3 and Saucelabs. As observed above, whether the proxies are involved or not, our connections to Saucelabs are often unstable and need either timeout and/or retry handling to proceed correctly.

Saucelabs observes that their product uses, in part, AWS hosting [1], and over time they have observed sporadic network-layer issues that have been traced to AWS itself. We have observed similar issues with AWS over the past 3 months, and our network team has - multiple times - opened support tickets with Amazon asking them to repair their internal network such that it can reply to use correctly. They have done so.

However, unlike our own networks, we are not the owner of Saucelabs' AWS properties, and thus we cannot usefully file network complaints against it. Assuming the issue is with AWS, which historically it has been every time in the past few months, they will need to take this up further with Amazon to diagnose and resolve whatever issue is occurring.

Dave, we need you to verify that all Saucelabs IPs to which you've had these connection failures are AWS-hosted IPs. Assuming this is the case, we then strongly encourage you to invite Saucelabs to have Amazon repair whatever issues are occurring here. This bug is public/open and they are invited to review it and/or participate directly.

Aside from the direct involvement of Saucelabs, there is nothing else IT can do to assist you at this time. It is unfortunate that this issue is occurring with a third-party, but the evidence collected by all of the IT departments involved here supports a reasonable conclusion that Saucelabs or their hosting provider has issues we cannot resolve on their behalf.

Comment 34 also indicates that an AWS-hosted test suite has no such issues, which makes sense, given that AWS-to-AWS connections over Amazon's private links are presumably far more reliable than SCL3-to-AWS connections over the public Internet. Please consider either altering your test suites to improve their resilience to public Internet transit issues - or, if this is unacceptable, moving all Saucelabs-dependent test suites to AWS to provide the resilience they depend on today.

[1] https://status.saucelabs.com/incidents/ppdnhfljfmdk
Flags: needinfo?(dave.hunt)
(In reply to Richard Soderberg [:atoll] from comment #35)

Wow - thanks for the awesome detailed response :atoll!

> Dave, we need you to verify that all Saucelabs IPs to which you've had these
> connection failures are AWS-hosted IPs.

How do we do this? Here are the IPs that have caused us timeouts from the logs :digi provided me:

162.222.75.33
162.222.75.39
162.222.75.45
162.222.75.179

> Comment 34 also indicates that an AWS-hosted test suite has no such issues,
> which makes sense, given that AWS-to-AWS connections over Amazon's private
> links are presumably far more reliable than SCL3-to-AWS connections over the
> public Internet.

Thanks, this confirms what I was speculating too.

> Please consider either altering your test suites to improve
> their resilience to public Internet transit issues - or, if this is
> unacceptable, moving all Saucelabs-dependent test suites to AWS to provide
> the resilience they depend on today.

Both of these are valid options. Do you know if there's a way to reduce the proxy timeout though, which appears to currently be 15 minutes? One course of action would be to retry failures automatically, however the 15 minutes for a single test is quite an impact for a test suite.
Flags: needinfo?(dave.hunt)
I don't believe that 162.222.75/24 is an AWS ip range.

If you traceroute to any of those IPs, they don't go to AWS.

traceroute to 162.222.75.33 (162.222.75.33), 30 hops max, 60 byte packets
 1  fw1.private.scl3.mozilla.net (10.22.75.1)  0.689 ms  0.679 ms  0.642 ms
 2  63.245.214.53 (63.245.214.53)  2.657 ms  2.644 ms  2.632 ms
 3  63.245.214.46 (63.245.214.46)  0.906 ms  0.903 ms  0.889 ms
 4  xe-1-2-3.border1.sjc2.mozilla.net (63.245.219.161)  1.483 ms  1.482 ms  1.467 ms
 5  sjo-b21-link.telia.net (62.115.8.161)  1.990 ms  2.018 ms  2.086 ms
 6  internap-ic-140172-sjo-b21.c.telia.net (213.248.81.134)  2.043 ms  2.058 ms  1.943 ms
 7  66.151.144.15 (66.151.144.15)  2.233 ms 66.151.144.85 (66.151.144.85)  2.205 ms  2.264 ms
 8  saucelabs-1.border3.sje011.pnap.net (23.92.186.22)  18.222 ms  18.190 ms  18.174 ms
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  *^C
Those IPs are all, oddly, Comcast Business. So it appears that they're running some component of their business on non-AWS properties out of Equinix SV5, which is near the end of Highway 85 on the South San Jose side.

We do have the ability to alter proxy timeouts, though it may be a global impact for all users (and not just yourself). What timeout were you thinking of? Would it be simpler to implement your own timeouts and simply discard the proxy connection as "timed out" when it times out, since otherwise it would reply with "timed out" anyways?
My traceroute is different, since I'm on Comcast. So it turns out, they have peering agreements, and one of them is Comcast. This is totally a thing is normal.

traceroute to 162.222.75.179 (162.222.75.179), 64 hops max, 52 byte packets
 1  172.16.51.1 (172.16.51.1)  4.626 ms  8.403 ms  7.295 ms
 2  96.120.88.229 (96.120.88.229)  24.972 ms  19.562 ms  22.985 ms
 3  te-0-3-0-9-sur03.santaclara.ca.sfba.comcast.net (68.85.190.253)  22.505 ms  22.821 ms  20.126 ms
 4  hu-0-3-0-7-ar01.santaclara.ca.sfba.comcast.net (68.87.192.189)  24.674 ms
    hu-0-3-0-5-ar01.santaclara.ca.sfba.comcast.net (68.87.192.181)  20.150 ms
    hu-0-3-0-4-ar01.santaclara.ca.sfba.comcast.net (68.87.192.177)  24.195 ms
 5  be-33651-cr01.sunnyvale.ca.ibone.comcast.net (68.86.90.93)  20.077 ms  22.186 ms  23.048 ms
 6  be-10925-cr01.9greatoaks.ca.ibone.comcast.net (68.86.87.158)  23.955 ms  27.921 ms  23.157 ms
 7  hu-0-11-0-1-pe03.11greatoaks.ca.ibone.comcast.net (68.86.85.242)  27.993 ms  23.809 ms  20.454 ms
 8  66.208.233.250 (66.208.233.250)  16.731 ms  19.660 ms
    66.208.233.246 (66.208.233.246)  38.517 ms
 9  162-222-75-179.saucelabs.net (162.222.75.179)  32.469 ms  30.368 ms  25.459 ms
For comparison, the same IP from my traceroute, done from admin1.scl3 (to match Dave's routes in comment 37):

$ traceroute 162.222.75.179
traceroute to 162.222.75.179 (162.222.75.179), 30 hops max, 60 byte packets
 1  fw1.private.scl3.mozilla.net (10.22.75.1)  1.279 ms  1.210 ms  1.217 ms
 2  63.245.214.53 (63.245.214.53)  3.157 ms  3.208 ms  3.133 ms
 3  63.245.214.46 (63.245.214.46)  1.430 ms  1.423 ms  1.407 ms
 4  xe-1-2-3.border1.sjc2.mozilla.net (63.245.219.161)  2.039 ms  1.982 ms  1.978 ms
 5  sjo-b21-link.telia.net (62.115.8.161)  2.395 ms  2.310 ms  2.403 ms
 6  internap-ic-140172-sjo-b21.c.telia.net (213.248.81.134)  2.322 ms  2.023 ms  2.066 ms
 7  66.151.144.85 (66.151.144.85)  2.208 ms  2.178 ms  2.233 ms
 8  saucelabs-1.border3.sje011.pnap.net (23.92.186.22)  2.191 ms  2.124 ms  2.072 ms
 9  162-222-75-179.saucelabs.net (162.222.75.179)  2.138 ms  2.199 ms  2.179 ms
As you can tell from these traceroutes, SCL3's quality of internet connection is significantly higher than my own :)
Got asked to look at this with fresh eyes.  If this is totally off-base and I need smacking, I won't be offended.

comment 2 says "(read timeout=15)"
comment 0 and comment 36 say "15 minutes" but, I -think- that's wrong.  comment 29 says "httplib.HTTPResponse", and https://docs.python.org/2/library/httplib.html says the timeout is measured in seconds (but the default is something system-sized)


"But Greg!  I am totally seeing things time out at 900 seconds and that's 15 minutes!"  Yes.  However, http://www.squid-cache.org/Versions/v3/3.1/cfgman/read_timeout.html - squid has a 15 minute timeout going up.  If the client connection goes away, who's to say it won't still try northbound?


On the proxy,   cat /var/log/squid/access.log | grep '162.222.7[2-9]' | sort -n -k2
(grep based on comment 22)
Most of the connections are way under 15s.  I see 2 blowouts at 5min right now.  Everything else looks "okay?"  With most of the connections are way under 15s, let alone 15min, this makes it a hard case to chase

comment 29's snippet, run through my (admittedly faulty) eyeball python parser:
line = readline... but if that times out, then line is empty and we move on.  Not over _MAXLINE, not in debug...  IF NOT LINE (hey that's me!) raise a bad status line of ''.  Which it does.  This can mask any timeout.


Speculation:
* The bug has a lot of mentions of 15m timeouts in code; I think it's 15s.
** Should be easy to prove/disprove this:
**  https://docs.python.org/2/library/socket.html?highlight=socket#socket.getdefaulttimeout
** 15s is too short to avoid issues, whether they're squid- or saucelabs- induced.

* If squid doesn't get a 'steady' stream back from saucelabs, enough to send data down to the client often enough to beat the 15s timeout, the code loses.  This is happening "often" (FSVO).
* If saucelabs doesn't give -any- stream back, we blow out at 900s because squid times out, but the client is long gone.  This is happening "rarely but nonzero enough to be a distraction" (FSVO).

Totally not tested, add salt, etc.
(In reply to Greg Cox [:gcox] from comment #42)
> "But Greg!  I am totally seeing things time out at 900 seconds and that's 15
> minutes!"  Yes.  However,
> http://www.squid-cache.org/Versions/v3/3.1/cfgman/read_timeout.html - squid
> has a 15 minute timeout going up.  If the client connection goes away, who's
> to say it won't still try northbound?

Sorry, I don't understand the "still try northbound" phrase. We're using the Python Selenium client, which by default uses socket._GLOBAL_DEFAULT_TIMEOUT, which by default is None (no timeout): https://github.com/SeleniumHQ/selenium/blob/8f455096da66de5f39a886e22338fdae43bbe12d/py/selenium/webdriver/remote/remote_connection.py#L463, and this explains why the proxy takes over at 15 minutes if that's what it's set to timeout at. When we avoided the proxy, we saw the same issue but the tests were waiting for hours (days) without timing out.

> On the proxy,   cat /var/log/squid/access.log | grep '162.222.7[2-9]' | sort
> -n -k2
> (grep based on comment 22)
> Most of the connections are way under 15s.  I see 2 blowouts at 5min right
> now.  Everything else looks "okay?"  With most of the connections are way
> under 15s, let alone 15min, this makes it a hard case to chase

Most of the connections are fine. I'm only really concerned with the ~900 second connection timeouts in attachment 8727423 [details].

> comment 29's snippet, run through my (admittedly faulty) eyeball python
> parser:
> line = readline... but if that times out, then line is empty and we move on.
> Not over _MAXLINE, not in debug...  IF NOT LINE (hey that's me!) raise a bad
> status line of ''.  Which it does.  This can mask any timeout.

I think we can disregard comment 29 as a red herring. We later saw the test hang and never timeout - a much more likely scenario without the proxy.

> Speculation:
> * The bug has a lot of mentions of 15m timeouts in code; I think it's 15s.

In the Python code, I think it's actually *never*. We can override this, and perhaps we should, but all that will do is reduce how long it takes for tests to fail in this scenario. Admittedly that may be worth doing and is most likely a better solution that changing any global defaults on the proxy configuration.

> ** Should be easy to prove/disprove this:
> ** 
> https://docs.python.org/2/library/socket.html?highlight=socket#socket.
> getdefaulttimeout
> ** 15s is too short to avoid issues, whether they're squid- or saucelabs-
> induced.
> 
> * If squid doesn't get a 'steady' stream back from saucelabs, enough to send
> data down to the client often enough to beat the 15s timeout, the code
> loses.  This is happening "often" (FSVO).

I don't believe there is a 15 second timeout, but I also don't understand this statement. :/

> * If saucelabs doesn't give -any- stream back, we blow out at 900s because
> squid times out, but the client is long gone.  This is happening "rarely but
> nonzero enough to be a distraction" (FSVO).

Right, and this is the key issue for me. How can we address this? It sounds like these IPs are not in AWS, but that doesn't mean it's not an issue with Sauce Labs. Is there anything we can run that could monitor one of these IPs and give us more information in the scenario where no response is received within 900 seconds?

> Totally not tested, add salt, etc.

Really appreciate you looking at this with fresh eyes, Greg! :)
I think, in the end, our code needs to explicitly define a timeout lower than 900 seconds. Saucelabs normally takes "a few seconds" to respond to you. So anything over.. 15? 30? .. seconds is bad, and should be halted and retried, right?
(In reply to Richard Soderberg [:atoll] from comment #44)
> I think, in the end, our code needs to explicitly define a timeout lower
> than 900 seconds. Saucelabs normally takes "a few seconds" to respond to
> you. So anything over.. 15? 30? .. seconds is bad, and should be halted and
> retried, right?

We can set a connection timeout, but unfortunately we can't retry individual commands - we would have to retry the entire test. Retrying all tests that failed will cause a penalty for all test failures that are not due to this timeout. I suppose we could patch Selenium to retry on timeout, but it's telling that there have been no requests for this in the past considering the number of Selenium and Sauce Labs users there must be.

Is there anything we can do to get to the bottom of this network issue and ensure a more reliable service? I've also asked Sauce Labs to comment on this bug.
I'm also uncertain how a connection timeout will be handled by our Selenium Grid instance, which queues sessions until there is an available node and has it's own timeout value (currently set to 30 seconds). Sauce Labs defaults to failing the test when there's been no commands for 90 seconds.
new to discussion so maybe asked and answered in a way I don't see above.

A problem releng has had from time to time with buildbot (python) TCP is that TCP level keep alive was not configured by default when sockets were opened. That would sometimes lead to hardware-in-the-middle properly dropping a it's-taking-a-long-time-to-build-the-answer type connections. (Especially true when we had cross-data-center connections. We solved with a mix of code patching and replicating services across data centers.)

(I'm not a network guy, but aiui, tcp level keep alives should provide enough traffic to prevent any network gear from disconnecting-for-idle. If the problem still occurs, it's something in the application layer doing the disconnect.)

(My general interest in this is to try to "bound" the type of timeout & retry behavior us clients are responsible for handling, so we don't encounter "surprises" like this.)
Flags: needinfo?(hwine)
Dave asked me to look into this through a backchannel(AKA selenium IRC).

A few comments in response to prior comments:
- SauceLabs has our own data center.  We do not host our active tests on AWS.  After testing some of the artifacts do get archived onto AWS.  Hence why you saw Comcast.  
- The fact that your tests running on a system external to SCL3 do not time out tends to point at SCL3 being the culprit
- The 90 second timeout is our default settings.  We do have a desired capability you can set in your code to increase or decrease that.  Please see https://wiki.saucelabs.com/display/DOCS/Test+Configuration+Options#TestConfigurationOptions-IdleTestTimeout
-I would actually consider setting up test retry logic on *all* failing tests.  In prior QA Automation work I have found it significantly beneficial to rerun failing tests to avoid false positives/flapping tests.
:davehunt, try these sysctl's. They're designed to aggressively detect and kill connections where the underlying TCP channel is dead, so that you can be sure that any timeouts are application-layer rather than socket-layer:

# { date; sysctl -a | grep tcp_keepalive; } >> ~root/keepalive_prev.log
# sysctl -w net.ipv4.tcp_keepalive_time=5
# sysctl -w net.ipv4.tcp_keepalive_intvl=5
# sysctl -w net.ipv4.tcp_keepalive_probes=3

This waits 5 seconds after the latest data packet before beginning keepalives, then sends them at 5 seconds intervals, force-terminating your connection if 3 probes fail to receive a response.

I'm not sure I advise running these values long-term, but they should provide some clarity on what's going on without causing any particular harm to any of your other sessions.
Flags: needinfo?(dave.hunt)
(In reply to Leo Laskin from comment #48)
> Dave asked me to look into this through a backchannel(AKA selenium IRC).
> 
> A few comments in response to prior comments:
> - SauceLabs has our own data center.  We do not host our active tests on
> AWS.  After testing some of the artifacts do get archived onto AWS.  Hence
> why you saw Comcast.  
> - The fact that your tests running on a system external to SCL3 do not time
> out tends to point at SCL3 being the culprit
> - The 90 second timeout is our default settings.  We do have a desired
> capability you can set in your code to increase or decrease that.  Please
> see
> https://wiki.saucelabs.com/display/DOCS/
> Test+Configuration+Options#TestConfigurationOptions-IdleTestTimeout
> -I would actually consider setting up test retry logic on *all* failing
> tests.  In prior QA Automation work I have found it significantly beneficial
> to rerun failing tests to avoid false positives/flapping tests.

I'm reluctant to automatically rerun any failures as there's a risk we'd be paving over intermittent issues either due to genuine bugs in the application under test or due to flaky tests. Once identified, flaky tests can either be made more robust, individually marked to automatically rerun, or deleted as appropriate.
(In reply to Richard Soderberg [:atoll] from comment #49)
> :davehunt, try these sysctl's. They're designed to aggressively detect and
> kill connections where the underlying TCP channel is dead, so that you can
> be sure that any timeouts are application-layer rather than socket-layer:
> 
> # { date; sysctl -a | grep tcp_keepalive; } >> ~root/keepalive_prev.log
> # sysctl -w net.ipv4.tcp_keepalive_time=5
> # sysctl -w net.ipv4.tcp_keepalive_intvl=5
> # sysctl -w net.ipv4.tcp_keepalive_probes=3
> 
> This waits 5 seconds after the latest data packet before beginning
> keepalives, then sends them at 5 seconds intervals, force-terminating your
> connection if 3 probes fail to receive a response.
> 
> I'm not sure I advise running these values long-term, but they should
> provide some clarity on what's going on without causing any particular harm
> to any of your other sessions.

I've configured staging with these suggestions and will provide an update once it's had some time to replicate the issue.
Flags: needinfo?(dave.hunt)
atoll: It looks like these changes have not made any difference to the failure signature. Tests still take 900 seconds to timeout, and the traceback is the same. Do you have any other suggestions? I'm going to see if I can replicate the same issue with another provider, which might eliminate Sauce Labs as being the cause.
Flags: needinfo?(rsoderberg)
I've replicated the issue with TestingBot, another Selenium as a service provider. A proxy timeout occurred at approximately 23:44:59 on Apr 4, 2016. There were also a number of ERR_CONNECT_FAIL failures in this build. This makes me once again suspect our infrastructure, and likely eliminates Sauce Labs from being the cause of these timeouts. I'm also running tests with BrowserStack, and will report back if I see timeouts there.

digi: Could you take a look at the logs for the above timestamp to see if there's anything suspicious? Given that we've now replicated this using another provider, what can we do to further investigate this?
Flags: needinfo?(bhourigan)
I'm starting to doubt if comment 53 was actually the same issue, as I've not seen it since, and not seen the same issue with BrowserStack. I'm running a build every two hours against each provider, and Sauce Labs is failing once or twice a day with the timeout.

Sauce Labs proxy timeouts from the last three days:
Apr 8, 2016 12:22 AM
Apr 7, 2016 6:22 AM
Apr 7, 2016 12:22 AM
Apr 6, 2016 4:22 PM
Apr 6, 2016 4:22 AM
Apr 6, 2016 12:22 AM

We've seen issues with BrowserStack in the last three days, but not related to timeouts. I've reported the issue to them and suspect it to be related to their infrastructure (the error mentions a missing library). TestingBot has seen no issues in the last three days. Perhaps we could try switching to TestingBot for our main jobs if Sauce Labs are unable to help to resolve this issue.
Fixed in bug 1289697 seem to of addressed most of the open issues with proxies, going to close this bug. Please re-open if you're still having failures to saucelabs.com.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(rsoderberg)
Flags: needinfo?(hwine)
Flags: needinfo?(bhourigan)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: