1235563 - [investigate] Selenium test setup failures against dev

Reporter

Description

•

8 years ago

We currently have three different jobs setup in Jenkins that test against dev using Saucelabs:

https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.dev.win10.chrome/
https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.dev.win10.firefox/
https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.dev.win10.ie/

These test runs seem to be failing intermittently when setting up jobs. I don't think there are issues with the tests or pages themselves. Both stage and prod currently show a nice run of green:

https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.stage/
https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.prod/

The errors on dev often seem to show Squid error page messages in the test results. For example:

https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.dev.win10.ie/159/testReport/junit/tests.functional.firefox.os.version/test_all/test_news_is_displayed_1_3_/

It would be great to identify what the issue is here causing these test failures, and attempt to get it resolved.

Alex Gibson [:agibson]

Reporter

Comment 1

•

8 years ago

It's worth noting that since the 23rd Dec 2015, these failures on dev seem to have suddenly got a lot more frequent. More often than not, we're seeing a random test fail with a squid error message in the log.

Alex Gibson [:agibson]

Reporter

Updated

•

8 years ago

Version: Production → Development/Staging

Josh Mize [:jgmize]

Comment 2

•

8 years ago

These errors are coming from the ZLBs in SCL3, which run squid.

Josh Mize [:jgmize]

Updated

•

8 years ago

Assignee: nobody → server-ops-webops

Component: Bedrock → WebOps: Product Delivery

Product: www.mozilla.org → Infrastructure & Operations

QA Contact: smani

Version: Development/Staging → unspecified

:kanban

Updated

•

8 years ago

Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2373]

:Atoll

Comment 3

•

8 years ago

The datacenter proxies (Squid clusters) do not use ZLB, to the best of my knowledge (as previous owner of the Squid clusters).

However, I'm not sure there's anything the Squid admins can do to help you here, either. The HTML error as indicated by one of your links in comment 0 is:

<p>The following error was encountered while trying to retrieve the URL: <a href="http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/url">http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/url</a></p> E    E   <blockquote id="error"> E   <p><b>Zero Sized Reply</b></p> E   </blockquote> E    E   <p>Squid did not receive any data for this request.</p>

Which, clarified, is:

The following error was encountered while trying to retrieve the URL:
http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/url
Zero Sized Reply
Squid did not receive any data for this request.

Which means that the remote server, for whatever reason, did not respond to the request correctly. No other users of the proxy are reporting issues and so I'm relatively certain the issue will be with 162.222.75.179.

I'm going to *tentatively* close this as WORKSFORME, but if further evidence arises that points to some sort of response failure by Squid itself, please feel free to reopen and this will go to the correct team for further consideration.

Assignee: server-ops-webops → infra

Component: WebOps: Product Delivery → Infrastructure: Other

QA Contact: smani → jdow

Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2373]

Josh Mize [:jgmize]

Comment 4

•

8 years ago

(In reply to Richard Soderberg [:atoll] from comment #3)
> The datacenter proxies (Squid clusters) do not use ZLB, to the best of my
> knowledge (as previous owner of the Squid clusters).

Thanks for that information. I was under the impression that the ZLB nodes did caching via squid in addition to load balancing, and was not aware that there was a seprate squid cluster. Is this something that might have changed in the last 3 years, or has this always been the case to your knowledge? Can you provide me with a link to the appropriate documentation for this?

> However, I'm not sure there's anything the Squid admins can do to help you
> here, either. The HTML error as indicated by one of your links in comment 0
> is:
> 
> <p>The following error was encountered while trying to retrieve the URL: <a
> href="http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/
> url">http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/
> url</a></p> E    E   <blockquote id="error"> E   <p><b>Zero Sized
> Reply</b></p> E   </blockquote> E    E   <p>Squid did not receive any data
> for this request.</p>
> 
> Which, clarified, is:
> 
> The following error was encountered while trying to retrieve the URL:
> http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/url
> Zero Sized Reply
> Squid did not receive any data for this request.
> 
> Which means that the remote server, for whatever reason, did not respond to
> the request correctly. No other users of the proxy are reporting issues and
> so I'm relatively certain the issue will be with 162.222.75.179.
> 
> I'm going to *tentatively* close this as WORKSFORME, but if further evidence
> arises that points to some sort of response failure by Squid itself, please
> feel free to reopen and this will go to the correct team for further
> consideration.

The "Zero sized reply" error seems to indicate a timeout waiting on a reply from the upstream server, which could indicate network issues, an issue with the squid node itself, or (most likely) the apache server itself simply taking too long to serve the reply. What is the timeout value in the squid config? Can we get a count of how many times this error has occurred in the past day, week, and/or month?

Flags: needinfo?(rsoderberg)

Josh Mize [:jgmize]

Comment 5

•

8 years ago

(In reply to Richard Soderberg [:atoll] from comment #3)
> Which means that the remote server, for whatever reason, did not respond to
> the request correctly. No other users of the proxy are reporting issues and
> so I'm relatively certain the issue will be with 162.222.75.179.

162.222.75.179 was public IP that the request was coming from. The error message (in this case) was from proxy2.dmz.scl3.mozilla.com.

:Atoll

Comment 6

•

8 years ago

Nope, nothing's changed in the past three years. They're all at 'proxy[1-4].dmz.[scl3,phx1].mozilla.com', and there's a DNS round-robin record 'proxy.dmz.[scl3,phx1].mozilla.com' that references them.

The datacenter isn't experiencing network issues so I'm inclined not to think it's that.

(In reply to Josh Mize [:jgmize] from comment #5)
> (In reply to Richard Soderberg [:atoll] from comment #3)
> > Which means that the remote server, for whatever reason, did not respond to
> > the request correctly. No other users of the proxy are reporting issues and
> > so I'm relatively certain the issue will be with 162.222.75.179.
> 
> 162.222.75.179 was public IP that the request was coming from. The error
> message (in this case) was from proxy2.dmz.scl3.mozilla.com.

To the best of my knowledge our squid cluster (as described above) is *not* publicly available, and in fact is only accessible to datacenter hosts (not even VPN hosts).

That squid error is an HTML page returned *by a Squid instance* indicating that Squid, itself, was unable to connect to the URL with that IP address in it.

So, perhaps it's important to clarify here - which Squid instance are you using? What is your http_proxy/https_proxy variable? What is the FQDN of the host that is having trouble making connections through Squid?

Flags: needinfo?(rsoderberg) → needinfo?(jmize)

Josh Mize [:jgmize]

Comment 7

•

8 years ago

Ah, I see the problem now. I misunderstood where the error was actually occurring: I originally thought that the squid errors were being returned during the requests from the saucelabs instance to the bedrock-dev instance, but the errors were actually coming from the squid instance when jenkins itself was attempting to communicate with saucelabs. There have been some ongoing issues for a while, but saucelabs recently had some major issues[0] that were causing this error to appear. Thanks :atoll for the thorough explanations and clearing up my misunderstanding. Since this bug was an "investigation", and I believe the root cause has been identified, I'll mark this as resolved fixed.

https://status.saucelabs.com/incidents/s2f7zl2r4942

Status: NEW → RESOLVED

Closed: 8 years ago

Flags: needinfo?(jmize)

Resolution: --- → FIXED

Alex Gibson [:agibson]

Reporter

Comment 8

•

8 years ago

Fwiw, we're still seeing the same failures when running tests against dev, which makes me believe the Selenium issue linked in Comment 7 is not likely the root cause here, sadly.

Alex Gibson [:agibson]

Reporter

Comment 9

•

8 years ago

(In reply to Alex Gibson [:agibson] from comment #8)
> Fwiw, we're still seeing the same failures when running tests against dev,
> which makes me believe the Selenium issue linked in Comment 7 is not likely
> the root cause here, sadly.

Sorry, I mean Saucelabs issue.

Josh Mize [:jgmize]

Comment 10

•

8 years ago

(In reply to Alex Gibson [:agibson] from comment #9)
> (In reply to Alex Gibson [:agibson] from comment #8)
> > Fwiw, we're still seeing the same failures when running tests against dev,
> > which makes me believe the Selenium issue linked in Comment 7 is not likely
> > the root cause here, sadly.
> 
> Sorry, I mean Saucelabs issue.

I believe now that the root cause is still that squid is timing out waiting on a reply from the Saucelabs API. While the major issues they were experienced on the 29th may have been "resolved", I believe that there are ongoing "minor" intermittent issues causing performance issues, network timeouts, etc, which manifest as squid "zero sized reply" errors on the jenkins instance running in SCL3, and show up as actual network timeout errors in AWS[0] where we do not go through a proxy.

[0] https://ci.us-west.moz.works/view/Bedrock/job/bedrock_test_dev_eu_west/827/console

:Atoll

Comment 11

•

8 years ago

Oh, also, I would hazard a guess that the Squid timeout is 300 seconds, given that the test fails in 310 seconds.

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 12

•

8 years ago

(In reply to Richard Soderberg [:atoll] from comment #6)
> What is your http_proxy/https_proxy variable?

http://proxy.dmz.scl3.mozilla.com:3128/

> What is the FQDN of the host that is having trouble making connections through Squid?

webqa-ci1.qa.scl3.mozilla.com

Sauce Labs are investigating "an intermittent low-level network issue that is resulting in sporadic and varied failures": https://status.saucelabs.com/incidents/ppdnhfljfmdk

:Atoll

Comment 13

•

8 years ago

(In reply to Dave Hunt (:davehunt) from comment #12)
> Sauce Labs are investigating "an intermittent low-level network issue that
> is resulting in sporadic and varied failures":
> https://status.saucelabs.com/incidents/ppdnhfljfmdk

That would be precisely the sort of thing that would cause the issues observed here, unfortunately.

Bugzilla

Quick Search

[investigate] Selenium test setup failures against dev

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

Tracking

(Not tracked)

People

(Reporter: agibson, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Updated

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13