Closed Bug 1235563 Opened 8 years ago Closed 8 years ago

[investigate] Selenium test setup failures against dev

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: agibson, Unassigned)

Details

We currently have three different jobs setup in Jenkins that test against dev using Saucelabs:

https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.dev.win10.chrome/
https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.dev.win10.firefox/
https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.dev.win10.ie/

These test runs seem to be failing intermittently when setting up jobs. I don't think there are issues with the tests or pages themselves. Both stage and prod currently show a nice run of green:

https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.stage/
https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.prod/

The errors on dev often seem to show Squid error page messages in the test results. For example:

https://webqa-ci.mozilla.com/view/Mozilla.org/job/bedrock.dev.win10.ie/159/testReport/junit/tests.functional.firefox.os.version/test_all/test_news_is_displayed_1_3_/

It would be great to identify what the issue is here causing these test failures, and attempt to get it resolved.
It's worth noting that since the 23rd Dec 2015, these failures on dev seem to have suddenly got a lot more frequent. More often than not, we're seeing a random test fail with a squid error message in the log.
Version: Production → Development/Staging
These errors are coming from the ZLBs in SCL3, which run squid.
Assignee: nobody → server-ops-webops
Component: Bedrock → WebOps: Product Delivery
Product: www.mozilla.org → Infrastructure & Operations
QA Contact: smani
Version: Development/Staging → unspecified
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2373]
The datacenter proxies (Squid clusters) do not use ZLB, to the best of my knowledge (as previous owner of the Squid clusters).

However, I'm not sure there's anything the Squid admins can do to help you here, either. The HTML error as indicated by one of your links in comment 0 is:

<p>The following error was encountered while trying to retrieve the URL: <a href="http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/url">http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/url</a></p> E    E   <blockquote id="error"> E   <p><b>Zero Sized Reply</b></p> E   </blockquote> E    E   <p>Squid did not receive any data for this request.</p>

Which, clarified, is:

The following error was encountered while trying to retrieve the URL:
http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/url
Zero Sized Reply
Squid did not receive any data for this request.

Which means that the remote server, for whatever reason, did not respond to the request correctly. No other users of the proxy are reporting issues and so I'm relatively certain the issue will be with 162.222.75.179.

I'm going to *tentatively* close this as WORKSFORME, but if further evidence arises that points to some sort of response failure by Squid itself, please feel free to reopen and this will go to the correct team for further consideration.
Assignee: server-ops-webops → infra
Component: WebOps: Product Delivery → Infrastructure: Other
QA Contact: smani → jdow
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2373]
(In reply to Richard Soderberg [:atoll] from comment #3)
> The datacenter proxies (Squid clusters) do not use ZLB, to the best of my
> knowledge (as previous owner of the Squid clusters).

Thanks for that information. I was under the impression that the ZLB nodes did caching via squid in addition to load balancing, and was not aware that there was a seprate squid cluster. Is this something that might have changed in the last 3 years, or has this always been the case to your knowledge? Can you provide me with a link to the appropriate documentation for this?

> However, I'm not sure there's anything the Squid admins can do to help you
> here, either. The HTML error as indicated by one of your links in comment 0
> is:
> 
> <p>The following error was encountered while trying to retrieve the URL: <a
> href="http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/
> url">http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/
> url</a></p> E    E   <blockquote id="error"> E   <p><b>Zero Sized
> Reply</b></p> E   </blockquote> E    E   <p>Squid did not receive any data
> for this request.</p>
> 
> Which, clarified, is:
> 
> The following error was encountered while trying to retrieve the URL:
> http://162.222.75.179/wd/hub/session/2c2c8937fcba4e1cbe10c98df8bfd876/url
> Zero Sized Reply
> Squid did not receive any data for this request.
> 
> Which means that the remote server, for whatever reason, did not respond to
> the request correctly. No other users of the proxy are reporting issues and
> so I'm relatively certain the issue will be with 162.222.75.179.
> 
> I'm going to *tentatively* close this as WORKSFORME, but if further evidence
> arises that points to some sort of response failure by Squid itself, please
> feel free to reopen and this will go to the correct team for further
> consideration.

The "Zero sized reply" error seems to indicate a timeout waiting on a reply from the upstream server, which could indicate network issues, an issue with the squid node itself, or (most likely) the apache server itself simply taking too long to serve the reply. What is the timeout value in the squid config? Can we get a count of how many times this error has occurred in the past day, week, and/or month?
Flags: needinfo?(rsoderberg)
(In reply to Richard Soderberg [:atoll] from comment #3)
> Which means that the remote server, for whatever reason, did not respond to
> the request correctly. No other users of the proxy are reporting issues and
> so I'm relatively certain the issue will be with 162.222.75.179.

162.222.75.179 was public IP that the request was coming from. The error message (in this case) was from proxy2.dmz.scl3.mozilla.com.
Nope, nothing's changed in the past three years. They're all at 'proxy[1-4].dmz.[scl3,phx1].mozilla.com', and there's a DNS round-robin record 'proxy.dmz.[scl3,phx1].mozilla.com' that references them.

The datacenter isn't experiencing network issues so I'm inclined not to think it's that.

(In reply to Josh Mize [:jgmize] from comment #5)
> (In reply to Richard Soderberg [:atoll] from comment #3)
> > Which means that the remote server, for whatever reason, did not respond to
> > the request correctly. No other users of the proxy are reporting issues and
> > so I'm relatively certain the issue will be with 162.222.75.179.
> 
> 162.222.75.179 was public IP that the request was coming from. The error
> message (in this case) was from proxy2.dmz.scl3.mozilla.com.

To the best of my knowledge our squid cluster (as described above) is *not* publicly available, and in fact is only accessible to datacenter hosts (not even VPN hosts).

That squid error is an HTML page returned *by a Squid instance* indicating that Squid, itself, was unable to connect to the URL with that IP address in it.

So, perhaps it's important to clarify here - which Squid instance are you using? What is your http_proxy/https_proxy variable? What is the FQDN of the host that is having trouble making connections through Squid?
Flags: needinfo?(rsoderberg) → needinfo?(jmize)
Ah, I see the problem now. I misunderstood where the error was actually occurring: I originally thought that the squid errors were being returned during the requests from the saucelabs instance to the bedrock-dev instance, but the errors were actually coming from the squid instance when jenkins itself was attempting to communicate with saucelabs. There have been some ongoing issues for a while, but saucelabs recently had some major issues[0] that were causing this error to appear. Thanks :atoll for the thorough explanations and clearing up my misunderstanding. Since this bug was an "investigation", and I believe the root cause has been identified, I'll mark this as resolved fixed.

https://status.saucelabs.com/incidents/s2f7zl2r4942
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(jmize)
Resolution: --- → FIXED
Fwiw, we're still seeing the same failures when running tests against dev, which makes me believe the Selenium issue linked in Comment 7 is not likely the root cause here, sadly.
(In reply to Alex Gibson [:agibson] from comment #8)
> Fwiw, we're still seeing the same failures when running tests against dev,
> which makes me believe the Selenium issue linked in Comment 7 is not likely
> the root cause here, sadly.

Sorry, I mean Saucelabs issue.
(In reply to Alex Gibson [:agibson] from comment #9)
> (In reply to Alex Gibson [:agibson] from comment #8)
> > Fwiw, we're still seeing the same failures when running tests against dev,
> > which makes me believe the Selenium issue linked in Comment 7 is not likely
> > the root cause here, sadly.
> 
> Sorry, I mean Saucelabs issue.

I believe now that the root cause is still that squid is timing out waiting on a reply from the Saucelabs API. While the major issues they were experienced on the 29th may have been "resolved", I believe that there are ongoing "minor" intermittent issues causing performance issues, network timeouts, etc, which manifest as squid "zero sized reply" errors on the jenkins instance running in SCL3, and show up as actual network timeout errors in AWS[0] where we do not go through a proxy.

[0] https://ci.us-west.moz.works/view/Bedrock/job/bedrock_test_dev_eu_west/827/console
Oh, also, I would hazard a guess that the Squid timeout is 300 seconds, given that the test fails in 310 seconds.
(In reply to Richard Soderberg [:atoll] from comment #6)
> What is your http_proxy/https_proxy variable?

http://proxy.dmz.scl3.mozilla.com:3128/

> What is the FQDN of the host that is having trouble making connections through Squid?

webqa-ci1.qa.scl3.mozilla.com

Sauce Labs are investigating "an intermittent low-level network issue that is resulting in sporadic and varied failures": https://status.saucelabs.com/incidents/ppdnhfljfmdk
(In reply to Dave Hunt (:davehunt) from comment #12)
> Sauce Labs are investigating "an intermittent low-level network issue that
> is resulting in sporadic and varied failures":
> https://status.saucelabs.com/incidents/ppdnhfljfmdk

That would be precisely the sort of thing that would cause the issues observed here, unfortunately.
You need to log in before you can comment on or make changes to this bug.