Closed Bug 1459910 Opened 6 years ago Closed 5 years ago

Intermittent testing/firefox-ui/tests/functional/security/test_ssl_disabled_error_page.py TestSSLDisabledErrorPage.test_ssl_disabled_error_page | AssertionError: u'The connection has timed out' != u'Secure Connection Failed'

Categories

(Testing :: Firefox UI Tests, defect, P5)

Version 3
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [retriggered][stockwell unknown])

Attachments

(1 file)

There was a network timeout trying to reach `tls-v1-0.badssl.com`:

01:32:33     INFO -  1525768353943	Marionette	DEBUG	Received DOM event DOMContentLoaded for about:neterror?e=netTimeout&u=https%3A//tls-v1-0.badssl.com%3A1010/&c=UTF-8&f=regular&d=The%20server%20at%20tls-v1-0.badssl.com%20is%20taking%20too%20long%20to%20respond.
01:32:33     INFO -  1525768353958	Marionette	TRACE	16 <- [1,59,{"error":"unknown error","message":"Reached error page: about:neterror?e=netTimeout&u=https%3A//tls-v1-0.badssl.com%3A1 ... :565:3\nregisterSelf@chrome://marionette/content/listener.js:465:5\n@chrome://marionette/content/listener.js:1697:1\n"},null]

Not sure how often this happens yet but worth keeping an eye on it before we are asking for an uplift of my patch on bug 1414776.
Richard, as it looks like the network availability for badssl.com is kinda bad seeing 16 failures only from yesterday. Can we figure out what has been caused those timeouts?
Flags: needinfo?(atoll)
Eliza, this is all already known. Please read the bug comments as already made on this bug before.
Thanks Eliza for looking into this, I am glad that :whimboo is already working on this and we have clear data to show the root cause and confirmation from the test owner :)
Flags: needinfo?(jmaher)
(In reply to Henrik Skupin (:whimboo) from comment #3)
> Richard, as it looks like the network availability for badssl.com is kinda
> bad seeing 16 failures only from yesterday. Can we figure out what has been
> caused those timeouts?

BadSSL is hosted by Google. April is our liason. I'm going to redirect your question to her. However, since this is causing failures in pushes, you should consider reverting first and then resolving a way forward given the intermittency issues of BadSSL second.
Flags: needinfo?(atoll) → needinfo?(april)
I asked sheriffs to backout my patch for now. We can re-land once connections to tls-v1-0.badssl.com don't fail that often anymore.
Summary: Intermittent testing/firefox-ui/tests/functional/security/test_ssl_disabled_error_page.py TestSSLDisabledErrorPage.test_ssl_disabled_error_page | AssertionError: u'The connection has timed out' != u'Secure Connection Failed' → Intermittent testing/firefox-ui/tests/functional/security/test_ssl_disabled_error_page.py TestSSLDisabledErrorPage.test_ssl_disabled_error_page | AssertionError: u'The connection has timed out' != u'Secure Connection Failed' (tls-v1-0.badssl.com)
I'm going to add Chris Thompson to this bug, as he is the Google person who handles the hosting (largely).

Do we know if it's just one specific BadSSL domain that is causing this problem, or if it's all of them randomly?

I almost never get bugs about BadSSL downtime (since it's just a simple nginx server), but I'm happy to look into it.
Flags: needinfo?(april) → needinfo?(chris.j.thompson)
It looks like it's just the TLS 1.0 website from the logs, so I've setup a monitor on https://tls-v1-0.badssl.com:1010/, which should check it every five minutes and shoot me an email if there is any notable downtime. Hopefully this'll help me learn if it's a problem in the build system, or if it's a problem on BadSSL.
Attached image screenshot.png
The problem largely happens for MacOS as what I can see from the intermittent viewer:

https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-05-03&endday=2018-05-10&tree=trunk&bug=1459910

There are lesser instances for Linux and only a single one for Windows. Attached a screenshot which get automatically created on failure from Marionette.
That really leads me to think that the problem might lie outside of BadSSL; TLS 1.1 is served almost identically, from the same nginx instance, and:

> ssl_protocols TLSv1;

> ssl_protocols TLSv1.1;

Is the only difference between the two, other than the port. If BadSSL.com was down, you should be seeing errors for both of these equally, on all platforms.
Looking at the error logs on badssl, I don't see a ton of handshake related errors. Obviously if there are timeouts, I wouldn't necessarily see them, but there aren't _that_ many errors in the nginx error log. Here are the systems I see handshake errors from:

root@badssl-zesty:/var/log/nginx# for i in `gunzip -c error.log.3.gz error.log.2.gz | grep handshaking | grep -v list | grep "05/08" | cut -d ' ' -f 17 | cut -d ',' -f 1 | sort -u | grep -v ':'`; do host $i; done
Host 77.202.130.104.in-addr.arpa. not found: 3(NXDOMAIN)
99.93.172.163.in-addr.arpa domain name pointer 163-172-93-99.rev.poneytelecom.eu.
139.97.172.163.in-addr.arpa domain name pointer 163-172-97-139.rev.poneytelecom.eu.
Host 126.107.99.167.in-addr.arpa. not found: 3(NXDOMAIN)
120.28.165.188.in-addr.arpa domain name pointer ip120.ip-188-165-28.eu.
21.116.184.35.in-addr.arpa domain name pointer 21.116.184.35.bc.googleusercontent.com.
166.137.80.52.in-addr.arpa domain name pointer ec2-52-80-137-166.cn-north-1.compute.amazonaws.com.cn.
20.73.80.52.in-addr.arpa domain name pointer ec2-52-80-73-20.cn-north-1.compute.amazonaws.com.cn.
105.200.41.64.in-addr.arpa domain name pointer www.ssllabs.com.
106.200.41.64.in-addr.arpa domain name pointer www.ssllabs.com.
107.200.41.64.in-addr.arpa domain name pointer www.ssllabs.com.
108.200.41.64.in-addr.arpa domain name pointer www.ssllabs.com.
Host 106.137.71.64.in-addr.arpa. not found: 3(NXDOMAIN)

Not a single one on 05/08 from our AWS region.

There are some that could be us from 05/07:

root@badssl-zesty:/var/log/nginx# for i in `gunzip -c error.log.3.gz error.log.2.gz | grep handshaking | grep -v list | grep "05/07" | cut -d ' ' -f 17 | cut -d ',' -f 1 | sort -u | grep -v ':'`; do host $i; done
178.63.9.176.in-addr.arpa domain name pointer static.178.63.9.176.clients.your-server.de.
120.28.165.188.in-addr.arpa domain name pointer ip120.ip-188-165-28.eu.
198.56.233.34.in-addr.arpa domain name pointer ec2-34-233-56-198.compute-1.amazonaws.com.
28.55.3.52.in-addr.arpa domain name pointer ec2-52-3-55-28.compute-1.amazonaws.com.
117.185.45.52.in-addr.arpa domain name pointer ec2-52-45-185-117.compute-1.amazonaws.com.
11.31.54.52.in-addr.arpa domain name pointer ec2-52-54-31-11.compute-1.amazonaws.com.
166.137.80.52.in-addr.arpa domain name pointer ec2-52-80-137-166.cn-north-1.compute.amazonaws.com.cn.
20.73.80.52.in-addr.arpa domain name pointer ec2-52-80-73-20.cn-north-1.compute.amazonaws.com.cn.
105.200.41.64.in-addr.arpa domain name pointer www.ssllabs.com.
107.200.41.64.in-addr.arpa domain name pointer www.ssllabs.com.
108.200.41.64.in-addr.arpa domain name pointer www.ssllabs.com.

But overall that isn't much. The last nginx restart was on 05/01.
Hm, so the test disabled all of SSL 3.0, TLS 1.0 and TLS 1.1. Given that we have problems with the TLS 1.0 subdomain, are there instances for SSL3 and TLS 1.1 too? May we should switch, and see if those work. Btw I was able reproduce the problem locally on my MacOS machine when running the test hundreds of times. But so far it happened only that one time, and in headless mode. So I was not able to interact with the browser when that was happening. Anyway I would exclude a DNS problem in automation.
There is a TLS 1.1 site:

https://tls-v1-1.badssl.com:1011/

Although we don't have an SSLv3 site anymore, I believe, because nothing supports it.
I believe the SSLv3 site still works (although I don't think you can test it with any modern browser), and the current plan is to maintain it if we have users relying on it. We might need to migrate it to a legacy server sometime, but that will likely be a internal implementation detail.
Flags: needinfo?(chris.j.thompson)
Whiteboard: [retriggered] → [retriggered][stockwell needswork]
I've been testing this more over the last week. I've found that if I sit in Firefox and refresh the page constantly, occasionally a stylesheet will fail to load because of a connection issue.

I haven't been able to replicate this in Chrome or Safari, nor have my downtime monitors caught it at all. Is it possible that this is a bug in NSS? I only seem to be able to trigger it in Firefox specifically.
Whiteboard: [retriggered][stockwell needswork] → [retriggered][stockwell unknown]
Btw. this was fixed by the backout of my patch on bug 1414776. Lets continue the discussion on that other bug.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
I reopen this bug given that the failure started to happen again with the landing of my patch on bug 1414776.

If we cannot fix it we may have to skip the test for now.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Intermittent testing/firefox-ui/tests/functional/security/test_ssl_disabled_error_page.py TestSSLDisabledErrorPage.test_ssl_disabled_error_page | AssertionError: u'The connection has timed out' != u'Secure Connection Failed' (tls-v1-0.badssl.com) → Intermittent testing/firefox-ui/tests/functional/security/test_ssl_disabled_error_page.py TestSSLDisabledErrorPage.test_ssl_disabled_error_page | AssertionError: u'The connection has timed out' != u'Secure Connection Failed'
Assignee: hskupin → nobody
Status: REOPENED → NEW

Looks like there were no more failures in all the last 9 months. Lets close the bug.

Status: NEW → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → WORKSFORME
Target Milestone: mozilla62 → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: