Closed Bug 1828861 Opened 2 years ago Closed 2 years ago

Investigate why 'Secure Connection Failed' but reloading makes it work

Tracking

()

Status:

RESOLVED FIXED

Milestone:

115 Branch

Tracking Flags:

Tracking

Status

firefox115

---

fixed

People

(Reporter: ckerschb, Assigned: kershaw)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged] [necko-priority-queue])

Attachments

(5 files)

astral.png 2 years ago Christoph Kerschbaumer [:ckerschb] 43.13 KB, image/png		Details
torxtrail.png 2 years ago Christoph Kerschbaumer [:ckerschb] 41.06 KB, image/png		Details
page info with this failure 2 years ago Neha Kochar [:neha] 52.97 KB, image/png		Details
Bug 1828861 - Also check Alpn when selecting the next record for retrying, r=#necko 2 years ago Kershaw Chang [:kershaw] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1828861 - Test case, r=#necko 2 years ago Kershaw Chang [:kershaw] 48 bytes, text/x-phabricator-request		Details \| Review

Christoph Kerschbaumer [:ckerschb]

Reporter

Description

•

2 years ago

Attached image astral.png — Details

When connecting to astral.sh it seems that in the initial request the 'Secure Connection Failed' appears, but when reloading it works again.

Christoph Kerschbaumer [:ckerschb]

Reporter

Comment 1

•

2 years ago

Attached image torxtrail.png — Details

It seems the same error occurs on other pages as well, e.g. when visiting www.torxtrail.com the error page appears - reloading makes it work again.

Dennis Jackson

Comment 2

•

2 years ago

This is likely a bug in Necko or PSM, since we're missing any specific error code as to why the secure connection failed and those areas are responsible for handling automatic retries.

Some information that would be helpful for tracking down the root cause:

What OS is the reporter running?
What version of Firefox is the reporter running?
Can the reporter reproduce with a different version of Firefox (e.g. Nightly, ESR or Release) or a different browser (e.g. Chrome, Edge, Safari)
Is the reporter using a VPN? Where is the reporter located?
Is the reporter using any extensions which hook into page loads? E.g. an ad blocker? Does the issue persist if the extensions are disabled?
Did the issue appear recently?
Does the issue reproduce a 100% of the time?

Eric Rescorla (:ekr)

Comment 3

•

2 years ago

I'm the initial reporter here.

What OS is the reporter running?
What version of Firefox is the reporter running?

Nightly on MacOS

Can the reporter reproduce with a different version of Firefox (e.g. Nightly, ESR or Release) or a different browser (e.g. Chrome, Edge, Safari)

I can't reliably reproduce it even in the same browser. It just happens sporadically but it's happening enough now that I don't think it's just network error.

Is the reporter using a VPN? Where is the reporter located?

The US.

Is the reporter using any extensions which hook into page loads? E.g. an ad blocker? Does the issue persist if the extensions are disabled?

No.

Did the issue appear recently?

Over the past few weeks.

What I would suggest is rather than doing a lot of debugging here, we instead
look to see what code path leads to this particular error page (I imagine there are
several) and surface something more specific to this page, and then I can
report what's going on in more detail. That's probably more productive than
a lot of time spent on guesswork abotu what's going on and might also help
debug future issues.

Neha Kochar [:neha]

Comment 4

•

2 years ago

•

Edited

I just ran into this too. I'm on a Mac in Canada using Nightly. I'm uploading the page info that seems to indicate a cert validation issue. It resolved with a page reload and does not reproduce with the same site again.

Neha Kochar [:neha]

Comment 5

•

2 years ago

Attached image page info with this failure — Details

Anna Weine

Updated

•

2 years ago

Blocks: 1829092

Anna Weine

Updated

•

2 years ago

No longer blocks: 1829092

John Schanck [:jschanck]

Comment 6

•

2 years ago

Could be a coincidence, but I just ran into this bug after switching from network.trr.mode = 0 to network.trr.mode = 3. I hadn't run into it previously. It happened within an hour of flipping the pref.

BugBot [:suhaib / :marco/ :calixte]

Comment 7

•

2 years ago

The severity field is not set for this bug.
:beurdouche, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(bbeurdouche)

Neha Kochar [:neha]

Comment 8

•

2 years ago

This happened again today to me on a new site. FWIW I haven't flipped network.trr.mode. I have default DoH settings and ETP-Strict and CBH enabled on latest Nightly.
It happens sporadically and does not reproduce with the same site in another tab or similarly reloading the existing broken tab.
What are the next steps here? NI'ing Dana also in case she has ideas for any debugging info that can be collected to help once in this situation.

Flags: needinfo?(dkeeler)

Dana Keeler (she/her) (use needinfo) [:keeler]

Comment 9

•

2 years ago

I suspect this is due to a transient network issue. What we could do is attempt to reproduce the issue with a TLS server that deliberately introduces errors after the handshake has completed. Once we can reproduce the issue, we can decide how Firefox should handle such situations.

Flags: needinfo?(dkeeler)

Dennis Jackson

Comment 10

•

2 years ago

Sorry for not updating this proactively. Here's a quick summary:

For this bug specifically, my current suspicion is that its due to an issue in the Necko or Neqo code for handling ECH+QUIC connections. This is because the impacted sites are all hosted on Cloudflare and have ECH records, plus the the users reporting the issue are all using Nightly and located in regions where DoH is enabled by default. If you'd like to help confirm this issue, you can take the following steps:

When you see an impacted page load, run this command in your local terminal dig {domain-name.com} NS and see if you see a Cloudflare entry in the response. If my hypothesis is correct, you should.
Check if setting network.dns.http3_echconfig.enabled to false stops the issue from happening.

I will need to spend a bit of time sitting down and reproducing this locally to be able to fix it, or this could be passed to the Necko team who own the relevant code if they have capacity. In the longer term, we have two identified steps for improving the situation:

Improve the testing of error conditions in Necko. Our error handling in Necko is currently under-tested and we need proper CI tests to ensure we correctly retry when we have these kinds of errors.
Enable support for including PSM, Neqo and NSS logs when using about:logging.

Flags: needinfo?(bbeurdouche)

Dennis Jackson

Updated

•

2 years ago

Comment 11

•

2 years ago

I spent some time reproducing this issue. There's a few different things going on.

Reproduction Instructions

On a network with Ipv6 disabled:

Use the latest Nightly with DoH enabled and default ech / http3 prefs (echconfig.enabled, http3_echconfig.enabled should be true, fallback_to_origin_when_all_failed should be false).
Connect to https://cloudflare-quic.com/cdn-cgi/trace or https://cloudflare-http3.com/cdn-cgi/trace
The initial connection should load with http=http/3 and sni=encrypted.
Do a force refresh with Ctrl-Shift-R or similar.
You should now see Secure Connection Failed with no further details. Click refresh.
You should now see http=http/2 and sni=encrypted.

Observations

From testing with Wireshark, we don't even produce any packets for the second QUIC connection after the force refresh, so there's no crypto problem or compatibility issue with Cloudflare, we just don't get that far. Kershaw pointed me to bug 1816677, which notes that we don't have retry logic for a QUIC Ipv6 connection. However, we still shouldn't be showing an error page in this circumstance, we should be falling back to http/2 with an encrypted sni. Let's call this Issue 1 ; we aren't automatically falling back to ECH+http/2 correctly when ECH+http/3 is enabled and fallback_to_origin is false. Instead, we're displaying an error page and requiring the user to refresh.

If I enable ipv6 on my local network, then the initial connection works with http/3 and ECH. Force refreshing triggers a second http/3+ECH handshake which succeeds, but still results in the connection migrating to http/2 and ECH. This is issue 2; we seem to be inappropriately disabling HTTP3+ECH after tearing down the connection, even if it was successful.

If I enable ipv6 on my local network and also change fallback_to_origin_when_all_failed to true, then the initial connection does not use ECH at all, just plain http/2. Force Refreshing upgrades to http/2 + ECH. Disabling ECH+http/3 correctly results in Necko using http/2+ECH on the first attempt. This is issue 3; setting fallback_to_origin_when_all_failed with ECH+http/3 enabled means ECH+http/3 doesn't get attempted at all and ECH isn't used for the first http/2 connection.

As none of these issues involve the crypto code or NSS, I'm moving this bug into Networking.

Assignee: nobody → nobody

Component: Libraries → Networking

Product: NSS → Core

Version: trunk → Firefox 113

Kershaw Chang [:kershaw]

Assignee

Updated

•

2 years ago

Severity: -- → S3

Priority: -- → P2

Whiteboard: [necko-triaged] [necko-priority-next]

Kershaw Chang [:kershaw]

Assignee

Updated

•

2 years ago

Priority: P2 → P1

Whiteboard: [necko-triaged] [necko-priority-next] → [necko-triaged] [necko-priority-queue]

Kershaw Chang [:kershaw]

Assignee

Updated

•

2 years ago

Assignee: nobody → kershaw

Kershaw Chang [:kershaw]

Assignee

Comment 12

•

2 years ago

Thanks for the detailed analysis, it's really helpful.

Observations

From testing with Wireshark, we don't even produce any packets for the second QUIC connection after the force refresh, so there's no crypto problem or compatibility issue with Cloudflare, we just don't get that far. Kershaw pointed me to bug 1816677, which notes that we don't have retry logic for a QUIC Ipv6 connection. However, we still shouldn't be showing an error page in this circumstance, we should be falling back to http/2 with an encrypted sni. Let's call this Issue 1 ; we aren't automatically falling back to ECH+http/2 correctly when ECH+http/3 is enabled and fallback_to_origin is false. Instead, we're displaying an error page and requiring the user to refresh.

For Issue 1, the problem is that we only compare the domain here when finding the next record for retrying.
We should also check the used alpn to make sure we fallback to h2 correctly.

If I enable ipv6 on my local network, then the initial connection works with http/3 and ECH. Force refreshing triggers a second http/3+ECH handshake which succeeds, but still results in the connection migrating to http/2 and ECH. This is issue 2; we seem to be inappropriately disabling HTTP3+ECH after tearing down the connection, even if it was successful.

For issue2, I think this is a timing issue. The following log shows that it took around 1s to close the Http/3 connection completely. During this 1s, it's not allowed to create a new Http/3 connection. In the end, the fast fallback Http/2 connection won, so the second http request use Http/2.

2023-05-30 10:00:56.068892 UTC - [Parent 45620: Socket Thread]: V/nsHttp ConnectionEntry::ClosePersistentConnections [ci=.S........[tlsflags0x00000000]cloudflare-quic.com:443 <ROUTE-via cloudflare-quic.com:443> {NPN-TOKEN h3}^partitionKey=%28https%2Ccloudflare-quic.com%29]
2023-05-30 10:00:56.068966 UTC - [Parent 45620: Socket Thread]: V/nsHttp HttpConnectionUDP::DontReuse 13db14a00 http3session=13da8eb00
2023-05-30 10:00:57.115124 UTC - [Parent 45620: Socket Thread]: V/nsHttp nsHttpConnectionMgr::OnMsgReclaimConnection [ent=1498d0300 conn=13db14a00]

If I enable ipv6 on my local network and also change fallback_to_origin_when_all_failed to true, then the initial connection does not use ECH at all, just plain http/2. Force Refreshing upgrades to http/2 + ECH. Disabling ECH+http/3 correctly results in Necko using http/2+ECH on the first attempt. This is issue 3; setting fallback_to_origin_when_all_failed with ECH+http/3 enabled means ECH+http/3 doesn't get attempted at all and ECH isn't used for the first http/2 connection.

I can't reproduce issue3 locally. Note that the pref fallback_to_origin_when_all_failed is only used for retry logic. It should not affect the initial connection.
Dennis, if you can reproduce this, could you send me a http log? Thanks.

Flags: needinfo?(djackson)

Kershaw Chang [:kershaw]

Assignee

Comment 13

•

2 years ago

Attached file Bug 1828861 - Also check Alpn when selecting the next record for retrying, r=#necko — Details

Dennis Jackson

Comment 14

•

2 years ago

Thanks for the quick investigation and patch Kershaw!

I tested your patch and verified that it fixed Issue 1. I no longer see an error page, instead the connection switches over to http/2 with ECH. When I shift-refreshed again a few more times, the connection remained sticky on http/2 with ECH. In combination with what you said regarding Issue 2 - I guess that means that shift-refreshing will force a site to exclude http/3 but perhaps that's not a bad thing for the purposes of reliability. I'm fine with leaving this and issue 2 as-is.

I spent some time trying to reproduce issue 3 today with various operating systems, prefs and networking configurations but wasn't able to. The only variable I can think of that I couldn't control for is whether Cloudflare's HTTPS RR has changed order or format since last week, but that seems unlikely. I'll try again over the next few days just in case I can reproduce, but hopefully this was just some transient error or mistake on my part.

Flags: needinfo?(djackson)

Dennis Jackson

Updated

•

2 years ago

Blocks: ech

Kershaw Chang [:kershaw]

Assignee

Comment 15

•

2 years ago

Attached file Bug 1828861 - Test case, r=#necko — Details

Pulsebot

Comment 16

•

2 years ago

Pushed by kjang@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a182d2375f8f Also check Alpn when selecting the next record for retrying, r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/513ac812d28f Test case, r=necko-reviewers,valentin

Sandor Molnar[:smolnar]

Comment 17

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/a182d2375f8f
https://hg.mozilla.org/mozilla-central/rev/513ac812d28f

Status: NEW → RESOLVED

Closed: 2 years ago

status-firefox115: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 115 Branch

You need to log in before you can comment on or make changes to this bug.