Closed Bug 1828861 Opened 1 year ago Closed 1 year ago

Investigate why 'Secure Connection Failed' but reloading makes it work

Categories

(Core :: Networking, defect, P1)

Firefox 113
defect

Tracking

()

RESOLVED FIXED
115 Branch
Tracking Status
firefox115 --- fixed

People

(Reporter: ckerschb, Assigned: kershaw)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged] [necko-priority-queue])

Attachments

(5 files)

Attached image astral.png

When connecting to astral.sh it seems that in the initial request the 'Secure Connection Failed' appears, but when reloading it works again.

Attached image torxtrail.png

It seems the same error occurs on other pages as well, e.g. when visiting www.torxtrail.com the error page appears - reloading makes it work again.

This is likely a bug in Necko or PSM, since we're missing any specific error code as to why the secure connection failed and those areas are responsible for handling automatic retries.

Some information that would be helpful for tracking down the root cause:

  • What OS is the reporter running?
  • What version of Firefox is the reporter running?
  • Can the reporter reproduce with a different version of Firefox (e.g. Nightly, ESR or Release) or a different browser (e.g. Chrome, Edge, Safari)
  • Is the reporter using a VPN? Where is the reporter located?
  • Is the reporter using any extensions which hook into page loads? E.g. an ad blocker? Does the issue persist if the extensions are disabled?
  • Did the issue appear recently?
  • Does the issue reproduce a 100% of the time?

I'm the initial reporter here.

What OS is the reporter running?
What version of Firefox is the reporter running?

Nightly on MacOS

Can the reporter reproduce with a different version of Firefox (e.g. Nightly, ESR or Release) or a different browser (e.g. Chrome, Edge, Safari)

I can't reliably reproduce it even in the same browser. It just happens sporadically but it's happening enough now that I don't think it's just network error.

Is the reporter using a VPN? Where is the reporter located?

The US.

Is the reporter using any extensions which hook into page loads? E.g. an ad blocker? Does the issue persist if the extensions are disabled?

No.

Did the issue appear recently?

Over the past few weeks.

What I would suggest is rather than doing a lot of debugging here, we instead
look to see what code path leads to this particular error page (I imagine there are
several) and surface something more specific to this page, and then I can
report what's going on in more detail. That's probably more productive than
a lot of time spent on guesswork abotu what's going on and might also help
debug future issues.

I just ran into this too. I'm on a Mac in Canada using Nightly. I'm uploading the page info that seems to indicate a cert validation issue. It resolved with a page reload and does not reproduce with the same site again.

Blocks: 1829092
No longer blocks: 1829092

Could be a coincidence, but I just ran into this bug after switching from network.trr.mode = 0 to network.trr.mode = 3. I hadn't run into it previously. It happened within an hour of flipping the pref.

The severity field is not set for this bug.
:beurdouche, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(bbeurdouche)

This happened again today to me on a new site. FWIW I haven't flipped network.trr.mode. I have default DoH settings and ETP-Strict and CBH enabled on latest Nightly.
It happens sporadically and does not reproduce with the same site in another tab or similarly reloading the existing broken tab.
What are the next steps here? NI'ing Dana also in case she has ideas for any debugging info that can be collected to help once in this situation.

Flags: needinfo?(dkeeler)

I suspect this is due to a transient network issue. What we could do is attempt to reproduce the issue with a TLS server that deliberately introduces errors after the handshake has completed. Once we can reproduce the issue, we can decide how Firefox should handle such situations.

Flags: needinfo?(dkeeler)

Sorry for not updating this proactively. Here's a quick summary:

For this bug specifically, my current suspicion is that its due to an issue in the Necko or Neqo code for handling ECH+QUIC connections. This is because the impacted sites are all hosted on Cloudflare and have ECH records, plus the the users reporting the issue are all using Nightly and located in regions where DoH is enabled by default. If you'd like to help confirm this issue, you can take the following steps:

  • When you see an impacted page load, run this command in your local terminal dig {domain-name.com} NS and see if you see a Cloudflare entry in the response. If my hypothesis is correct, you should.
  • Check if setting network.dns.http3_echconfig.enabled to false stops the issue from happening.

I will need to spend a bit of time sitting down and reproducing this locally to be able to fix it, or this could be passed to the Necko team who own the relevant code if they have capacity. In the longer term, we have two identified steps for improving the situation:

  • Improve the testing of error conditions in Necko. Our error handling in Necko is currently under-tested and we need proper CI tests to ensure we correctly retry when we have these kinds of errors.
  • Enable support for including PSM, Neqo and NSS logs when using about:logging.
Flags: needinfo?(bbeurdouche)
See Also: → 1833592

I spent some time reproducing this issue. There's a few different things going on.

Reproduction Instructions

On a network with Ipv6 disabled:

  1. Use the latest Nightly with DoH enabled and default ech / http3 prefs (echconfig.enabled, http3_echconfig.enabled should be true, fallback_to_origin_when_all_failed should be false).
  2. Connect to https://cloudflare-quic.com/cdn-cgi/trace or https://cloudflare-http3.com/cdn-cgi/trace
  3. The initial connection should load with http=http/3 and sni=encrypted.
  4. Do a force refresh with Ctrl-Shift-R or similar.
  5. You should now see Secure Connection Failed with no further details. Click refresh.
  6. You should now see http=http/2 and sni=encrypted.

Observations

From testing with Wireshark, we don't even produce any packets for the second QUIC connection after the force refresh, so there's no crypto problem or compatibility issue with Cloudflare, we just don't get that far. Kershaw pointed me to bug 1816677, which notes that we don't have retry logic for a QUIC Ipv6 connection. However, we still shouldn't be showing an error page in this circumstance, we should be falling back to http/2 with an encrypted sni. Let's call this Issue 1 ; we aren't automatically falling back to ECH+http/2 correctly when ECH+http/3 is enabled and fallback_to_origin is false. Instead, we're displaying an error page and requiring the user to refresh.

If I enable ipv6 on my local network, then the initial connection works with http/3 and ECH. Force refreshing triggers a second http/3+ECH handshake which succeeds, but still results in the connection migrating to http/2 and ECH. This is issue 2; we seem to be inappropriately disabling HTTP3+ECH after tearing down the connection, even if it was successful.

If I enable ipv6 on my local network and also change fallback_to_origin_when_all_failed to true, then the initial connection does not use ECH at all, just plain http/2. Force Refreshing upgrades to http/2 + ECH. Disabling ECH+http/3 correctly results in Necko using http/2+ECH on the first attempt. This is issue 3; setting fallback_to_origin_when_all_failed with ECH+http/3 enabled means ECH+http/3 doesn't get attempted at all and ECH isn't used for the first http/2 connection.

As none of these issues involve the crypto code or NSS, I'm moving this bug into Networking.

Assignee: nobody → nobody
Component: Libraries → Networking
Product: NSS → Core
Version: trunk → Firefox 113
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged] [necko-priority-next]
Priority: P2 → P1
Whiteboard: [necko-triaged] [necko-priority-next] → [necko-triaged] [necko-priority-queue]
Assignee: nobody → kershaw

Thanks for the detailed analysis, it's really helpful.

Observations

From testing with Wireshark, we don't even produce any packets for the second QUIC connection after the force refresh, so there's no crypto problem or compatibility issue with Cloudflare, we just don't get that far. Kershaw pointed me to bug 1816677, which notes that we don't have retry logic for a QUIC Ipv6 connection. However, we still shouldn't be showing an error page in this circumstance, we should be falling back to http/2 with an encrypted sni. Let's call this Issue 1 ; we aren't automatically falling back to ECH+http/2 correctly when ECH+http/3 is enabled and fallback_to_origin is false. Instead, we're displaying an error page and requiring the user to refresh.

For Issue 1, the problem is that we only compare the domain here when finding the next record for retrying.
We should also check the used alpn to make sure we fallback to h2 correctly.

If I enable ipv6 on my local network, then the initial connection works with http/3 and ECH. Force refreshing triggers a second http/3+ECH handshake which succeeds, but still results in the connection migrating to http/2 and ECH. This is issue 2; we seem to be inappropriately disabling HTTP3+ECH after tearing down the connection, even if it was successful.

For issue2, I think this is a timing issue. The following log shows that it took around 1s to close the Http/3 connection completely. During this 1s, it's not allowed to create a new Http/3 connection. In the end, the fast fallback Http/2 connection won, so the second http request use Http/2.

2023-05-30 10:00:56.068892 UTC - [Parent 45620: Socket Thread]: V/nsHttp ConnectionEntry::ClosePersistentConnections [ci=.S........[tlsflags0x00000000]cloudflare-quic.com:443 <ROUTE-via cloudflare-quic.com:443> {NPN-TOKEN h3}^partitionKey=%28https%2Ccloudflare-quic.com%29]
2023-05-30 10:00:56.068966 UTC - [Parent 45620: Socket Thread]: V/nsHttp HttpConnectionUDP::DontReuse 13db14a00 http3session=13da8eb00
2023-05-30 10:00:57.115124 UTC - [Parent 45620: Socket Thread]: V/nsHttp nsHttpConnectionMgr::OnMsgReclaimConnection [ent=1498d0300 conn=13db14a00]

If I enable ipv6 on my local network and also change fallback_to_origin_when_all_failed to true, then the initial connection does not use ECH at all, just plain http/2. Force Refreshing upgrades to http/2 + ECH. Disabling ECH+http/3 correctly results in Necko using http/2+ECH on the first attempt. This is issue 3; setting fallback_to_origin_when_all_failed with ECH+http/3 enabled means ECH+http/3 doesn't get attempted at all and ECH isn't used for the first http/2 connection.

I can't reproduce issue3 locally. Note that the pref fallback_to_origin_when_all_failed is only used for retry logic. It should not affect the initial connection.
Dennis, if you can reproduce this, could you send me a http log? Thanks.

Flags: needinfo?(djackson)

Thanks for the quick investigation and patch Kershaw!

I tested your patch and verified that it fixed Issue 1. I no longer see an error page, instead the connection switches over to http/2 with ECH. When I shift-refreshed again a few more times, the connection remained sticky on http/2 with ECH. In combination with what you said regarding Issue 2 - I guess that means that shift-refreshing will force a site to exclude http/3 but perhaps that's not a bad thing for the purposes of reliability. I'm fine with leaving this and issue 2 as-is.

I spent some time trying to reproduce issue 3 today with various operating systems, prefs and networking configurations but wasn't able to. The only variable I can think of that I couldn't control for is whether Cloudflare's HTTPS RR has changed order or format since last week, but that seems unlikely. I'll try again over the next few days just in case I can reproduce, but hopefully this was just some transient error or mistake on my part.

Flags: needinfo?(djackson)
Blocks: ech
Pushed by kjang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a182d2375f8f
Also check Alpn when selecting the next record for retrying, r=necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/513ac812d28f
Test case, r=necko-reviewers,valentin
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 115 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: