Closed Bug 1566175 Opened 2 years ago Closed 1 year ago

SSL_ERROR_MISSING_ESNI_EXTENSION occurs occasionally when visiting websites

Categories

(Core :: Networking: HTTP, defect, P3)

68 Branch
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr68 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox70 --- affected
firefox71 --- affected

People

(Reporter: raphael.mauro, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: regression, Whiteboard: [necko-triaged])

Attachments

(2 files)

Attached image ESNI-Error.png

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0

Steps to reproduce:

I activated DOH and ESNI in about:config from Mozilla Firefox Version 68.0 on Windows 10 1903.
I set those values in about:config :

network.trr.mode;3
network.trr.bootstrapAddress;1.1.1.1
network.trr.uri;https://mozilla.cloudflare-dns.com/dns-query
network.security.esni.enabled;true

Actual results:

When visiting websites (sometimes not always) I have those type of errors:
SSL_ERROR_MISSING_ESNI_EXTENSION

For example this website does this error: https://www.frandroid.com/
Some websites are working fine with ESNI set to "true" and some doesn't work at all, I have to set the value of network.security.esni.enabled to "false" in order to access some websites.

Expected results:

The website should be working fine, like the others where I don't have those type of error messages.
Please, see the attached file where I screen the error that occured while trying to access the community.cloudflare.com site.

Also have the same error on nightly (70). The error is NOT permanent. The sites are working just fine and then suddenly throw this error for a few seconds or minutes (and when one site does, then the other sites will too, so it's not site specific) and then start working again.

(In reply to khagaroth from comment #1)

Also have the same error on nightly (70). The error is NOT permanent. The sites are working just fine and then suddenly throw this error for a few seconds or minutes (and when one site does, then the other sites will too, so it's not site specific) and then start working again.

Hello,

What do I got to do then to get those sites working? Because for me I can't get an access to them at all. I always have this error.
Thanks.

I'm also encountering this. It's inconsistent, and sites may start sites may start working 5 minutes after I've received the error: SSL_ERROR_MISSING_ESNI_EXTENSION when attempting to connect. It happens with connections over HTTP, as well as websockets. My network.trr.mode is set to 2, otherwise all my settings are the same as Raphael's.

Hi,
I wasn't able to reproduce this issue on Nightly 70.0a1 (2019-07-29) and on 68.0
Also, Could you please try to see if it's reproducible on Nightly? here is the link for download https://www.mozilla.org/en-US/firefox/nightly/all/

Thanks!

Flags: needinfo?(raphael.mauro)

(In reply to Luciana queirolo from comment #4)

Hi,
I wasn't able to reproduce this issue on Nightly 70.0a1 (2019-07-29) and on 68.0
Also, Could you please try to see if it's reproducible on Nightly? here is the link for download https://www.mozilla.org/en-US/firefox/nightly/all/

Thanks!

Hello,

Sorry for the delay. I don't have any ESNI issue with Nightly. Still I would like to point out, that this error is so random... like sometimes a website throw this error and then a few sec later it works fine... For the moment, on Mozilla 68.0.1, the error sometimes occur and sometimes doesn't.

Flags: needinfo?(raphael.mauro)

Is there anything I can do to try to diagnose this error when it happens? I've had it happen multiple times over the past few days, I just don't know how to diagnose the problem.

I got the issue on feedly.com on Nightly, but it fixed itself after a few minutes.

The site uses Cloudflare, so perhaps there is an issue with server key rotation.
https://blog.cloudflare.com/encrypted-sni/

Cloudflare’s own SNI encryption implementation rotates the server’s keys every hour to improve forward secrecy, but keeps track of the keys for the previous few hours to allow for DNS caching and replication delays, so that clients with slightly outdated keys can still use ESNI without problems (but eventually all keys are discarded and forgotten).

Does Firefox keep DNS cache entries (or more specifically ESNI records) in a disk cache ?
I restarted it less than 30 min ago, so if the cache is only in RAM, then the key must be fresh, unless there is a server-side issue (or the issue isn't related to key rotation).

This just randomly started happening to me as well, on feedly.com and canary.discordapp.com. The site is broken for up to 60 seconds, I just continue to refresh and it goes away eventually. I'm on 70.0b10.

I have found something interesting about this bug: It only happens on full hours.
For example, it happens at 9:00 and keeps happening. Then it starts working again at 9:01.
This is in line with the person above stating it happens for "up to 60 seconds".
For me, this bug happens every day, and at times like 8:00, 11:00, 6:00, you get the idea.
It also only happens on some websites, but it's always the same websites.
There's probably some kind of bug in how you compare the time at full hours during ESNI validation.
I'm no programmer, so I hope someone else will take a close look at the time part of the code and fix the bug.

Sorry for the doublepost, but I could not find an "edit comment" function.
I want to add the following to the above post:

  • Changing my computer time (Win10's) to a full hour does not reproduce the bug. It appears that it actually has to be a full hour, not just on your computer.
  • The website I experience this regularly on uses cloudflare. This could be a coincidence, but I thought I should mention it, because the other posters also experienced issues on cloudflare websites.

Mark to NEW based on several reports.

Status: UNCONFIRMED → NEW
Component: Untriaged → Networking: HTTP
Ever confirmed: true
Product: Firefox → Core
Summary: ESNI error when visiting websites SSL_ERROR_MISSING_ESNI_EXTENSION → SSL_ERROR_MISSING_ESNI_EXTENSION occurs occasionally when visiting websites

(In reply to NoName from comment #10)

Sorry for the doublepost, but I could not find an "edit comment" function.
I want to add the following to the above post:

  • Changing my computer time (Win10's) to a full hour does not reproduce the bug. It appears that it actually has to be a full hour, not just on your computer.
  • The website I experience this regularly on uses cloudflare. This could be a coincidence, but I thought I should mention it, because the other posters also experienced issues on cloudflare websites.

This is really interesting.
Could you try to get the http log when this happens?
Thanks.

Flags: needinfo?(defer.com)

mt, can you look at this or pass it on?

Flags: needinfo?(mt)

This looks like a server configuration error... or a DNS server over-caching an old ESNI record ... or a bad local clock at the time that keys roll over. I don't see any problem with these sites, unless I have a very old ESNI record. I'll forward this to our friends at Cloudflare to get a better idea of what is going on, but we might need to build some better diagnostics for this error.

We probably need to log the ESNI record and the system time when this error happens. Ideally, we should also ask the DoH server about what time it thinks that it is. (If we already do these things, that's super.)

Flags: needinfo?(mt)

(In reply to Kershaw Chang [:kershaw] from comment #12)

(In reply to NoName from comment #10)

Sorry for the doublepost, but I could not find an "edit comment" function.
I want to add the following to the above post:

  • Changing my computer time (Win10's) to a full hour does not reproduce the bug. It appears that it actually has to be a full hour, not just on your computer.
  • The website I experience this regularly on uses cloudflare. This could be a coincidence, but I thought I should mention it, because the other posters also experienced issues on cloudflare websites.

This is really interesting.
Could you try to get the http log when this happens?
Thanks.
I'm working on it, though it's rather difficult, because the bug happens somewhat randomly and I need to start logging before it fixes itself. I'll make another post then.
Meanwhile I got some more info:
A few weeks ago I stated that the bug happens from XX:00 to XX:01 every hour for me.
Shortly afterwards, this actually changed to being XX:04 to XX:05. For example, I noticed it happening about 4 more times around times like 6:04, 11:04, and so on. So while probably not connected to a certain number, it still happens somewhat periodically.
And next, I tried logging the http traffic by refreshing around that time. I kept refreshing pages every few seconds between XX:55 and XX:10, but the bug never happened (I did it especially often around XX:04).
So it seems to me that it doesn't happen EVERY hour, but when it does happen, it's always around the same time.

(In reply to raphael.mauro from comment #0)

Created attachment 9078235 [details]
ESNI-Error.png

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0

Steps to reproduce:

I activated DOH and ESNI in about:config from Mozilla Firefox Version 68.0 on Windows 10 1903.
I set those values in about:config :

network.trr.mode;3
network.trr.bootstrapAddress;1.1.1.1
network.trr.uri;https://mozilla.cloudflare-dns.com/dns-query
network.security.esni.enabled;true

I am having this same issue with ESNI support enabled in Firefox 70. My TRR mode is set to 2, if that matters.

(In reply to Kershaw Chang [:kershaw] from comment #12)

This is really interesting.
Could you try to get the http log when this happens?
Thanks.
I managed to get a good HTTP log of this bug now. It's uploaded here:
https://x0.at/Zuw.txt
I did the following:

  • started browser
  • started logging
  • entered https://boards.4channel.org/g/ into the URL bar of a new tab and hit enter, 1 minute past a full hour (This is a cloudflare-backed website, if I'm not mistaken)
  • ESNI error appears almost instantly
  • stopped logging
  • removed cookies with personal information from the log
    Note that this bug, once again, occured 1 minute past a full hour (in this case 1:01AM) and fixed itself just a few seconds later, around 1:02AM. Just like the previous times it occured to me.
    Hope this helps.
Flags: needinfo?(defer.com)

The formatting in the previous post got messed up, but it contains a link to a clean log of the bug.
I forgot to add some possibly relevant info:
My system clock was on 2:01AM when I looked at it. The log says 1:01AM UTC, so it's probably just a hour timezone difference and the milliseconds should be the same as in the log?
Windows 10, Firefox 70.0, I use the HTTPS Everywhere and uBlock Origin extension.

I am getting this error in Firefox 71 beta 4 when visiting https://feedly.com/ with ESNI encryption and DoH enabled.

Based on what I'm seeing in the log (https://x0.at/Zuw.txt), we have a valid ESNI record that was generated at 2019-10-26T23:00Z on . I'm guessing, but this is exactly the middle of the notBefore and notAfter fields. The DNS TTL is 3600 (one hour). I need to confirm, but I expect that these records are generated once per hour on the hour. If that is the case, then this record might have entered a cache just before 2019-10-27T00:00Z. That means that the record would be considered usable right up to 2019-10-27T01:00Z, a minute before the failed attempt. It is likely that the ESNI keys at the server were rotated at 2019-10-27T01:00Z or soon afterwards. As a result, the server wasn't able to produce the correct ESNI record and we failed the connection.

This is clearly a server configuration issue. Because Cloudflare tell us that this is valid for 3 days, we can't do anything to safeguard against failures here. That's their prerogative, and they make that call because they don't trust client clocks, which is probably wise. Our experience shows that clocks on clients are sometimes very bad.

In the end, the goal is to have the server ensure that the client won't attempt the ESNI request past when they drop their keys. They do that either by reducing the time that the record is valid (by reducing the TTL on DNS records, for instance) or by extending the time that the server retains the corresponding keys.

If, as I'm guessing, the record was issued at 23:00, replaced at 00:00, and the keys were retired at 01:00, then a TTL of 3600 is too tight. Any delay in provisioning replacement keys at 00:00 would leave a period of exposure. If keys are replaced too soon relative to that, this error occurs. If a DNS server holds on to keys too long, or there are network delays in delivering DNS responses, then the TTL will extend past the 3600 and bad things happen.

Now, I might be wrong and this might be the result of an insane over-extension of the DNS TTL, but I'll check with Cloudflare to confirm this.

A note:
necko increase validity (valid + grace time) to be at least a 1min (the grace period is increase, but anyway we will use the record for 1min at least).
This is not the problem here because the record is not from necko cache.
(We should not increase TTL and also do not add grace as we so for A and AAAA records, I will file a bug.)

This does not look like our bug. i will wait to get a confirmation from Cloudflare.

Priority: -- → P3
Whiteboard: [necko-triaged]

I just got this error at https://clark.com/latest (Win7/FF 71.0).

Settings:
network.trr.mode - 2
network.trr.bootstrapAddress - 9.9.9.11
network.trr.uri - https://dns11.quad9.net/dns-query
network.trr.custom_uri - https://dns11.quad9.net/dns-query
network.trr.early-AAAA - true
network.trr.request-timeout - 3000
network.trr.request_timeout_ms - 3000
network.trr.wait-for-portal - true
network.security.esni.enabled - true

Changing only network.security.esni.enabled to false let me in. I then set it back to true and still got in. Will see what happens tomorrow.

(In reply to Dragana Damjanovic [:dragana] from comment #22)

This does not look like our bug. i will wait to get a confirmation from Cloudflare.

Indeed; for me, it's caused by having the AVG Antivirus setting for 'Enable HTTPS scanning' enabled.

Well, I had a AVG update waiting to be installed on reboot so I did it - and the problem has disappeared. Also, my laptop doesn't have the problem and it's still waiting for a reboot to install the AVG update. Don't know what to make of that, as both PCs have the same FF DNS setup as mentioned above :-/

Regarding the comment above, I don't use an antivirus at all, and I still get this problem multiple times every day at regular time intervals.
So possibly the AVG was just a coincidence.

Hello everyone,

As the person who reported this issue, I can confirm that the problem is not coming the AVG antivirus, as I'm using Eset Internet Security (version 13.0.24.0) and still have the issue I reported.

Regards,

Have to confirm - also don't use any antivirus. Just to add a new thing - for about a week now the same sites that give me the ESNI error now also rarely (ie even more rarely than the ESNI error) give me the SSL_ERROR_NO_CYPHER_OVERLAP error instead which also clears itself after few seconds/minutes.

I'm experiencing this issue as well, though it is not nearly as intermittent as others have reported. For me, the error doesn't clear itself until I restart Firefox.

I don't use any antivirus. Firefox 72.0.1 on Linux

Can also confirm I encountered this error on Firefox 73.0b5 (Windows 10). My trr is set to 3 and have the standard settings same to the users above.
I disabled the protection feature "Web Anti-Virus" in Kaspersky after seeing the comment regarding AVG, since then it's worked.

Some people seem still affected by without having an AV but if you have one it's worth disabling to try.

This happens for me regularly at hourly intervals, usually at 59 minutes past the hour, and usually abates at 1 minute or so past the hour. I'm running Linux, using Firefox Developer Edition 73.0b5 (64-bit).

This still happens to me rarely, can't seem to reproduce it. Last time it happened was on a different older version of Firefox Nightly, current using version 76.0a1 (2020-03-17) (64-bit).
When the problem occurs it happens on a certain site (Different each time it occurs.) however other sites work.
No anti-virus, ESNI + DOH + DNSSEC are all on, using Windows 10 stable latest version. This also happens right as the time hits XX:00 (A new hour.) which aligns with the previous comment above me. Problem dissipates after about a minute to three minutes.

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression

I am getting this still on FF 75. I have no antivirus other than windows defender and it is not happening at regular intervals, it happens randomly. Pressing refresh a few times usually gets the page to load. I am using DoH with Nextdns, TRR mode set to 3. I havehad to stop using DoH until this is resolved as it is driving me mad. How this is still going on after this much time I don't know.

Experiencing the same randomly today but only on CloudFlare hosted websites, I think comment 14 lays out the real issue with CF.

I've got SSL_ERROR_MISSING_ESNI_EXTENSION today, although I haven't got that yesterday.
And I took a network log I attached.

The site is https://pastebin.com/

<My Environment>
Firefox 75.0 (64-bit)
Windows 10 (64-bit)

network.trr.bootstrapAddress = 1.1.1.1
network.trr.mode = 3
network.trr.custom_uri = https://mozilla.cloudflare-dns.com/dns-query
network.trr.uri = https://mozilla.cloudflare-dns.com/dns-query
network.security.esni.enabled = true

This seems to remain broken for longer period of times, since at least 10 minutes (when I updated to today's nightly) :

Disabling DoH and restarting Firefox avoids the issues.

FWIW. I'm having the 'SSL_ERROR_MISSING_ESNI_EXTENSION' issue with some sites on Firefox Nighly version 77.0a1 (2020-04-15) (64-bit) - Windows 10.

A temporary work-around on my end seems to be setting network.trr.mode to 1 (network.trr.mode = 1).

So have the settings as follows:

network.trr.bootstrapAddress = 1.1.1.1
network.trr.mode = 1
network.trr.custom_uri = https://mozilla.cloudflare-dns.com/dns-query
network.trr.uri = https://mozilla.cloudflare-dns.com/dns-query
network.security.esni.enabled = true

However this leads the browser to fail the Cloudflare ENSI check:

https://www.cloudflare.com/ssl/encrypted-sni/

I have commented on this months ago and stated that it happens for about a minute an hour.
But recently it has happened for 10 minutes an hour and then 15 minutes an hour.
At the time of this writing, it has been happening for about 90 minutes nonstop, and it's still happening.

Hi,
Also getting this issue, only on the Discord website, every other site is loading fine. Running https://www.cloudflare.com/ssl/encrypted-sni/ I pass all tests
https://bin.privacytools.io/?3834a6d573796256#V932WglnnlG29qLklYXN7lQWiodAdYl0crklAqrYEEU=
My network.trr.mode is set to 2, network.trr.uri is set to https://mozilla.cloudflare-dns.com/dns-query to https://dns.quad9.net/dns-query and it's now working.

Possible issue with cloudflair it self https://www.cloudflarestatus.com/

Hi,

A fix has been rolled out at Cloudflare earlier today. The problem was exposed by two separate bugs:

  • Key rotation happens every hour, but Cloudflare's authorative DNS servers could serve stale records for up to 15 minutes. Since the TTL was originally 1 hour, this could mean that a client could continue to use the previous-previous key for 1h15m after the intended key rotation.
  • The TLS server was supposed to keep the previous two ESNI keys to counter DNS caches that do not respect TTL. However due to a bug, only the previous and current key was kept. That could result in a window of up to 15 minutes where client connections using older ESNI keys would fail connections.

A fix for the first issue is to reduce the TTL from 1h to 30 minutes, this change seems to have an immediate effect on the failure rates. There are close to no hourly failure spikes anymore.

A fix for the second issue has not been applied yet, but interestingly it does not seem necessary to reduce failure rates for the current deployment. While previous research revealed that resolvers can significantly increase TTL, the current deployment of ESNI being coupled with DoH seems to prevent these TTL modifications. As deployment of ESNI grows, it remains to be seen whether this still holds.

As for the reports on 2020-04-15, there was a temporary issue that caused increased failure rates for up to 2 hours.

@mt Do you have any metrics on ESNI failure rates from your side to verify this?
I think that this issue can be closed as it was a problem at our (CF) side that has since been resolved.

Flags: needinfo?(mt)

(In reply to Peter Wu from comment #43)

Hi,

A fix has been rolled out at Cloudflare earlier today. The problem was exposed by two separate bugs:

  • Key rotation happens every hour, but Cloudflare's authorative DNS servers could serve stale records for up to 15 minutes. Since the TTL was originally 1 hour, this could mean that a client could continue to use the previous-previous key for 1h15m after the intended key rotation.
  • The TLS server was supposed to keep the previous two ESNI keys to counter DNS caches that do not respect TTL. However due to a bug, only the previous and current key was kept. That could result in a window of up to 15 minutes where client connections using older ESNI keys would fail connections.

A fix for the first issue is to reduce the TTL from 1h to 30 minutes, this change seems to have an immediate effect on the failure rates. There are close to no hourly failure spikes anymore.

A fix for the second issue has not been applied yet, but interestingly it does not seem necessary to reduce failure rates for the current deployment. While previous research revealed that resolvers can significantly increase TTL, the current deployment of ESNI being coupled with DoH seems to prevent these TTL modifications. As deployment of ESNI grows, it remains to be seen whether this still holds.

As for the reports on 2020-04-15, there was a temporary issue that caused increased failure rates for up to 2 hours.

@mt Do you have any metrics on ESNI failure rates from your side to verify this?
I think that this issue can be closed as it was a problem at our (CF) side that has since been resolved.

Hello,

Thanks for your feedback though I still have the same error when accessing this URL: https://developers.cloudflare.com/
My settings are the same as my original post.

I don't think we should close this case until most people can confirm that everything's working fine.

Regards,

Raphael,

Can you reproduce the problem after restarting your browser?

If so, please share:

  • The info from https://1.1.1.1/help
  • On about:networking#dns, do any entries with "TRR" set to "true" exist?
  • On about:networking#dns, can you see developers.cloudflare.com after opening it? If so, what does it say?

I am unable to reproduce it locally with the settings you provided. I tried to connect to https://developers.cloudflare.com/, and it works. Wireshark shows the presence of ESNI too in the Encrypted Extensions handshake message.

Firstly, thanks to Peter for following up here.

The bugs regarding key rotation and DNS caching mean that we're finding the hidden operational traps in this. On the one hand, I'm glad that it was just server-side bugs; on the other, as long as these are inherent in the design we'll have these problems. I don't know how much we can feed this into improvements to specifications or documentation, but it would be good to capture this somewhere. Especially the bit about future research.

As far as knowing that problems remain or whether this fixes things, that's difficult. I believe we do track TLS error codes, but that tracking is not public and getting the data is a little difficult (as in, I have made some casual attempts to get data and only ever failed). Also, we are limited in the granularity of the data. We won't be able to tell what part of an hour an observation is from; we can barely tell which day a submission was from. I might be able to look into whether this trends down as a result of these fixes, but I expect that this will end up in the noise: our telemetry has a bunch of noise and as we haven't enabled this by default, the number of people who might encounter this is probably quite small (no, we don't have telemetry on that, and I don't believe we have plans to build that capability: about:config is a little sensitive).

This was always an experimental feature (and it's going to change a LOT soon), so this is valuable experience more than it is a problem that needs to be fixed. I'm going to close this, but feel free to use this bug to coordinate the last little details if that suits.

Flags: needinfo?(mt)
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → WORKSFORME

The issue seems to be anything but resolved. Just started getting SSL_ERROR_MISSING_ESNI_EXTENSION on sites like Patreon. And they don't go away, they're permament. Firefox 77.0b7 (64-bit)

Can't edit the comment, so forgive double-posting. Turns out, that it was ESET Nod32's doing. Unchecking

Settings > Advanced settings > Web and email > access protection > protocols > HTTPS checking

fixed the issue

Are you using ESNI?
ESNI is/was an experimental feature that had some deployment difficulties, e.g. unexpectedly getting this error. It will be replace with a new version soon. There is already a draft specification for the new version.
We are not advising on using ESNI at the moment.

I probably am using it, yes. That being said, changing this antivirus setting did help.

You need to log in before you can comment on or make changes to this bug.