Open Bug 1913559 Opened 1 year ago Updated 4 months ago

Extremely long first DNS request for local-only domains

Categories

(Core :: Networking, defect, P2)

Firefox 129
defect
Points:
3

Tracking

()

UNCONFIRMED

People

(Reporter: taynik777, Unassigned)

References

(Depends on 1 open bug)

Details

(Whiteboard: [necko-triaged][necko-priority-next])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:130.0) Gecko/20100101 Firefox/130.0

Steps to reproduce:

I have locally accessible nginx balancer that serves websites and services for local users. It serves all websites under *.lan.zonastro.com domain that is accessible ONLY for certain internal networks. All local DNS are aware of this domains, but no external DNS know it. Since every local user device is set up to use certain DNS servers - all works just as expected for years. Since firefox 129 update (includinf 130.0b6 dev version) after firefox start - every FIRST request to any of *.lan.zonastro.com domains leads to one minute load and it reproduces only for each new domain and never reproduces in any other browser and also didn't reproduce during testing in firefox 128.0.3

I've recorded one example of such strange behavior https://share.firefox.dev/3WRTGgA
Since it appeared in v129 and above - I think it somehow can be related to just reported another issue - https://bugzilla.mozilla.org/show_bug.cgi?id=1913558

Actual results:

On first attempt to load any domain in *.lan.zonastro.com it loads it for 60s and then just works normally. Reproduces for each first time opened domain and does not reproduce in other browsers.
For example if user opens abc123.lan.zonastro.com - it just stucks loading for 60s, then unstucks and loads normally for the whole session and if users just tries to load for example abc1234.lan.zonastro.com - it stucks again like before.

Expected results:

Page should open fast enough without this delay.

The Bugbug bot thinks this bug should belong to the 'Core::Networking' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Networking
Product: Firefox → Core

Could you find a regression range using mozregression

Flags: needinfo?(taynik777)

I was doing the same mozregression for https://bugzilla.mozilla.org/show_bug.cgi?id=1913558 and during test for it got the same results for this one issue, so results are identical and I did not repeat the same process for this issue since during the test all builds was AND long AND with that ssl issue. I'm currently using v128.0b9 (Build ID 20240628091536) and it works perfectly fine - I've specified it as last known good build id in mozregression and first bad release - 129

All it gave me is this (end of log):

2024-08-17T18:36:01.254000: INFO : Narrowed integration regression window from [ca0abc9a, 21a08e73] (3 builds) to [ca0abc9a, 023a7981] (2 builds) (~1 steps left)
2024-08-17T18:36:01.260000: DEBUG : Starting merge handling...
2024-08-17T18:36:01.261000: DEBUG : Using url: https://hg.mozilla.org/integration/autoland/json-pushes?changeset=023a7981f5eb5fbeb6587ab28ac7a06221f3d27c&full=1
2024-08-17T18:36:01.261000: DEBUG : redo: attempt 1/3
2024-08-17T18:36:01.261000: DEBUG : redo: retry: calling _default_get with args: ('https://hg.mozilla.org/integration/autoland/json-pushes?changeset=023a7981f5eb5fbeb6587ab28ac7a06221f3d27c&full=1',), kwargs: {}, attempt #1
2024-08-17T18:36:01.264000: DEBUG : urllib3.connectionpool: Resetting dropped connection: hg.mozilla.org
2024-08-17T18:36:02.701000: DEBUG : urllib3.connectionpool: https://hg.mozilla.org:443 "GET /integration/autoland/json-pushes?changeset=023a7981f5eb5fbeb6587ab28ac7a06221f3d27c&full=1 HTTP/11" 200 None
2024-08-17T18:36:02.755000: DEBUG : Found commit message:
Bug 1904270 - Use a customizable dimens variable for the marginEnd in the Toolbar URL background ImageView r=android-reviewers,twhite

- Replaces the existing hardcoded marginEnd value in the Toolbar URL background ImageView with a dimens variable that can be customized by the consumer.

Differential Revision: https://phabricator.services.mozilla.com/D214686

2024-08-17T18:36:02.755000: DEBUG : Did not find a branch, checking all integration branches
2024-08-17T18:36:02.756000: INFO : The bisection is done.
2024-08-17T18:36:02.756000: INFO : Stopped
Flags: needinfo?(taynik777)

Hi Astro,

Are you using DNS over HTTPS?
Check the values of network.trr.mode and doh-rollout.mode in about:config

Could you also capture a new profile from about:logging ?
Thanks!

Flags: needinfo?(taynik777)

Hi Valentin,

No, I don't use DNS over HTTPS, it shows status "Off"
Both of network.trr.mode and doh-rollout.mode have "0" values

Here is new profile recording from about:logging: https://share.firefox.dev/3yOaz3B

Flags: needinfo?(taynik777)

I have the same issue in Firefox 129.0.1.

I use split DNS and the internally resolved DNS records are immediately hitting the internal DNS server, as I see it in the DNS server logs, but then it takes about 1 minute until some timeout happens in Firefox, and only then it continues loading the page.

I have completely disabled DoH in Firefox ("Use your default DNS resolver). Once Firefox loads the page, it works fine until Firefox clears the entry from DNS cache.

Could you folks see if setting network.dns.native_https_query to false in about:config makes the problem go away?

Flags: needinfo?(taynik777)
Flags: needinfo?(janhouse)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #7)

Could you folks see if setting network.dns.native_https_query to false in about:config makes the problem go away?

Yes, setting that to "false" makes the problem go away.

Flags: needinfo?(janhouse)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #7)

Could you folks see if setting network.dns.native_https_query to false in about:config makes the problem go away?

But I also notice that it stops making the "HTTPS" DNS queries when that setting is false. It works, but I want to note that my DNS server does respond to internal "HTTPS" queries correctly.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #7)

Could you folks see if setting network.dns.native_https_query to false in about:config makes the problem go away?

It does go away with that setting set to false.

Flags: needinfo?(taynik777)

Just as a check, can you try disabling IPV6: network.dns.disableIPv6

Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged][necko-priority-new]

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #11)

Just as a check, can you try disabling IPV6: network.dns.disableIPv6

It does not fix change anything and does not fix the problem

Flags: needinfo?(kershaw)
Whiteboard: [necko-triaged][necko-priority-new] → [necko-triaged][necko-priority-review]

Thanks for the profile. Here's what seems to have happened:

  1. We received an HTTPS record indicating that h3 could be used.
2024-08-21 18:08:38.732 UTC - [Parent Process 16948 Socket Thread] D/nsHttp Http3Session::Init 17260e63000
2024-08-21 18:08:38.732 UTC - [Parent Process 16948 Socket Thread] D/nsHttp Http3Session::Init origin=kibana.lan.zonastro.com, alpn=h3, selfAddr=0.0.0.0, peerAddr=10.20.24.2, qpack table size=65536, max blocked streams=20 webtransport=0 [this=17260e63000]
  1. However, the HTTP/3 connection was blocked, and the connection failed due to a timeout.
2024-08-21 18:09:08.734 UTC - [Parent Process 16948 Socket Thread] D/nsHttp Http3Session::ProcessEvents - ConnectionClosed
2024-08-21 18:09:08.734 UTC - [Parent Process 16948 Socket Thread] D/nsHttp Http3Session::ProcessEvents - ConnectionClosed error=804b000e
  1. Firefox should have created a fallback connection to h2, but this also failed as the server doesn't appear to support h2.
2024-08-21 18:08:38.784 UTC - [Parent Process 16948 Socket Thread] D/nsHttp Creating DnsAndConnectSocket [this=17259eefd40 trans=1725f4ad9b0 ent=kibana.lan.zonastro.com key=.SA.....F.[tlsflags0x00000000]kibana.lan.zonastro.com:443 {NPN-TOKEN h2}]
2024-08-21 18:08:38.784 UTC - [Parent Process 16948 Socket Thread] D/nsHttp SpeculativeTransaction::Close 1725f4ad9b0 aReason=804b000d
2024-08-21 18:08:38.784 UTC - [Parent Process 16948 Socket Thread] D/nsHttp nsHttpConnection::Activate [this=17263110000 trans=1725f4ad9b0 caps=200011]

2024-08-21 18:08:38.784 UTC - [Parent Process 16948 Socket Thread] D/nsHttp nsHttpConnection::Activate [this=17263110000] Bad Socket 804b000d

Hi Reporter,

Could you clarify the HTTPS record for kibana.lan.zonastro.com? It appears that the record indicates Firefox can connect to the server using h3 and h2, but both connection attempts failed.

Thanks.

Flags: needinfo?(kershaw) → needinfo?(taynik777)

Redirect a needinfo that is pending on an inactive user to the triage owner.
:jesup, since the bug has recent activity, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(taynik777) → needinfo?(rjesup)

Astro, any chance you could answer comment 13? Thanks!

Flags: needinfo?(rjesup) → needinfo?(taynik777)

(In reply to Kershaw Chang [:kershaw] from comment #13)

Thanks for the profile. Here's what seems to have happened:

  1. We received an HTTPS record indicating that h3 could be used.
2024-08-21 18:08:38.732 UTC - [Parent Process 16948 Socket Thread] D/nsHttp Http3Session::Init 17260e63000
2024-08-21 18:08:38.732 UTC - [Parent Process 16948 Socket Thread] D/nsHttp Http3Session::Init origin=kibana.lan.zonastro.com, alpn=h3, selfAddr=0.0.0.0, peerAddr=10.20.24.2, qpack table size=65536, max blocked streams=20 webtransport=0 [this=17260e63000]
  1. However, the HTTP/3 connection was blocked, and the connection failed due to a timeout.
2024-08-21 18:09:08.734 UTC - [Parent Process 16948 Socket Thread] D/nsHttp Http3Session::ProcessEvents - ConnectionClosed
2024-08-21 18:09:08.734 UTC - [Parent Process 16948 Socket Thread] D/nsHttp Http3Session::ProcessEvents - ConnectionClosed error=804b000e
  1. Firefox should have created a fallback connection to h2, but this also failed as the server doesn't appear to support h2.
2024-08-21 18:08:38.784 UTC - [Parent Process 16948 Socket Thread] D/nsHttp Creating DnsAndConnectSocket [this=17259eefd40 trans=1725f4ad9b0 ent=kibana.lan.zonastro.com key=.SA.....F.[tlsflags0x00000000]kibana.lan.zonastro.com:443 {NPN-TOKEN h2}]
2024-08-21 18:08:38.784 UTC - [Parent Process 16948 Socket Thread] D/nsHttp SpeculativeTransaction::Close 1725f4ad9b0 aReason=804b000d
2024-08-21 18:08:38.784 UTC - [Parent Process 16948 Socket Thread] D/nsHttp nsHttpConnection::Activate [this=17263110000 trans=1725f4ad9b0 caps=200011]

2024-08-21 18:08:38.784 UTC - [Parent Process 16948 Socket Thread] D/nsHttp nsHttpConnection::Activate [this=17263110000] Bad Socket 804b000d

Hi Reporter,

Could you clarify the HTTPS record for kibana.lan.zonastro.com? It appears that the record indicates Firefox can connect to the server using h3 and h2, but both connection attempts failed.

Thanks.

It seems that firefox did not follow DNS for the exact requested local-only domain and it went straight to Cloudflare (main NS) where it got some of that HTTPS records about some http3 and other things, that is not true for .lan.zonastro.com domain, that does not have any https records on used local DNS servers. For now it was semi-solved by manually adding DNS HTTPS record (.lan.zonastro.com. 300 IN HTTPS 1 10.20.24.2.) to Cloudflare for local-only domains, which is not great, but it works for now at least. So any of local-only (*.lan.zonastro.com) domains that has records only on DNS server used in local network do not have any HTTPS records

Flags: needinfo?(taynik777)
Flags: needinfo?(kershaw)

Hey Kershaw, could you please help investigate this further?

Hi Astro,

Could you open about:networking#dns and see if .lan.zonastro.com is listed in DNS suffix? If yes, I think maybe we should skip native HTTPS query for local domains. However, the HTTPS record is provided by your system resolver, not TRR. I am not sure if it's correct to ignore HTTPS record for this case.

Thanks.

Flags: needinfo?(kershaw) → needinfo?(taynik777)

(In reply to Kershaw Chang [:kershaw] from comment #18)

Hi Astro,

Could you open about:networking#dns and see if .lan.zonastro.com is listed in DNS suffix? If yes, I think maybe we should skip native HTTPS query for local domains. However, the HTTPS record is provided by your system resolver, not TRR. I am not sure if it's correct to ignore HTTPS record for this case.

Thanks.

Hi Kershaw,

I've opened about:networking#dns but there is nothing in DNS suffix, only the table of dns records under the Clear DNS Cache button in filled

Flags: needinfo?(taynik777)

Hi Kershaw, any direction on next steps for this issue?

Flags: needinfo?(kershaw)

It seems that firefox did not follow DNS for the exact requested local-only domain and it went straight to Cloudflare (main NS) where it got some of that HTTPS records about some http3 and other things, that is not true for .lan.zonastro.com domain, that does not have any https records on used local DNS servers. For now it was semi-solved by manually adding DNS HTTPS record (.lan.zonastro.com. 300 IN HTTPS 1 10.20.24.2.) to Cloudflare for local-only domains, which is not great, but it works for now at least. So any of local-only (*.lan.zonastro.com) domains that has records only on DNS server used in local network do not have any HTTPS records

Please note that the HTTPS record is actually provided by the system resolver, so there isn’t much we can do in this case. I believe adding this HTTPS record (*.lan.zonastro.com. 300 IN HTTPS 1 10.20.24.2) is a good workaround.

Ideally, there would be a way to instruct Firefox not to resolve HTTPS records for .lan.zonastro.com, but I’m not aware of a good solution for this.

Valentin, do you have any idea here?
Thanks.

Flags: needinfo?(kershaw) → needinfo?(valentin.gosu)

Let's fix bug 1884762 and use that API to avoid resolving HTTPS records for any local domains.

Depends on: 1884762
Flags: needinfo?(valentin.gosu)
Whiteboard: [necko-triaged][necko-priority-review] → [necko-triaged][necko-priority-next]
Points: --- → 3
Rank: 3
You need to log in before you can comment on or make changes to this bug.