Open Bug 1947621 Opened 11 days ago Updated 10 days ago

Firefox is querying public nameservers for HTTPS RRs despite being on VPN and having DoH off

Categories

(Core :: Networking: DNS, defect, P2)

Firefox 134
defect

Tracking

()

UNCONFIRMED

People

(Reporter: paulfurtado91, Assigned: valentin)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:134.0) Gecko/20100101 Firefox/134.0

Steps to reproduce:

I have a hostname that uses split DNS such that it resolves directly to the internal load balancer when on VPN and to cloudflare access when not on VPN. I have DoH totally disabled in settings, which has historically made this reliable.

For a while now, I have been noticing that firefox is choosing to hit cloudflare for many of these requests despite being on VPN.

If I use about:networking#dnslookuptool to lookup the hostname, while keeping tcpdump running, I can see that the "IPs" section always shows the correct IPs, but the HTTP RRs section alternates between NS_ERROR_UNKNOWN_HOST and the cloudflare IPs. Looking at the tcpdump logs, I see that firefox is intermittently querying 1.1.1.1 and 75.75.75.75 directly for the HTTPS RRs.

As far as I understand it, Firefox should never be querying these DNS servers, which are not present in resolv.conf. I'm actually confused about how firefox is even coming up with these nameservers:

  • 75.75.75.75 is on of comcast's nameservers, which is returned by my router's DHCP response. Firefox must either be caching this from before I logged onto the VPN, or querying NetworkManager for it?
  • 1.1.1.1 is not configured anywhere in my network, and is not in the DHCP lease coming back from router. The only way I can imagine this getting chosen is because Firefox has it hardcoded for DoH and is choosing to use it for HTTPS RRS somehow despire DoH being off.

When DoH is disabled, I expect firefox to only ever query DNS records against the servers that exist in my resolv.conf file so it seems like there is a bug in the HTTPS RR lookup causing it to fallback to public DNS.

The Bugbug bot thinks this bug should belong to the 'Core::Networking: DNS' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Networking: DNS
Product: Firefox → Core

Thanks for reporting! To help diagnose this issue better, please can you:

  1. Type "about:support" in Firefox and copy-paste its contents here
  2. Capture a log when the issue occurs:
  • In Firefox, ideally freshly started with no other tabs, go to about:logging in a new tab
  • Select the "Networking" preset
  • Enable "stack traces for log messages"
  • Click on "Set Log Modules"
  • Click Start Logging
  • Reproduce the bug
  • Back on about:logging, click Stop Logging
  • In the new tab that appears with the Firefox Profiler web application, in the top right click the button to upload the profile
  • Make sure hidden threads are included and upload, then share the link here or send privately to a Mozilla developer
    -See https://paul.cx/public/about-logging-presentation.webm for a video walk-through
    Once the profile is captured, please share it privately on "necko@mozilla.com", quoting this bug number.
Flags: needinfo?(paulfurtado91)

While trying things to make a minimal reproducer, I think I discovered what the bug really is:
Firefox is using glibc's functions for resolving A records when DOH is disabled, and glibc's resolver reloads /etc/resolv.conf correctly so the A record resolution always looks correct. However, when performing the HTTPs query lookups, it is using a DNS resolver library, which seems to have a bug with reloading nameservers from resolv.conf

The steps to reproduce are:

  1. Update /etc/resolv.conf to contain only: nameserver 1.1.1.1
  2. Start running tcpdump. The command I used to show minimal noise is: sudo tcpdump -i any -nnep udp dst port 53
  3. Start firefox. It will do a bunch of random DNS queries on startup, so I find it helpful to hit enter a few times in the tcpdump window to visually separate the queries.
  4. go to about:networking#dnslookuptool, enter "google.com" and click "resolve". At this point, tcpdump will show:
    23:43:58.223985 wlp0s20f3 Out ifindex 3 a4:42:3b:0a:27:d3 ethertype IPv4 (0x0800), length 76: 10.0.0.107.48945 > 1.1.1.1.53: 6766+ A? google.com. (28)
    23:43:58.224089 wlp0s20f3 Out ifindex 3 a4:42:3b:0a:27:d3 ethertype IPv4 (0x0800), length 76: 10.0.0.107.50287 > 1.1.1.1.53: 32402+ HTTPS? google.com. (28)
    
    which shows that one A query was made against 1.1.1.1 and one HTTPS query was made against 1.1.1.1. All is well.
  5. Now, go back to resolv.conf and update the contents to be only: nameserver 8.8.8.8
  6. Return to firefox and click "resolve" again. tcpdump will show:
    23:46:28.852025 wlp0s20f3 Out ifindex 3 a4:42:3b:0a:27:d3 ethertype IPv4 (0x0800), length 76: 10.0.0.107.54025 > 8.8.8.8.53: 28543+ A? google.com. (28)
    23:46:28.852069 wlp0s20f3 Out ifindex 3 a4:42:3b:0a:27:d3 ethertype IPv4 (0x0800), length 76: 10.0.0.107.51413 > 8.8.8.8.53: 2131+ HTTPS? google.com. (28)
    
    Both queries went to 8.8.8.8 as we expect.
  7. Now, go back to resolv.conf and update the contents to be only: nameserver 8.8.4.4
  8. Now return to firefox and click "resolve" again. The time, tcpdump shows:
    23:47:18.751674 wlp0s20f3 Out ifindex 3 a4:42:3b:0a:27:d3 ethertype IPv4 (0x0800), length 76: 10.0.0.107.48610 > 8.8.4.4.53: 32807+ A? google.com. (28)
    23:47:18.751716 wlp0s20f3 Out ifindex 3 a4:42:3b:0a:27:d3 ethertype IPv4 (0x0800), length 76: 10.0.0.107.45323 > 8.8.8.8.53: 22027+ HTTPS? google.com. (28)
    
    this shows that the A query went to 8.8.4.4 as we expected, but the HTTPS query remained with 8.8.8.8.

I have repeated this process a number of times and there is some variation:

  • sometimes, the first time I swap the nameserver the problem reproduces immediately
  • sometimes, when you return to firefox and click "resolve", it doesn't make the HTTP query (due to cache?) but waiting a few seconds and trying again usually makes it happen
  • on one round of testing, when I swapped nameservers the second time (from 8.8.8.8 to 8.8.4.4), instead of sticking with 8.8.8.8, the resolver actually reverted to 1.1.1.1!

To avoid adding noise to the thread, I have uploaded my about:support to a gist rather than inlining them in this message.

Flags: needinfo?(paulfurtado91)

I think the problem here is that we only call res_ninit once at the beginning of the thread, then keep using it.
While regular A and AAAA are made with getaddrinfo which uses the global _res that does get updated on resolv.conf changes, the HTTPS record one does not.
We should be able to reinitialize resState when we detect changes to _res.nscount or _res.nsaddr_list

Assignee: nobody → valentin.gosu
Blocks: httpssvc
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged]
You need to log in before you can comment on or make changes to this bug.