Closed Bug 1643866 Opened 4 years ago Closed 3 years ago

Too small PR_NETDB_BUF_SIZE causes pr_GetAddrInfoByNameFB() to fail

Categories

(NSPR :: NSPR, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ceimoh3on3ox7io7, Assigned: KaiE)

References

Details

Attachments

(3 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0

Steps to reproduce:

Try to access page "https://www.freefilefillableforms.com/".

The user (me) is an experienced user, sysadmin, network engineer, armature coder.

The env is Debian Linux unstable, Firefox is Linux x86_64.

Actual results:

Firefox says "Hmm. We’re having trouble finding that site. We can’t connect to the server at www.freefilefillableforms.com."

This appears to be a DNS lookup failure, but it isn't. Instead, Firefox does not appear to be doing a DNS lookup at all.

I tried using the about:networking#dnslookuptool and put in "www.freefilefillableforms.com", which results in a NS_ERROR_UNKNOWN_HOST error.

DNS on the local host works fine. I am not aware of anything which might be affecting DNS for this particular domain.

Other browsers on the same host works fine. Doing DNS lookups for the problematic domain on the command line dig/ping works fine.

All other websites/DNS appears to work fine with this Firefox install. Only this one domain appears to be problematic.

I downloaded a completely new firefox tar.bz2 file and set up an installation. Started with firefox --new-instance --safe-mode --ProfileManager. Ran as basic/vanilla as possible. The problem continues.

It appears to be some issue with the local OS environment, but I"m not sure what it might be.

I have a second computer with a similar Debian unstable installation with very similar packages and setup, but it does not have this same problem.

I tried some network logging and got the following output:

2020-06-06 00:52:21.483568 UTC - [Parent 3180: Main Thread]: D/nsHostResolver Resolving host [www.freefilefillableforms.com] type 0. [this=0x7f23de995040]
2020-06-06 00:52:21.483593 UTC - [Parent 3180: Main Thread]: D/nsHostResolver   No usable record in cache for host [www.freefilefillableforms.com] type 0.
2020-06-06 00:52:21.483598 UTC - [Parent 3180: Main Thread]: D/nsHostResolver NameLookup: www.freefilefillableforms.com effectiveTRRmode: 1
2020-06-06 00:52:21.483612 UTC - [Parent 3180: Main Thread]: D/nsHostResolver   DNS thread counters: total=2 any-live=0 idle=2 pending=1
2020-06-06 00:52:21.483617 UTC - [Parent 3180: Main Thread]: D/nsHostResolver   DNS lookup for host [www.freefilefillableforms.com] blocking pending 'getaddrinfo' or trr query: callback [0x7f23d02fe660]
2020-06-06 00:52:21.483631 UTC - [Parent 3180: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - Calling getaddrinfo for host [www.freefilefillableforms.com].
2020-06-06 00:52:21.559784 UTC - [Parent 3180: DNS Resolver #2]: D/nsHostResolver Calling 'res_ninit'.
2020-06-06 00:52:21.562595 UTC - [Parent 3180: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - lookup completed for host [www.freefilefillableforms.com]: failure: unknown host.
2020-06-06 00:52:21.562610 UTC - [Parent 3180: DNS Resolver #2]: D/nsHostResolver nsHostResolver::CompleteLookup www.freefilefillableforms.com (nil) 804B001E trr=0 stillResolving=0
2020-06-06 00:52:21.562614 UTC - [Parent 3180: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 0x7f23e19201c0 new gencnt
2020-06-06 00:52:21.562618 UTC - [Parent 3180: DNS Resolver #2]: D/nsHostResolver Caching host [www.freefilefillableforms.com] negative record for 60 seconds.
2020-06-06 00:52:21.562621 UTC - [Parent 3180: DNS Resolver #2]: D/nsHostResolver CompleteLookup: www.freefilefillableforms.com has NO address
2020-06-06 00:52:21.562624 UTC - [Parent 3180: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 0x7f23e19201c0 calling back dns users

Expected results:

Firefox should have done a DNS lookup. It did not.

Bugbug thinks this bug should belong to this component, but please revert this change in case of error.

Component: Untriaged → Networking: DNS
Product: Firefox → Core

I can't reproduce this bug (nslookup, Firefox Nightly and Release, Windows 10)

Attached file dns_hook.c

You could check what glibc resolving functions return using this small program that hooks selected functions. Compile it:

gcc -shared -ldl -fPIC dns_hook.c -o dns_hook.so

And then run firefox as follows:

LD_PRELOAD=/path/to/dns_hook.so ./firefox --no-remote
Flags: needinfo?(ceimoh3on3ox7io7)

I have my system cloned into a VM, so it's easy to reproduce the issue there and do destructive system testing.

I compiled and ran the tool you gave me. All other domains resolve fine and return 0s, but when I get to this one....

gethostbyname_r=www.freefilefillableforms.com, retval=34, failed
gethostbyname_r=www.freefilefillableforms.com, retval=34, failed
gethostbyname_r=www.freefilefillableforms.com, retval=34, failed
gethostbyname_r=www.freefilefillableforms.com, retval=34, failed
^CExiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.

[~/bin/installs/firefox]
bg@desktest-->dig www.freefilefillableforms.com

; <<>> DiG 9.16.2-Debian <<>> www.freefilefillableforms.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27194
;; flags: qr rd ra; QUERY: 1, ANSWER: 13, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;www.freefilefillableforms.com. IN      A

;; ANSWER SECTION:
www.freefilefillableforms.com. 9 IN     CNAME   ffffgateway.api.intuit.com.
ffffgateway.api.intuit.com. 9   IN      CNAME   ffffgateway.prd.api.a.intuit.com.
ffffgateway.prd.api.a.intuit.com. 4 IN  CNAME   ffffgateway-us-west-2.prd.api.a.intuit.com.
ffffgateway-us-west-2.prd.api.a.intuit.com. 4 IN CNAME sw3_us-west-2_web.prd.api.a.intuit.com.
sw3_us-west-2_web.prd.api.a.intuit.com. 59 IN CNAME sw3prdwebbluealb-230791229.us-west-2.elb.amazonaws.com.
sw3prdwebbluealb-230791229.us-west-2.elb.amazonaws.com. 39 IN A 44.227.75.186
sw3prdwebbluealb-230791229.us-west-2.elb.amazonaws.com. 39 IN A 54.218.237.204
sw3prdwebbluealb-230791229.us-west-2.elb.amazonaws.com. 39 IN A 52.42.248.199
sw3prdwebbluealb-230791229.us-west-2.elb.amazonaws.com. 39 IN A 54.71.25.45
sw3prdwebbluealb-230791229.us-west-2.elb.amazonaws.com. 39 IN A 54.185.76.99
sw3prdwebbluealb-230791229.us-west-2.elb.amazonaws.com. 39 IN A 34.217.163.192
sw3prdwebbluealb-230791229.us-west-2.elb.amazonaws.com. 39 IN A 35.161.90.144
sw3prdwebbluealb-230791229.us-west-2.elb.amazonaws.com. 39 IN A 52.24.4.127

;; Query time: 56 msec
;; SERVER: 192.168.149.1#53(192.168.149.1)
;; WHEN: Mon Jun 15 16:18:40 PDT 2020
;; MSG SIZE  rcvd: 392

HMmmm... that's a lotta Fs in those cnames. I wonder.....

Yes, this is an ipv6 issue.

This system has linux kernel argument ipv6.disable=1 set. Removing this makes the problem go away.

I see firefox has a "network.dns.disableIPv6" option. I set this, but it still doesn't work.

If this was a greater ipv6 issue I would expect to see more problems, but only this one domain is known to be affected.

Flags: needinfo?(ceimoh3on3ox7io7)

I have a couple of domain names on DNS servers. Unfortunately I can't share them here, but I set up some records to test with:

test1                  IN    CNAME   ffffgateway.api.intuit.com.
www.freefilefillableforms.com   IN    CNAME   ffffgateway.api.intuit.com.

The first record works fine and the CNAME is resolved. The second record reproduces our broken behavior. Firefox fails to resolve the second entry, but it work on every other program on the same system and I can dig the record.

These DNS servers also have ipv6 entirely disabled.

So Firefox is definitely doing something wrong where the behavior is dependent upon something in that second resource record's name.

But wait, it gets stranger.

something    IN    CNAME   ffffgateway.api.intuit.com.
www.ffff.com    IN    CNAME   ffffgateway.api.intuit.com.
ffff    IN    CNAME   ffffgateway.api.intuit.com.

That first and second record reproduces our broken behavior. The second one works fine. I have no idea why.

I did a tcpdump on my firewall/router, which runs dnsmasq. I can see the DNS requests and replies coming back. Looks like Firefox just didn't like what it's getting back from the local resolver?

19:58:17.858863 IP desktest.lan.43301 > firewall.53: 42051+ A? www.freefilefillableforms.com. (47)
19:58:17.937496 IP firewall.53 > desktest.lan.43301: 42051 13/0/0 CNAME ffffgateway.api.intuit.com., CNAME ffffgateway.prd.api.a.intuit.com., CNAME ffffgateway-us-west-2.prd.api.a.intuit.com., CNAME sw3_us-west-2_web.prd.api.a.intuit.com., CNAME sw3prdwebbluealb-230791229.us-west-2.elb.amazonaws.com., A 54.185.76.99, A 52.39.222.244, A 44.227.75.186, A 34.217.163.192, A 52.24.4.127, A 52.42.248.199, A 54.218.237.204, A 54.71.25.45 (381)
19:58:17.940951 IP desktest.lan.35406 > firewall.53: 20891+ A? www.freefilefillableforms.com. (47)
19:58:17.941447 IP firewall.53 > desktest.lan.35406: 20891 13/0/0 CNAME ffffgateway.api.intuit.com., CNAME ffffgateway.prd.api.a.intuit.com., CNAME ffffgateway-us-west-2.prd.api.a.intuit.com., CNAME sw3_us-west-2_web.prd.api.a.intuit.com., CNAME sw3prdwebbluealb-230791229.us-west-2.elb.amazonaws.com., A 54.71.25.45, A 54.218.237.204, A 52.42.248.199, A 52.24.4.127, A 34.217.163.192, A 44.227.75.186, A 52.39.222.244, A 54.185.76.99 (437)

Sorry meant to say the third entry, ffff.mydomain.com, works fine.

The data contents of the DNS lookups from a working and non-working hostname above are basically identical (TTLs and a few other expected differences in a DNS lookup). The only thing different is the RR name and how firefox is behaving for each.

Given that this is all ipv4, I'm not sure why enabling ipv6 on the local host makes the problem go away. Note that the firewall has no real ipv6 configured beyond LL either.

(In reply to BG from comment #4)

Yes, this is an ipv6 issue.

This system has linux kernel argument ipv6.disable=1 set. Removing this makes the problem go away.

I'm able to reproduce it with ipv6.disable=1. When ipv6 is disabled, we call gethostbyname_r() (via pr_GetAddrInfoByNameFB()) instead of getaddrinfo(). Buffer size in PRAddrInfoFB struct is 1024 bytes which is too small for resolving this host. It seems we cannot do much about it in necko code and the buffer size needs to be changed in NSPR. Valentin, what do you think?

Flags: needinfo?(valentin.gosu)
Attached file test

Simple test that can be used to see the problem.

Wow, great find Michal!
I agree, there doesn't seem to be an easy way around this bug. It needs to be fixed in NSPR.

Component: Networking: DNS → NSPR
Flags: needinfo?(valentin.gosu)
Product: Core → NSPR
QA Contact: jjones
Summary: Firefox fails to resolve specific domain: NS_ERROR_UNKNOWN_HOST → Too small PR_NETDB_BUF_SIZE causes pr_GetAddrInfoByNameFB() to fail
Version: 77 Branch → other

Thanks Michael.

The first question is, what buffer size should we use instead? I find very little information on a suggested size.

One place mentions that 2K should be more than enough. In another place I read a recommendation that responses should usually be at most 512 bytes, to work with UDP. So, given that a 1K buffer has worked until today, maybe changing it to 2K might be reasonable.

The trivial change could be to simply change the constant PR_NETDB_BUF_SIZE to a larger value.

However, I wonder if this trivial approach could break some applications. The API docs say, the output buffer passed to some of the NSPR DNS functions must be at least of size PR_NETDB_BUF_SIZE.

NSPR gives a promise that bundling future versions of NSPR with an existing binary application will work, even without recompiling the application. Changing the buffer size would break that promise, because applications would continue to use the smaller buffer, and at runtime the new NSPR would decide that the buffer is insufficient.

Even if applications are recompiled, we might break them, if they use a buffer size that is sufficiently large today, but which is smaller than the new value of PR_NETDB_BUF_SIZE (the application doesn't use constant PR_NETDB_BUF_SIZE to define the size of their output buffer).

Actually, it looks like I exaggerated the situation.

Most functions already will use a larger buffer, if the caller provides one.
It's just about the specific place that Michael pointed out, which is always using the small size. So changing that to a higher value should be trivial.

We can try with a 2 K buffer and see if that works.

Assignee: nobody → kaie

(In reply to Kai Engert (:KaiE:) from comment #12)

The first question is, what buffer size should we use instead? I find very little information on a suggested size.

One place mentions that 2K should be more than enough. In another place I read a recommendation that responses should usually be at most 512 bytes, to work with UDP. So, given that a 1K buffer has worked until today, maybe changing it to 2K might be reasonable.

In this particular case, 1300 bytes seems to be enough, so let's say 2K should be fine. Maybe the code should handle ERANGE error and try again with a larger buffer as suggested by manual page or e.g. here https://stackoverflow.com/questions/6517478/how-to-use-gethostbyname-r-in-linux

Also a question is why pr_GetAddrInfoByNameFB() is used when ipv6 isn't present even if GetAddrInfo() is available.

Regarding timing:

For other reasons, we will create a new NSPR 4.26 release for FF 79, and because of our usual testing periods. But we're already at the end of the possible development period. With the usual one week of testing in FF nightly, prior to declarding an NSPR snapshot as stable, we should get this NSPR change finalized by June 22 at the latest.

Should we try to target this fix for the upcoming NSPR 4.26 release, and use the minimal approach, increasing the buffer size?

My schedule doesn't allow me to help much with the code for this bug this week. So if you want the better fix suggested by Michal in comment 16, I'd need you to provide patches and reviews, so I could only focus on the mechanics of landing and releasing.

If we don't agree on the simple buffer increase, and if we don't get help this week, then we'll have to postpone this fix to a later NSPR release and FF 80.

I personally would prefer to do a quick buffer size fix for 79 and try to do a better fix later.

(In reply to Kai Engert (:KaiE:) from comment #12)

Thanks Michael.

The first question is, what buffer size should we use instead? I find very little information on a suggested size.

One place mentions that 2K should be more than enough. In another place I read a recommendation that responses should usually be at most 512 bytes, to work with UDP. So, given that a 1K buffer has worked until today, maybe changing it to 2K might be reasonable.

The trivial change could be to simply change the constant PR_NETDB_BUF_SIZE to a larger value.

However, I wonder if this trivial approach could break some applications. The API docs say, the output buffer passed to some of the NSPR DNS functions must be at least of size PR_NETDB_BUF_SIZE.

NSPR gives a promise that bundling future versions of NSPR with an existing binary application will work, even without recompiling the application. Changing the buffer size would break that promise, because applications would continue to use the smaller buffer, and at runtime the new NSPR would decide that the buffer is insufficient.

Nothing in NSPR (AFAICT) is using PR_NETDB_BUF_SIZE as the size of a buffer given by the API caller. Callers are expected to give the size of the buffer they pass, and several callers in Firefox actually do that. If there's a problem within NSPR with the current size, I would expect those other callers in Firefox would have a problem too.

Even if applications are recompiled, we might break them, if they use a buffer size that is sufficiently large today, but which is smaller than the new value of PR_NETDB_BUF_SIZE (the application doesn't use constant PR_NETDB_BUF_SIZE to define the size of their output buffer).

And that's already a problem with the current PR_NETDB_BUF_SIZE, as per the paragraph above about Firefox, so it wouldn't be a new problem.

Mike, my pondering from comment 12 seems unnecessary for this issue, as I tried to clarify in comment 13.

(In reply to Mike Hommey [:glandium] from comment #19)

Nothing in NSPR (AFAICT) is using PR_NETDB_BUF_SIZE as the size of a buffer given by the API caller. Callers are expected to give the size of the buffer they pass, and several callers in Firefox actually do that. If there's a problem within NSPR with the current size, I would expect those other callers in Firefox would have a problem too.

Some NSPR functions reject a call with PR_INVALID_ARGUMENT_ERROR, if the given buffer size is smaller than PR_NETDB_BUF_SIZE.

If we increased PR_NETDB_BUF_SIZE, then this might cause some existing calls by applications to fail.

My worries were about that potential scenario. But it looks like we don't need to change the value of PR_NETDB_BUF_SIZE, because the callers already have flexibility to use a larger buffer, if they want to.

To summarize the latest state:

  • Don't change PR_NETDB_BUF_SIZE, because in most scenarios the caller of an API is able to use a larger buffer, if necessary

  • only change the place that Michal identified, which cannot be influenced by API callers.

With this explanation, is the attached patch acceptable?

Flags: needinfo?(mh+mozilla)

The r+ hasn't arrived in time.
This will miss 4.26 and FF 79.

If we increased PR_NETDB_BUF_SIZE, then this might cause some existing calls by applications to fail.

Not if you change those tests that return PR_INVALID_ARGUMENT_ERROR to check for the old value (i.e. half the new one)

Flags: needinfo?(mh+mozilla)

The severity field is not set for this bug.
:KaiE, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(kaie)
Severity: -- → S2
Flags: needinfo?(kaie)

Mike, IIUC, you don't like the suggestion to use the simpler attached patch. Instead, you ask that we increase the size of PR_NETDB_BUF_SIZE - correct?

I'm currently busy with other work, and find it difficult to focus on this bug.
If we don't want to land the attached, but a different solution should be used, I'd appreciate help with a better patch.

Status: UNCONFIRMED → NEW
Ever confirmed: true

this one misses nspr 4.27 and ff 80

Hello, I appear to also have been stung by this bug... for several days recently.
Sometimes (!?) I could not sign in to Microsoft services using Firefox, and only Firefox!?

Dev Tools > Network indicated that nothing from logincdn.msauth.net was being fetched?

about:networking#dnslookuptool claimed that logincdn.msauth.net was an unknown host?

All other software I tried had no problems resolving and fetching data from logincdn.msauth.net?

> host logincdn.msauth.net # example resolving logincdn.msauth.net from command line...
logincdn.msauth.net is an alias for lgincdn.trafficmanager.net.
lgincdn.trafficmanager.net is an alias for lgincdnmsftuswe2.azureedge.net.
lgincdnmsftuswe2.azureedge.net is an alias for lgincdnmsftuswe2.afd.azureedge.net.
lgincdnmsftuswe2.afd.azureedge.net is an alias for star-azureedge-prod.trafficmanager.net.
star-azureedge-prod.trafficmanager.net is an alias for dual.t-0009.t-msedge.net.
dual.t-0009.t-msedge.net is an alias for t-0009.t-msedge.net.
t-0009.t-msedge.net is an alias for Edge-Prod-EWR30r3.ctrl.t-0009.t-msedge.net.
Edge-Prod-EWR30r3.ctrl.t-0009.t-msedge.net is an alias for standard.t-0009.t-msedge.net.
standard.t-0009.t-msedge.net has address 13.107.246.19
standard.t-0009.t-msedge.net has address 13.107.213.19
standard.t-0009.t-msedge.net has IPv6 address 2620:1ec:bdf::19
standard.t-0009.t-msedge.net has IPv6 address 2620:1ec:46::19

I intercepted DNS calls with a DNS proxy, and DNS server was returning correct responses to Firefox?

Running Firefox with nsHostResolver debug log did not provide much more clues:

> NSPR_LOG_MODULES=nsHostResolver:5 NSPR_LOG_FILE=ffdnslog.txt firefox-esr
[Parent 13683: Main Thread]: D/nsHostResolver Resolving host [logincdn.msauth.net]<> type 0. [this=0x7f0fadc2e2e0]
[Parent 13683: Main Thread]: D/nsHostResolver   No usable record in cache for host [logincdn.msauth.net] type 0.
[Parent 13683: Main Thread]: D/nsHostResolver NameLookup host:logincdn.msauth.net af:2
[Parent 13683: Main Thread]: D/nsHostResolver NameLookup: logincdn.msauth.net effectiveTRRmode: 1 flags: 0
[Parent 13683: Main Thread]: D/nsHostResolver   DNS thread counters: total=1 any-live=0 idle=1 pending=1
[Parent 13683: Main Thread]: D/nsHostResolver   DNS lookup for host [logincdn.msauth.net] blocking pending 'getaddrinfo' or trr query: callback [0x7f0fa8dceea0]
[Parent 13683: DNS Resolver #1]: E/nsHostResolver DNS lookup thread - Calling getaddrinfo for host [logincdn.msauth.net].
[Parent 13683: DNS Resolver #1]: D/nsHostResolver Calling 'res_ninit'.
[Parent 13683: DNS Resolver #1]: E/nsHostResolver DNS lookup thread - lookup completed for host [logincdn.msauth.net]: failure: unknown host.
[Parent 13683: DNS Resolver #1]: D/nsHostResolver nsHostResolver::CompleteLookup logincdn.msauth.net (nil) 804B001E trr=0 stillResolving=0

Based on that log I intercepted all getaddrinfo clib calls, and that gave me my first clue: Firefox and Thunderbird were the only softwares on my system not calling getaddrinfo!?

...Which thankfully allowed my to find this thread (firefox and getaddrinfo)

I removed the ipv6.disable=1 boot option, and then I was able to login again. pfew. 😌

For what it's worth, to my uninformed self it also seems that if getaddrinfo is available it should be used, not the obsolete gethostbyname regardless if ipv6 is disabled.

Mostly posting this here for others, in case, because I couldn't find anything about my problem anywhere.
Thanks!

What do we need to do to get the ball rolling on this one?

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(kaie)

I tried to make progress, but didn't get sufficient feedback/explanations from glandium, and because I needed him to review the patch, didn't know how to write the patch in a way that is acceptable to him.

And I couldn't spend too much time on this.

I think it would be best if someone from the Mozilla network team could volunteer as a reviewer, review the existing patch and suggestions, and comment / explain / suggest.

Flags: needinfo?(kaie)

(In reply to Mike Hommey [:glandium] from comment #23)

If we increased PR_NETDB_BUF_SIZE, then this might cause some existing calls by applications to fail.

Not if you change those tests that return PR_INVALID_ARGUMENT_ERROR to check for the old value (i.e. half the new one)

:glandium mentioned comment 23 as the blocker to r+ing this when pinged on Slack.
I assume it's referring to this code.

Flags: needinfo?(kaie)

Valentin and Mike, thanks for following up.
(Last year I was too busy to focus on this issue.)

I reread comment 19 and comment 23 and today they make perfect sense to me.
I've pushed an updated patch to phabricator.

Flags: needinfo?(kaie)
Blocks: 1715584
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 4.32
Flags: needinfo?(mh+mozilla)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: