intermittent server not found
Categories
(Core :: Networking, defect, P5)
Tracking
()
People
(Reporter: rex, Assigned: michal)
Details
(Whiteboard: [necko-triaged][no-nag])
Attachments
(20 files)
172.67 KB,
text/plain
|
Details | |
929.96 KB,
text/plain
|
Details | |
552 bytes,
text/x-csrc
|
Details | |
373.55 KB,
text/plain
|
Details | |
973 bytes,
text/plain
|
Details | |
8.02 KB,
application/vnd.tcpdump.pcap
|
Details | |
3.92 KB,
application/vnd.tcpdump.pcap
|
Details | |
217.99 KB,
text/plain
|
Details | |
24 bytes,
application/vnd.tcpdump.pcap
|
Details | |
769 bytes,
application/zip
|
Details | |
801 bytes,
text/x-csrc
|
Details | |
20.75 KB,
text/plain
|
Details | |
383.27 KB,
text/plain
|
Details | |
965 bytes,
text/plain
|
Details | |
1.03 KB,
application/x-compressed-tar
|
Details | |
28.24 KB,
application/zip
|
Details | |
19.20 KB,
application/zip
|
Details | |
1.07 KB,
application/x-compressed-tar
|
Details | |
30.19 KB,
application/zip
|
Details | |
9.06 KB,
text/plain
|
Details |
Comment 3•6 years ago
|
||
Comment 7•6 years ago
|
||
Updated•6 years ago
|
Comment 10•6 years ago
|
||
Reporter | ||
Comment 11•6 years ago
|
||
Hi Honza! Thanks for the tip!
I'm currently unable to reproduce the issue at will with the current 65.0b8 beta. I will continue to use it until I notice an issue and then try to reproduce it while logging.
Reporter | ||
Comment 12•6 years ago
|
||
Started logging after the issue started while attempting to visit CNN.com. Was given "We can’t connect to the server at www.cnn.com" page. Started logging and hit refresh, and was shown the same page. Then stopped logging.
Reporter | ||
Comment 13•6 years ago
|
||
Aha! I started the log right before attempting to connect to CNN.com, received the "can't connect to the server" page, and stopped logging.
Comment 14•6 years ago
•
|
||
(In reply to rex from comment #13)
Created attachment 9034842 [details]
log.txt-main.9860Aha! I started the log right before attempting to connect to CNN.com, received the "can't connect to the server" page, and stopped logging.
Thank you.
Looks like your local resolver has an issue:
2019-01-07 22:07:23.681 ⁃ nsHttpChannel ⁃ 7f7450f0a800 ⁃ finished ⁃ status=804b001e ⁃ http-status=n/a ⁃ url=https://www.cnn.com/
2019-01-07 22:07:23.691 ⁃ nsSocketTransport ⁃ 7f74521a4c00 ⁃ released ⁃ origin=www.cnn.com:443
2019-01-07 22:07:23.691473 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport creating nsSocketTransport @0x7f74521a4c00
2019-01-07 22:07:23.691482 UTC - [Parent 9860: Socket Thread]: E/nsSocketTransport nsSocketTransport::Init [this=0x7f74521a4c00 host=www.cnn.com:443 origin=www.cnn.com:443 proxy=:0]
2019-01-07 22:07:23.691500 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::SetConnectionFlags 0x7f74521a4c00 flags=0
flags = 0
2019-01-07 22:07:23.691525 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::OpenOutputStream [this=0x7f74521a4c00 flags=2]
2019-01-07 22:07:23.691533 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::PostEvent [this=0x7f74521a4c00 type=0 status=0 param=(nil)]
2019-01-07 22:07:23.691557 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::OpenInputStream [this=0x7f74521a4c00 flags=2]
2019-01-07 22:07:23.691565 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::PostEvent [this=0x7f74521a4c00 type=0 status=0 param=(nil)]
nsHalfOpenSocket @7f7454d67ae0 --> nsSocketTransport @7f74521a4c00
2019-01-07 22:07:23.692309 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketEvent [this=0x7f74521a4c00 type=0 status=0 param=(nil)]
2019-01-07 22:07:23.692324 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::ResolveHost [this=0x7f74521a4c00 www.cnn.com:443]
2019-01-07 22:07:23.692335 UTC - [Parent 9860: Socket Thread]: E/nsSocketTransport nsSocketTransport::SendStatus [this=0x7f74521a4c00 status=804b0003]
804b0003 = STATUS_RESOLVING
2019-01-07 22:07:23.692 ⁃ nsHostResolver ⁃ log.txt-main!nsHostResolver::service ⁃ service
2019-01-07 22:07:23.692387 UTC - [Parent 9860: Socket Thread]: D/nsHostResolver Resolving host [www.cnn.com] type 0. [this=0x7f748678ceb0]
2019-01-07 22:07:23.692414 UTC - [Parent 9860: Socket Thread]: D/nsHostResolver No usable record in cache for host [www.cnn.com] type 0.
2019-01-07 22:07:23.692460 UTC - [Parent 9860: Socket Thread]: D/nsHostResolver DNS thread counters: total=2 any-live=0 idle=2 pending=1
2019-01-07 22:07:23.692484 UTC - [Parent 9860: Socket Thread]: D/nsHostResolver DNS lookup for host [www.cnn.com] blocking pending 'getaddrinfo' or trr query: callback [0x7f7454e6b700]
2019-01-07 22:07:23.692506 UTC - [Parent 9860: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - Calling getaddrinfo for host [www.cnn.com].
2019-01-07 22:07:23.692511 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketEvent [this=0x7f74521a4c00 type=0 status=0 param=(nil)]
2019-01-07 22:07:23.692641 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver Calling 'res_ninit'.
2019-01-07 22:07:23.692738 UTC - [Parent 9860: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - lookup completed for host [www.cnn.com]: failure: unknown host.
2019-01-07 22:07:23.692771 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver nsHostResolver::CompleteLookup www.cnn.com (nil) 804B001E trr=0 stillResolving=0
2019-01-07 22:07:23.692822 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 0x7f7466ac4580 new gencnt
2019-01-07 22:07:23.692849 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver Caching host [www.cnn.com] negative record for 60 seconds.
2019-01-07 22:07:23.692903 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver CompleteLookup: www.cnn.com has NO address
2019-01-07 22:07:23.692939 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 0x7f7466ac4580 calling back dns users
2019-01-07 22:07:23.693519 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnLookupComplete: this=0x7f74521a4c00 status 804b001e.
I can see "DNS lookup thread - lookup completed for host [www.cnn.com]: failure: unknown host." from the call to getaddrinfo, which is even called twice (getaddrinfo, res_ninit, getaddrinfo).
If you open a terminal (command prompt) and do nslookup www.cnn.com
what result do you get?
What exactly is your local DNS setting?
Reporter | ||
Comment 15•6 years ago
|
||
Hey Honza!
rex@ofc-ate ~ $ nslookup www.cnn.com
Server: 127.0.1.1
Address: 127.0.1.1#53
Non-authoritative answer:
www.cnn.com canonical name = turner-tls.map.fastly.net.
Name: turner-tls.map.fastly.net
Address: 151.101.1.67
Name: turner-tls.map.fastly.net
Address: 151.101.65.67
Name: turner-tls.map.fastly.net
Address: 151.101.129.67
Name: turner-tls.map.fastly.net
Address: 151.101.193.67
My DNS is set to automatic and uses what my isp provides: 217.23.2.11 and 8.8.8.8 .
Comment 16•6 years ago
|
||
Thanks. Dragana, Michal, Valentin, any ideas what more to check here?
Comment 17•6 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #16)
Thanks. Dragana, Michal, Valentin, any ideas what more to check here?
The only suggestion I have is to run nslookup www.cnn.com when problem occurs.
You os does not give us a ip address for www.cnn.com, we cannot do much about it.
Comment 18•6 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #17)
(In reply to Honza Bambas (:mayhemer) from comment #16)
Thanks. Dragana, Michal, Valentin, any ideas what more to check here?
The only suggestion I have is to run nslookup www.cnn.com when problem occurs.
You os does not give us a ip address for www.cnn.com, we cannot do much about it.
As a workaround you can try DNSoverHTTPS (also called TRR).
This bypasses the system resolver, and uses an external one.
https://www.ghacks.net/2018/04/02/configure-dns-over-https-in-firefox/
https://daniel.haxx.se/blog/2018/06/03/inside-firefoxs-doh-engine/
Reporter | ||
Comment 19•6 years ago
|
||
I ran into this issue again trying to get to google.com.
The nslookup appeared to be successful, but the issue persisted after the seemingly successful lookup:
Server: 127.0.1.1
Address: 127.0.1.1#53
Non-authoritative answer:
Name: www.google.com
Address: 172.217.9.164
I will try out DNSoverHTTPS! Neat suggestion!
Assignee | ||
Comment 20•6 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #16)
Thanks. Dragana, Michal, Valentin, any ideas what more to check here?
In bug 1439780 there is a simple program (hook3.c) that hooks resolver functions and dumps info to standard error, see https://bugzilla.mozilla.org/show_bug.cgi?id=1439780#c26. You can use it to see if it really fails on the system level. See comment #18 how to compile and run it.
Reporter | ||
Comment 21•6 years ago
|
||
I had some issues with DNSoverHTTPS. I'm thinking firefox had intermittent difficulty resolving the host from which I was supposed to do dns services with.
For the hook3.c program, I've run it with firefox normally for the few business days since the time of Michal's post without the issue recurring. I've even redownloaded that 65.0b8 meta and turned off updating to ensure my variables are not changing. DNS servers are the same. Hopefully like last time the issue will spring up when I mention it. Anyway, please know that I am continuing to test and will post if I have news.
Comment 22•6 years ago
|
||
(In reply to rex from comment #21)
I had some issues with DNSoverHTTPS. I'm thinking firefox had intermittent difficulty resolving the host from which I was supposed to do dns services with.
You can use the network.trr.bootstrapAddress pref to set the IP of the DoH service you want to use.
You can use dig mozilla.cloudflare-dns.com
do get the IP; I think 1.1.1.1 should work too.
https://gist.github.com/bagder/5e29101079e9ac78920ba2fc718aceec
For the hook3.c program, I've run it with firefox normally for the few business days since the time of Michal's post without the issue recurring. I've even redownloaded that 65.0b8 meta and turned off updating to ensure my variables are not changing. DNS servers are the same. Hopefully like last time the issue will spring up when I mention it. Anyway, please know that I am continuing to test and will post if I have news.
Thank you for this! We are really grateful. I'm sure this kind of hard-to-reproduce bug can be annoying to a lot of users, and any little help in finding out the root cause is great!
Reporter | ||
Comment 23•6 years ago
|
||
Hey Michal, sorry if I completely misunderstand, but does your hook ask the system to resolve the address before firefox makes the attempt? I have a hunch that, where firefox would fail to resolve an address, if another program or service had already resolved the address, firefox would succeed.
Reporter | ||
Comment 24•6 years ago
|
||
Where I've spent a little under three business days running the firefox beta and hook with no issues, I was able to reproduce the issue without the hook in a minute.
Again I apologize for the following speculation about something I completely do not understand. I think this issue occurs when firefox asks to resolve a new name. If the hook checks whether the system can resolve the new name first, then firefox might "only" need to resolve an already resolved name, which is consistently successful. Thus days of success with the hook and immediate failure without.
Does this make sense for how firefox and the hook actually work?
Comment 25•6 years ago
|
||
:michal, could you take a look at comment #24?
Do you have any idea why this issue is not reproducible with the hook? Thanks.
Assignee | ||
Comment 26•6 years ago
|
||
(In reply to rex from comment #23)
Hey Michal, sorry if I completely misunderstand, but does your hook ask the system to resolve the address before firefox makes the attempt? I have a hunch that, where firefox would fail to resolve an address, if another program or service had already resolved the address, firefox would succeed.
The utility just intercepts calls to gethostbyname, gethostbyname_r and getaddrinfo. It calls the original function and prints the result to stderr. I have no idea why it's not reproducible with it. It might slightly change the timing but it should be negligible.
Reporter | ||
Comment 27•6 years ago
|
||
Thanks for the explanation, :michal. Is there any chance that the hook utility could be calling the functions differently than firefox (especially if it was doing so incorrectly somehow) or is using a different process to do so? I'm repeating similar results today. Firefox+hook resolves consistently, firefox alone does not.
I'm imagining this bug could be something about my system intermittently ignoring firefox's calls or improperly responding to firefox, where the utility and chromium work consistently.
Assignee | ||
Comment 28•6 years ago
|
||
Functions from glibc should be called in both cases (with the LD_PRELOAD as well as without it). Do you run the same firefox binary with the same environment? I.e. when running firefox without the hook, do you run it from console as well? Also, does LD_PRELOAD variable contain anything by default?
Reporter | ||
Comment 29•6 years ago
|
||
Good questions Michal. I had been running firefox without the hook outside the console, where I was running with the hook in the console. Turns out the issue persists without the hook in the console though.
I couldn't find how to determine what my LD_PRELOAD variable is by default, but I ran a script with 'echo "LD_PRELOAD IS ${LD_PRELOAD}"' and starting up firefox which did not return anything after "LD_PRELOAD IS ". Not sure if that is applicable though. Is there a different way I should check?
Assignee | ||
Comment 30•6 years ago
|
||
I couldn't find how to determine what my LD_PRELOAD variable is by default, but I ran a script with 'echo "LD_PRELOAD IS ${LD_PRELOAD}"' and starting up firefox which did not return anything after "LD_PRELOAD IS ". Not sure if that is applicable though. Is there a different way I should check?
Running echo "${LD_PRELOAD}" would show it. That's really strange. If you comment out getaddrinfo from hook3.c (or just rename it to some random name) but keep gethostbyname and gethostbyname_r as they are, can you reproduce the problem?
Reporter | ||
Comment 31•6 years ago
|
||
I've made a newhook3.c, renaming getaddrinfo to dontgetaddrinfo (heh), used the gcc command to produce a newhook3.so, and using that new .so, I am able to reproduce the issue.
Assignee | ||
Comment 32•6 years ago
|
||
Thanks for doing the test. Please, try this new file. It only intercepts the function call but does nothing with the result.
Reporter | ||
Comment 33•6 years ago
|
||
With this one it looks like I cannot resolve any address.
Comment 34•6 years ago
|
||
(In reply to rex from comment #21)
I had some issues with DNSoverHTTPS. I'm thinking firefox had intermittent difficulty resolving the host from which I was supposed to do dns services with.
I would like to understand this: Can you explain what was the issue? How did firefox behave: firefox could not connect to web sites?
Assignee | ||
Comment 35•6 years ago
|
||
It should work. Did it compile without any warning?
Reporter | ||
Comment 36•6 years ago
|
||
:michal, it took about 5 seconds to compile and did not return anything to the terminal.
:dragana, yes, I did not seem to be able to resolve any addresses and only hit server not found pages. Though I had not tested with an IP address as my network.trr.uri, testing now shows the same thing. To be transparent, my only changes are network.trr.mode to 3, and network.trr.uri as default or to 104.16.249.249 or 1.1.1.1, and the rest of network.trr.* default. Maybe I am missing or misusing a setting?
Assignee | ||
Comment 37•6 years ago
|
||
(In reply to rex from comment #36)
:michal, it took about 5 seconds to compile and did not return anything to the terminal.
So, how the output written to stderr by hook3.so (which resolves correctly) looks like?
Comment 38•6 years ago
|
||
(In reply to rex from comment #36)
:michal, it took about 5 seconds to compile and did not return anything to the terminal.
:dragana, yes, I did not seem to be able to resolve any addresses and only hit server not found pages. Though I had not tested with an IP address as my network.trr.uri, testing now shows the same thing. To be transparent, my only changes are network.trr.mode to 3, and network.trr.uri as default or to 104.16.249.249 or 1.1.1.1, and the rest of network.trr.* default. Maybe I am missing or misusing a setting?
mode 3 is only using doh. mode 2 will fall back to dns if doh is not working. I recommend to use mode 2.
I am also curious to know why doh mode 3 is not working. Can you make me a http log:
https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
Thanks!
Reporter | ||
Comment 39•6 years ago
|
||
Hi :michal, using the gcc command for either one now is about instantaneous and produces no stderr output.
Hi :dragana, I agree mode 2 would make sense long term, but my intentions for testing was pure doh since currently dns isn't a solid fallback. I'll attach the log in just a moment.
Reporter | ||
Comment 40•6 years ago
|
||
Reporter | ||
Comment 41•6 years ago
|
||
Assignee | ||
Comment 42•6 years ago
|
||
(In reply to rex from comment #39)
Hi :michal, using the gcc command for either one now is about instantaneous and produces no stderr output.
I meant the output when you run it. You should have output like this in the console:
getaddrinfo node=detectportal.firefox.com, service=(null), hints=0x7fcb0b678500, retval=0
hints.ai_family=0
hints.ai_socktype=1
hints.ai_protocol=0
hints.ai_flags=32
IPv4 address: 23.219.91.81
IPv4 address: 23.219.91.27
IPv6 address: 2a02:26f0:40::17db:5b51
IPv6 address: 2a02:26f0:40::17db:5b1b
getaddrinfo node=tiles.services.mozilla.com, service=(null), hints=0x7fcb0b678500, retval=0
hints.ai_family=0
hints.ai_socktype=1
hints.ai_protocol=0
hints.ai_flags=32
IPv4 address: 52.25.70.97
IPv4 address: 52.26.103.165
IPv4 address: 52.39.131.77
IPv4 address: 52.41.60.30
IPv4 address: 52.24.236.113
IPv4 address: 35.160.41.125
IPv4 address: 52.41.78.152
IPv4 address: 52.34.107.172
Assignee | ||
Comment 43•6 years ago
|
||
Could you please also provide you DNS setup? Content of /etc/resolv.conf and line beginning with "hosts" from /etc/nsswitch.conf. If your resolv.conf contains 127.0.0.53 then also an output from "systemd-resolve --status".
Reporter | ||
Comment 44•6 years ago
|
||
Ah, thanks for clarifying Michal.
The original hook3 running firefox provides output like yours:
gethostbyname_r=myhostname, retval=0, succeeded
address: 127.0.1.1
gethostbyname_r=myhostname, retval=0, succeeded
address: 127.0.1.1
getaddrinfo node=detectportal.firefox.com, service=(null), hints=0x7f0d6787c940, retval=0
hints.ai_family=0
hints.ai_socktype=1
hints.ai_protocol=0
hints.ai_flags=32
IPv4 address: 23.54.163.42
IPv4 address: 23.54.163.75
IPv6 address: 2600:1404:27::17d7:6222
IPv6 address: 2600:1404:27::17d7:6241
etc.
My edited hook3 to rename gethostbyname running firefox only outputs:
gethostbyname_r=ofc-ate, retval=0, succeeded
address: 127.0.1.1
gethostbyname_r=ofc-ate, retval=0, succeeded
address: 127.0.1.1
The latest hook3_test1 outputs nothing.
/etc/resolv.conf
nameserver 127.0.1.1
"hosts" from /etc/nsswitch.conf
hosts: files wins dns
I attempted systemd-resolve --status just in case, but received:
systemd-resolve: unrecognized option '--status'
Assignee | ||
Comment 45•6 years ago
|
||
(In reply to rex from comment #44)
The original hook3 running firefox provides output like yours:
gethostbyname_r=myhostname, retval=0, succeeded
address: 127.0.1.1
gethostbyname_r=myhostname, retval=0, succeeded
address: 127.0.1.1
getaddrinfo node=detectportal.firefox.com, service=(null), hints=0x7f0d6787c940, retval=0
hints.ai_family=0
hints.ai_socktype=1
hints.ai_protocol=0
hints.ai_flags=32
IPv4 address: 23.54.163.42
IPv4 address: 23.54.163.75
IPv6 address: 2600:1404:27::17d7:6222
IPv6 address: 2600:1404:27::17d7:6241
etc.
This looks normal.
/etc/resolv.conf
nameserver 127.0.1.1"hosts" from /etc/nsswitch.conf
hosts: files wins dnsI attempted systemd-resolve --status just in case, but received:
systemd-resolve: unrecognized option '--status'
You could try:
resolvectl status
127.0.1.1 is a local address so something must be listening and forwarding the requests. You can use following command to find out, what it is:
sudo netstat -u -l -p -n | grep -e ":53[[:space:]]"
Reporter | ||
Comment 46•6 years ago
|
||
$ resolvectl status
resolvectl: command not found
$ sudo netstat -u -l -p -n | grep -e ":53[[:space:]]"
udp 0 0 127.0.1.1:53 0.0.0.0:* 2325/dnsmasq
Assignee | ||
Comment 47•6 years ago
|
||
You could try to put ISP's DNS servers to /etc/resolv.conf to make sure dnsmasq doesn't mess anything up.
I still don't understand why nothing is resolved when hook3_test1.so is preloaded. I will try to come with something else later.
Reporter | ||
Comment 48•6 years ago
|
||
I think I've run into some voodoo.
I updated my /etc/resolv.conf to:
nameserver 217.23.2.11
nameserver 8.8.8.8
nameserver 127.0.1.1
and then ran:
sudo /etc/init.d/dns-clean start
as described here: https://tecadmin.net/flush-dns-cache-ubuntu/
In using firefox without any hook, I still experience this issue.
Assignee | ||
Comment 49•6 years ago
|
||
Could you please put only "nameserver 217.23.2.11" to /etc/resolv.conf and provide HTTP log (the same as in comment #13) and dump of DNS packets when the failure happens? Use tcpdump to dump the DNS packets:
sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53 and host 217.23.2.11
Reporter | ||
Comment 50•6 years ago
|
||
Gladly!
With only 'nameserver 217.23.2.11' in my /etc/resolv.conf and running sudo /etc/init.d/dns-clean start, I appear to be completely unable to resolve any name in any of my internet browsers. The dump.pcap simply shows entries like 'Standard query 0x3347 A www.yahoo.com'.
With only 'nameserver 8.8.8.8' in my /etc/resolv.conf and running sudo /etc/init.d/dns-clean start, I can resolve names again. I've changed the tcpdump command to 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53 and host 8.8.8.8'.
I've attached the resulting dump.pcap below. In it it shows that several hostnames were successfully resolved. At the end, it failed to resolve cnn.com but is not shown in the dump.
I then did an extra tcpdump with 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53' with google.com failing at the end, but again, it is not shown.
My naive interpretation is that firefox is simply not sending a dns request for some reason.
Reporter | ||
Comment 51•6 years ago
|
||
Reporter | ||
Comment 52•6 years ago
|
||
Assignee | ||
Comment 53•6 years ago
|
||
(In reply to rex from comment #50)
Gladly!
With only 'nameserver 217.23.2.11' in my /etc/resolv.conf and running sudo /etc/init.d/dns-clean start, I appear to be completely unable to resolve any name in any of my internet browsers. The dump.pcap simply shows entries like 'Standard query 0x3347 A www.yahoo.com'.
In comment #13 you wrote that 217.23.2.11 and 8.8.8.8 are DNS servers provided automatically by ISP. It's bad, if they really return non-working DNS server in DHCP response. Check using dig utility that it really doesn't work:
dig www.cnn.com @217.23.2.11
If it times out or returns an error, get rid of this DNS server in the system configuration. I.e. manually set DNS servers to 8.8.8.8 and 8.8.4.4.
With only 'nameserver 8.8.8.8' in my /etc/resolv.conf and running sudo /etc/init.d/dns-clean start, I can resolve names again. I've changed the tcpdump command to 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53 and host 8.8.8.8'.
I've attached the resulting dump.pcap below. In it it shows that several hostnames were successfully resolved. At the end, it failed to resolve cnn.com but is not shown in the dump.
I then did an extra tcpdump with 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53' with google.com failing at the end, but again, it is not shown.
I meant to provide both logs, tcpdump and HTTP log from firefox, where we could see that we tried to resolve the address.
Reporter | ||
Comment 54•6 years ago
|
||
Thanks Michal. From using the dig command, it seems clear that what my ISP is providing is not a working DNS server. I have changed my top level router to distribute 8.8.8.8, 8.8.4.4, and 1.1.1.1 as my DNS servers, and it appears to have propagated correctly.
I've set my /etc/resolv.conf to just 'nameserver 8.8.8.8' and ran 'sudo /etc/init.d/dns-clean start'. I ran the tcpdump as 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53' in case the requests go somewhere else by chance.
Right now I'm having difficulty getting "good" logs. At first, all I was getting was 'server not found' with nothing in my pcap, so I thought I had messed up something. Now I'm getting only successful resolutions. I'll try again in a few hours.
Reporter | ||
Comment 55•6 years ago
|
||
Reporter | ||
Comment 56•6 years ago
|
||
Was built via 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53' and was active when requesting reddit.com. Appears to be empty.
Assignee | ||
Comment 57•6 years ago
|
||
Can you please try a special build where I've added a code to log error returned by GetAddrInfo? The try push is here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a9fbc1eae0ad3b6989fd2d2e42b449ae371f8ca0&selectedJob=225192845
Linux builds can be downloaded from:
32-bit - https://queue.taskcluster.net/v1/task/YQ59Q_ckRbqFDz7D4oCkrQ/runs/0/artifacts/public/build/target.tar.bz2
64-bit - https://queue.taskcluster.net/v1/task/L8QATV2TSkORJwzXFEI3vg/runs/0/artifacts/public/build/target.tar.bz2
The error is logged in GetAddrInfo module, so it has to be added to MOZ_LOG. Following MOZ_LOG is sufficient:
timestamp,sync,GetAddrInfo:5,nsHostResolver:5
Reporter | ||
Comment 58•6 years ago
|
||
Thanks for the special build, Michal! I've downloaded the 64 bit Linux build you linked, and am running the following script to produce the desired log file:
#! /bin/bash
export MOZ_LOG=timestamp,sync,GetAddrInfo:5,nsHostResolver:5
export MOZ_LOG_FILE=~/Desktop/log.txt
./firefox
However, I have not been able to reproduce the issue yet whether I use this script or the binary file directly. I will continue to use the binary today to see if it recurs, and if so, then the script to try to capture that issue. I'll also download the vanilla version of this firefox in case there's some difference there.
I notice this is a different version than the 65.0b8 version that I was testing with. Do we think it is better to continue testing in the same 65.0b8 environment or stay current in case the issue was resolved elsewhere?
Assignee | ||
Comment 59•6 years ago
|
||
Give a try to both versions. And if it's not reproducible in this 66.0a1 build but is reproducible in 65.0b8, I'll create a testing build from the same source tree.
Reporter | ||
Comment 60•6 years ago
|
||
I am able to reproduce the issue with the plain 66.0a1 alpha from a shell script or otherwise.
I am not able to reproduce the issue with the plain 66.0a1 alpha from a shell script that calls out:
export MOZ_LOG=timestamp,sync,GetAddrInfo:5,nsHostResolver:5
export MOZ_LOG_FILE=./2019-01-31log.txt
I am not able to reproduce the issue with the special build of 66.0a1, with or without the MOZ definitions.
Assignee | ||
Comment 61•6 years ago
|
||
(In reply to rex from comment #60)
I am able to reproduce the issue with the plain 66.0a1 alpha from a shell script or otherwise.
I am not able to reproduce the issue with the plain 66.0a1 alpha from a shell script that calls out:
export MOZ_LOG=timestamp,sync,GetAddrInfo:5,nsHostResolver:5
export MOZ_LOG_FILE=./2019-01-31log.txt
Since this is not a crash, sync isn't necessary. Timing won't be changes so much without sync, so please try timestamp,GetAddrInfo:5,nsHostResolver:5 or even just timestamp,GetAddrInfo:5
I am not able to reproduce the issue with the special build of 66.0a1, with or without the MOZ definitions.
Hmm, I would expect the same behavior as in the official 66.0a1. This strange behavior makes finding the problem really hard.
Reporter | ||
Comment 62•6 years ago
|
||
These logs were produced using the plain 66.0a1 firefox with MOZ_LOG=timestamp,GetAddrInfo:5. Firefox successfully resolved google.com, but failed to resolve bing.com
Assignee | ||
Comment 63•6 years ago
|
||
Thanks, unfortunately this log doesn't contain useful information. I suggested it mainly to check if removing sync helps with reproducing. Helpful would be the log from my version which logs the error code.
Reporter | ||
Comment 64•6 years ago
|
||
I see. I attempted the special build with MOZ_LOG=timestamp,GetAddrInfo:5 just in case but am still unable to reproduce the issue. I will keep it running and logging in case it is a rare occurrence for whatever reason.
Assignee | ||
Comment 65•6 years ago
|
||
You could give this version a try. It only prints error message in case of failure. Use it as the other versions, i.e. compile with:
gcc -shared -ldl -fPIC log_getaddrinfo_errors.c -o log_getaddrinfo_errors.so
And run it with the official firefox:
LD_PRELOAD=/path/to/log_getaddrinfo_errors.so ./firefox --no-remote
Reporter | ||
Comment 66•6 years ago
|
||
Thanks Michal. Are you interested in just the terminal output, or would you still like the HTTP log with MOZ_LOG set to something?
Reporter | ||
Comment 67•6 years ago
|
||
Reporter | ||
Comment 68•6 years ago
|
||
Reporter | ||
Comment 69•6 years ago
|
||
Reporter | ||
Comment 70•6 years ago
|
||
Firefox 66.0a1 running with with that LD_Preload appears to not be able to resolve any address. :(
Assignee | ||
Comment 71•6 years ago
|
||
getaddrinfo fails with EAI_SYSTEM and errno should contain details, but it's 0.
Could you please extract the attached archive and run "make" in the extracted directory? It should generate few files in logs directory. I'm curious whether getaddrinfo called from python code fails as well and maybe I'll see something interesting in the strace log.
Reporter | ||
Comment 72•6 years ago
|
||
Can do, Michal. Below is what printed in the terminal, and the attached logs follow.
gcc -shared -ldl -fPIC log_getaddrinfo_errors.c -o log_getaddrinfo_errors.so
rm -f logs/*.txt
./run_test.sh
Makefile:5: recipe for target 'test' failed
make: *** [test] Error 1
Reporter | ||
Comment 73•6 years ago
|
||
Assignee | ||
Comment 74•6 years ago
|
||
Is there any change if you remove wins resolving? I.e. in /etc/nsswitch.conf change line "hosts: files wins dns" to "hosts: files dns" and rerun the test.
Reporter | ||
Comment 75•6 years ago
|
||
Yes, there are changes! Below is the terminal output, and the attached logs follow.
gcc -shared -ldl -fPIC log_getaddrinfo_errors.c -o log_getaddrinfo_errors.so
rm -f logs/*.txt
./run_test.sh
Reporter | ||
Comment 76•6 years ago
|
||
Assignee | ||
Comment 77•6 years ago
|
||
Now, it seems to resolve normally. I guess firefox shouldn't fail anymore. Please, test it for a while without wins resolving. I'm not sure what's exactly wrong, but according to the rule in nsswitch.conf dns request should be sent when the host isn't found using wins and because no request is sent, there is probably some fatal failure in wins nss module.
Reporter | ||
Comment 78•6 years ago
|
||
Good find Michal! I'm not able to easily reproduce the issue in the old beta or 66.0a1 alpha. I will continue my normal browsing across the firefox versions and report if it recurs.
Reporter | ||
Comment 79•6 years ago
|
||
Firefox continues to resolve addresses correctly.
Do we know of a difference between Firefox and Chromium in how they interact with wins? This may provide further answers as to why Firefox has difficulty with wins in my system whereas Chromium does not.
Assignee | ||
Comment 80•6 years ago
|
||
(In reply to rex from comment #79)
Firefox continues to resolve addresses correctly.
Do we know of a difference between Firefox and Chromium in how they interact with wins? This may provide further answers as to why Firefox has difficulty with wins in my system whereas Chromium does not.
A quick test shows that chrome uses getaddrinfo just like firefox. Given that first getaddrinfo in the python test succeeded and also that firefox resolves correctly with hook3.so (which adds some delay by printing result to console), the problem might be when subsequent call to getaddrinfo is called too quickly after previous call and something in wins code doesn't have time to fail and clean up. Could you please do another test?
- put wins back to nsswitch.conf
- run the old test to make sure it still fails
- run this new test which sleeps between getaddrinfo calls
Also, would you mind sharing your /etc/samba/smb.conf ?
Reporter | ||
Comment 81•6 years ago
|
||
Good thinking Michal.
I've put wins back in nsswitch.conf.
Running "make" in the new test directory seems to give good terminal output:
gcc -shared -ldl -fPIC log_getaddrinfo_errors.c -o log_getaddrinfo_errors.so
rm -f logs/*.txt
./run_test.sh
A zip file of the new resulting logs and my smb.conf file follow.
Reporter | ||
Comment 82•6 years ago
|
||
Reporter | ||
Comment 83•6 years ago
|
||
Current smb.conf file with active shares removed.
Comment 84•6 years ago
|
||
Michal, I'm assigning this to you just to get it off the triage list, leaving UNCONFIRMED. Feel free to revert or anything you seem more fit for this bug. Thanks.
Assignee | ||
Comment 85•6 years ago
|
||
I'm closing this as WFM because the bug isn't limited only to firefox. Any application that calls getaddrinfo quickly enough will trigger it.
Rex, I didn't find anything strange in any config file you've posted. I tried to reproduce it on Fedora and Valentin tried it on Ubuntu, but we were unable to reproduce it. If you would like to track the bug down to nsswitch level, I would recommend to run the python test with rr (https://rr-project.org/) and if it's reproducible using rr, then you should be able to debug it after installing appropriate debug packages.
Reporter | ||
Comment 86•6 years ago
|
||
No worries, Michal. Thank you for your help diagnosing to this level. I'm rolling along just fine with WINS removed from my nsswitch.conf and am happy to have helped reach a conclusion on this bug.
Description
•