Closed Bug 1499825 Opened 6 years ago Closed 6 years ago

intermittent server not found

Categories

(Core :: Networking, defect, P5)

63 Branch
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: rex, Assigned: michal)

Details

(Whiteboard: [necko-triaged][no-nag])

Attachments

(20 files)

172.67 KB, text/plain
Details
929.96 KB, text/plain
Details
552 bytes, text/x-csrc
Details
373.55 KB, text/plain
Details
973 bytes, text/plain
Details
8.02 KB, application/vnd.tcpdump.pcap
Details
3.92 KB, application/vnd.tcpdump.pcap
Details
217.99 KB, text/plain
Details
24 bytes, application/vnd.tcpdump.pcap
Details
769 bytes, application/zip
Details
801 bytes, text/x-csrc
Details
20.75 KB, text/plain
Details
383.27 KB, text/plain
Details
965 bytes, text/plain
Details
1.03 KB, application/x-compressed-tar
Details
28.24 KB, application/zip
Details
19.20 KB, application/zip
Details
1.07 KB, application/x-compressed-tar
Details
30.19 KB, application/zip
Details
9.06 KB, text/plain
Details
User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/69.0.3497.81 Chrome/69.0.3497.81 Safari/537.36 Steps to reproduce: 1. Download the current beta from https://download.mozilla.org/?product=firefox-beta-latest-ssl&os=linux64&lang=en-US . 2. Unpack firefox. 3. Run firefox. 4. Navigate to several different sites in a row on the same tab. Actual results: Some of the sites display as "Server not found", "we can't connect to the server at" the site address. Refreshing the page sometimes displays the site instead of "Server not found". When not, opening a new tab and entering the same address sometimes displays the site instead of "Server not found". Expected results: Sites load without "Server not found" events.
ubuntu-bug provides the following information: ProblemType: Bug DistroRelease: Linux 18.3 Package: firefox 63.0~b13+build1-0ubuntu0.16.04.1 [modified: usr/lib/firefox/browser/defaults/preferences/vendor-firefox.js usr/lib/firefox/distribution/distribution.ini] [origin: LP-PPA-mozillateam-firefox-next] ProcVersionSignature: Ubuntu 4.15.0-36.39~16.04.1-generic 4.15.18 Uname: Linux 4.15.0-36-generic x86_64 AddonCompatCheckDisabled: False ApportVersion: 2.20.1-0ubuntu2.18 Architecture: amd64 AudioDevicesInUse: USER PID ACCESS COMMAND /dev/snd/controlC0: rex 3357 F.... pulseaudio BuildID: 20181009094547 Channel: beta CurrentDesktop: X-Cinnamon Date: Tue Oct 9 15:54:39 2018 DefaultProfileExtensions: extensions.sqlite corrupt or missing DefaultProfileIncompatibleExtensions: Unavailable (corrupt or non-existant compatibility.ini or extensions.sqlite) DefaultProfileLocales: extensions.sqlite corrupt or missing DefaultProfilePlugins: Shockwave Flash - /usr/lib/adobe-flashplugin/libflashplayer.so (adobe-flashplugin) DefaultProfilePrefSources: /usr/lib/firefox/defaults/pref/vendor-gre.js /usr/lib/firefox/defaults/pref/channel-prefs.js prefs.js DefaultProfileThemes: extensions.sqlite corrupt or missing ForcedLayersAccel: False IfupdownConfig: # interfaces(5) file used by ifup(8) and ifdown(8) auto lo iface lo inet loopback InstallationDate: Installed on 2017-01-12 (635 days ago) InstallationMedia: Linux Mint 17.2 "Rafaela" - Release amd64 20150627 IpRoute: default via 192.168.103.1 dev wlan0 proto static metric 600 169.254.0.0/16 dev wlan0 scope link metric 1000 192.168.103.0/24 dev wlan0 proto kernel scope link src 192.168.103.222 metric 600 MostRecentCrashID: bp-392336ee-2be7-4d6e-89e0-e40350180301 Profile0BrokenPermissions: browser-extension-data/uBlock0@raymondhill.net/storage.js.corrupt (0o600, wrong owner) extension-settings.json.corrupt (0o600, wrong owner) saved-telemetry-pings/ba50588c-e150-45ec-9d22-916c98101815 (0o600, wrong owner) saved-telemetry-pings/f50808ab-da4a-421b-8acd-2795287134d8 (0o600, wrong owner) Profile0Extensions: extensions.sqlite corrupt or missing Profile0IncompatibleExtensions: Unavailable (corrupt or non-existant compatibility.ini or extensions.sqlite) Profile0Locales: extensions.sqlite corrupt or missing Profile0Plugins: Shockwave Flash - /usr/lib/adobe-flashplugin/libflashplayer.so (adobe-flashplugin) Profile0PrefSources: /usr/lib/firefox/defaults/pref/vendor-gre.js /usr/lib/firefox/defaults/pref/channel-prefs.js prefs.js Profile0Themes: extensions.sqlite corrupt or missing Profiles: Profile0 - LastVersion=62.0/20180911150120 (Out of date) Profile1 (Default) - LastVersion=63.0/20181009094547 (In use) RelatedPackageVersions: adobe-flashplugin 1:20181009.1-0ubuntu0.16.04.1 RunningIncompatibleAddons: False SourcePackage: firefox SubmittedCrashIDs: bp-392336ee-2be7-4d6e-89e0-e40350180301 UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 10/05/2011 dmi.bios.vendor: Hewlett-Packard dmi.bios.version: F.1B dmi.board.asset.tag: Base Board Asset Tag dmi.board.name: 165A dmi.board.vendor: Hewlett-Packard dmi.board.version: 10.31 dmi.chassis.asset.tag: Chassis Asset Tag dmi.chassis.type: 10 dmi.chassis.vendor: Hewlett-Packard dmi.chassis.version: Chassis Version dmi.modalias: dmi:bvnHewlett-Packard:bvrF.1B:bd10/05/2011:svnHewlett-Packard:pnHPPaviliondv7NotebookPC:pvr058D110000244620001620100:rvnHewlett-Packard:rn165A:rvr10.31:cvnHewlett-Packard:ct10:cvrChassisVersion: dmi.product.family: 103C_5335KV G=N L=CON B=HP S=PAV dmi.product.name: HP Pavilion dv7 Notebook PC dmi.product.version: 058D110000244620001620100 dmi.sys.vendor: Hewlett-Packard
Ah! Sorry, the ubuntu-bug information was for a different version of firefox that I was testing before the steps provided in this bug. I apologize for muddling this issue.
Hi, I wasn't able to reproduce this issue, could you specify some of the websites you've visited where this occurred? Also, could you test if the issue is reproducible in safe mode, here is a link that can help you do that: https://support.mozilla.org/en-US/kb/troubleshoot-firefox-issues-using-safe-mode If the issue still occurs please try using a new profile, you can find the steps here: https://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles?redirectlocale=en-US&redirectslug=Managing-profiles#w_starting-the-profile-manager Thank you!
Flags: needinfo?(rex)
Hi Peter! Thank you for your help on this issue. I was able to reproduce the issue with the following site "chain": bing.com -> google.com -> yahoo.com -> reddit.com I noticed that yahoo.com only downloaded partially when I attempted to move to the reddit.com. I moved into safe mode, which automatically loaded reddit.com, and when I tried to go to bing.com, "Server not found" was displayed. I started a new profile, which automatically loaded the Firefox privacy notice page, and from there I used the following site "chain" to reproduce the issue: firefox privacy notice page -> google.com -> bing.com -> google.com -> yahoo.com -> reddit.com -> msnbc.com Again, reddit.com had only loaded partially. I look forward to anything further that I can provide. In case it is helpful, I am linking the launchpad bug that I had reported previously. That reported issue is a duplicate of another report confirmed by one other user: https://bugs.launchpad.net/ubuntu/+source/firefox/+bug/1796979
Flags: needinfo?(rex)
The issue persists in 64.0b3. I was moving through a site similar to the above chain without issue, but when I clicked my email link to get to this page, I was presented with "Server not found". Refreshing the page and trying in a new tab gave the same page. Closing the new and original tab and trying again with a new tab got me through.
Just for the record, I was the other user that confirmed this problem on Launchpad. I used to see this issue very frequently using Ubuntu 18.04 and later with Ubuntu 18.10 on one laptop, and with Xubuntu 18.04 on another. The first laptop was completely wiped several days ago and Ubuntu was 18.04 installed. After extensive testing I have not seen this issue again on either laptop so I am unable to confirm the bug here. If I do encounter this problem again then I will of course come back and confirm the bug report if no-one else has already done so.
Hi, I'm still unable to reproduce the issue you described. Could you please try and reproduce it on the latest Firefox Beta 64.0(20181205213546) and Firefox Nightly 65.0a1(20181207093029)? Thanks!
Flags: needinfo?(rex)
Hi Peter! Can you clarify how to get the 20181205... and 20181207... versions? The issue persists for me on both versions downloaded at the time of posting from https://www.mozilla.org/en-US/firefox/channel/desktop/ , but I want to ensure I have the right versions. Thanks!
Component: Untriaged → Networking
Product: Firefox → Core
I'm clearing the needinfo request since I cannot provide info without assistance myself.
Flags: needinfo?(rex)
rex, could you please provide HTTP logs as described here: https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging it may shade some light on this issue. thanks!
Flags: needinfo?(rex)

Hi Honza! Thanks for the tip!
I'm currently unable to reproduce the issue at will with the current 65.0b8 beta. I will continue to use it until I notice an issue and then try to reproduce it while logging.

Attached file log.txt-main.8416

Started logging after the issue started while attempting to visit CNN.com. Was given "We can’t connect to the server at www.cnn.com" page. Started logging and hit refresh, and was shown the same page. Then stopped logging.

Flags: needinfo?(rex)
Attached file log.txt-main.9860

Aha! I started the log right before attempting to connect to CNN.com, received the "can't connect to the server" page, and stopped logging.

(In reply to rex from comment #13)

Created attachment 9034842 [details]
log.txt-main.9860

Aha! I started the log right before attempting to connect to CNN.com, received the "can't connect to the server" page, and stopped logging.

Thank you.

Looks like your local resolver has an issue:

2019-01-07 22:07:23.681 ⁃ nsHttpChannel ⁃ 7f7450f0a800 ⁃ finished ⁃ status=804b001e ⁃ http-status=n/a ⁃ url=https://www.cnn.com/
2019-01-07 22:07:23.691 ⁃ nsSocketTransport ⁃ 7f74521a4c00 ⁃ released ⁃ origin=www.cnn.com:443
2019-01-07 22:07:23.691473 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport creating nsSocketTransport @0x7f74521a4c00
2019-01-07 22:07:23.691482 UTC - [Parent 9860: Socket Thread]: E/nsSocketTransport nsSocketTransport::Init [this=0x7f74521a4c00 host=www.cnn.com:443 origin=www.cnn.com:443 proxy=:0]
2019-01-07 22:07:23.691500 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::SetConnectionFlags 0x7f74521a4c00 flags=0
  flags = 0
2019-01-07 22:07:23.691525 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::OpenOutputStream [this=0x7f74521a4c00 flags=2]
2019-01-07 22:07:23.691533 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::PostEvent [this=0x7f74521a4c00 type=0 status=0 param=(nil)]
2019-01-07 22:07:23.691557 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::OpenInputStream [this=0x7f74521a4c00 flags=2]
2019-01-07 22:07:23.691565 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::PostEvent [this=0x7f74521a4c00 type=0 status=0 param=(nil)]
nsHalfOpenSocket @7f7454d67ae0 --> nsSocketTransport @7f74521a4c00
2019-01-07 22:07:23.692309 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketEvent [this=0x7f74521a4c00 type=0 status=0 param=(nil)]
2019-01-07 22:07:23.692324 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::ResolveHost [this=0x7f74521a4c00 www.cnn.com:443]
2019-01-07 22:07:23.692335 UTC - [Parent 9860: Socket Thread]: E/nsSocketTransport nsSocketTransport::SendStatus [this=0x7f74521a4c00 status=804b0003]
  804b0003 = STATUS_RESOLVING
2019-01-07 22:07:23.692 ⁃ nsHostResolver ⁃ log.txt-main!nsHostResolver::service ⁃ service
2019-01-07 22:07:23.692387 UTC - [Parent 9860: Socket Thread]: D/nsHostResolver Resolving host [www.cnn.com] type 0. [this=0x7f748678ceb0]
2019-01-07 22:07:23.692414 UTC - [Parent 9860: Socket Thread]: D/nsHostResolver   No usable record in cache for host [www.cnn.com] type 0.
2019-01-07 22:07:23.692460 UTC - [Parent 9860: Socket Thread]: D/nsHostResolver   DNS thread counters: total=2 any-live=0 idle=2 pending=1
2019-01-07 22:07:23.692484 UTC - [Parent 9860: Socket Thread]: D/nsHostResolver   DNS lookup for host [www.cnn.com] blocking pending 'getaddrinfo' or trr query: callback [0x7f7454e6b700]
2019-01-07 22:07:23.692506 UTC - [Parent 9860: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - Calling getaddrinfo for host [www.cnn.com].
2019-01-07 22:07:23.692511 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketEvent [this=0x7f74521a4c00 type=0 status=0 param=(nil)]
2019-01-07 22:07:23.692641 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver Calling 'res_ninit'.
2019-01-07 22:07:23.692738 UTC - [Parent 9860: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - lookup completed for host [www.cnn.com]: failure: unknown host.
2019-01-07 22:07:23.692771 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver nsHostResolver::CompleteLookup www.cnn.com (nil) 804B001E trr=0 stillResolving=0
2019-01-07 22:07:23.692822 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 0x7f7466ac4580 new gencnt
2019-01-07 22:07:23.692849 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver Caching host [www.cnn.com] negative record for 60 seconds.
2019-01-07 22:07:23.692903 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver CompleteLookup: www.cnn.com has NO address
2019-01-07 22:07:23.692939 UTC - [Parent 9860: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 0x7f7466ac4580 calling back dns users
2019-01-07 22:07:23.693519 UTC - [Parent 9860: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnLookupComplete: this=0x7f74521a4c00 status 804b001e.

I can see "DNS lookup thread - lookup completed for host [www.cnn.com]: failure: unknown host." from the call to getaddrinfo, which is even called twice (getaddrinfo, res_ninit, getaddrinfo).

If you open a terminal (command prompt) and do nslookup www.cnn.com what result do you get?

What exactly is your local DNS setting?

Hey Honza!

rex@ofc-ate ~ $ nslookup www.cnn.com
Server: 127.0.1.1
Address: 127.0.1.1#53

Non-authoritative answer:
www.cnn.com canonical name = turner-tls.map.fastly.net.
Name: turner-tls.map.fastly.net
Address: 151.101.1.67
Name: turner-tls.map.fastly.net
Address: 151.101.65.67
Name: turner-tls.map.fastly.net
Address: 151.101.129.67
Name: turner-tls.map.fastly.net
Address: 151.101.193.67

My DNS is set to automatic and uses what my isp provides: 217.23.2.11 and 8.8.8.8 .

Thanks. Dragana, Michal, Valentin, any ideas what more to check here?

(In reply to Honza Bambas (:mayhemer) from comment #16)

Thanks. Dragana, Michal, Valentin, any ideas what more to check here?

The only suggestion I have is to run nslookup www.cnn.com when problem occurs.
You os does not give us a ip address for www.cnn.com, we cannot do much about it.

(In reply to Dragana Damjanovic [:dragana] from comment #17)

(In reply to Honza Bambas (:mayhemer) from comment #16)

Thanks. Dragana, Michal, Valentin, any ideas what more to check here?

The only suggestion I have is to run nslookup www.cnn.com when problem occurs.
You os does not give us a ip address for www.cnn.com, we cannot do much about it.

As a workaround you can try DNSoverHTTPS (also called TRR).
This bypasses the system resolver, and uses an external one.

https://www.ghacks.net/2018/04/02/configure-dns-over-https-in-firefox/
https://daniel.haxx.se/blog/2018/06/03/inside-firefoxs-doh-engine/

I ran into this issue again trying to get to google.com.
The nslookup appeared to be successful, but the issue persisted after the seemingly successful lookup:
Server: 127.0.1.1
Address: 127.0.1.1#53

Non-authoritative answer:
Name: www.google.com
Address: 172.217.9.164

I will try out DNSoverHTTPS! Neat suggestion!

(In reply to Honza Bambas (:mayhemer) from comment #16)

Thanks. Dragana, Michal, Valentin, any ideas what more to check here?

In bug 1439780 there is a simple program (hook3.c) that hooks resolver functions and dumps info to standard error, see https://bugzilla.mozilla.org/show_bug.cgi?id=1439780#c26. You can use it to see if it really fails on the system level. See comment #18 how to compile and run it.

I had some issues with DNSoverHTTPS. I'm thinking firefox had intermittent difficulty resolving the host from which I was supposed to do dns services with.

For the hook3.c program, I've run it with firefox normally for the few business days since the time of Michal's post without the issue recurring. I've even redownloaded that 65.0b8 meta and turned off updating to ensure my variables are not changing. DNS servers are the same. Hopefully like last time the issue will spring up when I mention it. Anyway, please know that I am continuing to test and will post if I have news.

(In reply to rex from comment #21)

I had some issues with DNSoverHTTPS. I'm thinking firefox had intermittent difficulty resolving the host from which I was supposed to do dns services with.

You can use the network.trr.bootstrapAddress pref to set the IP of the DoH service you want to use.
You can use dig mozilla.cloudflare-dns.com do get the IP; I think 1.1.1.1 should work too.

https://gist.github.com/bagder/5e29101079e9ac78920ba2fc718aceec

For the hook3.c program, I've run it with firefox normally for the few business days since the time of Michal's post without the issue recurring. I've even redownloaded that 65.0b8 meta and turned off updating to ensure my variables are not changing. DNS servers are the same. Hopefully like last time the issue will spring up when I mention it. Anyway, please know that I am continuing to test and will post if I have news.

Thank you for this! We are really grateful. I'm sure this kind of hard-to-reproduce bug can be annoying to a lot of users, and any little help in finding out the root cause is great!

Hey Michal, sorry if I completely misunderstand, but does your hook ask the system to resolve the address before firefox makes the attempt? I have a hunch that, where firefox would fail to resolve an address, if another program or service had already resolved the address, firefox would succeed.

Where I've spent a little under three business days running the firefox beta and hook with no issues, I was able to reproduce the issue without the hook in a minute.

Again I apologize for the following speculation about something I completely do not understand. I think this issue occurs when firefox asks to resolve a new name. If the hook checks whether the system can resolve the new name first, then firefox might "only" need to resolve an already resolved name, which is consistently successful. Thus days of success with the hook and immediate failure without.

Does this make sense for how firefox and the hook actually work?

:michal, could you take a look at comment #24?

Do you have any idea why this issue is not reproducible with the hook? Thanks.

Flags: needinfo?(michal.novotny)

(In reply to rex from comment #23)

Hey Michal, sorry if I completely misunderstand, but does your hook ask the system to resolve the address before firefox makes the attempt? I have a hunch that, where firefox would fail to resolve an address, if another program or service had already resolved the address, firefox would succeed.

The utility just intercepts calls to gethostbyname, gethostbyname_r and getaddrinfo. It calls the original function and prints the result to stderr. I have no idea why it's not reproducible with it. It might slightly change the timing but it should be negligible.

Flags: needinfo?(michal.novotny)

Thanks for the explanation, :michal. Is there any chance that the hook utility could be calling the functions differently than firefox (especially if it was doing so incorrectly somehow) or is using a different process to do so? I'm repeating similar results today. Firefox+hook resolves consistently, firefox alone does not.

I'm imagining this bug could be something about my system intermittently ignoring firefox's calls or improperly responding to firefox, where the utility and chromium work consistently.

Functions from glibc should be called in both cases (with the LD_PRELOAD as well as without it). Do you run the same firefox binary with the same environment? I.e. when running firefox without the hook, do you run it from console as well? Also, does LD_PRELOAD variable contain anything by default?

Good questions Michal. I had been running firefox without the hook outside the console, where I was running with the hook in the console. Turns out the issue persists without the hook in the console though.

I couldn't find how to determine what my LD_PRELOAD variable is by default, but I ran a script with 'echo "LD_PRELOAD IS ${LD_PRELOAD}"' and starting up firefox which did not return anything after "LD_PRELOAD IS ". Not sure if that is applicable though. Is there a different way I should check?

I couldn't find how to determine what my LD_PRELOAD variable is by default, but I ran a script with 'echo "LD_PRELOAD IS ${LD_PRELOAD}"' and starting up firefox which did not return anything after "LD_PRELOAD IS ". Not sure if that is applicable though. Is there a different way I should check?

Running echo "${LD_PRELOAD}" would show it. That's really strange. If you comment out getaddrinfo from hook3.c (or just rename it to some random name) but keep gethostbyname and gethostbyname_r as they are, can you reproduce the problem?

I've made a newhook3.c, renaming getaddrinfo to dontgetaddrinfo (heh), used the gcc command to produce a newhook3.so, and using that new .so, I am able to reproduce the issue.

Attached file hook3_test1

Thanks for doing the test. Please, try this new file. It only intercepts the function call but does nothing with the result.

With this one it looks like I cannot resolve any address.

(In reply to rex from comment #21)

I had some issues with DNSoverHTTPS. I'm thinking firefox had intermittent difficulty resolving the host from which I was supposed to do dns services with.

I would like to understand this: Can you explain what was the issue? How did firefox behave: firefox could not connect to web sites?

It should work. Did it compile without any warning?

:michal, it took about 5 seconds to compile and did not return anything to the terminal.

:dragana, yes, I did not seem to be able to resolve any addresses and only hit server not found pages. Though I had not tested with an IP address as my network.trr.uri, testing now shows the same thing. To be transparent, my only changes are network.trr.mode to 3, and network.trr.uri as default or to 104.16.249.249 or 1.1.1.1, and the rest of network.trr.* default. Maybe I am missing or misusing a setting?

(In reply to rex from comment #36)

:michal, it took about 5 seconds to compile and did not return anything to the terminal.

So, how the output written to stderr by hook3.so (which resolves correctly) looks like?

(In reply to rex from comment #36)

:michal, it took about 5 seconds to compile and did not return anything to the terminal.

:dragana, yes, I did not seem to be able to resolve any addresses and only hit server not found pages. Though I had not tested with an IP address as my network.trr.uri, testing now shows the same thing. To be transparent, my only changes are network.trr.mode to 3, and network.trr.uri as default or to 104.16.249.249 or 1.1.1.1, and the rest of network.trr.* default. Maybe I am missing or misusing a setting?

mode 3 is only using doh. mode 2 will fall back to dns if doh is not working. I recommend to use mode 2.

I am also curious to know why doh mode 3 is not working. Can you make me a http log:
https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

Thanks!

Hi :michal, using the gcc command for either one now is about instantaneous and produces no stderr output.

Hi :dragana, I agree mode 2 would make sense long term, but my intentions for testing was pure doh since currently dns isn't a solid fallback. I'll attach the log in just a moment.

(In reply to rex from comment #39)

Hi :michal, using the gcc command for either one now is about instantaneous and produces no stderr output.

I meant the output when you run it. You should have output like this in the console:

getaddrinfo node=detectportal.firefox.com, service=(null), hints=0x7fcb0b678500, retval=0
hints.ai_family=0
hints.ai_socktype=1
hints.ai_protocol=0
hints.ai_flags=32
IPv4 address: 23.219.91.81
IPv4 address: 23.219.91.27
IPv6 address: 2a02:26f0:40::17db:5b51
IPv6 address: 2a02:26f0:40::17db:5b1b
getaddrinfo node=tiles.services.mozilla.com, service=(null), hints=0x7fcb0b678500, retval=0
hints.ai_family=0
hints.ai_socktype=1
hints.ai_protocol=0
hints.ai_flags=32
IPv4 address: 52.25.70.97
IPv4 address: 52.26.103.165
IPv4 address: 52.39.131.77
IPv4 address: 52.41.60.30
IPv4 address: 52.24.236.113
IPv4 address: 35.160.41.125
IPv4 address: 52.41.78.152
IPv4 address: 52.34.107.172

Could you please also provide you DNS setup? Content of /etc/resolv.conf and line beginning with "hosts" from /etc/nsswitch.conf. If your resolv.conf contains 127.0.0.53 then also an output from "systemd-resolve --status".

Ah, thanks for clarifying Michal.

The original hook3 running firefox provides output like yours:
gethostbyname_r=myhostname, retval=0, succeeded
address: 127.0.1.1
gethostbyname_r=myhostname, retval=0, succeeded
address: 127.0.1.1
getaddrinfo node=detectportal.firefox.com, service=(null), hints=0x7f0d6787c940, retval=0
hints.ai_family=0
hints.ai_socktype=1
hints.ai_protocol=0
hints.ai_flags=32
IPv4 address: 23.54.163.42
IPv4 address: 23.54.163.75
IPv6 address: 2600:1404:27::17d7:6222
IPv6 address: 2600:1404:27::17d7:6241
etc.

My edited hook3 to rename gethostbyname running firefox only outputs:
gethostbyname_r=ofc-ate, retval=0, succeeded
address: 127.0.1.1
gethostbyname_r=ofc-ate, retval=0, succeeded
address: 127.0.1.1

The latest hook3_test1 outputs nothing.

/etc/resolv.conf
nameserver 127.0.1.1

"hosts" from /etc/nsswitch.conf
hosts: files wins dns

I attempted systemd-resolve --status just in case, but received:
systemd-resolve: unrecognized option '--status'

(In reply to rex from comment #44)

The original hook3 running firefox provides output like yours:
gethostbyname_r=myhostname, retval=0, succeeded
address: 127.0.1.1
gethostbyname_r=myhostname, retval=0, succeeded
address: 127.0.1.1
getaddrinfo node=detectportal.firefox.com, service=(null), hints=0x7f0d6787c940, retval=0
hints.ai_family=0
hints.ai_socktype=1
hints.ai_protocol=0
hints.ai_flags=32
IPv4 address: 23.54.163.42
IPv4 address: 23.54.163.75
IPv6 address: 2600:1404:27::17d7:6222
IPv6 address: 2600:1404:27::17d7:6241
etc.

This looks normal.

/etc/resolv.conf
nameserver 127.0.1.1

"hosts" from /etc/nsswitch.conf
hosts: files wins dns

I attempted systemd-resolve --status just in case, but received:
systemd-resolve: unrecognized option '--status'

You could try:
resolvectl status

127.0.1.1 is a local address so something must be listening and forwarding the requests. You can use following command to find out, what it is:
sudo netstat -u -l -p -n | grep -e ":53[[:space:]]"

$ resolvectl status
resolvectl: command not found

$ sudo netstat -u -l -p -n | grep -e ":53[[:space:]]"
udp 0 0 127.0.1.1:53 0.0.0.0:* 2325/dnsmasq

You could try to put ISP's DNS servers to /etc/resolv.conf to make sure dnsmasq doesn't mess anything up.

I still don't understand why nothing is resolved when hook3_test1.so is preloaded. I will try to come with something else later.

I think I've run into some voodoo.

I updated my /etc/resolv.conf to:
nameserver 217.23.2.11
nameserver 8.8.8.8
nameserver 127.0.1.1

and then ran:
sudo /etc/init.d/dns-clean start

as described here: https://tecadmin.net/flush-dns-cache-ubuntu/

In using firefox without any hook, I still experience this issue.

Could you please put only "nameserver 217.23.2.11" to /etc/resolv.conf and provide HTTP log (the same as in comment #13) and dump of DNS packets when the failure happens? Use tcpdump to dump the DNS packets:

sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53 and host 217.23.2.11

Flags: needinfo?(rex)

Gladly!
With only 'nameserver 217.23.2.11' in my /etc/resolv.conf and running sudo /etc/init.d/dns-clean start, I appear to be completely unable to resolve any name in any of my internet browsers. The dump.pcap simply shows entries like 'Standard query 0x3347 A www.yahoo.com'.

With only 'nameserver 8.8.8.8' in my /etc/resolv.conf and running sudo /etc/init.d/dns-clean start, I can resolve names again. I've changed the tcpdump command to 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53 and host 8.8.8.8'.

I've attached the resulting dump.pcap below. In it it shows that several hostnames were successfully resolved. At the end, it failed to resolve cnn.com but is not shown in the dump.

I then did an extra tcpdump with 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53' with google.com failing at the end, but again, it is not shown.

My naive interpretation is that firefox is simply not sending a dns request for some reason.

Flags: needinfo?(rex)

(In reply to rex from comment #50)

Gladly!
With only 'nameserver 217.23.2.11' in my /etc/resolv.conf and running sudo /etc/init.d/dns-clean start, I appear to be completely unable to resolve any name in any of my internet browsers. The dump.pcap simply shows entries like 'Standard query 0x3347 A www.yahoo.com'.

In comment #13 you wrote that 217.23.2.11 and 8.8.8.8 are DNS servers provided automatically by ISP. It's bad, if they really return non-working DNS server in DHCP response. Check using dig utility that it really doesn't work:

dig www.cnn.com @217.23.2.11

If it times out or returns an error, get rid of this DNS server in the system configuration. I.e. manually set DNS servers to 8.8.8.8 and 8.8.4.4.

With only 'nameserver 8.8.8.8' in my /etc/resolv.conf and running sudo /etc/init.d/dns-clean start, I can resolve names again. I've changed the tcpdump command to 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53 and host 8.8.8.8'.

I've attached the resulting dump.pcap below. In it it shows that several hostnames were successfully resolved. At the end, it failed to resolve cnn.com but is not shown in the dump.

I then did an extra tcpdump with 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53' with google.com failing at the end, but again, it is not shown.

I meant to provide both logs, tcpdump and HTTP log from firefox, where we could see that we tried to resolve the address.

Thanks Michal. From using the dig command, it seems clear that what my ISP is providing is not a working DNS server. I have changed my top level router to distribute 8.8.8.8, 8.8.4.4, and 1.1.1.1 as my DNS servers, and it appears to have propagated correctly.

I've set my /etc/resolv.conf to just 'nameserver 8.8.8.8' and ran 'sudo /etc/init.d/dns-clean start'. I ran the tcpdump as 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53' in case the requests go somewhere else by chance.

Right now I'm having difficulty getting "good" logs. At first, all I was getting was 'server not found' with nothing in my pcap, so I thought I had messed up something. Now I'm getting only successful resolutions. I'll try again in a few hours.

Flags: needinfo?(rex)

Was built via 'sudo tcpdump -w /tmp/dump.pcap -s 0 udp and port 53' and was active when requesting reddit.com. Appears to be empty.

Flags: needinfo?(rex)

Can you please try a special build where I've added a code to log error returned by GetAddrInfo? The try push is here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a9fbc1eae0ad3b6989fd2d2e42b449ae371f8ca0&selectedJob=225192845

Linux builds can be downloaded from:
32-bit - https://queue.taskcluster.net/v1/task/YQ59Q_ckRbqFDz7D4oCkrQ/runs/0/artifacts/public/build/target.tar.bz2
64-bit - https://queue.taskcluster.net/v1/task/L8QATV2TSkORJwzXFEI3vg/runs/0/artifacts/public/build/target.tar.bz2

The error is logged in GetAddrInfo module, so it has to be added to MOZ_LOG. Following MOZ_LOG is sufficient:
timestamp,sync,GetAddrInfo:5,nsHostResolver:5

Flags: needinfo?(rex)

Thanks for the special build, Michal! I've downloaded the 64 bit Linux build you linked, and am running the following script to produce the desired log file:

#! /bin/bash
export MOZ_LOG=timestamp,sync,GetAddrInfo:5,nsHostResolver:5
export MOZ_LOG_FILE=~/Desktop/log.txt
./firefox

However, I have not been able to reproduce the issue yet whether I use this script or the binary file directly. I will continue to use the binary today to see if it recurs, and if so, then the script to try to capture that issue. I'll also download the vanilla version of this firefox in case there's some difference there.

I notice this is a different version than the 65.0b8 version that I was testing with. Do we think it is better to continue testing in the same 65.0b8 environment or stay current in case the issue was resolved elsewhere?

Give a try to both versions. And if it's not reproducible in this 66.0a1 build but is reproducible in 65.0b8, I'll create a testing build from the same source tree.

I am able to reproduce the issue with the plain 66.0a1 alpha from a shell script or otherwise.

I am not able to reproduce the issue with the plain 66.0a1 alpha from a shell script that calls out:
export MOZ_LOG=timestamp,sync,GetAddrInfo:5,nsHostResolver:5
export MOZ_LOG_FILE=./2019-01-31log.txt

I am not able to reproduce the issue with the special build of 66.0a1, with or without the MOZ definitions.

(In reply to rex from comment #60)

I am able to reproduce the issue with the plain 66.0a1 alpha from a shell script or otherwise.

I am not able to reproduce the issue with the plain 66.0a1 alpha from a shell script that calls out:
export MOZ_LOG=timestamp,sync,GetAddrInfo:5,nsHostResolver:5
export MOZ_LOG_FILE=./2019-01-31log.txt

Since this is not a crash, sync isn't necessary. Timing won't be changes so much without sync, so please try timestamp,GetAddrInfo:5,nsHostResolver:5 or even just timestamp,GetAddrInfo:5

I am not able to reproduce the issue with the special build of 66.0a1, with or without the MOZ definitions.

Hmm, I would expect the same behavior as in the official 66.0a1. This strange behavior makes finding the problem really hard.

These logs were produced using the plain 66.0a1 firefox with MOZ_LOG=timestamp,GetAddrInfo:5. Firefox successfully resolved google.com, but failed to resolve bing.com

Flags: needinfo?(rex)

Thanks, unfortunately this log doesn't contain useful information. I suggested it mainly to check if removing sync helps with reproducing. Helpful would be the log from my version which logs the error code.

I see. I attempted the special build with MOZ_LOG=timestamp,GetAddrInfo:5 just in case but am still unable to reproduce the issue. I will keep it running and logging in case it is a rare occurrence for whatever reason.

You could give this version a try. It only prints error message in case of failure. Use it as the other versions, i.e. compile with:

gcc -shared -ldl -fPIC log_getaddrinfo_errors.c -o log_getaddrinfo_errors.so

And run it with the official firefox:

LD_PRELOAD=/path/to/log_getaddrinfo_errors.so ./firefox --no-remote

Thanks Michal. Are you interested in just the terminal output, or would you still like the HTTP log with MOZ_LOG set to something?

Attached file 2019-02-01 HTTP Log

Firefox 66.0a1 running with with that LD_Preload appears to not be able to resolve any address. :(

Attached file test.tgz

getaddrinfo fails with EAI_SYSTEM and errno should contain details, but it's 0.

Could you please extract the attached archive and run "make" in the extracted directory? It should generate few files in logs directory. I'm curious whether getaddrinfo called from python code fails as well and maybe I'll see something interesting in the strace log.

Can do, Michal. Below is what printed in the terminal, and the attached logs follow.
gcc -shared -ldl -fPIC log_getaddrinfo_errors.c -o log_getaddrinfo_errors.so
rm -f logs/*.txt
./run_test.sh
Makefile:5: recipe for target 'test' failed
make: *** [test] Error 1

Attached file 2019-02-04logs.zip

Is there any change if you remove wins resolving? I.e. in /etc/nsswitch.conf change line "hosts: files wins dns" to "hosts: files dns" and rerun the test.

Yes, there are changes! Below is the terminal output, and the attached logs follow.
gcc -shared -ldl -fPIC log_getaddrinfo_errors.c -o log_getaddrinfo_errors.so
rm -f logs/*.txt
./run_test.sh

Now, it seems to resolve normally. I guess firefox shouldn't fail anymore. Please, test it for a while without wins resolving. I'm not sure what's exactly wrong, but according to the rule in nsswitch.conf dns request should be sent when the host isn't found using wins and because no request is sent, there is probably some fatal failure in wins nss module.

Good find Michal! I'm not able to easily reproduce the issue in the old beta or 66.0a1 alpha. I will continue my normal browsing across the firefox versions and report if it recurs.

Firefox continues to resolve addresses correctly.

Do we know of a difference between Firefox and Chromium in how they interact with wins? This may provide further answers as to why Firefox has difficulty with wins in my system whereas Chromium does not.

Attached file test with pauses

(In reply to rex from comment #79)

Firefox continues to resolve addresses correctly.

Do we know of a difference between Firefox and Chromium in how they interact with wins? This may provide further answers as to why Firefox has difficulty with wins in my system whereas Chromium does not.

A quick test shows that chrome uses getaddrinfo just like firefox. Given that first getaddrinfo in the python test succeeded and also that firefox resolves correctly with hook3.so (which adds some delay by printing result to console), the problem might be when subsequent call to getaddrinfo is called too quickly after previous call and something in wins code doesn't have time to fail and clean up. Could you please do another test?

  • put wins back to nsswitch.conf
  • run the old test to make sure it still fails
  • run this new test which sleeps between getaddrinfo calls

Also, would you mind sharing your /etc/samba/smb.conf ?

Good thinking Michal.

I've put wins back in nsswitch.conf.

Running "make" in the new test directory seems to give good terminal output:
gcc -shared -ldl -fPIC log_getaddrinfo_errors.c -o log_getaddrinfo_errors.so
rm -f logs/*.txt
./run_test.sh

A zip file of the new resulting logs and my smb.conf file follow.

Attached file smb.conf

Current smb.conf file with active shares removed.

Michal, I'm assigning this to you just to get it off the triage list, leaving UNCONFIRMED. Feel free to revert or anything you seem more fit for this bug. Thanks.

Assignee: nobody → michal.novotny
Priority: -- → P5
Whiteboard: [necko-triaged][no-nag]

I'm closing this as WFM because the bug isn't limited only to firefox. Any application that calls getaddrinfo quickly enough will trigger it.

Rex, I didn't find anything strange in any config file you've posted. I tried to reproduce it on Fedora and Valentin tried it on Ubuntu, but we were unable to reproduce it. If you would like to track the bug down to nsswitch level, I would recommend to run the python test with rr (https://rr-project.org/) and if it's reproducible using rr, then you should be able to debug it after installing appropriate debug packages.

Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME

No worries, Michal. Thank you for your help diagnosing to this level. I'm rolling along just fine with WINS removed from my nsswitch.conf and am happy to have helped reach a conclusion on this bug.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: