Closed Bug 475603 Opened 16 years ago Closed 16 years ago

Lots of timeouts for DNS requests with Netgear Router WGR614, and stylesheet/css rendering problems

Categories

(Core :: Networking, defect)

defect
Not set
critical

Tracking

()

VERIFIED INVALID

People

(Reporter: whimboo, Assigned: mcmanus)

References

()

Details

(Keywords: common-issue-)

Attachments

(5 files)

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2a1pre) Gecko/20081224 Minefield/3.2a1pre During regression tests I've noticed that I regularely run into timeouts when I try to access flickr. It happens with builds from at least the last 2 month. Builds from last June are working fine. I hadn't time right now to narrow down the regression range. Steps: 1. Create a fresh profile 2. Navigate to http://www.flickr.com 3. Click on Sign in Most of the times I get a timeout in step 2. But if the page will load I'll end up with a timeout in step 3. Even a lot of times the page isn't correctly displayed, means the CSS is missing und sub-frames show a DNS timeout. See the attached screenshot.
It would be nice to try and catch this in a debugger. Nightly builds have symbols via the symbol server, and you might be able to catch the DNS thread(s) in action by breaking into the process when you see a hang.
I'll dig into this this week
Could this be a regression from the DNS prefetch (bug 453403) ?
For myself this regressed between the following builds: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1b2pre) Gecko/20081110 Minefield/3.1b2pre Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1b2pre) Gecko/20081112 Minefield/3.1b2pre Tagged changesets: http://hg.mozilla.org/mozilla-central/rev/b5f3b30402cb http://hg.mozilla.org/mozilla-central/rev/8242c6adbf63 http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b5f3b30402cb&tochange=8242c6adbf63 (In reply to comment #3) > Could this be a regression from the DNS prefetch (bug 453403) ? That looks like. Its checkin happened within this timeframe, see the pushlog output too.
Blocks: 453403
OS: Windows Vista → All
Hardware: x86 → All
We timeout really often on each platform. I think we should block on this. Asking for blocking.
Severity: major → critical
Flags: blocking1.9.1?
This appears to be a lot like bug 463215, but that was supposedly addressed in the dns-prefetch bug itself. Need logs or tracebacks of some sort, I think.
Assignee: nobody → mcmanus
Version: 1.9.0 Branch → Trunk
henrik, can you confirm you see this in a build that would have included the jan 14th changes (a8fae999e836507f3303f3fe86811281c45695ec)? I assume so, but the latest version number I see in any of the comments is 20081224 so I just want to be sure. I cannot reproduce the flickr problem on linux x86_64 using either a minefield build from 1228 (the earliest I could find) or last night. I assume there is a problem and I'd love a commonly occurring test case, so that's frustrating :(.. I'm going to give Mac a try now too. I don't have a windows box.
Patrick, I can see the timeouts regularly on my MacBook (Pro) too. I would tend to say that each 5th page load for a site which wasn't opened for a while run into a timeout. You should also be able to reproduce it at least on OS X. You should have a look at comment 4. The pushlog link will show all checkins in the timeframe between the builds where the problem has been started. The changesets I've fetched from about:buildconfig so the list is definitely correct.
Man, I just can't make this happen on Mac either. I understand that comment 4's pushlog shows the DNS prefetching changes. And I agree they are a likely cause for what you are seeing. What they don't contain is changeset a8fae999e836507f3303f3fe86811281c45695ec which landed around Jan 14th. I want to make sure you're running that too, as it fixed some problems in this area. That being said, I cannot make this happen with builds from around the new year either. I've cleared local DNS caches and the caches on the recursive resolver. When you get flickr to fail, is it the first time you have visited flickr since opening the browser? If I had a kingdom, I'd trade it for a reproduce recipe that worked for me with last night's nightly :)
Henrik: Are you using a Fritzbox as router ?
I'm gonna block on this as it seems to be a regression (see comment 4) but if we could somehow get more reproducible STR that would really help. We might unblock on this if it's not something that seems to happen all the time.
Flags: blocking1.9.1? → blocking1.9.1+
(In reply to comment #10) > Henrik: Are you using a Fritzbox as router ? No, I don't have a Fritzbox. The router I use is a Netgear WGR614v5: http://netgear-us.custhelp.com/app/products/model/a_id/2587 As given by the article about general DNS problems with the latest 10.5.6 upgrade of Leopard, I've added the nameservers from http://www.opendns.com/ directly in my systems settings. But the problem still remains. Running dig in the terminal for all those websites where I get a timeout doesn't show any lag. The IP address is displayed promptly. I've no idea where I can have a look to check how Firefox handles it. Patrick, is there anything I can use to offer more information?
I see this also - perhaps similar environment Netgear WGR614v5 to cablemodem. Vista laptop wireless connection, 54.0 Mbps, excellent signal Broadcom 802.11g Network Adapter First saw this with 20081222 Shiretoko/3.1b3pre, but haven't sought regression range. Happens variably depending on sites, but STR below is pretty solid. Seems better with 20090131 Shiretoko/3.1b3pre which has 467562 checkin, but not gone. STR 1. install snaplinks addon. I've been using v.0.0.5 but 1.0 is at addon http://snaplinks.mozdev.org/ 2. go google.com/news 3. using snaplinks, open 2 sets of 5-6 links 3a. I chose links which I had not opened belfore 3b. right click, drag to make rectangle around desired links, release click results: see timeouts, and css rendering problems in 30-50% of tabs. reproducible: always will attempt to reproduce on cable to see if it's different from wireless, with snaplinks 1.0
same behavior on the wire. IE also shows sign of the problem. same deal on an XP machine no window vista updates on this machine between 11/21 and 12/31/08, and I don't recall seeing the problem before 11/21.
If I connect direct to cablemodem (i.e. remove WGR614v5 router) I don't see the problem. Also don't see it with IE.
It can't be Firefox fault if you get it with IE, must be either OS or DNS server and in your case the DNS forwarder in your router. Henrik still get the issue without the DNS forwarder from the router.
(In reply to comment #16) > Henrik still get the issue without the DNS forwarder from the router. At least the default DNS forwarder is grayed-out on OS X. I had to add additional ones from OpenDNS. No idea if this is still in conflict. For now I cannot run a test without the router. :/
(In reply to comment #15) > If I connect direct to cablemodem (i.e. remove WGR614v5 router) I don't see the > problem. Also don't see it with IE. I agree with Matthias that this makes it highly unlikely to be related to firefox or Henrik's problem report. Fwiw I did try out your STR and was not able to reproduce with it. Thanks for sharing it - I'm interested in any other STR's folks haven't shared yet - I'll try them all. for whatever it is worth - I would guess your problem is related to realtively small buffers on the wgr615v5 combined with the pretty big bursts of data snaplinks is going to create. (snaplinks opens a lot of pages in parallel).. anyhow those bursts are going to cause drops in the small buffered device, and dns, running on unreliable udp, operates extremely poorly in a packet-drop environment.
> additional ones from OpenDNS. No idea if this is still in conflict. For now I > cannot run a test without the router. :/ but you can go between 3.0.5 and 3.1.beta2 and switch between a clean bill of health and a semi-relaible reproduce, right? That would be the key thing.
> I can have a look to check how Firefox handles it. Patrick, is there anything I > can use to offer more information? I've been thinking about this, and I can't think of anything all that useful in the default build. By tomorrow I expect to attach (roughly) 3 different patches to this bug which, assuming you confirm the behavior of comment 19, should help isolate where something is going off the rails. They aren't fixes, because I don't know what's wrong :).. But they sort of short out the likely code one section at a time: 1] Makes the DNS threads poll for queued work every 250ms or so instead of relying on the wakeup logic.. if this bears fruit it means I need to look at the wakeup logic again. 2] makes the DNS pool never shrink.. if this bears fruit it means I have to look at losing wakeups when threads exit. 3] removes the quota on any-priority threads so that any thread can serve any request.. there are good reasons for each of those algorithms, but you can certainly test a normal load with the changes. I don't have tryserver permissions. If I post the patches can you make the builds and try them out?
(In reply to comment #19) > but you can go between 3.0.5 and 3.1.beta2 and switch between a clean bill of > health and a semi-relaible reproduce, right? That would be the key thing. Meanwhile I've noticed short hangs even with 3.0.6 but there were no timeouts. After the comment from Matthias I'm not quiet sure if I'm also affected by a router bug. Patrick, I'll be in Mountain View for a month starting from Monday. Means I have a different network connection. I could probably see what happens there. Further keep in mind that I won't be able to run any local tests. Probably Wayne could help out. (In reply to comment #20) > I don't have tryserver permissions. If I post the patches can you make the > builds and try them out? I can do, yes. Means I have to run this three times with each patch resulting in its own tryserver build, correct?
from comment 20 1] Makes the DNS threads poll for queued work every 250ms or so instead of relying on the wakeup logic.. if this bears fruit it means I need to look at the wakeup logic again.
referred to in comment 20 2] makes the DNS pool never shrink.. if this bears fruit it means I have to look at losing wakeups when threads exit.
referred to in comment 20 3] removes the quota on any-priority threads so that any thread can serve any request..
> I can do, yes. Means I have to run this three times with each patch resulting > in its own tryserver build, correct? right. 3 different data points - not cumulative. Thanks!
despite same behavior from IE, not sure my problem is entirely router, because I could swear FF got worse at some point. But could be my ISP. So I also have more testing to do. I didn't try FF3.0. It may be a couple weeks til I can get to it. If there are builds you want me to try perhaps Henrik or someone can put them up somewhere.
I *think* I'm seeing this too. Not clear though, as I'm on a really shady network (satellite) at home, but it sure does seem to me like I get DNS timeouts nowadays, and I don't remember that being the norm some time ago. Not much concrete I can say here, really, but I'll play around with Patrick's patches and see what I can find, but it'll take some time to see if there's any real difference for on my network, unfortunately.
I've started tryserver builds for each of the given patches. I'll come back with their locations once they have been finished building.
All the win32 builds are busted. Ted, is the Try server win32 hg builder somehow broken?
Might be bug 476635. Not sure, you'd have to ask RelEng.
Patrick, I assume that all the DNS problems I've noticed at home are related to my router. I don't have any problems by using a WLAN access point in my appartment in Mountain View. It would be great if someone with the same Netgear router could have a look with the above tryserver builds. I won't be able to until March, 9th.
(In reply to comment #32) > Patrick, I assume that all the DNS problems I've noticed at home are related to > my router. I don't have any problems by using a WLAN access point in my > appartment in Mountain View. > > It would be great if someone with the same Netgear router could have a look > with the above tryserver builds. I won't be able to until March, 9th. that's good news. I don't think any of those patches would help you in that case, they are aimed at isolating algorithmic problems in firefox. maybe we should drop the 1.9.1 blocking? When I first read this I kinda figured that you were somehow overrunning a queue in your netgear router.. but there are never more than 3 concurrent speculative lookups outstanding (the queue can be a lot bigger than that), so that seems pretty unlikely.. but just to rule it out you (or someone else with that router) can try out patch 4 which drops it to just 1 one speculative lookup at a time. (of course non-speculative ones are also going on.)
The tryserver builds for patch 4 will appear at the following location: https://build.mozilla.org/tryserver-builds/2009-02-08_16:34-hskupin@mozilla.com-bug475603_1thread/ Wayne, if you would have time for a short test, it would be really helpful.
(In reply to comment #35) > The tryserver builds for patch 4 will appear at the following location: > https://build.mozilla.org/tryserver-builds/2009-02-08_16:34-hskupin@mozilla.com-bug475603_1thread/ > > Wayne, if you would have time for a short test, it would be really helpful. still modest problems - restart FF that has ~25 tabs in 3 windows, 2 had page load error - google.com/news open 11 links in tabs, 4 had page load error
I tried to get it verified by running DNS requests via a local socks proxy but it doesn't work. I'm connected with the network at home but I cannot see this issue. Probably I've missed something but for now it looks like I have to stay at home to get it somehow reproduced. Thanks Wayne for verification. Shall we contact Netgear to inform them about this problem?
Before you contact netgear you should probably generate a packet tracemn with Wireshark (with DNS Filter)
So is there more to be done here, is this purely a Netgear issue, or something we'd attempt to deal with? Or do we not know yet?
I haven't had any problems so far while staying in the headquarters. So I highly suspect it is an issue with at least this type of routers from Netgear. The next time when I can test this again is March, 9th.
(In reply to comment #40) > I haven't had any problems so far while staying in the headquarters. > So I highly suspect it is an issue with at least this type of routers from Netgear. If your problem depends on environment, IPv6 may be relevant. Can following be a workaround of your problem? > network.dns.disableIPv6=true
Ok. Given that we won't be blocking on this bug. Please renominate if there's reasons to reconsider.
Flags: blocking1.9.1+ → blocking1.9.1-
I'm back at home and hit this problem again. I tried the workaround with disabling ipv6 and it worked but even after resetting the pref everything is fine. I'll have a look at this in the next days. Wayne, can you do the same?
It's back. Even with IPv6 disabled the lookup takes ages a lot of times.
(In reply to comment #44) > It's back. Even with IPv6 disabled the lookup takes ages a lot of times. Henrik Skupin, firmware of your router is newest one? OS level IPv6=Off or firmware level IPv6=Off is possible? Even if IPv6 support of Fx is disabled, IPv6 related functions of OS(network modules) and firmware work, if OS level and/or firmware level IPv6 support is enabled.
(In reply to comment #45) > Henrik Skupin, firmware of your router is newest one? OS level IPv6=Off or > firmware level IPv6=Off is possible? Yes, everything has the newest firmware. I switched off ipv6 support on OS X now too. I have forgotten this. But I cannot find a setting for the router. I did a further try now. While Firefox wasn't able to resolve a hostname I run the following command in parallel in the console to check if the OS can resolve the domain name. But it fails too: > henrik$ dig heise.de > > ; <<>> DiG 9.4.2-P2 <<>> heise.de > ;; global options: printcmd > ;; connection timed out; no servers could be reached It really looks like that it is not a fault on our side. Wayne, do you still have time to check this too on your side? Probably this bug should be resolved as invalid. Patrick, what do you plan with the given enhancements? Will you create new bugs for that, have them to wait, or won't you even get them into the tree?
(In reply to comment #46) > Patrick, what do you plan with the given enhancements? Will you create new bugs > for that, have them to wait, or won't you even get them into the tree? are you referring to the attachments to this bug? If so, those can just be forgotten - they have only diagnostic value for your (non) problem.
(In reply to comment #46) > It really looks like that it is not a fault on our side. Wayne, do you still > have time to check this too on your side? Henrik, does this match your experience at all - this week I found the browser behaved better loading links after I terminated some network apps, like remote desktop. I could see a clear change between RDP running and not running. I did not get to try IPv6 or some other things. I too would think it INVALID, but I can't get over the fact that it hasn't always worked this poorly - and I've had the network box for a long time. Guess that means I should test FF2! Did you try v2?
I don't have RDP running and I suspect that this has something to do with DNS resolution. Even Firefox 2 shows this lookup timeouts. I have tried it again. I think it has been started after I upgraded my firmware of the Netgear router to V1.0.9_1.0.6. Which version have you running?
(In reply to comment #49) > I think it has been started after I upgraded my firmware of the Netgear router > to V1.0.9_1.0.6. Which version have you running? same. and box is v5. mostly still on default settings except WPA-PSK is set. I upgraded (dont recall the date but it's been a long time) because I've been fighting a problem with wife's thinkpad+WGR511 pcmcia unpredictably dropping wireless connection.
Closing as invalid since this is not a problem with Firefox.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → INVALID
I have contacted Netgear a while back but never got a reply for my support ticket. Since I have changed my router everything is fine again. No more DNS timeouts are happening. Marking as verified.
Status: RESOLVED → VERIFIED
Summary: Lots of timeouts for DNS requests → Lots of timeouts for DNS requests with Netgear Router WGR614
sounds like a good issue to do a support page for.
Summary for a support page: Firefox 3.5 does DNS prefetching (new in 3.5). That means that if you load a page that contains links to different domains, Firefox will try to get this Domain Names resolved to IP addresses while loading the page. This can cause several DNS request at once at that seems to break the DNS forwarder in several different routers. You will get an address not found error with Firefox. Windows is doing DNS caching and a good sign that this is a router issue is that you should get the same error from IE if you try to visit the same domain immediately after you got the error in Firefox. Solution: Get a fixed Firmware version or disable the DNS prefetching in Firefox ( create (!) a new boolean pref with network.dns.disablePrefetch=true )
(In reply to comment #57) > Solution: > Get a fixed Firmware version or > disable the DNS prefetching in Firefox ( create (!) a new boolean pref > with network.dns.disablePrefetch=true ) See my comment 49 which states that it doesn't happen due to the new DNS prefetching feature.
Henrik: That doesn't match the information from the other people like https://bugzilla.mozilla.org/show_bug.cgi?id=506588#c2 (3.0.X works)
Probably dependent on the used router. Seems like it is not always its fault.
Regarding this being a common issue, note the following: The residential gateway / router I use, 2Wire 3800HGV-B Gateway, I believe is the standard deployed for AT&T U-Verse fiber optic Internet / TV / Voice installations. AT&T U-Verse has 1.6 million subscribers as of AT&T's last quarterly earnings report (I would guess >90% select the internet option) - http://www.mediaweek.com/mw/content_display/news/cable-tv/e3i017491d566a16661e1d1a754de299987 I suggest at least adding "Server not found" to the title of this bug. It is the main symptom of this issue and may help to reduce the filing of duplicate bugs. The greater issue being, this appears to be a more generic "DNS requests overloading router" issue that may deserve it's own bug NOT limited to Netgear. Whether the issue is documented here, 453403 or some other bug, I think the information needs to be easier to find (Great idea to do a support page!). (In reply to comment #55) I disabled the DNS prefetching by adding the network.dns.disablePrefetch=true preference. The workaround has virtually eliminated page not found errors for me with the EXCEPTION I speculate about below. Now what I get is more of a ~5 second pause occasionally when loading a web page (which is preferable to a Server not found error message). Google Maps Exception (DNS issue as well?) Steps to reproduce: 1) Go visit http://maps.google.com and look at the map that appears (USA for me). 2) See ~4 squares of "We are sorry, but we don't have maps at this zoom level for this region. Try zooming out for a broader look." Workaround: Drag the map around to get those areas to load. I think this is related to DNS as well, but am not sure.
Just a follow up from me. While I saw improvements with my WGR614 when disabling firefox prefetching and reverting back a version of the router's firmware, I replaced the Netgear router with a new Linksys. The the new router completely eliminated the Server Not Found issue with prefetch enabled. In fact Firefox is now performing amazingly better with the Linksys than the Netgear WGR614. Gene
We don't normally find out what router users have so we don't know how common this is. Stalled DNS requests can also be firewall or spyware.
I believe I have been experiencing this issue for awhile. Let me try and describe my experience. Environment: Windows XP Pro (fully patched) Firefox Version 3.6 (but was happening with earlier builds) Gigabit home network. Netgear WGR614v6 Comcast internet provider. DNS servers: google 8.8.8.8 and 8.8.4.4 Scenario 1. With my home page set to www.msnbc.com I would experience lots of timeouts connecting to additional sites via tabbed browsing or opening a new window. Opening up Google chrome and hitting the same sites would work without any problems. Doing an ipconfig /flushdns and hitting refresh on pages would solve the problem temporarily. Before I tried scenario two I thought it had something to do with my dns servers. I was using comcast DNS servers, then public dns servers. I then switched to using googles after reading an article on them. Found that it did not make a difference. Scenario 2. With my home page set to news.yahoo.com no problems. Scenario 3. With my home page set to news.google.com same problems as with www.msnbc.com Resolution: While trying to figure this out I found this site: http://kb.mozillazine.org/Error_loading_any_website I followed the instruction to enable network.dns.disablePrefetch Once I disable prefetching I have not had any problem at all. Firefox is running faster then ever. Other information: I am a user that has lots of tabs and browser windows open. Either surfing around on reddit and digg or researching some new technology, I hate to close tabs. Also I have multiple computers (5) running in my home and non of these have ever experienced DNS problems. Hopefully all this rambling is for nothing.
Summary: Lots of timeouts for DNS requests with Netgear Router WGR614 → Lots of timeouts for DNS requests with Netgear Router WGR614, and stylesheet/css rendering problems
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: