Closed Bug 467562 Opened 12 years ago Closed 12 years ago

DNS requests seem to hang sometimes

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9.2a1

People

(Reporter: sdwilsh, Assigned: mcmanus)

References

Details

(Keywords: fixed1.9.1)

Attachments

(1 file, 2 obsolete files)

About once a day my browser won't load anything in the current tab.  This seems to happen most often with my gmail tab.  The taskbar says it's looking up the host, and it just sits there.  The only way I've been able to resolve this issue and make it actually load is to open up a new tab and just try loading an arbitrary site (usually google.com).

I originally thought it was a flakey connection, but I'm hard-wired now.  I asked on irc, and at least one other person was seeing this as well.

I'm thinking this might be because of dns prefetching?
Flags: blocking1.9.1?
I'm experiencing similar behaviour - though it's hard to know when it's Firefox and when it's my ISP.
hm. I haven't seen that.

Just to be certain - are you saying that tab A freezes, and then you open tab B and initiate a lookup in tab B and that will un-freeze A?

That sounds like a lookup for A got queued but it did not make it out of the queue when a thread opened up for it until the new request was submitted in B. That shouldn't happen :(
(In reply to comment #2)
> Just to be certain - are you saying that tab A freezes, and then you open tab B
> and initiate a lookup in tab B and that will un-freeze A?
> 
> That sounds like a lookup for A got queued but it did not make it out of the
> queue when a thread opened up for it until the new request was submitted in B.
> That shouldn't happen :(
Yes, that is what I'm seeing.
ah.

ok, I see a problem with signal accounting and logic. Basically the code is assuming 1 signal to 1 wakeup, which the documentation says you cannot count on, and there is also a problem with timeout detection (which is inherently racy with the nspr api).

I'll make a patch.
as I cannot reproduce this, can you give the attached patch a spin?
I don't have an up-to-data build tree handy but if someone can point me at a build I'll give it a whirl.

I haven't noticed the interaction between tabs, but that is a possibility.  When I see the problems I've been experiencing I have lots of tabs and windows open, with gmail and other web apps that might be opening connections.

I also don't see "freezing" so much as just an hour glass and then time out with "site not found" error.   I'm able to hit stop and reload and get response to those actions, and sometimes that seems to work to break the deadlock and get the site to load.  Maybe I'm just hitting an open window in the race condition between two tabs or windows when I do this.
(In reply to comment #6)
> I don't have an up-to-data build tree handy but if someone can point me at a
> build I'll give it a whirl.
> 

I don't have tryserver access, I do have a mozcentral build from my macbook I can offer...

http://www.ducksong.com/mozilla/misc/firefox-3.2a1pre.en-US.mac.dmg
I can see similar things all the last days. I get timeout so often that it's really annoying. I did a test with Safari in parallel and it hangs too. So probably it will be mostly an ISP issue. I'll send the patch to the try server as soon as I can open the upload page. Now it seems to be down.
Hardware: PC → All
I've been testing wth the build from comment 7 since last night.  I still see the problem of  "server not found" time outs when clicking on links from google search results, but it seems like I'm not seeing the problem as often.   Maybe I was seeing a combination of network flakeyness, and the bug being patched, and now I'm just down to the occasional network flakeyness.  I'll keep testing to see if I can figure out more about what is going on.
Curiously the try server wasn't able to build with the patch on OS X. Sadly there is no build error output available. Linux and Windows builds are located here: https://build.mozilla.org/tryserver-builds/2008-12-13_11:00-mozilla@hskupin.info-bug467562/
probably not related to the patch for this bug, but that debug build in commment 7 crashes a very high pct. of the time when I click in a link in a thunderbird e-mail message, and the browser tries to open a new tab.
Is this ready for review?
Assignee: nobody → mcmanus
Flags: blocking1.9.1? → blocking1.9.1+
Priority: -- → P3
Comment on attachment 352191 [details] [diff] [review]
fix a race condition with timeouts and singnals reaching non-pri threads

feedback seems positive, if inconclusive. I've been using the patch on various builds without a problem.
Attachment #352191 - Flags: review?(bzbarsky)
Attachment #352191 - Flags: superreview?(bzbarsky)
Patrick, what's the race with timeout detection?

Also, where's the "one signal, one wakeup" assumption, and which documentation is comment 4 referring to?
(In reply to comment #14)
> Patrick, what's the race with timeout detection?

PR_WaitCondVar() does not indicate a timeout when returning so you have to sample 'now' before and after and calculate the elapsed time yourself and compare it to the timeout value. The elapsed is going to be an over-estimate and a value that exceeds the timeout value might not be the reason the wait returned. Without the patch, a thread might wakeup due to a signal but exit without doing anything because it perceives a timeout. This is changed to always checking to make sure there is nothing to do before exiting.

> 
> Also, where's the "one signal, one wakeup" assumption, 

threads were divided into "can do only high priority" or "can do any priority". When data was enqueued exactly one idle thread was woken, of unknown class. If the woken thread couldn't act on the queued data, it would go back to sleep without doing anything - effectively dropping the notification even though there might be another idle thread with the right classification. Realistically this wouldn't effect anything the browser was blocked on, because that all happens at high priority and all the threads are eligible to process that. The code changed from designating certain threads as "any priority" to just keeping a counter and quota of them. 

> and which documentation
> is comment 4 referring to?

http://groups.google.com/group/mozilla.dev.tech.nspr/msg/aca68a5dd34701e9?dmode=source

I'm fairly sure I saw something similar in a more formal location too...
Comment on attachment 352191 [details] [diff] [review]
fix a race condition with timeouts and singnals reaching non-pri threads

>+++ b/netwerk/dns/src/nsHostResolver.cpp
>+nsHostResolver::GetHostToLookup(nsHostRecord **result)
>+    PRBool TimedOut = PR_FALSE;

timedOut, perhaps?

>+    PRIntervalTime start = 0, now;

So |start| is also used to detect whether we've been through the loop before, right?  Please use a boolean for that.  0 is a perfectly valid value for PR_IntervalNow, after all.

>+        if (!start)
>+            now  = start =  PR_IntervalNow();
>+        
>+        PRIntervalTime expirationTime = start + ((mNumIdleThreads >= HighThreadThreshold) ? mShortIdleTimeout : mLongIdleTimeout);

Why are we basing the expiration time on when we entered this function?

>+        PRIntervalTime timeout = expirationTime - now;
>+        if (timeout < 0)
>+            timeout = 0;

Can't happen (and I'd have thought the compiler would warn you about it), since PRIntervalTime is an unsigned type.

So if now > expirationTime, you'll just end up with a huge timeout.

>+        if (now >= expirationTime)
>+            TimedOut = PR_TRUE;

Similar issue here: just because now is later than expirationTime doesn't mean that now >= expirationTime.  PR_IntervalNow values happily wrap (every 4 million seconds or so, at least).

In general, you can't compare a PR_IntervalNow() value to another PR_IntervalTime; you can only compare the difference of two PR_IntervalNow values, and only if you know they're "not too far apart".  And you need to make sure that you subtract the earlier one from the later one...

So what you really want to check, to see whether you've timed out, is whether |now - start| is greater than the timeout you desire.
Attachment #352191 - Flags: superreview?(bzbarsky)
Attachment #352191 - Flags: superreview-
Attachment #352191 - Flags: review?(bzbarsky)
Attachment #352191 - Flags: review-
Attached patch patch v2 (obsolete) — Splinter Review
I have now read http://www.mozilla.org/projects/nspr/reference/html/prinrvl.html more closely - you're right that I thought that worked differently. sorry about that.
Attachment #352191 - Attachment is obsolete: true
Attachment #356551 - Flags: review?(bzbarsky)
I've thought about a new way to test this.   When I still see the problem (as I still do with Firefox trunk build like Build identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3pre) Gecko/20090107 Shiretoko/3.1b3pre Ubiquity/0.1.4 ) I fire up 

  ping site_Im_pointing_the_browser_at

ping is showing that it can connect and get reasonable round trips, while Firefox still shows "looking up [site]", and then the time out page.  Is that a valid test?
i can thin(In reply to comment #18)
> I've thought about a new way to test this.   When I still see the problem (as I
> still do with Firefox trunk build like Build identifier: Mozilla/5.0
> (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3pre) Gecko/20090107
> Shiretoko/3.1b3pre Ubiquity/0.1.4 ) I fire up 
> 
>   ping site_Im_pointing_the_browser_at
> 
> ping is showing that it can connect and get reasonable round trips, while
> Firefox still shows "looking up [site]", and then the time out page.  Is that a
> valid test?

It might not be perfect, depending on the OS libraries, but it would prbly corolate well and be interesting information.
Comment on attachment 356551 [details] [diff] [review]
patch v2

>+++ b/netwerk/dns/src/nsHostResolver.cpp
>+        else {
>+            timeout -= (PRIntervalTime)(now - epoch);
>+            epoch = now;
>+        }

That's needed to handle cases of us staying inside this function long enough that we wrap the counter, right?  But our timeout seems to be small enough that this should never happen...  We could even assert that if desired. 

Do we really need this block?
(In reply to comment #20)
> (From update of attachment 356551 [details] [diff] [review])
> >+++ b/netwerk/dns/src/nsHostResolver.cpp
> >+        else {
> >+            timeout -= (PRIntervalTime)(now - epoch);
> >+            epoch = now;
> >+        }
> 
> That's needed to handle cases of us staying inside this function long enough
> that we wrap the counter, right?  But our timeout seems to be small enough that
> this should never happen...  We could even assert that if desired. 
> 
> Do we really need this block?

it is needed so that the timeout value is set appropriately for the wait() the next time through the loop in the case of an early return from the wait().
Comment on attachment 356551 [details] [diff] [review]
patch v2

Aha!  Document that, please, and with that looks great.
Attachment #356551 - Flags: review?(bzbarsky) → review+
Attached patch patch v3Splinter Review
comments updated as per review comment 22
Attachment #356551 - Attachment is obsolete: true
Pushed http://hg.mozilla.org/mozilla-central/rev/a8fae999e836
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
This is still happening with the latest builds and highly visible on Vista. In my case it runs in VmWare as a Bootcamp partition. But even on OS X I see the same timeouts.

Create a fresh profile and load http://www.flickr.com. Each time I do this for a fresh profile I run into a timeout and Minefield says:

Firefox can't find the server at www.flickr.com

Hitting reload mostly helps but sometimes I've to hit it twice. Even clicking on sign in afterward times out again mostly.
Status: RESOLVED → REOPENED
Keywords: fixed1.9.1
OS: Mac OS X → All
Resolution: FIXED → ---
Henrik, since this one already has a patch+branch landings, can you file your issue as a new bug please?
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Keywords: fixed1.9.1
Resolution: --- → FIXED
Sure. I've filed bug 475603.
Target Milestone: --- → mozilla1.9.2a1
You need to log in before you can comment on or make changes to this bug.