DNS requests seem to hang sometimes

RESOLVED FIXED in mozilla1.9.2a1

Status

()

defect
P3
normal
RESOLVED FIXED
11 years ago
10 years ago

People

(Reporter: sdwilsh, Assigned: mcmanus)

Tracking

({fixed1.9.1})

Trunk
mozilla1.9.2a1
Points:
---
Bug Flags:
blocking1.9.1 +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 2 obsolete attachments)

(Reporter)

Description

11 years ago
About once a day my browser won't load anything in the current tab.  This seems to happen most often with my gmail tab.  The taskbar says it's looking up the host, and it just sits there.  The only way I've been able to resolve this issue and make it actually load is to open up a new tab and just try loading an arbitrary site (usually google.com).

I originally thought it was a flakey connection, but I'm hard-wired now.  I asked on irc, and at least one other person was seeing this as well.

I'm thinking this might be because of dns prefetching?
Flags: blocking1.9.1?
I'm experiencing similar behaviour - though it's hard to know when it's Firefox and when it's my ISP.
(Assignee)

Comment 2

11 years ago
hm. I haven't seen that.

Just to be certain - are you saying that tab A freezes, and then you open tab B and initiate a lookup in tab B and that will un-freeze A?

That sounds like a lookup for A got queued but it did not make it out of the queue when a thread opened up for it until the new request was submitted in B. That shouldn't happen :(
(Reporter)

Comment 3

11 years ago
(In reply to comment #2)
> Just to be certain - are you saying that tab A freezes, and then you open tab B
> and initiate a lookup in tab B and that will un-freeze A?
> 
> That sounds like a lookup for A got queued but it did not make it out of the
> queue when a thread opened up for it until the new request was submitted in B.
> That shouldn't happen :(
Yes, that is what I'm seeing.
(Assignee)

Comment 4

11 years ago
ah.

ok, I see a problem with signal accounting and logic. Basically the code is assuming 1 signal to 1 wakeup, which the documentation says you cannot count on, and there is also a problem with timeout detection (which is inherently racy with the nspr api).

I'll make a patch.
(Assignee)

Comment 5

11 years ago
as I cannot reproduce this, can you give the attached patch a spin?

Comment 6

11 years ago
I don't have an up-to-data build tree handy but if someone can point me at a build I'll give it a whirl.

I haven't noticed the interaction between tabs, but that is a possibility.  When I see the problems I've been experiencing I have lots of tabs and windows open, with gmail and other web apps that might be opening connections.

I also don't see "freezing" so much as just an hour glass and then time out with "site not found" error.   I'm able to hit stop and reload and get response to those actions, and sometimes that seems to work to break the deadlock and get the site to load.  Maybe I'm just hitting an open window in the race condition between two tabs or windows when I do this.
(Assignee)

Comment 7

11 years ago
(In reply to comment #6)
> I don't have an up-to-data build tree handy but if someone can point me at a
> build I'll give it a whirl.
> 

I don't have tryserver access, I do have a mozcentral build from my macbook I can offer...

http://www.ducksong.com/mozilla/misc/firefox-3.2a1pre.en-US.mac.dmg
I can see similar things all the last days. I get timeout so often that it's really annoying. I did a test with Safari in parallel and it hangs too. So probably it will be mostly an ISP issue. I'll send the patch to the try server as soon as I can open the upload page. Now it seems to be down.
Hardware: PC → All

Comment 9

11 years ago
I've been testing wth the build from comment 7 since last night.  I still see the problem of  "server not found" time outs when clicking on links from google search results, but it seems like I'm not seeing the problem as often.   Maybe I was seeing a combination of network flakeyness, and the bug being patched, and now I'm just down to the occasional network flakeyness.  I'll keep testing to see if I can figure out more about what is going on.
Curiously the try server wasn't able to build with the patch on OS X. Sadly there is no build error output available. Linux and Windows builds are located here: https://build.mozilla.org/tryserver-builds/2008-12-13_11:00-mozilla@hskupin.info-bug467562/

Comment 11

11 years ago
probably not related to the patch for this bug, but that debug build in commment 7 crashes a very high pct. of the time when I click in a link in a thunderbird e-mail message, and the browser tries to open a new tab.

Comment 12

11 years ago
Is this ready for review?
Assignee: nobody → mcmanus
Flags: blocking1.9.1? → blocking1.9.1+
Priority: -- → P3
(Assignee)

Comment 13

11 years ago
Comment on attachment 352191 [details] [diff] [review]
fix a race condition with timeouts and singnals reaching non-pri threads

feedback seems positive, if inconclusive. I've been using the patch on various builds without a problem.
Attachment #352191 - Flags: review?(bzbarsky)
Attachment #352191 - Flags: superreview?(bzbarsky)
Patrick, what's the race with timeout detection?

Also, where's the "one signal, one wakeup" assumption, and which documentation is comment 4 referring to?
(Assignee)

Comment 15

11 years ago
(In reply to comment #14)
> Patrick, what's the race with timeout detection?

PR_WaitCondVar() does not indicate a timeout when returning so you have to sample 'now' before and after and calculate the elapsed time yourself and compare it to the timeout value. The elapsed is going to be an over-estimate and a value that exceeds the timeout value might not be the reason the wait returned. Without the patch, a thread might wakeup due to a signal but exit without doing anything because it perceives a timeout. This is changed to always checking to make sure there is nothing to do before exiting.

> 
> Also, where's the "one signal, one wakeup" assumption, 

threads were divided into "can do only high priority" or "can do any priority". When data was enqueued exactly one idle thread was woken, of unknown class. If the woken thread couldn't act on the queued data, it would go back to sleep without doing anything - effectively dropping the notification even though there might be another idle thread with the right classification. Realistically this wouldn't effect anything the browser was blocked on, because that all happens at high priority and all the threads are eligible to process that. The code changed from designating certain threads as "any priority" to just keeping a counter and quota of them. 

> and which documentation
> is comment 4 referring to?

http://groups.google.com/group/mozilla.dev.tech.nspr/msg/aca68a5dd34701e9?dmode=source

I'm fairly sure I saw something similar in a more formal location too...
Comment on attachment 352191 [details] [diff] [review]
fix a race condition with timeouts and singnals reaching non-pri threads

>+++ b/netwerk/dns/src/nsHostResolver.cpp
>+nsHostResolver::GetHostToLookup(nsHostRecord **result)
>+    PRBool TimedOut = PR_FALSE;

timedOut, perhaps?

>+    PRIntervalTime start = 0, now;

So |start| is also used to detect whether we've been through the loop before, right?  Please use a boolean for that.  0 is a perfectly valid value for PR_IntervalNow, after all.

>+        if (!start)
>+            now  = start =  PR_IntervalNow();
>+        
>+        PRIntervalTime expirationTime = start + ((mNumIdleThreads >= HighThreadThreshold) ? mShortIdleTimeout : mLongIdleTimeout);

Why are we basing the expiration time on when we entered this function?

>+        PRIntervalTime timeout = expirationTime - now;
>+        if (timeout < 0)
>+            timeout = 0;

Can't happen (and I'd have thought the compiler would warn you about it), since PRIntervalTime is an unsigned type.

So if now > expirationTime, you'll just end up with a huge timeout.

>+        if (now >= expirationTime)
>+            TimedOut = PR_TRUE;

Similar issue here: just because now is later than expirationTime doesn't mean that now >= expirationTime.  PR_IntervalNow values happily wrap (every 4 million seconds or so, at least).

In general, you can't compare a PR_IntervalNow() value to another PR_IntervalTime; you can only compare the difference of two PR_IntervalNow values, and only if you know they're "not too far apart".  And you need to make sure that you subtract the earlier one from the later one...

So what you really want to check, to see whether you've timed out, is whether |now - start| is greater than the timeout you desire.
Attachment #352191 - Flags: superreview?(bzbarsky)
Attachment #352191 - Flags: superreview-
Attachment #352191 - Flags: review?(bzbarsky)
Attachment #352191 - Flags: review-
(Assignee)

Comment 17

10 years ago
Posted patch patch v2 (obsolete) — Splinter Review
I have now read http://www.mozilla.org/projects/nspr/reference/html/prinrvl.html more closely - you're right that I thought that worked differently. sorry about that.
Attachment #352191 - Attachment is obsolete: true
Attachment #356551 - Flags: review?(bzbarsky)

Comment 18

10 years ago
I've thought about a new way to test this.   When I still see the problem (as I still do with Firefox trunk build like Build identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3pre) Gecko/20090107 Shiretoko/3.1b3pre Ubiquity/0.1.4 ) I fire up 

  ping site_Im_pointing_the_browser_at

ping is showing that it can connect and get reasonable round trips, while Firefox still shows "looking up [site]", and then the time out page.  Is that a valid test?
(Assignee)

Comment 19

10 years ago
i can thin(In reply to comment #18)
> I've thought about a new way to test this.   When I still see the problem (as I
> still do with Firefox trunk build like Build identifier: Mozilla/5.0
> (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3pre) Gecko/20090107
> Shiretoko/3.1b3pre Ubiquity/0.1.4 ) I fire up 
> 
>   ping site_Im_pointing_the_browser_at
> 
> ping is showing that it can connect and get reasonable round trips, while
> Firefox still shows "looking up [site]", and then the time out page.  Is that a
> valid test?

It might not be perfect, depending on the OS libraries, but it would prbly corolate well and be interesting information.
Comment on attachment 356551 [details] [diff] [review]
patch v2

>+++ b/netwerk/dns/src/nsHostResolver.cpp
>+        else {
>+            timeout -= (PRIntervalTime)(now - epoch);
>+            epoch = now;
>+        }

That's needed to handle cases of us staying inside this function long enough that we wrap the counter, right?  But our timeout seems to be small enough that this should never happen...  We could even assert that if desired. 

Do we really need this block?
(Assignee)

Comment 21

10 years ago
(In reply to comment #20)
> (From update of attachment 356551 [details] [diff] [review])
> >+++ b/netwerk/dns/src/nsHostResolver.cpp
> >+        else {
> >+            timeout -= (PRIntervalTime)(now - epoch);
> >+            epoch = now;
> >+        }
> 
> That's needed to handle cases of us staying inside this function long enough
> that we wrap the counter, right?  But our timeout seems to be small enough that
> this should never happen...  We could even assert that if desired. 
> 
> Do we really need this block?

it is needed so that the timeout value is set appropriately for the wait() the next time through the loop in the case of an early return from the wait().
Comment on attachment 356551 [details] [diff] [review]
patch v2

Aha!  Document that, please, and with that looks great.
Attachment #356551 - Flags: review?(bzbarsky) → review+
(Assignee)

Comment 23

10 years ago
Posted patch patch v3Splinter Review
comments updated as per review comment 22
Attachment #356551 - Attachment is obsolete: true
Pushed http://hg.mozilla.org/mozilla-central/rev/a8fae999e836
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
This is still happening with the latest builds and highly visible on Vista. In my case it runs in VmWare as a Bootcamp partition. But even on OS X I see the same timeouts.

Create a fresh profile and load http://www.flickr.com. Each time I do this for a fresh profile I run into a timeout and Minefield says:

Firefox can't find the server at www.flickr.com

Hitting reload mostly helps but sometimes I've to hit it twice. Even clicking on sign in afterward times out again mostly.
Status: RESOLVED → REOPENED
Keywords: fixed1.9.1
OS: Mac OS X → All
Resolution: FIXED → ---

Comment 27

10 years ago
Henrik, since this one already has a patch+branch landings, can you file your issue as a new bug please?
Status: REOPENED → RESOLVED
Last Resolved: 10 years ago10 years ago
Keywords: fixed1.9.1
Resolution: --- → FIXED
Sure. I've filed bug 475603.
Target Milestone: --- → mozilla1.9.2a1
You need to log in before you can comment on or make changes to this bug.