Closed
Bug 467562
Opened 16 years ago
Closed 16 years ago
DNS requests seem to hang sometimes
Categories
(Core :: Networking, defect, P3)
Core
Networking
Tracking
()
RESOLVED
FIXED
mozilla1.9.2a1
People
(Reporter: sdwilsh, Assigned: mcmanus)
References
Details
(Keywords: fixed1.9.1)
Attachments
(1 file, 2 obsolete files)
12.33 KB,
patch
|
Details | Diff | Splinter Review |
About once a day my browser won't load anything in the current tab. This seems to happen most often with my gmail tab. The taskbar says it's looking up the host, and it just sits there. The only way I've been able to resolve this issue and make it actually load is to open up a new tab and just try loading an arbitrary site (usually google.com).
I originally thought it was a flakey connection, but I'm hard-wired now. I asked on irc, and at least one other person was seeing this as well.
I'm thinking this might be because of dns prefetching?
Flags: blocking1.9.1?
Comment 1•16 years ago
|
||
I'm experiencing similar behaviour - though it's hard to know when it's Firefox and when it's my ISP.
Assignee | ||
Comment 2•16 years ago
|
||
hm. I haven't seen that.
Just to be certain - are you saying that tab A freezes, and then you open tab B and initiate a lookup in tab B and that will un-freeze A?
That sounds like a lookup for A got queued but it did not make it out of the queue when a thread opened up for it until the new request was submitted in B. That shouldn't happen :(
Reporter | ||
Comment 3•16 years ago
|
||
(In reply to comment #2)
> Just to be certain - are you saying that tab A freezes, and then you open tab B
> and initiate a lookup in tab B and that will un-freeze A?
>
> That sounds like a lookup for A got queued but it did not make it out of the
> queue when a thread opened up for it until the new request was submitted in B.
> That shouldn't happen :(
Yes, that is what I'm seeing.
Assignee | ||
Comment 4•16 years ago
|
||
ah.
ok, I see a problem with signal accounting and logic. Basically the code is assuming 1 signal to 1 wakeup, which the documentation says you cannot count on, and there is also a problem with timeout detection (which is inherently racy with the nspr api).
I'll make a patch.
Assignee | ||
Comment 5•16 years ago
|
||
as I cannot reproduce this, can you give the attached patch a spin?
Comment 6•16 years ago
|
||
I don't have an up-to-data build tree handy but if someone can point me at a build I'll give it a whirl.
I haven't noticed the interaction between tabs, but that is a possibility. When I see the problems I've been experiencing I have lots of tabs and windows open, with gmail and other web apps that might be opening connections.
I also don't see "freezing" so much as just an hour glass and then time out with "site not found" error. I'm able to hit stop and reload and get response to those actions, and sometimes that seems to work to break the deadlock and get the site to load. Maybe I'm just hitting an open window in the race condition between two tabs or windows when I do this.
Assignee | ||
Comment 7•16 years ago
|
||
(In reply to comment #6)
> I don't have an up-to-data build tree handy but if someone can point me at a
> build I'll give it a whirl.
>
I don't have tryserver access, I do have a mozcentral build from my macbook I can offer...
http://www.ducksong.com/mozilla/misc/firefox-3.2a1pre.en-US.mac.dmg
Comment 8•16 years ago
|
||
I can see similar things all the last days. I get timeout so often that it's really annoying. I did a test with Safari in parallel and it hangs too. So probably it will be mostly an ISP issue. I'll send the patch to the try server as soon as I can open the upload page. Now it seems to be down.
Hardware: PC → All
Comment 9•16 years ago
|
||
I've been testing wth the build from comment 7 since last night. I still see the problem of "server not found" time outs when clicking on links from google search results, but it seems like I'm not seeing the problem as often. Maybe I was seeing a combination of network flakeyness, and the bug being patched, and now I'm just down to the occasional network flakeyness. I'll keep testing to see if I can figure out more about what is going on.
Comment 10•16 years ago
|
||
Curiously the try server wasn't able to build with the patch on OS X. Sadly there is no build error output available. Linux and Windows builds are located here: https://build.mozilla.org/tryserver-builds/2008-12-13_11:00-mozilla@hskupin.info-bug467562/
Comment 11•16 years ago
|
||
probably not related to the patch for this bug, but that debug build in commment 7 crashes a very high pct. of the time when I click in a link in a thunderbird e-mail message, and the browser tries to open a new tab.
Comment 12•16 years ago
|
||
Is this ready for review?
Assignee: nobody → mcmanus
Flags: blocking1.9.1? → blocking1.9.1+
Priority: -- → P3
Assignee | ||
Comment 13•16 years ago
|
||
Comment on attachment 352191 [details] [diff] [review]
fix a race condition with timeouts and singnals reaching non-pri threads
feedback seems positive, if inconclusive. I've been using the patch on various builds without a problem.
Attachment #352191 -
Flags: review?(bzbarsky)
Updated•16 years ago
|
Attachment #352191 -
Flags: superreview?(bzbarsky)
Comment 14•16 years ago
|
||
Patrick, what's the race with timeout detection?
Also, where's the "one signal, one wakeup" assumption, and which documentation is comment 4 referring to?
Assignee | ||
Comment 15•16 years ago
|
||
(In reply to comment #14)
> Patrick, what's the race with timeout detection?
PR_WaitCondVar() does not indicate a timeout when returning so you have to sample 'now' before and after and calculate the elapsed time yourself and compare it to the timeout value. The elapsed is going to be an over-estimate and a value that exceeds the timeout value might not be the reason the wait returned. Without the patch, a thread might wakeup due to a signal but exit without doing anything because it perceives a timeout. This is changed to always checking to make sure there is nothing to do before exiting.
>
> Also, where's the "one signal, one wakeup" assumption,
threads were divided into "can do only high priority" or "can do any priority". When data was enqueued exactly one idle thread was woken, of unknown class. If the woken thread couldn't act on the queued data, it would go back to sleep without doing anything - effectively dropping the notification even though there might be another idle thread with the right classification. Realistically this wouldn't effect anything the browser was blocked on, because that all happens at high priority and all the threads are eligible to process that. The code changed from designating certain threads as "any priority" to just keeping a counter and quota of them.
> and which documentation
> is comment 4 referring to?
http://groups.google.com/group/mozilla.dev.tech.nspr/msg/aca68a5dd34701e9?dmode=source
I'm fairly sure I saw something similar in a more formal location too...
Comment 16•16 years ago
|
||
Comment on attachment 352191 [details] [diff] [review]
fix a race condition with timeouts and singnals reaching non-pri threads
>+++ b/netwerk/dns/src/nsHostResolver.cpp
>+nsHostResolver::GetHostToLookup(nsHostRecord **result)
>+ PRBool TimedOut = PR_FALSE;
timedOut, perhaps?
>+ PRIntervalTime start = 0, now;
So |start| is also used to detect whether we've been through the loop before, right? Please use a boolean for that. 0 is a perfectly valid value for PR_IntervalNow, after all.
>+ if (!start)
>+ now = start = PR_IntervalNow();
>+
>+ PRIntervalTime expirationTime = start + ((mNumIdleThreads >= HighThreadThreshold) ? mShortIdleTimeout : mLongIdleTimeout);
Why are we basing the expiration time on when we entered this function?
>+ PRIntervalTime timeout = expirationTime - now;
>+ if (timeout < 0)
>+ timeout = 0;
Can't happen (and I'd have thought the compiler would warn you about it), since PRIntervalTime is an unsigned type.
So if now > expirationTime, you'll just end up with a huge timeout.
>+ if (now >= expirationTime)
>+ TimedOut = PR_TRUE;
Similar issue here: just because now is later than expirationTime doesn't mean that now >= expirationTime. PR_IntervalNow values happily wrap (every 4 million seconds or so, at least).
In general, you can't compare a PR_IntervalNow() value to another PR_IntervalTime; you can only compare the difference of two PR_IntervalNow values, and only if you know they're "not too far apart". And you need to make sure that you subtract the earlier one from the later one...
So what you really want to check, to see whether you've timed out, is whether |now - start| is greater than the timeout you desire.
Attachment #352191 -
Flags: superreview?(bzbarsky)
Attachment #352191 -
Flags: superreview-
Attachment #352191 -
Flags: review?(bzbarsky)
Attachment #352191 -
Flags: review-
Assignee | ||
Comment 17•16 years ago
|
||
I have now read http://www.mozilla.org/projects/nspr/reference/html/prinrvl.html more closely - you're right that I thought that worked differently. sorry about that.
Attachment #352191 -
Attachment is obsolete: true
Attachment #356551 -
Flags: review?(bzbarsky)
Comment 18•16 years ago
|
||
I've thought about a new way to test this. When I still see the problem (as I still do with Firefox trunk build like Build identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3pre) Gecko/20090107 Shiretoko/3.1b3pre Ubiquity/0.1.4 ) I fire up
ping site_Im_pointing_the_browser_at
ping is showing that it can connect and get reasonable round trips, while Firefox still shows "looking up [site]", and then the time out page. Is that a valid test?
Assignee | ||
Comment 19•16 years ago
|
||
i can thin(In reply to comment #18)
> I've thought about a new way to test this. When I still see the problem (as I
> still do with Firefox trunk build like Build identifier: Mozilla/5.0
> (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3pre) Gecko/20090107
> Shiretoko/3.1b3pre Ubiquity/0.1.4 ) I fire up
>
> ping site_Im_pointing_the_browser_at
>
> ping is showing that it can connect and get reasonable round trips, while
> Firefox still shows "looking up [site]", and then the time out page. Is that a
> valid test?
It might not be perfect, depending on the OS libraries, but it would prbly corolate well and be interesting information.
Comment 20•16 years ago
|
||
Comment on attachment 356551 [details] [diff] [review]
patch v2
>+++ b/netwerk/dns/src/nsHostResolver.cpp
>+ else {
>+ timeout -= (PRIntervalTime)(now - epoch);
>+ epoch = now;
>+ }
That's needed to handle cases of us staying inside this function long enough that we wrap the counter, right? But our timeout seems to be small enough that this should never happen... We could even assert that if desired.
Do we really need this block?
Assignee | ||
Comment 21•16 years ago
|
||
(In reply to comment #20)
> (From update of attachment 356551 [details] [diff] [review])
> >+++ b/netwerk/dns/src/nsHostResolver.cpp
> >+ else {
> >+ timeout -= (PRIntervalTime)(now - epoch);
> >+ epoch = now;
> >+ }
>
> That's needed to handle cases of us staying inside this function long enough
> that we wrap the counter, right? But our timeout seems to be small enough that
> this should never happen... We could even assert that if desired.
>
> Do we really need this block?
it is needed so that the timeout value is set appropriately for the wait() the next time through the loop in the case of an early return from the wait().
Comment 22•16 years ago
|
||
Comment on attachment 356551 [details] [diff] [review]
patch v2
Aha! Document that, please, and with that looks great.
Attachment #356551 -
Flags: review?(bzbarsky) → review+
Assignee | ||
Comment 23•16 years ago
|
||
comments updated as per review comment 22
Attachment #356551 -
Attachment is obsolete: true
Comment 24•16 years ago
|
||
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 25•16 years ago
|
||
Keywords: fixed1.9.1
Comment 26•16 years ago
|
||
This is still happening with the latest builds and highly visible on Vista. In my case it runs in VmWare as a Bootcamp partition. But even on OS X I see the same timeouts.
Create a fresh profile and load http://www.flickr.com. Each time I do this for a fresh profile I run into a timeout and Minefield says:
Firefox can't find the server at www.flickr.com
Hitting reload mostly helps but sometimes I've to hit it twice. Even clicking on sign in afterward times out again mostly.
Comment 27•16 years ago
|
||
Henrik, since this one already has a patch+branch landings, can you file your issue as a new bug please?
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Keywords: fixed1.9.1
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•