467562 - DNS requests seem to hang sometimes

Reporter

Description

•

16 years ago

About once a day my browser won't load anything in the current tab. This seems to happen most often with my gmail tab. The taskbar says it's looking up the host, and it just sits there. The only way I've been able to resolve this issue and make it actually load is to open up a new tab and just try loading an arbitrary site (usually google.com). I originally thought it was a flakey connection, but I'm hard-wired now. I asked on irc, and at least one other person was seeing this as well. I'm thinking this might be because of dns prefetching?

Flags: blocking1.9.1?

bhearsum@mozilla.com (:bhearsum)

Comment 1

•

16 years ago

I'm experiencing similar behaviour - though it's hard to know when it's Firefox and when it's my ISP.

Patrick McManus [:mcmanus]

Assignee

Comment 2

•

16 years ago

hm. I haven't seen that. Just to be certain - are you saying that tab A freezes, and then you open tab B and initiate a lookup in tab B and that will un-freeze A? That sounds like a lookup for A got queued but it did not make it out of the queue when a thread opened up for it until the new request was submitted in B. That shouldn't happen :(

Shawn Wilsher :sdwilsh

Reporter

Comment 3

•

16 years ago

(In reply to comment #2) > Just to be certain - are you saying that tab A freezes, and then you open tab B > and initiate a lookup in tab B and that will un-freeze A? > > That sounds like a lookup for A got queued but it did not make it out of the > queue when a thread opened up for it until the new request was submitted in B. > That shouldn't happen :( Yes, that is what I'm seeing.

Patrick McManus [:mcmanus]

Assignee

Comment 4

•

16 years ago

ah. ok, I see a problem with signal accounting and logic. Basically the code is assuming 1 signal to 1 wakeup, which the documentation says you cannot count on, and there is also a problem with timeout detection (which is inherently racy with the nspr api). I'll make a patch.

Patrick McManus [:mcmanus]

Assignee

Comment 5

•

16 years ago

Attached patch fix a race condition with timeouts and singnals reaching non-pri threads (obsolete) — Details — Splinter Review

as I cannot reproduce this, can you give the attached patch a spin?

chris hofmann

Comment 6

•

16 years ago

I don't have an up-to-data build tree handy but if someone can point me at a build I'll give it a whirl. I haven't noticed the interaction between tabs, but that is a possibility. When I see the problems I've been experiencing I have lots of tabs and windows open, with gmail and other web apps that might be opening connections. I also don't see "freezing" so much as just an hour glass and then time out with "site not found" error. I'm able to hit stop and reload and get response to those actions, and sometimes that seems to work to break the deadlock and get the site to load. Maybe I'm just hitting an open window in the race condition between two tabs or windows when I do this.

Patrick McManus [:mcmanus]

Assignee

Comment 7

•

16 years ago

(In reply to comment #6) > I don't have an up-to-data build tree handy but if someone can point me at a > build I'll give it a whirl. > I don't have tryserver access, I do have a mozcentral build from my macbook I can offer... http://www.ducksong.com/mozilla/misc/firefox-3.2a1pre.en-US.mac.dmg

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 8

•

16 years ago

I can see similar things all the last days. I get timeout so often that it's really annoying. I did a test with Safari in parallel and it hangs too. So probably it will be mostly an ISP issue. I'll send the patch to the try server as soon as I can open the upload page. Now it seems to be down.

Hardware: PC → All

chris hofmann

Comment 9

•

16 years ago

I've been testing wth the build from comment 7 since last night. I still see the problem of "server not found" time outs when clicking on links from google search results, but it seems like I'm not seeing the problem as often. Maybe I was seeing a combination of network flakeyness, and the bug being patched, and now I'm just down to the occasional network flakeyness. I'll keep testing to see if I can figure out more about what is going on.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 10

•

16 years ago

Curiously the try server wasn't able to build with the patch on OS X. Sadly there is no build error output available. Linux and Windows builds are located here: https://build.mozilla.org/tryserver-builds/2008-12-13_11:00-mozilla@hskupin.info-bug467562/

chris hofmann

Comment 11

•

16 years ago

probably not related to the patch for this bug, but that debug build in commment 7 crashes a very high pct. of the time when I click in a link in a thunderbird e-mail message, and the browser tries to open a new tab.

Benjamin Smedberg

Comment 12

•

16 years ago

Is this ready for review?

Assignee: nobody → mcmanus

Flags: blocking1.9.1? → blocking1.9.1+

Priority: -- → P3

Patrick McManus [:mcmanus]

Assignee

Comment 13

•

16 years ago

Comment on attachment 352191 [details] [diff] [review] fix a race condition with timeouts and singnals reaching non-pri threads feedback seems positive, if inconclusive. I've been using the patch on various builds without a problem.

Attachment #352191 - Flags: review?(bzbarsky)

Reed Loden [:reed]

Updated

•

16 years ago

Attachment #352191 - Flags: superreview?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 14

•

16 years ago

Patrick, what's the race with timeout detection? Also, where's the "one signal, one wakeup" assumption, and which documentation is comment 4 referring to?

Patrick McManus [:mcmanus]

Assignee

Comment 15

•

16 years ago

(In reply to comment #14) > Patrick, what's the race with timeout detection? PR_WaitCondVar() does not indicate a timeout when returning so you have to sample 'now' before and after and calculate the elapsed time yourself and compare it to the timeout value. The elapsed is going to be an over-estimate and a value that exceeds the timeout value might not be the reason the wait returned. Without the patch, a thread might wakeup due to a signal but exit without doing anything because it perceives a timeout. This is changed to always checking to make sure there is nothing to do before exiting. > > Also, where's the "one signal, one wakeup" assumption, threads were divided into "can do only high priority" or "can do any priority". When data was enqueued exactly one idle thread was woken, of unknown class. If the woken thread couldn't act on the queued data, it would go back to sleep without doing anything - effectively dropping the notification even though there might be another idle thread with the right classification. Realistically this wouldn't effect anything the browser was blocked on, because that all happens at high priority and all the threads are eligible to process that. The code changed from designating certain threads as "any priority" to just keeping a counter and quota of them. > and which documentation > is comment 4 referring to? http://groups.google.com/group/mozilla.dev.tech.nspr/msg/aca68a5dd34701e9?dmode=source I'm fairly sure I saw something similar in a more formal location too...

Boris Zbarsky [:bzbarsky]

Comment 16

•

16 years ago

Comment on attachment 352191 [details] [diff] [review] fix a race condition with timeouts and singnals reaching non-pri threads >+++ b/netwerk/dns/src/nsHostResolver.cpp >+nsHostResolver::GetHostToLookup(nsHostRecord **result) >+ PRBool TimedOut = PR_FALSE; timedOut, perhaps? >+ PRIntervalTime start = 0, now; So |start| is also used to detect whether we've been through the loop before, right? Please use a boolean for that. 0 is a perfectly valid value for PR_IntervalNow, after all. >+ if (!start) >+ now = start = PR_IntervalNow(); >+ >+ PRIntervalTime expirationTime = start + ((mNumIdleThreads >= HighThreadThreshold) ? mShortIdleTimeout : mLongIdleTimeout); Why are we basing the expiration time on when we entered this function? >+ PRIntervalTime timeout = expirationTime - now; >+ if (timeout < 0) >+ timeout = 0; Can't happen (and I'd have thought the compiler would warn you about it), since PRIntervalTime is an unsigned type. So if now > expirationTime, you'll just end up with a huge timeout. >+ if (now >= expirationTime) >+ TimedOut = PR_TRUE; Similar issue here: just because now is later than expirationTime doesn't mean that now >= expirationTime. PR_IntervalNow values happily wrap (every 4 million seconds or so, at least). In general, you can't compare a PR_IntervalNow() value to another PR_IntervalTime; you can only compare the difference of two PR_IntervalNow values, and only if you know they're "not too far apart". And you need to make sure that you subtract the earlier one from the later one... So what you really want to check, to see whether you've timed out, is whether |now - start| is greater than the timeout you desire.

Attachment #352191 - Flags: superreview?(bzbarsky)

Attachment #352191 - Flags: superreview-

Attachment #352191 - Flags: review?(bzbarsky)

Attachment #352191 - Flags: review-

Patrick McManus [:mcmanus]

Assignee

Comment 17

•

16 years ago

Attached patch patch v2 (obsolete) — Details — Splinter Review

I have now read http://www.mozilla.org/projects/nspr/reference/html/prinrvl.html more closely - you're right that I thought that worked differently. sorry about that.

Attachment #352191 - Attachment is obsolete: true

Attachment #356551 - Flags: review?(bzbarsky)

chris hofmann

Comment 18

•

16 years ago

I've thought about a new way to test this. When I still see the problem (as I still do with Firefox trunk build like Build identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3pre) Gecko/20090107 Shiretoko/3.1b3pre Ubiquity/0.1.4 ) I fire up ping site_Im_pointing_the_browser_at ping is showing that it can connect and get reasonable round trips, while Firefox still shows "looking up [site]", and then the time out page. Is that a valid test?

Patrick McManus [:mcmanus]

Assignee

Comment 19

•

16 years ago

i can thin(In reply to comment #18) > I've thought about a new way to test this. When I still see the problem (as I > still do with Firefox trunk build like Build identifier: Mozilla/5.0 > (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3pre) Gecko/20090107 > Shiretoko/3.1b3pre Ubiquity/0.1.4 ) I fire up > > ping site_Im_pointing_the_browser_at > > ping is showing that it can connect and get reasonable round trips, while > Firefox still shows "looking up [site]", and then the time out page. Is that a > valid test? It might not be perfect, depending on the OS libraries, but it would prbly corolate well and be interesting information.

Boris Zbarsky [:bzbarsky]

Comment 20

•

16 years ago

Comment on attachment 356551 [details] [diff] [review] patch v2 >+++ b/netwerk/dns/src/nsHostResolver.cpp >+ else { >+ timeout -= (PRIntervalTime)(now - epoch); >+ epoch = now; >+ } That's needed to handle cases of us staying inside this function long enough that we wrap the counter, right? But our timeout seems to be small enough that this should never happen... We could even assert that if desired. Do we really need this block?

Patrick McManus [:mcmanus]

Assignee

Comment 21

•

16 years ago

(In reply to comment #20) > (From update of attachment 356551 [details] [diff] [review]) > >+++ b/netwerk/dns/src/nsHostResolver.cpp > >+ else { > >+ timeout -= (PRIntervalTime)(now - epoch); > >+ epoch = now; > >+ } > > That's needed to handle cases of us staying inside this function long enough > that we wrap the counter, right? But our timeout seems to be small enough that > this should never happen... We could even assert that if desired. > > Do we really need this block? it is needed so that the timeout value is set appropriately for the wait() the next time through the loop in the case of an early return from the wait().

Boris Zbarsky [:bzbarsky]

Comment 22

•

16 years ago

Comment on attachment 356551 [details] [diff] [review] patch v2 Aha! Document that, please, and with that looks great.

Attachment #356551 - Flags: review?(bzbarsky) → review+

Patrick McManus [:mcmanus]

Assignee

Comment 23

•

16 years ago

Attached patch patch v3 — Details — Splinter Review

comments updated as per review comment 22

Attachment #356551 - Attachment is obsolete: true

Boris Zbarsky [:bzbarsky]

Comment 24

•

16 years ago

Pushed http://hg.mozilla.org/mozilla-central/rev/a8fae999e836

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Boris Zbarsky [:bzbarsky]

Comment 25

•

16 years ago

Pushed http://hg.mozilla.org/releases/mozilla-1.9.1/rev/fbcd8d15953c

Keywords: fixed1.9.1

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

16 years ago

Blocks: 469459

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 26

•

16 years ago

This is still happening with the latest builds and highly visible on Vista. In my case it runs in VmWare as a Bootcamp partition. But even on OS X I see the same timeouts. Create a fresh profile and load http://www.flickr.com. Each time I do this for a fresh profile I run into a timeout and Minefield says: Firefox can't find the server at www.flickr.com Hitting reload mostly helps but sometimes I've to hit it twice. Even clicking on sign in afterward times out again mostly.

Status: RESOLVED → REOPENED

Keywords: fixed1.9.1

OS: Mac OS X → All

Resolution: FIXED → ---

Benjamin Smedberg

Comment 27

•

16 years ago

Henrik, since this one already has a patch+branch landings, can you file your issue as a new bug please?

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Keywords: fixed1.9.1

Resolution: --- → FIXED

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 28

•

16 years ago

Sure. I've filed bug 475603.

Target Milestone: --- → mozilla1.9.2a1

fix a race condition with timeouts and singnals reaching non-pri threads 16 years ago Patrick McManus [:mcmanus] 12.15 KB, patch	bzbarsky : review- bzbarsky : superreview-	Details \| Diff \| Splinter Review
patch v2 16 years ago Patrick McManus [:mcmanus] 12.07 KB, patch	bzbarsky : review+	Details \| Diff \| Splinter Review
patch v3 16 years ago Patrick McManus [:mcmanus] 12.33 KB, patch		Details \| Diff \| Splinter Review