Closed Bug 84420 Opened 23 years ago Closed 23 years ago

BeOS and OS/2 should use Unix Async DNS Implementation

Categories

(Core :: Networking, defect, P3)

Other
Other
defect

Tracking

()

VERIFIED FIXED
mozilla0.9.2

People

(Reporter: gordon, Assigned: mkaply)

Details

(Whiteboard: critical for 0.9.2)

Attachments

(3 files)

The DNS cache doesn't evict lookups on BEOS and OS/2.  We need to call 
AddToEvictionQ() after processing the request on new/pending lookups for 
platforms that don't use the dns thread for async lookups.
Mike, do you test on both BEOS and OS/2, or just OS/2?  Could you test a patch 
once I get it done?
Status: NEW → ASSIGNED
Priority: -- → P3
Target Milestone: --- → mozilla0.9.2
afaik mkaply is just a concerned citizen and doesn't have a BeOS box.  I 
know cls builds on BeOS.
Summary: DNS cache doesn't evict on BEOS and OS/2 → DNS cache doesn't evict on BeOS and OS/2
Someone else is going to need to pick up testing this.
Keywords: qawanted
If mkaply and cls can review the patch, I'll get an sr= and an a= and check it 
in.

Another possibility is that OS/2 and BeOS may be able to simply use the UNIX 
implementation.  That would clean up the code a bit as well.  I'll submit an 
optimistic patch so someone can try that as well.
How would I actually test the patch to see if it was working/failing?
The second patch is using PR_GetIPNodeByName() for OS/2.  Wan-Teh, is that 
available on OS/2?
Yes, PR_GetIPNodeByName() is available on OS/2.
Setting target milestone to 0.9.3, but I'll check it in as soon as I get test 
feedback and reviews.
Target Milestone: mozilla0.9.2 → mozilla0.9.3
OK, without any patches applied, DNS works fine.

First patch wouldn't compile - PRCLIST_IS_EMPTY is not defined.

Second patch compiled, but I got this assert coming up:

###!!! ASSERTION: bad DNS shutdown state: 'mState == 
DNS_SHUTTING_DOWN', file D:
/builds/current/mozilla/netwerk/dns/src/nsDnsService.cpp, line 1219

And DNS did not work at all.
I don't understand the problem with the optimal patch.  I looks like the dns 
thread is failing to wait on the condition variable in DequeueLookup() or getting 
wrongly notified when there are no lookups in the queue.  It's not clear to me 
how this could be happening on OS/2, but not Unix.

Wan-Teh, are there any known differences with PR_WaitCondVar() on OS/2?
You MUST ALWAYS call PR_WaitCondVar in a while loop.

The following code is wrong.
 
nsDNSLookup *
nsDNSService::DequeuePendingQ()
{
#if defined(XP_UNIX)
    // Wait for notification of a queued request, unless  we're shutting down.
    if (PR_CLIST_IS_EMPTY(&mPendingQ) && (mState != DNS_SHUTTING_DOWN)) {
        PRStatus  status = PR_WaitCondVar(mDNSCondVar, PR_INTERVAL_NO_TIMEOUT);
        NS_ASSERTION(status == PR_SUCCESS, "PR_WaitCondVar failed.");
    }
#endif

You should say
    while (PR_CLIST_IS_EMPTY(&mPendingQ) && (mState != DNS_SHUTTING_DOWN)) {
Ah, great.  I'll fix it on the trunk as soon as I can push through the review 
process.  I wonder why we haven't seen this as a problem on Unix?
in the original "async dns" XP_UNIX code that i wrote for bug 10733, we had
something similar to:

DNS thread:
  mon.enter();
  if (!pending_lookup && !shutting_down)
      mon.wait();
  if (pending_lookup)
      do_lookup();
  else
      do_shutdown();
  mon.exit();

client thread:
  mon.enter();
  pending_lookup = true;
  mon.notify();
  mon.exit();

in other words, it seemed ok for there not to be a pending_lookup when the
monitor was notified... it meant to shutdown (note: the shutting_down flag was
just used to handle the shutdown case in which the DNS thread had not yet run).
the new code seems to be basically identical.

on OS/2 it seems to be the case that a pending_lookup is added before the
monitor (convar in the new world) is notified, but after the DNS thread resumes
pending_lookup is for some reason not set.

wtc: why is the while loop needed?  i always thought that a while loop was only
need in the case of multiple threads waiting on a condition upon which only one
thread may act.  is this not the case?  thanks!
The OS/2 problem may be not caused by not using a while loop
here.  I am merely pointing out that as a rule, you must call
PR_WaitCondVar or PR_Wait in a while loop.  That's required by
the API.

Two reasons the while loop is required.
1. PR_WaitCondVar or PR_Wait may return due to a spurious wakeup.
   Note that an implementation may not actually generate spurious
   wakeups, but they are allowed by the API specification.
2. If multiple threads may call PR_WaitCondVar, another thread
   may be able to enter the monitor sooner than the awakened thread
   can (a thread awakened from a PR_WaitCondVar call needs to re-enter
   the monitor) and make the boolean expression false again.  This
   condition is called a stolen wakeup.

Again, I am not saying that the OS/2 problem is caused by a
spurious wakeup or a stolen wakeup.  In fact, you may be able
to analyze your code and the PR_WaitCondVar implementation on
OS/2 and determine that these don't happen.  I am just saying
that it is a syntactical requirement (one that can't be enforced
by a compiler) that all PR_WaitCondVar calls must be inside a
while loop.
fair enough... thanks for the thorough explanation :-)

in this case, it is not a stolen wakeup, because we only have one thread 
that may call PR_WaitCondVar.  perhaps it is the spurious wakeup problem,
since we are only seeing this on OS/2 and not any of the unix platforms.

sounds to me like we need to do a scan of all usages of PR_Wait or
PR_WaitCondVar to ensure that they are not subject to similar problems.
I suspect there may be problems with the thread pool implementation.
The simple patch has been checked in.  So I'm morphing this bug to state the
desire to have BeOS and OS/2 use the Unix implementation of async dns lookups
(unless those platforms have a native async gethostbyname API that someone wants
to use).
Summary: DNS cache doesn't evict on BeOS and OS/2 → BeOS and OS/2 should use Unix Async DNS Implementation
Target Milestone: mozilla0.9.3 → mozilla1.0
OK, with the current level of nsDnsService.cpp, Os/2 can follow the 
UNIX path no problem (as per the optimistic patch)

Do we want to assume BEOS can and check in the changes? Or just check 
in the pieces with XP_OS2?
Gordon,

OS/2 can use the Unix implementation. Do you want to assume that BeOS can? Or 
should I create a patch that just fixes this for OS/2?
You know, I'm tempted to change it for both; if it works - great! If it breaks
BeOS and somebody cares, then we should hear from them, which will let us know
who we can work with to resolve the issues.  I suppose we should make an
announcement on netscape.public.mozilla.netlib first.
you could also post to netscape.public.mozilla.beos, ... I just updated my tree 
so i might try a beos build this week.
Sorry to be late, I tested the 'optimistic' patch on BeOS build and it is
working for me so far.
I'm posting this from patched Bezilla:-)
Why is this targeted for 1.0?

I'd like to see this in 0.9.2. Don't we leak without this?
mkaply: gordon is out of town for the next couple of weeks... so, this probably
won't go in for 0.9.2 unless some kind soul takes it... how 'bout you? ;-) 
I'm taking this bug to push for checkin to 0.9.2.

Gordon wrote the code, and I looked at it and said it was OK, so

r=mkaply

Makoto Hamanaka verified it on BeOS, so I am going to take r=VYA04230@nifty.com 
for the BEOS part.

darin: Can I get an sr=darin so I can send this to drivers?
Assignee: gordon → mkaply
Status: ASSIGNED → NEW
Target Milestone: mozilla1.0 → mozilla0.9.2
sr=darin
Fix checked in to trunk and branch (branch approval from Asa by mail)
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Whiteboard: critical for 0.9.2
Keywords: verifyme
Verify fixed
Status: RESOLVED → VERIFIED
Keywords: verifyme
Keywords: qawanted
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: