Closed Bug 30746 (NT-primordial-thread) Opened 25 years ago Closed 22 years ago

Problems with the primordial thread being converted to a local thread in the MxN thread model

Categories

(NSPR :: NSPR, defect, P1)

x86
Windows NT
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wtc, Assigned: wtc)

Details

Attachments

(3 files, 1 obsolete file)

In the combined (MxN) thread model, the current implementation
of NSPR converts the primordial thread into a local thread and
uses the underlying native thread as an execution entity to run
local threads and the idle thread (part of the NSPR thread
scheduler).

This implementation strategy is problematic when the primordial
thread calls a native blocking function that blocks the underlying
native thread, thus preventing the other local threads and the
idle thread on that execution entity from running.  On Windows NT,
this problem may manifest itself as follows:
1. A thread calls an NSPR I/O function and hangs in _NT_IO_WAIT.
   This is because on Windows NT NSPR uses the idle threads to
   read the I/O completion port but the idle thread (by default
   only one is created) is blocked.  This can be worked around
   by calling PR_SetConcurrency(n), where n is an interger larger
   than 2, at the beginning of the main() function to create
   additional idle threads.
2. PR_CreateThread() hangs, trying to acquire the internal lock
   _pr_activeLock, which is being held by the internal thread
   TimerManager (prmwait.c).  The TimerManager thread is created
   during NSPR initialization as a local thread, which at the time
   is runnable but cannot be scheduled to run because the native
   blocking function called by the primordial thread blocks the
   underlying execution entity.  I am not sure if PR_SetConcurrency()
   can work around this problem.  One workaround is to create a
   global thread to run the real main function and join with that
   global thread.

This problem has been reported so many times that I think it
needs to be fixed.  I propose that we leave the primordial thread
as a plain native thread and create a new native thread as the
execution entity to run local threads and the idle thread.
This will create one more native thread than the current
implementation does, but it will save us and our clients
from wasting time in debugging this problem down the road.
I don't understand the problems described here.

1. When a local thread calls _NT_IO_WAIT, the calling thread yields to another 
local thread (an application thread, the idle thread, etc) by calling 
_PR_MD_WAIT. The calling local thread does not block the underlying native (cpu) 
thread while waiting for IO completion.

2. Since the TimerManager function does not acquire _pr_activeLock, but, rather, 
"_PR_UserRunThread" wrapper function acquires and releases _pr_activeLock 
without making a blocking call in between, how is _PR_CreateThread blocked? The 
primordial thread will block only when it directly calls a native blocking 
function; in that case, the application will hang, irrespective of the 
presence/absence of TimerManager.
#1 is a common problem and has been reported
many times.

#2 has only been reported once (recently by
rwalker on nspr20clients@netscape.com).

Regarding #1: by default, there is only one
CPU thread in NSPR.  When a thread (local or
global, doesn't matter) calls PR_Write(),
it blocks in _NT_IO_WAIT().  Only the
idle thread can wake up that thread from
_NT_IO_WAIT().  Now, if the primordial
thread (a fiber) calls a native blocking
function, the CPU thread blocks, so the
idle thread won't be able to run.  A
workaround is to call PR_SetConcurrency()
to create more CPU threads so that even
if the primordial CPU thread is blocked,
the idle threads on other CPU threads can
still call GetQueuedCompletionStatus() and
wake up the thread blocked in _NT_IO_WAIT().

Regarding #2: it is the _PR_UserRunThread()
wrapper function (around TimerManager) that
tries to acquire _pr_activeLock.  Apparently,
it could not acquire _pr_activeLock on first
try and was put on the lock's wait queue.
Then, some other thread released _pr_activeLock
and the lock was assigned to TimerManager (or
rather its _PR_UserRunThread wrapper).  So
TimerManager was made runnable.  But the
primordial CPU thread was blocked because the
primordial thread (a fiber) called the native
blocking function GetMessage().  So even though
TimerManager was runnable, it could not be
scheduled to run.  Then, another thread called
PR_CreateThread() and was hung trying to acquire
_pr_activeLock.  This can only be reproduced on
a multiprocessor system.

1. It wasn't clear initially that the primordial thread was blocked due to a 
direct call to a native blocking function, instead of a NSPR function.
In the more general case, any local thread calling a native blocking function 
can cause the application to hang; the workaround will only solve the problem of 
the primordial thread calling a native blocking function.

2. In this case, the Assign-lock operation is probably a bug; it ignores the 
thread priorities for resource allocation.
Re: #1

The clients that reported this problem use
global threads only.  The only exception is
the primordial thread, which is converted to
a local thread (a fiber on NT) by NSPR.
This is why only the primordial thread may
cause this problem (for those clients) and
why the PR_SetConcurrency() workaround work
for them.
I did manage to modify NSPR so that the primordial
thread is not converted to a local thread in the
combined thread model.  However, on second thought
I realized that this change will have performance
implications on applications using local threads.

The context switch between two local threads is
much faster than the context switch between a
global thread and a local thread.  If the primordial
thread becomes a global thread, what used to be
a local/local context switch will become a global/local
context switch.

Therefore I propose that we implement the solution
that mharmsen suggested:
- Call PR_SetConcurrency(2) at the end of
  _PR_ImplicitInitialization().  This creates an
  additional CPU thread that can read the NT I/O
  completion port in case the primordial CPU thread
  is blocked in a native blocking function called
  by the primordial thread (a fiber).
- Create the TimerManager thread as a global thread
  (rwalker tested this workaround) or force
  _PR_ImplicitInitialization() to wait until the
  TimerManager thread starts to run.

These will work around problems #1 and #2 and will
not affect the performance of applications using
local threads.
Status: NEW → ASSIGNED
Added a new test primblok.c that reproduces problem #1
(NSPR I/O functions hang in _NT_IO_WAIT() when the
primordial thread calls a native blocking function).
Changed the bug summary from
    In the MxN thread model, the primordial thread
    should be a plain native thread
to
    Problems with the primordial thread being converted
    to a local thread in the MxN thread model

I decided to use the alternative fix, for Windows
NT only.  So the primordial thread will still be
converted to a local thread (hence the new bug
summary), but the fix works around the problems.
1. Call PR_SetConcurrency(2) at the end of
   _PR_ImplicitInitialization().
2. Have _PR_ImplicitInitialization() wait
   until the timer manager thread starts to
   run.

This fix has been checked in on the main trunk.
/cvsroot/mozilla/nsprpub/pr/src/io/prmwait.c, revision 3.8
/cvsroot/mozilla/nsprpub/pr/src/misc/prinit.c, revision 3.20
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Summary: In the MxN thread model, the primordial thread should be a plain native thread → Problems with the primordial thread being converted to a local thread in the MxN thread model
I improved the comments in prmwait.c (to explain how
I temporarily use a condition variable for a different
purpose during NSPR initialization).
/cvsroot/mozilla/nsprpub/pr/src/io/prmwait.c, revision 3.9
Target Milestone: --- → 4.1
We ran into a problem with the PR_SetConcurrency(2)
workaround.  The primordial thread (which is converted
into a fiber) may migrate to CPU thread #2 and hence
any assumption made about the underlying native thread
on which the primordial thread runs (such as the window
it owns) are no longer valid.

In the specific problem we ran into, the primordial
thread is running a windows message loop.  After the
primordial thread migrates to CPU thread #2, the
primordial CPU becomes idle, and the messages posted
to that windows are not being processed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reassigned the bug to larryh.
Assignee: wtc → larryh
Status: REOPENED → NEW
This problem showed up on a new server development project. The symptom was a 
hang. Investigation of the server's operation showed that the primordial thread 
was using the windows message pump to receive I/O completion notification for 
sockets. Because of PR_SetConcurrency(2), the primordial thread, now a fiber, 
moved to a different CPU thread. The windows message pump is tied to a specific 
thread. What had been the primordial thread was not running PAUSE_CPU().

I asked these folks run with the environment variable 
"NSPR_NATIVE_THREADS_ONLY=1" on the WinNT platform. He did and the problem went 
away.

For this particular application there are several possible solutions. The 
"NSPR_NATIVE_THREADS_ONLY=1" being the first. Similarly, setting the NSPR 
variable _native_threads_only to 1 does the same; do this before making any 
other NSPR calls. I also suggested that extracting the message-pump code and 
putting it on a native thread by itself would likely have circumvented this 
problem.

Marking this as won't fix.
Status: NEW → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → WONTFIX
The name of the magic global variable is "nspr_native_threads_only",
not "_native_threads_only".

The magic global variable "nspr_native_threads_only"
is of the type PRBool.  It must be defined in the main
executable using the DLL-export qualifier (e.g., the
PR_IMPLEMENT_DATA macro).

See Bugzilla bug #23694.
I backed out the PR_SetConcurrency(2) workaround.
/cvsroot/mozilla/nsprpub/pr/src/misc/prinit.c, revision: 3.25
I am fed up with all the reports of this bug.
I decided to make another attempt at fixing it.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Target Milestone: 4.1 → 4.2
Status: REOPENED → ASSIGNED
Target Milestone: 4.2 → 4.1.2
Set target milestone NSPR 4.2.
Assignee: larryh → wtc
Status: ASSIGNED → NEW
Target Milestone: 4.1.2 → 4.2
I checked in proposed patch v2 on the main trunk.
Status: NEW → RESOLVED
Closed: 24 years ago23 years ago
Resolution: --- → FIXED
Attachment #34030 - Attachment is obsolete: true
Julien Pierre told me that the web server does not work
with this fix.  I am reopening this bug.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I tracked down why the web server doesn't work with
this fix.

The web server initializes NSPR from the DllMain function
of a DLL.  While new threads can be created by DllMain,
they won't be able to run during DLL initialization.
Therefore, the primordial CPU thread created by _PR_InitCPUs
cannot run, but we are waiting for it to initialize the
primordial CPU, hence the deadlock.

The fix is to have _PR_InitCPUs initialize the primordial
CPU, as opposed to relying on the primordial CPU thread to
do that.  Since parts of the primordial CPU must be initialized
by the primordial CPU thread, I modified _PR_CreateCPU so
that it only initializes enough of the CPU to make it usable
by a global thread or other CPU threads.  Then, I added a
new function, _PR_StartCPU, that will be called by the new
CPU thread to complete the initialization of a new CPU.

I will attach a patch.
Status: REOPENED → ASSIGNED
Priority: P3 → P1
Julien reported in bug 126946 that the current fix
does not work if NSPR_NATIVE_THREADS_ONLY is set to 1.
This patch fixed that problem.
I forgot to mark this bug fixed.  This bug is fixed in NSPR 4.2.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago22 years ago
Resolution: --- → FIXED
A workaround for problem #2 is to add a PR_Sleep(PR_INTERVAL_NO_WAIT)
call to the main thread (or the thread that implicitly initializes
NSPR) before it calls any other NSPR functions.

The PR_Sleep(PR_INTERVAL_NO_WAIT) call makes the NSPR primordial
thread (a fiber) yield, which gives the TimerManager thread a
chance to run.  At that time, there should be no other NSPR threads
around, so the TimerManager thread should be able to acquire
_pr_activeLock without blocking and release it.
Alias: NT-primordial-thread
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: