Closed
Bug 30746
(NT-primordial-thread)
Opened 25 years ago
Closed 22 years ago
Problems with the primordial thread being converted to a local thread in the MxN thread model
Categories
(NSPR :: NSPR, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
4.2
People
(Reporter: wtc, Assigned: wtc)
Details
Attachments
(3 files, 1 obsolete file)
4.65 KB,
patch
|
Details | Diff | Splinter Review | |
10.59 KB,
patch
|
Details | Diff | Splinter Review | |
2.74 KB,
patch
|
Details | Diff | Splinter Review |
In the combined (MxN) thread model, the current implementation of NSPR converts the primordial thread into a local thread and uses the underlying native thread as an execution entity to run local threads and the idle thread (part of the NSPR thread scheduler). This implementation strategy is problematic when the primordial thread calls a native blocking function that blocks the underlying native thread, thus preventing the other local threads and the idle thread on that execution entity from running. On Windows NT, this problem may manifest itself as follows: 1. A thread calls an NSPR I/O function and hangs in _NT_IO_WAIT. This is because on Windows NT NSPR uses the idle threads to read the I/O completion port but the idle thread (by default only one is created) is blocked. This can be worked around by calling PR_SetConcurrency(n), where n is an interger larger than 2, at the beginning of the main() function to create additional idle threads. 2. PR_CreateThread() hangs, trying to acquire the internal lock _pr_activeLock, which is being held by the internal thread TimerManager (prmwait.c). The TimerManager thread is created during NSPR initialization as a local thread, which at the time is runnable but cannot be scheduled to run because the native blocking function called by the primordial thread blocks the underlying execution entity. I am not sure if PR_SetConcurrency() can work around this problem. One workaround is to create a global thread to run the real main function and join with that global thread. This problem has been reported so many times that I think it needs to be fixed. I propose that we leave the primordial thread as a plain native thread and create a new native thread as the execution entity to run local threads and the idle thread. This will create one more native thread than the current implementation does, but it will save us and our clients from wasting time in debugging this problem down the road.
I don't understand the problems described here. 1. When a local thread calls _NT_IO_WAIT, the calling thread yields to another local thread (an application thread, the idle thread, etc) by calling _PR_MD_WAIT. The calling local thread does not block the underlying native (cpu) thread while waiting for IO completion. 2. Since the TimerManager function does not acquire _pr_activeLock, but, rather, "_PR_UserRunThread" wrapper function acquires and releases _pr_activeLock without making a blocking call in between, how is _PR_CreateThread blocked? The primordial thread will block only when it directly calls a native blocking function; in that case, the application will hang, irrespective of the presence/absence of TimerManager.
Assignee | ||
Comment 2•25 years ago
|
||
#1 is a common problem and has been reported many times. #2 has only been reported once (recently by rwalker on nspr20clients@netscape.com). Regarding #1: by default, there is only one CPU thread in NSPR. When a thread (local or global, doesn't matter) calls PR_Write(), it blocks in _NT_IO_WAIT(). Only the idle thread can wake up that thread from _NT_IO_WAIT(). Now, if the primordial thread (a fiber) calls a native blocking function, the CPU thread blocks, so the idle thread won't be able to run. A workaround is to call PR_SetConcurrency() to create more CPU threads so that even if the primordial CPU thread is blocked, the idle threads on other CPU threads can still call GetQueuedCompletionStatus() and wake up the thread blocked in _NT_IO_WAIT(). Regarding #2: it is the _PR_UserRunThread() wrapper function (around TimerManager) that tries to acquire _pr_activeLock. Apparently, it could not acquire _pr_activeLock on first try and was put on the lock's wait queue. Then, some other thread released _pr_activeLock and the lock was assigned to TimerManager (or rather its _PR_UserRunThread wrapper). So TimerManager was made runnable. But the primordial CPU thread was blocked because the primordial thread (a fiber) called the native blocking function GetMessage(). So even though TimerManager was runnable, it could not be scheduled to run. Then, another thread called PR_CreateThread() and was hung trying to acquire _pr_activeLock. This can only be reproduced on a multiprocessor system.
1. It wasn't clear initially that the primordial thread was blocked due to a direct call to a native blocking function, instead of a NSPR function. In the more general case, any local thread calling a native blocking function can cause the application to hang; the workaround will only solve the problem of the primordial thread calling a native blocking function. 2. In this case, the Assign-lock operation is probably a bug; it ignores the thread priorities for resource allocation.
Assignee | ||
Comment 4•25 years ago
|
||
Re: #1 The clients that reported this problem use global threads only. The only exception is the primordial thread, which is converted to a local thread (a fiber on NT) by NSPR. This is why only the primordial thread may cause this problem (for those clients) and why the PR_SetConcurrency() workaround work for them.
Assignee | ||
Comment 5•25 years ago
|
||
I did manage to modify NSPR so that the primordial thread is not converted to a local thread in the combined thread model. However, on second thought I realized that this change will have performance implications on applications using local threads. The context switch between two local threads is much faster than the context switch between a global thread and a local thread. If the primordial thread becomes a global thread, what used to be a local/local context switch will become a global/local context switch. Therefore I propose that we implement the solution that mharmsen suggested: - Call PR_SetConcurrency(2) at the end of _PR_ImplicitInitialization(). This creates an additional CPU thread that can read the NT I/O completion port in case the primordial CPU thread is blocked in a native blocking function called by the primordial thread (a fiber). - Create the TimerManager thread as a global thread (rwalker tested this workaround) or force _PR_ImplicitInitialization() to wait until the TimerManager thread starts to run. These will work around problems #1 and #2 and will not affect the performance of applications using local threads.
Status: NEW → ASSIGNED
Assignee | ||
Comment 6•25 years ago
|
||
Added a new test primblok.c that reproduces problem #1 (NSPR I/O functions hang in _NT_IO_WAIT() when the primordial thread calls a native blocking function).
Assignee | ||
Comment 7•24 years ago
|
||
Changed the bug summary from In the MxN thread model, the primordial thread should be a plain native thread to Problems with the primordial thread being converted to a local thread in the MxN thread model I decided to use the alternative fix, for Windows NT only. So the primordial thread will still be converted to a local thread (hence the new bug summary), but the fix works around the problems. 1. Call PR_SetConcurrency(2) at the end of _PR_ImplicitInitialization(). 2. Have _PR_ImplicitInitialization() wait until the timer manager thread starts to run. This fix has been checked in on the main trunk. /cvsroot/mozilla/nsprpub/pr/src/io/prmwait.c, revision 3.8 /cvsroot/mozilla/nsprpub/pr/src/misc/prinit.c, revision 3.20
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Summary: In the MxN thread model, the primordial thread should be a plain native thread → Problems with the primordial thread being converted to a local thread in the MxN thread model
Assignee | ||
Comment 8•24 years ago
|
||
I improved the comments in prmwait.c (to explain how I temporarily use a condition variable for a different purpose during NSPR initialization). /cvsroot/mozilla/nsprpub/pr/src/io/prmwait.c, revision 3.9
Assignee | ||
Updated•24 years ago
|
Target Milestone: --- → 4.1
Assignee | ||
Comment 9•24 years ago
|
||
We ran into a problem with the PR_SetConcurrency(2) workaround. The primordial thread (which is converted into a fiber) may migrate to CPU thread #2 and hence any assumption made about the underlying native thread on which the primordial thread runs (such as the window it owns) are no longer valid. In the specific problem we ran into, the primordial thread is running a windows message loop. After the primordial thread migrates to CPU thread #2, the primordial CPU becomes idle, and the messages posted to that windows are not being processed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 10•24 years ago
|
||
Reassigned the bug to larryh.
Assignee: wtc → larryh
Status: REOPENED → NEW
Comment 11•24 years ago
|
||
This problem showed up on a new server development project. The symptom was a hang. Investigation of the server's operation showed that the primordial thread was using the windows message pump to receive I/O completion notification for sockets. Because of PR_SetConcurrency(2), the primordial thread, now a fiber, moved to a different CPU thread. The windows message pump is tied to a specific thread. What had been the primordial thread was not running PAUSE_CPU(). I asked these folks run with the environment variable "NSPR_NATIVE_THREADS_ONLY=1" on the WinNT platform. He did and the problem went away. For this particular application there are several possible solutions. The "NSPR_NATIVE_THREADS_ONLY=1" being the first. Similarly, setting the NSPR variable _native_threads_only to 1 does the same; do this before making any other NSPR calls. I also suggested that extracting the message-pump code and putting it on a native thread by itself would likely have circumvented this problem. Marking this as won't fix.
Status: NEW → RESOLVED
Closed: 24 years ago → 24 years ago
Resolution: --- → WONTFIX
Assignee | ||
Comment 12•24 years ago
|
||
The name of the magic global variable is "nspr_native_threads_only", not "_native_threads_only". The magic global variable "nspr_native_threads_only" is of the type PRBool. It must be defined in the main executable using the DLL-export qualifier (e.g., the PR_IMPLEMENT_DATA macro). See Bugzilla bug #23694.
Assignee | ||
Comment 13•24 years ago
|
||
I backed out the PR_SetConcurrency(2) workaround. /cvsroot/mozilla/nsprpub/pr/src/misc/prinit.c, revision: 3.25
Assignee | ||
Comment 14•23 years ago
|
||
I am fed up with all the reports of this bug. I decided to make another attempt at fixing it.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Target Milestone: 4.1 → 4.2
Assignee | ||
Updated•23 years ago
|
Status: REOPENED → ASSIGNED
Target Milestone: 4.2 → 4.1.2
Assignee | ||
Comment 15•23 years ago
|
||
Assignee | ||
Comment 16•23 years ago
|
||
Set target milestone NSPR 4.2.
Assignee: larryh → wtc
Status: ASSIGNED → NEW
Target Milestone: 4.1.2 → 4.2
Assignee | ||
Comment 17•23 years ago
|
||
Assignee | ||
Comment 18•23 years ago
|
||
I checked in proposed patch v2 on the main trunk.
Status: NEW → RESOLVED
Closed: 24 years ago → 23 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•23 years ago
|
Attachment #34030 -
Attachment is obsolete: true
Assignee | ||
Comment 19•23 years ago
|
||
Julien Pierre told me that the web server does not work with this fix. I am reopening this bug.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 20•23 years ago
|
||
I tracked down why the web server doesn't work with this fix. The web server initializes NSPR from the DllMain function of a DLL. While new threads can be created by DllMain, they won't be able to run during DLL initialization. Therefore, the primordial CPU thread created by _PR_InitCPUs cannot run, but we are waiting for it to initialize the primordial CPU, hence the deadlock. The fix is to have _PR_InitCPUs initialize the primordial CPU, as opposed to relying on the primordial CPU thread to do that. Since parts of the primordial CPU must be initialized by the primordial CPU thread, I modified _PR_CreateCPU so that it only initializes enough of the CPU to make it usable by a global thread or other CPU threads. Then, I added a new function, _PR_StartCPU, that will be called by the new CPU thread to complete the initialization of a new CPU. I will attach a patch.
Status: REOPENED → ASSIGNED
Priority: P3 → P1
Assignee | ||
Comment 21•23 years ago
|
||
Assignee | ||
Comment 22•23 years ago
|
||
Julien reported in bug 126946 that the current fix does not work if NSPR_NATIVE_THREADS_ONLY is set to 1. This patch fixed that problem.
Assignee | ||
Comment 23•22 years ago
|
||
I forgot to mark this bug fixed. This bug is fixed in NSPR 4.2.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago → 22 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 24•22 years ago
|
||
A workaround for problem #2 is to add a PR_Sleep(PR_INTERVAL_NO_WAIT) call to the main thread (or the thread that implicitly initializes NSPR) before it calls any other NSPR functions. The PR_Sleep(PR_INTERVAL_NO_WAIT) call makes the NSPR primordial thread (a fiber) yield, which gives the TimerManager thread a chance to run. At that time, there should be no other NSPR threads around, so the TimerManager thread should be able to acquire _pr_activeLock without blocking and release it.
Assignee | ||
Updated•22 years ago
|
Alias: NT-primordial-thread
You need to log in
before you can comment on or make changes to this bug.
Description
•