Open
Bug 255452
Opened 21 years ago
Updated 2 years ago
PR_Cleanup hangs when called from DllMain
Categories
(NSPR :: NSPR, defect, P3)
Tracking
(Not tracked)
NEW
People
(Reporter: julien.pierre, Unassigned)
Details
Attachments
(2 files, 1 obsolete file)
458 bytes,
patch
|
Details | Diff | Splinter Review | |
549 bytes,
patch
|
Details | Diff | Splinter Review |
This bug only happens in the NT version of NSPR .
1. build a DLL which just calls PR_Init in the DllMain DLL_PROCESS_ATTACH, and
PR_Cleanup in the DLL_PROCESS_DETACH
2. write a test program that just calls LoadLibrary and FreeLibrary on this DLL
3. the test will hang in the DllMain of the library in _PR_CleanupCPUs, trying
to join the CPU thread .
This problem appears to be a regression introduced by bug 161998 .
Given comments in the code in prucpu.c, I believe it was the intent that NSPR
should support being initialized in DllMain . There is no comment about shutting
it down in DllMain, but that wouldn't be unreasonable if you also had
initialized it there .
Microsoft has some warnings about certain things not to do inside DllMain, but I
couldn't see any of them applying to this case. Synchronization is specifically
allowed .
Reporter | ||
Comment 1•21 years ago
|
||
Reporter | ||
Comment 2•21 years ago
|
||
Reporter | ||
Comment 3•21 years ago
|
||
It looks like it is impossible to wait for a thread within DllMain using
WaitForSingleObject . This is because when the other thread ends after being
notified, it will itself call into DllMain for DLL_THREAD_DETACH processing !
But that gets serialized by the OS, because DllMain is already in
DLL_PROCESS_DETACH (and calling PR_Cleanup).
See
http://groups.google.com/groups?q=DLLMAIN+waitforsingleobject&hl=en&lr=&ie=UTF-8&selm=01bbbc9d%2451ed1580%244602a8c0%40pc_agra&rnum=2
I think the correct fix for this problem is to use an event object, and post it
right at the end of the CPU thread, before calling _endthreadex .
Then, instead of waiting on the thread handle, PR_CleanupCPUs could wait on the
event handle, and there should be no deadlock .
I will try this and attach a patch if it works.
Reporter | ||
Comment 4•21 years ago
|
||
Instead of joining the CPU thread with WaitForSingleObject, use a Windows event
object that gets notified from the CPU thread when it dies. PR_CleanupCPUs can
successfully wait on that condition variable, but not the thread due to the
DllMain Windows lock.
Reporter | ||
Updated•21 years ago
|
Assignee: wchang0222 → julien.pierre.bugs
Status: NEW → ASSIGNED
Reporter | ||
Updated•21 years ago
|
Attachment #156018 -
Flags: superreview?(wchang0222)
Attachment #156018 -
Flags: review?(bugz)
Comment 5•21 years ago
|
||
Comment on attachment 156018 [details] [diff] [review]
Patch to stop deadlock
I'm going to give this a tentative r+, with the caveat that I really know very
little about this. From what I can gather from the code and the newsgroup
posting, it seems your approach is a viable solution to a known problem. But I
know next to nothing about the Windows API.
One question: are you sure it is OK to change the size of the _MDCPU structure?
Given that there was already an unused field there, why not just replace it
with the HANDLE?
Attachment #156018 -
Flags: review?(bugz) → review+
Comment 6•21 years ago
|
||
Comment on attachment 156018 [details] [diff] [review]
Patch to stop deadlock
If we need to fix this bug, this patch is good although
it may need some fine tuning. (Ian is right that the
'unused' field in the _MDCPU structure should be deleted
when you add the 'die' field.)
But I question whether this bug needs to be fixed. The
reason is that PR_Cleanup does not free everything that
PR_Init created, so each time you load and unload the
test DLL, some memory and resources will be leaked. If
the test DLL is loaded and unloaded repeatedly, the
leak grows with time. Another problem is that PR_Init
and PR_Cleanup are not reference counted. So if another
DLL or the executable also uses NSPR, the test DLL will
shut down NSPR and break the other NSPR users in the
process.
So the use case that is prevented by the bug is
questionable.
Comment 7•21 years ago
|
||
Just to clarify: the reason 'unused' should be deleted
is not to preserve the size of the _MDCPU structure.
It is because the _MDCPU structure has a real field
('die') so the dummy 'unused' field is no longer
necessary.
Comment 8•21 years ago
|
||
Comment on attachment 156018 [details] [diff] [review]
Patch to stop deadlock
mozilla/nsprpub/pr/src/threads/combined/prucpu.c
> #ifdef WINNT
> if (_pr_cpus_exit) {
> /* _PR_CleanupCPUs tells us to exit */
>+ /* Notify _PR_CleanupCPUs that we are done */
>+ SetEvent(cpu->md.die);
> _PR_MD_END_THREAD();
> }
> #endif
One risk of this patch is that the "CPU" threads
are still running after the SetEvent call above,
and they will use NSPR global data structures.
If any of the NSPR global data structures used
by them before their actual termination are
destroyed by PR_Cleanup, they will crash. For
example, _pr_activeLock seems like such a global
variable.
To test my theory, you can add
Sleep(10000);
after the SetEvent call.
Reporter | ||
Comment 9•21 years ago
|
||
Wan-Teh,
About the unused / die field, I'm fine with that change.
We have indeed unfortunately found out in the last week that there is a lot of
cleanup that isn't currently performed by PR_Cleanup, both for threads and for
memory. We know it will be a long time before they can all be resolved.
The reason I presented this particular patch and asked for review and I think it
needs to be fixed is because this part of the cleanup code was added to the NSPR
tree, and it introduced a regression. We had an application that was on NSPR
4.1.6, which was doing init/cleanup in DllMain, and we ran into this hang when
we tried moving up to 4.5 .
That does not mean there aren't plenty of other issues that remain unresolved in
cleanup, but I think this patch is one useful step towards fixing the many
remaining PR_Cleanup issues.
As far as the situation where there are several unrelated DLL users of NSPR in a
non-NSPR main executable, that's a more complex situation and I'm not sure how
to officially support it.
We have found a hack that allows NSPR to remain loaded in a non-NSPR Windows
executable even if the DLL using NSPR is freed. The hack is to just have the DLL
call LoadLibrary("libnspr4") during its initialization. This bumps up the OS
refcount of the NSPR DLL handle in Windows, so even if this DLL user gets freed,
the NSPR code stays in memory until the process terminates :-)). The downside to
this hack is that in this case, none of the DLL users can call PR_Cleanup at
all, or they would be stepping on each other's toes.
We will probably end up using this hack to fix the situation, even with a single
DLL user of NSPR, because that DLL can be loaded and reloaded multiple times; so
hopefully we won't be in a hurry to fix all the NSPR cleanup issues, as this
could take many months.
Reporter | ||
Comment 10•21 years ago
|
||
Wan-Teh,
Doesn't the code in PR_CleanupCPUs wait for all the CPU threads to end ?
My understanding is that was the intent, it just didn't happen successfully in
this particular case of PR_Cleanup being called in DllMain () . I just changed
the code to wait for the event variable instead of the actual thread.
I don't see any NSPR global variable being used between the SetEvent and the
next line, which is _PR_MD_END_THREAD(), and calls _endthreadex, so this should
be safe, unless there is somethig I don't understand about the CPU threads here
(which is quite possible!).
I did try your suggestion of adding a Sleep after the SetEvent() and do see a
crash in the test, often related to _pr_activelock, as part of a thread creation
in an NSPR re-initialization . However, I fail to see how why this happens.
Comment 11•21 years ago
|
||
The hack you described in comment 9 was recently
discussed in the netscape.public.mozilla.nspr
newsgroup. It was to address a similar problem
of loading and unloading an IIS plugin that uses
NSPR. I explained why one should not initialize
and clean up NSPR repeatedly, and proposed the
general idea of workaround by causing NSPR to
stay loaded in the process permanently.
Comment 12•21 years ago
|
||
The "CPU" threads have multiple layers:
a Win32 native thread running
_PR_NativeRunThread (pruthr.c), which runs
thread->startFunc: _PR_RunCPU (prucpu.c), which runs
an NT fiber running
_PR_CPU_Idle (prucpu.c)
The function you modified (added the SetEvent call to) is the
innermost layer (_PR_CPU_Idle). So after the new SetEvent
call, the "CPU" thread will execute a lot of NSPR code before
it truly terminates. In particular, the "CPU" thread will
call PR_Lock(_pr_activeLock) after (*thread->startFunc)(thread->arg)
returns in _PR_NativeRunThread. So if PR_Cleanup (which
destroys _pr_activeLock) finishes before the "CPU" threads
truly terminate, the "CPU" thread will crash. Also, since
NSPR automatically gets initialized, the "CPU" thread may
also cause NSPR to be re-initialized after PR_Cleanup finishes
or when it is in progress. The Sleep call I asked you to add
is meant to cause these race conditions.
Reporter | ||
Comment 13•21 years ago
|
||
Wan-Teh,
I sure wish I had been reading the NSPR newsgroup and known that somebody else
had the very same issue we did (with an IIS plug-in as well) ...
Thanks for the info about the CPU threads. This helps a lot. I think it means
there is more code needed to make the native thread terminate and wait for it,
perhaps with another event semaphore similarly to the fiber.
Comment 14•20 years ago
|
||
Julien: the newsgroup posting I mentioned in comment 11
is by Tom Wocken on 8/3/2004 with the subject
"NSPR DLL access violation". Now I look at his messages
again, I realize that he was not writing an IIS plugin.
Sorry about the confusion.
Reporter | ||
Updated•20 years ago
|
Priority: -- → P3
Reporter | ||
Comment 15•20 years ago
|
||
Wan-Teh,
To fix the remaining part of the problem, the uppermost layers
(PR_NativeRunThread for sure, and perhaps PR_RunCPU) would need to detect that
NSPR has been cleaned up and exit. This seems a hard task to accomplish when the
cleanup is still in progress. The only thing that seems appropriate would be
some process-wise variable set by the OS (eg. system object) that could be
polled to short circuit. However, I'm not sure that would be available on all
platforms for us to fix.
I think what you have uncovered is a pre-existing problem however, which could
happen in other situations even without fibers, on other platforms. For example,
somebody could start a native thread on any platform, which would do a long
sleep . The main thread of the application would call PR_Cleanup(), but not
immediately exit. Then, the other thread would wake up from its sleep and return
to _PR_NativeRunThread, and cause a crash dereferencinng _pr_activeLock.
Reporter | ||
Comment 16•20 years ago
|
||
In response to comment #6, I'm aware of the leaked resources, and I believe
there are bugs open on at least some of them .
The lack of refcounting is only an issue if there are multiple modules doing
initialization / cleanup of NSPR, which may or may not be the case.
I think this should probably be tracked in a separate bug.
Do you agree at least that the current patch is not harmful and should be
checked in (after removing the unused field) ?
Reporter | ||
Comment 17•20 years ago
|
||
Wan-Teh,
Could you please tell me what you think of comment #16 ? I would like to close
this bug if possible.
Reporter | ||
Comment 18•20 years ago
|
||
Wan-Teh,
I would like to come to a closure on this bug . If you think we don't need fix
this case, then at least we should update the documentation for PR_Cleanup and
warn about its limitations.
Comment 19•20 years ago
|
||
The idea of your patch is fine. I'm just saying that the
SetEvent call is done too early; see my comment 12. The
SetEvent call should be done in the _PR_NativeRunThread
function after the last access to any NSPR global data
structures. Doing that will need more complicated code.
Reporter | ||
Updated•18 years ago
|
Assignee: bugzilla → wtchang
Status: ASSIGNED → NEW
Reporter | ||
Updated•18 years ago
|
Attachment #156018 -
Attachment is obsolete: true
Attachment #156018 -
Flags: superreview?(wtchang)
Updated•18 years ago
|
QA Contact: wtchang → nspr
Updated•2 years ago
|
Severity: normal → S3
Comment 20•2 years ago
|
||
The bug assignee is inactive on Bugzilla, so the assignee is being reset.
Assignee: wtc → nobody
You need to log in
before you can comment on or make changes to this bug.
Description
•