Open Bug 255452 Opened 21 years ago Updated 2 years ago

PR_Cleanup hangs when called from DllMain

Categories

(NSPR :: NSPR, defect, P3)

x86
Windows 2000

Tracking

(Not tracked)

People

(Reporter: julien.pierre, Unassigned)

Details

Attachments

(2 files, 1 obsolete file)

This bug only happens in the NT version of NSPR . 1. build a DLL which just calls PR_Init in the DllMain DLL_PROCESS_ATTACH, and PR_Cleanup in the DLL_PROCESS_DETACH 2. write a test program that just calls LoadLibrary and FreeLibrary on this DLL 3. the test will hang in the DllMain of the library in _PR_CleanupCPUs, trying to join the CPU thread . This problem appears to be a regression introduced by bug 161998 . Given comments in the code in prucpu.c, I believe it was the intent that NSPR should support being initialized in DllMain . There is no comment about shutting it down in DllMain, but that wouldn't be unreasonable if you also had initialized it there . Microsoft has some warnings about certain things not to do inside DllMain, but I couldn't see any of them applying to this case. Synchronization is specifically allowed .
It looks like it is impossible to wait for a thread within DllMain using WaitForSingleObject . This is because when the other thread ends after being notified, it will itself call into DllMain for DLL_THREAD_DETACH processing ! But that gets serialized by the OS, because DllMain is already in DLL_PROCESS_DETACH (and calling PR_Cleanup). See http://groups.google.com/groups?q=DLLMAIN+waitforsingleobject&hl=en&lr=&ie=UTF-8&selm=01bbbc9d%2451ed1580%244602a8c0%40pc_agra&rnum=2 I think the correct fix for this problem is to use an event object, and post it right at the end of the CPU thread, before calling _endthreadex . Then, instead of waiting on the thread handle, PR_CleanupCPUs could wait on the event handle, and there should be no deadlock . I will try this and attach a patch if it works.
Attached patch Patch to stop deadlock (obsolete) — Splinter Review
Instead of joining the CPU thread with WaitForSingleObject, use a Windows event object that gets notified from the CPU thread when it dies. PR_CleanupCPUs can successfully wait on that condition variable, but not the thread due to the DllMain Windows lock.
Assignee: wchang0222 → julien.pierre.bugs
Status: NEW → ASSIGNED
Attachment #156018 - Flags: superreview?(wchang0222)
Attachment #156018 - Flags: review?(bugz)
Comment on attachment 156018 [details] [diff] [review] Patch to stop deadlock I'm going to give this a tentative r+, with the caveat that I really know very little about this. From what I can gather from the code and the newsgroup posting, it seems your approach is a viable solution to a known problem. But I know next to nothing about the Windows API. One question: are you sure it is OK to change the size of the _MDCPU structure? Given that there was already an unused field there, why not just replace it with the HANDLE?
Attachment #156018 - Flags: review?(bugz) → review+
Comment on attachment 156018 [details] [diff] [review] Patch to stop deadlock If we need to fix this bug, this patch is good although it may need some fine tuning. (Ian is right that the 'unused' field in the _MDCPU structure should be deleted when you add the 'die' field.) But I question whether this bug needs to be fixed. The reason is that PR_Cleanup does not free everything that PR_Init created, so each time you load and unload the test DLL, some memory and resources will be leaked. If the test DLL is loaded and unloaded repeatedly, the leak grows with time. Another problem is that PR_Init and PR_Cleanup are not reference counted. So if another DLL or the executable also uses NSPR, the test DLL will shut down NSPR and break the other NSPR users in the process. So the use case that is prevented by the bug is questionable.
Just to clarify: the reason 'unused' should be deleted is not to preserve the size of the _MDCPU structure. It is because the _MDCPU structure has a real field ('die') so the dummy 'unused' field is no longer necessary.
Comment on attachment 156018 [details] [diff] [review] Patch to stop deadlock mozilla/nsprpub/pr/src/threads/combined/prucpu.c > #ifdef WINNT > if (_pr_cpus_exit) { > /* _PR_CleanupCPUs tells us to exit */ >+ /* Notify _PR_CleanupCPUs that we are done */ >+ SetEvent(cpu->md.die); > _PR_MD_END_THREAD(); > } > #endif One risk of this patch is that the "CPU" threads are still running after the SetEvent call above, and they will use NSPR global data structures. If any of the NSPR global data structures used by them before their actual termination are destroyed by PR_Cleanup, they will crash. For example, _pr_activeLock seems like such a global variable. To test my theory, you can add Sleep(10000); after the SetEvent call.
Wan-Teh, About the unused / die field, I'm fine with that change. We have indeed unfortunately found out in the last week that there is a lot of cleanup that isn't currently performed by PR_Cleanup, both for threads and for memory. We know it will be a long time before they can all be resolved. The reason I presented this particular patch and asked for review and I think it needs to be fixed is because this part of the cleanup code was added to the NSPR tree, and it introduced a regression. We had an application that was on NSPR 4.1.6, which was doing init/cleanup in DllMain, and we ran into this hang when we tried moving up to 4.5 . That does not mean there aren't plenty of other issues that remain unresolved in cleanup, but I think this patch is one useful step towards fixing the many remaining PR_Cleanup issues. As far as the situation where there are several unrelated DLL users of NSPR in a non-NSPR main executable, that's a more complex situation and I'm not sure how to officially support it. We have found a hack that allows NSPR to remain loaded in a non-NSPR Windows executable even if the DLL using NSPR is freed. The hack is to just have the DLL call LoadLibrary("libnspr4") during its initialization. This bumps up the OS refcount of the NSPR DLL handle in Windows, so even if this DLL user gets freed, the NSPR code stays in memory until the process terminates :-)). The downside to this hack is that in this case, none of the DLL users can call PR_Cleanup at all, or they would be stepping on each other's toes. We will probably end up using this hack to fix the situation, even with a single DLL user of NSPR, because that DLL can be loaded and reloaded multiple times; so hopefully we won't be in a hurry to fix all the NSPR cleanup issues, as this could take many months.
Wan-Teh, Doesn't the code in PR_CleanupCPUs wait for all the CPU threads to end ? My understanding is that was the intent, it just didn't happen successfully in this particular case of PR_Cleanup being called in DllMain () . I just changed the code to wait for the event variable instead of the actual thread. I don't see any NSPR global variable being used between the SetEvent and the next line, which is _PR_MD_END_THREAD(), and calls _endthreadex, so this should be safe, unless there is somethig I don't understand about the CPU threads here (which is quite possible!). I did try your suggestion of adding a Sleep after the SetEvent() and do see a crash in the test, often related to _pr_activelock, as part of a thread creation in an NSPR re-initialization . However, I fail to see how why this happens.
The hack you described in comment 9 was recently discussed in the netscape.public.mozilla.nspr newsgroup. It was to address a similar problem of loading and unloading an IIS plugin that uses NSPR. I explained why one should not initialize and clean up NSPR repeatedly, and proposed the general idea of workaround by causing NSPR to stay loaded in the process permanently.
The "CPU" threads have multiple layers: a Win32 native thread running _PR_NativeRunThread (pruthr.c), which runs thread->startFunc: _PR_RunCPU (prucpu.c), which runs an NT fiber running _PR_CPU_Idle (prucpu.c) The function you modified (added the SetEvent call to) is the innermost layer (_PR_CPU_Idle). So after the new SetEvent call, the "CPU" thread will execute a lot of NSPR code before it truly terminates. In particular, the "CPU" thread will call PR_Lock(_pr_activeLock) after (*thread->startFunc)(thread->arg) returns in _PR_NativeRunThread. So if PR_Cleanup (which destroys _pr_activeLock) finishes before the "CPU" threads truly terminate, the "CPU" thread will crash. Also, since NSPR automatically gets initialized, the "CPU" thread may also cause NSPR to be re-initialized after PR_Cleanup finishes or when it is in progress. The Sleep call I asked you to add is meant to cause these race conditions.
Wan-Teh, I sure wish I had been reading the NSPR newsgroup and known that somebody else had the very same issue we did (with an IIS plug-in as well) ... Thanks for the info about the CPU threads. This helps a lot. I think it means there is more code needed to make the native thread terminate and wait for it, perhaps with another event semaphore similarly to the fiber.
Julien: the newsgroup posting I mentioned in comment 11 is by Tom Wocken on 8/3/2004 with the subject "NSPR DLL access violation". Now I look at his messages again, I realize that he was not writing an IIS plugin. Sorry about the confusion.
Priority: -- → P3
Wan-Teh, To fix the remaining part of the problem, the uppermost layers (PR_NativeRunThread for sure, and perhaps PR_RunCPU) would need to detect that NSPR has been cleaned up and exit. This seems a hard task to accomplish when the cleanup is still in progress. The only thing that seems appropriate would be some process-wise variable set by the OS (eg. system object) that could be polled to short circuit. However, I'm not sure that would be available on all platforms for us to fix. I think what you have uncovered is a pre-existing problem however, which could happen in other situations even without fibers, on other platforms. For example, somebody could start a native thread on any platform, which would do a long sleep . The main thread of the application would call PR_Cleanup(), but not immediately exit. Then, the other thread would wake up from its sleep and return to _PR_NativeRunThread, and cause a crash dereferencinng _pr_activeLock.
In response to comment #6, I'm aware of the leaked resources, and I believe there are bugs open on at least some of them . The lack of refcounting is only an issue if there are multiple modules doing initialization / cleanup of NSPR, which may or may not be the case. I think this should probably be tracked in a separate bug. Do you agree at least that the current patch is not harmful and should be checked in (after removing the unused field) ?
Wan-Teh, Could you please tell me what you think of comment #16 ? I would like to close this bug if possible.
Wan-Teh, I would like to come to a closure on this bug . If you think we don't need fix this case, then at least we should update the documentation for PR_Cleanup and warn about its limitations.
The idea of your patch is fine. I'm just saying that the SetEvent call is done too early; see my comment 12. The SetEvent call should be done in the _PR_NativeRunThread function after the last access to any NSPR global data structures. Doing that will need more complicated code.
Assignee: bugzilla → wtchang
Status: ASSIGNED → NEW
Attachment #156018 - Attachment is obsolete: true
Attachment #156018 - Flags: superreview?(wtchang)
QA Contact: wtchang → nspr
Severity: normal → S3

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: wtc → nobody
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: