Closed Bug 98182 Opened 23 years ago Closed 23 years ago

M094 & Trunk crashes exiting browser [@ PK11_GetInternalSlot][turbo]

Categories

(Core Graveyard :: Security: UI, defect, P1)

Other Branch
x86
Windows NT
defect

Tracking

(Not tracked)

VERIFIED DUPLICATE of bug 101329
psm2.1

People

(Reporter: dbaron, Assigned: inactive-mailbox)

References

Details

(Keywords: crash, topcrash, Whiteboard: turbo, PDT-)

Crash Data

Attachments

(3 files, 2 obsolete files)

Starting in the 2001-08-30-05 build there were a bunch of crashes on Windows reported in Talkback with the following stack (reason: Access Violation): PK11_GetInternalSlot() PK11_TokenExists() ssl3_config_match_init() ssl2_HandleServerHelloMessage() ssl2_BeginClientHandshake() ssl_Do1stHandshake() ssl_SecureSend() ssl_SecureWrite() SSL_ImportFD() nsSSLIOLayerWrite [d:\builds\seamonkey\mozilla\security\manager\ssl\src\nsNSSIOLayer.cpp, line 704] (I'm guessing there aren't line numbers because the NSS build system on Windows is doing the Windows equivalent of stripping libraries itself.) I'm not sure what changed around that time, or whether it's in PSM or NSS, or what.
Nelson, can you see anything obviously wrong with this stack trace? If not, I will change the product to Browser for further investigation.
Assignee: wtc → nelsonb
Yes, I see something wrong with this trace. SSL_ImportFD() does not do any I/O. It does not call SSL_SecureWrite(). I'm guessing that this is a developer's personal build, and that he's forgotten that gmake dependencies don't work properly on windows, or that CVS update doesn't change the header files in dist (unlike Unix, where they're simlinked). But that's a wild guess. I'd suggest the developer do a complete clean build of all of NSS, including gmaking the clean, export, and libs targets (assuming coreconf, I don't know how to do the equivalent using mozilla's build system).
It's very unlikely that this is a developer's personal build -- there were a number of these crashes reported by different users, entirely outside Netscape (and therefore with no access to talkback source). (It might be more likely that the symbols for a static function are being omitted and you're seeing the name of the closest previous non-static function.)
If we ignore the SSL_ImportFD item in the stack, the rest of the stack is consistent with a call to SSL_SecureWrite (via PR_Write). PK11_GetInternalSlot() is a one-line function that calls SECMOD_GetInternalModule() and then immediately references the pointer returned by SECMOD_GetInternalModule() without checking it first. If SECMOD_GetInternalModule() returned a null or corrupt pointer, this crash would occur. So, there are two issues here: 1. Why is SECMOD_GetInternalModule() returning a bad value? This should be investigated. It's a problem with the pk11 wrapper layer. 2. PK11_GetInternalSlot() should check the value returned by SECMOD_GetInternalModule() for NULL before using the value, and should return NULL if it gets a NULL from SECMOD_GetInternalModule(). I'll make that change (#2) in the trunk. I suggest this bug be reassigned to relyea to investigate issue #1.
Get internal module returns a static variable which is either the internal module or NULL. If this function is failing it's because NSS was called, but hasn't been initialized yet, or has been shutdown. This fix will most likely just move the crash rather than repair it, as most callers of Get_InternalSlot do not check for failure either. There are literally dozens of places that may go belly-up if NSS is not initialized, so unless we find out the root problem with initialization, the patch won't really prevent any crashes. bob
If indeed the problem is that NSS functions are being called before NSS has been initialized, then this is an application error, not an NSS library error. But we cannot determine that from the information in this bug.
I checked in the test for the NULL internal module pointer. I agree with Bob that the code will probably just crash somewhere else. But testing it here is still the right thing to do, IMO. Given that the application is attempting to do SSL, it seems unlikely to me that the application has made no attempt to initialize NSS. So, the likely explanations are that the attempt failed, but the application ignored the error and went on, or that memory has been corrupted after initialization is complete. Either of these conditions is probably an error outside of NSS. This bug should be reassigned to the PSM group.
Product => PSM.
Assignee: nelsonb → ssaux
Component: Libraries → Client Library
Product: NSS → PSM
QA Contact: sonmi → bsharma
to kai ->2.1 P1 nsenterprise+ Kai, do you think this may have anything to do with the disk full fix?
Assignee: ssaux → kai.engert
Keywords: nsenterprise+
Priority: -- → P1
Target Milestone: --- → 2.1
The "disk full fix" from bug 76915 was checked in on August 22, but according to the report the first problems were seen on August 30. Since August 22 we init NSS in the very early stages of application startup, and exit the app if that doesn't work. When I use Bonsai to query what happened around that date to PSM and NSS, I only find one change that I think could possibly be related. It changes an #ifdef block to some init code. http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&subdir=mozilla/security/nss/lib/softoken&command=DIFF_FRAMESET&file=pk11db.c&rev1=1.1&rev2=1.2&root=/cvsroot All other checkins around that date seem to be unrelated. (The full list is at (slow): http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=HEAD&branchtype=match&dir=%2Fmozilla%2Fsecurity%2Fnss%2F+%2Fmozilla%2Fsecurity%2Fmanager%2F&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=08%2F27%2F2001&maxdate=08%2F30%2F2001&cvsroot=%2Fcvsroot )
Kai, the link you provided showed diffs between revs 1.1 and 1.2 of pk11db.c. But that file is a new file on the trunk, and is not part of the client builds. It doesn't yet even have the NSS_CLIENT_TAG. So, that's not it. Is this problem reproducible?
pk11db.c was in pk11wrap in NSS 3.3 and before. It has been moved to softoken in NSS 3.4. Kai, you aren't pulling from the NSS tip for PSM are you? The NSS tip is definately not ready for general use (we still have QA failures). In fact if PSM is pulling from the NSS tip, it is quite likely that you will run into some of these problems. I suspect the *real* change that was made was the change from NSS 3.2 to NSS 3.3.
False alert I think. I checked, both the trunk and the 094 branch have the file pk11db.c inside pk11wrap. This means they are pulling from NSS 3.3. I found the above by looking for checkins to the NSS and PSM directories, but didn't look at the tags.
I counted 29 TALKBACK reports. All of them are on the trunk. build dates are from 8/29 to 09/06.
per comments, changing to ASSIGNED
Status: NEW → ASSIGNED
I'm unable to reproduce this. In the talkbalk reports, I can't find any helpful comments, besides from the fact that people are in the process of leaving the application. Therefore I have to try to analyze this by looking at the code: The crashing function looks like that: PK11SlotInfo * PK11_GetInternalSlot(void) { return PK11_ReferenceSlot(SECMOD_GetInternalModule()->slots[0]); } As the crash is directly within this function, we know for sure that SECMOD_GetInternalModule() returns. The return value must be either a null (or invalid) value, or the attribute slots is null (or invalid). Function PK11_ReferencesSlot can not be have been called, because that would show up in the stack trace. SECMOD_GetInternalModule() does simply return the value of variable internalModule, which is a pointer, limited to the scope of file pk11util.c. 1) Let's assume that internalModule itself is NULL when we crash. As we know the crash is during application shutdown, it seems likely that the code is arriving at SECMOD_Shutdown, which destroys the object pointed to by internalModule and sets this pointer to NULL. Besides from shutdown, this variable is set only by SECMOD_init (which looks unlikely as a cause for this bug), SECMOD_SetInternalModule and SECMOD_DeleteInternalModule. I can't find code (besides in init code) that calls SetInternalModule. DeleteInternalModule is called from the NSS command line tools and from PSM ToggleFIPSMode() only. Looks unlikely to be the cause, too. The shutdown code is called when the NSS shared library is destructed. Is it possible, that the PIPNSS component get's unloaded, while the network code still needs it? I conclude: The only realistic scenario I can see is: The user selects to quit the application, but the https / ssl code is still active, and while the app shuts down, a SSL handshake is still in progress, which causes the crash as the internal slot has already been killed. Maybe we should make the network library try to obtain a reference to the crypto libraries, and, if successful, delay releasing that reference until the final statement of the network shut down code? 2) Now let's assume that internalModule is valid, but attribute slots is invalid. But I can't find any code in NSS files where the filename contains "pk11", that references an attribute "slots" in a way that modifies the pointer value, besides in the loading code. There is a call to PK11_FreeSlot, called from SECMOD_DestroyValue, which is actually called from the destructor of the PSM PKCS11 module. So again, this could be caused by shutting down of the crypto shared libraries, while there is still network code active. The above is pure theory, and maybe I have missed some code fragments while making my deductions. So I basically hope that my research will give some ideas to people with deeper knowledge of the codebase, and whether my suggestion with the reference counter makes sense or seems useless. I'm looking for help, either by trying to reproduce the bug or by commenting my above statements.
The disassembly of the code around the crash (with the first line being the one on which the crash occurred -- talback doesn't show before, for some reason) is: 608d2af5 8b4028 mov eax,[eax+0x28] <== CRASH HERE 608d2af8 8b08 mov ecx,[eax] 608d2afa 51 push ecx 608d2afb e820e3ffff call 608d0e20 608d2b00 83c404 add esp,0x4 608d2b03 c3 ret the EAX register is 0. This suggests that SECMOD_GetInternalModule is returning null. (I assume that the |slots| member is at offset 0x28.)
Keywords: nsbranch+
nsNSSComponent is declared as: NS_IMPL_THREADSAFE_ISUPPORTS6(nsNSSComponent, nsISecurityManagerComponent, nsISignatureVerifier, nsIEntropyCollector, nsINSSComponent, nsIObserver, nsISupportsWeakReference); The destructor of nsNSSComponent calls NSS_Shutdown. After this has been shut down, no NSS operations are allowed and are prone to crash. When a SSL socket is opened, the code arrives at nsSSLIOLayerAddToSocket / nsSSLIOLayerNewSocket, which will create an object of type nsNSSSocketInfo. This object is referenced by the socket transport code. NS_IMPL_THREADSAFE_ISUPPORTS4(nsNSSSocketInfo, nsITransportSecurityInfo, nsISSLSocketControl, nsIInterfaceRequestor, nsISSLStatusProvider) While an nsNSSSocketInfo is in use by the network transport code, this can result in NSS functions being called - because the SSL code has added itself as a layer to the network code. For my understanding this means, as long as there are nsNSSSocketInfo objects alive, NSS must not be shut down. But I can't see code that couples nsNSSSocketInfo to nsNSSComponent in terms of reference counting. Have I found the cause, or is there a coupling that I don't see?
nsNSSIOLayer.cpp creates the nsNSSService object in InitNSSMethods(), but never releases it. So we leak on shutdown. The refcnt never goes to 0, and so nss_shutdown won't be called. What am I missing?
Bradley: You are right with your last statement. I looked whether NSS_Shutdown is really called and from where. It is called, but when the application shuts down and cleans up, see the stacktrace below. #0 nsNSSComponent::~nsNSSComponent (this=0x81e3590, __in_chrg=3) at ../../../../../mozilla/security/manager/ssl/src/nsNSSComponent.cpp:109 #1 0x416603f7 in nsNSSComponent::Release (this=0x81e3590) at ../../../../../mozilla/security/manager/ssl/src/nsNSSComponent.cpp:550 #2 0x401ccc0d in DeleteEntry (aKey=0x8317598, aData=0x8317580, closure=0x0) at ../../../mozilla/xpcom/components/nsServiceManager.cpp:260 #3 0x4017b22a in _hashEnumerateRemove (he=0x83175b8, i=23, arg=0xbffff570) at ../../../mozilla/xpcom/ds/nsHashtable.cpp:381 #4 0x402af2fa in PL_HashTableEnumerateEntries (ht=0x806d90c, f=0x4017b1ec <_hashEnumerateRemove(PLHashEntry *, int, void *)>, arg=0xbffff570) at ../../../../mozilla/nsprpub/lib/ds/plhash.c:429 #5 0x4017b2eb in nsHashtable::Reset (this=0x806d908, destroyFunc=0x401ccbcc <DeleteEntry(nsHashKey *, void *, void *)>, aClosure=0x0) at ../../../mozilla/xpcom/ds/nsHashtable.cpp:400 #6 0x4017c9fb in nsObjectHashtable::Reset (this=0x806d908) at ../../../mozilla/xpcom/ds/nsHashtable.cpp:850 #7 0x4017c89a in nsObjectHashtable::~nsObjectHashtable (this=0x806d908, __in_chrg=3) at ../../../mozilla/xpcom/ds/nsHashtable.cpp:816 #8 0x401ccd9f in nsServiceManagerImpl::~nsServiceManagerImpl (this=0x806d8f0, __in_chrg=3) at ../../../mozilla/xpcom/components/nsServiceManager.cpp:285 #9 0x401ccf13 in nsServiceManagerImpl::Release (this=0x806d8f0) at ../../../mozilla/xpcom/components/nsServiceManager.cpp:294 #10 0x401cd94d in nsServiceManager::ShutdownGlobalServiceManager (result=0x0) at ../../../mozilla/xpcom/components/nsServiceManager.cpp:544 #11 0x401720cd in NS_ShutdownXPCOM (servMgr=0x0) at ../../../mozilla/xpcom/build/nsXPComInit.cpp:489 #12 0x0805b73d in main (argc=1, argv=0xbffff844) at ../../../mozilla/xpfe/bootstrap/nsAppRunner.cpp:1598 #13 0x405e8e5e in __libc_start_main (main=0x805b4cc <main>, argc=1, ubp_av=0xbffff844, init=0x8054570 <_init>, fini=0x8067260 <_fini>, rtld_fini=0x4000d3c4 <_dl_fini>, stack_end=0xbffff83c) at ../sysdeps/generic/libc-start.c:129 But if this is really the only location where NSS_Shutdown is triggered, as the application code never frees the reference... Is it possible that the socket transport thread is still running when the application shuts down? Maybe the security component is freed before the network component(s)?
nsNSSModule.cpp is seriously messed up: it has multiple entries for a given CID and they're all tied to the nsNSSComponentConstructor but for different contractIDs. The GetService into the global in nsNSSIOLayer.cpp is by contract ID rather than CID -- so it may not work. If that is the problem (I'm not sure how the component manager would deal with the strangeness in nsNSSModule.cpp), then there should be instance counting to release that global when it can be released, rather than just leaking it.
OK, maybe that nsNSSModule.cpp stuff is OK after all.
Might the service manager get confused by having multiple ContractIDs and a single CID for the nsNSSComponent, and create more than one instance?
Where do I find the source for nsNSSComponent.cpp or nsNSSModule.cpp There are a few components that have more than one contractid for a given cid. I dont think that is a problem. That was supposed to work by design.
dp: mozilla/security/manager/ssl/src/
Note that links are internal to netscape. Sorry. This query fails more often that not on climate.mcom.com, so including the saved output allows one to easily (as long as one has access to the internal network) access the individual incidents. There doesn't appear to be any branch builds.
If shiva and jay say this is fixed, then we should put this one on the back burner.
This is showing up in the M094 Talkback topcrash report. Stephane's attached query does show many of these crashes with build 2001091311 on the Netscape6.20 branch (which also happens to be the Mozilla 0.9.4 release). Here is the latest info from Talkback. The stack doesn't look too helpful, but perhaps the comments will help us reproduce: PK11_GetInternalSlot 14 98182 ASSI kai.engert@gmx.de 2.1 BBID range: 35353387 - 35494918 Min/Max Seconds since last crash: 256 - 46396 Min/Max Runtime: 256 - 46396 Crash data range: 2001-09-13 to 2001-09-17 Build ID range: 2001091311 to 2001091311 Stack Trace: PK11_GetInternalSlot() PK11_TokenExists() ssl3_config_match_init() ssl3_HandleRecord() ssl3_HandleRecord() ssl3_HandleRecord() ssl3_HandleRecord() ssl3_GatherCompleteHandshake() ssl_GatherRecord1stHandshake() ssl_Do1stHandshake() ssl_SecureSend() ssl_SecureWrite() SSL_ImportFD() nsSSLIOLayerWrite [d:\builds\seamonkey\mozilla\security\manager\ssl\src\nsNSSIOLayer.cpp line 718] SHA1_Update() RNG_UpdateAndEnd_FIPS186_1() prng_RandomUpdate() SocketRecv [../../../../pr/src/io/prsocket.c line 641] pl_DefRecv [../../../../pr/src/io/prlayer.c line 278] ssl_DefRecv() ssl3_GatherCompleteHandshake() ssl3_GatherCompleteHandshake() 0xffffffff 0x90909090 (35494918) Comments: Tried to close Mozilla completely using Mozilla icon in the task bar. I tried to right click left click double click but no menu appeared. Mozilla opened and I closed it immediately. At this point Mozilla crashed. (35484527) Comments: I just quit got mesage about icon staying on taskbar hit OK. (35483403) URL: webmail.cmg.nl (35483403) Comments: I think the web-mail application tried to close the browser window... It seemed to go slightly wrong.. The app is written in javascript. (35482014) Comments: closed mozilla window while it was receiving data (35451055) URL: http://www.msn.com (35451055) Comments: Exiting Hotmail. The page that follows the checkout page was trying to load after redirect. (35445384) Comments: closed browser (via javascript apparently) (35420953) URL: www.landbobanken.dk (35420953) Comments: Java internet banking
Keywords: crash
Summary: crashes exiting browser [@ PK11_GetInternalSlot] → M094 crashes exiting browser [@ PK11_GetInternalSlot]
Kai. So this is still happening on the 0.9.4 branch. From the comments included, I see two avenues to investigate: 1) using the turbo mode exit. 2) quiting while page is still loading.
I'm confused why the title of this bug was changed to include M094. My impression was, from Stephane's attachment, that it crashes on Mozilla trunk, too. It isn't limited to a branch. The avenue "still loading while exiting" is indeed the only scenario that makes sense to me. I played around on Windows, but I did not see the crash. I did use the turbo mode feature, which worked stable for me. Until I can reproduce it myself, the only thing I can try is to add code that tries to improve thread safetyness of PSM, as well as some other things to test: I want to test whether the singleton concept of the NSS component is actually working. The existing code is designed to work as a singleton only. If there were multiple instances of nsNSSComponent created during runtime, we would indeed see a crash, as during destruction of the first one, NSS would already have been shut down, and still open SSL connection would crash in a way as the above stack trace. However, during my tests so far, I have never seen the component to be created twice, so this might happen only in an edge condition that I have not yet come across. Another, very general, suggestion being made was: the socket information structure should refcount the nss component in some way, because it depends on it. The reason why it currently works: the nsNSSComponent is knowingly leaked. But this makes it even more incomprehensible why we crash, as the destruction and NSS shutdown code should not be reached if the singleton concept worked and there were no multithread issues. This is flawed, I guess we really should shut down NSS on exit. Currently, we don't do it, as soon as SSL code has been called at least once during the Mozilla session.
Kai: Sorry for the confusion, I forgot to add "Trunk" to the summary as well. We (the Talkback team) usually update the summary to show which releases/milestones are showing the crash so we can get a quick idea of when we last saw the crash. Anyway, I've updated the summary with Trunk, so we know that this crash is occurring with both Mozilla 0.9.4 and the latest MozillaTrunk builds.
Summary: M094 crashes exiting browser [@ PK11_GetInternalSlot] → M094 & Trunk crashes exiting browser [@ PK11_GetInternalSlot]
Kai: you should try to instantiate multiple instances of NSS explicitly to test your theory. If that help reproduce the crash and it's in the same place, that would probably be significant.
I made the following change to enforce the crash. I changed the code receiving the entropy collector in nsGlobalWindow, to not use do_GetService, but use do_CreateInstance. This will create additional instances of nsNSSComponent. Inside the nsNSSComponent, there is an assertion check, that tries to detect if the init code is called twice. When I use a debug build, the app quits. However, in an optimized build, I suppose the app will not quit I think. I therefore disabled the assertion check for testing, too. When I open the application, create a second window, and surf over https, I crash very soon. I see the following stack trace, which is at least in the upper levels similar to the reported stack traces. #0 0x416f8cc8 in PK11_GetInternalSlot () at pk11slot.c:2132 #1 0x416f8faa in PK11_TokenExists (type=33) at pk11slot.c:2244 #2 0x416b9f32 in ssl3_config_match_init (ss=0x895da00) at ssl3con.c:419 #3 0x416a80ec in ssl2_ConstructCipherSpecs (ss=0x895da00) at sslcon.c:204 #4 0x416ade3a in ssl2_BeginClientHandshake (ss=0x895da00) at sslcon.c:2964 #5 0x416b17bf in ssl_Do1stHandshake (ss=0x895da00) at sslsecur.c:156 #6 0x416b3764 in ssl_SecureSend (ss=0x895da00, buf=0x0, len=0, flags=0) at sslsecur.c:1075 #7 0x416b3899 in ssl_SecureWrite (ss=0x895da00, buf=0x0, len=0) at sslsecur.c:1109 #8 0x416b6c1c in ssl_Write (fd=0x8938f40, buf=0x0, len=0) at sslsock.c:1232 #9 0x4166e5b7 in nsSSLIOLayerWrite (fd=0x86de738, buf=0x0, amount=0) at ../../../../../mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp:697 #10 0x402c4990 in PR_Write (fd=0x86de738, buf=0x0, amount=0) at ../../../../../mozilla/nsprpub/pr/src/io/priometh.c:141 #11 0x4166d050 in nsNSSSocketInfo::TLSStepUp (this=0x895d968) at ../../../../../mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp:341 #12 0x4166cfcc in nsNSSSocketInfo::ProxyStepUp (this=0x895d968) at ../../../../../mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp:326 #13 0x40ba3e53 in nsHttpConnection::ProxyStepUp (this=0x8963578) at ../../../../../mozilla/netwerk/protocol/http/src/nsHttpConnection.cpp:287 #14 0x40ba38e7 in nsHttpConnection::OnHeadersAvailable (this=0x8963578, trans=0x89825c8, reset=0xbf7ff654) at ../../../../../mozilla/netwerk/protocol/http/src/nsHttpConnection.cpp:192 #15 0x40ba6f80 in nsHttpTransaction::HandleContentStart (this=0x89825c8) at ../../../../../mozilla/netwerk/protocol/http/src/nsHttpTransaction.cpp:387 #16 0x40ba7318 in nsHttpTransaction::HandleContent (this=0x89825c8, buf=0x89df960 "HTTP/1.0 200 Connection established\n\nProxy-agent: Muffin\n\n\n\n", 'Ú' <repeats 140 times>..., count=0, countRead=0xbf7ff7d4) at ../../../../../mozilla/netwerk/protocol/http/src/nsHttpTransaction.cpp:469 #17 0x40ba8016 in nsHttpTransaction::Read (this=0x89825c8, buf=0x89df960 "HTTP/1.0 200 Connection established\n\nProxy-agent: Muffin\n\n\n\n", 'Ú' <repeats 140 times>..., count=0, bytesWritten=0xbf7ff7d4) at ../../../../../mozilla/netwerk/protocol/http/src/nsHttpTransaction.cpp:777 #18 0x401b2d05 in nsReadFromInputStream (outStr=0x8982924, closure=0x89825cc, toRawSegment=0x89df960 "HTTP/1.0 200 Connection established\n\nProxy-agent: Muffin\n\n\n\n", 'Ú' <repeats 140 times>..., offset=0, count=4096, readCount=0xbf7ff7d4) at ../../../mozilla/xpcom/io/nsPipe2.cpp:830 #19 0x401b27ce in nsPipe::nsPipeOutputStream::WriteSegments (this=0x8982924, reader=0x401b2cd0 <nsReadFromInputStream(nsIOutputStream *, void *, char *, unsigned int, unsigned int, unsigned int *)>, closure=0x89825cc, count=16384, writeCount=0xbf7ff89c) at ../../../mozilla/xpcom/io/nsPipe2.cpp:704 #20 0x401b2d59 in nsPipe::nsPipeOutputStream::WriteFrom (this=0x8982924, fromStream=0x89825cc, count=16384, writeCount=0xbf7ff89c) at ../../../mozilla/xpcom/io/nsPipe2.cpp:838 #21 0x40b74361 in nsStreamListenerProxy::OnDataAvailable (this=0x8963508, request=0x89825c8, context=0x0, source=0x89825cc, offset=0, count=16384) at ../../../../mozilla/netwerk/base/src/nsStreamListenerProxy.cpp:288 #22 0x40ba656a in nsHttpTransaction::OnDataReadable (this=0x89825c8, is=0x89391d0) at ../../../../../mozilla/netwerk/protocol/http/src/nsHttpTransaction.cpp:178 #23 0x40ba57e7 in nsHttpConnection::OnDataAvailable (this=0x8963578, request=0x86d6020, context=0x0, inputStream=0x89391d0, offset=0, count=8192) at ../../../../../mozilla/netwerk/protocol/http/src/nsHttpConnection.cpp:702 #24 0x40b68ee4 in nsSocketReadRequest::OnRead (this=0x86d6020) at ../../../../mozilla/netwerk/base/src/nsSocketTransport.cpp:2723 #25 0x40b63894 in nsSocketTransport::doReadWrite (this=0x8805eb8, aSelectFlags=1) at ../../../../mozilla/netwerk/base/src/nsSocketTransport.cpp:999 #26 0x40b62072 in nsSocketTransport::Process (this=0x8805eb8, aSelectFlags=1) at ../../../../mozilla/netwerk/base/src/nsSocketTransport.cpp:480 #27 0x40b6a37a in nsSocketTransportService::Run (this=0x8116820) at ../../../../mozilla/netwerk/base/src/nsSocketTransportService.cpp:458 #28 0x401d6832 in nsThread::Main (arg=0x810fd78) at ../../../mozilla/xpcom/threads/nsThread.cpp:105 #29 0x402ea93a in _pt_root (arg=0x810d530) at ../../../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:214 #30 0x4030c078 in pthread_start_thread_event (arg=0xbf7ffc00) at manager.c:262
I just saw this for the very first time! I think I know what causes this problem. NSS is indeed shut down and restarted during the session, and it has to do with the -turbo mode on Windows. The reason is: method nsNSSComponent::Observe is called, which I did not expect. The "profile changed" event (causing NSS to shut down) seems to be triggered while there are still connections in progress. This method needs to be redesigned. More to follow.
nice finding kai. Ccing Conrad "Turbo" Carlen.
Blocks: 75599
Summary: M094 & Trunk crashes exiting browser [@ PK11_GetInternalSlot] → M094 & Trunk crashes exiting browser [@ PK11_GetInternalSlot][turbo]
Nominating for turbo short list.
Whiteboard: turbo
Please see also some comments in bug 99566. We have learned that the crash is caused by unloading and loading the profile. PSM tries to shut down and re-init NSS when it is informed by profile change events. While this is done, it is possible that there are still active threads accessing NSS, e.g. the network communication thread. The code necessary to add protection is not a small patch, but might require many changes all over PSM, to safeguard for the "NSS currently unavailble" condition, i.e. checking for this flag, including to add locks for serializing access. While it is a general requirement to have the functionality to switch profiles (including the certificate databases used by Mozilla), this might require too many changes, which may be allowed to be put to the 094 branch. There is a workaround for applications that don't support real profile switching yet. We could patch PSM to completely ignore the profile loading / unloading event, by assuming there won't be a switch to a different profile (with different certificate database files). I'd like to suggest this patch for applications that are created from the 094 branch. If this restriction isn't possible, random crashes are always possible. We could try to minimize the likelyhood for a crash, by delaying NSS_Shutdown until the moment where we need to re-initialize. I have no idea, whether there might still be SSL sockets active when the new profile is being loaded/activated. This could result in a newly activated NSS session accessing NSS library instances from a previous session. Here are some comments for a correct implementation for the trunk: nsNSSComponent methods seem to be called from multiple threads, the UI thread, the observer notifier (not sure which this is), the socket thread. I think we should protect the flag, whether NSS is currently activated or not, using a mutex. We need to behave gracefully. But doing so in nsNSSComponent is not enough. The SSL code adds a function layer to the socket file descriptors. And those obviously are active in the middle of profile change. And those call global functions (e.g. nsNSSIOLayerWrite). In all the layer functions, we must check for NSS being currently in good state (by querying the nsNSSComponent::mNSSInitiliazed flag). However, what happens, if a SSL socket layer function is currently active, while NSS is being shut down? Obviously we have the likelyhood that NSS goes away in the middle of a socket operation. This means that it doesn't matter whether NSS (itself) is thread safe or not. Access to NSS functions is asynchronous. If we want to fail gracefully if NSS is not available, that means, make the check for "NSS is available" right before every single call to any NSS function, and protect the calls using the same Mutex as we must use for NSS shutdown and init, and to protect the member variables of the nsNSSComponent. We might want to discuss whether we really need to protect every single function call to NSS or whether we can reduce this. However, I fear the likelyhood for being required to protect every single function is high, as indicated by my observation, that I saw a "certificate domain mismatch" prompter dialog coming up, nearly at the same moment as the profile observer was called with the "profile unloaded" event.
Oops, in the previous comment, in the fifth paragraph, there is the important word NOT missing: While it is a general requirement to have the functionality to switch profiles (including the certificate databases used by Mozilla), this might require too many changes, which may be *NOT* allowed to be put to the 094 branch.
Since this concerns general architecture of the security subsystems, I'm cc'ing Terry Hayes.
Cant NSS not shutdown, but just forget all cert db etc and reinitialize with the new profile's cert db etc...in this situation ?
*** Bug 100771 has been marked as a duplicate of this bug. ***
I'm a little concerned that the client is not trying to quiesce all its connections before it tries to switch profiles. The Initialization and shutdown code in NSS is not thread safe. This is the code where any global locks which may be needed are acquired. It seems to me that it's a dangerous thing in general to allow operations to continue while the profile is in transition. Mail/News components can't be happy if you are reading a message in the middle of reading a message. Making PSM fail when NSS isn't initialized seems like a hack. The real question why are there outstanding NSS requests out there to begin with. How is the profile switch supposed to happen? Does the profiler shutdown each component first before it starts enabling new components? If so, then you need to make sure PSM is shutdown after all the potential callers of PSM, and that include the networking library! PSM should then be brought up before any new potential callers.
The database and keys are an integral part of NSS, NSS can't run without them defined. Not only that, but what tokens you have defined is stored in your profile. You could have a completely different set of hardware tokens defined. Switching profiles is a big deal. Before this design Netscape switched profiles by Shutting down. The profile switching code should assume that it's has to do most of the work of shutdown. Even if PSM fails on NSS connections in between the shutdown and the new initialize, I don't know what kind of results we are going to get if we have things like SSL connections that start on one profile and finish on the second. All of these connections have to be closed down before you shutdown NSS.
Bob, thanks for the explanation. Maybe this is even the simplest strategy to solve our problem. Can we use code, that really assures that all network activity has been shut down completely (even if this blocks), before we try to switch the profile?
Adding in mscott. Darin/mscott: any way we can tear down network connections completely on a profile switch.
As the first phase of profile switching, all windows are destroyed which cancels any current loads. The canceling, though, is not synchronous. It's hard to know when all activity is stopped and trying to enforce that in something as large as mozilla isn't the right way to go. Whatever a component loads from a profile, it's given notice that that data is going away. It needs to do what it must to put itself in a safe state. localstore, for instance, makes a temporary in-memory data source so it can function without any of the user-specific data in the meanwhile. How can NSS require, or else crash, that it has this data? What if it's internal data structures were just empty, in the state they were before it inernalized any data from the profile? Would it just crash? What if it failed to read the data for some other reason? I think, from another bug, it'll crash then too. This caused a similar problem in the disk cache. There, it was agreed that the disk cache should be able to handle the situation without complete quiesense.
That said, I think having multiple profiles in turbo and therefore switching should be disabled until this can be worked out. Now that it's out of an embedding-only context, too many problems like this are cropping up.
The profile *DEFINES* NSS's internal data structures. What modules are 'loaded'. What slots those modules have, etc are defined by secmod.db. It's not that the data structures are empty, they flat out don't exist! You cannot do anything in NSS without a token defined: not hashing, not PRNG, not anything! If NSS_Initialize fails, the application cannot call any other NSS function without crashing. This has been true since the very beginning, and all NSS applications (including communicator and the servers)live with this restriction. The reason for this is most of these products have not restricted themselves to a small set up public API's. NSS can't make initialization checks on *every* interface that PSM calls currently because it calles to many internal only interfaces. NSS requires it because it has never been a design point that it didn't. When the shutdown feature was request, I clearly stated what the restrictions were. Obviously those restrictions weren't communicated back up. There is a more fundamental problem, though. Even if NSS allow 'changing the profile on the fly' with it's connections, there are security issues. What if a connection was created before you switched profiles, managed not to make any NSS calls until after the profile switch, what is the semantic of that connection? All the security checks were done in reguard to a completely separate connection? These really need to be quiesced before you bring up a new security context.
Reading Conrad's comments more carefully, I would say that NSS should not shutdown on the first phase. We have an issue if we need to access the key or certdb in that time, but NSS will fail, not crash in that case (if the key and cert DB go away after initialization, NSS does handle that case 'gracefully'). The user key and cert Data is already squirrelled away currently. There may be an issue if we have modified the key and cert DB and the dbclose doesn't happen. NSS will still have the files open at this point, I don't know what issue that will cause. The potential db corruption in this case is already an issue we have filed against NSS (it's really a bug in DBM where db_sync doesn't always do what it should). The secmod.db is openned, read, then closed. As long as someone doesn't try to add a module in a middle, it could disappear of the face of the earth and we would never know. In this case, KAI, we should do the shutdown just before we do the initialize. We still need to make sure no one is going to call NSS between the two calls. I am still concerned about the security implications of having connections starting in one profile and living through the profile switch process. bob
i haven't been following all of the discussion on this bug, but in answer to dp's question: it is possible to "go offline" and thereby force all socket connections to close (they will not close synchronously, however).
Bob, you say that there is an issue if dbclose is not called. Well, then we might have another problem. Currently, in most cases, the PSM code will not call NSS_Shutdown on application exit. If there has been at least on SSL connection during an application session, the component responsible for shutting it down, is intentionally leaked. This is a different bug. The network layer is not keeping a reference to the nsNSSComponent. This means, during app shutdown, the destructor of the nsNSSComponent might be called while there is still network activity. This destructor has the call to NSS_Shutdown. Not knowing better, I assume the additional leaking reference count was introduced to prevent a crash. Bob, if you say that calling NSS_Shutdown is wanted/required, to make sure that dbclose is called during application exit, I will open another bug in order to fix this.
As we need a general solution for profile switching applications, I have created the new bug 101329. However, we need a tempory solution for this bug now. Temporary solution for the 094 branch ===================================== All suggestions outlined in the new bug 101329 seem to be too risky for the 094 branch. Conrad suggests to disable profile switching for now. Still, we might encounter profile loading and unloading events, where the actual profile will not change, but unloaded/loaded in the turbo mode situation. To make sure we will not crash if those events occur, I suggest the following patch. It makes PSM simply ignore the profile change events. This patch is absolutely wrong for all other applications, but is a good temporary solution, as it avoids all the problems, by simply not unloading/reloading NSS.
The patch would require that we disable turbo mode for multiple profiles. If that's what we have to do, I'll disable it. r=ccarlen. Before taking this step, can we see if there are any talkback incidents of this on the trunk since bug 99387 was checked in?
Comment on attachment 50519 [details] [diff] [review] Suggested temporary fix for 094 branch ONLY. Kai, The assumption your patch makes, that the application will not switch the profile but only unload and reload the same profile again, should be checked and nsNSSComponent::Observe should return a failure status or even abort (i.e., assert) if the assumption is not true.
Depends on: 101364
Attachment #50519 - Attachment is obsolete: true
Updated patch as suggested by wtc. The profile directory will be different when a different profile is selected. We remember the directory during NSS init, and compare this when we receive a profile change event.
Conrad, you ask whether there are still talkback results for this bug. The answer is yes, I see incident 35726796, which shows the same stack trace.
+ if (lastProfileDirectory != currentProfileDirectory) { + NS_ASSERTION(0, "attempt to switch profile, currently unsupported by PSM"); + // let's crash in optimized builds + *(int*)0 = 1; + } It is a "religious" issue whether we should crash as soon as we realize that we will crash at some point in the future. Based on my experience, people's first reaction is to blame the module that crashed. Therefore, I suggest that you rewrite the code fragment above as something like this. NS_ASSERTION(lastProfileDirectory == currentProfileDirectory, "attempt to switch profile, currently unsupported by PSM"); if (lastProfileDirectory != currentProfileDirectory) { return NS_ERROR_ABORT; } This way it becomes the responsibility of the caller to check your return value and decide what is the appropriate action.
The patch will always cause a crash for anybody who uses the 0.9.4 branch and profile switching. The crash which caused this bug *may* have happened for them. But, if they had managed to stop all async activity before the profile switch, they would not be having this problem. Now, they're guaranteed to have a crash. But, if you don't wan't the profile to be switched for PSM, just don't register to observe these two notifications. Then, the problem code will never be hit. http://lxr.mozilla.org/mozilla/source/security/manager/ssl/src/nsNSSComponent.cpp#801
(I now see that the command line interface of Mozilla supports selecting a different profile at runtime, even in turbo mode, which I was not aware of.) Conrad, you said you want to disable this on the branch. If you do, we will not introduce a crash using my patch, because the Observe method will never be triggered with a different profile. Did I understand something wrong? Using my initial patch, we don't recognize a potential crash, if the patch accidentially lands in a source tree that does support profile switching. If we completely disable listening for the profile switch event, we can not detect at runtime whether NSS switching is required or not, as the application uses profile switching.
> I now see that the command line interface of Mozilla supports selecting a > different profile at runtime. If you mean the -P <profile> arg, that only happens: 1) without turbo - only on initial launch, not at runtime 2) with turbo - only if multiple profiles are enabled
Attachment #50545 - Attachment is obsolete: true
This fix will not be needed if we disable profile switching in Turbo mode. Kai - Pls talk to Conrad.
Whiteboard: turbo → turbo, PDT-
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → DUPLICATE
This issue is going to be resolved on the branch by providing a fix to bug 101364, which makes applying the the current patch to this bug unnecessary. This bug should be marked as a duplicate of 101329 which outlines the various options. Kai, let me know if that's not the right thing to do. *** This bug has been marked as a duplicate of 101329 ***
Jaime, Stephane, I agree to your suggestions. If the profile switch is not triggered, we don't need a fix for PSM now, and care for it in bug 101329.
QA Contact: bsharma → junruh
Verified dupe.
Status: RESOLVED → VERIFIED
No longer blocks: 75599
Blocks: 75599
No longer blocks: 75599
Product: PSM → Core
Crash Signature: [@ PK11_GetInternalSlot]
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: