Closed Bug 98182 Opened 23 years ago Closed 23 years ago

M094 & Trunk crashes exiting browser [@ PK11_GetInternalSlot][turbo]

Categories

(Core Graveyard :: Security: UI, defect, P1)

Other Branch
x86
Windows NT
defect

Tracking

(Not tracked)

VERIFIED DUPLICATE of bug 101329
psm2.1

People

(Reporter: dbaron, Assigned: inactive-mailbox)

References

Details

(Keywords: crash, topcrash, Whiteboard: turbo, PDT-)

Crash Data

Attachments

(3 files, 2 obsolete files)

Starting in the 2001-08-30-05 build there were a bunch of crashes on Windows
reported in Talkback with the following stack (reason: Access Violation):

PK11_GetInternalSlot()
PK11_TokenExists()
ssl3_config_match_init()
ssl2_HandleServerHelloMessage()
ssl2_BeginClientHandshake()
ssl_Do1stHandshake()
ssl_SecureSend()
ssl_SecureWrite()
SSL_ImportFD()
nsSSLIOLayerWrite
[d:\builds\seamonkey\mozilla\security\manager\ssl\src\nsNSSIOLayer.cpp, line 704]

(I'm guessing there aren't line numbers because the NSS build system on Windows
is doing the Windows equivalent of stripping libraries itself.)

I'm not sure what changed around that time, or whether it's in PSM or NSS, or what.
Nelson, can you see anything obviously wrong with this
stack trace?  If not, I will change the product to Browser
for further investigation.
Assignee: wtc → nelsonb
Yes, I see something wrong with this trace. 
SSL_ImportFD() does not do any I/O.  
It does not call SSL_SecureWrite().

I'm guessing that this is a developer's personal build, and that
he's forgotten that gmake dependencies don't work properly on windows,
or that CVS update doesn't change the header files in dist (unlike Unix,
where they're simlinked).  But that's a wild guess.

I'd suggest the developer do a complete clean build of all of NSS,
including gmaking the clean, export, and libs targets (assuming coreconf,
I don't know how to do the equivalent using mozilla's build system).
It's very unlikely that this is a developer's personal build -- there were a
number of these crashes reported by different users, entirely outside Netscape
(and therefore with no access to talkback source).  (It might be more likely
that the symbols for a static function are being omitted and you're seeing the
name of the closest previous non-static function.)
If we ignore the SSL_ImportFD item in the stack, the rest of the stack is
consistent with a call to SSL_SecureWrite (via PR_Write).  

PK11_GetInternalSlot() is a one-line function that calls 
SECMOD_GetInternalModule() and then immediately references the pointer
returned by SECMOD_GetInternalModule() without checking it first.
If SECMOD_GetInternalModule() returned a null or corrupt pointer, 
this crash would occur.

So, there are two issues here:
1. Why is SECMOD_GetInternalModule() returning a bad value? 
   This should be investigated.  It's a problem with the pk11 wrapper layer.

2. PK11_GetInternalSlot() should check the value returned by 
SECMOD_GetInternalModule() for NULL before using the value, and should 
return NULL if it gets a NULL from SECMOD_GetInternalModule().

I'll make that change (#2) in the trunk.  
I suggest this bug be reassigned to relyea to investigate issue #1.
Get internal module returns a static variable which is either the internal
module or NULL.

If this function is failing it's because NSS was called, but hasn't been
initialized yet, or has been shutdown.

This fix will most likely just move the crash rather than repair it, as most
callers of Get_InternalSlot do not check for failure either.

There are literally dozens of places that may go belly-up if NSS is not
initialized, so unless we find out the root problem with initialization, the
patch won't really prevent any crashes.

bob
If indeed the problem is that NSS functions are being called before NSS
has been initialized, then this is an application error, not an NSS 
library error.  But we cannot determine that from the information in
this bug.
I checked in the test for the NULL internal module pointer.
I agree with Bob that the code will probably just crash somewhere else.
But testing it here is still the right thing to do, IMO.

Given that the application is attempting to do SSL, it seems unlikely
to me that the application has made no attempt to initialize NSS.
So, the likely explanations are that the attempt failed, but the 
application ignored the error and went on, or that memory has been 
corrupted after initialization is complete.  Either of these conditions
is probably an error outside of NSS.  

This bug should be reassigned to the PSM group.
Product => PSM.
Assignee: nelsonb → ssaux
Component: Libraries → Client Library
Product: NSS → PSM
QA Contact: sonmi → bsharma
to kai
->2.1
P1
nsenterprise+
Kai, do you think this may have anything to do with the disk full fix?
Assignee: ssaux → kai.engert
Keywords: nsenterprise+
Priority: -- → P1
Target Milestone: --- → 2.1
The "disk full fix" from bug 76915 was checked in on August 22, but according to
the report the first problems were seen on August 30.

Since August 22 we init NSS in the very early stages of application startup, and
exit the app if that doesn't work.

When I use Bonsai to query what happened around that date to PSM and NSS, I only
find one change that I think could possibly be related. It changes an #ifdef
block to some init code.

http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&subdir=mozilla/security/nss/lib/softoken&command=DIFF_FRAMESET&file=pk11db.c&rev1=1.1&rev2=1.2&root=/cvsroot

All other checkins around that date seem to be unrelated. (The full list is at
(slow):
http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=HEAD&branchtype=match&dir=%2Fmozilla%2Fsecurity%2Fnss%2F+%2Fmozilla%2Fsecurity%2Fmanager%2F&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=08%2F27%2F2001&maxdate=08%2F30%2F2001&cvsroot=%2Fcvsroot
)
Kai,  
the link you provided showed diffs between revs 1.1 and 1.2 of pk11db.c.
But that file is a new file on the trunk, and is not part of the client
builds.  It doesn't yet even have the NSS_CLIENT_TAG.  So, that's not it.

Is this problem reproducible?
pk11db.c was in pk11wrap in NSS 3.3 and before.

It has been moved to softoken in NSS 3.4.

Kai, you aren't pulling from the NSS tip for PSM are you? The NSS tip is
definately not ready for general use (we still have QA failures). In fact if PSM
is pulling from the NSS tip, it is quite likely that you will run into some of
these problems.

I suspect the *real* change that was made was the change from NSS 3.2 to NSS 3.3.
False alert I think. I checked, both the trunk and the 094 branch have the file
pk11db.c inside pk11wrap. This means they are pulling from NSS 3.3.

I found the above by looking for checkins to the NSS and PSM directories, but
didn't look at the tags.
I counted 29 TALKBACK reports.  All of them are on the trunk. build dates are
from  8/29 to 09/06.
per comments, changing to ASSIGNED
Status: NEW → ASSIGNED
I'm unable to reproduce this.
In the talkbalk reports, I can't find any helpful comments, besides from the
fact that people are in the process of leaving the application.

Therefore I have to try to analyze this by looking at the code:

The crashing function looks like that:

  PK11SlotInfo *
  PK11_GetInternalSlot(void) 
  {
    return PK11_ReferenceSlot(SECMOD_GetInternalModule()->slots[0]);
  }

As the crash is directly within this function, we know for sure that
SECMOD_GetInternalModule() returns. The return value must be either a null (or
invalid) value, or the attribute slots is null (or invalid). Function
PK11_ReferencesSlot can not be have been called, because that would show up in
the stack trace.

SECMOD_GetInternalModule() does simply return the value of variable
internalModule, which is a pointer, limited to the scope of file pk11util.c.

1) Let's assume that internalModule itself is NULL when we crash.

As we know the crash is during application shutdown, it seems likely that the
code is arriving at SECMOD_Shutdown, which destroys the object pointed to by
internalModule and sets this pointer to NULL.

Besides from shutdown, this variable is set only by SECMOD_init (which looks
unlikely as a cause for this bug), SECMOD_SetInternalModule and
SECMOD_DeleteInternalModule.

I can't find code (besides in init code) that calls SetInternalModule.

DeleteInternalModule is called from the NSS command line tools and from PSM
ToggleFIPSMode() only. Looks unlikely to be the cause, too.

The shutdown code is called when the NSS shared library is destructed. Is it
possible, that the PIPNSS component get's unloaded, while the network code still
needs it?

I conclude: The only realistic scenario I can see is: The user selects to quit
the application, but the https / ssl code is still active, and while the app
shuts down, a SSL handshake is still in progress, which causes the crash as the
internal slot has already been killed.

Maybe we should make the network library try to obtain a reference to the crypto
libraries, and, if successful, delay releasing that reference until the final
statement of the network shut down code?


2) Now let's assume that internalModule is valid, but attribute slots is invalid.

But I can't find any code in NSS files where the filename contains "pk11", that
references an attribute "slots" in a way that modifies the pointer value,
besides in the loading code.

There is a call to PK11_FreeSlot, called from SECMOD_DestroyValue, which is
actually called from the destructor of the PSM PKCS11 module.

So again, this could be caused by shutting down of the crypto shared libraries,
while there is still network code active.


The above is pure theory, and maybe I have missed some code fragments while
making my deductions. So I basically hope that my research will give some ideas
to people with deeper knowledge of the codebase, and whether my suggestion with
the reference counter makes sense or seems useless.


I'm looking for help, either by trying to reproduce the bug or by commenting my
above statements.
The disassembly of the code around the crash (with the first line being the one
on which the crash occurred -- talback doesn't show before, for some reason) is:

608d2af5 8b4028           mov     eax,[eax+0x28]     <== CRASH HERE
608d2af8 8b08             mov     ecx,[eax]
608d2afa 51               push    ecx
608d2afb e820e3ffff       call    608d0e20
608d2b00 83c404           add     esp,0x4
608d2b03 c3               ret

the EAX register is 0.  This suggests that SECMOD_GetInternalModule is returning
null.  (I assume that the |slots| member is at offset 0x28.)
Keywords: nsbranch+
nsNSSComponent is declared as:

NS_IMPL_THREADSAFE_ISUPPORTS6(nsNSSComponent,
                              nsISecurityManagerComponent,
                              nsISignatureVerifier,
                              nsIEntropyCollector,
                              nsINSSComponent,
                              nsIObserver,
                              nsISupportsWeakReference);

The destructor of nsNSSComponent calls NSS_Shutdown.
After this has been shut down, no NSS operations are allowed and are prone to crash.


When a SSL socket is opened, the code arrives at nsSSLIOLayerAddToSocket /
nsSSLIOLayerNewSocket, which will create an object of type nsNSSSocketInfo.
This object is referenced by the socket transport code.

NS_IMPL_THREADSAFE_ISUPPORTS4(nsNSSSocketInfo,
                              nsITransportSecurityInfo,
                              nsISSLSocketControl,
                              nsIInterfaceRequestor,
                              nsISSLStatusProvider)

While an nsNSSSocketInfo is in use by the network transport code, this can
result in NSS functions being called - because the SSL code has added itself as
a layer to the network code.
For my understanding this means, as long as there are nsNSSSocketInfo objects
alive, NSS must not be shut down.


But I can't see code that couples nsNSSSocketInfo to nsNSSComponent in terms of
reference counting.

Have I found the cause, or is there a coupling that I don't see?
nsNSSIOLayer.cpp creates the nsNSSService object in InitNSSMethods(), but never
releases it. So we leak on shutdown. The refcnt never goes to 0, and so
nss_shutdown won't be called.

What am I missing?
Bradley: You are right with your last statement.

I looked whether NSS_Shutdown is really called and from where. It is called, but
when the application shuts down and cleans up, see the stacktrace below.

#0  nsNSSComponent::~nsNSSComponent (this=0x81e3590, __in_chrg=3) at
../../../../../mozilla/security/manager/ssl/src/nsNSSComponent.cpp:109
#1  0x416603f7 in nsNSSComponent::Release (this=0x81e3590) at
../../../../../mozilla/security/manager/ssl/src/nsNSSComponent.cpp:550
#2  0x401ccc0d in DeleteEntry (aKey=0x8317598, aData=0x8317580, closure=0x0) at
../../../mozilla/xpcom/components/nsServiceManager.cpp:260
#3  0x4017b22a in _hashEnumerateRemove (he=0x83175b8, i=23, arg=0xbffff570) at
../../../mozilla/xpcom/ds/nsHashtable.cpp:381
#4  0x402af2fa in PL_HashTableEnumerateEntries (ht=0x806d90c, f=0x4017b1ec
<_hashEnumerateRemove(PLHashEntry *, int, void *)>, arg=0xbffff570) at
../../../../mozilla/nsprpub/lib/ds/plhash.c:429
#5  0x4017b2eb in nsHashtable::Reset (this=0x806d908, destroyFunc=0x401ccbcc
<DeleteEntry(nsHashKey *, void *, void *)>, aClosure=0x0) at
../../../mozilla/xpcom/ds/nsHashtable.cpp:400
#6  0x4017c9fb in nsObjectHashtable::Reset (this=0x806d908) at
../../../mozilla/xpcom/ds/nsHashtable.cpp:850
#7  0x4017c89a in nsObjectHashtable::~nsObjectHashtable (this=0x806d908,
__in_chrg=3) at ../../../mozilla/xpcom/ds/nsHashtable.cpp:816
#8  0x401ccd9f in nsServiceManagerImpl::~nsServiceManagerImpl (this=0x806d8f0,
__in_chrg=3) at ../../../mozilla/xpcom/components/nsServiceManager.cpp:285
#9  0x401ccf13 in nsServiceManagerImpl::Release (this=0x806d8f0) at
../../../mozilla/xpcom/components/nsServiceManager.cpp:294
#10 0x401cd94d in nsServiceManager::ShutdownGlobalServiceManager (result=0x0) at
../../../mozilla/xpcom/components/nsServiceManager.cpp:544
#11 0x401720cd in NS_ShutdownXPCOM (servMgr=0x0) at
../../../mozilla/xpcom/build/nsXPComInit.cpp:489
#12 0x0805b73d in main (argc=1, argv=0xbffff844) at
../../../mozilla/xpfe/bootstrap/nsAppRunner.cpp:1598
#13 0x405e8e5e in __libc_start_main (main=0x805b4cc <main>, argc=1,
ubp_av=0xbffff844, init=0x8054570 <_init>, fini=0x8067260 <_fini>,
rtld_fini=0x4000d3c4 <_dl_fini>, stack_end=0xbffff83c) at
../sysdeps/generic/libc-start.c:129


But if this is really the only location where NSS_Shutdown is triggered, as the
application code never frees the reference... Is it possible that the socket
transport thread is still running when the application shuts down? Maybe the
security component is freed before the network component(s)?
nsNSSModule.cpp is seriously messed up:  it has multiple entries for a given CID
and they're all tied to the nsNSSComponentConstructor but for different
contractIDs.  The GetService into the global in nsNSSIOLayer.cpp is by contract
ID rather than CID -- so it may not work.  If that is the problem (I'm not sure
how the component manager would deal with the strangeness in nsNSSModule.cpp),
then there should be instance counting to release that global when it can be
released, rather than just leaking it.
OK, maybe that nsNSSModule.cpp stuff is OK after all.
Might the service manager get confused by having multiple ContractIDs and a
single CID for the nsNSSComponent, and create more than one instance?
Where do I find the source for nsNSSComponent.cpp or nsNSSModule.cpp

There are a few components that have more than one contractid for a given cid. I
dont think that is a problem. That was supposed to work by design.
dp: mozilla/security/manager/ssl/src/
Note that links are internal to netscape. Sorry.
This query fails more often that not on climate.mcom.com, so including the saved
output allows one to easily (as long as one has access to the internal network)
access the individual incidents.

There doesn't appear to be any branch builds.

If shiva and jay say this is fixed, then we should put this one on the back burner.
This is showing up in the M094 Talkback topcrash report.  Stephane's attached
query does show many of these crashes with build 2001091311 on the Netscape6.20
branch (which also happens to be the Mozilla 0.9.4 release).  Here is the latest
info from Talkback.  The stack doesn't look too helpful, but perhaps the
comments will help us reproduce:

PK11_GetInternalSlot   14
			 98182 	 ASSI 	  	 kai.engert@gmx.de 2.1 
BBID range: 35353387 - 35494918
Min/Max Seconds since last crash: 256 - 46396
Min/Max Runtime: 256 - 46396
Crash data range: 2001-09-13 to 2001-09-17
Build ID range: 2001091311 to 2001091311

Stack Trace: 

	 PK11_GetInternalSlot()
	 PK11_TokenExists()
	 ssl3_config_match_init()
	 ssl3_HandleRecord()
	 ssl3_HandleRecord()
	 ssl3_HandleRecord()
	 ssl3_HandleRecord()
	 ssl3_GatherCompleteHandshake()
	 ssl_GatherRecord1stHandshake()
	 ssl_Do1stHandshake()
	 ssl_SecureSend()
	 ssl_SecureWrite()
	 SSL_ImportFD()
	 nsSSLIOLayerWrite
[d:\builds\seamonkey\mozilla\security\manager\ssl\src\nsNSSIOLayer.cpp  line 718]
	 SHA1_Update()
	 RNG_UpdateAndEnd_FIPS186_1()
	 prng_RandomUpdate()
	 SocketRecv
[../../../../pr/src/io/prsocket.c  line 641]
	 pl_DefRecv
[../../../../pr/src/io/prlayer.c  line 278]
	 ssl_DefRecv()
	 ssl3_GatherCompleteHandshake()
	 ssl3_GatherCompleteHandshake()
	 0xffffffff
	 0x90909090     (35494918)	Comments: Tried to close Mozilla completely using Mozilla icon in
the task bar. I tried to right click  left click double click  but no menu
appeared. Mozilla opened and I closed it immediately. At this point Mozilla crashed.
     (35484527)	Comments: I just quit  got mesage about icon staying on taskbar  hit OK.
     (35483403)	URL: webmail.cmg.nl
     (35483403)	Comments: I think the web-mail application tried to close the browser
window... It seemed to go slightly wrong.. The app is written in javascript.
     (35482014)	Comments: closed mozilla window while it was receiving data
     (35451055)	URL: http://www.msn.com
(35451055)
Comments: Exiting Hotmail. The page that follows the checkout page was trying to
load after redirect.
     (35445384)	Comments: closed browser (via javascript  apparently)
     (35420953)	URL: www.landbobanken.dk
(35420953)
Comments: Java internet banking
Keywords: crash
Summary: crashes exiting browser [@ PK11_GetInternalSlot] → M094 crashes exiting browser [@ PK11_GetInternalSlot]
Kai. So this is still happening on the 0.9.4 branch. From the comments included,
I see two avenues to investigate:
1) using the turbo mode exit.
2) quiting while page is still loading.
I'm confused why the title of this bug was changed to include M094. My
impression was, from Stephane's attachment, that it crashes on Mozilla trunk,
too. It isn't limited to a branch.

The avenue "still loading while exiting" is indeed the only scenario that makes
sense to me.

I played around on Windows, but I did not see the crash. I did use the turbo
mode feature, which worked stable for me.

Until I can reproduce it myself, the only thing I can try is to add code that
tries to improve thread safetyness of PSM, as well as some other things to test:

I want to test whether the singleton concept of the NSS component is actually
working. The existing code is designed to work as a singleton only. If there
were multiple instances of nsNSSComponent created during runtime, we would
indeed see a crash, as during destruction of the first one, NSS would already
have been shut down, and still open SSL connection would crash in a way as the
above stack trace. However, during my tests so far, I have never seen the
component to be created twice, so this might happen only in an edge condition
that I have not yet come across.

Another, very general, suggestion being made was: the socket information
structure should refcount the nss component in some way, because it depends on
it.

The reason why it currently works: the nsNSSComponent is knowingly leaked. But
this makes it even more incomprehensible why we crash, as the destruction and
NSS shutdown code should not be reached if the singleton concept worked and
there were no multithread issues.

This is flawed, I guess we really should shut down NSS on exit. Currently, we
don't do it, as soon as SSL code has been called at least once during the
Mozilla session.
Kai: Sorry for the confusion, I forgot to add "Trunk" to the summary as well. 
We (the Talkback team) usually update the summary to show which
releases/milestones are showing the crash so we can get a quick idea of when we
last saw the crash.  Anyway, I've updated the summary with Trunk, so we know
that this crash is occurring with both Mozilla 0.9.4 and the latest MozillaTrunk
builds.
Summary: M094 crashes exiting browser [@ PK11_GetInternalSlot] → M094 & Trunk crashes exiting browser [@ PK11_GetInternalSlot]
Kai: you should try to instantiate multiple instances of NSS explicitly to test
your theory.  If that help reproduce the crash and it's in the same place, that
would probably be significant.
I made the following change to enforce the crash.

I changed the code receiving the entropy collector in nsGlobalWindow, to not use
do_GetService, but use do_CreateInstance. This will create additional instances
of nsNSSComponent.

Inside the nsNSSComponent, there is an assertion check, that tries to detect if
the init code is called twice. When I use a debug build, the app quits. However,
in an optimized build, I suppose the app will not quit I think.
I therefore disabled the assertion check for testing, too.

When I open the application, create a second window, and surf over https, I
crash very soon. I see the following stack trace, which is at least in the upper
levels similar to the reported stack traces.


#0  0x416f8cc8 in PK11_GetInternalSlot () at pk11slot.c:2132
#1  0x416f8faa in PK11_TokenExists (type=33) at pk11slot.c:2244
#2  0x416b9f32 in ssl3_config_match_init (ss=0x895da00) at ssl3con.c:419
#3  0x416a80ec in ssl2_ConstructCipherSpecs (ss=0x895da00) at sslcon.c:204
#4  0x416ade3a in ssl2_BeginClientHandshake (ss=0x895da00) at sslcon.c:2964
#5  0x416b17bf in ssl_Do1stHandshake (ss=0x895da00) at sslsecur.c:156
#6  0x416b3764 in ssl_SecureSend (ss=0x895da00, buf=0x0, len=0, flags=0) at
sslsecur.c:1075
#7  0x416b3899 in ssl_SecureWrite (ss=0x895da00, buf=0x0, len=0) at sslsecur.c:1109
#8  0x416b6c1c in ssl_Write (fd=0x8938f40, buf=0x0, len=0) at sslsock.c:1232
#9  0x4166e5b7 in nsSSLIOLayerWrite (fd=0x86de738, buf=0x0, amount=0) at
../../../../../mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp:697
#10 0x402c4990 in PR_Write (fd=0x86de738, buf=0x0, amount=0) at
../../../../../mozilla/nsprpub/pr/src/io/priometh.c:141
#11 0x4166d050 in nsNSSSocketInfo::TLSStepUp (this=0x895d968) at
../../../../../mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp:341
#12 0x4166cfcc in nsNSSSocketInfo::ProxyStepUp (this=0x895d968) at
../../../../../mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp:326
#13 0x40ba3e53 in nsHttpConnection::ProxyStepUp (this=0x8963578) at
../../../../../mozilla/netwerk/protocol/http/src/nsHttpConnection.cpp:287
#14 0x40ba38e7 in nsHttpConnection::OnHeadersAvailable (this=0x8963578,
trans=0x89825c8, reset=0xbf7ff654) at
../../../../../mozilla/netwerk/protocol/http/src/nsHttpConnection.cpp:192
#15 0x40ba6f80 in nsHttpTransaction::HandleContentStart (this=0x89825c8) at
../../../../../mozilla/netwerk/protocol/http/src/nsHttpTransaction.cpp:387
#16 0x40ba7318 in nsHttpTransaction::HandleContent (this=0x89825c8,
buf=0x89df960 "HTTP/1.0 200 Connection established\n\nProxy-agent:
Muffin\n\n\n\n", 'Ú' <repeats 140 times>..., count=0, countRead=0xbf7ff7d4) at
../../../../../mozilla/netwerk/protocol/http/src/nsHttpTransaction.cpp:469
#17 0x40ba8016 in nsHttpTransaction::Read (this=0x89825c8, buf=0x89df960
"HTTP/1.0 200 Connection established\n\nProxy-agent: Muffin\n\n\n\n", 'Ú'
<repeats 140 times>..., count=0, bytesWritten=0xbf7ff7d4) at
../../../../../mozilla/netwerk/protocol/http/src/nsHttpTransaction.cpp:777
#18 0x401b2d05 in nsReadFromInputStream (outStr=0x8982924, closure=0x89825cc,
toRawSegment=0x89df960 "HTTP/1.0 200 Connection established\n\nProxy-agent:
Muffin\n\n\n\n", 'Ú' <repeats 140 times>..., offset=0, count=4096,
readCount=0xbf7ff7d4) at ../../../mozilla/xpcom/io/nsPipe2.cpp:830
#19 0x401b27ce in nsPipe::nsPipeOutputStream::WriteSegments (this=0x8982924,
reader=0x401b2cd0 <nsReadFromInputStream(nsIOutputStream *, void *, char *,
unsigned int, unsigned int, unsigned int *)>, closure=0x89825cc, count=16384,
writeCount=0xbf7ff89c) at ../../../mozilla/xpcom/io/nsPipe2.cpp:704
#20 0x401b2d59 in nsPipe::nsPipeOutputStream::WriteFrom (this=0x8982924,
fromStream=0x89825cc, count=16384, writeCount=0xbf7ff89c) at
../../../mozilla/xpcom/io/nsPipe2.cpp:838
#21 0x40b74361 in nsStreamListenerProxy::OnDataAvailable (this=0x8963508,
request=0x89825c8, context=0x0, source=0x89825cc, offset=0, count=16384) at
../../../../mozilla/netwerk/base/src/nsStreamListenerProxy.cpp:288
#22 0x40ba656a in nsHttpTransaction::OnDataReadable (this=0x89825c8,
is=0x89391d0) at
../../../../../mozilla/netwerk/protocol/http/src/nsHttpTransaction.cpp:178
#23 0x40ba57e7 in nsHttpConnection::OnDataAvailable (this=0x8963578,
request=0x86d6020, context=0x0, inputStream=0x89391d0, offset=0, count=8192) at
../../../../../mozilla/netwerk/protocol/http/src/nsHttpConnection.cpp:702
#24 0x40b68ee4 in nsSocketReadRequest::OnRead (this=0x86d6020) at
../../../../mozilla/netwerk/base/src/nsSocketTransport.cpp:2723
#25 0x40b63894 in nsSocketTransport::doReadWrite (this=0x8805eb8,
aSelectFlags=1) at ../../../../mozilla/netwerk/base/src/nsSocketTransport.cpp:999
#26 0x40b62072 in nsSocketTransport::Process (this=0x8805eb8, aSelectFlags=1) at
../../../../mozilla/netwerk/base/src/nsSocketTransport.cpp:480
#27 0x40b6a37a in nsSocketTransportService::Run (this=0x8116820) at
../../../../mozilla/netwerk/base/src/nsSocketTransportService.cpp:458
#28 0x401d6832 in nsThread::Main (arg=0x810fd78) at
../../../mozilla/xpcom/threads/nsThread.cpp:105
#29 0x402ea93a in _pt_root (arg=0x810d530) at
../../../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:214
#30 0x4030c078 in pthread_start_thread_event (arg=0xbf7ffc00) at manager.c:262
I just saw this for the very first time! I think I know what causes this
problem. NSS is indeed shut down and restarted during the session, and it has to
do with the -turbo mode on Windows.

The reason is: method nsNSSComponent::Observe is called, which I did not expect.
The "profile changed" event (causing NSS to shut down) seems to be triggered
while there are still connections in progress. This method needs to be redesigned.

More to follow.
nice finding kai. Ccing Conrad "Turbo" Carlen.
Blocks: 75599
Summary: M094 & Trunk crashes exiting browser [@ PK11_GetInternalSlot] → M094 & Trunk crashes exiting browser [@ PK11_GetInternalSlot][turbo]
Nominating for turbo short list.
Whiteboard: turbo
Please see also some comments in bug 99566. 

We have learned that the crash is caused by unloading and loading the profile.
PSM tries to shut down and re-init NSS when it is informed by profile change
events. 
While this is done, it is possible that there are still active threads accessing
NSS, e.g. the network communication thread.

The code necessary to add protection is not a small patch, but might require
many changes all over PSM, to safeguard for the "NSS currently unavailble"
condition, i.e. checking for this flag, including to add locks for serializing
access.

While it is a general requirement to have the functionality to switch profiles
(including the certificate databases used by Mozilla), this might require too
many changes, which may be allowed to be put to the 094 branch.

There is a workaround for applications that don't support real profile switching
yet. We could patch PSM to completely ignore the profile loading / unloading
event, by assuming there won't be a switch to a different profile (with
different certificate database files). I'd like to suggest this patch for
applications that are created from the 094 branch. If this restriction isn't
possible, random crashes are always possible. We could try to minimize the
likelyhood for a crash, by delaying NSS_Shutdown until the moment where we need
to re-initialize. I have no idea, whether there might still be SSL sockets
active when the new profile is being loaded/activated. This could result in a
newly activated NSS session accessing NSS library instances from a previous session.

Here are some comments for a correct implementation for the trunk:

nsNSSComponent methods seem to be called from multiple threads, the UI thread,
the observer notifier (not sure which this is), the socket thread. I think we
should protect the flag, whether NSS is currently activated or not, using a mutex.

We need to behave gracefully. But doing so in nsNSSComponent is not enough. The
SSL code adds a function layer to the socket file descriptors. And those
obviously are active in the middle of profile change. And those call global
functions (e.g. nsNSSIOLayerWrite). In all the layer functions, we must check
for NSS being currently in good state (by querying the
nsNSSComponent::mNSSInitiliazed flag).

However, what happens, if a SSL socket layer function is currently active, while
NSS is being shut down? Obviously we have the likelyhood that NSS goes away in
the middle of a socket operation. This means that it doesn't matter whether NSS
(itself) is thread safe or not. Access to NSS functions is asynchronous. If we
want to fail gracefully if NSS is not available, that means, make the check for
"NSS is available" right before every single call to any NSS function, and
protect the calls using the same Mutex as we must use for NSS shutdown and init,
and to protect the member variables of the nsNSSComponent.

We might want to discuss whether we really need to protect every single function
call to NSS or whether we can reduce this.

However, I fear the likelyhood for being required to protect every single
function is high, as indicated by my observation, that I saw a "certificate
domain mismatch" prompter dialog coming up, nearly at the same moment as the
profile observer was called with the "profile unloaded" event.

Oops, in the previous comment, in the fifth paragraph, there is the important
word NOT missing:


While it is a general requirement to have the functionality to switch profiles
(including the certificate databases used by Mozilla), this might require too
many changes, which may be *NOT* allowed to be put to the 094 branch.

Since this concerns general architecture of the security subsystems, I'm cc'ing
Terry Hayes. 
Cant NSS not shutdown, but just forget all cert db etc and reinitialize with the
new profile's cert db etc...in this situation ?
*** Bug 100771 has been marked as a duplicate of this bug. ***
I'm a little concerned that the client is not trying to quiesce all its
connections before it tries to switch profiles. The Initialization and shutdown
code in NSS is not thread safe. This is the code where any global locks which
may be needed are acquired.

It seems to me that it's a dangerous thing in general to allow operations to
continue while the profile is in transition. Mail/News components can't be happy
if  you are reading a message in the middle of reading a message.

Making PSM fail when NSS isn't initialized seems like a hack. The real question
why are there outstanding NSS requests out there to begin with. How is the
profile switch supposed to happen? Does the profiler shutdown each component
first before it starts enabling new components? If so, then you need to make
sure PSM is shutdown after all the potential callers of PSM, and that include
the networking library! PSM should then be brought up before any new potential
callers.
The database and keys are an integral part of NSS, NSS can't run without them
defined. Not only that, but what tokens you have defined is stored in your
profile. You could have a completely different set of hardware tokens defined.
Switching profiles is a big deal. Before this design Netscape switched profiles
by Shutting down. The profile switching code should assume that it's has to do
most of the work of shutdown.

Even if PSM fails on NSS connections in between the shutdown and the new
initialize, I don't know what kind of results we are going to get if we have
things like SSL connections that start on one profile and finish on the second.
All of these connections have to be closed down before you shutdown NSS.
Bob, thanks for the explanation.

Maybe this is even the simplest strategy to solve our problem.

Can we use code, that really assures that all network activity has been shut
down completely (even if this blocks), before we try to switch the profile?
Adding in mscott.

Darin/mscott: any way we can tear down network connections completely on a
profile switch.
As the first phase of profile switching, all windows are destroyed which cancels
any current loads. The canceling, though, is not synchronous. It's hard to know
when all activity is stopped and trying to enforce that in something as large as
mozilla isn't the right way to go. Whatever a component loads from a profile,
it's given notice that that data is going away. It needs to do what it must to
put itself in a safe state. localstore, for instance, makes a temporary
in-memory data source so it can function without any of the user-specific data
in the meanwhile. How can NSS require, or else crash, that it has this data?
What if it's internal data structures were just empty, in the state they were
before it inernalized any data from the profile? Would it just crash? What if it
failed to read the data for some other reason? I think, from another bug, it'll
crash then too.

This caused a similar problem in the disk cache. There, it was agreed that the
disk cache should be able to handle the situation without complete quiesense.
That said, I think having multiple profiles in turbo and therefore switching
should be disabled until this can be worked out. Now that it's out of an
embedding-only context, too many problems like this are cropping up.
The profile *DEFINES* NSS's internal data structures. What modules are 'loaded'.
What slots those modules have, etc are defined by secmod.db. It's not that the
data structures are empty, they flat out don't exist! You cannot do anything in
NSS without a token defined: not hashing, not PRNG, not anything!

If NSS_Initialize fails, the application cannot call any other NSS function
without crashing. This has been true since the very beginning, and all NSS
applications (including communicator and the servers)live with this restriction.
The reason for this is most of these products have not restricted themselves to
a small set up public API's. NSS can't  make initialization checks on *every*
interface that PSM calls currently because it calles to many internal only
interfaces.

NSS requires it because it has never been a design point that it didn't. When
the shutdown feature was request, I clearly stated what the restrictions were.
Obviously those restrictions weren't communicated back up.

There is a more fundamental problem, though. Even if NSS allow 'changing the
profile on the fly' with it's connections, there are security issues. What if a
connection was created before you switched profiles, managed not to make any NSS
calls until after the profile switch, what is the semantic of that connection?
All the security checks were done in reguard to a completely separate
connection?  These really need to be quiesced before you bring up a new security
context.

Reading Conrad's comments more carefully, I would say that NSS should not
shutdown on the first phase. We have an issue if we need to access the key or
certdb in that time, but NSS will fail, not crash in that case (if the key and
cert DB go away after initialization, NSS does handle that case 'gracefully').
The user key and cert Data is already squirrelled away currently. There may be
an issue if we have modified the key and cert DB and the dbclose doesn't happen.
NSS will still have the files open at this point, I don't know what issue that
will cause. The potential db corruption in this case is already an issue we have
filed against NSS (it's really a bug in DBM where db_sync doesn't always do what
it should). The secmod.db is openned, read, then closed. As long as someone
doesn't try to add a module in a middle, it could disappear of the face of the
earth and we would never know.

In this case, KAI, we should do the shutdown just before we do the initialize.
We still need to make sure no one is going to call NSS between the two calls. I
am still concerned about the security implications of having connections
starting in one profile and living through the profile switch process.

bob
i haven't been following all of the discussion on this bug, but in answer to
dp's question: it is possible to "go offline" and thereby force all socket
connections to close (they will not close synchronously, however).
Bob, you say that there is an issue if dbclose is not called.

Well, then we might have another problem. Currently, in most cases, the PSM code
will not call NSS_Shutdown on application exit.

If there has been at least on SSL connection during an application session, the
component responsible for shutting it down, is intentionally leaked.

This is a different bug. The network layer is not keeping a reference to the
nsNSSComponent. This means, during app shutdown, the destructor of the
nsNSSComponent might be called while there is still network activity. This
destructor has the call to NSS_Shutdown. Not knowing better, I assume the
additional leaking reference count was introduced to prevent a crash.

Bob, if you say that calling NSS_Shutdown is wanted/required, to make sure that
dbclose is called during application exit, I will open another bug in order to
fix this.
As we need a general solution for profile switching applications, I have created
the new bug 101329.

However, we need a tempory solution for this bug now.


Temporary solution for the 094 branch
=====================================

All suggestions outlined in the new bug 101329 seem to be too risky for the 094
branch.

Conrad suggests to disable profile switching for now.

Still, we might encounter profile loading and unloading events, where the actual
profile will not change, but unloaded/loaded in the turbo mode situation.

To make sure we will not crash if those events occur, I suggest the following
patch. It makes PSM simply ignore the profile change events.

This patch is absolutely wrong for all other applications, but is a good
temporary solution, as it avoids all the problems, by simply not
unloading/reloading NSS.

The patch would require that we disable turbo mode for multiple profiles. If
that's what we have to do, I'll disable it. r=ccarlen.

Before taking this step, can we see if there are any talkback incidents of this
on the trunk since bug 99387 was checked in?
Comment on attachment 50519 [details] [diff] [review]
Suggested temporary fix for 094 branch ONLY.

Kai,

The assumption your patch makes, that the application
will not switch the profile but only unload and reload the same profile again, should be
checked and nsNSSComponent::Observe
should return a failure status or even abort (i.e., assert) if the assumption is not true.
Depends on: 101364
Attachment #50519 - Attachment is obsolete: true
Updated patch as suggested by wtc. The profile directory will be different when
a different profile is selected. We remember the directory during NSS init, and
compare this when we receive a profile change event.
Conrad, you ask whether there are still talkback results for this bug.
The answer is yes, I see incident 35726796, which shows the same stack trace.
+    if (lastProfileDirectory != currentProfileDirectory) {
+      NS_ASSERTION(0, "attempt to switch profile, currently unsupported by PSM");
+      // let's crash in optimized builds
+      *(int*)0 = 1;
+    }

It is a "religious" issue whether we should crash as soon as
we realize that we will crash at some point in the future.
Based on my experience, people's first reaction is to blame
the module that crashed.  Therefore, I suggest that you
rewrite the code fragment above as something like this.

    NS_ASSERTION(lastProfileDirectory == currentProfileDirectory,
        "attempt to switch profile, currently unsupported by PSM");
    if (lastProfileDirectory != currentProfileDirectory) {
        return NS_ERROR_ABORT;
    }

This way it becomes the responsibility of the caller to check your
return value and decide what is the appropriate action.
The patch will always cause a crash for anybody who uses the 0.9.4 branch and
profile switching. The crash which caused this bug *may* have happened for them.
But, if they had managed to stop all async activity before the profile switch,
they would not be having this problem. Now, they're guaranteed to have a crash.

But, if you don't wan't the profile to be switched for PSM, just don't register
to observe these two notifications. Then, the problem code will never be hit.

http://lxr.mozilla.org/mozilla/source/security/manager/ssl/src/nsNSSComponent.cpp#801
(I now see that the command line interface of Mozilla supports selecting a
different profile at runtime, even in turbo mode, which I was not aware of.)

Conrad, you said you want to disable this on the branch. If you do, we will not
introduce a crash using my patch, because the Observe method will never be
triggered with a different profile. Did I understand something wrong?

Using my initial patch, we don't recognize a potential crash, if the patch
accidentially lands in a source tree that does support profile switching.

If we completely disable listening for the profile switch event, we can not
detect at runtime whether NSS switching is required or not, as the application
uses profile switching.
> I now see that the command line interface of Mozilla supports selecting a
> different profile at runtime.

If you mean the -P <profile> arg, that only happens:
1) without turbo - only on initial launch, not at runtime
2) with turbo - only if multiple profiles are enabled

Attachment #50545 - Attachment is obsolete: true
This fix will not be needed if we disable profile switching in Turbo mode. 

Kai - Pls talk to Conrad.
Whiteboard: turbo → turbo, PDT-
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → DUPLICATE
This issue is going to be resolved on the branch by providing a fix to bug
101364, which makes applying the the current patch to this bug unnecessary.

This bug should be marked as a duplicate of 101329 which outlines the various
options.

Kai, let me know if that's not the right thing to do.


*** This bug has been marked as a duplicate of 101329 ***
Jaime, Stephane, I agree to your suggestions. If the profile switch is not
triggered, we don't need a fix for PSM now, and care for it in bug 101329.
QA Contact: bsharma → junruh
Verified dupe.
Status: RESOLVED → VERIFIED
No longer blocks: 75599
Blocks: 75599
No longer blocks: 75599
Product: PSM → Core
Crash Signature: [@ PK11_GetInternalSlot]
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: