Closed Bug 73018 Opened 19 years ago Closed 19 years ago

SSL fails on Mac

Categories

(Core Graveyard :: Security: UI, defect, blocker)

1.0 Branch
PowerPC
Mac System 9.x
defect
Not set
blocker

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: junruh, Assigned: sfraser_bugs)

References

()

Details

(Keywords: smoketest)

I cannot reach a secure site with the 2001032204 commercial trunk build.
yep, just tested this on the first reported mac build.  endlessly attempts to 
load page...clicking stop freezes the app(must force quit to escape)

adding smoketest keyword
Keywords: smoketest
cc-ing necko folks in case this is cache related.

John Tracy - could you try removing any shared libraries with "cache" in them 
from the "components" direcory.  I think it is called nkcache, but I am not 
sure.
Is this commercial only?  Or does this also happen on Mozilla builds?

This may be related to the patch I made yesterday that copies the
security-prefs.js file to the default prefs directory.

I'm building right now to investigate.  Probably another hour or so before my
build finishes.
Also, is this with a new profile or an existing one?
New profile or old, Mozilla or commercial trunk build, there is no reaching an 
ssl site.
it's doing it with both olds and new profiles.
If it's doing it with both old and new profiles, then it probably isn't really
to my check-in yesterday as that would only affect new profiles.

Still waiting on build.  It's gonna be longer than I thought because I forgot to
disable pulling the tree when I ran the build script.
Someone from necko is gonna have to help me on this one.

Here's what's happening:

 The Mac tries to create an SSL socket via PSM. The PSM client libraries (CMT_*)
call through to PSM to create the SSL socket. With PSM 1.x, this happens via
TCP/IP sockets on the Mac.  There is a PSM shim layer that raps calls to
send/read data on sockets.  The request is sent over to the PSM shared library
to create a connection.  PSM creates a connectoin object.  It sends back a reply
that the connection object has been created and it waits to receive the nonce
authenticating that this SSL request is from a client it has established a
connection with.

Everything up to this point is normal.  The NSPR socket created by the PSM shim
layer never wakes up to read the reply, so the PSM SSL thread is blocking
waiting on a thread that never gets woken up.

The problem could be that:
1) necko changed something with the way it expects sockets to be set up
2) NSPR sockets somehow changed which cause our shim layer to no longer work
3) PSM shim layer isn't setting the correct bit on the socket for reading and
worked before by luck.

Anyone else have ideas?  Cause I don't know why this is.

http://lxr.mozilla.org/mozilla/source/extensions/psm-glue/src/nsPSMShimLayer.c
Have you tried backing out mozilla/nsprpub/pr/src/md/mac/macsockotpt.c two 
revisions? It changed a bit recently, ask gordon or sfraser for the background.
I reverted macsockotpt.c to rev 3.15.2.7 and I still get the same behavior.

Setting up the control connection and passing the prefs over the control
connection works.  The difference being that those sockets aren't handed back up
to necko for layering.  Those sockets exists solely in the CMT layer for PSM
setup.  It seems that the necko layer is forgetting to set a bit or  sets a bad
bit on the socket returned such that it never wakes from a poll.  Similar to
what we were seeing with the SSL proxy bug before 0.8
Do we know exactly when this broke?
My PSM 1 build from yesterday loads https pages OK, but I do see some scarey 
assertions:

  11453980    PPC  3DC6CF04  _PR_UserRunThread+000C4
  11453900    PPC  3CC02488  SSM_FrontEndThread+00308
  11453880    PPC  3CC00220  SSMControlConnection_ProcessMessage+00154
  11453830    PPC  3CBFE498  SSMControlConnection_ProcessHello+00084
  114537B0    PPC  3CBFE36C  SSMControlConnection_SetupNSS+00114
  11453740    PPC  3CBFDFF0  SSM_InitNSS+00080
  114536C0    PPC  3CBFDE70  ssm_OpenSecModDB+00084
  11453680    PPC  3CC66670  SECMOD_init+00168
  11453630    PPC  3CC6E120  SECMOD_LoadModule+0010C
  11453580    PPC  3B95E87C  MODULE_NAMEC_Initialize+0001C
  11453540    PPC  3B96C754  NSSCKFWC_Initialize+00090
  114534F0    PPC  3B9641A4  nssCKFWInstance_Create+00024
  11453490    PPC  3B9618A8  NSSArena_Create+00014
  11453450    PPC  3B961990  nssArena_Create+000AC
  11453410    PPC  3B961630  arena_add_pointer+0001C
  114533D0    PPC  3B962AA4  nssPointerTracker_initialize+00024
  11453390    PPC  3B962908  call_once+000AC
  11453340    PPC  3DC6B464  PR_NotifyAllCondVar+0004C
  114532E0    PPC  3DC5C654  PR_Assert+00048

and my own assertions in the OT Notifier, which assert that if we have a thread, 
its io_pending is true, fail.
That assertion has been there for a while.  I haven't had time to track it down
and fix it.  It's way in the depths of the PKCS11 module that contains the root
certificates.
OK, now I'm really confused.

I'm not sure what I did, but it's working for me now.  I didn't find any lurkers
in nsprpub, netwerk, psm-glue, or security.

No clue as to why it started working.
It turns out I hadn't properly re-built NSPR when I updated macsockotpt.c

Reverting to version 3.15.2.7 does make SSL work again.
I tested https before landing the PR_Poll stuff, and it did work, so I'm not sure 
what's going on here.

What are the exact steps to show the failure? For example, in a build from 
yesterday, I can still load https://www.verisign.com. (Why is the URL in this bug 
an http url?).
I updated my tree this morning.

After that going to an https:// site didn't work.

After reverting macsockotpt.c like beard suggesting (and re-building correctly)
https sites started working again.  When I updated macsockotpt.c to the latest
version on NSPRPUB_CLIENT_BRANCH, https sites didn't work anymore.

That's where I am.
Does 3.15.2.8 work?
we could use some help from QA in helping us track down when this regressed. 
No it doesn't
The 3-21-18 Mac commercial installer build works with SSL sites.
The 3-22-04 build fails.
Correcting the URL to be https.
Well, my build from yesterday did PSM ok, and my build from today doesn't. So 
that suggests that there is something outside of Mac NSPR that also is 
contributing to this problem.
Is there any code in PSM that is Mac-only, and perhaps makes assumptions about 
whether pollable events are available, or anything like that?
There are no mac only places where poll happens, but in psm-glue the following
code does run which polls for "events" from the PSM daemon.

http://lxr.mozilla.org/mozilla/source/security/psm/lib/client/cmtevent.c#253
it's stalling because it fails to acquire the lock at:

 CMT_LOCK(cm_control->mutex);

in CMT_ProcessEvent(), after the first poll returns. I don't know why.
That means that another thread (probably the necko thread trying to read the
response sent back by the PSM daemon) is waiting to read/write on that socket.  

This is actually expected since CMT_ProcessEvent should only ever process UI
events not read responses to qyeries.
> 1) necko changed something with the way it expects sockets to be set up

For what it's worth, nsSocketTransport has not been touched since 3/16.
This is all so complex as to be not understandable by any normal human.
So at last count, running Mozilla with PSM opens 7 different sockets:
3 (1 closed) for necko pollable events
2 for SLL I/O
2 for SSM
2 for CMT
and that's before we've started reading data off the net. There are also 4-5 
threads involved here. Things are extremely difficult to debug.

With regards to that assertion above, this is because NSS seems to provide some 
stubs for NSPR routines (in nss/lib/ckfw/nsprstub.c) like PR_Lock, that have 
their own implementations. The assertion is caused by this PR_Lock returning a 
NSSCKFWMutex* casted to a PRLock*, which is then used in a PRCondVar* (a real 
NSPR one this time). We assert later because NSPR is expecting that PRLock to 
have a certain layout, which it doesn't (because it's not really a PRLock). This 
seems very wrong. I'm unwilling to waste more time poking around in this rats 
nest until that is resolved.
I've made little further progress on this. I've verified that PR_Poll is working 
as it should, as far as I can tell, but also that reverting to the older PR_Poll 
version fixes the problem.

My best guess right now is that we fail because the OT notifier routine is waking 
up the wrong thread (perhaps the poll thread instead of the read/write thread).
So that's in fact what it was. Fix:

Index: mozilla/nsprpub/pr/src/md/mac/macsockotpt.c
===================================================================
RCS file: /cvsroot/mozilla/nsprpub/pr/src/md/mac/macsockotpt.c,v
retrieving revision 3.15.2.9
diff -b -u -2 -r3.15.2.9 macsockotpt.c
--- macsockotpt.c	2001/03/16 21:25:19	3.15.2.9
+++ macsockotpt.c	2001/03/23 08:26:49
@@ -456,5 +456,6 @@
 	if (pollThread)
 		WakeUpNotifiedThread(pollThread, kOTNoError);
-	else
+	
+	if (thread && (thread != pollThread))
 	WakeUpNotifiedThread(thread, result);
 	
This fix needs r= and sr=, and gordon, beard or sdagley will have to check it 
in.
ok, this fix looks reasonable, r=pink. 

Who is going to do something about the rats nest of casted structures that 
sfraser pointed out? Is that addressed in psm2? If not....
I have an sr=gordon, and will make sure that he checks this in.
pinkerton -- PSM2 uses the same NSS code as PSM1.
This fix has been checked into the NSPRPUB_CLIENT_BRANCH and the nspr tip.
Mine
Assignee: javi → sfraser
Fixed checked in; thanks gordon
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
verified with mac commercial build 2001-03-26-12-trunk
Status: RESOLVED → VERIFIED
Keywords: qawanted
Mass changing Security:Crypto to PSM
Component: Security: Crypto → Client Library
Product: Browser → PSM
Version: other → 2.1
Mass changing Security:Crypto to PSM
Product: PSM → Core
Version: psm2.1 → 1.0 Branch
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.