Closed Bug 73018 Opened 19 years ago Closed 19 years ago
SSL fails on Mac
I cannot reach a secure site with the 2001032204 commercial trunk build.
yep, just tested this on the first reported mac build. endlessly attempts to load page...clicking stop freezes the app(must force quit to escape) adding smoketest keyword
cc-ing necko folks in case this is cache related. John Tracy - could you try removing any shared libraries with "cache" in them from the "components" direcory. I think it is called nkcache, but I am not sure.
Is this commercial only? Or does this also happen on Mozilla builds? This may be related to the patch I made yesterday that copies the security-prefs.js file to the default prefs directory. I'm building right now to investigate. Probably another hour or so before my build finishes.
Also, is this with a new profile or an existing one?
New profile or old, Mozilla or commercial trunk build, there is no reaching an ssl site.
it's doing it with both olds and new profiles.
If it's doing it with both old and new profiles, then it probably isn't really to my check-in yesterday as that would only affect new profiles. Still waiting on build. It's gonna be longer than I thought because I forgot to disable pulling the tree when I ran the build script.
Someone from necko is gonna have to help me on this one. Here's what's happening: The Mac tries to create an SSL socket via PSM. The PSM client libraries (CMT_*) call through to PSM to create the SSL socket. With PSM 1.x, this happens via TCP/IP sockets on the Mac. There is a PSM shim layer that raps calls to send/read data on sockets. The request is sent over to the PSM shared library to create a connection. PSM creates a connectoin object. It sends back a reply that the connection object has been created and it waits to receive the nonce authenticating that this SSL request is from a client it has established a connection with. Everything up to this point is normal. The NSPR socket created by the PSM shim layer never wakes up to read the reply, so the PSM SSL thread is blocking waiting on a thread that never gets woken up. The problem could be that: 1) necko changed something with the way it expects sockets to be set up 2) NSPR sockets somehow changed which cause our shim layer to no longer work 3) PSM shim layer isn't setting the correct bit on the socket for reading and worked before by luck. Anyone else have ideas? Cause I don't know why this is. http://lxr.mozilla.org/mozilla/source/extensions/psm-glue/src/nsPSMShimLayer.c
Have you tried backing out mozilla/nsprpub/pr/src/md/mac/macsockotpt.c two revisions? It changed a bit recently, ask gordon or sfraser for the background.
I reverted macsockotpt.c to rev 18.104.22.168 and I still get the same behavior. Setting up the control connection and passing the prefs over the control connection works. The difference being that those sockets aren't handed back up to necko for layering. Those sockets exists solely in the CMT layer for PSM setup. It seems that the necko layer is forgetting to set a bit or sets a bad bit on the socket returned such that it never wakes from a poll. Similar to what we were seeing with the SSL proxy bug before 0.8
Do we know exactly when this broke?
My PSM 1 build from yesterday loads https pages OK, but I do see some scarey assertions: 11453980 PPC 3DC6CF04 _PR_UserRunThread+000C4 11453900 PPC 3CC02488 SSM_FrontEndThread+00308 11453880 PPC 3CC00220 SSMControlConnection_ProcessMessage+00154 11453830 PPC 3CBFE498 SSMControlConnection_ProcessHello+00084 114537B0 PPC 3CBFE36C SSMControlConnection_SetupNSS+00114 11453740 PPC 3CBFDFF0 SSM_InitNSS+00080 114536C0 PPC 3CBFDE70 ssm_OpenSecModDB+00084 11453680 PPC 3CC66670 SECMOD_init+00168 11453630 PPC 3CC6E120 SECMOD_LoadModule+0010C 11453580 PPC 3B95E87C MODULE_NAMEC_Initialize+0001C 11453540 PPC 3B96C754 NSSCKFWC_Initialize+00090 114534F0 PPC 3B9641A4 nssCKFWInstance_Create+00024 11453490 PPC 3B9618A8 NSSArena_Create+00014 11453450 PPC 3B961990 nssArena_Create+000AC 11453410 PPC 3B961630 arena_add_pointer+0001C 114533D0 PPC 3B962AA4 nssPointerTracker_initialize+00024 11453390 PPC 3B962908 call_once+000AC 11453340 PPC 3DC6B464 PR_NotifyAllCondVar+0004C 114532E0 PPC 3DC5C654 PR_Assert+00048 and my own assertions in the OT Notifier, which assert that if we have a thread, its io_pending is true, fail.
That assertion has been there for a while. I haven't had time to track it down and fix it. It's way in the depths of the PKCS11 module that contains the root certificates.
OK, now I'm really confused. I'm not sure what I did, but it's working for me now. I didn't find any lurkers in nsprpub, netwerk, psm-glue, or security. No clue as to why it started working.
It turns out I hadn't properly re-built NSPR when I updated macsockotpt.c Reverting to version 22.214.171.124 does make SSL work again.
I tested https before landing the PR_Poll stuff, and it did work, so I'm not sure what's going on here. What are the exact steps to show the failure? For example, in a build from yesterday, I can still load https://www.verisign.com. (Why is the URL in this bug an http url?).
I updated my tree this morning. After that going to an https:// site didn't work. After reverting macsockotpt.c like beard suggesting (and re-building correctly) https sites started working again. When I updated macsockotpt.c to the latest version on NSPRPUB_CLIENT_BRANCH, https sites didn't work anymore. That's where I am.
Does 126.96.36.199 work?
we could use some help from QA in helping us track down when this regressed.
No it doesn't
The 3-21-18 Mac commercial installer build works with SSL sites. The 3-22-04 build fails.
Well, my build from yesterday did PSM ok, and my build from today doesn't. So that suggests that there is something outside of Mac NSPR that also is contributing to this problem.
Is there any code in PSM that is Mac-only, and perhaps makes assumptions about whether pollable events are available, or anything like that?
There are no mac only places where poll happens, but in psm-glue the following code does run which polls for "events" from the PSM daemon. http://lxr.mozilla.org/mozilla/source/security/psm/lib/client/cmtevent.c#253
it's stalling because it fails to acquire the lock at: CMT_LOCK(cm_control->mutex); in CMT_ProcessEvent(), after the first poll returns. I don't know why.
That means that another thread (probably the necko thread trying to read the response sent back by the PSM daemon) is waiting to read/write on that socket. This is actually expected since CMT_ProcessEvent should only ever process UI events not read responses to qyeries.
> 1) necko changed something with the way it expects sockets to be set up For what it's worth, nsSocketTransport has not been touched since 3/16.
This is all so complex as to be not understandable by any normal human. So at last count, running Mozilla with PSM opens 7 different sockets: 3 (1 closed) for necko pollable events 2 for SLL I/O 2 for SSM 2 for CMT and that's before we've started reading data off the net. There are also 4-5 threads involved here. Things are extremely difficult to debug. With regards to that assertion above, this is because NSS seems to provide some stubs for NSPR routines (in nss/lib/ckfw/nsprstub.c) like PR_Lock, that have their own implementations. The assertion is caused by this PR_Lock returning a NSSCKFWMutex* casted to a PRLock*, which is then used in a PRCondVar* (a real NSPR one this time). We assert later because NSPR is expecting that PRLock to have a certain layout, which it doesn't (because it's not really a PRLock). This seems very wrong. I'm unwilling to waste more time poking around in this rats nest until that is resolved.
I've made little further progress on this. I've verified that PR_Poll is working as it should, as far as I can tell, but also that reverting to the older PR_Poll version fixes the problem. My best guess right now is that we fail because the OT notifier routine is waking up the wrong thread (perhaps the poll thread instead of the read/write thread).
So that's in fact what it was. Fix: Index: mozilla/nsprpub/pr/src/md/mac/macsockotpt.c =================================================================== RCS file: /cvsroot/mozilla/nsprpub/pr/src/md/mac/macsockotpt.c,v retrieving revision 188.8.131.52 diff -b -u -2 -r184.108.40.206 macsockotpt.c --- macsockotpt.c 2001/03/16 21:25:19 220.127.116.11 +++ macsockotpt.c 2001/03/23 08:26:49 @@ -456,5 +456,6 @@ if (pollThread) WakeUpNotifiedThread(pollThread, kOTNoError); - else + + if (thread && (thread != pollThread)) WakeUpNotifiedThread(thread, result);
This fix needs r= and sr=, and gordon, beard or sdagley will have to check it in.
ok, this fix looks reasonable, r=pink. Who is going to do something about the rats nest of casted structures that sfraser pointed out? Is that addressed in psm2? If not....
I have an sr=gordon, and will make sure that he checks this in.
pinkerton -- PSM2 uses the same NSS code as PSM1.
This fix has been checked into the NSPRPUB_CLIENT_BRANCH and the nspr tip.
Assignee: javi → sfraser
Fixed checked in; thanks gordon
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
verified with mac commercial build 2001-03-26-12-trunk
Status: RESOLVED → VERIFIED
Mass changing Security:Crypto to PSM
Component: Security: Crypto → Client Library
Product: Browser → PSM
Version: other → 2.1
Mass changing Security:Crypto to PSM
You need to log in before you can comment on or make changes to this bug.