Closed Bug 240956 Opened 21 years ago Closed 20 years ago

OS/2 VACPP and gcc builds of NSS use different PKCS#11 calling conventions

Categories

(NSS :: Libraries, defect)

3.9.1
x86
OS/2
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: julien.pierre, Assigned: wtc)

Details

When NSS was ported from the IBM Visual Age C++ compiler to the gcc compiler, no attention was paid to the calling convention used for loading PKCS#11 modules, and it got changed. Unfortunately, the two calling conventions use the same symbol naming, but different semantics for passing the arguments and returning values. There is no obvious way to find out if the nssckbi.dll was built with one calling convention or the other, until it's too late (ie. one of the C_ functions is called, and crashes). The consequence if that if NSS code built with gcc tries to load a PKCS#11 module built with VACPP, or vice-versa, a crash results. This affects a lot of OS/2 browser users with secmod.db, which has an absolute reference in secmod.db The 2 common cases are : a) users upgrading from a Mozilla VACPP build to a Mozilla gcc build, and using their existing browser profile . b) users switching back and forth between the two types of builds. This is actually not uncommon. For example, the still-maintained Mozilla 1.4 branch is built with VACPP, but since 1.5 (or 1.6, I forget) it is built with gcc. The known-workaround is to delete secmod.db . I think a necessary part of the fix is to reconcile the calling convention for PKCS#11 in NSS on OS/2 regardless of the compiler, so that long-term, this problem just disappears. Fixing the browser issue that users are seeing in the short-term is more complicated. I feel that at least the upgrade case should be fixed. I don't feel that supporting switching back & forth is important, as we don't do that with the other DBs either. Here is a proposed fix for the upgrade case. #ifdef XP_OS2_EMX 1) In SECMOD_LoadModule, check if the name of the DLL to load is nssckbi. 2) If so, check the DLLs it depends on (this requires some OS/2-specific 3) If one of them is the VACPP runtime DLL bundled with the browser, don't try to load it, for fear it would crash due to the calling convention mismatch. Simply return an error. believe the PSM code will then find that it doesn't have a root cert module loaded, and automatically load the DLL from the location where Mozilla is running. #endif Other ideas welcome !
Anything we try to do to fix this will make things worse then they are now. Your assertions are incorrect. Installing GCC over VACPP works fine and the profile migrates fine. The ONLY problem here is sharing profiles between GCC and VACPP which is unsupported, or if you install a GCC build along side a VACPP build and try to use the same profile. Your suggestion about "detecting MOZRMI36" will still cause VACPP bulds fail accessing security. We can't go back in time and fix this, and any calling convention change we make will break even more builds. GCC works, VACPP is in the past, this decision isn't worth revisiting now.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → WONTFIX
Mike, Even the upgrade process isn't reliable. I never upgrade Mozilla by overwriting the Mozilla code. Instead, I always install a new version of Mozilla to a new directory, and use profile manager to point to the old profile, which is yet in a third location. When I went from a VACPP to a gcc build this way, I did get the crash. The migration to a new version only works if you overwrite the Mozilla code in-place, which is not necessarily what everybody does. So, I consider the upgrade case still broken. You are incorrect that we can only make the situation worse. We can actually improve it if my proposed fix for this bug, and my proposed fix in http://bugzilla.mozilla.org/show_bug.cgi?id=176501#c21 , are both implemented. If you do that, you will only have to incorporate the fixes into the newer gcc build. You will not have to touch any VAC code. At least as far as security is concerned, that new gcc build would be able to share its profile with existing VAC builds. Also, you could upgrade from existing gcc builds to the newest gcc build without problem. The one case that wouldn't work is if you try to share your profile between 3 builds, an existing VAC build, an existing gcc build, and a new gcc build. The first 2 types just don't play together and that part can't be fixed.
Your idea about figuring out if the DLL loads MOZRMI36 won't work most of the time. If the user has MOZRMI36 on their, the DLL failure will happen with SYS2070 when PLC4 is loaded. These cannot be handles with DosLoadModule - your app will crash. Believe it or not, if you don't have MOZRMI36 in your path, everything works just like you want - loading the 1.4 NSSCKBI.DLL fails gracefully, and we load the one from GCC. Works fine. The issue here is that if MOZRMI36 is available, the system tries to load the NSPR DLLs and all heck breaks loose and there is no way to prevent it. Period.
This would be an issue if any vendors tried to support OS/2 for their tokens. In this case they would have to have 2 different PKCS #11 modules depending which compilier your application happens to use. If a common Binary API is not defined to PKCS #11, it would guarrentee that you would never convince hardware vendors to support their tokens on OS/2 (including IBM). nssckbi is not the only issue here. Maybe it's not a concern because no token vendors support or plan on supporting OS/2, and no one cares if the ever do. bob
Can someone tell me what APIs I should be looking at if I wanted to change the calling convention? It looks like NSS uses PR_CALLBACK and these are _Optlink on VACPP. What would I actually need to change?
Bob, in your comment 4, I think you overlooked nssckbi. This is an issue even for OS/2 users who have no hardware tokens, but how have two versions of their browser, one built with gcc, the other not. Having said that, As I wrote in bug 176501, the application that calls NSS ultimately controls what filenames get into secmod.db. And they control where secmod.db itself gets installed. Some applications control this very well, others apparently do not.
I think Bob and I are in agreement that we should to resolve the calling convention issue, even if it is only to standardize on the C calling convention currently used in gcc builds of Mozilla/NSS and that doesn't affect the browser. There are two PKCS#11 modules that come with NSS, nssckbi.dll and softokn3.dll . This change would affect both. The pathname issue is irrelevant for softokn3.dll since it doesn't get loaded from secmod.db, so changing its calling convention (or more accurately, the way it selects the calling convention when building) will not create additional problems. In response to comment #5 : the PKCS#11 header definitions are under lib/softoken . But it doesn't look like we currently use any macro similar to PR_CALLBACK for the exportable symbols in the headers :-( !!! There are some traces of CK_EXTERN which could do the job, but the macro isn't used. This means that the issue may exists on other platforms as well when switching compilers for NSS. I think we haven't run into it because most compilers on other platforms don't fool around with changing C calling conventions; mostly they have their own C++ calling conventions. Bob, can you take a look at this ? If this is confirmed, we should at least define the macros for calling conventions, and use them on all the exportable PKCS#11 symbols.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
If you fix this, you will break migration of older GCC profiles...
Mike, If we stick to the same calling convention used now in gcc, and just formalize it in the header files with explicit macros, it won't break anything with existing or future gcc builds. It would only affect any new NSS code built with VACPP, which would suddenly switch over to the new calling convention. I understand you won't be building any new VACPP NSS code, so that shouldn't be a problem for you. As Bob said, we need to set the calling convention in stone to allow any future smartcard vendors to know what to do when building their PKCS#11 libraries. This is the main value of the proposed change. The other value of this change for me is that I could share my security DBs (which are not necessarily browser DBs, but server DBs) between NSS command-line tools built with the 2 different compilers, for example when doing performance tests. Right now, I must always make two different copies of the DBs and erase secmod.db in those tests. I should be able to just point to the same directory on the command-line with either version of the tools.
I don't understand why more people don't use extern "C" when creating interfaces like this. It sure would make things nice and compatible :)
Nelson: I wasn't ignoring nssckbi.dll, I just felt Julien had covered it pretty well. Mike, These functions are more than Extern "C".... The are "C" functions called from "C" code (both NSS and PKCS #11 are C APIs not C++). If there is a calling convention problem its because GCC an VAC have different "C" calling conventions. PKCS #11 has special defines to allow us to put whatever magic keywords we need to make gcc and vac++ generate the same calling sequence (if it's possible for them to use the same calling sequence) for the PKCS #11 functions themselves. Those magic defines need to be specified for the OS/2 platform. The PKCS #11 declarations are defined in nss/softoken/pkcs11t.h . This file is a modified version of the file supplied by RSA on the pkcs#11 site. There's a long comment in pkcs11.h describing how these various macros are used, and how they should be defined for certain platforms. NSPR makes some basic defines (which for most platforms match the PKCS #11 definitions) which nss/softoken/pkcs11t.h uses: #define CK_CALLBACK_FUNCTION(rv,func) rv (PR_CALLBACK * func) #define CK_DECLARE_FUNCTION(rv,func) PR_EXTERN(rv) func #define CK_DECLARE_FUNCTION_POINTER(rv,func) rv (PR_CALLBACK * func) These are meant to be compilier invariant values for defining C-function api's across shared libraries. If we decide that these values are not correct for OS/2, we could use ifdef OS2 around these and supply the correct macros. bob
There's no such thing as different "C" calling conventions. If you are using PR_CALLBACK, it is definitely different between the two versions. PR_CALLBACK does NOT equal "C" calling convention - PR_CALLBACK can be set to anything you want - for VACPP it is _Optlink, for GCC I believe it is nothing. To make things 100% compatibile, "C" calling convention is the right thing to use here, NOT PR_CALLBACK. I repeat, PR_CALLBACK is NOT compiler invariant - it was never intended to be as far as I know.
Mike, _Optlink, __cdecl, _Pascal, __Far32, _System are all different C calling conventions, ie. different calling conventions that can be used to invoke C functions. For PKCS#11, we need to choose a C calling convention that is compatible with all invariant. extern "C" is only a means for C++ to not mangle the function name. It uses the compiler's default C calling convention as far as argument passing goes. The problem comes from the fact that VAC and gcc have different opinions of what the default C calling convention is. For VACPP I believe there is actually a compiler flag to make the default C calling convention _System instead of _Optlink ... I don't know what calling convention is used by gcc.
One way to look at this problem is that on OS/2 there are apparently two builds of mozilla, built with different compilers that use different calling conventions by default, and that cannot share each other's libraries (unless something is done to force them to use a common calling convention. But perhaps the only aspect of that that is unusual is that there are two commonly used builds of mozilla on the same box. Other platforms have similar problems, potentially. For example, solaris supports 3 different ABIs on the same CPU. NSS's makefiles support separate 32-bit and 64-bit builds (the 32-bit build covers both 32-bit ABIs). If someone built a 64-bit build of the mozilla browser, including a 64-bit NSS, and then tried to use it in a profile that had previously been setup by a 32-bit mozilla, I ttink they'd have the same problem that OS/2 is having.
Nelson, I would consider solaris 32-bits and 64-bits to be two different platforms. I'm not sure the problem is the same as here. Can a 32-bit solaris .so even be loaded by a 64-bit solaris process ? (or vice-versa) . I'd like to test this - how do I force the NSS Solaris build to be 64 bits ? I'm not familar with the the two 32-bit ABIs on solaris. Can you expand a little bit ? Thanks.
Julien, Imagine that a user builds mozilla from source using the V9 ABI (LP64 model) including the whole browser and NSS. The user has previously run the normal 32-bit version of mozilla on his ultrasparc box, and now has a profile with cert and key DBs and a secmod DB that names the 32-bit nssckbi.so in some directory. Then the user runs his 64-bit build, and it reads secmod.db, and it then tries to load the 32-bit nssckbi.so. It will either fail in the load, or it will fail later when the first attempt is made to call a function in that so. Perhaps another solution is to have separate secmod.db files for different ABIs (or calling conventions). To build NSS in 64-bit mode, setenv USE_64 1 or export USE_64=1 then build normally.
Nelson, I did in fact propose as one of the solutions to rename the secmod.db to solve this problem, which can be done by simply changing one argument to NSS_Init . I understand your case with loading 32-bit and 64-bit versions of Solaris. However I think there is a big difference between not loading the .so and loading the .so and then crashing . If the .so load fails, PSM will actually look for the module somewhere else, and likely will find the correct version. On the other hand, if the load succeeds and there is a crash, it would be the same as the OS/2 situation. I don't think that is the case, but I'm trying to confirm it. I just tried to build with USE_64=1 on my Blade 2000, but it failed in NSPR in pr/src/md/unix/os_SunOS_sparcv9.s . Might have to file another bug for that.
The crash should not be a normal situation. Most people would not have MOZRMI36 in their LIBPATH for the GCC build. IF there is no MOZRMI36, we won't crash.
This problem is much worse than we think. It also appears to happen when we switch LIBC versions (from LIBC04 to LIBC05) See: http://bugzilla.mozilla.org/show_bug.cgi?id=241722
Mike, Am I correct that when you switch libc versions, you have to recompile Mozilla with an updated GCC, that puts references to libc05.dll instead of libc04.dll ? Is it possible that gcc OS/2 itself switched the default calling convention that it uses for libc between two different compiler updates ? That would be a really good case for fixing this problem and selecting the OS/2 PKCS#11 calling convention once and for all. And even if it hasn't actually happened yet and the problem is something else, you want to prevent that problem from popping up again and set the OS/2 PKCS#11 convention in stone now. Pleasee Bob's message in comment #11 for more details on how to do that. Basically, you need to settle on a calling convention, find the right keywords for it that work with both gcc and VACPP, and update the macros. As long as the C calling convention is the same between the PKCS#11 application (here, the Mozilla browser and NSS) and the PKCS#11 library (nssckbi.dll) it should not matter which runtime the PKCS#11 library is built against, as the principle of PKCS#11 is that you can load libraries from many sources, including third-party vendor sources. So, you should not be affected by any C runtime DLL naming changes.
We're not going to fix this. VACPP build is dead. Everyone's using GCC and we're not going to change things now.
Status: REOPENED → RESOLVED
Closed: 21 years ago20 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.