Closed Bug 1202413 Opened 4 years ago Closed 2 years ago

because PK11_CreateGenericObject() leaks by design, add non-leaking PK11_CreateManagedGenericObject() API

Categories

(NSS :: Libraries, defect, P2)

3.20
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: john.haxby, Assigned: rrelyea)

References

Details

Attachments

(1 file, 2 obsolete files)

User Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0
Build ID: 20150827075926

Steps to reproduce:

Using curl compiled with nss (on Fedora), the process grows a little each time a certificate is read.  Briefly, curl uses PK11_CreateGenericObject() to load a certificate which does something like this:

PK11_CreateGenericObject()
  PK11_CreateNewObject
     PK11_GETTAB(slot)->C_CreateObject
  PK11_ReferenceSlot

The object is later deleted with PK11_DestroyGenericObject which does something like this:

PK11_DestroyGenericObject
  PK11_UnlinkGenericObject
  if (object->slot)
    PK11_FreeSlot
  PORT_Free

PK11_UnlinkGenericObject() is, I think, the mirror to PK11_ReferenceSlot, PK11_FreeSlot() drops the reference count on the (containing) slot and PORT_Free frees up the object.  What's missing is a call to PK11_DestroyObject() and thence PK11_GETTAB(slot)->C_DestroyObject()


Actual results:

The object created by PK11_CreateGenericObject() was just orphaned, it was not cleaned up by PK11_DestroyGenericObject()


Expected results:

PK11_DestroyGenericObject() should have destroyed the created object.
This is related to and possibly the same as bug 585066.

I have a reproducer and possible patch attached to https://bugzilla.redhat.com/show_bug.cgi?id=1057388 (comment 20 onwards).

Basically if PK11_DestroyGenericObject() is amended to look include a call to PK11_DestroyObject() like this, the leak goes away:

SECStatus 
PK11_DestroyGenericObject(PK11GenericObject *object)
{
    if (object == NULL) {
	return SECSuccess;
    }

    PK11_UnlinkGenericObject(object);
    if (object->slot) {
	PK11_DestroyObject(object->slot, object->objectID);
	PK11_FreeSlot(object->slot);
    }
    PORT_Free(object);
    return SECSuccess;
}

In bug It was pointed out that SECKEY_DestroyPublicKey() has a similar change but in that case it only calls it when PK11_IsPermObject() is false.  I tried that but PK11_IsPermObject() returned true for the objects created.

I did wonder if there was something else that should be called to destroy the object, but I couldn't find any suitable callers of PK11_DestroyObject() -- I was worried that PK11_DestroyGenericObject() is not supposed to actually destroy the object referred to by obj->objectID, but the more I think about this, the less likely I think that is likely.


The leak isn't particularly large, but when a program (rsyslog) used curl tens of thousands of times then the leak becomes more painful.

Please let me know if you need any more information.
'The object is later deleted with PK11_DestroyGenericObject' - what object? Where is it released - NSS or Curl?

In NSS's usage, PK11_DestroyGenericObject is always assumed to behave as it's implemented - leaving the underlying token object in place, but removing it from the linked list of objects. This can be seen in code like http://mxr.mozilla.org/nss/source/lib/pk11wrap/pk11obj.c#1412 , which is absolutely not trying to 'delete' (from the token/slot) the |firstObj| parameter - it's just trying to deref from the linked list if the allocator fails.

The only other place this is called is in the rsapoptst.c code, and there, yes, the code is possibly bugged.

But I'm not sure how you end up with an object from a certificate that's supposed to be destroyed.
Flags: needinfo?(john.haxby)
This could be an artifact of the way nss-pem (which is used in Fedora) is working, but I don't really see how.

Let me try to explain.

In curl's lib/vtls/nss.c, nss_ceate_object() goes something like this:

nss_create_object(struct ssl_connect_data *ssl,
                                  CK_OBJECT_CLASS obj_class,
                                  const char *filename, bool cacert)
{
   slot = PK11_FindSlotByName(aprintf("PEM Token #%d", cacert? "0": "1");
   PK11_SETATTRS(attrs, ..., CKA_CLASS, &obj_class, ...);
   PK11_SETATTRS(attrs, ..., CKA_TOKEN, &cktrue, ...);
   PK11_SETATTRS(attrs, ..., CKA_LABEL, filename, ...);
   PK11_SETATTRS(attrs, ..., CKA_TRUST, cacert? &cktrue: &ckfalse);
   obj = PK11_CreateGenericObject(slot, attrs, ..., PR_FALSE);
   PK11_FreeSlot(slot);
   Curl_llist_insert_next(ssl->obj_list, ssl->obj_list->tail, obj);
   return CURLE_OK;
}

(I've elided detail in the interests of readability.)

The object is destroyed with

static void nss_destroy_object(void *user, void *ptr)
{
  PK11GenericObject *obj = (PK11GenericObject *)ptr;
  (void) user;
  PK11_DestroyGenericObject(obj);
}

So, PK11_DestroyGenericObject() simply unlinks obj from its slot and frees the PK11GenericObject (three pointers and an integer).

Chasing down through PK11_CreateGenericObject() to PK11_CreateNewObject() the work done to create, in this case, a trusted certificate from a file (cacert is true) is done by

  PK11_GETTAB(slot)->C_CreateObject()

which, for curl on Fedora, then goes through NSSCKFWC_CreateObject(), nssCKFWSession_CreateObject() and finally pem_createObject().

When curl has done its thing and is closing the session, it calls (eventually) Curl_nss_close() in turn calls the destructor (nss_destroy_object) for everything on the linked list.

From what you're saying, PK11_DestroyGenericObject() shouldn't destroy the underlying object, but the underlying object is nonetheless still there.  It has been orphaned from its slot and there is no way to ever access it again.

I did wonder if the SSL_invalidate_session() that is called when a client certificate is used by curl would clear things up, but that actually made it even worse: both the cacert and the client cert objects are orphaned and the calling process grows even more each time curl_easy_perform() fetches the URL.

I'm perfectly willing to believe that either curl or nss-pem is doing something wrong but I'm at a complete loss to know what's going wrong.   I can't see a way of removing the underlying certificate objects without PK11_DestroyGenericObject() calling PK11_DestroyObject()  (obviously I can't call PK11_DestroyObject() from the curl code because struct PK11GenericObjectStr is not defined outside the nss library.

Does that make sense?
Flags: needinfo?(john.haxby)
Thanks for clarifying. So it's definitely in libcurl code that the actual call to PK11_DestroyGenericObject exists.

PK11_DestroyGenericObject not calling C_DestroyObject is again working as intended. The object created via C_CreateObject can be reached again within the slot via C_FindObjects, and can then be deleted (via C_DeleteObject). Having the object persist is 'intentional' (in as much as PK11_DestroyGenericObject is used within NSS), although I can see from libcurl's POV it's not.

From an API stability point of view, I don't know if NSS 'supports' how libcurl is doing things, but Bob would be able to comment more from the libpem POV as that's something Red Hat's developed, AIUI.

From my own POV, libcurl's nss_create_object() seems weird and unnecessary - if it's really just about loading certs, there's far 'simpler' ways to do it that won't incur leaks or a dependency on libpem - but I think we should figure out if NSS wants to expose a method to fully nuke a PKCS#11 object (there's implications for the various in-memory caches NSS maintains to let a caller directly reach into the slot)
Flags: needinfo?(rrelyea)
(In reply to Ryan Sleevi from comment #4)
> From an API stability point of view, I don't know if NSS 'supports' how
> libcurl is doing things, but Bob would be able to comment more from the
> libpem POV as that's something Red Hat's developed, AIUI.

See bug #402712 where libpem was proposed to be merged into NSS.  Unfortunately, the code has not yet reached that level of quality.  It currently lives here:

https://git.fedorahosted.org/cgit/nss-pem.git

> From my own POV, libcurl's nss_create_object() seems weird and unnecessary -
> if it's really just about loading certs, there's far 'simpler' ways to do it
> that won't incur leaks or a dependency on libpem - but I think we should
> figure out if NSS wants to expose a method to fully nuke a PKCS#11 object
> (there's implications for the various in-memory caches NSS maintains to let
> a caller directly reach into the slot)

I am afraid it will not be that easy.  Unlike Firefox, libcurl is a _library_ that needs to maintain its own API/ABI, can be loaded together with other libraries that also use NSS into a single memory image, etc.  libcurl was originally built on top of OpenSSL, so its API is to some extent inspired by OpenSSL.  Nowadays, libcurl supports OpenSSL, GnuTLS, NSS, yassl, axTLS, PolarSSL, WinSSL, Secure Transport, and GSKit.  Still it provides a unified API/ABI, no matter which TLS library was chosen at build time.

Nevertheless, if you are proposing a better way of loading certs/keys in libcurl running on top of NSS (without changing libcurl's API/ABI), I would be more than happy to decommission libpem!
(In reply to Kamil Dudka from comment #5)
> Nevertheless, if you are proposing a better way of loading certs/keys in
> libcurl running on top of NSS (without changing libcurl's API/ABI), I would
> be more than happy to decommission libpem!

Right, the question being posed was whether or not libcurl's needs can be met w/o the dependency on libpem. The API in particular would be CERT_DecodeCertPackage, although from quick scan of it's "not-der" PEM handling is that it only handles BEGIN CERTIFICATE and not private keys. It's unclear from libcurl's usage whether or not client private keys are also meant to be imported.

But I think it's fair to keep this bug focused on whether or not NSS should surface a function to delete objects. It would presumably need to be a new function - I think changing PK11_DestroyGenericObject to actually nuke from a token, when it historically has not, would be an API-breaking change.
(In reply to Ryan Sleevi from comment #6)
> Right, the question being posed was whether or not libcurl's needs can be
> met w/o the dependency on libpem. The API in particular would be
> CERT_DecodeCertPackage, although from quick scan of it's "not-der" PEM
> handling is that it only handles BEGIN CERTIFICATE and not private keys.
> It's unclear from libcurl's usage whether or not client private keys are
> also meant to be imported.

libcurl supports loading CA certs, client certs and _keys_ from files per connection, which is heavily based on libpem.  Already loaded objects are re-used to reduce memory consumption.
(In reply to Kamil Dudka from comment #7)
> libcurl supports loading CA certs, client certs and _keys_ from files per
> connection, which is heavily based on libpem.  Already loaded objects are
> re-used to reduce memory consumption.

That is certainly the intent, but in practice that's not what happens.  Each connection creates a new CA cert (for example) and the old one is left to languish.   This is easy to demonstrate: a test program using libcurl to fetch the same https URL 10,000 times grows from ~9MB RSS to ~20MB rss.   I got involved with this because rsyslog (I think) eventually got so large the oom killer had to step in.

I think, though, that libcurl still needs to be able to unload, for example, a self-signed cert or a client cert which at present does not seem to be possible.
(In reply to John Haxby from comment #8)
> That is certainly the intent, but in practice that's not what happens.  Each
> connection creates a new CA cert (for example) and the old one is left to
> languish.   This is easy to demonstrate: a test program using libcurl to
> fetch the same https URL 10,000 times grows from ~9MB RSS to ~20MB rss.   I
> got involved with this because rsyslog (I think) eventually got so large the
> oom killer had to step in.

Without the patch for re-using already loaded objects, libcurl would never do 10000 subsequent TLS hand-shakes.  It ran out of memory after ~200 iterations of loading the default CA bundle prior to implementing the feature.  Note that each CA root loaded by libpem occupies two times more memory than what is necessary but that is another bug that yet needs to be fixed.

> I think, though, that libcurl still needs to be able to unload, for example,
> a self-signed cert or a client cert which at present does not seem to be
> possible.

Do you mean a self-signed _server_ cert?  Those are not loaded by libpem nor PK11_CreateGenericObject().
In the test program attached to https://bugzilla.redhat.com/show_bug.cgi?id=1057388 I load the self-signed cert the test server is using with curl_easy_setopt(c, CURLOPT_CAINFO, cert).  That's only a single cert, of course.  Using the default bundle, the process gets much bigger much more quickly.  I wonder if I'm missing a patch?   Although the 10,000 requests process "only" got up to 182MB.
I meant this patch (applied in 2009):

https://git.fedorahosted.org/cgit/nss-pem.git/commit/?id=271ab3cf

You can comment out the following "if" block to see what actually happens if objects are not re-used:

https://git.fedorahosted.org/cgit/nss-pem.git/tree/nss/lib/ckfw/pem/pinst.c?id=015ae754#n484
(In reply to John Haxby from comment #8)
> I think, though, that libcurl still needs to be able to unload, for example,
> a self-signed cert or a client cert which at present does not seem to be
> possible.

The way softoken addresses this is via the slot/tokens. If you want to load a user database (or, I would presume in libpem's case, a file), it's treated as a slot insertion event via the SECMOD_OpenUserDB/SECMOD_CloseUserDB functions. Of course, if the goal is no leaks, that doesn't help, since Softoken itself will grow unbounded ( Bug 781950 )

To circle back on this issue, we have:
1) An open question as to how libpem should best handle load/unload events
2) An open question as to whether NSS should expose an event to destructively modify a slot/token.

If we can solve 1 (in libpem), we may be able to avoid 2, but I'd like Bob's thoughts on exposing 2. As I mentioned in comment #6, because PK11_DestroyGenericObject is an exported function that doesn't actually modify the token/slot in question, I think it'd be an API-breaking change with potentially dire consequences to make it start doing that. So I'd lean on a new API function.
> PK11_DestroyGenericObject not calling C_DestroyObject is again working as intended. The object created via 
> C_CreateObject can be reached again within the slot via C_FindObjects, and can then be deleted (via 
> C_DeleteObject). Having the object persist is 'intentional' (in as much as PK11_DestroyGenericObject is used
>  within NSS), although I can see from libcurl's POV it's not.

So the answer is 'it depends'. If you create an session object with C_CreateObject, then you should destroy it once you have finished with it. This is where PK11_CreateGenericObject()/ PK11_DestroyGenericObject() fall down. If you create token object with C_CreateObject, then you shouldn't destroy it because the point of the token object is you meant it to stay around even after the system reboots. If you find an object you should not destroy it.

In the original redhat bug, I attached some info on how this should be fixed, I'll go find it and transfer it to here.

libcurl is using the interface reasonably, and in it's case the object should be destroyed, assuming libcurl only frees the object when it is clearly through with it.
Flags: needinfo?(rrelyea)
Here's my comments in the redhat bug: https://bugzilla.redhat.com/show_bug.cgi?id=1057388

You have something here John, but it's a bit more complex. We need to keep an 'onwer' tag on the generic object. If we create it and it's a session object, then PK11_DestroyGenericObject should destroy it. If we find it with 'FindGenericObject, or we create it as a token object (which means we mean it to be persistant) then we PK11_DestroyGenericObject should not destroy it).

We need to add a new field to struct PK11GenericObjectStr { which is a PRBool called owner.

owner needs to be set in PK11_CreateGenericObject as:
   PR_TRUE if token == PR_FALSE
   PR_FALSE if token == PR_TRUE
and in PK11_FindGenericObjects it needs to be set to PR_FALSE 

Then we call PK11_DestroyObject() in PK11_DestroyGenericObject if owner == PR_TRUE.

--------------------------------------------
For completeness, there is one other place that creates PK11_GenericObjects() for the pk11_newMergeLog() in pk11merge.c. These should be set to PR_FALSE.
--------------------------------------------

bob
Apologies that I don't have more time to detail a more thorough analysis, but Chrome's experience with private keys is that such a solution would be quite impractical for those consuming NSS.

The reason being is that session objects are reported via FindGeneric. So if you import a session object, then later acquire a (new) handle to it, one of the object handles you have has the owner bit set (the original), while the new one doesn't. Because these are distinct object handles, if the first one goes out of scope, it will destroy the object, immediately invalidating the second.

This was a repeated source of issues for Chrome (and one of the reasons why I was thrilled to get a lot of core crypto to BoringSSL). In particular, the interaction between PK11_FindPrivateKeyForCert and friends. The private key does this ( http://mxr.mozilla.org/nss/source/lib/cryptohi/keythi.h#232 ) and the symmetric key has the concept of owner, and both are similarly error prone.

My conclusion was that we were "using NSS wrong", but to be honest, it seems like it should have been a perfectly valid use case.
If you create a session object, then it's perfectly acceptable for you to free it when you are through. If someone else found a handle to it with C_FindObjects, then the object will no longer be functional. The object has a handle, so there isn't a crash (this is why PKCS #11 objects have handles not pointers). Heck the object will go away if the session it was created on is ever freed. All of the is expected behavior:

 Session objects appear and disappear all the time, while still being findable by C_FindObjects. The CreateGenericObject interface isn't unique in that. If you use PK11_KeyGen, or PK11_GenerateKeyPair, all of those would have the same semantic when. It's just a bug in CreateGenericObject(). You may notice we have an owner field in the symkey for this. The private and public keys either check the pkcs11IsTemp (private keys) field or the actual bit in the PKCS #11 object (public keys).

bob
Bob: My comment wasn't that it wasn't correct behaviour from a PKCS#11 layer - there were no crashes in that. Rather, the crashes came from Chrome acquiring a key, expecting that key to be there, and then seeing it mysteriously disappear with some other C++ class went out of scope. In short, the NSS global scope problem :)

If I were to offer any advice, it would be that "owner" is often quite insufficient, and what you may want is "refCount"; while that is more complex, it behaves more intuitively as one would expect from a (soft) token.
You can't ref-count because you don't have enough information. Besides PKCS #11 semantics are handles can go away at any time.

Your application should be tolerant of a 'found' key going away. Session objects go away even if you don't do anything special (PKCS #11 makes these objects go away when the session closes, or if the object is in a removable token it can go away asynchronously). If you actually crash because an object went away you need to fix your application.

Owner is the best we have. Basically if you created the temp object, you should free it. If you create a temp object that others may depend on, then you should arrange for it to stay around for the those others. If you don't NSS and PKCS #11 won't crash, but they relevant key may disappear on you.

In anycase the base issue here is PK11_CreateGenericObject has a bug. It's not functioning like all the other NSS PKCS #11 object structures, so we just need to fix it.

bob
(In reply to Robert Relyea from comment #18)
> You can't ref-count because you don't have enough information. Besides PKCS
> #11 semantics are handles can go away at any time.

Again, I think that's irrelevant for an application only using softoken, in which the only keys that will exist are the ones it creates. That's all - it's a completely non-intuitive flow where you can 'import' a temporary key only to find out you're not the owner.

> Your application should be tolerant of a 'found' key going away. Session
> objects go away even if you don't do anything special (PKCS #11 makes these
> objects go away when the session closes, or if the object is in a removable
> token it can go away asynchronously). If you actually crash because an
> object went away you need to fix your application.

It's mostly irrelevant now, because this behaviour - in a pure softoken world, that a returned object can go away - was extremely undesirable and unintuitive to everyone who worked with NSS+Chrome. I can understand and appreciate the semantics when smart cards are involved, but when a pure softoken database is involved, it doesn't match the expectations of a reasonable programmer. Given that softoken only ever has one session, there's no way for two "properly isolated" (from a software engineering and architecture point of dependencies) pieces of code to reliably use NSS without cooperating of some shared, global state (which was ultimately the resolution). The idea of needing to use Singletons in our code to synchronize something as basic as 'in memory keys' was and is highly undesirable.

In any event, it sounds like you've got a path for resolving the PK11_CreateGenericObject, and while I've tried to impart our experience - and ultimately one of the reasons contributing to NSS being rejected for Chrome - it certainly shouldn't block resolution of this. I just wanted to make sure to elaborate the concerns as to why the pattern of "isOwner" encourages practices that are problematic for downstream consumers. Perhaps it's an intrinsic necessity of dealing with PKCS#11, which we know is the only way to get at softoken, but the behaviour defies both intuition and experience with other cryptographic libraries.
> That's all - it's a completely non-intuitive flow where you can 'import' a temporary key only to find out you're > not the owner.

Now I'm confused. If you import the key, you are the owner.

I think we are done with this debate here. If you want to talk about philosophy with me in someplace other than a bug, that's fine, but just because you don't like the fact that keys can go away (even in softoken), it doesn't change the fact that that is the semantic and the solution to *THIS BUG* is to handle that semantic.
Dear members,

I am writing as part of the FTS service (File Transfer Service) used in the context of WLCG, to transfer files from tiers on the Grid. We are currently using the NSS library and we are suffering memory leaks because it seems that NSS is not freeing properly the certificates. I can see that this link is related to that and I was wondering if you plan to fix this issue and when. 

Thanks,
Maria
Hi Maria, do you have a stack trace or anything to confirm its really the same problem? I don't know this issue but let's needinfo Bob to see if there is a fix in sight.
Flags: needinfo?(rrelyea)
Flags: needinfo?(mariayh1)
Hi Franzikus,

in fact we were having memory leaks with this trace continuously with the call PK11_CreateGenericObject. 

==20823== 32 bytes in 1 blocks are possibly lost in loss record 675 of 4,212
==20823==    at 0x4A057BB: calloc (vg_replace_malloc.c:593)
==20823==    by 0xCB3E6A9: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0xCB2E6DF: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0xCB33B98: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0xCB3A7F1: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0x329FC487C9: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC48A62: PK11_CreateGenericObject (in /usr/lib64/libnss3.so)
==20823==    by 0x32A14401DF: nss_create_object (nss.c:358)
==20823==    by 0x32A1440366: nss_load_cert (nss.c:390)
==20823==    by 0x32A1441F4A: nss_connect_common (nss.c:1129)
==20823==    by 0x32A14391E4: Curl_ssl_connect (sslgen.c:185)
==20823==    by 0x32A14179FA: Curl_http_connect (http.c:1801)
==20823==
==20823== 32 bytes in 1 blocks are possibly lost in loss record 676 of 4,212
==20823==    at 0x4A057BB: calloc (vg_replace_malloc.c:593)
==20823==    by 0xCB3E6A9: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0xCB2E6DF: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0xCB33B98: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0xCB3A7F1: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0x329FC487C9: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC48A62: PK11_CreateGenericObject (in /usr/lib64/libnss3.so)
==20823==    by 0x32A14401DF: nss_create_object (nss.c:358)
==20823==    by 0x32A1440366: nss_load_cert (nss.c:390)
==20823==    by 0x32A14420A0: nss_connect_common (nss.c:556)
==20823==    by 0x32A14391E4: Curl_ssl_connect (sslgen.c:185)
==20823==    by 0x32A14179FA: Curl_http_connect (http.c:1801)
Flags: needinfo?(mariayh1)
And also, we were having memory leaks classified as definitely when creating the NSS_InitContext, SECMOD_LoadModule and PK11_FindCertFromDERCertItem.

==20823== 72 bytes in 1 blocks are definitely lost in loss record 2,229 of 4,212
==20823==    at 0x4A057BB: calloc (vg_replace_malloc.c:593)
==20823==    by 0x329FC781E6: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC78258: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC78128: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC6CCAC: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC7268C: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC1AD85: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC1BAB0: NSS_InitContext (in /usr/lib64/libnss3.so)
==20823==    by 0x32A144127A: nss_init (nss.c:913)
==20823==    by 0x32A1441469: nss_connect_common (nss.c:1280)
==20823==    by 0x32A14391E4: Curl_ssl_connect (sslgen.c:185)
==20823==    by 0x32A14179FA: Curl_http_connect (http.c:1801)
==20823==
==20823== 72 bytes in 1 blocks are definitely lost in loss record 2,230 of 4,212
==20823==    at 0x4A057BB: calloc (vg_replace_malloc.c:593)
==20823==    by 0xCB3EAA6: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0xCB3EB18: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0xCB3E9E8: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0xCB31D1E: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0xCB3C3FF: ??? (in /usr/lib64/libnsspem.so)
==20823==    by 0x329FC38342: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC38D4C: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC4C7E9: SECMOD_LoadModule (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC4C9A7: SECMOD_LoadUserModule (in /usr/lib64/libnss3.so)
==20823==    by 0x32A14418DD: nss_connect_common (nss.c:1294)
==20823==    by 0x32A14391E4: Curl_ssl_connect (sslgen.c:185)
==20823==
==20823== 78 (48 direct, 30 indirect) bytes in 1 blocks are definitely lost in loss record 2,246 of 4,212
==20823==    at 0x4A057BB: calloc (vg_replace_malloc.c:593)
==20823==    by 0x329FC77DE9: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC75FC3: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC72C6D: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC737C1: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC738AE: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC73DD1: ??? (in /usr/lib64/libnss3.so)
==20823==    by 0x329FC35763: PK11_FindCertFromDERCertItem (in /usr/lib64/libnss3.so)
==20823==    by 0x32A14406F2: SelectClientCert (nss.c:791)
==20823==    by 0x32A0418A1A: ??? (in /usr/lib64/libssl3.so)
==20823==    by 0x32A0419781: ??? (in /usr/lib64/libssl3.so)
==20823==    by 0x32A041A66E: ??? (in /usr/lib64/libssl3.so)
Any news regarding this? Thanks! 
Maria
Hi Maria, still waiting on input from Bob. Unfortunately I can't really reproduce this without major efforts. But maybe you could tell us in the meantime what OS and version of curl and nss you're using? Sorry again for the delays.
I've asked Elio to look at the RH bug and push any patch he produces here.

bob
Flags: needinfo?(rrelyea)
(In reply to Robert Relyea from comment #27)
> I've asked Elio to look at the RH bug and push any patch he produces here.

Bob, did anything come of that?
Flags: needinfo?(rrelyea)
Sorry, it's now I'm my to do list for next week.
Flags: needinfo?(rrelyea)
Hi, the nss version is nss-util-3.19.1-5.el6_7.x86_64, and the OS is SCL6 and Centos7 nss-util-3.19.1-9.el7_2.x86_64.

Cheers,
maria
Do you need assistance reproducing this (listed as unconfirmed on this bug report)?  Able to reproduce on Centos 7 (and others).

From Centos 7 with recent update:rpm -q nss libcurl
nss-3.21.0-9.el7_2.x86_64
libcurl-7.29.0-25.el7.centos.x86_64

Have simple code and memory profile report if it will help confirm the bug.
Attached patch Generic_object_leak_fix.patch (obsolete) — Splinter Review
Elio is off doing other things, here's the patch to fix the leak.
Assignee: nobody → rrelyea
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Attachment #8802305 - Flags: review?(wtc)
Attachment #8802305 - Flags: review?(martin.thomson)
Comment on attachment 8802305 [details] [diff] [review]
Generic_object_leak_fix.patch

Review of attachment 8802305 [details] [diff] [review]:
-----------------------------------------------------------------

I can't pretend to understand the code here, but this patch correctly implements Bob's suggested fix.
Attachment #8802305 - Flags: review?(martin.thomson) → review+
Target Milestone: --- → 3.28
Patch checked in.
remote:   https://hg.mozilla.org/projects/nss/rev/d405c74dfab8841f197844183efa93a3664902e6
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Comment on attachment 8802305 [details] [diff] [review]
Generic_object_leak_fix.patch

clearing old review flag.
Attachment #8802305 - Flags: review?(wtc)
This doesn't seem to be right. It breaks a lot in Firefox [1]. The problem seems to be this destroy [2].
I backed out the changeset [3]. I think this needs a test before landing again to make sure it doesn't break anything.

[1] https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a1b7ab570480370c82b0df7ab60c7e02acf43e08
[2] http://searchfox.org/nss/rev/a008c0d05b71118758104ea532e5e398e70a1f95/lib/pk11wrap/pk11obj.c#1582
[3] https://hg.mozilla.org/projects/nss/rev/eb8c96563cd70aa69d48b01516085f7d2fb46178
Status: RESOLVED → REOPENED
Flags: needinfo?(rrelyea)
Resolution: FIXED → ---
Breakage happens.  As for the patch, has anybody confirmed that it actually solves the problem?  Is there any known way to observe (positive) effect of the patch on existing applications?
Tests for regressions can be probably written along the lines of the failing tests in Firefox, e.g. implementing something similar to the webcrypto tests that generate ECDH keys and try to derive keys from it, or using SGN_NewContex. Not sure though how to implement a test to check that the actual problem is fixed.
I'll take a look at the failures. Hopefully they are bugs in the patch and some something like firefox creating temp objects and then just freeing them. If that's the case, then I'll have to add a 'real' object free and mark the old one ad deprecated.
Flags: needinfo?(rrelyea)
Bob, do you have an update on this?
Flags: needinfo?(rrelyea)
No, we should just back it out for now. If we are running into firefox failures, but guess is we have an existing but in firefox, probably to work around this bug. We may have to fix this by creating a separate free API that has the correct semantics.
Flags: needinfo?(rrelyea)
Priority: -- → P2
Franziskus,

Is there a way I can test a fix to make sure it doesn't crash the mozilla tests?

bob
Flags: needinfo?(franziskuskiefer)
> Is there a way I can test a fix to make sure it doesn't crash the mozilla tests?

The best way is to get mozilla-central [1], update NSS manually (copy your working copy of NSS with your patch over to security/nss in mozilla-central) and push it to try [2][3].

[1] https://hg.mozilla.org/mozilla-central/
[2] https://hg.mozilla.org/try/
[3] https://treeherder.mozilla.org/#/jobs?repo=try
Flags: needinfo?(franziskuskiefer)
Thanks Franziskus.
This patch improved the old patch in 2 ways:

1) The objects are only fully managed if they are created with the PK11_CreateManagedGenericObject(). If they are crated with the old interface, we will continue to leak the key material. This semantic is important because the webcrypto interface in the browser depends on it. I uses PK11_CreateGenericObject() to create a new key, then frees all our references to that key. It later looks the key up with PK11_FindObject and creates an NSS SECKEYPrivateKey object from it. (If it can't find that object, webcrypto crashes because it passes a NULL key to NSS -- that bug should be fixed in webcrypto, but even if that bug is fixed, any tests would fail because we have descroted the key).

2) I've added tests for both PK11_CreateGenericObject and PK11_CreateManagedGenericObject (I've piggy backed on the RSApopulate tests, which seem to have been turned off for some reason).


Webcrypto no longer crashes in try:

 https://treeherder.mozilla.org/#/jobs?repo=try&revision=5c6e1bb0682d47c24fd9c6c7cde3d7e11be9e271
Attachment #8802305 - Attachment is obsolete: true
Attachment #8922120 - Flags: review?(martin.thomson)
Attachment #8922120 - Flags: review?(kaie)
Comment on attachment 8922120 [details] [diff] [review]
Fix leak in generic object. Only fix the leak if the new PK11_CreateManagedGenericObject() is called.

Review of attachment 8922120 [details] [diff] [review]:
-----------------------------------------------------------------

You need to run clang-format for this to land.  (Or apply the patch that try gives you on the clang-format target you get with -t clang-format.)

I also see a lot of errors in your try run:
./cipher.sh: line 119: /home/worker/dist/Debug/bin/rsapoptst: No such file or directory

That suggests to me that you need to update .gyp files somewhere so that rsapoptst is created.

The MAC builds seem to be busted, but I can't work out how.

::: cmd/rsapoptst/rsapoptst.c
@@ +158,1 @@
>                  for (j = 0; j < template[i].ulValueLen; j++) {

I think that you have a hexdump function that you could extract.  This loop and the loop in dumpItem both use it.

@@ +467,5 @@
> +            }
> +            fprintf(stderr, "id = { ");
> +            dumpItem(stderr,&id);
> +            fprintf(stderr, "};");
> +            if (id.data[1] == 4) {

I think that you should use the leak_expected variable here, and compare id_data against cka_id.  This is brittle as it is.

@@ +483,5 @@
> +            }
> +            SECITEM_FreeItem(&id,PR_FALSE);
> +        }
> +    }
> +    if ((mask & 0x10) && !leak_found) {

I think that you should create a leak_expected variable.

::: lib/pk11wrap/pk11obj.c
@@ +1679,5 @@
> +}
> +
> +/* Use this interface. It will automatically destroy any temporary objects when
> + * the PK11GenericObject is freed. Permanant objects still need to be destroyed 
> + * by hand with PK11_DestroyTokenObject.

typo Permanant

What does it mean for an object to be permanent?  Is this just the case that if the object is created in a token, it becomes bound to that token and is not managed by NSS?
Attachment #8922120 - Flags: review?(martin.thomson)
Thanks for the review.


> You need to run clang-format for this to land.  (Or apply the patch that try gives you on the clang-format target you get with -t clang-format.)

Is that a mozilla central try or an nss-try option. When I pushed an nss-try, it didn't actually give me any tests like mozilla-central did. I can't login to the try server for the same reasons I can't log into phabricator;(.


> I also see a lot of errors in your try run:
> ./cipher.sh: line 119: /home/worker/dist/Debug/bin/rsapoptst: No such file or directory
> 
> That suggests to me that you need to update .gyp files somewhere so that rsapoptst is created.

<digressive rant>I'm really starting to hate the dual build system, and particularly .gyp builds. I think we need to get together at some point and figure out how we can fix those issues. Having to places to change to add files to builds is not sustainable.</digressive rant>

> The MAC builds seem to be busted, but I can't work out how.

I'll try my mac system directly. I didn't see any MAC bustage in the mozilla-central builds....

> I think that you have a hexdump function that you could extract.  This loop and the loop in dumpItem both use it.

I'll take a look.

> I think that you should use the leak_expected variable here, and compare id_data against cka_id.  This is brittle as it is.
> I think that you should create a leak_expected variable.

Sounds reasonable.

> What does it mean for an object to be permanent?  Is this just the case that if the object is created in a token, 
> it becomes bound to that token and is not managed by NSS?

Yes. token == permanent, session == ephemeral. When NSS closes session objects disappear, permanent objects can be found in the future. In softoken it means the object gets stored in the database. The normal NSS semantic is when you create token objects, you need to explicitly delete them if you don't want them to persist (since you presumably wanted a persistent object, otherwise you wouldn't have made it a token object).

bob
(In reply to Robert Relyea from comment #47)
> > You need to run clang-format for this to land.  (Or apply the patch that try gives you on the clang-format target you get with -t clang-format.)
> 
> Is that a mozilla central try or an nss-try option. When I pushed an
> nss-try, it didn't actually give me any tests like mozilla-central did. I
> can't login to the try server for the same reasons I can't log into
> phabricator;(.


I'm willing to help you with that. You can leave it as the last step, and I'll convert your patch to an clang-format'ted patch.

I'm not sure what you mean with "login to the try server". You don't have to login. When you do a try build, you simply have to ensure that clang-format isn't disabled. In your yesterday's nss-try build, you had used syntax "-t none", which disabled the clang-format tool.

You should probably run with "-t all". When you do, the list of executed tests will include the required fixes to the style/whitespace formatting.

For example, look at this recent try build:
  https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=91db5a882d94fc16a3b0ca034abc537599b3503f

Click on the orange clang-format. A popup with a black bar will appear. The bar contains multiple icons. It start from the left with a "log" icon. The second icon is for "raw log". Click that second icon, which opens a log in a new window. Scroll down until you see a line starting with "diff ...". Everything from there, until the end of the page (minus trailing taskcluster lines) is a patch, which you should copy/paste to a local file, and then apply to your local tree. This will change your local tree to have all the formatting changes that are mandatory for the commit to be accepted.
(In reply to Robert Relyea from comment #47)
> > The MAC builds seem to be busted, but I can't work out how.
> 
> I'll try my mac system directly. I didn't see any MAC bustage in the
> mozilla-central builds....

Bob, this is the link to your try run from yesterday:
  https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=91db5a882d94fc16a3b0ca034abc537599b3503f

The failed lines are:
  mac opt - B (in red)
  mac debug - B (in red)

It means the (B)uild has failed (red).

If you click the B, a bottom popup will appear. It sometimes can display the interesting failure lines immediately. In your case, that doesn't work. Click one of the "log" or "raw log" icons to get the full log of what happened.
> In your yesterday's nss-try build, you had used syntax "-t none", which disabled the clang-format tool.

Thanks Kai. The information I was looking for was 'nss-try' versus mozilla central try, though the rest of the information is useful as well (I didn't even know to ask for it.

> Bob, this is the link to your try run from yesterday:
>   https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=91db5a882d94fc16a3b0ca034abc537599b3503f

Thanks Kai, so when I submitted the try build I only got the patch link:

https://hg.mozilla.org/projects/nss-try/rev/64bfdef1dc85

not the treeherder link. How to I get from the patch link to the treehearder link? That's why I thought I needed to log in..
Yes, for some reason, nss-try only reports the commits, and not the treeherder link.  I normally just go to https://treeherder.mozilla.org/#/jobs?repo=nss-try and scroll down.  There aren't that many builds there usually.

The details for nss-try can be found here: https://wiki.mozilla.org/NSS:TryServer
Attachment #8922120 - Attachment is obsolete: true
Attachment #8922120 - Flags: review?(kaie)
Comment on attachment 8923032 [details] [diff] [review]
Fix leak in generic object. Only fix the leak if the new PK11_CreateManagedGenericObject() is called.

Code and API design look good to me. I see you have addressed Martin's requests. Your try-run is green.

r=kaie


There's just one minor nit, the comment here:
+    int expect_leak = 0; /* did we find the expected leak */

The comment should obviously be something like
  "are we expecting a leak?"

Because we want to declare API freeze today, I'll check it in for you, with the comment changed.
Attachment #8923032 - Flags: review?(kaie) → review+
Summary: objects created by PK11_CreateGenericObject() aren't destroyed by PK11_DestroyGenericObject() → because PK11_CreateGenericObject() leaks by design, add non-leaking PK11_CreateManagedGenericObject() API
Updated the bug subject to better describe the situation and the chosen solution.
https://hg.mozilla.org/projects/nss/rev/32c9bbad2655
Status: REOPENED → RESOLVED
Closed: 3 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: 3.28 → 3.34
> Because we want to declare API freeze today, I'll check it in for you, with the comment changed.

Thanks Kai.
Attachment #8923032 - Flags: review?(martin.thomson)
You need to log in before you can comment on or make changes to this bug.