Open
Bug 936407
Opened 12 years ago
Updated 3 years ago
Firefox freezes when removing Gemalto smartcard
Categories
(NSS :: Libraries, defect, P3)
Tracking
(Not tracked)
NEW
People
(Reporter: roberto.viola, Unassigned, NeedInfo)
References
Details
(Keywords: regression)
Attachments
(5 files)
User Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Steps to reproduce:
1. Insert a Gemalto smartcard
2. Remove the smartcard
Note: this bug is available from version 10 to version 25. From version 9 to version 3 there isn't this issue.
I would like to try nightly builds from version 9 to version 10 in order to identify which build introduces this bugs, but i can't find a nighlty build repository of these versions.
Actual results:
Firefox freezes until i reinsert the smartcard. After that it goes flawless
Expected results:
Firefox shouldn't freeze at all
| Reporter | ||
Updated•12 years ago
|
OS: Windows 8 → Windows 7
Hardware: x86_64 → x86
As nobody has a Gemalto smartcard to test, you need to install the tool mozregression (see http://harthur.github.com/mozregression/ for details). You can use it with a testing profile (read the FAQ).
Nightly builds are available on the FTP: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/ (by year since 2004, use the repository mozilla-central)
FF10 nightlies started in Sept. 2011 (mozregression --good=2011-09-01).
Flags: needinfo?(roberto.viola)
| Reporter | ||
Comment 2•12 years ago
|
||
Thank you Loic for your quick answer.
I've tried and this is the result:
2011-10-30 is the last good nightly
2011-10-31 is the first bad nighlty
If you need more information, i'm here :)
Good weekend
Flags: needinfo?(roberto.viola)
| Reporter | ||
Comment 4•12 years ago
|
||
I'm AFK.
You could run mozregression --good=2011-10-30 --bad=2011-10-31 to see the changelog URL.
I don't know this Gemalto Smartcard, but I guess you need to load a Gemalto encryption module in Firefox, like described here:
http://docs.epoline.org/doc/epoline/firefox/GemSAFE_Firefox_EN.pdf
If that's the case, the suspected bug could be:
Brian Smith — Bug 675702 - Remove XPCOM Proxies: Do PSM client certificate processing on the main thread, r=kaie
What do you think? Can you explain how this product works with Firefox (encryption, settings etc)?
Flags: needinfo?(roberto.viola)
| Reporter | ||
Comment 7•12 years ago
|
||
Yes, i load the gemalto library as described in the document (versions < 2011-10-30 work with that library).
The gemalto smartcard contains some certs that are sent to a web server in order to authorize the user with an unique id.
I think it could be releated to your link. How could I run the current version (25) without that patch?
Flags: needinfo?(roberto.viola)
CC'ing bsmith who is aware of this kind of stuff.
Blocks: 675702
Status: UNCONFIRMED → NEW
Component: Untriaged → Security: PSM
Ever confirmed: true
Flags: needinfo?(brian)
Keywords: regression
Product: Firefox → Core
Summary: Freeze removing smartcard → Firefox freezes when removing Gemalto smartcard
| Reporter | ||
Comment 10•12 years ago
|
||
Brian could be useful if i recompile version 10 and/or version 25 without your patch in order to understand if Gemalto cards works fine without it?
Flags: needinfo?(brian)
Comment 11•12 years ago
|
||
With Firefox 25, please follow the instructions at https://developer.mozilla.org/en-US/docs/How_to_Report_a_Hung_Firefox to crash Firefox while it is hung and paste the crash report ID in this bug.
Recompiling old versions is probably not useful at this point.
Flags: needinfo?(brian) → needinfo?(roberto.viola)
| Reporter | ||
Comment 12•12 years ago
|
||
Flags: needinfo?(roberto.viola) → needinfo?(brian)
Comment 13•12 years ago
|
||
Socket transport thread:
nsSocketOutputStream::Write(char const *,unsigned int,unsigned int *)
nsSSLIOLayerWrite
...
nsNSS_SSLGetClientAuthData(void *,PRFileDesc *,CERTDistNamesStr *,CERTCertificateStr * *,SECKEYPrivateKeyStr * *)
mozilla::psm::SyncRunnableBase::DispatchToMainThreadAndWait()
Main thread:
ClientAuthDataRunnable::RunOnTargetThread()
nsNSSCertificateDB::FindCertByDBKey(char const *,nsISupports *,nsIX509Cert * *)
CERT_FindCertByIssuerAndSN
PK11_FindCertByIssuerAndSN
STAN_GetCERTCertificateOrRelease
stan_GetCERTCertificate
nssPKIObject_Unlock
PR_ExitMonitor <-- this is hanging?
Roberto, when Firefox hangs like this is it using CPU or not?
Flags: needinfo?(brian) → needinfo?(roberto.viola)
| Reporter | ||
Comment 14•12 years ago
|
||
Yes, it takes a whole core (25% cpu on a quadcore).
Flags: needinfo?(roberto.viola)
| Reporter | ||
Comment 15•12 years ago
|
||
Benjamin, do you think that the stacktrace is releated to the Bug 675702 ?
Flags: needinfo?(brian)
Comment 16•12 years ago
|
||
It's in similar code, so it could be. Given that this is spinning and not hanging, we'll probably need a profile in order to get more data.
| Reporter | ||
Comment 17•12 years ago
|
||
Profile? So you need a Smartcard to try? If you want i could give you an access to my system via teamviewer.
| Reporter | ||
Comment 18•12 years ago
|
||
Benjamin i've tried to remove the patch 79658 from rev 158125.
It compiles well but the error still remains.
| Reporter | ||
Comment 19•12 years ago
|
||
Update: i'm trying to solve it myself. I'm debugging Firefox with VS2012.
When it freezes it hangs in this while on pk11cert.c:
do {
/* free the old cert on retry. Associated slot was not present */
if (rvCert) {
CERT_DestroyCertificate(rvCert);
rvCert = NULL;
}
cert = NSSTrustDomain_FindCertificateByIssuerAndSerialNumber(
STAN_GetDefaultTrustDomain(),
&issuer,
&serial);
if (!cert) {
break;
}
rvCert = STAN_GetCERTCertificateOrRelease(cert);
if (rvCert == NULL) {
break;
}
/* Check to see if the cert's token is still there */
} while (!PK11_IsPresent(rvCert->slot));
rvCert is valid and PK11_IsPresent return always false so it stays always in this loop!
In your opinion, what should be the behaivor when the smart key is unplugged?
Flags: needinfo?(brian) → needinfo?(benjamin)
Comment 20•12 years ago
|
||
That is a NSS question which I can't answer.
Flags: needinfo?(benjamin) → needinfo?(brian)
Updated•12 years ago
|
Assignee: nobody → nobody
Component: Security: PSM → Libraries
Flags: needinfo?(brian)
Product: Core → NSS
Version: 10 Branch → trunk
Comment 21•12 years ago
|
||
Bob, see comment 19. Based on that comment, this seems to be an NSS bug. However, I know very little about how the smart card insertion/removal detection is supposed to work in NSS. Can you help verify whether the code mentioned in comment 19 is correct and/or give us some pointers for things we should be looking for? Thanks.
Flags: needinfo?(rrelyea)
Comment 22•12 years ago
|
||
No, the loop is correct, if PK11_IsPresent returns false for the returning slot, then the cert should be updated so that either we no longer find the cert in the NSSTrustDomain_FindCertificateByIssuerAndSerialNumber, or the slot returned for that certificate should be be a new slot. This loop should only ever execute twice. (well unless you have multiple tokens with the same cert that all get removed at one time).:
PK11_IsPresent -> pk11_IsPresentCertLoad -> nssToken_IsPresent -> nssSlot_IsTokenPresent-> nssToken_NotifyCertsNotVisisble -> nssTrustDomain_RemoveTokenCertsFromCache ->
either: nssTrustDomain_RemoveCertFromCacheLOCKED() (which should cause NSSTrustDomain_FindCertificateByIssuerAndSerialNumber to fail to find the cert),
or: STAN_ForceCERTCertificateUpdate -> stan_GetCERTCertificate -> fill_CERTCertificateFields (which should fill in a different slot than the
Places where this chain can go wrong: 1) the token field in the PK11SlotInfo structure (slot passed to PK11_IsPresent) has been corrupted (read set to NULL) before the certs have been removed. 2) The token structure (the one one pointed to by the PK11SlotInfo) had it's name field corrupted before the certs have been removed. 3) The certs weren't properly registered in the cache.
If you are looping multiple times, something as definitely messed with the NSS internal state.
bob
Flags: needinfo?(rrelyea)
Comment 23•12 years ago
|
||
Hmmm I running ESR 17 with NSS 3.15.1 with not problems (insertion/removal seems to work just fine). What versions of NSS does 'version 10' mean is it Firefox 10, or some build number on the current trunk.
bob
| Reporter | ||
Comment 24•12 years ago
|
||
Thank you Robert.
I will run more test next week (I'm out of office right now).
I will let you know if I will find something useful.
| Reporter | ||
Comment 25•12 years ago
|
||
Hi Robert, i'm finally in front of the problem.
I try to explain the behaivour:
- The loop is truly without an end (not only twice).
- When I detach the key the SmartCardMonitoringThread::Execute run the function PK11_IsPresent that calls correctly the nssTrustDomain_RemoveTokenCertsFromCache.
- After that, we approach into the PK11_FindCertByIssuerAndSN and rvCert is always ok and PK11_IsPresent return always false without calling the nssTrustDomain_RemoveTokenCertsFromCache because the name buffer was blanked from the SmartCardMonitoringThread.
If you want we could use Teamviewer (and skype maybe) to debug together the issue.
Flags: needinfo?(rrelyea)
Comment 26•12 years ago
|
||
Which patch does nssTrustDomain_RemoveTokenCertsFromCache take? Does it call ssTrustDomain_RemoveCertFromCacheLOCKED() or does it call STAN_ForceCERTCertificateUpdate()?
Also, do you have a copy of the cert in your NSS database (possibly without the key)?
Flags: needinfo?(rrelyea)
| Reporter | ||
Comment 27•12 years ago
|
||
| Reporter | ||
Comment 28•12 years ago
|
||
| Reporter | ||
Comment 29•12 years ago
|
||
nssTrustDomain_RemoveTokenCertsFromCache calls STAN_ForceCERTCertificateUpdate.
I attached to this task the 2 certs avaiable on the Gemalto.
Flags: needinfo?(rrelyea)
Comment 30•12 years ago
|
||
Hi Robert, I think you misunderstood my second question. In addition to the cert being stored in on your Gemalto card, is the cert also stored in the NSS database (can you find the certificate either under 'My certs' or 'Peer certs' in the certificate manager even if your Gemalto card has been removed?
The call to STAN_ForceCERTCertificateUpdate should have changed the slot value from the gem slot to some other slot. This happens in fill_CERTCertificateFields.
Hmm one thing that looks like it could happen is if instance == NULL, then we may miss updating the cert. I wonder if the cert has somehow reverted to the temp cache because someone has it open. If that's the case, I think we can fix the issue in fill_CERTCertificateFields. Can you check to see if either context != NULL or instance is equal to null.
bob
Flags: needinfo?(rrelyea)
| Reporter | ||
Comment 31•12 years ago
|
||
I guess you're talking about certutil? If it's so i can't run it from my mozilla build directory. This is the output:
C:\mozilla-source\mozilla-central\obj-i686-pc-mingw32\dist\bin>certutil.exe -U
certutil.exe: function failed: SEC_ERROR_LEGACY_DATABASE: The certificate/key da
tabase is in an old, unsupported format.
I checked inside the fill_CERTCertificateFields called when i remove the key.
This is the stacktrace:
> nss3.dll!fill_CERTCertificateFields(NSSCertificateStr * c=0x09c2cf50, CERTCertificateStr * cc=0x19218768, int forced=1) Riga 843 C
nss3.dll!stan_GetCERTCertificate(NSSCertificateStr * c=0x09c2cf50, int forceUpdate=1) Riga 890 C
nss3.dll!STAN_ForceCERTCertificateUpdate(NSSCertificateStr * c=0x09c2cf50) Riga 914 C
nss3.dll!nssTrustDomain_RemoveTokenCertsFromCache(NSSTrustDomainStr * td=0x192185e8, NSSTokenStr * token=0x1922b360) Riga 448 C
nss3.dll!nssToken_NotifyCertsNotVisible(NSSTokenStr * tok=0x1922b360) Riga 303 C
nss3.dll!nssSlot_IsTokenPresent(NSSSlotStr * slot=0x1922bde8) Riga 172 C
nss3.dll!nssToken_IsPresent(NSSTokenStr * token=0x1922b360) Riga 1441 C
nss3.dll!pk11_IsPresentCertLoad(PK11SlotInfoStr * slot=0x19227358, int loadCerts=1) Riga 1435 C
nss3.dll!PK11_IsPresent(PK11SlotInfoStr * slot=0x19227358) Riga 1483 C
xul.dll!SmartCardMonitoringThread::Execute() Riga 284 C++
nss3.dll!_PR_NativeRunThread(void * arg=0x19231d30) Riga 419 C
nss3.dll!pr_root(void * arg=0x19231bd0) Riga 90 C
msvcr110.dll!_callthreadstartex() Riga 354 C
msvcr110.dll!_threadstartex(void * ptd=0x1923ac88) Riga 332 C
I checked instance and context variable:
- context that takes the value from c->object.cryptoContext in function fill_CERTCertificateFields, is null;
- instance, that takes the value from get_cert_instance, has a right value;
Roberto
Flags: needinfo?(rrelyea)
Comment 32•12 years ago
|
||
Roberto, by 'the right value' do you mean softoken, or the gemalto token. We just removed the gemalto token (or should have just removed it), So the gemalto token should not be the current instance.
Flags: needinfo?(rrelyea)
Comment 33•12 years ago
|
||
Hmm, I wonder if we are incorrectly getting to instances of the gemalto token added to the certificate. Our instance is being removed in 'remove_token_certs' which is a callback function from nssHash_Iterate() iterating over our cert cache and called from nssTrustDomain_RemoveTokenCertsFromCache. We know that we are finding this cert in 'remove_token_certs', and that we are removing at least one instance in the list, and that there are more than one object instance.
Comment 34•12 years ago
|
||
Oh, I wonder.... do you have to copies of the exact same cert on your token? That could cause us to have more than one instance for the same token because there would be the same cert with multiple PKCS #11 object id's.
Comment 35•12 years ago
|
||
Also, an even more pathelogical case would be 2 different certs with the same issuer serial number (which would be interpeted by NSS as a single cert).
| Reporter | ||
Comment 36•12 years ago
|
||
I've got some problems debugging some symbols.
For example when i'm in fill_CERTCertificateFields i can't view the value of instance after the line "instance = get_cert_instance(c);": Visual Studio 2012 says "identificator not recognized". Do you know why?
Anyway i follow the code inside the get_cert_instance -> nssCryptokiObject_Clone and i check the rvObject: the parameter "label" of this variable has refereed to the gemalto token that has just bee1n removed! Is this the problem? If it's so, what do i should look to investigate deeply?
I moved on: with your help i put a breakpoint into the remove_token_certs. If you look on the screenshot, you will see that we have only 1 instances, but the tokens are differents! So remove_token_certs doesn't remove anything! Have we reached the problem?
> do you have to copies of the exact same cert on your token? That could cause us to have more than one instance for the same token because there would be the same cert with multiple PKCS #11 object id's.
I don't understand this question. I have copied the certs attached to this task in gemalto token. Is this what you want to know?
Flags: needinfo?(rrelyea)
| Reporter | ||
Comment 37•12 years ago
|
||
| Reporter | ||
Comment 38•12 years ago
|
||
Ops, i made a mistake: i had stopped on the first istance of remove_token_certs. But the PL_HashTableEnumerateEntries calls many times the remove_token_certs. Infact, after some calls, remove_token_certs finds the correct token of my gemalto (look at the new screenshot). So never mind, it's not releated to remove_token_certs, but it still alives the issue on the istance, isn't it?
| Reporter | ||
Comment 39•12 years ago
|
||
| Reporter | ||
Comment 40•12 years ago
|
||
I inserted a very dirty solution for our issue.
I noticed that when remove_token_certs decrements object->numIstances, object->numIstances remains to 1 (it was 2).
So, i replace the line object->numIstances--; with the line object->numIstances=0; and firefox doesn't hangs anymore when i unplugged the gemalto key.
I know it's very dirty, but i guess it could be useful to you in order to understand the issue.
I hope we're near to the solution.
Comment 41•12 years ago
|
||
So, given what we are seeing, this is expected. The question is why do you have 2 object instances (are both instances for the removed gem token?). My current theory is your card has two copies of the same cert under 2 pkcs #11 id's. This creates 2 object instances with the same token. If that's the case, what we need to do in remove_token_certs is to continue to loop through the objects and remove all the objects for a given token (rather than break).
What we should do is print out the two objects and see if they point to the same token, but different objectID's. If that's the case, then I think we found our issue.
bob
Flags: needinfo?(rrelyea)
| Reporter | ||
Comment 42•12 years ago
|
||
You were definitely right.
I post a screenshot of the 2 instances.
What's next? Do I create a patch myself or you want to do it yourself?
I have another issue, probably unrelated: with this scenario i have a tremendous memory leak (100 Mb in a hour keeping a local page opened). What you suggest in order to understand the problem?
Flags: needinfo?(rrelyea)
| Reporter | ||
Comment 43•12 years ago
|
||
2 instances
Updated•7 years ago
|
Priority: -- → P3
Updated•3 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•