Closed Bug 433594 Opened 16 years ago Closed 16 years ago

Crash destroying OCSP Cert ID [@ CERT_DestroyOCSPCertID ]

Categories

(NSS :: Libraries, defect, P1)

x86
Windows XP
defect

Tracking

(Not tracked)

VERIFIED FIXED
3.12.1

People

(Reporter: fittysix, Assigned: alvolkov.bgs)

References

()

Details

(Keywords: crash, regression, Whiteboard: PKIX [RC2+])

Crash Data

Attachments

(2 files, 1 obsolete file)

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008051206 Firefox/3.0
(rc1 candidate 1)

My co-worker is getting this crash every time he visits the URL in the bug report. Crash id: 47c555b0-2127-11dd-b748-001cc45a2ce4

Paypal seems to have no problems, which is the same CA, and thus far this remains the only website that he has experienced this on.
I'll try to get a stack trace some time, but as this is his working machine it might not be possible to get that right away.

Possibly related to Bug 428038
Summary: Crash destroying cert [@ CERT_DestroyOCSPCertID ] → Crash destroying OSCP Cert ID [@ CERT_DestroyOCSPCertID ]
Stack contains 

0  	nss3.dll  	CERT_DestroyOCSPCertID  	 mozilla/security/nss/lib/certhigh/ocsp.c:1519
1 	nss3.dll 	pkix_pl_OcspCertID_Destroy 	mozilla/security/nss/lib/libpkix/pkix_pl_nss/pki/pkix_pl_ocspcertid.c:69
2 	nss3.dll 	PKIX_PL_Object_DecRef 	mozilla/security/nss/lib/libpkix/pkix_pl_nss/system/pkix_pl_object.c:936
3 	nss3.dll 	PKIX_PL_OcspCertID_Create 	mozilla/security/nss/lib/libpkix/pkix_pl_nss/pki/pkix_pl_ocspcertid.c:178
4 	nss3.dll 	pkix_OcspChecker_Check 	mozilla/security/nss/lib/libpkix/pkix/checker/pkix_ocspchecker.c:183
5 	nss3.dll 	pkix_RevCheckCert 	mozilla/security/nss/lib/libpkix/pkix/top/pkix_validate.c:390
6 	nss3.dll 	pkix_CheckChain 	mozilla/security/nss/lib/libpkix/pkix/top/pkix_validate.c:978
7 	nss3.dll 	pkix_Build_ValidateEntireChain 	mozilla/security/nss/lib/libpkix/pkix/top/pkix_build.c:1632
8 	nss3.dll 	pkix_BuildForwardDepthFirstSearch 	mozilla/security/nss/lib/libpkix/pkix/top/pkix_build.c:3047
9 	nss3.dll 	pkix_Build_InitiateBuildChain 	mozilla/security/nss/lib/libpkix/pkix/top/pkix_build.c:4194
10 	nss3.dll 	PKIX_BuildChain 	mozilla/security/nss/lib/libpkix/pkix/top/pkix_build.c:4384
11 	nss3.dll 	CERT_PKIXVerifyCert 	mozilla/security/nss/lib/certhigh/certvfypkix.c:2138
12 	xul.dll 	xul.dll@0x2efb2a 	
13 	xul.dll 	nsNSSCertificate::getValidEVOidTag 	mozilla/security/manager/ssl/src/nsIdentityChecking.cpp:970
14 	xul.dll 	nsNSSCertificate::GetIsExtendedValidation 	mozilla/security/manager/ssl/src/nsIdentityChecking.cpp:996
15 	xul.dll 	AuthCertificateCallback 	mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp:979
16 	ssl3.dll 	ssl3_HandleCertificate 	mozilla/security/nss/lib/ssl/ssl3con.c:7260
17 	ssl3.dll 	ssl3_HandleHandshakeMessage 	mozilla/security/nss/lib/ssl/ssl3con.c:7938
18 	ssl3.dll 	ssl3_HandleHandshake 	mozilla/security/nss/lib/ssl/ssl3con.c:8062
19 	ssl3.dll 	ssl3_HandleRecord 	mozilla/security/nss/lib/ssl/ssl3con.c:8325
20 	ssl3.dll 	ssl3_GatherCompleteHandshake 	mozilla/security/nss/lib/ssl/ssl3gthr.c:206
21 	ssl3.dll 	ssl_GatherRecord1stHandshake 	mozilla/security/nss/lib/ssl/sslcon.c:1258
22 	ssl3.dll 	ssl_Do1stHandshake 	mozilla/security/nss/lib/ssl/sslsecur.c:151
23 	ssl3.dll 	ssl_SecureSend 	mozilla/security/nss/lib/ssl/sslsecur.c:1152
24 	ssl3.dll 	ssl_SecureWrite 	mozilla/security/nss/lib/ssl/sslsecur.c:1197
25 	ssl3.dll 	ssl_Write 	mozilla/security/nss/lib/ssl/sslsock.c:1487
26 	xul.dll 	nsSSLThread::Run 	mozilla/security/manager/ssl/src/nsSSLThread.cpp:1029
Assignee: nobody → alexei.volkov.bugs
Priority: -- → P1
Whiteboard: PKIX
Version: unspecified → trunk
Does Not appear related to Bug 428038.
Ryan, is your co-worker's system clock wrong, perhaps some hours or days 
in the past?  Is his system time zone wrong?  
www.techdata.com is a misconfigured server.  It has a Verisign class 3 
SSL server cert, but IINM, it is not configured to send out the proper 
intermediate CA certificate for the Verisign class 3 Intermediate CA.
The fact that the browser got as far as calling libPKIX means that the 
non-PKIX cert verification passed, which in turn means that the browser
had previously imported and saved that Verisign class 3 Intermediate CA.

None of this excuses the crash.  It's just helpful info to understand the
circumstances in which the crash occurred. 
The system clock is accurate, and synced to the windows internet time server. The time zone appears to be correct as well.
I should probably mention, we tried completely uninstalling firefox, deleting all program directories and removing all appdata/local settings mozilla directories, and the problem remained.
With the attached cert installed in my cert DB, I ran the commands

vfyserv -d $HOME/.netscape -c -o www.techdata.com

NSS_ENABLE_PKIX_VERIFY=1 vfyserv -d $HOME/.netscape -c -o www.techdata.com

At first, I got this error:


Error in function PR_Write: -8071
 - The OCSP server experienced an internal error.

But then it changed, and now I get:

Handshake Complete: SERVER CONFIGURED CORRECTLY

I wonder if the server configuation was changed.

Ryan, can you still reproduce?
Still seems to happen.
One odd note, it crashes with the same signature when he looks at the bookmark properties:
bp-b08e17a8-2140-11dd-b7f3-001321b13766
I cannot reproduce this with any of NSS's test programs.
They all complete with out crashing.
I've just managed to reproduce this on a different machine, this one is a vista machine that had never accessed techdata before. The trace on this one is a little shorter.

bp-eb0e5aee-2149-11dd-8d0e-0013211cbf8a

When I first visit the page in that URL I get Error code: sec_error_unknown_issuer
I can blindly add an exception, but if I click the view button I get the crash.
Hmm, so ignoring the whole sec_error_unknown_issuer thing, which I assume is the misconfiguration on their part. 

I am able to view the cert without crashing on Beta 5, but I do get the crash on a build from 04/23/08, which is the oldest build I happened to conveniently have on my hard drive. There's still about a month of unknown between there, so I'll see if I can narrow down a regression range a little, if that can be found then it might be a little easier to figure out what exactly causes the crash.
Keywords: regression
Only two of the files changed in that huge list of checkins are NSS.

We cannot reproduce this with any of the NSS test tools. 
Not with vfyserv, vfychain, nor ocspclnt.  
Reproducing this bug seems to require PSM.  
While that's not completely conclusive, it suggests strongly that this 
problem involves something that PSM does that NSS does not. 
 
Got it, regression range in comment 11 is a little wrong. Bug 375019

http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/lib/libpkix/pkix_pl_nss/pki/pkix_pl_ocspcertid.c&mark=170,171,178#164

The CERT_CreateOCSPCertID is returning null because of sec_error_unknown_issuer
This throws the cleanup, which calls PKIX_DECREF(cid) which calls the destructor:
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/lib/libpkix/pkix_pl_nss/pki/pkix_pl_ocspcertid.c&mark=69#52
which tries to destroy the non-existant certID
Ryan,  Here are a few thoughts.

1. Good Job.  You found a real bug, and your description of the sequence
of events that leads to it also suggests a second bug.  

Your results show that if CERT_CreateOCSPCertID fails for ANY reason, 
when called from PKIX_PL_OcspCertID_Create, the result will be a crash.

I'm surprised that our automated testing didn't catch that.  We do testing
that injects allocation failures, simulating out of memory conditions, to 
excersize our error paths.  A memory allocation failure during the execution
of CERT_CreateOCSPCertID would have caused the same result.  But we haven't
encountered that.  Makes me wonder if our allocation failure injection is 
working as well as we think it is.

The fix to the immediate cause of the crash is now obvious.  There are (at least) two ways we could fix it.  I recomend changing CERT_DestroyOCSPCertID 
to check its input argument for NULL.  It should also set an error code when returning SECFailure.

2. I wonder, did you see this in a debugger?  Did you actually see the 
sec_error_unknown_issuer error occur when ocsp_CreateCertID called CERT_FindCertIssuer?  Or is that an inference/deduction?  

I ask this because that should never happen, and if it did happen, then there 
is another bug somewhere else, in the code path that preceded the call to pkix_RevCheckCert.  That error is occuring while we're building the OCSP 
request for a cert to be checked for revocation.  That error code implies
that we're attempting to check the revocation status of a cert for which we
do not have the issuer certificate.  That's a serious no-no.

We should NEVER attempt to do a revocation check on a cert that has not 
already been validated (except for revocation).  Revocation checking should 
be the last step, after the whole chain has passed the test of having valid 
issuers, valid signatures, not expired, etc., in a chain that is anchored by 
a trusted root.  If we were following that rule, we would have already found 
an issuer cert for the cert that we're checking with OCSP.  Consequently, 
we should not do the OCSP check until after we've found a valid issuer for 
the cert.  (This implies that the revocation checks should proceed from root
to leaf, and not in the opposite order).  So, it should be impossible to 
experience an "unknown issuer" error while doing OCSP, because the issuer 
cert should already have been checked and found valid.  

So, I ask if you're sure that the error experienced in the OCSP code was 
SEC_ERROR_UNKNOWN_ISSUER, because if so, we've got a bug in a higher layer
of the code. 

It occurs to me that PSM's new "security exceptions" for certificates may be 
defeating this rule.  I wonder if perhaps you had created a "security 
exception" for this server's certificate, to work around the fact that the 
issuer cert was not being sent by the https server. (?)  I wonder if PSM is 
attempting to do EV certificate checks on certs that have "security exceptions".  That would be another no-no.  If that's what's going on, then
we have a PSM bug here, too.
Attached patch Immediate quick fix (obsolete) — Splinter Review
I'm actually surprised this bug never got hit before considering it happens on all OCSP errors

I did see it when debugging in visual studio.
When I click the view certificate button (easiest way to invoke it) it runs through pkix_pl_ocspcertid.c twice, the first time it goes through completely valid. The second time is when it fails by falling in to this
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/lib/certhigh/ocsp.c&mark=1665-1667#1660

Which eventually ends up here:
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/lib/certhigh/certvfy.c&mark=266#260
Comment on attachment 321531 [details] [diff] [review]
Immediate quick fix

It's certainly not true that this "happens on all OCSP errors". 
It happens only in the relatively rare case where OCSP is unable to 
successfully construct the OCSP request to send to the server, because
it is unable to construct the "certID".  

Please fix this in CERT_DestroyOCSPCertID. CERT_DestroyOCSPCertID has 
3 callers.  I'd rather fix it once in that function, than multiple 
times in all the callers of that function.
Attachment #321531 - Flags: review-
Ryan, did you create a "security exception" for this certificate?
Target Milestone: --- → 3.12.1
In comment 15, Ryan wrote:
> When I click the view certificate button (easiest way to invoke it) it 
> runs through pkix_pl_ocspcertid.c twice, the first time it goes through 
> completely valid. The second time is when it fails [...]

Is this when viewing the certificate for https://www.techdata.com/ ?

Are you clicking the view certificate button in the "Add Security Exception"
dialog box?  Or elsewhere?  

Do the two calls to PKIX_PL_OcspCertID_Create (one of which precedes a 
successful test, and one of which precedes a failure) both get the same 
certificate for the first argument?  Or are they testing different 
certificates?

(If I could reproduce this, I wouldn't ask all these questions.  Sorry.)
Actually, I'm willing to take both patches, this and the previous one.
The other patch is not wrong.  It's merely not as complete as I'd like.

I think this bug should be nominated to be fixed in a new NSS 3.12 release
candidate, and get into FF3.
Attachment #321538 - Flags: review?(julien.pierre.boogz)
I'm using techdata for testing, and clicking the view button in the 'add security exception' dialog.

I don't have any security exceptions on the machine that I am debugging on, it is a little strange you can't reproduce it though, since I've managed it on all but 1 windows machine (a machine with a very old profile that had previously been to techdata). It seems to always reproduce on a completely fresh profile, though it might make a difference that I'm in Canada, they could have a different server setup.

the nssCert of the 2 certs (hopefully bugzilla formats these nicely)
pass:
0x086a4eb0 {arena=0x08664ff0 subjectName=0x086a8740 "CN=VeriSign Class 3 Extended Validation SSL CA,OU=Terms of use at https://www.verisign.com/rpa (c)06,OU=VeriSign Trust Network,O="VeriSign, Inc.",C=US" issuerName=0x086addb8 "CN=VeriSign Class 3 Public Primary Certification Authority - G5,OU="(c) 2006 VeriSign, Inc. - For authorized use only",OU=VeriSign Trust Network,O="VeriSign, Inc.",C=US" .

fail:
0x08695dc8 {arena=0x08696f30 subjectName=0x08697448 "CN=www.techdata.com,OU=web1-5,O=Tech Data Corporation,OID.2.5.4.9=5350 Tech Data Drive,L=Clearwater,ST=Florida,postalCode=33760,C=US,serialNumber=465173,OID.2.5.4.15="V1.0, Clause 5.(b)",OID.1.3.6.1.4.1.311.60.2.1.2=Florida,OID.1.3.6.1.4.1.311.60.2.1.3=US" issuerName = 0x08697548 "CN=VeriSign Class 3 Extended Validation SSL CA,OU=Terms of use at https://www.verisign.com/rpa (c)06,OU=VeriSign Trust Network,O="VeriSign, Inc.",C=US"
OK, I am now able to reproduce this crash with a fresh profile when I 
attempt to view the cert in the add exception dialog.  Thanks.  

The information that you provided in comment 20 is very interesting!
I have not yet attempted to reproduce those results.

They show that libPKIX *IS* checking revocation in the desired order,
progressing from the highest level intermediate CA towards the EE (leaf)
cert.  It verifies the certificate status for the issuer certificate 
first, and then attempts to verify the status of the EE (leaf) cert.
But when it attempts to verify the status of the EE cert, it is unable
to find the issuer cert that it just checked in the preceding OCSP check!

So, the question is: how can it fail to find the very same issuer 
certificate that it JUST CHECKED?  

This is really a question for bug 434398.
Comment on attachment 321531 [details] [diff] [review]
Immediate quick fix

I'm willing to commit this patch together with the other one, when that is reviewed.
Attachment #321531 - Flags: review- → review+
might as well throw it in the rc2? pile.
There was a huge spike in these for some reason. Still fairly far down on the list, but reading through some of those reports it looks like there's a few affected sites.
http://crash-stats.mozilla.com/report/list?range_unit=weeks&version=Firefox%3A3.0&range_value=2&signature=CERT_DestroyOCSPCertID
Flags: blocking1.9?
Summary: Crash destroying OSCP Cert ID [@ CERT_DestroyOCSPCertID ] → Crash destroying OCSP Cert ID [@ CERT_DestroyOCSPCertID ]
Whiteboard: PKIX → PKIX [RC2?]
Attachment #321538 - Flags: review?(julien.pierre.boogz) → review+
Is there a way to add a crashtest for this fix?
Comment on attachment 321538 [details] [diff] [review]
proposed patch v1 (checked in on trunk)

Checking in certhigh/ocsp.c; new revision: 1.52; previous revision: 1.51
Attachment #321538 - Attachment description: proposed patch v1 → proposed patch v1 (checked in on trunk)
Do we have data about the parameter value to the destroy function?
Was it always NULL when you crashed?
on pkix_pl_OcspCertID_Destroy?
If so, the PKIX_PL object was there, and appeared to be valid, but the magicheader was 0 IIRC. AFAICT this is normal behavior, since the certid should be null if it returns an error.

For the CERT_DestroyOCSPCertID function, it appeared that the CertID was null, or somehow invalid. MSVC showed the proper data structure of the CertID, but the data was all "???" (same result you get with an invalid ptr).

I can confirm this later, but I don't have access to my dev machine ATM.
Ryan, I'm only interested in the parameter value to function CERT_DestroyOCSPCertID. Thanks!
Comment on attachment 321538 [details] [diff] [review]
proposed patch v1 (checked in on trunk)

This patch is prerequisite to either of the two alternative patches for bug 433386, which are now being considered for a new NSS 3.12.0.x release candidate.  
Bob, please SR for the 3.12.0 branch.
Attachment #321538 - Flags: superreview?(rrelyea)
Comment on attachment 321531 [details] [diff] [review]
Immediate quick fix

Although reviewed, apparently this patch did not get checked in, and it collides with the work in bug 433386, so I'm marking it as obsolete.
Comment on attachment 321538 [details] [diff] [review]
proposed patch v1 (checked in on trunk)

r=kengert
Attachment #321538 - Flags: review+
Comment on attachment 321531 [details] [diff] [review]
Immediate quick fix

Yes, I was about to check in this patch when I realized just how flawed the whole certIDWasConsumed flag really is.  That led to bug 435007.  The patch for bug 435007 conflicts with (and supersedes) this patch, so I agree, this patch is obsolete.
Attachment #321531 - Attachment is obsolete: true
OK, so what's the status here, then? Do we have a patch checked in on a NSS branch? Is this to be fixed by one of those other bugs? If we're to consider this for RC2, we'll need to know:

 - status and ETA to fix being ready
 - risk of regression
My understanding is that the patch we actually want is in bug 433386, where it (as well as a broader, post-FF3, patch) have already been reviewed and sr'd.

I can't comment on the risk here, but I hope Kai, Nelson, or Bob can do so, as well as correct me if I mis-stated things above.
Johnathan & Mike,

I believe there are two separate sets of "topcrash" bugs, with two separate
patches, one in this bug, and one in bug 433386, where the attachment is 
named "(minimal) Patch v1 b".  I believe both patches are necessary to fix 
the related top crashers.  I believe both are VERY low risk (just adding 
detection of NULL pointers).

The NSS team is willing to produce another NSS 3.12.0 Release Candidate for
inclusion in FF3 RC2, if MoCo wants it (and it's beginning to sound like you
do, but please confirm).  Please tell us the date on which you would need 
that new NSS release candidate.  

One other crash bug in NSS is bug 432284.  We would also supply a fix for 
that bug if we could reproduce it or knew the cause exactly, but we don't.  
It apparently only happened on one day.  See the comments in that bug for 
info on it. 
Flags: wanted1.9.0.x?
Flags: wanted-next+
Flags: blocking1.9?
Flags: blocking1.9-
Comment on attachment 321538 [details] [diff] [review]
proposed patch v1 (checked in on trunk)

Clearing the request for review from Bob (but please feel free to have a look). We have two reviews already.
Attachment #321538 - Flags: superreview?(rrelyea)
Blocks: 435959
Comment on attachment 321538 [details] [diff] [review]
proposed patch v1 (checked in on trunk)

a=shaver for landing on the Firefox 3 NSS branch thingy place.  Thanks!
Attachment #321538 - Flags: approval1.9+
checked in to 3.12.0 branch for 3.12.0 rc4

Checking in certhigh/ocsp.c;
/cvsroot/mozilla/security/nss/lib/certhigh/ocsp.c,v  <--  ocsp.c
new revision: 1.50.6.1; previous revision: 1.50
done
Flags: wanted1.9.0.x?
Flags: blocking1.9-
Flags: blocking1.9+
Whiteboard: PKIX [RC2?] → PKIX [RC2+]
This is FIXED, aiui.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
It is fixed on the branch that NSS maintains for FF3, but not yet on the trunk.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Nelson, didn't you say in comment 25 that you have checked it in?
Uh, we're building Firefox 3 RC2 with the understanding that it was checked in on NSS 3.12 RC 4. If that's not the case, someone needs to raise a flag RFN.
Oh, nm, I misinterpreted comments.
D'oh!
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
verified fixed using Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0 and the testurl - no crash with the RC2 Candidate Build -> Verified Fixed
Status: RESOLVED → VERIFIED
Keywords: crash
Carsten, than you for verifying this bug. 
Please also check bug 435496, which is conjectured to be another duplicate
of this bug (having the same cause), and see if it is also fixed by the 
fix for this bug.  
Thanks.
Crash Signature: [@ CERT_DestroyOCSPCertID ]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: