Closed Bug 662557 Opened 9 years ago Closed 8 years ago

OCSP validation errors are wrongly reported as SEC_ERROR_NO_MEMORY errors from CERT_PKIXVerifyCert

Categories

(NSS :: Libraries, defect)

x86
Windows 7
defect
Not set

Tracking

(firefox7-)

RESOLVED FIXED
3.12.11
Tracking Status
firefox7 - ---

People

(Reporter: briansmith, Assigned: kaie)

Details

Attachments

(4 files, 4 obsolete files)

The PSM modal dialog box will occasionally pop up with the sec_error_no_memory error message when I visit a page (the most recent example was https://ssllabs.com, a while ago I think I also experienced this with https://gmail.com or one of the sites it redirects to). However, the browser does not seem to be out of memory at all. Wan-Teh mentioned that it was possible that we're parsing/interpreting a buffer length incorrectly or that some buffer length arithmetic underflowed/overflowed, causing us to try to allocate an extremely large buffer.

I did not have this problem until immediately after I enabled libpkix verification. At that time, I ran into it just a few times but couldn't reproduce it. Then, I didn't run into the problem for several weeks, until just now.
This may be not reproducible outside your box. To gather more information, it may be appropriate to run your Firefox under debugger and trace PORT_SetError(SEC_ERROR_NO_MEMORY ::= -8173) calls. Unfortunately, I am not a Windows programmer, and can not advice how to achieve this.
I just ran into this again, but I was using the stock nightly build. I would like to check this patch into m-c so that I can ask more people to help me debug it. The patch adds a debug break when PORT_SetError(SEC_ERROR_NO_MEMORY) is called and the NSS_DEBUG_SEC_ERROR_NO_MEMORY environment variable is set. After the patch is checked in, I will post a message asking people to add NSS_DEBUG_SEC_ERROR_NO_MEMORY=1 to their runtime configuration and to enable libpkix to help find the bug.
Attachment #539074 - Flags: review?(kaie)
Assignee: nobody → bsmith
(a) How will you guarantee this gets backed out prior to aurora/beta/release? You must find to a way to guarantee this, or my review is void.

(b) What about side effects? PORT_SetError is designed to be minimal, and may be called from anywhere. The call to PR_GetEnv uses a lock. Can this introduce a new risk for a deadlock? For example, CERT_UncacheCRL calls PORT_SetError while holding another lock, also the SSLThread.

(c) This is a full abort on non-Win32 platforms.

(d) If you really do this, you must make a note of the patch to directory mozilla/security/patches so it's not forgotten later when updating NSS.

(e) Instead of using PR_GetEnv and requiring a debugger, could you simply dump a stack to the console? I have seen other debugging output in the past that did this, but I don't know if this is possible on Windows (which is what you probably use).
Comment on attachment 539074 [details] [diff] [review]
[debug-patch] Patch to help debug this for mozilla-central only - not for check-in to NSS CVS

I'm ok with this if you do (a) and (d), and really must try this for a couple of days - but you should backout quickly if it doesn't help or introduces new strange deadlocks. I'd rather recommend to use solution (e).
Attachment #539074 - Flags: review?(kaie) → review+
Here is the patch to m-c. It includes the patch in security/patches.

I talked to some people about getting a stack trace from NSS without aborting but we didn't find a way to do it in non-debug builds.

I am hoping that somebody will help test this and post a stacktrace to this bug really soon and that we can disable this ASAP.
Attachment #541788 - Flags: review+
If this patch survives until the Aurora landing, then we must back it out of Aurora after the merge.
(In reply to comment #3)
> (a) How will you guarantee this gets backed out prior to
> aurora/beta/release? You must find to a way to guarantee this, or my review
> is void.

I have marked it as tracking-firefox7 to track this.

> (b) What about side effects? PORT_SetError is designed to be minimal, and
> may be called from anywhere. The call to PR_GetEnv uses a lock. Can this
> introduce a new risk for a deadlock? For example, CERT_UncacheCRL calls
> PORT_SetError while holding another lock, also the SSLThread.

PORT_SetError is only called from within NSS so this will be limited to affecting NSS functions. I do not see how it is possible to get a deadlock with this patch, but if we start seeing deadlocks we can definitely back it out.
Whiteboard: [inbonud]
Whiteboard: [inbonud] → [inbound]
http://hg.mozilla.org/mozilla-central/rev/ecbf03bb7afe
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Re-opening because we need to back the patch out on Aurora, and eventually on m-c once we get some stack traces.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [inbound]
Whiteboard: nominated for backout on aurora
tracking-firefox7 and a=me on a straight up backout. Can we get it soon?
Brian, will you do the backout?
Or do people want me to do it?
I believe the error is related to OCSP code paths.

So far I have been unable to ever run into this problem while running a debug build, although I tried, I even attempted to use a test page that continously reloads a mix of https pages, but the error was never hit.

However, I repeatedly ran into this with my primary work profiles (which I do run with libpkix enabled).

When I ran into the problem the last time (without being in the debugger), I played with OCSP settings.

I disabled OCSP (prefs, advanced, encryption, validation). Reloaded the page, and it worked!

I reenabled OCSP (and set to strict), reloaded the page, and the problem reoccurred.

This was happening with my test server at https://ssltls.de/status-of-major-sites.php which uses a certificate (and OCSP server) from Startcom.

My current suspicion is, Startcom's OCSP server might have a problem, when we experience this bug.

I tried to reproduce the bug, by using Firewall settings that block access to the OCSP settings. I couldn't, which means it's not as simple as an unreachable OCSP server (which is expected, because we have separate error codes for that scenario).

Maybe the response we receive from the OCSP is damaged.

It might be, that the binary response that we receive from the OCSP server contains bad data structure. It might be possible that the repsonse says "next piece is a string, and it is e.g. 89789277892389 bytes long". This would be correct with Wan-Teh's statement, that we attempt to allocated more memory than we have available.

Why is this only happening in libpkix mode? I suspect that in the classic mode the code path reports an OCSP error, while in libpkix mode the out-of-memory-error somehow gets the highest preference.
With which build does this happen? I'm running hard-fail exclusively, but never experienced something like this. Of course I'm close to a OCSP responder, but there is one close to your location too (80.237.154.29) if you want to test.
The patch can be backed out of both Aurora and mozilla-central. I will do it tomorrow.

Due to this patch, I have narrowed down the problem to the libpkix memory allocation routines, and we can use the libpkix error log to diagnose this. I could reproduce it reliably when there was a captive portal trying to intercept traffic. I suspect that the OCSP response parser is parsing an HTML response as an OCSP response, and interpreting some sequence of large values as a very large DER length value. Still needs a little more investigation.

Eddy, this bug seems to happen only when security.use_libpkix_verification=true.
Oh well, with FF5 there is no such config yet. Apparently I need a nightly build in order to test that.
Eddy, actually Firefox 5 was the first version where this pref is available for experimental testing. But we made it a super-hidden pref. You can go to about:config - and add that as a new bool pref, and set it to true. Restart, and you should be running the new experimental verification code.
err sorry, I think I misremembered. It might have been version 6. (too many version numbers recently...)
not high visible enough for the release team to track.
If libpkix is unable to send/receive an OCSP response, internally it reports an PKIX_OCSPSERVERERROR, but the error code that is reported to the application is SEC_ERROR_NO_MEMORY.

Somewhere internal to libpkix's error handling (Java exception emulation), the PKIX_OCSPSERVERERROR error becomes a PKIX_ALLOCERROR. Then CERT_PKIXVerifyCert converts the PKIX_ALLOCERROR to SEC_ERROR_NO_MEMORY here:
http://mxr.mozilla.org/security/source/security/nss/lib/certhigh/certvfypkix.c#2304
Summary: unexpected, rare, SEC_ERROR_NO_MEMORY errors when using Firefox with libpkix verification enabled → OCSP validation errors are wrongly reported as SEC_ERROR_NO_MEMORY errors from CERT_PKIXVerifyCert
What bothers me is that sites like  https://ssllabs.com and https://gmail.com suffered from this and specially this statement from Kay:

> This was happening with my test server at 
> https://ssltls.de/status-of-major-sites.php which uses a 
> certificate (and OCSP server) from Startcom.

Something doesn't sound quite right with that.
(In reply to comment #12)
> I tried to reproduce the bug, by using Firewall settings that block access
> to the OCSP settings. I couldn't, which means it's not as simple as an
> unreachable OCSP server (which is expected, because we have separate error
> codes for that scenario).

It could be that the timeout case is handled differently (somehow correctly), while other error cases aren't.

This is triggered with captive portals because of this code:

    if (PORT_Strcasecmp(responseContentType, 
                        "application/ocsp-response")) {
        PKIX_ERROR(PKIX_OCSPSERVERERROR);
    }

The captive portals return a text/html login page.
bsmith's hints in comment 19 and comment 21 allowed me to track
this down.

The enumeration constant PKIX_ALLOCERROR has the value 0, which
is the default value of pkixErrorCode.  Therefore, any function
that fails to set pkixErrorCode to some value on error will report
pkixErrorCode=PKIX_ALLOCERROR.

pkixErrorCode is set by macros like PKIX_CHECK and PKIX_ERROR.
So, any error return that doesn't use those macros may potentially
have this bug.  We will need to review all the code that does
    pkixErrorResult = some_function(...);
and verify that it does the equivalent of PKIX_CHECK if
pkixErrorResult is not NULL.

This patch only fixes the instance of this bug in this bug
report.  I suspect there are other instances of this bug
unrelated to OCSP check failures.
Attachment #539074 - Attachment description: Patch to help debug this for mozilla-central only - not for check-in to NSS CVS → [debug-patch] Patch to help debug this for mozilla-central only - not for check-in to NSS CVS
Attachment #541788 - Attachment description: assert on out-of-memory errors in NSS even in release builds - not for check-in to NSS CVS - must be backed out of Aurora → [debug-patch] assert on out-of-memory errors in NSS even in release builds - not for check-in to NSS CVS - must be backed out of Aurora
Attachment #545853 - Attachment description: Patch to fix only one instance of this bug → [fix v1] Patch to fix only one instance of this bug
Attached patch [fix] Patch v2 (obsolete) — Splinter Review
Wan-Teh, do you think this is a good improvement of your patch?
Attachment #546194 - Flags: review?(wtc)
Comment on attachment 546194 [details] [diff] [review]
[fix] Patch v2

Alexei, I think you have touched this code; if you have comments, please let us know, thanks!
Attachment #546194 - Flags: review?(alvolkov.bgs)
The three temporary bug-finding patches' backout (along with most things committed on Friday afternoon) was backed out of mozilla-inbound in order to clear up orange.
(In reply to comment #14)
> The patch can be backed out of both Aurora and mozilla-central. I will do it
> tomorrow.

Brian, I think you forgot to do this.

I'm about to land a new NSS into mozilla-central, and I'll back out there.

Do you need me to help with the Aurora backout?
Ok, Wan-Teh was quicker, he already upgraded NSS, but he re-applied the debug patch (because it is still mentioned in the mozilla/security/patches directory/ )
Comment on attachment 541788 [details] [diff] [review]
[debug-patch] assert on out-of-memory errors in NSS even in release builds - not for check-in to NSS CVS - must be backed out of Aurora

This backed out of mozilla-central.
http://hg.mozilla.org/mozilla-central/rev/47a266e0a4bd
Comment on attachment 541788 [details] [diff] [review]
[debug-patch] assert on out-of-memory errors in NSS even in release builds - not for check-in to NSS CVS - must be backed out of Aurora

Backed out of aurora.
http://hg.mozilla.org/releases/mozilla-aurora/rev/7bbe5e1dc636
Comment on attachment 546194 [details] [diff] [review]
[fix] Patch v2

Kai, I'm sorry I forgot to review this patch.  When I worked on NSS
in the past few days, I only looked at bugs with target milestone
NSS 3.12.11.  This bug doesn't even have a target milestone!

Your changes to pkix_revocationchecker.c are not equivalent to the
original code with respect to "goto cleanup".  In the original code,
we goto cleanup unconditionally in those two places.  In the new
code, we goto cleanup only if (pkixErrorResult), due to the use of
the PKIX_CHECK_RESULT macro.

There is also a bug in your macro:

>+#define PKIX_CHECK_RESULT(descNum) \
>+    do { \
>+        if (pkixErrorResult) { \
>+            TRACE_CHECK_FAILURE((func), PKIX_ErrorText[descNum]) \
>+            pkixErrorClass = pkixErrorResult->errClass; \
>+            pkixErrorCode = descNum; \
>+            goto cleanup; \
>+        } \
>+    } while (0)

Your macro does not have a 'func' argument, but 'func' is used in
the macro's expansion.
Attachment #546194 - Flags: review?(wtc) → review-
Whiteboard: nominated for backout on aurora
Wan-Teh, thanks for noticing the problems in my patch.

When looking at this today, with a fresh head, I have a better proposal. IMHO we should follow the usual style and always use a macro to call the functions, to ensure all dependent variables get updated immediately.

For this purpose I propose to introduce PKIX_CHECK_NO_GOTO.
Attached patch Patch v3 (obsolete) — Splinter Review
Assignee: bsmith → kaie
Attachment #549082 - Flags: review?(wtc)
I don't know if target milestone 3.12.11 is realistic, but let's try, so Wan-Teh sees this.
Target Milestone: --- → 3.12.11
Attachment #546194 - Attachment is obsolete: true
Attachment #546194 - Flags: review?(alvolkov.bgs)
I am currently at a WiFi Hotspot.
I am able to reproduce the issue consistenly.

In order to reproduce the problem it is require to have two preferences set:
- use pkix
- strict ocsp

i.e. the prefs.js file must contain these prefs:
- user_pref("security.use_libpkix_verification", true);
- user_pref("security.OCSP.require", true);
I can confirm that patch v3 fixes the issue.
With patch v3 applied, we get error code "OCSP server error".

I recommend to apply patch v4, too.
(I tested both v3 and v3+v4.)
One more detail regarding the OCSP server.

Weeks ago I had attempted to reproduce the issue by using firewall rules that block connections to the OCSP server. I wasn't able to.

In order to reproduce this issue, the following behaviour of a OCSP connection attempt triggers this bug:
- connection to ocsp hostname succeeds
- we send out the http request
- the server responds with http code "302 temporarily moved"
Comment on attachment 549082 [details] [diff] [review]
Patch v3

Kai: I may not have the stamina to review your patches quickly.  Sorry.

Your use of "patch v4" and "patch v3" is confusing.  Usually patch v4
should replace patch v3, but apparently the patch v3 and patch v4
in this bug can be applied independently.

In mozilla/security/nss/lib/libpkix/pkix/checker/pkix_revocationchecker.c:

>             if ((!(revFlags & PKIX_REV_MI_TEST_ALL_LOCAL_INFORMATION_FIRST) ||
>                  onlyUseRemoteMethods) &&
>                 chainVerificationState &&
>                 methodStatus[methodNum] == PKIX_RevStatus_NoInfo) {
>                 if (!(methodFlags & PKIX_REV_M_FORBID_NETWORK_FETCHING)) {
>                     PKIX_RevocationStatus revStatus = PKIX_RevStatus_NoInfo;
>-                    pkixErrorResult =
>-                        (*method->externalRevChecker)(cert, issuer, date,
>-                                                      method,
>-                                                      procParams, methodFlags,
>-                                                      &revStatus, pReasonCode,
>-                                                      &nbioContext, plContext);
>+                    PKIX_CHECK_NO_GOTO((*method->externalRevChecker)
>+                                           (cert, issuer, date,
>+                                           method,
>+                                           procParams, methodFlags,
>+                                           &revStatus, pReasonCode,
>+                                           &nbioContext, plContext),
>+                                       PKIX_REVCHECKERCHECKFAILED);
>                     methodStatus[methodNum] = revStatus;
>                     if (revStatus == PKIX_RevStatus_Revoked) {
>                         /* if error was generated use it as final error. */
>                         overallStatus = PKIX_RevStatus_Revoked;
>                         goto cleanup;
>                     }
>                     if (pkixErrorResult) {
>                         /* Disregard errors. Only returned revStatus matters. */
>                         PKIX_PL_Object_DecRef((PKIX_PL_Object*)pkixErrorResult,
>                                               plContext);
>                         pkixErrorResult = NULL;
>                     }

In the last "if" statement, it seems that we should also reset
pkixErrorClass and pkixErrorCode if we reset pkixErrorResult to
NULL.
clarification:

patches v3 and v4 are independent.

patch v3 fixes this bug.

patch v4 is an additional patch to increase correctness.
patch v4 is optional.
patch v4 might fix other scenarios.
(In reply to comment #38)
> >                     if (pkixErrorResult) {
> >                         /* Disregard errors. Only returned revStatus matters. */
> >                         PKIX_PL_Object_DecRef((PKIX_PL_Object*)pkixErrorResult,
> >                                               plContext);
> >                         pkixErrorResult = NULL;
> >                     }
> 
> In the last "if" statement, it seems that we should also reset
> pkixErrorClass and pkixErrorCode if we reset pkixErrorResult to
> NULL.


I originally had the same idea.

However, I can find several places where libpkix doesn't do this cleanup.
Existing code sets pkixErrorResult to NULL, but keeps *Class/*Code as is.

Examples:
- pkix_build.c uses "PKIX_DECREF(pkixErrorResult)"
- pkix_build.c uses "pkixErrorResult = NULL"
- pkix_pl_object.c uses "pkixErrorResult = NULL"
Wan-Teh said, he doesn't like adding more macros.

I'm sad, because I think using another macro is more elegant than duplicating code.

But in order to get this done, I will attach without a new macro.
This is a modified version of v3.

It avoids the new macro.
Instead, it copies the relevant code to two places.

I've added a comment to make it clear that this is a duplicated subset of the PKIX_CHECK macro, that must be adjusted accordingly, if PKIX_CHECK ever changes.
Attachment #549082 - Attachment is obsolete: true
Attachment #549082 - Flags: review?(wtc)
Attachment #550805 - Flags: review?(wtc)
Attachment #550805 - Attachment is patch: true
Attachment #550805 - Attachment mime type: text/x-patch → text/plain
Kai, I cleaned up your original patch v3 and checked it in
on the NSS trunk (NSS 3.13) and NSS_3_12_BRANCH (NSS 3.12.11).

The only two real changes were:

1. For PKIX_List_GetItem failure, I use the error class
PKIX_LISTGETITEMFAILED instead of PKIX_REVCHECKERCHECKFAILED.

2. In PKIX_CHECK_NO_GOTO, I kept the TRACE_CHECK_FAILURE.
I am not sure about this because we will still log a trace
message when we ignore a failure, but it seems that logging
the function failure is useful even if it's ignored.  You
can remove the TRACE_CHECK_FAILURE if you think it's wrong.

Please create an NSS_3_12_11_BETA3 CVS tag and push it to
mozilla-central.  Thanks a lot!

Checking in pkix/checker/pkix_revocationchecker.c;
/cvsroot/mozilla/security/nss/lib/libpkix/pkix/checker/pkix_revocationchecker.c,v  <--  pkix_revocationchecker.c
new revision: 1.11; previous revision: 1.10
done
Checking in pkix/util/pkix_tools.h;
/cvsroot/mozilla/security/nss/lib/libpkix/pkix/util/pkix_tools.h,v  <--  pkix_tools.h
new revision: 1.27; previous revision: 1.26
done

Checking in pkix/checker/pkix_revocationchecker.c;
/cvsroot/mozilla/security/nss/lib/libpkix/pkix/checker/pkix_revocationchecker.c,v  <--  pkix_revocationchecker.c
new revision: 1.10.4.1; previous revision: 1.10
done
Checking in pkix/util/pkix_tools.h;
/cvsroot/mozilla/security/nss/lib/libpkix/pkix/util/pkix_tools.h,v  <--  pkix_tools.h
new revision: 1.26.4.1; previous revision: 1.26
done
Attachment #545853 - Attachment is obsolete: true
Attachment #550805 - Attachment is obsolete: true
Attachment #550805 - Flags: review?(wtc)
moving patch v4 to bug 676858.

marking this one fixed.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Attachment #549090 - Flags: review?(wtc)
You need to log in before you can comment on or make changes to this bug.