Closed Bug 1798053 Opened 3 years ago Closed 2 years ago

Certainly: Serving Bad OCSP Responses

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wthayer, Assigned: wthayer)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

On 24-October 2022, Certainly became aware that our OCSP service was returning ‘unauthorized’ for some valid certificates.

  1. How your CA first became aware of the problem.
    While testing a new monitoring tool, a Certainly Engineer found that OCSP queries for the certificate on the www.certainly.com website were returning errors.

  2. A timeline of the actions your CA took in response.

  • 10/17 19:11 UTC Deploy change to Certainly’s serial number prefixes
  • 10/24 13:15 UTC Certainly engineer detects ‘unauthorized’ OCSP response for a valid certificate. Investigation begins.
  • 10/24 14:28 UTC Incident declared
  • 10/24 16:30 UTC Fix applied to active data center
  • 10/24 17:19 UTC Bug in OCSP monitor identified
  • 10/25 21:05 UTC Fix applied to inactive data center
  • 10/26 17:18 UTC Fix for monitoring bug deployed
  • 10/26 23:13 UTC Updated testing checklist deployed
  • 10/28 19:02 UTC Incident report published
  1. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.
    The problem was not related to certificates themselves. Correct OCSP responses were fully restored for all certificates on 10/25 at 21:05 UTC.

  2. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.
    During the incident, using OCSP to check the validity of any certificate issued prior to the start of the incident and not expired would have returned an ‘unauthorized’ response rather than the correct status of ‘good’ or ‘revoked’. Therefore, this issue affected OCSP responses for approximately 50,000 certificates issued between 18-September and 17-October 2022.

  3. In a case involving certificates, the complete certificate data for the problematic certificates.
    Not applicable.

  4. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
    Certainly recently decided to modify the construction of the serial numbers in our leaf certificates. Boulder adds a configurable prefix to the random number generated as a certificate’s serial number. Certainly had previously used a fixed prefix of ‘FF’ (hexadecimal) for all certificates, but decided to move to a unique prefix per environment. Prior to deployment, this change was tested and OCSP responses were observed to be properly generated for new and existing certificates. We failed to recognize that the ocsp-responder service checks the serial number prefix on incoming requests and will respond “unauthorized” if the prefix is not listed in the configuration. To change from one prefix to another, both the old and new prefixes must remain in the active configuration until all certificates using the old prefix have expired.
    We have both an internal check that responses are being generated and staying current over the life of a certificate and an external, full-path check to ensure that these responses are available externally from the responder. In this case the internal responses were being pregenerated, so the internal check passed. On investigation, we also discovered a bug in the external check that prevented it from firing the alert in this specific scenario, resulting in the issue not being identified when the change was first deployed.

  5. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.
    Changing serial number prefixes is a rare occurrence and something that had not been performed before, so there was not a documented process for doing so. In our testing the change appeared to be functioning as expected. Our tests did not detect this problem because responses were being generated properly and a second bug was silencing one of the tests. We have added a manual verification of the external check to our checklist and began to test for this specific condition prior to deploying changes to Boulder.
    Our last incident (https://bugzilla.mozilla.org/show_bug.cgi?id=1771238) involved publishing expired OCSP responses, and remediation work included configuring several additional monitors. We have monitors for stale responses, for valid responses, and for overall service availability. In this case, one of those monitors should have detected the problem, but had a bug of its own.
    In summary, our remediation plan is as follows.

Action item Due Date
Fix Boulder configuration to permit responses for the old certificates. Done
Add checks for valid OCSP responses being served for already existing certificates to the testing checklist that is performed before each Boulder deployment. Done
Fix the check for existing certificates to external OCSP validity monitors. Done
Assignee: bwilson → wthayer

Certainly has completed our remediation of this incident. We are monitoring this bug and will respond to any comments.

We're continuing to monitor this bug for questions or feedback.

Product: NSS → CA Program

Remediation is complete and we have not received any questions. We will continue to monitor this bug until it is closed.

Are there any additional questions or issues to be raised by the community? If not, I plan to close this on or about Wed. 23-Nov-2022.

Flags: needinfo?(bwilson)

Certainly representatives continue to monitor this bug for questions or feedback.

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Whiteboard: [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.