Closed Bug 1786313 Opened 2 years ago Closed 2 years ago

DFN-PKI: OCSP/CRL inconsistencies

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: brauckmann, Assigned: brauckmann)

Details

(Whiteboard: [ca-compliance] [crl-failure] [ocsp-failure])

Attachments

(2 files)

From Friday 2022-08-19 08:34 UTC to Saturday 2022-08-20 11:30 UTC, DFN-PKI suffered loss of revocation state for 6800 non-expired, revoked certificates. The affected CAs are sub-CAs of "DFN-Verein Certification Authority 2" (https://crt.sh/?caid=22818), and are operated by DFN-Verein.

For those certificates, our OCSP service responded with "good", and the corresponding entries did not appear on the CRL.

This is a preliminary report and will be amended.

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in the MDSP mailing list, a Bugzilla bug, or internal self-audit), and the time and date.

On 2022-08-19 10:04 UTC, internal monitoring reported unexpected results from OCSP responders.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2022-08-18 ~20:30 UTC: Network interruption between internal systems due to faulty hardware.
2022-08-19 08:35 Replacement of hardware and restore of network connectivity completed.

2022-08-19 10:04 Internal monitoring reports OCSP inconsistencies. PKI operations manager + admin team start investigation
2022-08-19 11:14 Two particular OCSP responders are taken offline, as the investigation seems to hint at problems with those two systems
2022-08-19 11:19 Internal monitoring still reports OCSP inconsistencies. Investigation is widened.
2022-08-19 12:07 Stop of certificate issuance to support data collection and analysis.
2022-08-19 12:44 Investigation team expanded with IT security officer, notification of more internal stake holders.
2022-08-19 13:49 Confirmation of corruption of a data store where certificate and revocation data is held
2022-08-20 09:20 Finished work to correct and validate the data store
2022-08-20 09:30 Issued first corrected CRLs
2022-08-20 09:35 Resumed certificate issuance
2022-08-20 11:20 Manual validation of OCSP state finished, all revocation info as expected.

  1. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

We stopped issuance on 2022-08-19 12:07 UTC (Certificate issuance was not affected by this incident; it was decided that a stop was necessary to guarantee correct analysis and recovery)

We resumed issuance on 2022-08-20 09:35 UTC after database corruption was recovered and the correct status of CRL/OCSP was restored.

  1. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

In our current investigation, we detected a deviation of 6800 revoked unexpired certificates out of 923521 certificates. For those 6800, an incorrect status was returned by our OCSP systems and they did not appear on the CRL.

  1. In a case involving TLS server certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. It is also recommended that you use this form in your list "https://crt.sh/?sha256=[sha256-hash]", unless circumstances dictate otherwise. When the incident being reported involves an SMIME certificate, if disclosure of personally identifiable information in the certificate may be contrary to applicable law, please provide at least the certificate serial number and SHA256 hash of the certificate. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

Refer to 4.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The current state of the investigation leads us to believe that:

  • Network connectivity problems lead to a overload sitation on a CA system
  • This in turn lead to corruption of a data store where certificate and revocation data was held
  • The corrupt data was then propagated to CRLs and OCSP

The data store in question was up to now inconspicious to corruption issues. We need to work out why we missed this potential issue in our regular system reviews.

The investigation continues, and the findings may be amended/changed.

  1. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.
  • The investigation what happened is still ongoing. We will amend this report on Friday 2022-08-26 at the latest.

  • A root cause analysis will be carried out and results will be published here by Friday 2022-09-02.

  • In any case, we will implement safeguards against data store corruption. This will involve at least:

    • Development effort to introduce system changes that prevent corruption
    • Enhancements to monitoring/integrity checking mechanisms
      A timeline for both enhancements is not yet established.

5.) In a case involving TLS server certificates, the complete certificate data for the problematic certificates
Refer to 4.

4.) In a case involving certificates, a summary of the problematic certificates

This Section 4 does not contain the complete certificate data required by Section 5. Could you provide a better set of the complete certificate data for the problematic certificates? Considering the number of certificates that are involved, an attachment would be preferred.

Summary: Titel: DFN-PKI: OCSP/CRL inconsistencies → DFN-PKI: OCSP/CRL inconsistencies

(In reply to Matthias from comment #1)

Could you provide a better set of the complete certificate data for the problematic certificates? Considering the number of certificates that are involved, an attachment would be preferred.

Yes, we are preparing certificate data.

I attached two lists of affected certificates, separated for S/MIME and TLS certs. Please note that our OCSP responders answer "unauthorized" for expired certificates, and some of the affected certs have expired since the incident.

Current status:

  • We found a bug in our software that may trigger corruption of the affected data store in combination with longer running (> 5 min) network availability issues. A fix was developed and is currently being tested. If testing finishes with a positive result, the fix will be deployed by 2022-08-31.

  • Until the fix is deployed, network availability issues that run longer for 5 minutes will (organisationally) trigger a manual integrity check.

  • Root causes analysis is on its way

  • A timeline for development efforts and enhancements to monitoring/integrity checking mechanisms is currently established

Assignee: bwilson → brauckmann
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

Activities:

  • After successful testing, we installed an update to our software on 2022-08-31 to fix the bug mentiond in comment #5
  • Enhancements to monitoring/integrity checking are expected to be completed on 2022-09-09
  • Need and timeline for further development efforts is under investigation

Root Cause Analysis:

  • System test coverage for high loads was designed without adequately considering a combination of different load types, just for single workloads. Due to this gap in test coverage for such simultaneous loads, the bug in question was not caught.

Implications:

  • System test coverage will be reviewed for more areas that may be also lacking. This task is planned with an end date of 2022-10-31.
  • Enhancements to monitoring/integrity checking have been completed
  • Investigation for need and timeline for further development efforts is estimated to be completed by 2022-09-23

We plan to give the next update on 2022-09-23, and will monitor this bug for further questions/comments.

Whiteboard: [ca-compliance] → [ca-compliance] Next update 2022-09-24

Update on task "Investigation for need and timeline for further development efforts": Our investigation leads us to the conclusion that the bug fix from 2022-08-31 in combination with the ongoing thorough examination of system test coverage is sufficient to mitigate the problem. Root cause for this incident is a gap in system test coverage, and we will focus on improvements in that area, as already stated in my comment #6

The task "System test coverage will be reviewed ..." is ongoing (planned end date of 2022-10-31).

We plan to give the next update on 2022-10-07, and will monitor this bug for further questions/comments.

Whiteboard: [ca-compliance] Next update 2022-09-24 → [ca-compliance]

Unchanged: The task "System test coverage will be reviewed ..." is ongoing, planned end date 2022-10-31.

We plan to give the next update on 2022-10-28, and will monitor this bug for further questions/comments.

We reviewed our system test coverage and found a corner case that needs consideration. We added an appropriate test to our test plan.

We consider this incident to be resolved. We will monitor this bug for further comments/questions, but don't have any further planned updates to this bug.

Can you please share a bit more about the corner case you discovered and what was the additional test you added to help other CAs avoid a similar gap? This is very much appreciated.

Generating CRLs in parallel while our system is fully stretched with parallel high loads of issuance and revocation was not covered in our system load tests until now.

Our system automatically generates a CRL after a block of revocation requests is finished. Additional requests for generating CRLs (from whatever other source) are queued. There are protection mechanisms against interference between those processes. But there were no explicit tests in our test plan for this scenario.

Thank you for the additional information.

I will close this on or about Wed. 2-Nov-2022 unless further discussion is needed.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [crl-failure] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: