Closed Bug 1806728 Opened 2 years ago Closed 1 year ago

IdenTrust: Bad OCSP Responses

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: roots, Assigned: roots)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36

Steps to reproduce:

On December 15, 2022 IdenTrust was performing a scheduled 24+ hour test of OCSP traffic being diverted to a CDN for the purpose of scalability and we received reports from some customers who were receiving timeout errors for TLS CRLs. We continue to investigate and a full incident report will be provided no later than December 30, 2022.

Summary: IdenTrust- Bad OCPS Responses → IdenTrust- Bad OCSP Responses
Assignee: nobody → roots
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Summary: IdenTrust- Bad OCSP Responses → IdenTrust: Bad OCSP Responses

Full Incident Report:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On December 15, 2022, IdenTrust conducted a planned 24-hour test of its OCSP traffic, which was being served from cloud infrastructure (AWS) instead of on-premise infrastructure in order to test the auto-scaling capability. During this test, it was reported that a small number of certificates were mistakenly displaying as "revoked" when checked through OCSP.

Upon further investigation, it was determined that there were problems with the replication of OCSP data to AWS. The replication process missed updates to the status of the source certificates, and as a result, incorrectly mapped the status of the target certificates to a "revoked" state with an empty "revocationTime" field in AWS. A total of 326 certificates were affected, and the incorrect OCSP response was served 4,257 times over a 20-hour period. During this time, a total of 76,298,215 OCSP responses were served correctly.

This issue did not align with the guidelines outlined in IdenTrust's Certificate Practice Statement (CPS) Section 2.2.1, which states that "IdenTrust operates and maintains CRL and OCSP capability with resources sufficient to provide a response time of ten seconds or less.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2022-12-14 16:05 MST: The process of diverting OCSP traffic from on-premise to AWS began.
2022-12-15 09:30 MST: Reports were received from customers experiencing connectivity issues.
2022-12-15 09:41 MST: Investigation and troubleshooting process began. The initial report concerned CRLs and not OCSP.
2022-12-15 11:36 MST: A data issue was identified and the diversion of OCSP traffic was stopped.
2022-12-16 10:41 MST: Confirmation was received from customers that the issue had been resolved.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

This incident did not pertain to the issuance of TLS certificates.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

Not applicable

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

Not applicable

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

During the process of issuing certificates, the internal system workflow updates the status of the certificate multiple times, for instance, from "Requested" to "Approved for Issuance" to "Issuance Initiated" to "Awaiting Activation" to "Awaiting Retrieval" to "Retrieval Completed," and so on. The data replication process uses a "Continuous Query Notification" feature to monitor for changes in status and update the replicated database accordingly.

Two issues were identified that impacted the production system:

  1. Under high production loads, the database's "Continuous Query Notification" feature did not work properly and some status notification events were not generated. Multiple instances of certificates being updated in the source database without corresponding change events being generated were identified.
  2. There was an error in the replication status mapping, where a status that should have been identified as "valid" in OCSP was instead mapped as "revoked." For example, a pre-certificate generated at the "Awaiting Activation" phase should have been marked as "valid," but was instead mapped as "revoked" with an empty "revocation time" and "revocationReason." Due to the missing change notifications from issue 1, this certificate was never updated to "valid."

Data and load testing was conducted prior to the move to production, but the testing process had some flaws. The tests performed included:

  1. Verified that the total number of certificates in the source and target databases matched.

  2. Confirmed that all certificates had the correct issuer and serialNumber replicated.

  3. Issued OCSP requests for all 2.3+ million certificates in the replicated cloud responders, which returned correctly formatted OCSP responses (i.e. not "unauthorized" or "unknown").

  4. Checked a sample of certificates to ensure that the revocationReason and revocationTime matched in the source and target."

  5. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

To correct and prevent a recurrence of this issue, the following measures will be implemented:
a) The architecture for the data replication process will no longer use the "Continuous Query Notification" feature, but will instead rely on database triggers and a staging table for any status updates. The staging table will only be updated after a status update has been verified as replicated to the target.
b) The testing procedure will be modified to ensure that 100% of the source and target revocationTime and revocationReasons match.
c) The source and target system will run concurrently for a period of time, with multiple runs of step b to validate that they remain in sync before traffic is directed to the cloud infrastructure.

We expect to have these changes implemented by March 31, 2023, and will provide a status update by January 31, 2023."

Whiteboard: [ca-compliance]

We are confirming that we are still on track to implement the corrective measures outlined in comment #1's item 7 by March 31, 2023. Our next update regarding this matter will be provided by February 28, 2023.

Whiteboard: [ca-compliance] → [ca-compliance] [ocsp-failure]

We are still on track to implement these corrective measures by March 31, 2023:
a) The architecture for the data replication process will no longer use the "Continuous Query Notification" feature, but will instead rely on database triggers and a staging table for any status updates. The staging table will only be updated after a status update has been verified as replicated to the target.
b) The testing procedure will be modified to ensure that 100% of the source and target revocationTime and revocationReasons match.
c) The source and target system will run concurrently for a period of time, with multiple runs of step b to validate that they remain in sync before traffic is directed to the cloud infrastructure.

Our next update confirming the implementation will be on or before March 31, 2023.

Whiteboard: [ca-compliance] [ocsp-failure] → [ca-compliance] [ocsp-failure] Next update 2023-03-31

Last Friday (3/24/2023) a code change was made in the pre-production environment to address the certificate status issue originating in the EJBCA Data Replicator. However, during the validation process, we have identified other issues that need to be resolved before we can deploy in the production environment. As such, we anticipate it will take at least another month before the system is fully validated and stable.

We will provide another update by April 28, 2023.

Whiteboard: [ca-compliance] [ocsp-failure] Next update 2023-03-31 → [ca-compliance] [ocsp-failure] Next update 2023-04-28

Today we have been able to confirm that all implemented changes regarding this issue have been working as expected and therefore we consider this issue fully resolved.

I will close this on or about Friday, 5-May-2023.

Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] [ocsp-failure] Next update 2023-04-28 → [ca-compliance] [ocsp-failure]
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.