1806728 - IdenTrust: Bad OCSP Responses

Assignee

Description

•

2 years ago

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36

Steps to reproduce:

On December 15, 2022 IdenTrust was performing a scheduled 24+ hour test of OCSP traffic being diverted to a CDN for the purpose of scalability and we received reports from some customers who were receiving timeout errors for TLS CRLs. We continue to investigate and a full incident report will be provided no later than December 30, 2022.

IdenTrust

Assignee

Updated

•

2 years ago

Summary: IdenTrust- Bad OCPS Responses → IdenTrust- Bad OCSP Responses

Chris Clements

Updated

•

2 years ago

Assignee: nobody → roots

Status: UNCONFIRMED → ASSIGNED

Type: defect → task

Ever confirmed: true

Summary: IdenTrust- Bad OCSP Responses → IdenTrust: Bad OCSP Responses

IdenTrust

Assignee

Comment 1

•

2 years ago

Full Incident Report:

How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On December 15, 2022, IdenTrust conducted a planned 24-hour test of its OCSP traffic, which was being served from cloud infrastructure (AWS) instead of on-premise infrastructure in order to test the auto-scaling capability. During this test, it was reported that a small number of certificates were mistakenly displaying as "revoked" when checked through OCSP.

Upon further investigation, it was determined that there were problems with the replication of OCSP data to AWS. The replication process missed updates to the status of the source certificates, and as a result, incorrectly mapped the status of the target certificates to a "revoked" state with an empty "revocationTime" field in AWS. A total of 326 certificates were affected, and the incorrect OCSP response was served 4,257 times over a 20-hour period. During this time, a total of 76,298,215 OCSP responses were served correctly.

This issue did not align with the guidelines outlined in IdenTrust's Certificate Practice Statement (CPS) Section 2.2.1, which states that "IdenTrust operates and maintains CRL and OCSP capability with resources sufficient to provide a response time of ten seconds or less.

A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2022-12-14 16:05 MST: The process of diverting OCSP traffic from on-premise to AWS began.
2022-12-15 09:30 MST: Reports were received from customers experiencing connectivity issues.
2022-12-15 09:41 MST: Investigation and troubleshooting process began. The initial report concerned CRLs and not OCSP.
2022-12-15 11:36 MST: A data issue was identified and the diversion of OCSP traffic was stopped.
2022-12-16 10:41 MST: Confirmation was received from customers that the issue had been resolved.

Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

This incident did not pertain to the issuance of TLS certificates.

A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

Not applicable

The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

Not applicable

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

During the process of issuing certificates, the internal system workflow updates the status of the certificate multiple times, for instance, from "Requested" to "Approved for Issuance" to "Issuance Initiated" to "Awaiting Activation" to "Awaiting Retrieval" to "Retrieval Completed," and so on. The data replication process uses a "Continuous Query Notification" feature to monitor for changes in status and update the replicated database accordingly.

Two issues were identified that impacted the production system:

Under high production loads, the database's "Continuous Query Notification" feature did not work properly and some status notification events were not generated. Multiple instances of certificates being updated in the source database without corresponding change events being generated were identified.
There was an error in the replication status mapping, where a status that should have been identified as "valid" in OCSP was instead mapped as "revoked." For example, a pre-certificate generated at the "Awaiting Activation" phase should have been marked as "valid," but was instead mapped as "revoked" with an empty "revocation time" and "revocationReason." Due to the missing change notifications from issue 1, this certificate was never updated to "valid."

Data and load testing was conducted prior to the move to production, but the testing process had some flaws. The tests performed included:

Verified that the total number of certificates in the source and target databases matched.
Confirmed that all certificates had the correct issuer and serialNumber replicated.
Issued OCSP requests for all 2.3+ million certificates in the replicated cloud responders, which returned correctly formatted OCSP responses (i.e. not "unauthorized" or "unknown").
Checked a sample of certificates to ensure that the revocationReason and revocationTime matched in the source and target."
List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

To correct and prevent a recurrence of this issue, the following measures will be implemented:
a) The architecture for the data replication process will no longer use the "Continuous Query Notification" feature, but will instead rely on database triggers and a staging table for any status updates. The staging table will only be updated after a status update has been verified as replicated to the target.
b) The testing procedure will be modified to ensure that 100% of the source and target revocationTime and revocationReasons match.
c) The source and target system will run concurrently for a period of time, with multiple runs of step b to validate that they remain in sync before traffic is directed to the cloud infrastructure.

We expect to have these changes implemented by March 31, 2023, and will provide a status update by January 31, 2023."

Ben Wilson

Updated

•

2 years ago

Whiteboard: [ca-compliance]

IdenTrust

Assignee

•

1 year ago

Today we have been able to confirm that all implemented changes regarding this issue have been working as expected and therefore we consider this issue fully resolved.

Ben Wilson

Comment 6

•

1 year ago

I will close this on or about Friday, 5-May-2023.

Flags: needinfo?(bwilson)

Whiteboard: [ca-compliance] [ocsp-failure] Next update 2023-04-28 → [ca-compliance] [ocsp-failure]

Ben Wilson

Updated

•

1 year ago

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

Flags: needinfo?(bwilson)

Resolution: --- → FIXED

Bugzilla

Quick Search

IdenTrust: Bad OCSP Responses

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

People

(Reporter: roots, Assigned: roots)

References

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Updated

Comment 2

Updated

Comment 3

Updated

Comment 4

Updated

Comment 5

Comment 6

Updated