SSL.com: "unknown" OCSP response for issued certificates
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: secauditor, Assigned: secauditor)
Details
(Whiteboard: [ca-compliance] [ocsp-failure])
Attachments
(2 files)
Preliminary Incident Report
Summary
-
Incident description:
OCSP responders for 67 Certificates were returning the following error: "OCSP responder does not know this certificate". -
Relevant policies:
This constitutes a violation of Section 4.9.9 "On-line revocation/status checking availability" of our CPS.For the status of a Subscriber Certificate or its corresponding Precertificate:
An authoritative OCSP response MUST be available (i.e. the responder MUST NOT respond with the "unknown" status) starting no more than 15 minutes after the Certificate or Precertificate is first published or otherwise made available.
-
Source of incident disclosure:
Third Party Reported: A Certificate Problem Report was submitted to notify us of the OCSP errors at 2025-03-25 22:08 UTC.
As an immediate action, we resolved the issue with the reported certificates and the OCSP service returns a valid response.
Our investigation into this issue continues. We will post a full incident report on or before 2025-04-08.
Updated•18 days ago
|
Assignee | ||
Comment 1•10 days ago
|
||
URLs of affected certificates.
Assignee | ||
Comment 2•10 days ago
|
||
URLs of affected pre-certificates.
Comment 3•10 days ago
|
||
Full Incident Report
Summary
-
CA Owner CCADB unique ID: A002038
-
Incident description: An issue with our CA software prevented a proper OSCP response for a subset of TLS certificates and pre-certificates. We determined that the issue stems from intermittent failures to insert newly issued certificates into the CA database specifically from high-rate certificate requests. A mitigating control has been put in place to fix the issue. However, due to the technical complexity of the issue, our investigation has not yet concluded. We will provide updates to this report as we continue to work on identifying the root cause.
-
Timeline summary:
-
Non-compliance start date: 2024-09-20
-
Non-compliance identified date: 2025-03-26 19:20 UTC
-
Non-compliance end date: 2025-04-02 20:04 UTC
-
Relevant policies:
-
CP/CPS Section 4.9.9 "On-line revocation/status checking availability": For the status of a Subscriber Certificate or its corresponding Precertificate: […] An authoritative OCSP response MUST be available (i.e. the responder MUST NOT respond with the "unknown" status) starting no more than 15 minutes after the Certificate or Precertificate is first published or otherwise made available.
-
CP/CPS Section 4.10.2 “Service Availability": SSL.com shall maintain an online 24x7 Repository that application software can use to automatically check the current status of all unexpired Certificates issued by SSL.com.
-
-
Source of incident disclosure: Third Party Reported
Impact
-
Total number of certificates: 58,129 Certificates and 53,116 pre-certificates
-
Total number of "remaining valid" certificates: 38,652 Certificates (2025-03-27 00:00 UTC)
-
Affected certificate types: DV TLS
-
Incident heuristic: The full corpus of affected certificates is disclosed in the Appendix.
-
Was issuance stopped in response to this incident, and why or why not?: No. This incident did not produce any invalid certificates,
so we did not stop issuance. We were able to immediately resolve the issue with the reported certificates and the OCSP service returning
a valid response. -
Analysis: N/A - No revocation delay.
-
Additional considerations: Our investigation has not yet concluded. We will provide updates to this report as we continue to identify the root cause.
Timeline
2024-09-20:
- First instance of a certificate missing from the CA database.
2025-03-25:
- 22:08: SSL.com receives a Certificate Problem Report (CPR) that our OCSP service does not produce the correct response for the certificate https://crt.sh/?opt=ocsp&sha256=c9cf1b1f6944b0458bd7371f60d76af880c74408211f34def202f22430ea8009. The CPR also reports that this issue affects 67 certificates.
2025-03-26:
-
13:58: Internal ticket registered by the Validation Team to handle the CPR.
-
14:03: A preliminary report is sent to the reporter to inform that we have received the report and that an internal investigation has been launched.
-
14:05: The Compliance Business Unit (CBU) is notified of a potential issue with the OCSP service.
-
17:52: Internal ticket registered by the CBU in accordance with our Incident Management Policy.
-
19:20: CBU confirms this as a violation of section 4.9.9 of our CPS and declares an incident.
-
19:38: A notification is sent out to the Subscriber to notify of the problem report.
-
19:41: The missing 67 certificates are imported to the CA database.
-
19:48: New OCSP responses are generated
-
20:04: Problems reported by OCSP watch (https://sslmate.com/labs/ocsp_watch/) cleared
2025-03-27:
-
12:01: Implemented OCSP Watch monitoring alerts be sent to a designated Slack channel.
-
12:34: Investigation continues. The engineering and compliance teams collaborate to gather and analyze records to identify the issue.
-
14:31: Investigation identified a preliminary population of certificates missing from the CA DB
-
17:55: Drafting of the Preliminary Incident Report
2025-03-28:
-
12:29: All certificates from preliminary population are imported into CA DB.
-
19:34: Preliminary Incident Report posted to Bugzilla
(https://bugzilla.mozilla.org/show_bug.cgi?id=1957140#c0) -
20:08: A mechanism has been put in place to automatically detect certificates missing from the CA DB and import them
2025-03-31
-
09:48: Investigation revealed 3,893 pre-certs, for which no certificate was issued, missing from the DB
-
17:10: Mitigation to automatically detect certificates missing from the CA DB and import them updated to include pre-certificates.
2025-04-01
- 12:22: Total number of missing pre-certs is 53,052
2025-04-02
-
11:27: All missing pre-certs have been imported
-
12:25: Investigation concludes that there are no more certificates or pre-certs missing from the CA DB
2025-04-03
-
15:20: Investigation confirms that the CA application issues ALTER TABLE statements upon every restart. Investigation continues to
confirm other contributing factors. -
18:00: Drafting of the Full Incident Report
2025-04-07
- 18:02: Debug logs were enabled on several CA nodes to help discover other contributing factors.
2025-04-08
-
08:03: ALTER privilege was revoked from the application's DB user as a temporary mitigation.
-
21:47: Full Incident Report posted to Bugzilla
Related Incidents
We are aware there has been a surge of OCSP response related issues recently and while our investigation is ongoing, none of the prior
incidents appear to share the same potential root cause.
Root Cause Analysis
As mentioned in our summary, due to the technical complexity of the issue, we are still investigating all possible contributing factors and
will continue to update the Root Cause Analysis as new factors are confirmed. At this time, we have confirmed one contributing factor and
one failed control as examined below.
Contributing Factor #1: When the CA application (re)starts, it causes ALTER TABLE statements to be issued to the database.
-
Description: In a clustered setup, where multiple application instances access the DB backend simultaneously, a race condition
could occur where the DB thread, which handles the ALTER TABLE operation tries to lock the table, and conflicts with another DB
thread which also tries to INSERT data to that specific table resulting in the second thread being terminated. -
Timeline:
-
2025-02-27: Testing for implementation of most recent CA Software release. Testing discovered the ALTER TABLE behavior but was deemed
non-harmful. -
2025-03-13: Deployment of most recent CA software release.
-
2025-04-02: Identified DB queries being cancelled due to ALTER TABLE statements locking the table.
-
2025-04-08: ALTER privilege was revoked from the application's DB user as a temporary mitigation.
-
-
Detection: While investigating to identify the cause of the missing certificates, the kernel in one node killed the application server because it ran out of memory. When the application server was started again, the application issued the ALTER statements, causing a conflict with a table write operation, causing the write to be dropped. Although the tests initially discovered this behavior, it was not thought that this type of interaction would occur and cause loss of data.
-
Interaction with other factors: Although our investigation is ongoing, it is currently determined that this is an isolated factor.
-
Root Cause Analysis methodology used: 5-Whys
Contributing Factor #2: Monitoring alerts are part of other general systems alerts without proper categorization.
-
Description: An alert mechanism was created to monitor SSLMate OCSP Watch for early detection of any OCSP problems. Alerts are part
of other general systems alerts without proper categorization and were not prioritized accordingly. -
Timeline:
-
2023-02-01: Implementation of OCSP Watch monitoring
-
2025-02-26: First email alert regarding OCSP error from this incident.
-
2025-03-25: SSL.com receives a Certificate Problem Report (CPR) that our OCSP service does not produce the correct response for
67 certificates
-
-
Detection: Although email alerts were sent out as configured, it was not until the investigation of a CPR that made us look back and
search for the email alerts. -
Interaction with other factors: Increased the time until detection.
-
Root Cause Analysis methodology used: 5-Whys
Lessons Learned
-
What went well: A mitigating control was quickly put in place.
-
What didn't go well:
-
The issue was not detected by internal controls.
-
Our testing failed to realize all the implications of the database behavior that was identified.
-
The complexity of the issue has prevented us from uncovering all of the root causes.
-
-
Where we got lucky: OCSP Responses were initially produced at the time of issuance and were available for serving requests.
-
Additional: N/A
Action Items
Due to the technical complexity of the issue, we are still investigating all possible contributing factors and will continue to update our Action
Items as new factors are discovered.
Action Item | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
A mechanism to automatically detect certificates missing from the CA DB and import them | Mitigate | Root Cause # 1 | Monitor OCSP Watch for any OCSP errors | Completed | |
ALTER privilege was revoked from the application’s DB user | Prevent | Root Cause # 1 | Tested behavior in stagging and production environment | Completed | |
Review and categorize OCSP/CRL Watch alerts to actionable priorities | Detect / Prevent | Root Cause # 2 | Set an SLA for each Category and track it | 2025-05-15 | Open |
Appendix
Comment 4•3 days ago
|
||
SSL.com continued its investigation after enabling debug logs on several CA nodes but no information was discovered to help identify any new contributing factors. We are currently working with our CA vendor to help identify all root cases. We will provide an update as we continue working with our CA vendor.
Description
•