Closed Bug 1957140 Opened 5 months ago Closed 8 days ago

SSL.com: "unknown" OCSP response for issued certificates

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: secauditor, Assigned: secauditor)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Attachments

(2 files)

Preliminary Incident Report

Summary

  • Incident description:
    OCSP responders for 67 Certificates were returning the following error: "OCSP responder does not know this certificate".

  • Relevant policies:
    This constitutes a violation of Section 4.9.9 "On-line revocation/status checking availability" of our CPS.

    For the status of a Subscriber Certificate or its corresponding Precertificate:

    An authoritative OCSP response MUST be available (i.e. the responder MUST NOT respond with the "unknown" status) starting no more than 15 minutes after the Certificate or Precertificate is first published or otherwise made available.

  • Source of incident disclosure:
    Third Party Reported: A Certificate Problem Report was submitted to notify us of the OCSP errors at 2025-03-25 22:08 UTC.

As an immediate action, we resolved the issue with the reported certificates and the OCSP service returns a valid response.

Our investigation into this issue continues. We will post a full incident report on or before 2025-04-08.

Assignee: nobody → secauditor
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [ocsp-failure]
Attached file cert_urls.txt

URLs of affected certificates.

Attached file precert_urls.txt

URLs of affected pre-certificates.

Full Incident Report

Summary

  • CA Owner CCADB unique ID: A002038

  • Incident description: An issue with our CA software prevented a proper OSCP response for a subset of TLS certificates and pre-certificates. We determined that the issue stems from intermittent failures to insert newly issued certificates into the CA database specifically from high-rate certificate requests. A mitigating control has been put in place to fix the issue. However, due to the technical complexity of the issue, our investigation has not yet concluded. We will provide updates to this report as we continue to work on identifying the root cause.

  • Timeline summary:

    • Non-compliance start date: 2024-09-20

    • Non-compliance identified date: 2025-03-26 19:20 UTC

    • Non-compliance end date: 2025-04-02 20:04 UTC

    • Relevant policies:

    • CP/CPS Section 4.9.9 "On-line revocation/status checking availability": For the status of a Subscriber Certificate or its corresponding Precertificate: […] An authoritative OCSP response MUST be available (i.e. the responder MUST NOT respond with the "unknown" status) starting no more than 15 minutes after the Certificate or Precertificate is first published or otherwise made available.

    • CP/CPS Section 4.10.2 “Service Availability": SSL.com shall maintain an online 24x7 Repository that application software can use to automatically check the current status of all unexpired Certificates issued by SSL.com.

  • Source of incident disclosure: Third Party Reported

Impact

  • Total number of certificates: 58,129 Certificates and 53,116 pre-certificates

  • Total number of "remaining valid" certificates: 38,652 Certificates (2025-03-27 00:00 UTC)

  • Affected certificate types: DV TLS

  • Incident heuristic: The full corpus of affected certificates is disclosed in the Appendix.

  • Was issuance stopped in response to this incident, and why or why not?: No. This incident did not produce any invalid certificates,
    so we did not stop issuance. We were able to immediately resolve the issue with the reported certificates and the OCSP service returning
    a valid response.

  • Analysis: N/A - No revocation delay.

  • Additional considerations: Our investigation has not yet concluded. We will provide updates to this report as we continue to identify the root cause.

Timeline

2024-09-20:

  • First instance of a certificate missing from the CA database.

2025-03-25:

2025-03-26:

  • 13:58: Internal ticket registered by the Validation Team to handle the CPR.

  • 14:03: A preliminary report is sent to the reporter to inform that we have received the report and that an internal investigation has been launched.

  • 14:05: The Compliance Business Unit (CBU) is notified of a potential issue with the OCSP service.

  • 17:52: Internal ticket registered by the CBU in accordance with our Incident Management Policy.

  • 19:20: CBU confirms this as a violation of section 4.9.9 of our CPS and declares an incident.

  • 19:38: A notification is sent out to the Subscriber to notify of the problem report.

  • 19:41: The missing 67 certificates are imported to the CA database.

  • 19:48: New OCSP responses are generated

  • 20:04: Problems reported by OCSP watch (https://sslmate.com/labs/ocsp_watch/) cleared

2025-03-27:

  • 12:01: Implemented OCSP Watch monitoring alerts be sent to a designated Slack channel.

  • 12:34: Investigation continues. The engineering and compliance teams collaborate to gather and analyze records to identify the issue.

  • 14:31: Investigation identified a preliminary population of certificates missing from the CA DB

  • 17:55: Drafting of the Preliminary Incident Report

2025-03-28:

  • 12:29: All certificates from preliminary population are imported into CA DB.

  • 19:34: Preliminary Incident Report posted to Bugzilla
    (https://bugzilla.mozilla.org/show_bug.cgi?id=1957140#c0)

  • 20:08: A mechanism has been put in place to automatically detect certificates missing from the CA DB and import them

2025-03-31

  • 09:48: Investigation revealed 3,893 pre-certs, for which no certificate was issued, missing from the DB

  • 17:10: Mitigation to automatically detect certificates missing from the CA DB and import them updated to include pre-certificates.

2025-04-01

  • 12:22: Total number of missing pre-certs is 53,052

2025-04-02

  • 11:27: All missing pre-certs have been imported

  • 12:25: Investigation concludes that there are no more certificates or pre-certs missing from the CA DB

2025-04-03

  • 15:20: Investigation confirms that the CA application issues ALTER TABLE statements upon every restart. Investigation continues to
    confirm other contributing factors.

  • 18:00: Drafting of the Full Incident Report

2025-04-07

  • 18:02: Debug logs were enabled on several CA nodes to help discover other contributing factors.

2025-04-08

  • 08:03: ALTER privilege was revoked from the application's DB user as a temporary mitigation.

  • 21:47: Full Incident Report posted to Bugzilla

Related Incidents

We are aware there has been a surge of OCSP response related issues recently and while our investigation is ongoing, none of the prior
incidents appear to share the same potential root cause.

Root Cause Analysis

As mentioned in our summary, due to the technical complexity of the issue, we are still investigating all possible contributing factors and
will continue to update the Root Cause Analysis as new factors are confirmed. At this time, we have confirmed one contributing factor and
one failed control as examined below.

Contributing Factor #1: When the CA application (re)starts, it causes ALTER TABLE statements to be issued to the database.

  • Description: In a clustered setup, where multiple application instances access the DB backend simultaneously, a race condition
    could occur where the DB thread, which handles the ALTER TABLE operation tries to lock the table, and conflicts with another DB
    thread which also tries to INSERT data to that specific table resulting in the second thread being terminated.

  • Timeline:

    • 2025-02-27: Testing for implementation of most recent CA Software release. Testing discovered the ALTER TABLE behavior but was deemed
      non-harmful.

    • 2025-03-13: Deployment of most recent CA software release.

    • 2025-04-02: Identified DB queries being cancelled due to ALTER TABLE statements locking the table.

    • 2025-04-08: ALTER privilege was revoked from the application's DB user as a temporary mitigation.

  • Detection: While investigating to identify the cause of the missing certificates, the kernel in one node killed the application server because it ran out of memory. When the application server was started again, the application issued the ALTER statements, causing a conflict with a table write operation, causing the write to be dropped. Although the tests initially discovered this behavior, it was not thought that this type of interaction would occur and cause loss of data.

  • Interaction with other factors: Although our investigation is ongoing, it is currently determined that this is an isolated factor.

  • Root Cause Analysis methodology used: 5-Whys

Contributing Factor #2: Monitoring alerts are part of other general systems alerts without proper categorization.

  • Description: An alert mechanism was created to monitor SSLMate OCSP Watch for early detection of any OCSP problems. Alerts are part
    of other general systems alerts without proper categorization and were not prioritized accordingly.

  • Timeline:

    • 2023-02-01: Implementation of OCSP Watch monitoring

    • 2025-02-26: First email alert regarding OCSP error from this incident.

    • 2025-03-25: SSL.com receives a Certificate Problem Report (CPR) that our OCSP service does not produce the correct response for
      67 certificates

  • Detection: Although email alerts were sent out as configured, it was not until the investigation of a CPR that made us look back and
    search for the email alerts.

  • Interaction with other factors: Increased the time until detection.

  • Root Cause Analysis methodology used: 5-Whys

Lessons Learned

  • What went well: A mitigating control was quickly put in place.

  • What didn't go well:

    • The issue was not detected by internal controls.

    • Our testing failed to realize all the implications of the database behavior that was identified.

    • The complexity of the issue has prevented us from uncovering all of the root causes.

  • Where we got lucky: OCSP Responses were initially produced at the time of issuance and were available for serving requests.

  • Additional: N/A

Action Items

Due to the technical complexity of the issue, we are still investigating all possible contributing factors and will continue to update our Action
Items as new factors are discovered.

Action Item Kind Corresponding Root Cause(s) Evaluation Criteria Due Date Status
A mechanism to automatically detect certificates missing from the CA DB and import them Mitigate Root Cause # 1 Monitor OCSP Watch for any OCSP errors Completed
ALTER privilege was revoked from the application’s DB user Prevent Root Cause # 1 Tested behavior in stagging and production environment Completed
Review and categorize OCSP/CRL Watch alerts to actionable priorities Detect / Prevent Root Cause # 2 Set an SLA for each Category and track it 2025-05-15 Open

Appendix

Cert_URLs.txt

Precert_URLs.txt

SSL.com continued its investigation after enabling debug logs on several CA nodes but no information was discovered to help identify any new contributing factors. We are currently working with our CA vendor to help identify all root cases. We will provide an update as we continue working with our CA vendor.

SSL.com continues to work with our CA vendor to identify any additional root causes. We continue to monitor this bug for any questions or comments.

I checked OCSP Watch, and it seems that the issue "error parsing OCSP response: ocsp: error from server: unauthorized" occurred with several certificates 13 days ago.
https://sslmate.com/labs/ocsp_watch/
Are you aware of this issue, and are you working to resolve it?
I was concerned because there was no mention of it in Comment 5, so I decided to contact you.
Thank you.

(In reply to James from comment #6)

I checked OCSP Watch, and it seems that the issue "error parsing OCSP response: ocsp: error from server: unauthorized" occurred with several certificates 13 days ago.
https://sslmate.com/labs/ocsp_watch/
Are you aware of this issue, and are you working to resolve it?
I was concerned because there was no mention of it in Comment 5, so I decided to contact you.
Thank you.

Thank you for your comment and bringing this to our attention.

SSL.com has fixed these ten (10) newly reported certificates, and five (5) additional ones discovered during our follow-up investigation. Initially, the issue seemed to affect only one of our CA clusters, and out of caution, we started mitigating all CA clusters (where these newly identified certificates took place). Due to significant changes needing to be adapted to the other clusters, this fix has not been deployed yet. For now, we have deployed a partial mitigation and are working to update it soon, to cover these cases more effectively. In parallel we have escalated this CA software bug with our vendor in hopes of a quicker resolution.

Below are the newly identified certificates:

https://crt.sh/?id=17758060151
https://crt.sh/?id=17758356316
https://crt.sh/?id=17758525615
https://crt.sh/?id=17757899318
https://crt.sh/?id=17758362600
https://crt.sh/?id=17758362336
https://crt.sh/?id=17758525734
https://crt.sh/?id=17758539456
https://crt.sh/?id=17758384093
https://crt.sh/?id=17758551902
https://crt.sh/?id=17758551327
https://crt.sh/?id=17758060834
https://crt.sh/?id=18000526791
https://crt.sh/?id=18000636747
https://crt.sh/?id=18001286930

Whiteboard: [ca-compliance] [ocsp-failure] → [ca-compliance] [ocsp-failure] Next update 2025-05-15

Hi Luis - can you provide information on how the 15 certs were missed? Was it the same issue? Looks like it was but I wanted to confirm.

(In reply to Jeremy from comment #8)

Hi Luis - can you provide information on how the 15 certs were missed? Was it the same issue? Looks like it was but I wanted to confirm.

Hi Jeremy,

Yes, the case is the same. As Comment 7 stated, we first thought the issue seemed to affect only one of our CA clusters, but we discovered this was not true. We are working with our vendor and hope to have a resolution soon.

This is an update to report our progress with remediation actions.

The following action item has been completed:

Action Item Kind Corresponding Root Cause(s) Evaluation Criteria Due Date Status
Review and categorize OCSP/CRL Watch alerts to actionable priorities Detect / Prevent Root Cause # 2 Set an SLA for each Category and track it 2025-05-15 Completed

As we continue our investigation, alongside our CA software vendor, we have not been able to confirm any other contributing factors at the moment. We will continue monitoring this bug and ask our next update to be 2025-05-29.

Whiteboard: [ca-compliance] [ocsp-failure] Next update 2025-05-15 → [ca-compliance] [ocsp-failure] Next update 2025-05-29

Our CA software vendor provided a possible fix and we have begun testing. We kindly ask for our next update to be 2025-06-12.

Whiteboard: [ca-compliance] [ocsp-failure] Next update 2025-05-29 → [ca-compliance] [ocsp-failure] Next update 2025-06-12

After testing and further research, SSL and our CA Vendor concluded the proposed configuration change did not fix the issue and requires further log reviews and testing. It was revealed that the issue was quite complex as it manifests itself randomly. As we conclude our research, we are shifting our focus in improving/solidifying mitigation controls (described as action item #1) to ensure automated proper handling of this issue regardless of the underlying technical issue. We ask that our next update be set for 2025-06-26.

Whiteboard: [ca-compliance] [ocsp-failure] Next update 2025-06-12 → [ca-compliance] [ocsp-failure] Next update 2025-06-26

SSL.com continues to work on solidifying our mitigation controls. We will provide an update next week.

Whiteboard: [ca-compliance] [ocsp-failure] Next update 2025-06-26 → [ca-compliance] [ocsp-failure]

During this week, our engineers were able to complete another cycle of improvements and re-tests of the said mitigation controls. In parallel, we continue to work alongside our CA vendor on possible preventive controls. We will provide an update in the next couple of weeks and ask for our next update to be set to 2025-07-17.

Whiteboard: [ca-compliance] [ocsp-failure] → [ca-compliance] [ocsp-failure] Next update 2025-07-17

SSL.com was notified by our CA Vendor that they have identified the problem and will provide a fix in their next software release. In the meantime, SSL.com has made improvements to its controls to mitigate this issue while we wait for our CA vendor's next update release. SSL.com has completed all the above action items and will post our Incident Closure Summary on or before 2025-07-31 as we consider this bug completed.

Whiteboard: [ca-compliance] [ocsp-failure] Next update 2025-07-17 → [ca-compliance] [ocsp-failure] Next update 2025-07-31

Is it possible to name the CA vendor so other CAs can see if they have the same problem? I would guess this ist EJBCA which other CAs are in use with.

(In reply to JR Moir from comment #16)

Is it possible to name the CA vendor so other CAs can see if they have the same problem? I would guess this ist EJBCA which other CAs are in use with.

Hi JR - SSL.com uses Keyfactor (EJBCA) as our CA vendor.

Report Closure Summary

  • Incident description: An issue with our CA software prevented a proper OSCP response for a subset of TLS certificates and pre-certificates where we incur intermittent failures to insert newly issued certificates into the CA database specifically from high-rate certificate requests.

  • Incident Root Cause(s): A race condition could occur where the DB thread, which handles the ALTER TABLE operation tries to lock the table, and conflicts with another DB thread which also tries to INSERT data to that specific table resulting in the second thread being terminated. Also, our monitoring for OCSP problems on OCSP Watch failed as alerts were part of other general systems alerts without proper categorization and were not prioritized accordingly.

  • Remediation description: As an immediate response ALTER privilege was revoked from the application’s DB user. We then engaged with our CA software vendor to determine the cause and possible software fix to prevent this from happening. After extensive collaboration, our CA vendor informed us that a fix will become available in their next software update. In the meantime, SSL.com has implemented a mitigating control that automatically detects certificates missing from the CA DB and imports them, along with enhancing our detective controls by improving the process by which OCSP/CRL Watch alerts are converted to actionable priorities.

  • Commitment summary: SSL.com will adopt a more efficient monitoring and alerting system, by investing collaborative tools that allow for better identification of high severity alerts. We will also continue to work with our CA software vendor to test and implement their next software update which will include a fix to this issue.

All Action Items disclosed in this report have been completed as described, and we request its closure.

This is a final call for comments or questions on this Incident Report.

Otherwise, this bug will be closed on approximately 2025-08-08.

Flags: needinfo?(incident-reporting)
Whiteboard: [ca-compliance] [ocsp-failure] Next update 2025-07-31 → [close on 2025-08-08] [ca-compliance] [ocsp-failure]
Status: ASSIGNED → RESOLVED
Closed: 8 days ago
Flags: needinfo?(incident-reporting)
Resolution: --- → FIXED
Whiteboard: [close on 2025-08-08] [ca-compliance] [ocsp-failure] → [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: