Closed Bug 1878106 Opened 7 months ago Closed 6 months ago

HARICA: Anomaly in OCSP services after CA software upgrade

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jimmy, Assigned: jimmy)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Incident Report

HARICA detected an anomaly in its OCSP services after a CA software upgrade, resulting in new issued certificates having pre-signed OCSP responses with non-compliant nextUpdate value.

Summary

HARICA was alerted by one of our Subscribers that wrong OCSP responses were being served for their certificate. After investigation we detected that after a CA software upgrade, 161 TLS certificates, issued from Issuing CAs configured for pre-signed OCSP responses, were not getting proper OCSP responses issued. The problematic OCSP responses had a nextUpdate of more than 10 days from the thisUpdate which is a violation of TLS BRs section 4.9.10. In addition to that, the OCSP response refresh service did not update these OCSP responses in time missing the four-day mark.

During that time, CRLs were issued properly, and Relying Parties had an alternative channel to receive revocation information related to those certificates.

HARICA is using Enterprise CA software from an external Software Vendor but has access to the source code. This allowed HARICA to quickly detect the problematic code and fix it.

The problematic code was fixed the same day, new proper OCSP responses were generated and the problematic responses were purged from the front-end certificate status servers.

The Subscribers affected by this incident were able to get the proper OCSP responses within 1 hour after the issue was fixed in production.

Impact

Relying parties with Web Browsers that check OCSP responses, were unable to connect to the websites of affected subscribers because they considered the OCSP response stale.

Timeline

All timestamps in EET (GMT+2)

  • 2024-01-29 18:42 We were alerted that old OCSP responses were being served for the certificate with serial number 1D9A3527F779457416829A88CAC8961D
  • 2024-01-29 18:46 Investigation began
  • 2024-01-29 18:50 It was determined that the OCSP response was generated on 2024-01-18 and the nextUpdate was set to more than the allowed days according to the TLS BRs (more than 10 days).
  • 2024-01-29 19:00 Investigation revealed that the issue began on 2024-01-16. The CA software had been upgraded that day.
  • 2024-01-29 19:11 A configuration audit was initiated to identify the cause of the issue
  • 2024-01-29 19:27 The configuration audit did not reveal any issues. A code audit was initiated.
  • 2024-01-29 19:37 The code audit identified a potentially problematic line
  • 2024-01-29 19:42 The investigation revealed that up to 161 TLS certificates could potentially be affected
  • 2024-01-29 20:04 Certificate issuance was suspended until a fix would become available
  • 2024-01-29 20:20 Investigation identified and confirmed 161 TLS certificates being affected
  • 2024-01-29 20:29 The problematic OCSP responses were purged from the CA database
  • 2024-01-29 20:36 The fix was confirmed to be effective in the staging environment
  • 2024-01-29 20:52 The fix was deployed in the production environment and the CA issuing service was resumed. Refresh of the OCSP responses was initiated
  • 2024-01-29 21:21 OCSP responses were refreshed
  • 2024-01-29 21:24 The problematic OCSP responses were purged from the certificate status servers databases
  • 2024-01-29 21:31 Work began to identify web servers that were serving stapled OCSP responses
  • 2024-01-29 21:50 Confirmed OCSP stapling is refreshed by default every 3600 seconds in Apache and nginx
  • 2024-01-29 22:13 Out of all the web servers that could be accessed on port 443, only one was still serving a problematic stapled OCSP response
  • 2024-01-29 22:22 We analyzed the behavior of popular web server software (Apache, nginx), and determined that they fetch new OCSP responses within the hour by default regardless of the nextUpdate OCSP response field
  • 2024-01-31 11:10 Investigation continued to collect information and notify the software vendor
  • 2024-02-01 13:15 Software vendor notified
  • 2024-02-01 15:00 Started drafting incident report
  • 2024-02-01 21:20 Bugzilla bug opened

Root Cause Analysis

Here we present the results of the “5 whys” root cause methodology that was followed:

  • Why was there a problem?

Because we were serving stale OCSP responses

  • Why were the OCSP responses stale?

Because the OCSP response refresh service did not update them in time, violating the 4-day requirement

  • Why did the OCSP response refresh service not update them in time?

Because the nextUpdate was far in the future for the response to not be updated, violating the 10-day max validity

  • Why was the nextUpdate not set as in the CA configuration?

Because of a bug in the CA software

  • Why was a faulty version of the CA software deployed in the production environment?

Because testing of the software upgrade failed to identify that the CA was bypassing the configured nextUpdate value for the OCSP responses generated upon certificate issuance

  • Bonus why: Why did we not detect this earlier?

Because there were gaps in the monitoring for the actual OCSP response validity and freshness, relying only on the monitoring of the system configuration for these parameters. Additionally, OCSP monitoring was focused on checking that responses were being served and that their nextUpdate had not elapsed.

This Root Cause analysis helped us to identify areas of improvement, like implementing additional monitoring for real OCSP responses, and to extend those checks to CRLs as well. Mitigations are described in the actions section.

Lessons Learned

What went well

  • HARICA has access to the source code of the CA Software Vendor which allowed for code review, quick detection of the problem and a quick fix
  • HARICA was able to demonstrate good knowledge and understanding of the CA software code
  • HARICA responded swiftly to the issue once the alert was fired

What didn't go well

  • CA software upgrades are tested before proceeding to the Production environment. Our internal testing did not include a detailed analysis of the OCSP responses produced after the upgrade
  • We did not detect the problem sooner

Where we got lucky

  • This issue only affected new certificates from CAs without delegated OCSP responders
  • The failure to refresh the problematic OCSP responses avoided masking the issue
  • The most popular web servers were fetching new OCSP responses regularly which made the corrected responses spread fast

Action Items

Action Item Kind Due Date
Request an investigation and Root Cause Analysis from the CA Software Vendor about the problematic code Informative Completed
Update the testing instructions to include issuance of new certificates and explicit checking of OCSP responses and CRLs Prevent 2024-02-09
Implement additional OCSP monitoring controls Detect 2024-02-16
Implement OCSP linter Detect 2024-02-23
Implement CRL linter Prevent 2024-02-23

Appendix

Details of affected certificates

Based on Incident Reporting Template v. 2.0

Assignee: nobody → jimmy
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [ocsp-failure]

(In reply to Dimitris Zacharopoulos from comment #0)

HARICA detected an anomaly in its OCSP services after a CA software upgrade, resulting in new issued certificates having pre-signed OCSP responses with non-compliant nextUpdate value.

What is the CA software that HARICA is using? That would be helpful to other CAs for them to investigate if they are similarly affected.

(In reply to Mathew Hodson from comment #1)

(In reply to Dimitris Zacharopoulos from comment #0)

HARICA detected an anomaly in its OCSP services after a CA software upgrade, resulting in new issued certificates having pre-signed OCSP responses with non-compliant nextUpdate value.

What is the CA software that HARICA is using? That would be helpful to other CAs for them to investigate if they are similarly affected.

We are using EJBCA Enterprise. The issue is triggered in version 8.2.0. As mentioned in our timeline, we already contacted Keyfactor to alert about the issue and they confirmed they would be contacting their customers running EJBCA to inform about the problem and the conditions that trigger it.

Keyfactor is also preparing a public announcement. We will add it to this bug as soon as we see it.

Here is the announcement from Keyfactor.

Here is an update to our action items.

Action Item Kind Due Date
Request an investigation and Root Cause Analysis from the CA Software Vendor about the problematic code Informative Completed
Update the testing instructions to include issuance of new certificates and explicit checking of OCSP responses and CRLs Prevent Completed 2024-02-07
Implement additional OCSP monitoring controls Detect 2024-02-16
Implement OCSP linter Detect 2024-02-23
Implement CRL linter Prevent 2024-02-23

Another update to our action items.

Action Item Kind Due Date
Request an investigation and Root Cause Analysis from the CA Software Vendor about the problematic code Informative Completed
Update the testing instructions to include issuance of new certificates and explicit checking of OCSP responses and CRLs Prevent Completed 2024-02-07
Implement additional OCSP monitoring controls Detect Completed 2024-02-13
Implement OCSP linter Detect 2024-02-23
Implement CRL linter Prevent 2024-02-23

All action items have been completed.

Action Item Kind Due Date
Request an investigation and Root Cause Analysis from the CA Software Vendor about the problematic code Informative Completed
Update the testing instructions to include issuance of new certificates and explicit checking of OCSP responses and CRLs Prevent Completed 2024-02-07
Implement additional OCSP monitoring controls Detect Completed 2024-02-13
Implement OCSP linter Detect Completed 2024-02-21
Implement CRL linter Prevent Completed 2024-02-21

We would like to share some notes from our testing of pkilint related to OCSP response checking. With the current implementation of pkilint it seems very challenging to run the OCSP linter in real time (i.e. at the issuance of every OCSP response). Execution for 1 OCSP response takes about 250ms (depending on the CPU it can be faster or slower but that's on average). We decided to run the OCSP linter in batch mode after parallelizing the execution, and scheduled the check to run every few hours.

Flags: needinfo?(bwilson)

We appreciate this good and thorough incident report. Since all action items have been completed, I believe this matter can be closed and I will do so on Friday, 8-March-2024.

Status: ASSIGNED → RESOLVED
Closed: 6 months ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.