Closed Bug 1914067 Opened 3 months ago Closed 1 month ago

IdenTrust: Expired CRLs

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: roots, Assigned: roots)

Details

(Whiteboard: [ca-compliance] [crl-failure])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36

Steps to reproduce:

Preliminary Incident Report

Summary

On 2024-08-18 05:46:09 UTC, we received an alert from our monitoring tools indicating there were some expired CRLs that were publicly available. These were also flagged by the sslmate’s monitoring tool.

This is a violation of Section 4.10.2 of the TLS BRs on Service Availability:

_ The CA SHALL maintain an online 24x7 Repository that application software can use to automatically check the current status of all unexpired Certificates issued by the CA._

The new CRLs were issued at 2024-08-18 12:58:41 UTC thus we served expired CRLs for about 8 hours affecting certificate validations.

We believe that the root cause was a scheduled software change control that was completed on 2024-08-17 around 22:35 UTC.

Due to these failures, the software change control was reverted on 2024-08-18 at 13:12 UTC.

We are still gathering the details on the root cause and will provide a complete incident report no later than August 30, 2024.

Status: RESOLVED → REOPENED
Component: General → CA Certificate Compliance
Ever confirmed: true
Product: Invalid Bugs → CA Program
Resolution: INVALID → ---
Assignee: nobody → roots
Status: REOPENED → ASSIGNED
Type: defect → task
Whiteboard: [ca-compliance] [crl-failure]

Complete Incident Report

Summary

On 2024-08-18, we deployed a server OS update, and later in the day we discovered that two public trust CRLs had expired without being renewed.
This is a violation of Section 4.10.2 of the TLS BRs on Service Availability:
The CA SHALL maintain an online 24x7 Repository that application OS can use to automatically check the current status of all unexpired Certificates issued by the CA.

Impact

The expiration of these CRLs potentially affected the validation process for certificates issued under these trust anchors. Our team promptly renewed the CRLs and reconfigured the alert system.

Timeline

(All times are UTC)
2024-08-05: 6:16 Deployed server OS update in pre-production environment and tested it successfully.
2024-08-17 03:00 Started deployment of scheduled server OS update in production
2024-08-17 09:00 Completed Server OS update
2024-08-18 04:01:19 Server OS crashed
2024-08-18 05:42:12 The CRL for Trustidevcodesigning3.crl expired
2024-08-18 05:52:13 The CRL for Timestamping3.crl expired
2024-08-18 05:56:09 Received alert that public CRLs checks failed
2024-08-18 09:00 Started troubleshooting
2024-08-18 13:14 Rolled back the server OS update
2024-08-18 13:58:41 The Trustidevcodesigning3.crl CRL was recreated successfully
2024-08-18 13:58:42 The Timestamping3.crl CRL was recreated successfully

Root Cause Analysis

The server OS update altered certain background processes that affected our certificate management workflow. These changes went unnoticed initially, as they did not impact non-CRL operations, and the CRLs were not due for renewal until several hours later.

Also during the update, the alert system that should have warned us about the impending CRL expirations was misconfigured. As a result, no alerts regarding upcoming CRL renewals were sent to our team members.

The combination of these two factors led to the CRLs not being renewed on time, and they expired.

Lessons Learned

What went well

A contingency plan was in place for a rollback if needed, and it was executed successfully.

What didn’t go well

  • The operating system update, which was successfully tested in our pre-production environment, failed to deploy correctly in the production environment.
  • Due to the OS update failure, the Certificate Revocation Lists (CRLs) did not renew as anticipated.
  • An email misconfiguration prevented the CRL expiration alerting system from notifying relevant parties.
  • The rollback plan, which was designed to mitigate such issues, was initiated too late to prevent CRL expiration.

Where we got lucky

Action Items

Action Item Kind Due Date
Include a checklist item to verify that CRL alerts are functioning correctly Prevent Done
Resolve Email Alert Misconfiguration Prevent Done
Expedite the decision-making process for initiating rollbacks upon detecting issues related to server operating system updates. Implement a more responsive protocol that allows for quicker assessment and action when problems arise post-update Prevent Done
Improve debugging capabilities to provide enhanced visibility during operating system server upgrades Prevent 2024-09-30

Appendix

Details of affected certificates

No certificates were affected.

Thank you for the incident report. Could we have some clarity on the monitoring / alerting system and the potential improvements there:
Do you rely solely on email alerts to trigger actions on impending CRL expiry? If so, do you intend to put in place an alert escalation pattern (for example, first email, followed by incident and a stand by call n minutes ahead of expiry) ?

Thanks,
Kumaresh Somi

Our current monitoring and alerting system are more comprehensive than relying solely on email alerts.

  • We utilize various tools for event monitoring, with Nagios being a primary component.
  • We have automated processes in place to publish a new Certificate Revocation List (CRL) 12 hours before it expires.
  • Nagios is integrated with PagerDuty, an incident management platform, which should alert us to critical issues detected by Nagios.

The recent incident was caused by a misconfiguration in Nagios, which failed to detect the condition where the CRL was past its signing date and eventually expired. This led to a breakdown in the alerting process which we have fixed and successfully tested.

We have successfully identified and implemented improvements to our debugging capabilities. These enhancements have been merged into our configuration management codebase, resulting in:

  • Enhanced visibility during operating system upgrades.
  • Improved logging for related operations.

We have no additional updates for this issue which is considered closed/resolved on our side.

Flags: needinfo?(bwilson)

I will close this on Friday, 18-Oct-2024, unless there are additional issues to discuss.

Status: ASSIGNED → RESOLVED
Closed: 3 months ago1 month ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.