1900129 - Certainly: Serving invalid or incomplete CRLs

Wayne Thayer

Assignee

Description

•

1 year ago

•

Edited

Incident Report

Summary

The service responsible for updating Certainly's R1 intermediate issuer CRL shards began to store them incorrectly on May 24th, 2024. E1 issuer CRLs continued to be updated, but were sometimes overwritten by the R1 CRLs. OCSP continued to function normally and we did not serve expired CRLs. However, E1 and R1 CRLs became intermixed and we sometimes served the wrong CRL (R1 when it should have been E1) and also served CRLs that, while valid, did not contain recent revocations. This state continued until the issue was discovered on May 29th, 2024. The CRLs were corrected approximately two hours after discovery, with the affected CRL shards manually updated. A change using our automation systems with the corrected configuration was rolled out shortly after.

Impact

The R1 CRL shards were not updated after May 24th, 2024, 2:52:22 AM so they did not contain the very latest revocations, but were accurate and valid for the data they did contain. When requesting an E1 CRL, the requester would sometimes receive an R1 CRL instead.
Some impact-mitigating facts to bear in mind:

No expired CRLs were served.
Certainly does not publish CRL information in its end-entity certificates. Only services querying CCADB for CRL information should have been affected.

Timeline

All times are UTC.

May 6th, 2024

Boulder release-2024-05-06 is released containing PR 7461.

Week of May 12th, 2024

Weekly Boulder deploy at Certainly is delayed due to questions around impact of implementing the changes in PR 7461.

May 24th, 2024

2:42:38 AM - Weekly Boulder update is rolled out with configuration changes to the CRL generation parameters.
2:52:22 AM - R1 and E1 CRLs begin to both be published to the E1 URL at this time. R1 CRL shards are no longer being updated at the expected URLs.

May 25th, 2024

2:52:29 AM - R1 CRL is now in violation of the BRs as a revoked certificate has gone over 24 hours without a CRL being published.

May 29th, 2024

6:30 PM - Logs are examined in staging as part of weekly software update testing, CRL errors are discovered. Typo in code is identified. Following this, production logs are examined, and CRL errors are found to be present.
6:32 PM - CRL files are examined - thisUpdate field in R1 issuer CRL has not changed since May 24th, 2024, 02:22:24 AM. R1 issuer CRL shards at the expected URL are found not to have been updated since that time.
8:39:09 PM - CRL configurations are manually updated within services. CRL updates are manually forced in production.
10:22:46 PM - Change in CRL configuration is rolled out into production to ensure CRLs stay updated. Incident is declared mitigated.

Root Cause Analysis

In the process of updating the Boulder software, we identified changes to how CRL URLs are specified, and made the corresponding changes to our configurations. The changes were deployed with the weekly software update, after testing. Release testing includes checking for alerts, errors in the logs, verifying issuance and OCSP responses. A CRL monitor checked for healthy CRLs. All of these tests passed in the staging tests and close monitoring of the service for approximately an hour after deployment to production saw no errors.
The configuration change contained a copy-paste error. The error caused R1 CRLs to attempt to use the same URL path as E1 CRLs. As a result, R1 CRLs were not being updated at the expected URL. Additionally, some E1 CRLs were overwritten with R1 CRLs depending on which was last in the race to be published.
The first unsuccessful safeguard occurred during code review, when the copy paste mistake was not caught by reviewers.
The second unsuccessful safeguard was the human review of errors while testing and after deployment. During release testing and the post-deploy observation period no CRLs were published using the problematic configuration.
The third unsuccessful safeguard was the CRL monitor configuration. The monitor had been set to alert in the event of a CRL getting too close to the end of its validity period.
The fourth and final unsuccessful safeguard was alerting on the error in the logs. We have alerts for errors in the logs of Boulder services, but the error reported by the CRL updater did not match on any of the current alerts.
During standard testing of the Boulder release the following week, the manual review of the logs discovered the errors from the CRL updater and the team was able to quickly identify the typo and make the needed changes.

Lessons Learned

What went well

Team communication and collaboration on the issue.
Quick response to mitigate the incident.
As part of standard testing for the weekly software update, the operator discovered errors related to the misconfiguration.
Redundant safeguards were in place to prevent such an error.

What didn't go well

Processes were insufficient to prevent a human error from making it into production.
Current configuration for CRL management was not fully vetted during release testing
Existing CRL monitor was monitoring for an expired CRL rather than identifying if CRLs were being updated in a timely manner (CRL had not expired at the time of discovery). Since we have a predictable cadence of certificates being revoked, this could be set to alert on a much shorter interval.
While alerts existed for several adjacent failures, no alerts existed for the error that the CRL update service was outputting.

Where we got lucky

The only certificates revoked during the timeline of the incident were certificates requested by internal Certainly tooling.

Action Items

Action Item	Kind	Due Date
Manual or automated test cases that definitively confirm CRLs are both currently valid and being properly generated during release testing.	Prevent	1-July 2024
Tune CRL monitoring to alert when updates have not occurred in a typical timeframe.	Detect	done
Add additional alerts to catch CRL-related errors in logs, including updates and related failures.	Detect	1-July 2024

Appendix

Affected CRL endpoints

Aaron Gable

Comment 1

•

1 year ago

•

Edited

Thanks for this report! One question: does Certainly the boulder-observer tool for monitoring CRLs? If yes, it might be useful to add a check to its crl prober that confirms that the CRL it retrieved declares the same issuingDistributionPoint as the URL from which it was retrieved. This would have caught cases where an E1 URL was serving an R1 CRL.

Ben Wilson

Updated

•

1 year ago

Assignee: nobody → wthayer

Status: NEW → ASSIGNED

Type: defect → task

Whiteboard: [ca-compliance] [crl-failure]

Quan Nham

Comment 2

•

1 year ago

We are currently using the boulder-observer for CRL monitoring - however, I don't see in the documentation anything that mentions the CRL DP as a returned value. Is this a feature to be implemented? Or is there a way to do this with the current release?

Flags: needinfo?(aaron)

Aaron Gable

Comment 3

•

1 year ago

You're right that it does not check the CRL's Issuing Distribution Point today. I was suggesting that this could be a useful feature to add to the boulder-observer as part of the remediation items for this incident. Boulder PRs are always welcome!

Flags: needinfo?(aaron)

Mathew Hodson

Comment 4

•

1 year ago

(In reply to Wayne Thayer from comment #0)

Where we got lucky

The only certificates revoked during the timeline of the incident were certificates requested by internal Certainly tooling.

On May 29 before your published the correct CRLs, how many entries were missing?

Other than those details that could have been in the impact section, I think this was an excellent report that others should seek to emulate.

Daniel Jeffery

Comment 5

•

1 year ago

(In reply to Mathew Hodson from comment #4)

(In reply to Wayne Thayer from comment #0)

Where we got lucky

The only certificates revoked during the timeline of the incident were certificates requested by internal Certainly tooling.

On May 29 before your published the correct CRLs, how many entries were missing?

Regarding the number of revocations while the CRLs were not being updated correctly, all of the revocations were produced by automated tooling. This tooling issues both an E1 and an R1 certificate then revokes them. Since this is a primary monitoring tool for the health of our issuance system, this happens very frequently.

Over the period from 2024-05-24 02:52:22 to 2024-05-29 20:39:10 98,784 certificates were revoked. All were reported in OCSP correctly. They are evenly split between R1 and E1. No updated CRLs were accessible at the expected location for R1 and the CRLs at the URL for E1 were inconsistently either an E1 or an R1 CRL.

Other than those details that could have been in the impact section, I think this was an excellent report that others should seek to emulate.

Thank you for the kind words. We did try to make it as complete and yet to the point as we could.

Dan

Daniel Jeffery

Comment 6

•

1 year ago

(In reply to Aaron Gable from comment #3)

You're right that it does not check the CRL's Issuing Distribution Point today. I was suggesting that this could be a useful feature to add to the boulder-observer as part of the remediation items for this incident. Boulder PRs are always welcome!

We've added this to the backlog and there is interest from the team in taking this on. We'd definitely like to engage more with Boulder development in the future.

Aaron Gable

Comment 7

•

1 year ago

(In reply to Daniel Jeffery from comment #6)

We've added this to the backlog and there is interest from the team in taking this on. We'd definitely like to engage more with Boulder development in the future.

Fantastic! I've filed https://github.com/letsencrypt/boulder/issues/7527 to track this improvement.

Daniel Jeffery

Comment 8

•

1 year ago

(In reply to Wayne Thayer from comment #0)

Incident Report

...

Action Items Updates

Changes to our systems and release testing processes to complete the two remaining action items were fully completed 2024-06-06.

Action Items

Action Item Kind Due Date

Manual or automated test cases that definitively confirm CRLs are both currently valid and being properly generated during release testing. Prevent Done

Tune CRL monitoring to alert when updates have not occurred in a typical timeframe. Detect done

Add additional alerts to catch CRL-related errors in logs, including updates and related failures. Detect Done

This concludes our reporting and remediation of this issue. We will continue to monitor for any further questions.

Wayne Thayer

Assignee

Comment 9

•

1 year ago

All action items are complete. We continue to monitor this bug for questions and feedback.

Wayne Thayer

Assignee

Comment 10

•

1 year ago

If there are no further questions or concerns, please consider closing this bug. Meanwhile, we will continue to monitor this bug.

Ben Wilson

Comment 11

•

1 year ago

I intend to close this next Wed. 26-June-2024 unless there are additional questions or comments.

Flags: needinfo?(bwilson)

Ben Wilson

Updated

•

1 year ago

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

Flags: needinfo?(bwilson)

Resolution: --- → FIXED

Action Item	Kind	Due Date
Manual or automated test cases that definitively confirm CRLs are both currently valid and being properly generated during release testing.	Prevent	Done
Tune CRL monitoring to alert when updates have not occurred in a typical timeframe.	Detect	done
Add additional alerts to catch CRL-related errors in logs, including updates and related failures.	Detect	Done