Certainly: Serving invalid or incomplete CRLs
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: wthayer, Assigned: wthayer)
Details
(Whiteboard: [ca-compliance] [crl-failure])
Incident Report
Summary
The service responsible for updating Certainly's R1 intermediate issuer CRL shards began to store them incorrectly on May 24th, 2024. E1 issuer CRLs continued to be updated, but were sometimes overwritten by the R1 CRLs. OCSP continued to function normally and we did not serve expired CRLs. However, E1 and R1 CRLs became intermixed and we sometimes served the wrong CRL (R1 when it should have been E1) and also served CRLs that, while valid, did not contain recent revocations. This state continued until the issue was discovered on May 29th, 2024. The CRLs were corrected approximately two hours after discovery, with the affected CRL shards manually updated. A change using our automation systems with the corrected configuration was rolled out shortly after.
Impact
The R1 CRL shards were not updated after May 24th, 2024, 2:52:22 AM so they did not contain the very latest revocations, but were accurate and valid for the data they did contain. When requesting an E1 CRL, the requester would sometimes receive an R1 CRL instead.
Some impact-mitigating facts to bear in mind:
- No expired CRLs were served.
- Certainly does not publish CRL information in its end-entity certificates. Only services querying CCADB for CRL information should have been affected.
Timeline
All times are UTC.
May 6th, 2024
- Boulder release-2024-05-06 is released containing PR 7461.
Week of May 12th, 2024
- Weekly Boulder deploy at Certainly is delayed due to questions around impact of implementing the changes in PR 7461.
May 24th, 2024
- 2:42:38 AM - Weekly Boulder update is rolled out with configuration changes to the CRL generation parameters.
- 2:52:22 AM - R1 and E1 CRLs begin to both be published to the E1 URL at this time. R1 CRL shards are no longer being updated at the expected URLs.
May 25th, 2024
- 2:52:29 AM - R1 CRL is now in violation of the BRs as a revoked certificate has gone over 24 hours without a CRL being published.
May 29th, 2024
- 6:30 PM - Logs are examined in staging as part of weekly software update testing, CRL errors are discovered. Typo in code is identified. Following this, production logs are examined, and CRL errors are found to be present.
- 6:32 PM - CRL files are examined - thisUpdate field in R1 issuer CRL has not changed since May 24th, 2024, 02:22:24 AM. R1 issuer CRL shards at the expected URL are found not to have been updated since that time.
- 8:39:09 PM - CRL configurations are manually updated within services. CRL updates are manually forced in production.
- 10:22:46 PM - Change in CRL configuration is rolled out into production to ensure CRLs stay updated. Incident is declared mitigated.
Root Cause Analysis
In the process of updating the Boulder software, we identified changes to how CRL URLs are specified, and made the corresponding changes to our configurations. The changes were deployed with the weekly software update, after testing. Release testing includes checking for alerts, errors in the logs, verifying issuance and OCSP responses. A CRL monitor checked for healthy CRLs. All of these tests passed in the staging tests and close monitoring of the service for approximately an hour after deployment to production saw no errors.
The configuration change contained a copy-paste error. The error caused R1 CRLs to attempt to use the same URL path as E1 CRLs. As a result, R1 CRLs were not being updated at the expected URL. Additionally, some E1 CRLs were overwritten with R1 CRLs depending on which was last in the race to be published.
The first unsuccessful safeguard occurred during code review, when the copy paste mistake was not caught by reviewers.
The second unsuccessful safeguard was the human review of errors while testing and after deployment. During release testing and the post-deploy observation period no CRLs were published using the problematic configuration.
The third unsuccessful safeguard was the CRL monitor configuration. The monitor had been set to alert in the event of a CRL getting too close to the end of its validity period.
The fourth and final unsuccessful safeguard was alerting on the error in the logs. We have alerts for errors in the logs of Boulder services, but the error reported by the CRL updater did not match on any of the current alerts.
During standard testing of the Boulder release the following week, the manual review of the logs discovered the errors from the CRL updater and the team was able to quickly identify the typo and make the needed changes.
Lessons Learned
What went well
- Team communication and collaboration on the issue.
- Quick response to mitigate the incident.
- As part of standard testing for the weekly software update, the operator discovered errors related to the misconfiguration.
- Redundant safeguards were in place to prevent such an error.
What didn't go well
- Processes were insufficient to prevent a human error from making it into production.
- Current configuration for CRL management was not fully vetted during release testing
- Existing CRL monitor was monitoring for an expired CRL rather than identifying if CRLs were being updated in a timely manner (CRL had not expired at the time of discovery). Since we have a predictable cadence of certificates being revoked, this could be set to alert on a much shorter interval.
- While alerts existed for several adjacent failures, no alerts existed for the error that the CRL update service was outputting.
Where we got lucky
- The only certificates revoked during the timeline of the incident were certificates requested by internal Certainly tooling.
Action Items
Action Item | Kind | Due Date |
---|---|---|
Manual or automated test cases that definitively confirm CRLs are both currently valid and being properly generated during release testing. | Prevent | 1-July 2024 |
Tune CRL monitoring to alert when updates have not occurred in a typical timeframe. | Detect | done |
Add additional alerts to catch CRL-related errors in logs, including updates and related failures. | Detect | 1-July 2024 |
Appendix
Affected CRL endpoints
http://crls.certainly.com/17182262453002514/1.crl
http://crls.certainly.com/17182262453002514/2.crl
http://crls.certainly.com/17182262453002514/3.crl
http://crls.certainly.com/17182262453002514/4.crl
http://crls.certainly.com/17182262453002514/5.crl
http://crls.certainly.com/17182262453002514/6.crl
http://crls.certainly.com/17182262453002514/7.crl
http://crls.certainly.com/17182262453002514/8.crl
http://crls.certainly.com/17182262453002514/9.crl
http://crls.certainly.com/17182262453002514/10.crl
http://crls.certainly.com/11083093410258787/1.crl
http://crls.certainly.com/11083093410258787/2.crl
http://crls.certainly.com/11083093410258787/3.crl
http://crls.certainly.com/11083093410258787/4.crl
http://crls.certainly.com/11083093410258787/5.crl
http://crls.certainly.com/11083093410258787/6.crl
http://crls.certainly.com/11083093410258787/7.crl
http://crls.certainly.com/11083093410258787/8.crl
http://crls.certainly.com/11083093410258787/9.crl
http://crls.certainly.com/11083093410258787/10.crl
Comment 1•1 year ago
•
|
||
Thanks for this report! One question: does Certainly the boulder-observer tool for monitoring CRLs? If yes, it might be useful to add a check to its crl prober that confirms that the CRL it retrieved declares the same issuingDistributionPoint
as the URL from which it was retrieved. This would have caught cases where an E1 URL was serving an R1 CRL.
Updated•1 year ago
|
We are currently using the boulder-observer for CRL monitoring - however, I don't see in the documentation anything that mentions the CRL DP as a returned value. Is this a feature to be implemented? Or is there a way to do this with the current release?
Comment 3•1 year ago
|
||
You're right that it does not check the CRL's Issuing Distribution Point today. I was suggesting that this could be a useful feature to add to the boulder-observer as part of the remediation items for this incident. Boulder PRs are always welcome!
Comment 4•1 year ago
|
||
(In reply to Wayne Thayer from comment #0)
Where we got lucky
- The only certificates revoked during the timeline of the incident were certificates requested by internal Certainly tooling.
On May 29 before your published the correct CRLs, how many entries were missing?
Other than those details that could have been in the impact section, I think this was an excellent report that others should seek to emulate.
Comment 5•1 year ago
|
||
(In reply to Mathew Hodson from comment #4)
(In reply to Wayne Thayer from comment #0)
Where we got lucky
- The only certificates revoked during the timeline of the incident were certificates requested by internal Certainly tooling.
On May 29 before your published the correct CRLs, how many entries were missing?
Regarding the number of revocations while the CRLs were not being updated correctly, all of the revocations were produced by automated tooling. This tooling issues both an E1 and an R1 certificate then revokes them. Since this is a primary monitoring tool for the health of our issuance system, this happens very frequently.
Over the period from 2024-05-24 02:52:22 to 2024-05-29 20:39:10 98,784 certificates were revoked. All were reported in OCSP correctly. They are evenly split between R1 and E1. No updated CRLs were accessible at the expected location for R1 and the CRLs at the URL for E1 were inconsistently either an E1 or an R1 CRL.
Other than those details that could have been in the impact section, I think this was an excellent report that others should seek to emulate.
Thank you for the kind words. We did try to make it as complete and yet to the point as we could.
Dan
Comment 6•1 year ago
|
||
(In reply to Aaron Gable from comment #3)
You're right that it does not check the CRL's Issuing Distribution Point today. I was suggesting that this could be a useful feature to add to the boulder-observer as part of the remediation items for this incident. Boulder PRs are always welcome!
We've added this to the backlog and there is interest from the team in taking this on. We'd definitely like to engage more with Boulder development in the future.
Comment 7•1 year ago
|
||
(In reply to Daniel Jeffery from comment #6)
We've added this to the backlog and there is interest from the team in taking this on. We'd definitely like to engage more with Boulder development in the future.
Fantastic! I've filed https://github.com/letsencrypt/boulder/issues/7527 to track this improvement.
Comment 8•1 year ago
|
||
(In reply to Wayne Thayer from comment #0)
Incident Report
...
Action Items Updates
Changes to our systems and release testing processes to complete the two remaining action items were fully completed 2024-06-06.
Action Items
Action Item Kind Due Date Manual or automated test cases that definitively confirm CRLs are both currently valid and being properly generated during release testing. Prevent Done Tune CRL monitoring to alert when updates have not occurred in a typical timeframe. Detect done Add additional alerts to catch CRL-related errors in logs, including updates and related failures. Detect Done
This concludes our reporting and remediation of this issue. We will continue to monitor for any further questions.
Assignee | ||
Comment 9•11 months ago
|
||
All action items are complete. We continue to monitor this bug for questions and feedback.
Assignee | ||
Comment 10•11 months ago
|
||
If there are no further questions or concerns, please consider closing this bug. Meanwhile, we will continue to monitor this bug.
Comment 11•11 months ago
|
||
I intend to close this next Wed. 26-June-2024 unless there are additional questions or comments.
Updated•11 months ago
|
Description
•