Let's Encrypt: Early CRL Removal Incident
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: inahga, Assigned: inahga)
Details
(Whiteboard: [ca-compliance] [crl-failure])
Attachments
(3 files)
Preliminary Incident Report
Summary
-
Incident description: Two revoked certificates (with serials 053258de6a60305075be223306a211f5328e and 06e748585b1ba76887bf7b0cce58a94361de) were removed from our CRLs (partitions 15 and 95, respectively) before the certificates had expired.
-
Relevant policies: This is a violation of RFC 5280, Section 3.3, which says “An entry MUST NOT be removed from the CRL until it appears on one regularly scheduled CRL issued beyond the revoked certificate's validity period.”
-
Source of incident disclosure: An internal email alert notified us of the situation at 2025-03-18 15:06 UTC.
We have developed, tested, and deployed a fix for this error, and the missing entries have been restored to the CRLs.
We will post a full incident on or before 2025-03-25.
Updated•7 months ago
|
| Assignee | ||
Comment 1•7 months ago
|
||
Full Incident Report
Summary
- CA Owner CCADB unique ID: Internet Security Research Group
- Incident description: Two revoked certificates (with serials 053258de6a60305075be223306a211f5328e and 06e748585b1ba76887bf7b0cce58a94361de) were removed from our CRLs (partitions 15 and 95, respectively) before the certificates had expired.
- Timeline summary:
- Non-compliance start date: 2025-03-18 15:05:05 UTC
- Non-compliance identified date: 2025-03-18 15:28:00 UTC
- Non-compliance end date: 2025-03-18 22:19:38 UTC
- Relevant policies: This is a violation of RFC 5280, Section 3.3, which says “An entry MUST NOT be removed from the CRL until it appears on one regularly scheduled CRL issued beyond the revoked certificate's validity period.”
- Source of incident disclosure: An internal email alert notified us of the situation at 2025-03-18 15:06 UTC.
Impact
- Total number of certificates: 2
- Total number of "remaining valid" certificates: 0
- Affected certificate types: DV certificates
- Incident heuristic: Any revoked certificate issued after 2025-03-12 21:15, and due for expiration before 2025-03-19 22:00.
- Was issuance stopped in response to this incident, and why or why not?: No. This incident did not produce any invalid certificates, so we did not stop issuance.
- Analysis: N/A, no revocations were delayed.
- Additional considerations: This incident affected 2 CRL partitions. We did not halt issuance of CRLs because our analysis of the incident made us confident that a third revoked certificate entry would not be affected for several days, and because halting publication of revocation information can lead to further incidents (e.g. missing a revocation deadline).
Timeline
All times are UTC.
Throughout, “Cert A” refers to the certificate with serial 06e748585b1ba76887bf7b0cce58a94361de, and “Cert B” refers to the certificate with serial 053258de6a60305075be223306a211f5328e.
2025-01-27
- 18:11 Change merged introducing the bug, but unreachable
2025-02-04
- 16:45 Change merged making the bug reachable, but behind a feature flag
2025-03-12
- 21:15 Feature flag flipped in production, causing the buggy codepath to be reachable
- 21:20:06 Cert A issued
- 21:54:35 Cert B issued
- 22:19:26 Cert A revoked
- 22:37:52 Cert A first appears on CRL partition 95
- 22:53:26 Cert B revoked
- 23:09:35 Cert B first appears on CRL partition 15
2025-03-17
- 13:10 Staging crl-monitor detects faulty non-production CRL and sends email alert
2025-03-18
- 15:05:05 CRL partition 15 signed with entry for Cert B missing (Incident Begins)
- 15:05:25 crl-monitor detects early removal of Cert B from CRL partition 95, and sends an email alert
- 15:11:17 CRL partition 95 signed with entry for Cert A missing
- 15:11:31 Production crl-monitor detects early removal of Cert A from CRL partition 95 and sends email alert
- 15:28 SRE and Dev teams acknowledge the incident and begin investigation
- 15:33 Investigation identifies all affected certificates and CRLs
- 16:15 Dev team identifies possible fix
- 16:18 Potential fix developed
- 16:50 Unit tests for potential fix developed
- 17:39 Tested fix approved and merged to main
- 17:44 Fix back-ported to currently-deployed release
- 17:45 Hotfix release containing the fix tagged for deployment
- 17:51 Investigation confirms that the next time an entry could be removed early would be approximately 2025-03-23 11:00:00
- 18:44 Fix deployed to Staging
- 19:00 Confirmed that the entry which had been erroneously removed from Staging CRLs was restored by the fix
- 22:00 Fix deployed to Production
- 22:05:03 Cert B restored to CRL partition 15
- 22:19:38 Cert A restored to CRL partition 95 (Incident Ends)
- 22:29 Preliminary report posted
2025-03-19
- 13:20:05 Cert A expired
- 13:54:34 Cert B expired
2025-03-20
- 15:05:16 Cert B removed from CRL partition 15, as expected.
- 15:18:36 Cert A removed from CRL partition 95, as expected.
Related Incidents
| Bug | Date | Description |
|---|---|---|
| 1793114 | 2022-09-30 | Prior Let's Encrypt incident involving revoked certificate entries moving between CRL partitions. Led to the introduction of crl-monitor, which detected this incident. |
Root Cause Analysis
Fundamentally, this incident was caused by a software bug: the omission of a single minus sign in a single line of code in the component which queries the database for all of the revoked certificate entries that need to be included in a given CRL.
But we have systems in place to prevent such bugs from affecting our issuance environment. Those processes failed. We could have prevented this bug from being merged into the repository at all if our automated tests included a test checking that a revoked certificate entry remains in a CRL after the certificate has expired. We could have prevented this bug from affecting our production environment if we had noticed and investigated the alert which was fired from our staging environment a day earlier.
Contributing Factor 1: Bug in CRL Database Query Logic
-
Description:
Let’s Encrypt introduced a new CRL partitioning scheme, which changes how certificates are assigned to their partition.This new logic contained a miscalculation when determining which entries to include in the CRL:
// Query for unexpired certificates, with padding to ensure that revoked certificates show // up in at least one CRL, even if they expire between revocation and CRL generation. expiresAfter := cu.clk.Now().Add(cu.lookbackPeriod) saStream, err := cu.sa.GetRevokedCertsByShard(ctx, &sapb.GetRevokedCertsByShardRequest{ IssuerNameID: int64(issuerNameID), ShardIdx: int64(shardIdx), ExpiresAfter: timestamppb.New(expiresAfter), RevokedBefore: timestamppb.New(atTime), })The
lookbackPeriodis defined as how far back the CRL updater looks for expired revoked certificates. It is set to 24 hours. The value was applied in the wrong direction. For instance, for a hypothetical revoked 90-day certificate, this would lead to it being removed 89 days into its lifetime instead of 91 days.The fix was to instead calculate:
expiresAfter := cu.clk.Now().Add(-cu.lookbackPeriod)Note that certificates issued under the old CRL partitioning scheme were not affected, because we did not change their partitioning logic.
-
Timeline:
The bug was deployed to production on 2025-03-12 21:15. It was remediated on 2025-03-18 22:00. -
Detection:
Our CRL monitoring detected an incorrect CRL. The alert and our subsequent log analysis gave us sufficient context to review the code, identify the bug, and identify all affected certificates.The conditions that lead to this bug escaping detection are elaborated on in the other contributing factors.
-
Interaction with other factors:
As described in Contributing Factor 2, we failed to test for the bug leading to it being promoted to production.As described in Contributing Factor 3, the bug was detected in staging, but its incidence was unacknowledged until it happened in production.
Contributing Factor 2: Insufficient Testing
-
Description:
The Boulder integration tests do not include a test which verifies that a revoked certificate entry is not removed until after it has appeared on at least one CRL after its notAfter date. This is due to the fundamental difficulty -- manipulating the environment or memory of a running process, synchronizing that manipulation across the 20+ separate processes running in the integration test environment, and not clobbering any internal clock manipulation that any of those processes have done in the course of testing -- of manipulating the Boulder software's concept of what time it is during the integration tests.Our integration tests currently use a blunt workaround: spin up the whole stack with a fake clock several months in the past, do a small amount of setup work, restart the whole stack at the current time, and then run the main corpus of tests. This system is difficult enough to work with that it discourages writing new tests (such as one confirming inclusion of a certificate on a CRL even after that cert expires) which require work to be done at multiple different timestamps.
-
Timeline:
The full-stop-and-restart mechanism for changing the integration test clock has existed since 2017-09-07, and we have known we wanted to make it more flexible since at least 2020-07-06. The change which introduced the bug was merged on 2020-01-27, and the change which made the bug reachable was merged on 2020-02-04. -
Detection:
We've been aware of this difficulty in our integration tests since 2020, as noted above. We found our tests were missing this particular case during the investigation of this incident. -
Interaction with other factors:
Had we caught the bug in testing, we would have avoided the incident entirely.
Contributing Factor 3: Alerting Gaps
-
Description:
As an action item from a prior incident, we createdcrl-monitor, which performs black-box analysis of all issued CRLs and reports on any anomalies. Since crl-monitor lives outside our primary infrastructure, it is not fully integrated with our standard alerting mechanisms. Instead, it sends its alerts by email to a general-purpose mailing list that is read by our SRE team.When the crl-monitor detected this bug in our staging environment, the email it sent was not acknowledged. This was due to three conditions: issues with staging are not compliance incidents so it was not a waking alert, the email was sent outside of business hours, and the email was not sent to our standard dedicated alerting queue. When the crl-monitor detected the same bug in our production environment a day later during business hours, the email notification it sent caught the attention of several SREs.
-
Timeline:
The crl-monitor was configured to send alerts to the general-purpose mailing list on 2024-01-12. The crl-monitor sent an email about the staging environment at 2025-03-17 13:10, but was not acknowledged. The crl-monitor sent an email about the production environment at 2025-03-18 15:05, and this notification was acknowledged at 2025-03-18 15:28. -
Detection:
The production alert fired during business hours, so it was noticed almost immediately by an SRE in their email as it came in.No SRE had noticed the staging alert until the production one had fired. Email is a high noise and low signal medium, so it is not suitable for important alerts.
-
Interaction with other factors:
Had we detected the alert in staging, we likely would have avoided the incident entirely.
Lessons Learned
What went well:
- crl-monitor worked as designed in detecting CRL anomalies.
- We store our CRLs in a versioned object store, so we were able to view prior versions of each partition to confirm when entries were first added, first removed, and finally restored.
- A fix was identified and deployed to production within 8 hours.
What didn’t go well:
- The bug passed several review and testing cycles, and was still promoted to production.
- The problem was detected in staging approximately 1 day earlier. It went unacknowledged because the alert is sent through email, rather than as a high priority page.
Where we got lucky:
- We happened to be doing initial deployment of short lived certs around the same time as we turned on the new CRL issuance path. Currently short lived certificate issuance is allowlisted to our internal accounts and issuance volume is very low. Because short lived certificates expire earlier, they were affected earlier and triggered our alerting. Because their issuance volume is very low, there were very few certificates affected at the time of the alert.
Action Items
| Action Item | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Fix the bug | Prevent | Root Cause #1 | CRL monitoring should pass following deployment of fix. Affected certs should re-appear in respective CRLs. The changeset will be public on GitHub. | 2025-03-18 | Complete |
| Improve unit tests to check for correct expiration cutoff calculation | Prevent | Root Cause #2 | Analyze test coverage report for satisfactory coverage. The changeset will be public on GitHub. | 2025-03-18 | Complete |
| Improve integration tests to cover revoked cert removal from CRL after cert expiration. | Prevent | Root Cause #2 | The changeset will be public on GitHub. | 2025-04-28 | Not started |
| Configure crl-monitor to page high priority | Detect | Root Cause #3 | Configure alerting and test it thoroughly in staging for effectiveness. | 2025-04-04 | Not started |
Appendix
Attached file certs.csv is a CSV file containing all affected certificates.
Attached file e6-15.tar.gz includes all versions of CRL partition 15 for intermediate E6 for which the certificate 053258de6a60305075be223306a211f5328e was missing.
Attached file e6-95.tar.gz includes all versions of CRL partition 95 for intermediate E6 for which the certificate 06e748585b1ba76887bf7b0cce58a94361de was missing.
| Assignee | ||
Comment 2•7 months ago
|
||
| Assignee | ||
Comment 3•7 months ago
|
||
| Assignee | ||
Comment 4•7 months ago
|
||
| Assignee | ||
Comment 5•7 months ago
|
||
Errata
The statement in Contributing Factor 3, Timeline:
The change which introduced the bug was merged on 2020-01-27, and the change which made the bug reachable was merged on 2020-02-04.
Should instead read:
The change which introduced the bug was merged on 2025-01-27, and the change which made the bug reachable was merged on 2025-02-04.
Comment 6•7 months ago
|
||
Thank you for this detailed report.
We’d like to highlight that this report does many things well, and should be considered an example of “good practice” given the recent update to the CCADB Incident Reporting Guidelines. We especially appreciate the level of detail provided by Let’s Encrypt and the amount of candid introspection described in the report, even when doing so was likely uncomfortable.
The Root Cause Analysis provided is both thorough and still very approachable. Combined with the Lessons Learned, we feel these sections offer valuable insights that can benefit the broader community. It's also positive to see that the Action Items directly address what didn’t go well, as well as the contributing factors that led to this issue. This demonstrates a commitment to learning from the incident and taking concrete steps to prevent similar incidents from occurring in the future.
One opportunity for improvement is that we note that in the Summary Section, the “CA Owner CCADB unique ID” is supposed to be represented by an “A” followed by a six-digit number corresponding to the CA Owner’s “CA Owner/Certificate” record disclosed in the CCADB. This is described on CCADB.org.
Ultimately, these reports are not about placing blame, they are about making things meaningfully better, and we feel this report represents that view very well.
Comment 7•7 months ago
|
||
(In reply to chrome-root-program from comment #6)
One opportunity for improvement is that we note that in the Summary Section, the “CA Owner CCADB unique ID” is supposed to be represented by an “A” followed by a six-digit number corresponding to the CA Owner’s “CA Owner/Certificate” record disclosed in the CCADB.
Thank you for the correction! Let's Encrypt's CA Owner CCADB unique ID is A000320. We have updated our incident response procedure and template to pre-populate this value for future reports.
| Assignee | ||
Comment 8•7 months ago
|
||
We have completed all outstanding action items.
New integration tests have been implemented. In particular, they cover three new cases that check for proper CRL behavior:
- Shortly before the certificate expires, when the entry must still appear;
- Shortly after the certificate expires, when the entry must still appear; and
- Significantly after the certificate expires, when the entry may be removed.
We have also reconfigured crl-monitor to alert us using our normal alerting mechanisms, rather than email. We have thoroughly tested this functionality.
| Action Item | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Fix the bug | Prevent | Root Cause #1 | CRL monitoring should pass following deployment of fix. Affected certs should re-appear in respective CRLs. The changeset will be public on GitHub. | 2025-03-18 | Complete |
| Improve unit tests to check for correct expiration cutoff calculation | Prevent | Root Cause #2 | Analyze test coverage report for satisfactory coverage. The changeset will be public on GitHub. | 2025-03-18 | Complete |
| Improve integration tests to cover revoked cert removal from CRL after cert expiration. | Prevent | Root Cause #2 | The changeset will be public on GitHub. | 2025-04-28 | Complete |
| Configure crl-monitor to page high priority | Detect | Root Cause #3 | Configure alerting and test it thoroughly in staging for effectiveness. | 2025-04-04 | Complete |
Report Closure Summary
- Incident description: Two revoked certificates (with serials 053258de6a60305075be223306a211f5328e and 06e748585b1ba76887bf7b0cce58a94361de) were removed from our CRLs (partitions 15 and 95, respectively) before the certificates had expired.
- Incident Root Cause(s): This incident was caused by a bug in CRL partitioning logic. It was worsened by insufficient tests to cover CRL entry removal edge cases and incorrect configuration of our CRL monitor alerting.
- Remediation description: We corrected the bug, improved unit and integration tests to cover edge cases, and reconfigured CRL monitoring to alert us properly.
- Commitment summary: Let's Encrypt will continue to improve our secondary controls (i.e. tests and monitoring) to ensure bugs do not reach our issuance environment.
All Action Items disclosed in this report have been completed as described, and we request its closure.
Comment 9•7 months ago
|
||
I'll close this on or about Thursday or Friday later this week, unless there are issues to discuss or questions to answer.
Updated•6 months ago
|
Description
•