1954861 - Let's Encrypt: Early CRL Removal Incident

Ameer Ghani

Assignee

Description

•

7 months ago

Preliminary Incident Report

Summary

Incident description: Two revoked certificates (with serials 053258de6a60305075be223306a211f5328e and 06e748585b1ba76887bf7b0cce58a94361de) were removed from our CRLs (partitions 15 and 95, respectively) before the certificates had expired.
Relevant policies: This is a violation of RFC 5280, Section 3.3, which says “An entry MUST NOT be removed from the CRL until it appears on one regularly scheduled CRL issued beyond the revoked certificate's validity period.”
Source of incident disclosure: An internal email alert notified us of the situation at 2025-03-18 15:06 UTC.

We have developed, tested, and deployed a fix for this error, and the missing entries have been restored to the CRLs.

We will post a full incident on or before 2025-03-25.

Ben Wilson

Updated

•

7 months ago

Assignee: nobody → inahga

Status: UNCONFIRMED → ASSIGNED

Type: defect → task

Ever confirmed: true

Whiteboard: [ca-compliance] [crl-failure]

Ameer Ghani

Assignee

Comment 1

•

7 months ago

Full Incident Report

Summary

CA Owner CCADB unique ID: Internet Security Research Group
Incident description: Two revoked certificates (with serials 053258de6a60305075be223306a211f5328e and 06e748585b1ba76887bf7b0cce58a94361de) were removed from our CRLs (partitions 15 and 95, respectively) before the certificates had expired.
Timeline summary:
- Non-compliance start date: 2025-03-18 15:05:05 UTC
- Non-compliance identified date: 2025-03-18 15:28:00 UTC
- Non-compliance end date: 2025-03-18 22:19:38 UTC
Relevant policies: This is a violation of RFC 5280, Section 3.3, which says “An entry MUST NOT be removed from the CRL until it appears on one regularly scheduled CRL issued beyond the revoked certificate's validity period.”
Source of incident disclosure: An internal email alert notified us of the situation at 2025-03-18 15:06 UTC.

Impact

Total number of certificates: 2
Total number of "remaining valid" certificates: 0
Affected certificate types: DV certificates
Incident heuristic: Any revoked certificate issued after 2025-03-12 21:15, and due for expiration before 2025-03-19 22:00.
Was issuance stopped in response to this incident, and why or why not?: No. This incident did not produce any invalid certificates, so we did not stop issuance.
Analysis: N/A, no revocations were delayed.
Additional considerations: This incident affected 2 CRL partitions. We did not halt issuance of CRLs because our analysis of the incident made us confident that a third revoked certificate entry would not be affected for several days, and because halting publication of revocation information can lead to further incidents (e.g. missing a revocation deadline).

Timeline

All times are UTC.

Throughout, “Cert A” refers to the certificate with serial 06e748585b1ba76887bf7b0cce58a94361de, and “Cert B” refers to the certificate with serial 053258de6a60305075be223306a211f5328e.

2025-01-27

18:11 Change merged introducing the bug, but unreachable

2025-02-04

16:45 Change merged making the bug reachable, but behind a feature flag

2025-03-12

21:15 Feature flag flipped in production, causing the buggy codepath to be reachable
21:20:06 Cert A issued
21:54:35 Cert B issued
22:19:26 Cert A revoked
22:37:52 Cert A first appears on CRL partition 95
22:53:26 Cert B revoked
23:09:35 Cert B first appears on CRL partition 15

2025-03-17

13:10 Staging crl-monitor detects faulty non-production CRL and sends email alert

2025-03-18

15:05:05 CRL partition 15 signed with entry for Cert B missing (Incident Begins)
15:05:25 crl-monitor detects early removal of Cert B from CRL partition 95, and sends an email alert
15:11:17 CRL partition 95 signed with entry for Cert A missing
15:11:31 Production crl-monitor detects early removal of Cert A from CRL partition 95 and sends email alert
15:28 SRE and Dev teams acknowledge the incident and begin investigation
15:33 Investigation identifies all affected certificates and CRLs
16:15 Dev team identifies possible fix
16:18 Potential fix developed
16:50 Unit tests for potential fix developed
17:39 Tested fix approved and merged to main
17:44 Fix back-ported to currently-deployed release
17:45 Hotfix release containing the fix tagged for deployment
17:51 Investigation confirms that the next time an entry could be removed early would be approximately 2025-03-23 11:00:00
18:44 Fix deployed to Staging
19:00 Confirmed that the entry which had been erroneously removed from Staging CRLs was restored by the fix
22:00 Fix deployed to Production
22:05:03 Cert B restored to CRL partition 15
22:19:38 Cert A restored to CRL partition 95 (Incident Ends)
22:29 Preliminary report posted

2025-03-19

13:20:05 Cert A expired
13:54:34 Cert B expired

2025-03-20

15:05:16 Cert B removed from CRL partition 15, as expected.
15:18:36 Cert A removed from CRL partition 95, as expected.

Related Incidents

Bug	Date	Description
1793114	2022-09-30	Prior Let's Encrypt incident involving revoked certificate entries moving between CRL partitions. Led to the introduction of crl-monitor, which detected this incident.

Root Cause Analysis

Fundamentally, this incident was caused by a software bug: the omission of a single minus sign in a single line of code in the component which queries the database for all of the revoked certificate entries that need to be included in a given CRL.

But we have systems in place to prevent such bugs from affecting our issuance environment. Those processes failed. We could have prevented this bug from being merged into the repository at all if our automated tests included a test checking that a revoked certificate entry remains in a CRL after the certificate has expired. We could have prevented this bug from affecting our production environment if we had noticed and investigated the alert which was fired from our staging environment a day earlier.

Contributing Factor 1: Bug in CRL Database Query Logic

Description:
Let’s Encrypt introduced a new CRL partitioning scheme, which changes how certificates are assigned to their partition.

This new logic contained a miscalculation when determining which entries to include in the CRL:
```
// Query for unexpired certificates, with padding to ensure that revoked certificates show
// up in at least one CRL, even if they expire between revocation and CRL generation.
expiresAfter := cu.clk.Now().Add(cu.lookbackPeriod)

saStream, err := cu.sa.GetRevokedCertsByShard(ctx, &sapb.GetRevokedCertsByShardRequest{
    	IssuerNameID:  int64(issuerNameID),
    	ShardIdx:  	int64(shardIdx),
    	ExpiresAfter:  timestamppb.New(expiresAfter),
    	RevokedBefore: timestamppb.New(atTime),
})
```
The lookbackPeriod is defined as how far back the CRL updater looks for expired revoked certificates. It is set to 24 hours. The value was applied in the wrong direction. For instance, for a hypothetical revoked 90-day certificate, this would lead to it being removed 89 days into its lifetime instead of 91 days.

The fix was to instead calculate:
```
expiresAfter := cu.clk.Now().Add(-cu.lookbackPeriod)
```
Note that certificates issued under the old CRL partitioning scheme were not affected, because we did not change their partitioning logic.
Timeline:
The bug was deployed to production on 2025-03-12 21:15. It was remediated on 2025-03-18 22:00.
Detection:
Our CRL monitoring detected an incorrect CRL. The alert and our subsequent log analysis gave us sufficient context to review the code, identify the bug, and identify all affected certificates.

The conditions that lead to this bug escaping detection are elaborated on in the other contributing factors.
Interaction with other factors:
As described in Contributing Factor 2, we failed to test for the bug leading to it being promoted to production.

As described in Contributing Factor 3, the bug was detected in staging, but its incidence was unacknowledged until it happened in production.

Contributing Factor 2: Insufficient Testing

Description:
The Boulder integration tests do not include a test which verifies that a revoked certificate entry is not removed until after it has appeared on at least one CRL after its notAfter date. This is due to the fundamental difficulty -- manipulating the environment or memory of a running process, synchronizing that manipulation across the 20+ separate processes running in the integration test environment, and not clobbering any internal clock manipulation that any of those processes have done in the course of testing -- of manipulating the Boulder software's concept of what time it is during the integration tests.

Our integration tests currently use a blunt workaround: spin up the whole stack with a fake clock several months in the past, do a small amount of setup work, restart the whole stack at the current time, and then run the main corpus of tests. This system is difficult enough to work with that it discourages writing new tests (such as one confirming inclusion of a certificate on a CRL even after that cert expires) which require work to be done at multiple different timestamps.
Timeline:
The full-stop-and-restart mechanism for changing the integration test clock has existed since 2017-09-07, and we have known we wanted to make it more flexible since at least 2020-07-06. The change which introduced the bug was merged on 2020-01-27, and the change which made the bug reachable was merged on 2020-02-04.
Detection:
We've been aware of this difficulty in our integration tests since 2020, as noted above. We found our tests were missing this particular case during the investigation of this incident.
Interaction with other factors:
Had we caught the bug in testing, we would have avoided the incident entirely.

Contributing Factor 3: Alerting Gaps

Description:
As an action item from a prior incident, we created crl-monitor, which performs black-box analysis of all issued CRLs and reports on any anomalies. Since crl-monitor lives outside our primary infrastructure, it is not fully integrated with our standard alerting mechanisms. Instead, it sends its alerts by email to a general-purpose mailing list that is read by our SRE team.

When the crl-monitor detected this bug in our staging environment, the email it sent was not acknowledged. This was due to three conditions: issues with staging are not compliance incidents so it was not a waking alert, the email was sent outside of business hours, and the email was not sent to our standard dedicated alerting queue. When the crl-monitor detected the same bug in our production environment a day later during business hours, the email notification it sent caught the attention of several SREs.
Timeline:
The crl-monitor was configured to send alerts to the general-purpose mailing list on 2024-01-12. The crl-monitor sent an email about the staging environment at 2025-03-17 13:10, but was not acknowledged. The crl-monitor sent an email about the production environment at 2025-03-18 15:05, and this notification was acknowledged at 2025-03-18 15:28.
Detection:
The production alert fired during business hours, so it was noticed almost immediately by an SRE in their email as it came in.

No SRE had noticed the staging alert until the production one had fired. Email is a high noise and low signal medium, so it is not suitable for important alerts.
Interaction with other factors:
Had we detected the alert in staging, we likely would have avoided the incident entirely.

Lessons Learned

What went well:

crl-monitor worked as designed in detecting CRL anomalies.
We store our CRLs in a versioned object store, so we were able to view prior versions of each partition to confirm when entries were first added, first removed, and finally restored.
A fix was identified and deployed to production within 8 hours.

What didn’t go well:

The bug passed several review and testing cycles, and was still promoted to production.
The problem was detected in staging approximately 1 day earlier. It went unacknowledged because the alert is sent through email, rather than as a high priority page.

Where we got lucky:

We happened to be doing initial deployment of short lived certs around the same time as we turned on the new CRL issuance path. Currently short lived certificate issuance is allowlisted to our internal accounts and issuance volume is very low. Because short lived certificates expire earlier, they were affected earlier and triggered our alerting. Because their issuance volume is very low, there were very few certificates affected at the time of the alert.

Action Items

Action Item	Kind	Corresponding Root Cause(s)	Evaluation Criteria	Due Date	Status
Fix the bug	Prevent	Root Cause #1	CRL monitoring should pass following deployment of fix. Affected certs should re-appear in respective CRLs. The changeset will be public on GitHub.	2025-03-18	Complete
Improve unit tests to check for correct expiration cutoff calculation	Prevent	Root Cause #2	Analyze test coverage report for satisfactory coverage. The changeset will be public on GitHub.	2025-03-18	Complete
Improve integration tests to cover revoked cert removal from CRL after cert expiration.	Prevent	Root Cause #2	The changeset will be public on GitHub.	2025-04-28	Not started
Configure crl-monitor to page high priority	Detect	Root Cause #3	Configure alerting and test it thoroughly in staging for effectiveness.	2025-04-04	Not started

Appendix

Attached file certs.csv is a CSV file containing all affected certificates.

Attached file e6-15.tar.gz includes all versions of CRL partition 15 for intermediate E6 for which the certificate 053258de6a60305075be223306a211f5328e was missing.

Attached file e6-95.tar.gz includes all versions of CRL partition 95 for intermediate E6 for which the certificate 06e748585b1ba76887bf7b0cce58a94361de was missing.

Ameer Ghani

Assignee

Comment 2

•

7 months ago

Attached file certs.csv — Details

Ameer Ghani

Assignee

Comment 3

•

7 months ago

Attached file e6-15.tar.gz — Details

Ameer Ghani

Assignee

Comment 4

•

7 months ago

Attached file e6-95.tar.gz — Details

Ameer Ghani

Assignee

Comment 5

•

7 months ago

Errata

The statement in Contributing Factor 3, Timeline:

The change which introduced the bug was merged on 2020-01-27, and the change which made the bug reachable was merged on 2020-02-04.

Should instead read:

The change which introduced the bug was merged on 2025-01-27, and the change which made the bug reachable was merged on 2025-02-04.

chrome-root-program

Comment 6

•

7 months ago

Thank you for this detailed report.

We’d like to highlight that this report does many things well, and should be considered an example of “good practice” given the recent update to the CCADB Incident Reporting Guidelines. We especially appreciate the level of detail provided by Let’s Encrypt and the amount of candid introspection described in the report, even when doing so was likely uncomfortable.

The Root Cause Analysis provided is both thorough and still very approachable. Combined with the Lessons Learned, we feel these sections offer valuable insights that can benefit the broader community. It's also positive to see that the Action Items directly address what didn’t go well, as well as the contributing factors that led to this issue. This demonstrates a commitment to learning from the incident and taking concrete steps to prevent similar incidents from occurring in the future.

One opportunity for improvement is that we note that in the Summary Section, the “CA Owner CCADB unique ID” is supposed to be represented by an “A” followed by a six-digit number corresponding to the CA Owner’s “CA Owner/Certificate” record disclosed in the CCADB. This is described on CCADB.org.

Ultimately, these reports are not about placing blame, they are about making things meaningfully better, and we feel this report represents that view very well.

Aaron Gable

Comment 7

•

7 months ago

(In reply to chrome-root-program from comment #6)

One opportunity for improvement is that we note that in the Summary Section, the “CA Owner CCADB unique ID” is supposed to be represented by an “A” followed by a six-digit number corresponding to the CA Owner’s “CA Owner/Certificate” record disclosed in the CCADB.

Thank you for the correction! Let's Encrypt's CA Owner CCADB unique ID is A000320. We have updated our incident response procedure and template to pre-populate this value for future reports.

Ameer Ghani

Assignee

Comment 8

•

7 months ago

We have completed all outstanding action items.

New integration tests have been implemented. In particular, they cover three new cases that check for proper CRL behavior:

Shortly before the certificate expires, when the entry must still appear;
Shortly after the certificate expires, when the entry must still appear; and
Significantly after the certificate expires, when the entry may be removed.

We have also reconfigured crl-monitor to alert us using our normal alerting mechanisms, rather than email. We have thoroughly tested this functionality.

Action Item	Kind	Corresponding Root Cause(s)	Evaluation Criteria	Due Date	Status
Fix the bug	Prevent	Root Cause #1	CRL monitoring should pass following deployment of fix. Affected certs should re-appear in respective CRLs. The changeset will be public on GitHub.	2025-03-18	Complete
Improve unit tests to check for correct expiration cutoff calculation	Prevent	Root Cause #2	Analyze test coverage report for satisfactory coverage. The changeset will be public on GitHub.	2025-03-18	Complete
Improve integration tests to cover revoked cert removal from CRL after cert expiration.	Prevent	Root Cause #2	The changeset will be public on GitHub.	2025-04-28	Complete
Configure crl-monitor to page high priority	Detect	Root Cause #3	Configure alerting and test it thoroughly in staging for effectiveness.	2025-04-04	Complete

Report Closure Summary

Incident description: Two revoked certificates (with serials 053258de6a60305075be223306a211f5328e and 06e748585b1ba76887bf7b0cce58a94361de) were removed from our CRLs (partitions 15 and 95, respectively) before the certificates had expired.
Incident Root Cause(s): This incident was caused by a bug in CRL partitioning logic. It was worsened by insufficient tests to cover CRL entry removal edge cases and incorrect configuration of our CRL monitor alerting.
Remediation description: We corrected the bug, improved unit and integration tests to cover edge cases, and reconfigured CRL monitoring to alert us properly.
Commitment summary: Let's Encrypt will continue to improve our secondary controls (i.e. tests and monitoring) to ensure bugs do not reach our issuance environment.

All Action Items disclosed in this report have been completed as described, and we request its closure.

Ben Wilson

Comment 9

•

7 months ago

I'll close this on or about Thursday or Friday later this week, unless there are issues to discuss or questions to answer.

Flags: needinfo?(bwilson)

Ben Wilson

Updated

•

6 months ago

Status: ASSIGNED → RESOLVED

Closed: 6 months ago

Flags: needinfo?(bwilson)

Resolution: --- → FIXED

certs.csv 7 months ago Ameer Ghani 739 bytes, text/csv		Details
e6-15.tar.gz 7 months ago Ameer Ghani 465.57 KB, application/gzip		Details
e6-95.tar.gz 7 months ago Ameer Ghani 90.63 KB, application/gzip		Details