Certainly: Early CRL Entry Removal
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: wthayer, Assigned: wthayer)
Details
(Whiteboard: [ca-compliance] [crl-failure])
Attachments
(1 file)
6.50 MB,
text/plain
|
Details |
Preliminary Incident Report
Summary
- Incident description: On 13-Feb 2025, Certainly deployed a version of Boulder containing a logic bug that likely resulted in entries for some revoked certificates being removed from the CRL before the certificates had expired.
- Relevant policies: Removing an entry from a CRL before the revoked certificate expires would violate section 4.10.1 of the Baseline Requirements for the Issuance and Management of Publicly‐Trusted TLS Server Certificates, and section 3.3 of RFC 5280.
- Source of incident disclosure: We became aware of this issue at 16:23 UTC on 18-Mar 2025 when a member of the Let’s Encrypt team directed us to a Boulder pull request in their public code repository containing their fix.
A fix has been deployed. We will post a full incident report within 7 days.
Updated•5 months ago
|
Comment 1•5 months ago
|
||
Full Incident Report
Summary
- CA Owner CCADB unique ID: A007878
- Incident description: On 2025-02-13, Certainly deployed a version of Boulder containing a logic bug that resulted in entries for some revoked certificates being removed from the CRL before the certificates had expired.
- Timeline summary:
- Non-compliance start date: 2025-02-13
- Non-compliance identified date: 2025-03-18
- Non-compliance end date: 2025-03-18
- Relevant policies: Removing an entry from a CRL before the revoked certificate expires would violate section 4.10.1 of the Baseline Requirements for the Issuance and Management of Publicly‐Trusted TLS Server Certificates, and section 3.3 of RFC 5280.
- Source of incident disclosure: We became aware of this issue at 16:23 UTC on 2025-03-18 when a member of the Let's Encrypt team directed us to a Boulder pull request in their public code repository containing their fix.
Impact
- Total number of certificates: 77467
- Total number of "remaining valid" certificates: 0
- Affected certificate types: Certainly only issues DV certificates.
- Incident heuristic: Certificate was affected if it was revoked, its expiration date fell between (CRL time - one CRL generation interval) and (CRL time + lookbackPeriod + one CRL generation interval), and a CRL was generated during the bug period.
- Was issuance stopped in response to this incident, and why or why not?: No. This incident does not affect the issuance process or resulting certificates.
- Analysis: Not applicable.
- Additional considerations: The large number of impacted certificates is the result of automated tests that Certainly runs in order to verify that our revocation systems are working properly.
Timeline
2025-01-27
18:11 UTC Upstream commit containing bug in CRL generation code
2025-02-03
9:14 UTC Documented the fact that we chose not to deploy the Boulder release tagged release-2025-01-27 due to a change freeze
2025-02-13
17:26 UTC Deployed Boulder release tagged release-2025-02-04 which included the bug
2025-03-18
16:23 UTC Received notification from Let’s Encrypt
16:39 UTC Team began investigation
18:29 UTC Team confirmed that Certainly was susceptible to this bug, and began work to test and deploy the fix
23:35 UTC Fix fully deployed
2025-03-19
04:24 UTC Preliminary incident report published
Related Incidents
Bug | Date | Description |
---|---|---|
1954861 https://bugzilla.mozilla.org/show_bug.cgi?id=1954861 | 2025-03-18 | Both incidents were caused by the same Boulder bug. |
Root Cause Analysis
Contributing Factor #1: software bug
- Description: As part of a significant refactoring of code related to CRL generation, a bug was introduced into Boulder which incorrectly calculated the time as which a certificate had expired, appeared on at least one additional CRL, and was thus safe to remove from future CRLs. This bug was located on line 317 of crl/updater/updater.go
- Timeline:
2025-01-27 18:11 UTC Boulder bug committed
2025-02-13 17:26 UTC Certainly deploys release containing this bug
2025-03-18 23:35 UTC Certainly deploys patch to fix this bug - Detection: The bug was detected by Let’s Encrypt
- Interaction with other factors: This bug was the proximate cause of the incident
- Root Cause Analysis methodology used: Drill down
Contributing Factor #2: testing gap
- Description: Certainly relies on the suite of tests included with Boulder to verify proper functioning of the system. These tests did not cover the scenario presented by this bug.
- Timeline:
2025-03-18 16:50 UTC Unit test covering this scenario is committed to Boulder - Detection: The bug was detected by Let’s Encrypt
- Interaction with other factors: Lack of test coverage allowed the bug to avoid detection
- Root Cause Analysis methodology used: Drill down
Contributing Factor #3: monitoring gap
- Description: Certainly has deployed CRL monitoring and alerting that, for instance, ensures that published CRLs have not expired. This monitoring does not attempt to determine if the complete correct set of revoked serial numbers is represented in CRLs, including maintaining serial numbers in the CRL until at least one CRL has been issued after a certificate expires.
- Timeline:
2025-03-19 15:56 UTC Certainly team determined that our existing CRL monitors would not have detected this incident - Detection: The bug was detected by Let’s Encrypt
- Interaction with other factors: Lack of monitoring for this condition allowed the bug to remain active for more than a month
- Root Cause Analysis methodology used: Drill down
Lessons Learned
- What went well: Certainly maintains communications channels with Let’s Encrypt, so we were quickly made aware of this issue. We were able to deploy a patch the same day that the issue was discovered.
- What didn’t go well: Certainly’s existing CRL monitors did not detect this problem
- Where we got lucky: We keep a close eye on Boulder changes and would have responded quickly to the incident report filed by Let’s Encrypt, but their outreach significantly reduced our time to mitigate and we greatly appreciate the notification.
- Additional:
Action Items
Action Item | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Bug fixed upstream and applied by Certainly | Prevent | Root Cause # 1 | Tests passing | 2025-03-18 | Complete |
Upstream unit test | Prevent | Root Cause # 2 | Tests passing | 2025-03-18 | Complete |
Add monitoring that will detect revoked serials missing from CRLs | Detect | Root Cause # 3 | Alerts fire when this situation occurs | 2025-04-30 | Ongoing |
Appendix
See attachment
Comment 2•5 months ago
|
||
We will continue to monitor for questions and comments and provide updates as we proceed toward expanding our CRL monitoring.
Comment 3•5 months ago
|
||
Revised Evaluation of the Incident
Upon further review of the code, and with a clearer understanding of the issue from Let's Encrypt's full incident report, we determined that Certainly was not actually impacted by the CRL generation bug.
Certainly makes the reasonable assumption that any compliance bug in Boulder is likely to affect us. In this case, we determined that we had deployed a version of Boulder containing the bug and from there proceeded to quickly disclose this incident.
In preparing our full incident report, we assumed our monitoring had failed to catch the issue and reviewed the buggy code and the upstream issue describing the problem. Using that understanding, we identified all revoked certs that could have been missing on the CRLs generated during the period we had the bug deployed. We then listed all of those certs and that entire time period in our final report.
After reading Let’s Encrypt’s full incident report, we discovered there were additional conditions that had to be met for the bug to be triggered. We proceeded to review the code path leading to the bug and discuss directly with Let’s Encrypt staff. Doing so we confirmed that we had not issued incomplete CRLs.
Additional Technical Context:
While the CRL generation bug was present and deployed in our environment, our specific system configuration did not set the feature flags necessary to engage the new code pathway that triggered this bug.
Consequently, our CRLs remained complete and compliant throughout the period.
While no certificates were affected, we've still implemented the upstream fix to the affected code and are adding additional testing to prevent similar issues in the future. While not an incident, this report has highlighted a gap in our CRL monitoring, which we are actively addressing to ensure future detection of such anomalies.
Based on this revised evaluation, we request that this bug be closed as invalid.
Updated•5 months ago
|
Description
•