Mozilla Root Store Policy explicitly requires CAs to follow discussions in dev-security-policy, and this specific class of error was raised on the list on 2020-10-28.
Can you share more about your process for reviewing m.d.s.p. and how this thread was missed?
We did not miss the thread of 2020-10-28 ("EJBCA performs incorrect calculation of validities"). At 2020-10-28 17:57:36 UTC, one of our engineers sent to all staff:
There's a post on mdsp today about EJBCA having an off-by-one-second bug in calculating validity intervals. A good reminder that we should always aim to be well within requirements about timing, rather than right up on the edge.
In other words, we evaluated the issue, and concluded that our existing practices more than adequately addressed it. Our validity intervals are shorter than the BR-set limits by approximately 308 days, which more than covers this sort of off-by-one-second bug, and similar issues like leap seconds.
One could argue that even though we believed our configuration of Boulder (our in-house CA software) adequately mitigated this category of issue, we should have checked the implementation for this off-by-one-second bug in case someone runs an instance of Boulder with a different configuration that goes right up to the BR limits. However, we maintain Boulder primarily for our own consumption, informed by our policy decisions. Since it was inconceivable that we would configure Boulder with a validity period within one second of the BR limits, we did not start an investigation into whether Boulder had this off-by-one-second bug.
Another possible response to the EJBCA thread would have been to re-review our CPS to ensure we were not setting requirements for ourselves that reintroduced the sort of "right up against the limit" compliance issue we seek to avoid with respect to BR requirements. We did not do so at that time. Even with the benefit of hindsight, we do not believe that would have been a reasonable response to the thread.
This is consistent with the discussion on the original EJBCA thread. For instance, Mike Kushner posted:
all PrimeKey customers were alerted a week ago and recommended to review their certificate profiles and responder settings to be within thresholds.
Our settings were well within thresholds already, as the result of a series of conscious design choices since launch. Those choices had been informed by even earlier discussions on mdsp about CAs that missed deprecation targets or maximum validities by a second or a day.
Bug 1708965, reported a month ago
You can use component subscriptions to help ensure this, as the majority of CAs have done (judging by the CC lists).
By contrast with the EJBCA thread, if we had read the KIR S.A. incident report in a timely manner, a correct response would have included a review of our CPS for this sort of "right up against the limit" condition. That's because the KIR S.A. incident specifically discusses a disagreement between drafting of the CPS and implementation / configuration, which is not a problem we would expect to have mitigated by our conservative approach to setting validity intervals.
Several of our engineers are subscribed to bugzilla for the relevant components. As noted, we don’t have a formalized process for reviewing all bugs in the ca-compliance category. So, one root cause is absence of such a process. We plan to implement such a process as a remediation item.
I think it would benefit Let's Encrypt to re-do the root cause analysis.
Looking deeper into root causes, the main issue seems to be: we preemptively mitigated a large class of boundary condition bugs by setting our validity intervals well within BR requirements. And then we broke that mitigation by setting additional requirements in our CPS that put our actual implementation right up against our CPS limits. How did we introduce that error? How did we miss it in multiple CPS self-reviews?
Looking back through our CPS history at https://letsencrypt.org/repository/, our 1.x series of CPS said that end-entity certificates could have a lifetime of "up to 12 months." In 2017 we extensively revised our CP and CPS to make them more accessible (HTML instead of PDF), more maintainable (versioned history on GitHub), and clearer to read.
On 2017-02-09, during the drafting period, we introduced the "less than 100 days" language in section 6.3.2:
https://github.com/letsencrypt/cp-cps/commit/063e56225885ed216e9ba150bef40d07334d6ffa. This was based on a desire to make our CPS more specific, combined with our understanding of boundary condition problems that had already been documented on mdsp as of 2017.
On 2017-03-08, still in the drafting period, we introduced the "90 days" language in section 7.1: https://github.com/letsencrypt/cp-cps/pull/6. This was a mistake, but it went uncommented at the time.
Our PMA met on 2017-04-13 11:35 AM US Central time to review the new CPS. According to our records, the PMA did not remark on differing language between section 6.3.2 and 7.1.
On 2017-04-13 4:16 PM US Central time, we published the new CPS v2.0: https://github.com/letsencrypt/website/commit/aa5c552f63b175fceb79e0c199ce06ef3b92d9c8 / https://letsencrypt.org/documents/isrg-cp-v2.0/. At the time of publication we were out of compliance because of the strict "90 days" language we introduced in 7.1.
The PMA review (and our subsequent quarterly PMA re-reviews) are when we really should have caught the discrepancy between the two requirements. We will re-review our CP and CPS with regards to relationships between different sections, to see if there are any similar inconsistencies. We will also update our guidelines for reviewing CP and CPS to include this type of review in the future.
Can you share more about the nature and structure of your triage team, and whether you've reviewed past CA incidents to examine how such teams may succeed or fail depending on how the CA structures them?
All of our CA staff subscribes to and reads mdsp, with the exception of one team member who recently switched from a non-CA role (and who subscribes as of now). Instructions to subscribe are part of our new staff onboarding. A smaller set of staff (six people) subscribes to bugzilla's CA Certificate Compliance category, but does not necessarily read all incidents in detail.
As an organization, we have reviewed and learned from many of the bugzilla CA incidents including ones that started before Let’s Encrypt was publicly available. As one publicly-documented example, in https://bugzilla.mozilla.org/show_bug.cgi?id=1577652 we revised our OCSP behavior for precertificates and filed an incident based on reading about Apple's related incident. We have a new-engineer reading syllabus that includes several incident reports. We also regularly review Mozilla’s ‘Responding to an Incident’ document which includes exemplary reports to reference.
The missing pieces are (a) more attention to incident reports on bugzilla (we focus most of our attention on mdsp), and (b) a process to ensure that each item is read.
We have reviewed discussion triage teams from past incidents and will inform the structuring of our updated processes.
Suggestion: The best thing to do would be to develop a process for reviewing the extant (open and closed) CA incidents
Our plan is to review all bugzilla issues in the "CA Certificate Compliance" categories opened between 2019-06-01 and today (439 issues, 5514 comments). We will also establish a formal review rotation where:
- Each week a staff member will be assigned to review new bugzilla issues and mdsp threads since the last review.
- Each month a staff member will be assigned to review updated bugzilla issues and mdsp threads since the last review.
The output of that review will be (a) a summary of issues that are relevant to Let's Encrypt, to be reviewed at the next PMA quarterly meeting, and (b) an immediate page to oncall engineers if there are items that require immediate attention. This will be in addition to our current practice of staff reading mdsp as part of the regular workflow and responding to items that require immediate attention.
The reasoning behind separate review cycles for new vs updated issues: Items that require prompt response are most likely to manifest as new threads or new issues, while comprehensive understanding of a long thread or issue is best achieved by reading the whole thread or whole issue after all participants have had time to contribute.
We intend to have the mdsp/bugzilla review rotation in place by 2021-06-15. We intend to complete the retrospective review by 2021-11-12.
We intend to have our re-review of CP and CPS and updated review guidelines done by 2021-06-21.
Question 5, which expects the complete certificate details to be provided.
The list of affected certificate serial numbers that had unexpired as of the time we declared an incident is available at this URL: https://le-https-stats.s3.amazonaws.com/one-second-incident-affected-serials.txt.gz
Interested parties can obtain the full certificate bodies either from Certificate Transparency logs, or via the ACME endpoint via a HTTP GET request to the endpoint https://acme-v02.api.letsencrypt.org/acme/cert/<serial in hex>, e.g. https://acme-v02.api.letsencrypt.org/acme/cert/030000033ee153519e6734086f560282082f