Let's Encrypt: Failure to revoke for Certificate Lifetime Incident
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: aaron, Assigned: aaron)
References
Details
(Whiteboard: [ca-compliance] [leaf-revocation-delay])
Attachments
(2 files)
1. How your CA first became aware of the problem.
We became aware that we would not be revoking these certificates during our analysis of Bug 1715455.
2. A timeline of the actions your CA took in response.
2021-06-09 02:53 UTC: Incident response for Bug 1715455 begins
2021-06-09 04:52 UTC: ISRG decides not to revoke any certificates for the time being
2021-06-09 17:52 UTC: ISRG begins drafting this incident report to declare the exceptional circumstances leading to this non-revocation
2021-06-09 19:22 UTC: ISRG declares our intent to not revoke in Bug 1715455, Comment #8
3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.
As per Bug 1715455, we have stopped issuing affected certificates.
All affected certificates will reach their expiration date before 7 September 2021 at 03:41:54 UTC.
We expect the affected unexpired certificates to expire approximately linearly until that time.
4. A summary of the problematic certificates.
See Bug 1715455.
5. In a case involving certificates, the complete certificate data for the problematic certificates.
See Bug 1715455.
6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
The following are the exceptional conditions that lead to our decision not to revoke the affected certificates:
While we appreciate that reporting and discussing this incident may lead to improvements to our CA and the PKI ecosystem going forward, in this case we do not believe that revoking certificates already issued as part of our response would benefit the Web PKI.
All of the certificates in question have a validity period of approximately 23% the maximum allowable validity period as per the BRs. Thus, this extra one second of validity is still far short of the Baseline Requirements’ maximum allowable lifespan.
Many certificate consumers (such as Chrome and NSS) treat affected certificates as having a 90-day lifespan, not 90 days plus 1 second.
Finally, based on the implementation of those certificate consumers and based on discussion on both Bug 1715455 and Bug 1708965, it is not clear that there is widespread consensus across the WebPKI ecosystem that (absent the precedent of Bug 1708965) this constitutes misissuance. We intend to raise this discussion on MDSP to provide clarity for future potential incidents.
7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.
We will ensure that this issue is included in our next BR audit statement.
The volume of issuance of Let’s Encrypt is such that mass-revocation events are currently painful. Our intended solution is the ACME Renewal Info (ARI) extension, proposed at the IETF ACME Working Group, initially as a response to Bug 1619179. Implementation of this in such a way that it is effective for a large number of subscribers is non-trivial and will take a significant amount of time, but this bug provides further urgency.
We will commit additional effort to bring ARI through the IETF standardization process.
Comment 1•2 years ago
|
||
https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation states:
Responses similar to “we do not deem this non-compliant certificate to be a security risk” are not acceptable
While the answer to question 6 doesn't use those exact words, it is very much along the same lines, trying to use the supposed non-severity of the misissuances as a justification to ignore the BR's revocation requirement.
Also, the behavior of certificate consumers is irrelevant to whether a certificate is misissued and whether the BR's revocation requirement applies.
As such, I don't believe Comment 0 meets Mozilla's expectations as described in https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation
Comment 2•2 years ago
•
|
||
(In reply to Aaron Gable from comment #0)
5. In a case involving certificates, the complete certificate data for the problematic certificates.
See Bug 1715455.
This (and Bug 1715455) don't appear to provide the required detail. This is something that CAs have been expected to provide, which hopefully Let's Encrypt will see through the review of other CA incidents, as mentioned in Bug 1715455.
As captured in the incident template:
The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.
While we appreciate that reporting and discussing this incident may lead to improvements to our CA and the PKI ecosystem going forward, in this case we do not believe that revoking certificates already issued as part of our response would benefit the Web PKI.
Have I missed an explanation as to why? This appears to just be a statement of opinion, without supporting details. This seems to borderline the unacceptable response, namely:
The rationale must include an explanation for why the situation is exceptional. Responses similar to “we do not deem this non-compliant certificate to be a security risk” are not acceptable
Again, this is something that Let's Encrypt would be aware of through review of other CA incidents.
All of the certificates in question have a validity period of approximately 23% the maximum allowable validity period as per the BRs. Thus, this extra one second of validity is still far short of the Baseline Requirements’ maximum allowable lifespan.
Yes, but the BRs also require revocation if:
- The CA is made aware that the Certificate was not issued in accordance with these Requirements or the CA’s Certificate Policy or Certification Practice Statement;
In essence, the statement provided here is why Let's Encrypt feels it is OK to not revoke, which may be a factor for consideration, but it's missing the analysis requested by https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation . This should be easy to provide, but it's concerning that it's not provided here.
Many certificate consumers (such as Chrome and NSS) treat affected certificates as having a 90-day lifespan, not 90 days plus 1 second.
"Show, don't tell" :)
I would normally flag this as a particularly problematic statement, because it lacks supporting detail. This is, however, partly mitigated because you did show the work in the related thread at https://groups.google.com/a/mozilla.org/g/dev-security-policy/c/-BogZx_IJyk/m/gHm3l613AgAJ relating to this incident, close to the time of posting this report.
In the future, please ensure incident reports have the relevant supporting details.
Finally, based on the implementation of those certificate consumers and based on discussion on both Bug 1715455 and Bug 1708965, it is not clear that there is widespread consensus across the WebPKI ecosystem that (absent the precedent of Bug 1708965) this constitutes misissuance. We intend to raise this discussion on MDSP to provide clarity for future potential incidents.
The relevant thread is https://groups.google.com/a/mozilla.org/g/dev-security-policy/c/-BogZx_IJyk/m/gHm3l613AgAJ , and corrections have been provided on that thread.
Note that as part of Let's Encrypt's analysis here, relevant other discussions were overlooked, as captured in Bug 1715455, Comment #19 , that show that there has been a historic consensus on the expectation, both from software vendors and CAs.
We will commit additional effort to bring ARI through the IETF standardization process.
This is not a binding timeline, as expected at https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report
I realize that trying to get a timeline out of the IETF is like trying to get praise for a CA from me - i.e. incredibly difficult - but the expectation here is that Let's Encrypt is going to provide an objective baseline to measure progress, where their prioritization and goals are (e.g. when THEY would like to have something implemented), and periodic updates on that. Obviously, there is benefit to consensus-based standards development, but the concern here, as has happened with other CAs, is that the desire for "consensus" is a way of making it somebody else's problem. This is because consensus doesn't form in a vaccuum, and requires energy and investment, and the goal of this incident report is to understand the energy and investment being placed by Let's Encrypt to prevent these incidents going forward.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 3•2 years ago
|
||
It has been suggested offline that “delayed revocation” (which is true for any event that exceeds the BR mandated period) may imply that revocation will happen at some point. As LE have clearly stated that, at present, they have no plans to revoke whatsoever, retitling this to avoid such ambiguity, although we track all such incidents generically as “delayed revocation”
Updated•2 years ago
|
Comment 4•2 years ago
|
||
Just an update to say we're planning to update by Tuesday 2021-06-15 6pm PDT with more information and plans.
Comment 5•2 years ago
|
||
We intend to deploy a demonstration version of ACME Renewal Info (ARI) or similar technology in our staging environment by 2021-11-12, and engage with client authors and integrators to test it out, provide feedback, and evaluate how a mass revocation event would work in practice.
We're planning another update here with answers to remaining questions above by 2021-06-17 6pm PDT.
Assignee | ||
Updated•2 years ago
|
Comment 6•2 years ago
|
||
This (and Bug 1715455) don't appear to provide the required detail.
The list of affected certificate serial numbers that had unexpired as of the time we declared an incident is available at this URL: https://le-https-stats.s3.amazonaws.com/one-second-incident-affected-serials.txt.gz
Interested parties can obtain the full certificate bodies either from Certificate Transparency logs, or via the ACME endpoint via a HTTP GET request to the endpoint https://acme-v02.api.letsencrypt.org/acme/cert/<serial in hex>, e.g. https://acme-v02.api.letsencrypt.org/acme/cert/030000033ee153519e6734086f560282082f
This information has also been provided on Bug 1715455.
we do not believe that revoking certificates already issued as part of our response would benefit the Web PKI.
Have I missed an explanation as to why?
See the updated response below, doing the analysis required by https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation.
missing the analysis requested by https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation
The rationale must include an explanation for why the situation is exceptional.
The situation is exceptional because of the large number of certificates that would be revoked and the current deployed base of the Web PKI, which does not have good software support for automated replacement of revoked and about-to-be-revoked certificates. As we've previously said in https://bugzilla.mozilla.org/show_bug.cgi?id=1619179, causing revocation warnings for large numbers of users on large numbers of sites is likely to result in warning blindness for those users, which harms warning compliance numbers for users and potentially exposes them to increased risk from untrusted certificates.
In addition to that, a large revocation like this one, in the current Web PKI, would reduce site owner confidence in the stability of HTTPS, and is likely to lead site operators to conclude using HTTP is safer for their site's availability, given they have no software support for responding to such a revocation by automatically reissuing.
Part of our mission is to increase the use of HTTPS while also using automation to increase security and agility in the ecosystem. Replacement on revocation is a particular area where that automation is currently lacking, which is why our proposed remediation is to add that support. Our vision is that in most cases, revocation is not a concern for either a site operator or a site visitor, because replacement (where possible) is handled automatically. We recognize that both our rationale and our proposed mitigation bear significant similarity with the previously mentioned incident, and that we should have made better progress on our automation solutions by now. That's why we are now committing to show progress by specific dates, with or without standardization.
Any decision to not comply with the timeline specified in the Baseline Requirements must also be accompanied by a clear timeline
All affected certificates will expire by 2021-09-07. See attachments for a graph of expirations over time, and a chart of counts on specific days.
The issue will need to be listed as a finding in your CA’s next BR audit statement.
We will make sure this happens.
Your CA will work with your auditor (and supervisory body, as appropriate) and the Root Store(s) that your CA participates in to ensure your analysis of the risk and plan of remediation is acceptable.
Will do.
You will perform an analysis to determine the factors that prevented timely revocation of the certificates, and include a set of remediation actions in the final incident report that aim to prevent future revocation delays.
As noted in Comment #5 above, we commit to having an initial version of ARI (or similar) deployed in our Staging environment by 2021-11-12.
In the future, please ensure incident reports have the relevant supporting details.
Acknowledged. In this case we consciously chose to not reference specific implementations here as that kind of calling-out did not feel productive. We’ll err on the other side next time.
Comment 7•2 years ago
|
||
Comment 8•2 years ago
|
||
Comment 9•2 years ago
|
||
(In reply to Jacob Hoffman-Andrews from comment #6)
The situation is exceptional because of the large number of certificates that would be revoked and the current deployed base of the Web PKI, which does not have good software support for automated replacement of revoked and about-to-be-revoked certificates. As we've previously said in https://bugzilla.mozilla.org/show_bug.cgi?id=1619179, causing revocation warnings for large numbers of users on large numbers of sites is likely to result in warning blindness for those users, which harms warning compliance numbers for users and potentially exposes them to increased risk from untrusted certificates.
I'm curious what Let's Encrypt sees the thresholds at for this part of the response. In particular, I'm trying to understand how this would differ from, say, a CA that says they cannot revoke a single certificate for a bank because the bank doesn't have a good strategy to replace certificates, and they don't want to cause warning fatigue for their users, or users seeing an error on the bank would cause a loss of trust in HTTPS.
Does Let's Encrypt have any internal guidance about what revocation threshold it sees as acceptable versus unacceptable? If it does, how is that guidance determined?
That said, I also want to acknowledge that the mitigation for this concern - that it seems a blanket statement that can be applied in any/all situations - is that Let's Encrypt has previously publicly made efforts to improve this, and as part of this incident, is publicly committing to make further progress on these efforts, with concrete deliverables and dates. This ultimately is in line with the expectations: the CA must demonstrate what they're doing to solve these (general) problems, and ideally, show a history of work to do so prior to/outside of incidents.
Comment 10•2 years ago
|
||
Does Let's Encrypt have any internal guidance about what revocation threshold it sees as acceptable versus unacceptable? If it does, how is that guidance determined?
We don't have internal guidance about a specific threshold. The decision is made by the oncall staff and those they bring in for incident response, based on the specific details of the situation.
I'm curious what Let's Encrypt sees the thresholds at for this part of the response. In particular, I'm trying to understand how this would differ from, say, a CA that says they cannot revoke a single certificate for a bank because the bank doesn't have a good strategy to replace certificates, and they don't want to cause warning fatigue for their users, or users seeing an error on the bank would cause a loss of trust in HTTPS.
Our evaluation of a claim like that would be: the number of users exposed to such warnings would ultimately be somewhat low, because a single high traffic site with a high dedication to availability would find a way to replace a revoked certificate quickly, even if there were not procedures in place already. Part of our evaluation here is the prospect of large numbers of users seeing large numbers of warnings across large numbers of sites over a long period of time.
And, as you say, the main thing here is to make progress towards ensuring universal, easy, on-demand certificate rotation.
Comment 11•2 years ago
|
||
While always uncomfortable with these sorts of responses on principle ("It's exceptional because it will impact our users/your users"), because it's a statement that has been previously applied for as little as one certificate, in this case, it applies to "hundreds of millions", and there's a clear plan for forward progress to ensure that, even for "hundreds of millions", there's a story in place.
I'm sending this to Ben. Ben: Comment #8 suggests that the upper-bound for action will be 2021-11-12. This also corresponds to the conclusion of IETF 112
I'd suggest setting a Next-Update perhaps closer to 2021-09-14: this is two weeks before all the certificates will be expired, as well as roughly 2-3 weeks before the cut-off date for requesting scheduling of meetings at IETF 112 (historically, cut-off is around 6 weeks before the event). This would hopefully give ISRG/Let's Encrypt a chance to provide a status update about the IETF discussions and their own implementation experiences/explorations on ARI, which is their proposed systemic fix for preventing revocation delays.
Assignee | ||
Comment 12•2 years ago
|
||
We are happy to provide an update on or by 2021-09-14.
Updated•2 years ago
|
Assignee | ||
Comment 13•2 years ago
|
||
Providing our update for 2021-09-14.
We are currently in the process of developing and standardizing a mechanism for ACME servers to provide hints to ACME clients as to when those clients should attempt to renew the certificates they manage. This will allow ACME servers to both encourage clients to renew when the server knows it is about to revoke the current certificate, and to encourage clients to renew slightly earlier or later than they normally would to spread out load spikes resulting from such mass revocation events.
The initial draft sent to the ACME WG mailing list is at https://mailarchive.ietf.org/arch/msg/acme/3wDZfTxjDqmhSxwBKjX3uPBzULM/
The proposal will be discussed at the working group's interim meeting later this month (currently in the process of being scheduled).
We have not yet begun implementation of the server side of the above proposal. Our current plan is to develop both the server side (in Boulder) and the client side (in certbot, and perhaps in other clients as well) in parallel, with the experimental API existing only in our Staging environment for the time being, in order to avoid a repetition of the struggle to deprecate the ACMEv1 (pre-standardization) API endpoints.
Updated•2 years ago
|
Assignee | ||
Comment 14•2 years ago
|
||
Providing our update for 2021-10-14, as I will be on vacation on that date.
The first (zeroth) draft of ARI has been published: https://datatracker.ietf.org/doc/draft-aaron-acme-ari/. There has been healthy discussion both on the ACME WG mailing list and during the ACME WG Interim Meeting. A second draft of the document is already in progress, with changes being tracked through github issues and PRs.
The scaffolding for our first implementation of this draft is under review: https://github.com/letsencrypt/boulder/pull/5691. We still intend to have a draft implementation deployed and enabled in our Staging environment by 2021-11-12.
Please set our next-update for 2021-11-12, when we will provide a final report on our findings from the test deployment.
Updated•2 years ago
|
Assignee | ||
Comment 15•2 years ago
|
||
The draft ACME Renewal Information (ARI) spec has received good discussion both on and off the acme@ietf.org mailing list. I have produced a second version of the draft, which was presented to the ACME WG at IETF 112 yesterday. Discussion continues both on the list and in the source repository. Multiple members of the working group have said that they support adoption of the document, so I expect a formal call for adoption soon.
As of 2021-11-04, an initial implementation of the API has been deployed and turned on in our Staging environment. You can test it yourself with
curl -L https://acme-staging-v02.api.letsencrypt.org/directory
to see the new entry in the Directory resource, and with
curl -L https://acme-staging-v02.api.letsencrypt.org/get/draft-aaron-ari/renewalInfo/x/x/FA63A686E6D6FFB30CDC9E0EAF27F0FF4A3F
to fetch the new RenewalInfo resource for the cert with that serial. We are continuing to make improvements to the implementation.
I have begun implementation of the first fully-fledged client code inside Certbot; this code has not yet landed. In both Boulder and Certbot, the intent is that the code will only be active for the Staging environment until the API standard seems stabilized, to prevent version drift.
Obviously, there is more work to be done here, and the standardization process will take some time. This constitutes the final update we have committed to making on this ticket; please instead follow the documents and repositories linked above for further developments.
Assignee | ||
Comment 16•2 years ago
|
||
Per https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed, we consider remediation done on this incident and propose closing if there are no followup questions.
Comment 17•2 years ago
|
||
I'll close this tomorrow, 12-Jan-2022, unless there is a need for further discussion.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•11 months ago
|
Updated•7 months ago
|
Description
•