Open Bug 1715672 Opened 2 months ago Updated 27 days ago

Let's Encrypt: Failure to revoke for Certificate Lifetime Incident

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: aaron, Assigned: aaron)

References

Details

(Whiteboard: [ca-compliance] [delayed-revocation-leaf] Next update 2021-09-14)

Attachments

(2 files)

1. How your CA first became aware of the problem.

We became aware that we would not be revoking these certificates during our analysis of Bug 1715455.

2. A timeline of the actions your CA took in response.

2021-06-09 02:53 UTC: Incident response for Bug 1715455 begins
2021-06-09 04:52 UTC: ISRG decides not to revoke any certificates for the time being
2021-06-09 17:52 UTC: ISRG begins drafting this incident report to declare the exceptional circumstances leading to this non-revocation
2021-06-09 19:22 UTC: ISRG declares our intent to not revoke in Bug 1715455, Comment #8

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.

As per Bug 1715455, we have stopped issuing affected certificates.

All affected certificates will reach their expiration date before 7 September 2021 at 03:41:54 UTC.

We expect the affected unexpired certificates to expire approximately linearly until that time.

4. A summary of the problematic certificates.

See Bug 1715455.

5. In a case involving certificates, the complete certificate data for the problematic certificates.

See Bug 1715455.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The following are the exceptional conditions that lead to our decision not to revoke the affected certificates:

While we appreciate that reporting and discussing this incident may lead to improvements to our CA and the PKI ecosystem going forward, in this case we do not believe that revoking certificates already issued as part of our response would benefit the Web PKI.
All of the certificates in question have a validity period of approximately 23% the maximum allowable validity period as per the BRs. Thus, this extra one second of validity is still far short of the Baseline Requirements’ maximum allowable lifespan.

Many certificate consumers (such as Chrome and NSS) treat affected certificates as having a 90-day lifespan, not 90 days plus 1 second.

Finally, based on the implementation of those certificate consumers and based on discussion on both Bug 1715455 and Bug 1708965, it is not clear that there is widespread consensus across the WebPKI ecosystem that (absent the precedent of Bug 1708965) this constitutes misissuance. We intend to raise this discussion on MDSP to provide clarity for future potential incidents.

7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

We will ensure that this issue is included in our next BR audit statement.

The volume of issuance of Let’s Encrypt is such that mass-revocation events are currently painful. Our intended solution is the ACME Renewal Info (ARI) extension, proposed at the IETF ACME Working Group, initially as a response to Bug 1619179. Implementation of this in such a way that it is effective for a large number of subscribers is non-trivial and will take a significant amount of time, but this bug provides further urgency.

We will commit additional effort to bring ARI through the IETF standardization process.

https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation states:

Responses similar to “we do not deem this non-compliant certificate to be a security risk” are not acceptable

While the answer to question 6 doesn't use those exact words, it is very much along the same lines, trying to use the supposed non-severity of the misissuances as a justification to ignore the BR's revocation requirement.

Also, the behavior of certificate consumers is irrelevant to whether a certificate is misissued and whether the BR's revocation requirement applies.

As such, I don't believe Comment 0 meets Mozilla's expectations as described in https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation

(In reply to Aaron Gable from comment #0)

5. In a case involving certificates, the complete certificate data for the problematic certificates.

See Bug 1715455.

This (and Bug 1715455) don't appear to provide the required detail. This is something that CAs have been expected to provide, which hopefully Let's Encrypt will see through the review of other CA incidents, as mentioned in Bug 1715455.

As captured in the incident template:

The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

While we appreciate that reporting and discussing this incident may lead to improvements to our CA and the PKI ecosystem going forward, in this case we do not believe that revoking certificates already issued as part of our response would benefit the Web PKI.

Have I missed an explanation as to why? This appears to just be a statement of opinion, without supporting details. This seems to borderline the unacceptable response, namely:

The rationale must include an explanation for why the situation is exceptional. Responses similar to “we do not deem this non-compliant certificate to be a security risk” are not acceptable

Again, this is something that Let's Encrypt would be aware of through review of other CA incidents.

All of the certificates in question have a validity period of approximately 23% the maximum allowable validity period as per the BRs. Thus, this extra one second of validity is still far short of the Baseline Requirements’ maximum allowable lifespan.

Yes, but the BRs also require revocation if:

  1. The CA is made aware that the Certificate was not issued in accordance with these Requirements or the CA’s Certificate Policy or Certification Practice Statement;

In essence, the statement provided here is why Let's Encrypt feels it is OK to not revoke, which may be a factor for consideration, but it's missing the analysis requested by https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation . This should be easy to provide, but it's concerning that it's not provided here.

Many certificate consumers (such as Chrome and NSS) treat affected certificates as having a 90-day lifespan, not 90 days plus 1 second.

"Show, don't tell" :)

I would normally flag this as a particularly problematic statement, because it lacks supporting detail. This is, however, partly mitigated because you did show the work in the related thread at https://groups.google.com/a/mozilla.org/g/dev-security-policy/c/-BogZx_IJyk/m/gHm3l613AgAJ relating to this incident, close to the time of posting this report.

In the future, please ensure incident reports have the relevant supporting details.

Finally, based on the implementation of those certificate consumers and based on discussion on both Bug 1715455 and Bug 1708965, it is not clear that there is widespread consensus across the WebPKI ecosystem that (absent the precedent of Bug 1708965) this constitutes misissuance. We intend to raise this discussion on MDSP to provide clarity for future potential incidents.

The relevant thread is https://groups.google.com/a/mozilla.org/g/dev-security-policy/c/-BogZx_IJyk/m/gHm3l613AgAJ , and corrections have been provided on that thread.

Note that as part of Let's Encrypt's analysis here, relevant other discussions were overlooked, as captured in Bug 1715455, Comment #19 , that show that there has been a historic consensus on the expectation, both from software vendors and CAs.

We will commit additional effort to bring ARI through the IETF standardization process.

This is not a binding timeline, as expected at https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report

I realize that trying to get a timeline out of the IETF is like trying to get praise for a CA from me - i.e. incredibly difficult - but the expectation here is that Let's Encrypt is going to provide an objective baseline to measure progress, where their prioritization and goals are (e.g. when THEY would like to have something implemented), and periodic updates on that. Obviously, there is benefit to consensus-based standards development, but the concern here, as has happened with other CAs, is that the desire for "consensus" is a way of making it somebody else's problem. This is because consensus doesn't form in a vaccuum, and requires energy and investment, and the goal of this incident report is to understand the energy and investment being placed by Let's Encrypt to prevent these incidents going forward.

Flags: needinfo?(aaron)
Assignee: bwilson → aaron
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
See Also: → 1715455
Summary: Let’s Encrypt: Certificate Lifetime Non-Revocation Incident → Let’s Encrypt: Delayed revocation for Certificate Lifetime Incident
Whiteboard: [ca-compliance] [delayed-revocation-leaf]

It has been suggested offline that “delayed revocation” (which is true for any event that exceeds the BR mandated period) may imply that revocation will happen at some point. As LE have clearly stated that, at present, they have no plans to revoke whatsoever, retitling this to avoid such ambiguity, although we track all such incidents generically as “delayed revocation”

Summary: Let’s Encrypt: Delayed revocation for Certificate Lifetime Incident → Let’s Encrypt: Failure to revoke for Certificate Lifetime Incident
Type: defect → task

Just an update to say we're planning to update by Tuesday 2021-06-15 6pm PDT with more information and plans.

We intend to deploy a demonstration version of ACME Renewal Info (ARI) or similar technology in our staging environment by 2021-11-12, and engage with client authors and integrators to test it out, provide feedback, and evaluate how a mass revocation event would work in practice.

We're planning another update here with answers to remaining questions above by 2021-06-17 6pm PDT.

Flags: needinfo?(aaron)
Summary: Let’s Encrypt: Failure to revoke for Certificate Lifetime Incident → Let's Encrypt: Failure to revoke for Certificate Lifetime Incident

This (and Bug 1715455) don't appear to provide the required detail.

The list of affected certificate serial numbers that had unexpired as of the time we declared an incident is available at this URL: https://le-https-stats.s3.amazonaws.com/one-second-incident-affected-serials.txt.gz

Interested parties can obtain the full certificate bodies either from Certificate Transparency logs, or via the ACME endpoint via a HTTP GET request to the endpoint https://acme-v02.api.letsencrypt.org/acme/cert/<serial in hex>, e.g. https://acme-v02.api.letsencrypt.org/acme/cert/030000033ee153519e6734086f560282082f

This information has also been provided on Bug 1715455.

we do not believe that revoking certificates already issued as part of our response would benefit the Web PKI.
Have I missed an explanation as to why?

See the updated response below, doing the analysis required by https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation.

missing the analysis requested by https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation

The rationale must include an explanation for why the situation is exceptional.

The situation is exceptional because of the large number of certificates that would be revoked and the current deployed base of the Web PKI, which does not have good software support for automated replacement of revoked and about-to-be-revoked certificates. As we've previously said in https://bugzilla.mozilla.org/show_bug.cgi?id=1619179, causing revocation warnings for large numbers of users on large numbers of sites is likely to result in warning blindness for those users, which harms warning compliance numbers for users and potentially exposes them to increased risk from untrusted certificates.

In addition to that, a large revocation like this one, in the current Web PKI, would reduce site owner confidence in the stability of HTTPS, and is likely to lead site operators to conclude using HTTP is safer for their site's availability, given they have no software support for responding to such a revocation by automatically reissuing.

Part of our mission is to increase the use of HTTPS while also using automation to increase security and agility in the ecosystem. Replacement on revocation is a particular area where that automation is currently lacking, which is why our proposed remediation is to add that support. Our vision is that in most cases, revocation is not a concern for either a site operator or a site visitor, because replacement (where possible) is handled automatically. We recognize that both our rationale and our proposed mitigation bear significant similarity with the previously mentioned incident, and that we should have made better progress on our automation solutions by now. That's why we are now committing to show progress by specific dates, with or without standardization.

Any decision to not comply with the timeline specified in the Baseline Requirements must also be accompanied by a clear timeline

All affected certificates will expire by 2021-09-07. See attachments for a graph of expirations over time, and a chart of counts on specific days.

The issue will need to be listed as a finding in your CA’s next BR audit statement.

We will make sure this happens.

Your CA will work with your auditor (and supervisory body, as appropriate) and the Root Store(s) that your CA participates in to ensure your analysis of the risk and plan of remediation is acceptable.

Will do.

You will perform an analysis to determine the factors that prevented timely revocation of the certificates, and include a set of remediation actions in the final incident report that aim to prevent future revocation delays.

As noted in Comment #5 above, we commit to having an initial version of ARI (or similar) deployed in our Staging environment by 2021-11-12.

In the future, please ensure incident reports have the relevant supporting details.

Acknowledged. In this case we consciously chose to not reference specific implementations here as that kind of calling-out did not feel productive. We’ll err on the other side next time.

(In reply to Jacob Hoffman-Andrews from comment #6)

The situation is exceptional because of the large number of certificates that would be revoked and the current deployed base of the Web PKI, which does not have good software support for automated replacement of revoked and about-to-be-revoked certificates. As we've previously said in https://bugzilla.mozilla.org/show_bug.cgi?id=1619179, causing revocation warnings for large numbers of users on large numbers of sites is likely to result in warning blindness for those users, which harms warning compliance numbers for users and potentially exposes them to increased risk from untrusted certificates.

I'm curious what Let's Encrypt sees the thresholds at for this part of the response. In particular, I'm trying to understand how this would differ from, say, a CA that says they cannot revoke a single certificate for a bank because the bank doesn't have a good strategy to replace certificates, and they don't want to cause warning fatigue for their users, or users seeing an error on the bank would cause a loss of trust in HTTPS.

Does Let's Encrypt have any internal guidance about what revocation threshold it sees as acceptable versus unacceptable? If it does, how is that guidance determined?

That said, I also want to acknowledge that the mitigation for this concern - that it seems a blanket statement that can be applied in any/all situations - is that Let's Encrypt has previously publicly made efforts to improve this, and as part of this incident, is publicly committing to make further progress on these efforts, with concrete deliverables and dates. This ultimately is in line with the expectations: the CA must demonstrate what they're doing to solve these (general) problems, and ideally, show a history of work to do so prior to/outside of incidents.

Flags: needinfo?(aaron)

Does Let's Encrypt have any internal guidance about what revocation threshold it sees as acceptable versus unacceptable? If it does, how is that guidance determined?

We don't have internal guidance about a specific threshold. The decision is made by the oncall staff and those they bring in for incident response, based on the specific details of the situation.

I'm curious what Let's Encrypt sees the thresholds at for this part of the response. In particular, I'm trying to understand how this would differ from, say, a CA that says they cannot revoke a single certificate for a bank because the bank doesn't have a good strategy to replace certificates, and they don't want to cause warning fatigue for their users, or users seeing an error on the bank would cause a loss of trust in HTTPS.

Our evaluation of a claim like that would be: the number of users exposed to such warnings would ultimately be somewhat low, because a single high traffic site with a high dedication to availability would find a way to replace a revoked certificate quickly, even if there were not procedures in place already. Part of our evaluation here is the prospect of large numbers of users seeing large numbers of warnings across large numbers of sites over a long period of time.

And, as you say, the main thing here is to make progress towards ensuring universal, easy, on-demand certificate rotation.

While always uncomfortable with these sorts of responses on principle ("It's exceptional because it will impact our users/your users"), because it's a statement that has been previously applied for as little as one certificate, in this case, it applies to "hundreds of millions", and there's a clear plan for forward progress to ensure that, even for "hundreds of millions", there's a story in place.

I'm sending this to Ben. Ben: Comment #8 suggests that the upper-bound for action will be 2021-11-12. This also corresponds to the conclusion of IETF 112

I'd suggest setting a Next-Update perhaps closer to 2021-09-14: this is two weeks before all the certificates will be expired, as well as roughly 2-3 weeks before the cut-off date for requesting scheduling of meetings at IETF 112 (historically, cut-off is around 6 weeks before the event). This would hopefully give ISRG/Let's Encrypt a chance to provide a status update about the IETF discussions and their own implementation experiences/explorations on ARI, which is their proposed systemic fix for preventing revocation delays.

Flags: needinfo?(aaron) → needinfo?(bwilson)

We are happy to provide an update on or by 2021-09-14.

Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] [delayed-revocation-leaf] → [ca-compliance] [delayed-revocation-leaf] Next update 2021-09-14
You need to log in before you can comment on or make changes to this bug.