Closed Bug 1619179 Opened 8 months ago Closed 5 months ago

Let's Encrypt: Incomplete revocation for CAA rechecking bug

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jaas, Assigned: jaas)

References

Details

(Whiteboard: [ca-compliance] [delayed-revocation-leaf])

Attachments

(3 files)

Let’s Encrypt recently reported the bug/incident described here:

https://bugzilla.mozilla.org/show_bug.cgi?id=1619047

A fix was quickly deployed and we reported the issue publicly within hours of having learned about it.

It’s not likely that there was any significant mis-issuance as a result of this incident because the following unlikely sequence of events would have to have occurred:

  1. Someone successfully issues a Let’s Encrypt certificate containing multiple domains, with CAA not blocking issuance.

  2. After successful initial issuance, a CAA record preventing issuance from Let’s Encrypt is put in place for at least one of the domains in the certificate. If this was a legitimate domain owner/agent responding to previous unauthorized issuance (possible but not common) we’d expect the domain owner to revoke via API or at least file a report.

  3. More than 8 hours and less than 30 days after the initial issuance, someone attempts to obtain another certificate from Let’s Encrypt and the domain for which Let’s Encrypt did check CAA is not the domain with the new CAA block.

Given how unlikely it is that a scenario like this would arise, we’d like to ask for an exemption from the requirement that we mass revoke certificates in response.

The Baseline Requirements require us to revoke any certificates issued without proper CAA checks. Having to revoke millions of certificates in this particular situation would cause a huge amount of unnecessary disruption. In some cases mass revocation is appropriate, but we don’t believe this is one of those cases.

Note that we only offer certificates with 90 day lifetimes and we only cache authorizations for 30 days, so certificates issued during the affected period will leave the ecosystem relatively quickly.

We still have time to meet the revocation deadline, and we intend to do so unless we are granted exemptions from all relevant root programs. We’d appreciate a prompt response if possible so that we can revoke on time if we need to.

Depends on: 1619047
Whiteboard: [ca-compliance] [delayed-revocation-leaf]

Josh: please refer to Mozilla's guidance on revocation at https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation which states the following:

Mozilla recognizes that in some exceptional circumstances, revoking misissued certificates within the prescribed deadline may cause significant harm, such as when the certificate is used in critical infrastructure and cannot be safely replaced prior to the revocation deadline, or when the volume of revocations in a short period of time would result in a large cumulative impact to the web. However, Mozilla does not grant exceptions to the BR revocation requirements. It is our position that your CA is ultimately responsible for deciding if the harm caused by following the requirements of BR section 4.9.1 outweighs the risks that are passed on to individuals who rely on the web PKI by choosing not to meet this requirement.

It then goes on to explain what Mozilla' expectations are should a CA decide to violate the BR revocation requirements.

I hope this answers your question. Feel free to use this bug to provide the additional information that is requested in the event that Let's Encrypt should choose not to revoke.

Assignee: wthayer → jaas
Status: NEW → ASSIGNED
Summary: revocation exemption request for Let's Encrypt CAA Rechecking bug → Let's Encrypt: Revocation exemption request for CAA Rechecking bug

More than 8 hours and less than 30 days after the initial issuance, someone attempts to obtain another certificate from Let’s Encrypt and the domain for which Let’s Encrypt did check CAA is not the domain with the new CAA block.

It was potentially up to 37 days for older ones, given https://github.com/letsencrypt/boulder/issues/4617.

(That in itself is fine, since the BR rule is 825 days.)

Summary: Let's Encrypt: Revocation exemption request for CAA Rechecking bug → Let's Encrypt: Incomplete revocation for CAA rechecking bug

After learning about and remediating a bug in our CAA checking code [1] on 2020-02-29 UTC (the evening of Friday February 28, U.S. Eastern time), we announced that we would be revoking approximately 2.6% of our active certificates that were potentially affected by the bug, totalling approximately 3 million certificates [2].

We announced the plan to revoke because even though the vast majority of the certificates in question do not pose a security risk, industry rules require that we revoke certificates not issued in full compliance with specific standards. These rules exist for good reasons. We work hard to comply with them and have an excellent track record for doing so.

Since that announcement we have worked with subscribers around the world to replace affected certificates as quickly as possible. More than 1.7 million affected certificates have been replaced in less than 48 hours. We'd like to thank everyone who helped with the effort. Our focus on automation has allowed us, and our subscribers, to make great progress in a short amount of time. We’ve also learned a lot about how we can do even better in the future.

Unfortunately, we believe it's likely that more than 1 million certificates will not be replaced before the compliance deadline for revocation is upon us at 2020-03-05 03:00 UTC (9pm U.S. ET tonight). Rather than potentially break so many sites and cause concern for their visitors, we have determined that it is in the best interest of the health of the Internet for us to not revoke those certificates by the deadline.

Let’s Encrypt only offers certificates with 90 day lifetimes, so potentially affected certificates that we may not revoke will leave the ecosystem relatively quickly.

The following certificates have been, or will be, revoked prior to the compliance deadline at 2020-03-05 03:00 UTC (9pm U.S. ET tonight):

  • 1,706,505 certificates that we are confident were replaced during the incident period
  • 445 certificates that we treated as highest priority for revocation because, at the time we found the bug, they had CAA records that forbid issuance by Let’s Encrypt.

We plan to revoke more certificates as we become confident that doing so will not be needlessly disruptive to Web users.

I would like to thank the Let’s Encrypt team for tirelessly working to resolve this situation in the best way possible. It involved incredible effort and I couldn’t be more proud of what we have been able to get done in such a short amount of time.

[1] https://community.letsencrypt.org/t/2020-02-29-caa-rechecking-bug/114591
[2] https://community.letsencrypt.org/t/revoking-certain-certificates-on-march-4/114864

Josh: Thanks for the update.

While this is understandably complex, and I appreciate the level of detail in Comment #3, I want to highlight that it's missing essential details as covered in the expectations for Responding to an incident

Relevant bits from that section are included below for emphasis:

  • The rationale must include an explanation for why the situation is exceptional. Responses similar to “we deem this misissuance not to be a security risk” are not acceptable.
  • Any decision to not comply with the timeline specified in the Baseline Requirements must also be accompanied by a clear timeline describing if and when the problematic certificates will be revoked or expire naturally, and supported by the rationale to delay revocation.
  • That you will perform an analysis to determine the factors that prevented timely revocation of the certificates, and include a set of remediation actions in the final incident report that aim to prevent future revocation delays.

It may be helpful to review some of the still open reports from some CAs, as well as some of the past incidents

Flags: needinfo?(jaas)

The decision and rationale for delaying revocation will be disclosed to Mozilla in the form of a preliminary incident report immediately

We believe the large volume of revocations, if we had revoked the full set of affected certificates by the deadline, would have resulted in a large cumulative impact to the web. We revoked 1,711,396 certificates by the deadline (56% of the total affected), based on our evaluation that they had been replaced, were not in use, or currently had CAA records forbidding issuance to Let's Encrypt. Of the remaining 1,336,893 certificates, most (65%) were still in use based on our Internet scans, and the remainder were of undetermined status based on initial scans.

We believe that revoking those actively in-use certificates would have harmed the web because many users, upon encountering the revocation errors, would look up instructions on how to bypass revocation checks. For instance, the top Google result for [SEC_ERROR_REVOKED_CERTIFICATE firefox] right now is https://support.mozilla.org/en-US/questions/856276, which says 'You can uncheck "Use the OCSP to confirm the current validity"'. Those users are unlikely to re-enable revocation checks once they are done using the affected sites. This would prevent those users from receiving future warnings about revoked certificates.

Also, the experience of bypassing many revoked certificate warnings would likely contribute to such users' “warning blindness,” causing them to ignore future errors. The class of errors users start ignoring could extend beyond revocation errors to include similar types of errors, like certificate mismatch or certificate expiration. That would expose such users to a risk of their communications being intercepted with trivial attacks using non-browser-trusted certificates.

We acknowledge that this reasoning applies to future instances where millions of certificates need to be revoked at once. As such, we plan to develop systemic improvements so that millions of certificates can be automatically replaced by Subscribers within the BR-mandated deadlines. See below for details.

Any decision to not comply with the timeline specified in the Baseline Requirements must also be accompanied by a clear timeline
Since the deadline, we have revoked an additional 295,799 certificates, for a total of 2,007,195 revoked, plus 37,499 that expired before we revoked them. That leaves 1,003,596 still to be revoked or expire.

Over the next 83 days we will continue to work with our Subscribers to get certificates replaced, and will continue to revoke certificates as they are replaced. Specifically, we will check at least twice a week for certificates that have been replaced, and revoke those that have. Additionally, when subscribers with large numbers of certificates notify us that their replacement process is complete, we will revoke those certificates.
After 83 days, all affected certificates will have expired, due to our 90-day certificate lifetime.

I've attached a file listing, for each of the next 83 days, how many currently-unrevoked certificates will expire on that day. Dates are in UTC. Note that the number for 2020-03-07 is slightly higher because it represents that value for 2020-03-07 00:00 UTC, while the numbers described above are as of 01:10 UTC.

The issue will need to be listed as a finding in your CA's next BR audit statement.

We will make sure this happens.

Your CA will work with your auditor (and supervisory body, as appropriate) and the Root Store(s) that your CA participates in to ensure your analysis of the risk and plan of remediation is acceptable.

We will also do this.

That you will perform an analysis to determine the factors that prevented timely revocation of the certificates, and include a set of remediation actions in the final incident report that aim to prevent future revocation delays.

By reviewing previous incident reports and analyzing our current situation, a common root cause of failure to timely revoke is that Subscribers are not able to replace certificates on the BR-mandated timelines (24 hours and 5 days, depending on the issue).

Most Subscribers are not able to field round-the-clock incident response, so improving the speed of manual replacement processes cannot be the answer. Increasing public acceptance of revoked certificate errors also cannot be the answer, because that would undermine public faith in the web PKI. Reducing the incidence and scope of CA errors is an important part of the solution, and we have laid out some plans to that effect at https://bugzilla.mozilla.org/show_bug.cgi?id=1619047. However, responsible systems design requires layered responses, and it is possible that we, or another CA, will have a similar-sized incident in the future despite our best practices and best efforts.

Therefore, our conclusion is that we need to develop a protocol to notify Subscribers' systems of imminent certificate revocation, so those Subscribers can automate the process of replacing affected certificates before the deadline. We plan to design this protocol publicly, in collaboration with the PKI community, so that any CA and any Subscriber can implement it. We will also collaborate directly with popular ACME clients to integrate and test such automated replacement.

Jacob: Could you provide an update to Comment #7? Per Keeping us informed

You should also provide updates at least every week giving your progress, and confirm when the remediation steps have been completed - unless Mozilla representatives agree to a different schedule by setting a “Next Update” date in the “Whiteboard” field of the bug.

In Comment #7, you stated:

Specifically, we will check at least twice a week for certificates that have been replaced, and revoke those that have. Additionally, when subscribers with large numbers of certificates notify us that their replacement process is complete, we will revoke those certificates.

Has this been done? Have the numbers substantially changed? Noting how other large revocations have provided regular progress (examine past delayed-revocation bugs), it'd be good to hear the same.

Flags: needinfo?(jsha)

Specifically, we will check at least twice a week for certificates that have been replaced, and revoke those that have. Additionally, when
subscribers with large numbers of certificates notify us that their replacement process is complete, we will revoke those certificates.

Has this been done? Have the numbers substantially changed? Noting how other large revocations have provided regular progress (examine past > delayed-revocation bugs), it'd be good to hear the same.

Yes, we've been doing these check-and-revoke cycles twice a week as planned. The remaining unrevoked, unexpired certificate count as of March 27 is 685,234. That's an improvement over that 870,074 that our original analysis indicated would expire naturally by now without our biweekly checks. I'm attaching a fresh TSV indicating the timeline for remaining certificates.

Per Keeping us informed

Thanks for the reminder of this, and apologies for not providing timely updates so far. We will provide weekly updates from here on.

Flags: needinfo?(jsha)
Flags: needinfo?(jaas)

Giving our update for this week: The unrevoked, unexpired count as of April 3 is 634,120. That's ahead of the 819,688 we originally projected for this date. The difference, as before, is due to our biweekly scans and revocations.

The unrevoked, unexpired count as of April 13 is 496,523. That's ahead of the 726,022 we originally projected for this date. The difference, as before, is due to our biweekly scans and revocations.

The unrevoked, unexpired count as of April 20 is 449,634. That's ahead of the 668,200 we originally projected for this date.

The unrevoked, unexpired count as of April 27 is 327,352. That's ahead of the 604,812 we originally projected for this date.

The unrevoked, unexpired count as of May 4 is 124,940. That's ahead of the 532,821 we originally projected for this date.

The unrevoked, unexpired count as of May 11 is 92,887. That's ahead of the 432,238 we originally projected for this date.

The unrevoked, unexpired count as of May 18 is 65,646. That's ahead of the 364,409 we originally projected for this date.

The unrevoked, unexpired count as of May 26 is 18,111. That's ahead of the 145,672 we originally projected for this date.

As of 2020-05-29 02:10:21 all affected certificates have expired or been revoked.

Flags: needinfo?(wthayer)

It appears that all questions have been answered and remediation is complete.

Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Flags: needinfo?(wthayer)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.