Closed Bug 1886876 Opened 9 months ago Closed 8 months ago

Let's Encrypt: keyCompromise key blocking deviation from CP/CPS

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jcj, Assigned: jcj)

Details

(Whiteboard: [ca-compliance] [policy-failure])

Preliminary Incident Report

During quarterly review of our CP/CPS document on 2024-03-20, Let’s Encrypt’s Policy Management Authority discovered a discrepancy between the behavior regarding revocation with reason keyCompromise committed to in that document and the actual behavior of our software systems. This incident does not affect the issuance of certificates, so we have not stopped issuance. We will provide a full incident report on or before 2024-03-28.

Status: NEW → ASSIGNED
Type: defect → task
Whiteboard: [ca-compliance] [policy-failure]

Incident Report

Summary

During PMA review of our CP/CPS we noted that Section 4.9.12 stated that “Successful revocation requests with a reason code of keyCompromise will result in the affected key being blocked for future issuance and all currently valid certificates with that key will be revoked.” However, this does not accurately describe Let’s Encrypt’s behavior.

The ACME protocol supports three different kinds of revocation requests: those signed by the ACME account key of the Subscriber who originally requested the certificate, those signed by the ACME account key of a different Subscriber who has demonstrated control over all identifiers in the certificate, and those signed by the keypair represented by the certificate itself. Only the last of these actually demonstrates that the key has been compromised, and so we only block the key and revoke other certificates sharing that key when the revocation request was signed by the certificate key itself. This behavior was and remains a deliberate choice, to prevent a potential DoS vector detailed below.

We have changed the language in our CP/CPS to reflect our actual behavior, and are filing this incident to reflect the period of time in which our behavior was in violation of our CPS.

Impact

We have completed an investigation of the issuance and revocation information in our authoritative database and onsite audit logs. From that investigation we found:

  • 18,333 successful keyCompromise revocation requests where compromise was not demonstrated and therefore the key was not blocked, in violation of our CPS. Most of the corresponding keys have been blocked; see details in the Action Items section.
  • 9 unexpired unrevoked certificates which should have been revoked or should not have been issued due to one of the requests above. These certificates have been revoked.

However, our onsite logs only cover revocation events after 2022-12-17, approximately 84% of the incident period. We are continuing our investigation using audit logs held off-site on archival tape, and will provide another update to this incident report when that work is complete.

Timeline

All times are UTC. Some documents (such as the Baseline Requirements and the Mozilla Root Store Policy) do not give their publication or effective dates with HH:MM granularity, so only dates are provided for those entries.

2020-05-28

2021-07-19

2021-10-04

  • 06:05 Let’s Encrypt receives a request to revoke one certificate with reason keyCompromise, and subsequently revokes 134,405 other certificates which shared the same public key

2021-10-14

  • 21:39 Boulder changed to restrict keyCompromise revocation to only requests which demonstrate compromise

2021-11-30

  • 21:43 Mozilla begins discussion on introducing requirements regarding revocation reasons

2022-03-14

  • 15:58 The way Boulder handles revocation requests is overhauled

2022-03-17

  • 20:13 The change above is deployed to production, but the portion of the change relevant to this incident is gated behind a disabled-by-default feature flag

2022-06-29

  • Version 2.8 of Mozilla’s Root Store Policy is published

2022-08-16

  • 16:29 The upcoming changes to how LE processes revocation requests are publicly announced

2022-09-21

  • 20:14 The feature flag is enabled in production (Incident Begins)

2022-10-01

2023-07-15

2023-09-01

  • Version 2.9 of Mozilla’s Root Store Policy goes into effect

2024-03-20

  • 18:30 The incorrect language in the CP/CPS is discovered during Policy Management Authority (PMA) review of the document
  • 20:20 Incident declared and incident response procedures begun

2024-03-21

  • 19:11 Preliminary investigation of production database identifies 453 revocation requests for unexpired certificates which were not processed per our CP/CPS
  • 20:14 Investigation identifies 1 unexpired unrevoked certificate which should have been revoked as a result of one of the requests above
  • 20:16 The 1 unexpired unrevoked certificate is revoked
  • 20:45 Preliminary incident report filed in Bugzilla (this issue)

2024-03-22

  • 20:50 CP/CPS v5.3 is published, updating the language to reflect correct behavior (Incident Ends)

2024-03-25

  • 16:58 Investigation of onsite audit logs commences

2024-03-26

  • 23:31 Keys from 453 revocation requests above blocked from future issuance

2024-03-27

  • 19:58 Investigation of onsite audit logs identifies 18333 revocation requests which were not processed per our CP/CPS
  • 21:05 Investigation identifies 18260 keys which should have been blocked as a result of the requests above
  • 23:24 Investigation identifies 446 unexpired certificates containing one of the keys above (most of which are already revoked)

2024-03-28

  • 00:29 All 18260 keys are blocked
  • 00:31 All 8 additional unexpired unrevoked certificates are revoked

Root Cause Analysis

Background

Before diving into the root cause analysis itself, we would like to provide additional details on how Let’s Encrypt has handled keyCompromise revocation requests historically, and when and why that behavior changed. All events described below have precise datestamps and links to external sources in the timeline above.

In 2020, we updated our CPS to include the language quoted in the Summary above. At the time, this language reflected Let’s Encrypt’s actual behavior: all successful revocation requests which specified the keyCompromise revocation reason resulted in the corresponding key being added to our “blockedKeys” table. Inclusion in the blockedKeys table has two consequences: all future issuance requests which specify that same key will be rejected, and a background process searches our database for all other unexpired certificates sharing that same key and revokes them with reason keyCompromise.

This system seemed to serve us well, until late 2021 when we received an ACME request from a Subscriber to revoke a certificate with reason keyCompromise. The request was from a Subscriber who controlled all of the identifiers in the certificate, so it was accepted and the certificate was revoked. Per our CPS, our CA software automatically then blocked that key and revoked all 134,405 other certificates which shared that key. This resulted in a different Subscriber being very understandably frustrated and confused as to why all of their certificates had been revoked when their key had not in fact been compromised. We realized that we should only do the block-and-cascading-revoke routine when key compromise has been demonstrated, not just asserted.

As noted in the summary above, only one of the ACME protocol’s revocation methods actually demonstrates key compromise: the JSON Web Signature which accompanies the request to the revocation endpoint can be signed by the certificate’s private key, rather than an ACME account private key. Because the JWS in all ACME requests binds an anti-replay nonce, this kind of revocation request is a definitive demonstration of compromise.

So we changed our software to require all revocation requests with a reasonCode of keyCompromise to be signed by the certificate key. This change was still in compliance with our CPS, because the behavior described there was limited to “successful revocation requests with a reason code of keyCompromise” (emphasis added). The change simply reduced the set of possible requests which would be successful, but continued to have the same block-and-cascading-revoke behavior for those that did succeed.

Shortly thereafter, the Mozilla Root Program circulated their intent to add requirements on which revocation reasons can be used under which circumstances. We actively participated in this discussion, and urged that proof of possession be a necessary prerequisite for revoking with reason keyCompromise simply at the request of a Subscriber. The end result of the conversation was that CAs would be obligated to revoke with reason keyCompromise simply at the word of the Subscriber, but that they would not be required to block-and-cascading-revoke if the Subscriber had not proved possession of the key. This language was eventually incorporated into Version 2.8 of the MRSP.

In parallel to those discussions, we overhauled our revocation code paths to make them simpler to reason about. While doing so, we preserved the current “accept keyCompromise only if compromise is demonstrated” behavior, but added a feature flag that would change the behavior to comply with the upcoming MRSP and allow keyCompromise revocations to succeed without an accompanying block-and-cascading-revoke. We then turned this feature flag on a few days before the new MRSP requirements went into effect, thus beginning this incident.

Analysis

The root cause here is that we did not have effective mechanisms for ensuring that changes to CA behavior which would violate the CPS are preceded by changes to the CPS to allow said behavior.

We have mature mechanisms for reviewing changes to the BRs and individual root program requirements, and incorporating those changes into our behavior. But a CPS is fundamentally different from other requirements documents: it is descriptive, rather than prescriptive, and changes need to flow into it from the CA rather than the other way around. Although our existing PMA review of the document did catch this discrepancy, it is not positioned to proactively prevent a similar issue: high-level, whole-document, general review is not well suited to finding small technical discrepancies, and by definition can only find them after the fact, at which point an incident has already occurred.

Instead, we need a process which prompts us to review targeted sections of the CPS when certain code changes, particularly feature flags, are introduced.

Lessons Learned

What went well

  • ACME’s fully automated approach makes it possible for ACME clients to easily use a new key for every renewal, making key reuse relatively rare.
  • Our existing PMA document review caught the discrepancy.
  • Our incident response procedures kicked in quickly and effectively after PMA discovered the issue.
  • Our PMA procedures enabled us to quickly review and publish an updated CP/CPS.

What didn't go well

  • Our PMA document review did not detect the issue in the first review after the discrepancy was introduced, increasing the length of the incident period.
  • Our onsite log retention period was not large enough to more-conveniently cover all revocation events within the incident period, requiring us to access archived logs.

Where we got lucky

  • None of the key compromise revocation requests we have identified were for keys which are used across a large set of certificates, like the one on 2021-10-04. This enabled us to block the keys and revoke the affected certificates without concern for impact on the internet at large.

Action Items

Due to our on-site log rotation, we have not yet identified 100% of revocation requests which were not fully processed per our CPS. In particular, we estimate we have identified only 84% of the affected revocation requests. Additionally, we have blocked only 98% of the keys contained in the certificates revoked by those revocation requests, because some of those requests were for certificates that were issued in the 90 days prior to our earliest onsite logs, and we only log the public key at issuance time (not at revocation time).

Our highest priority at this time is completing our investigation using the logs stored off-site on archival tape. When we have processed the archived audit logs, we will perform another round of key-blocking and report any additional unexpired certificates we revoke.

To prevent incidents like this from recurring, we intend to establish automation which requires a review of our CP/CPS to be included with each proposed code change that introduces a new feature flag. Requiring this review at the time that feature flags are proposed will ensure that the review is narrowly targeted and occurs early, before issues like this actually arise.

These action items and their target completion dates are summarized in the table below.

Remediation Kind Status Date
Update CP/CPS section 4.9.3 Prevent Complete 2024-03-22
Block all keys which should have been blocked per our CPS Mitigate In Progress 2024-04-05
Revoke all certificates which should have been revoked or blocked from issuance per our CPS Mitigate In Progress 2024-04-05
Increase onsite log target retention period to ease investigation Mitigate Not Yet Started 2024-04-12
Establish automation requiring CP/CPS review for all new feature flags Prevent Not Yet Started 2024-04-19

Appendix

Details of affected certificates

Links to the 9 unexpired unrevoked certs which have been revoked are below:

Establish automation requiring CP/CPS review for all new feature flags

It seems like re-reviewing the CA's CP and CPS on every "feature flag" is a recipe for that task to become a rubber stamp, unless "feature flag" changes are very rare. Maybe you should consider something more like publishing all of these policy decisions to a transparency database?

because some of those requests were for certificates that were issued in the 90 days prior to our earliest onsite logs, and we only log the public key at issuance time (not at revocation time).

This seems like a remediation item as well. That you should have all your relevant data in each audit record you store

"The request was from a Subscriber who controlled all of the identifiers in the certificate, so it was accepted and the certificate was revoked. Per our CPS, our CA software automatically then blocked that key and revoked all 134,405 other certificates which shared that key. This resulted in a different Subscriber being very understandably frustrated"

This makes it sound like there was a single key that was used by different Subscribers, and that a single key was used for >134K certs. Is there (or should there be) a condition alerting on e.g., keys being shared by different subscribers or being reused above a certain threshold?

It seems like re-reviewing the CA's CP and CPS on every "feature flag" is a recipe for that task to become a rubber stamp, unless "feature flag" changes are very rare.

Changes to our feature flags are quite rare in the grand scheme of things. We agree that solutions which impose additional human review steps are imperfect, but we don't currently have a better solution that we can commit to implementing. Even things like a "requirements traceability matrix" ultimately boil down to using human review to ensure that the requirements and artifacts identified in the matrix actually correspond.

This seems like a remediation item as well. That you should have all your relevant data in each audit record you store.

When we block a key from future issuance, our audit logs do include that key. These keys were not included in the revocation audit logs because the key was not relevant, as it was not being blocked.

This makes it sound like there was a single key that was used by different Subscribers

Not quite. As explained in the report, the revocation request came from a Subscriber who had demonstrated control over all of the identifiers in the to-be-revoked certificate. They were not the Subscriber who had originally requested the certificate in question, they had not included that public key in any certificate of their own, and there is no evidence that they ever possessed the private key.

and that a single key was used for >134K certs. Is there (or should there be) a condition alerting on e.g., keys being shared by different subscribers or being reused above a certain threshold?

Correct. We agree that it is not good hygiene for a single key to be shared across so many certificates. However, we cannot alert on a key being re-used across too many different accounts or certificates: doing so would be analogous to alerting on HTTP 4XX client errors. And as explained above, we have already implemented other mechanisms to prevent a non-demonstrated keyCompromise revocation request from resulting in a mass revocation event.

(In reply to Aaron Gable from comment #1)

What went well

  • ACME’s fully automated approach makes it possible for ACME clients to easily use a new key for every renewal, making key reuse relatively rare.
  • Our existing PMA document review caught the discrepancy.
  • Our incident response procedures kicked in quickly and effectively after PMA discovered the issue.
  • Our PMA procedures enabled us to quickly review and publish an updated CP/CPS.

What didn't go well

  • Our PMA document review did not detect the issue in the first review after the discrepancy was introduced, increasing the length of the incident period.

Your PMA document review process seems useful and probably something other CA should emulate. Could you give some details of how the review is done? You mentioned that this wasn't caught at the first review after the incident started; when was that? How often do you do a PMA document review?

(In reply to Mathew Hodson from comment #5)

Hi Matthew, addressing your questions in reverse order:

How often do you do a PMA document review?

Quarterly (i.e. once each quarter, but not necessarily exactly 3 months apart).

You mentioned that this wasn't caught at the first review after the incident started; when was that?

The first PMA review after the start of the incident was 2022-09-28, coincidentally just one week later.

Could you give some details of how the review is done?

The PMA meets via video chat, and covers many topics other than the CP/CPS. For that document specifically, the members of the group each individually read the entire document and take notes. Once everyone has completed their read-through, the group compares notes and files tickets for any desired updates or changes.

Your PMA document review process seems useful and probably something other CA should emulate.

To be honest, we don't think it is a wonderful process. As demonstrated in this incident, our PMA review process is capable of missing discrepancies like this one. This is why our root cause analysis and final remediation item focus on performing more narrowly-focused review of the CP/CPS at shorter intervals (when feature flags are introduced), rather than performing whole-document review on a longer time scale.

In fact, if any other CAs have better processes for ensuring correspondence between behavior documented in their CPS and actual CA practices, we'd love to hear those suggestions.

We have completed our investigation of the offsite archived audit logs. Below is an update to our timeline and action items. No additional revocations were necessary.

Timeline

All times are UTC.

2024-04-01

  • 19:29 Archived audit logs on magnetic storage media reconnected to log host

2024-04-02

  • 16:23 All relevant logs files successfully restored from tapes

2024-04-03

  • 16:50 Investigation of restored audit logs identifies 493 additional revocation requests which were not processed per our CP/CPS, and 523 additional keys which should have been blocked as a result of such revocation requests

2024-04-04

  • 22:47 All 523 keys are blocked (0 additional certificates revoked)

Action Items

Remediation Kind Status Date
Update CP/CPS section 4.9.3 Prevent Complete 2024-03-22
Block all keys which should have been blocked per our CPS Mitigate Complete 2024-04-04
Revoke all certificates which should have been revoked or blocked from issuance per our CPS Mitigate Complete 2024-04-04
Increase onsite log target retention period to ease investigation Mitigate In Progress 2024-04-12
Establish automation requiring CP/CPS review for all new feature flags Prevent Not Yet Started 2024-04-19

We intend to provide our next update on or before 2024-04-12.

We have completed the action item to increase our onsite log target retention period which would have helped to ease our investigation. Below is our updated action items.

Action Items

Remediation Kind Status Date
Update CP/CPS section 4.9.3 Prevent Complete 2024-03-22
Block all keys which should have been blocked per our CPS Mitigate Complete 2024-04-04
Revoke all certificates which should have been revoked or blocked from issuance per our CPS Mitigate Complete 2024-04-04
Increase onsite log target retention period to ease investigation Mitigate Complete 2024-04-12
Establish automation requiring CP/CPS review for all new feature flags Prevent Not Yet Started 2024-04-19

We intent to provide our next update on or before 2024-04-19.

We have completed the action item to establish automation requiring CP/CPS review when new feature flags are introduced.

Remediation Kind Status Date
Update CP/CPS section 4.9.3 Prevent Complete 2024-03-22
Block all keys which should have been blocked per our CPS Mitigate Complete 2024-04-04
Revoke all certificates which should have been revoked or blocked from issuance per our CPS Mitigate Complete 2024-04-04
Increase onsite log target retention period to ease investigation Mitigate Complete 2024-04-12
Establish automation requiring CP/CPS review for all new feature flags Prevent Complete 2024-04-12

With this, all of our remediation items are complete, and we do not intend to provide any further updates. If there are no further questions, we ask that this ticket be closed.

Increase onsite log target retention period to ease investigation

How long has this been increased to?

What caused the 24 hour delay from knowing the keys that need to be blocked to blocking them?

(In reply to amir from comment #10)

How long has this been increased to?

918 days (two years plus two three-month issuance cycles, where a month is 31 days and a year is 366 days).

What caused the 24 hour delay from knowing the keys that need to be blocked to blocking them?

Each stage of the log investigations was executed overnight (note that 16:50 UTC is 9:50am Pacific). We used the following day to confirm the findings (ensuring that the numbers made sense based on other indicators, were reproducible, etc). We also used that time to confirm that blocking these keys would not result in a mass revocation event like the one on 2021-10-04. Once we had completed that investigation, we added the keys to the blockedKeys table, and allowed our automated systems to determine exactly which unexpired certificates were affected (it turned out to be 0) and revoke them if necessary.

Thank you for the update! I have no additional questions.

If there are no other questions, then I will close this on or about Wed. 17-Apr-2024.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.