Closed Bug 1639794 Opened 5 years ago Closed 5 years ago

Let's Encrypt: Failure to revoke key-compromised certificate within 24 hours

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mpalmer, Assigned: jsha)

Details

(Whiteboard: [ca-compliance] [leaf-revocation-delay])

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36

Steps to reproduce:

At 2020-05-05 06:02:23 UTC, a certificate problem report was delivered aspmx.l.google.com (108.177.15.27) on behalf of cert-prob-reports@letsencrypt.org, stating that a private key with SPKI 416b41f2c9f1dfab97b952c4bda877dba9775334eddc643cc44d3fb927f671f0 had been compromised, and requesting revocation of all certificates issued by Let's Encrypt using that SPKI be revoked. The URL of a CSR attesting to the compromise of the private key, signed by the compromised private key, was provided.

Actual results:

Revocation was effected at 2020-05-06 19:23:39 UTC, based on the timestamp contained within signed OCSP responses for certificates using the specified SPKI.

Expected results:

The certificate to have been revoked within 24 hours of the certificate problem report being sent.

Type: defect → task
Assignee: bwilson → jsha
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

Thanks for the report. We'll investigate and provide a report shortly.

Summary:

Let’s Encrypt received a report of a compromised private key and associated valid certificates to our cert-prob-reports@letsencrypt.org email address. Let’s Encrypt staff processed the report within 24 hours, but a step in the procedure was missed to successfully revoke the certificate. The key had been added to our blocked-keys list so the next day the certificate was found, and revoked, as part of our sub daily remediation check as described in https://bugzilla.mozilla.org/show_bug.cgi?id=1625322

Incident Report:

1. How your CA first became aware of the problem.

  • On 2020-05-21: Bugzilla bug id=1639794 was assigned to a member of Let’s Encrypt for review.

2. A timeline of the actions your CA took in response.

  • 2020-05-05T06:02Z: Report received via cert-prob-reports email.
  • 2020-05-05T18:00Z: Let’s Encrypt staff responded that the key had been blocked and the certificate associated with that key revoked. The manual process for revocation was incorrectly followed at this point.
  • 2020-05-06T19:23Z: In a routine daily check, Let’s Encrypt staff found and revoked a certificate matching a key listed in the blocked keys list. There was nothing abnormal about this, it had been expected that on occasion certificates revoked via the API with reason keyCompromise would trigger the need to revoke additional certificates until automation was in place.
  • Time to revocation from initial cert-prob-reports email: 37h 21m

3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem.

  • The key was blocked for future certificate issuance within 24 hours of when the original key compromise report was delivered. This prevented any new certificates from being issued with the problem.
  • The certificate with the same SubjectPublicKeyInfo as the report was revoked at 37 hours and 21 minutes after the report.

4. A summary of the problematic certificates.

  • One certificate issued on 2020-04-21 was affected.

5. The complete certificate data for the problematic certificates.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

  • Although the manual revocation procedure had been successfully completed many times before and after this instance, the procedure being used at the time of the report was complex and a human error in the process caused two of the steps to be missed. (The actual revocation and verification of the revocation).

7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

  • At the time of the report the process to manually revoke required multiple steps.
    • Verify that the CSR proof of key compromise matches the SPKI hash of certificates targeted for revocation.
    • Create, merge, and deploy pull request to our blocked keys file managed in configuration management.
    • Search the database for any other active certificates matching the reported SPKI hash that may have been issued up to the point of deploying blocked keys file.
    • Search the database for any subscriber email addresses for detected certificates needing revocation.
    • Revoke any certificate(s).
    • Verify revocation(s).
    • Send emails to subscribers (if contact information was provided).
    • Respond to the email report confirming revocation had been completed.
  • On 2020-05-19 we deployed a Boulder update into production which handles many of the manual steps listed above. The remaining manual process for a cert-prob-report email is as follows:
    • Verify the CSR was signed with the same private key as the certificates targeted for revocation.
    • Revoke any certificate(s).
    • Verify revocation(s).
    • Respond to the email report confirming revocation had been completed.
  • In the current procedure all the steps we were previously doing are automated. In addition we have automated the process of key blocking, subscriber email, and revoking of additional certificates matching the SPKI when the revocation occurs via our acme API.We realize that we still need to support revocation via manual reports to our cert-prob-reports email address and our process has been simplified to help mitigate human error.
  • We have audited all keyCompromise reports to find any additional certificates that were not revoked within 24 hours of being reported.. We found none that have not been already disclosed in this or a previous bugzilla report.
  • We have updated our CPS to clearly state that keys will be blocked and affected certificates revoked when the ACME API is used to revoke certificates with reason keyCompromise. This should reduce the risk of errors from the manual process by driving a greater fraction of key compromise reports to our automated process. https://letsencrypt.org/documents/isrg-cps-v2.8/#493-procedure-for-revocation-request

(In reply to Andrew Gabbitas from comment #2)

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

  • Although the manual revocation procedure had been successfully completed many times before and after this instance, the procedure being used at the time of the report was complex and a human error in the process caused two of the steps to be missed. (The actual revocation and verification of the revocation).

(snip)

  • On 2020-05-19 we deployed a Boulder update into production which handles many of the manual steps listed above. The remaining manual process for a cert-prob-report email is as follows:
    • Verify the CSR was signed with the same private key as the certificates targeted for revocation.
    • Revoke any certificate(s).
    • Verify revocation(s).
    • Respond to the email report confirming revocation had been completed.

From what I can see, it seems that the two steps that weren't followed that caused this violation are still manual in the revised procedure. Are there any additional controls that have been (or are planned to be) implemented to compensate for the remaining potential for human error?

From what I can see, it seems that the two steps that weren't followed that caused this violation are still manual in the revised procedure. Are there any additional controls that have been (or are planned to be) implemented to compensate for the remaining potential for human error?

By reducing the number of steps in processing key compromise emails we are reducing the chance that a staff member can leave a revocation in a “half-done” state, as was the case here.

In addition, we have modified our verification procedure to include the output of an OCSP response of the revoked certificate(s) in our internal ticketing system before considering the process complete.

We now encourage folks to use the API for keyCompromise revocations, if possible, to relieve our security team of the burden of manual revocations via reports to our cert-prob-reports email address.

Is the requirement for an OCSP response indicating that the certificate is revoked a technical control, or an administrative or procedural one? It would certainly remove a lot of potential human error if the ticketing system refused to allow a problem report to be closed as "completed" without an OCSP response "in hand", as it were (/me glances meaningfully at other CAs).

Is the requirement for an OCSP response indicating that the certificate is revoked a technical control, or an administrative or procedural one? It would certainly remove a lot of potential human error if the ticketing system refused to allow a problem report to be closed as "completed" without an OCSP response "in hand", as it were

Adding the OCSP response to our issue tracking system is a management/administrative control. Our ticketing system is not enforcing this property.

We’re focusing on driving users toward the ACME API, which allows us to apply a more robust set of technical controls. To that end, we've built a technical control into our Boulder CA software that completely removes human involvement on our end. We welcome all technically capable users to utilize this feature of our ACME API described below.

When a user revokes a certificate with reason type keyCompromise, boulder will now ensure the following: any certificates matching the key will be revoked, the key will be blocked for future issuance, the subscriber will receive an email notification presuming they supplied an email address, and a success/failure message will be displayed to the reporter. These changes were introduced in https://github.com/letsencrypt/boulder/pull/4788. We address this procedure in version 2.8 of our CPS in section 4.9.3 https://letsencrypt.org/documents/isrg-cps-v2.8/#493-procedure-for-revocation-request.

Adding the OCSP response to our issue tracking system is a management/administrative control. Our ticketing system is not enforcing this property.

That's disappointing. What exactly is the control, though? That is, what "closes the loop" on the addition of the OCSP response to the issue tracking system, such that the risk of a failure of the manual process is mitigated?

We’re focusing on driving users toward the ACME API, which allows us to apply a more robust set of technical controls.

That's all well and good, but by analogy with "the Web PKI's weakest link is the least well-managed CA", a CA's revocation weakest link is the least well-managed method of reporting and processing certificate problem reports. So, it really doesn't matter that Let's Encrypt's ACME API is the bee's knees if manual problem report processing, which is certain to be a requirement for the foreseeable future, isn't capable of effectively managing revocation.

I have reviewed the incident report filed by Let's Encrypt and subsequent conversation with the reporter of this bug and believe this bug can be closed. I will close it on or after 10-July-2020 unless other questions or issues are raised.

Flags: needinfo?(bwilson)

Ben: I do want to register my concern about whether or not the proposed changes in Comment #6 meet Mozilla's requirements. It appears to be adding an explicit barrier to reporting key compromise, and an explicit disclaimer about the ability of Let's Encrypt to follow the BRs as required by 4.9.1.

Specifically, I have trouble squaring how this statement meets either Mozilla Policy or the BRs:

All other requests for revocation must be made by emailing cert-prob-reports@letsencrypt.org. ISRG will respond to such requests within 24 hours, though an investigation into the legitimacy of the request may take longer.

Ryan makes a good point. After re-reading our v2.8 CPS update, it does read as though we exclude reporting keyCompromise via email to our cert-prob-reports email address. That was not our intent and we are in the process of updating our CPS to clarify. Our intent was to encourage the use of our API for these reports, but not to exclude reports via email.

It's not yet published as an official CPS update, but here is the relevant PR: https://github.com/letsencrypt/cp-cps/pull/27

Ben: I do want to register my concern about whether or not the proposed changes in Comment #6 meet Mozilla's requirements. It appears to be adding an explicit barrier to reporting key compromise, and an explicit disclaimer about the ability of Let's Encrypt to follow the BRs as required by 4.9.1.

Specifically, I have trouble squaring how this statement meets either Mozilla Policy or the BRs:

All other requests for revocation must be made by emailing cert-prob-reports@letsencrypt.org. ISRG will respond to such requests within 24 hours, though an investigation into the legitimacy of the request may take longer.

The change to our CPS is now live with the update to make it clear that reports via email will still be attended to.
https://cps.letsencrypt.org/

I'm satisfied with the revisions that Let's Encrypt has made to its CPS and the resolution of this bug.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [leaf-revocation-delay]
You need to log in before you can comment on or make changes to this bug.