Closed Bug 1693343 Opened 3 years ago Closed 3 years ago

DigiCert: Failure to find and revoke key-compromised certificates within 24 hours

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rob, Assigned: jeremy.rowley)

Details

(Whiteboard: [ca-compliance] [leaf-revocation-delay])

I recently found some private keys that correspond to leaf certificates issued by DigiCert, and I wanted to report them to DigiCert. DigiCert's CPS (https://www.digicert.com/wp-content/uploads/2020/10/DigiCert-CPS-V.5.4.1.pdf) says:

As of October 15th, 2020 to request that one or more certificates be revoked due to key compromise, the request must be submitted via http://problemreport.digicert.com/key-compromise providing all of the necessary information as outlined in section 4.9.

I browsed to http://problemreport.digicert.com/key-compromise, which redirected to https://problemreport.digicert.com/key-compromise. Under "Report a compromised private key", I clicked "Use this form". I was then prompted to provide either (1) a Private Key or (2) a CSR containing "CN=Proof of Key Compromise for DigiCert".

I tried submitting several private keys in PKCS#1 format, one private key in PKCS#8 format, and a couple of CSRs containing "CN=Proof of Key Compromise for DigiCert". After each submission, I received the following email (with a different case number each time) from DigiCert:

Thank you for submitting a report of suspected key compromise via DigiCert’s Compromised Key Service. Your case number is xxxx-xxxx-xxxx.

We have added the confirmed compromised key(s) to our blocklist, and we have scanned our database in search of valid certificates using the key(s). We did not find any matching certificates.

To submit another report of key compromise, visit https://problemreport.digicert.com.

If you have any questions or believe our findings are incorrect, please email us at revoke@digicert.com.

Thank you,
The DigiCert Team

BR 4.9.1.1 says:
"The CA SHALL revoke a Certificate within 24 hours if ... The CA obtains evidence that the Subscriber’s Private Key corresponding to the Public Key in the Certificate suffered a Key Compromise;"

Revocation did not occur within 24 hours. "We did not find any matching certificates" seems to point to the root cause.

Whiteboard: [ca-compliance]

Hey Rob - can you send me the CSR that you submitted?

Assignee: bwilson → jeremy.rowley
Status: NEW → ASSIGNED

Hi Jeremy. I just emailed revoke@digicert.com (with you on CC) to provide some of the case numbers. Will that suffice?

Thanks Rob - that was great. We discovered the issue and are working on a resolution. I'll post a preliminary incident report this week.

Flags: needinfo?(jeremy.rowley)
  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

Rob Stradling posted an incident report on Feb 17, 2021. He submitted several keys to the key compromising reporting tool. The tool failed to find any certificates associated with the keys and reported back to Rob that no certificates were revoked, including the ones submitted through the tool.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

Feb 6, 2021 - DigiCert migrated our CA to a new data center. As part of the migration, applications connecting to the CA required configuration variables to be updated to point to the CA’s new URL. This configuration was not updated and caused a connectivity issue between the CA and the revocation service.
Feb 17, 2021- Rob posts on the bug about the failures and provides the relevant case numbers for investigation.
Feb 18, 2021 – The team corrects the URL and deploys the change. We identify certificates that should have been revoked and revoke them. We also add an alert for issues connecting to the CA.
3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

We corrected the connectivity issue between the key compromise service and the CA. We grabbed the list of key compromise reports from Splunk and revoked all associated certificates. We’ve also fixed the alerts from splunk to the team about connectivity issues.
4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
Eight certificates should have been revoked as a result of certificate problem reports between the time of the URL change and when the Rob reported the issue via Mozilla. We revoked all eight certificates. Each certificate was a result of the same failure – the issue connecting to the CA resulted in zero certificates being returned for revocation after the key compromise service accepted evidence of key compromise.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
    https://crt.sh/?id=3380445754
    https://crt.sh/?id=3834096677
    https://crt.sh/?id=4022903877
    https://crt.sh/?id=2835520037
    https://crt.sh/?id=4022899715
    https://crt.sh/?id=3790750917
    https://crt.sh/?id=2693016376
    https://crt.sh/?id=1573644023

  2. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
    During a data center move, one of the key compromise application's variables were not updated, creating a connectivity issue between the CA and revocation service. The revocation requests were logged in splunk but the notifications from splunk to the engineering team were not set properly. Given the low volume of issues, the connectivity problem errors being raised by the system went unnoticed. We’re reviewing the entire workflow to identify other potential gaps in the alert. We were working on this and improving the queuing mechanism before the incident. Both will help us recover from problems like this in a more automated fashion. We are scheduled to complete this work by March 5th.

One issue is that our CPS lacked clear on how to report key compromise reports if there is a system unavailability. Although the CPS does include contact information (as does the website), the CPS specifies that all reports must be submitted through the tool. We’re adding language to our website and the CPS to clarify that, if there is an outage or questions about the tool, we accept reports through a support email. This language already exists in the email response to a query for active certs with a compromised key:
“If you have any questions or believe our findings are incorrect, please email us at revoke@digicert.com.” The language we are adding will be similar to this. We did not receive any emails from key compromise reporters in response to the system’s failure to find certificates.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We’re currently reviewing the entire workflow to identify additional potential gaps in our alert systems across the CA services. Prior to this incident, we’d already been working on improved queuing mechanisms, which will help us recover from problems like this more systematically. This should be completed by March 5th.

As mentioned above, we’re also working on a CPS update to clarify how to report this information in case of a system issue. The information is provided in the response email, but not the CPS.

Thanks Jeremy. I can confirm that your compromised key reporting tool does seem to be fixed. I just submitted another DigiCert subscriber private key to https://problemreport.digicert.com/key-compromise/report; the automated response email said "Valid DigiCert certificates using the compromised key(s) will be revoked within 24 hours" and quoted the certificate serial number that I expected to see.

We're working on the CPS update. A new one should be published before the end of March. We're making several updates (not related to this bug) in addition to adding the backup reporting mechanism.

We've already deployed the fix to get notices from splunk for outages related this service. During our investigation of similar potential issues, we noticed a few other health checks we needed to add.

  1. We added an updated health check for whenever the CA cannot communicate to the database.
  2. We added a high-speed queuing system to ingest submissions and ensure they are not lost even if faults occur elsewhere.
  3. We added additional logging and notification for logic paths when an error occurs in:
    a) Evidence submission
    b) Evidence processing
    c) Contact with the CA service for information retrieval and certificate revocation
    d) Storage of evidence and revocation information within the database

That should cover the critical comms path. We're expecting to deploy the code changes in the next sprint, meaning we should have full remediation (and then some) in two weeks.

We've deployed the code referenced in the previous comment and updated our CPS to include the instructions already in the service. Any additional questions before we close this bug?

Flags: needinfo?(jeremy.rowley) → needinfo?(bwilson)

I will schedule this to be closed on or about 19-March-2021.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [leaf-revocation-delay]
You need to log in before you can comment on or make changes to this bug.