Closed Bug 1624527 Opened 4 years ago Closed 4 years ago

DigiCert: Issuance of Cert with Compromised Key

Categories

(CA Program :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jeremy.rowley, Assigned: jeremy.rowley)

Details

(Whiteboard: [ca-compliance] [ov-misissuance])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36

Steps to reproduce:

How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
We noticed the issue during an escape analysis after deploying a SEV 1 store-front fix unrelated to validation. The issue was missed originally during testing but the patch applied to a store-front caused issuance to skip the new domain validation system if the cert was never-before seen and the cert was org validated. We originally though the issue related to how domain validation evidence was stored but during investigation realized that the storefront skipped domain validation. This led to mis-issuance of 123 OV certs and 36 EV certs. We have been monitoring certificate issuance for problems like this since we deployed the domain consolidation, which is why we caught it during the escape analysis.

A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
2019-11-04 – A SEV1 outage was reported for a storefront. The fix was deployed after hours with targeted testing instead of regression testing. This is standard for our Sev1 issues. Unfortunately, the patch for the SEV1 caused an issue where the storefront sent the certificate information to issuance with the evidence of domain validation rather than to validation.
2019-11-07 – The problem was discovered during an escape analysis. From the initial investigation, it looked like the validation evidence storage was at issue. We rolled back the patch while investigating further.
2019-11-08– We realized the issue was with domain validation but were not sure of the impact. We continued to investigate the certificates impacted and conditions for missing validation.
2019-11-11 – A final list of impacted certificates was reported, and an incident report was written.
2019-11-12 – All impacted certificates were revoked within 24 hours of knowing which certificates were impacted.

Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
A deployment on November 7th, as detailed above, reverted the patch and removed the bad code.

A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
123 OV and 36 EV certs. I’m working on getting crt.sh links and will post them as an attachment.

The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
See above.

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
We had a SEV1 issue that was escalated. The team fixed the issue after hours. We performed targeted testing but no regression testing on how the change would impact other systems. Unfortunately, the system impact was that the storefront started providing certificate requests for issuance, skipping the validation system. Despite having good unit tests on applications, we lack good cross-system automated tests, mostly because of the number of storefronts.

List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
The immediate fix is to add a report to our canary platform that will identify issues in this integration point on an ongoing basis. This will provide alerts in an out of band process, while further system consolidations are performed that will provide even better testing around these integration points. The addition to our canary platform will be in place by 2019-11-16.

In addition, we need to provide better automated system tests. These are more complicated because of the number of storefronts, but we plan to work on them more in parallel with the system shut down.

Actual results:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On March 23, 2020, Matt Palmer posted to Mozilla forum alerting us that a cert was issued with a compromised private key.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

The timeline is:
2020-03-20 02:05:49 UTC - Matt reports SPKI 4310b6bc0841efd7fcec6ba0ed1f36e7a28bf9a707ae7f7771e2cd4b6f31b5af, associated with https://crt.sh/?id=1760024320 , as compromised
2020-3-20 06:30 UTC - DIgiCert kicks of a scan to identify whether all certs associated with the SPKI are found
2020-03-21 01:56:31 UTC - DigiCert issues https://crt.sh/?id=2606438724 with that same SPKI
2020-03-21 02:09:12 UTC - DigiCert revokes https://crt.sh/?id=1760024320
2020-03-22 05:07:12 UTC - Matt notifies DigiCert of https://crt.sh/?id=2606438724
2020-03-23 03:16:18 UTC - DigiCert revokes https://crt.sh/?id=2606438724

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

The way our system works is that the keys are blacklisted upon revocation of a previous cert. So anything with the same keys would indeed be blocked. However, we have not stopped issuing certs with potentially compromised keys because we don't have a way to track keys independent of a certificate. We are working on figuring that out.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

There is one cert with the issue that we know of.
https://crt.sh/?id=2606438724

There could be others, but I'll need to figure out how to pull a cross-section between active certs and compromised keys to figure that out.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

https://crt.sh/?id=2606438724

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Mistakes were made because we were only accepting cert problem requests for keys and blacklisting the keys when the cert is revoked for key compromise. The basic flow is: a) an email comes in to the revoke at digicert.com alias, b) the CSR provided is used to identify the certs requiring revocation, c) an email is kicked off to the cert subscriber notifying them of the revocation and that they need to change their key pair, and d) the certs are revoked (effectively adding them to a blacklist). The break here is part b since the system doesn't do anything to prevent resue of the keys until the revocation actually occurs.

This leaves two issues: 1) How do we track compromised keys before a cert is issued for it? and 2) How do we prevent a cert from being issued with a compromised key between the time when we scan our database for other certs with the same key and when we revoke?

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

I don't have a good answer for #1 yet. For #2, the short term solution is to scan the database both before and after we revoke the reported cert. The longer term solution is to improve the process we have, turning the entire revocation event into an automation solution. The short term should allow us to revoke all certs issued between notification and revocation going forward until I can finish the automated method.

The two risks I've identified that i need to solve on teh longer term solution are:

  1. How do we prevent essentially a DDOS by people mass uploading keys not associated with our certs?
  2. How do we search through all of the certs if a private key itself is provided? I think this problem is easier to solve.
Assignee: wthayer → jeremy.rowley
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

Jeremy: It appears that the incident report for bug 1595921 is prepended to the actual report you intended to file? Or are these somehow inter-related?

Flags: needinfo?(jeremy.rowley)

(In reply to Jeremy Rowley from comment #0)

The two risks I've identified that i need to solve on the longer term solution are:

  1. How do we prevent essentially a DDOS by people mass uploading keys not associated with our certs?

Some of this was captured on the thread, but I'm not sure I understand this as a DDOS scenario?

If folks demonstrate compromise of private key, it seems like it's reasonable to add it to a blocklist. If you end up with a giant database of compromised keys, that... seems like a net-positive?

I think you've basically identified what sounds like a reasonable process:

  1. Receive notification of compromised private key
  2. Verify compromise
  3. Blocklist key from new issuance
  4. Scan database for any existing certificates with said key
  5. Notify subscribers
  6. Revoke

Where Steps 1&2 of that might be automated or manual (or, practically, a combination of both being valid paths), while Steps 3-6 can be entirely automated.

Is the worry about having humans in the loop for Step 1 causing the DoS, especially if Step 4 returns no certs (i.e. it's a key you've not dealt with)?

  1. How do we search through all of the certs if a private key itself is provided? I think this problem is easier to solve.

Could you be a bit more descriptive here? Which algorithms are you concerned about, and what key representation formats?

Re: Wayne's Comment. I'm not sure how the bugs got mixed. Copy/paste error? I was grabbing the section headings from the bug. The two are unrelated.

How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
On March 23, 2020, Matt Palmer posted to Mozilla forum alerting us that a cert was issued with a compromised private key.

A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
The timeline is:
2020-03-20 02:05:49 UTC - Matt reports SPKI 4310b6bc0841efd7fcec6ba0ed1f36e7a28bf9a707ae7f7771e2cd4b6f31b5af, associated with https://crt.sh/?id=1760024320 , as compromised
2020-3-20 06:30 UTC - DIgiCert kicks of a scan to identify whether all certs associated with the SPKI are found
2020-03-21 01:56:31 UTC - DigiCert issues https://crt.sh/?id=2606438724 with that same SPKI
2020-03-21 02:09:12 UTC - DigiCert revokes https://crt.sh/?id=1760024320
2020-03-22 05:07:12 UTC - Matt notifies DigiCert of https://crt.sh/?id=2606438724
2020-03-23 03:16:18 UTC - DigiCert revokes https://crt.sh/?id=2606438724

Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
The way our system works is that the keys are blacklisted upon revocation of a previous cert. So anything with the same keys would indeed be blocked. However, we have not stopped issuing certs with potentially compromised keys because we don't have a way to track keys independent of a certificate. We are working on figuring that out.

A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
There is one cert with the issue that we know of.
https://crt.sh/?id=2606438724

There could be others, but I'll need to figure out how to pull a cross-section between active certs and compromised keys to figure that out.

The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
https://crt.sh/?id=2606438724

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
Mistakes were made because we were only accepting cert problem requests for keys and blacklisting the keys when the cert is revoked for key compromise. The basic flow is: a) an email comes in to the revoke at digicert.com alias, b) the CSR provided is used to identify the certs requiring revocation, c) an email is kicked off to the cert subscriber notifying them of the revocation and that they need to change their key pair, and d) the certs are revoked (effectively adding them to a blacklist). The break here is part b since the system doesn't do anything to prevent resue of the keys until the revocation actually occurs.

This leaves two issues: 1) How do we track compromised keys before a cert is issued for it? and 2) How do we prevent a cert from being issued with a compromised key between the time when we scan our database for other certs with the same key and when we revoke?

List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
I don't have a good answer for #1 yet. For #2, the short term solution is to scan the database both before and after we revoke the reported cert. The longer term solution is to improve the process we have, turning the entire revocation event into an automation solution. The short term should allow us to revoke all certs issued between notification and revocation going forward until I can finish the automated method.

The two risks I've identified that i need to solve on teh longer term solution are:

Additional information based on Ryan's approach:

  1. How do we prevent abuse by people mass uploading keys not associated with our certs? If we allow keys not related to certs at all, then someone could cause the system to overload by performing mass numbers of certs. Or someone could generate hundreds of thousands of keys and submit them. This isn't an objection. Instead, we just need to figure out the proper rate limit on the number of keys that can be sumbitted at a time that trigger a search.

  2. How do we search through all of the certs if a private key itself is provided? What I haven't thought through is a good way to search our database for all of the public keys automatically if a private key is submitted (not a CSR). The best way is probably just to generate a CSR automatically and then search for the spki, but I need to plan it with the engineers.

Flags: needinfo?(jeremy.rowley)

Responding specifically to Ryan's questions:

If folks demonstrate compromise of private key, it seems like it's reasonable to add it to a blocklist. If you end up with a giant database of compromised keys, that... seems like a net-positive?

Agreed, I was more thinking through the logistics of how to prevent continous searches on keys from taking down the system, but that's already an existing problem with teh manual steps we have. If someone submitted 100k keys without referencing any certs, I'm not sure the system would be able to search for all of them in 24 hours. I'll probably just implement a rate limit of some kind.

I think you've basically identified what sounds like a reasonable process:

Receive notification of compromised private key
Verify compromise
Blocklist key from new issuance
Scan database for any existing certificates with said key
Notify subscribers
Revoke
Where Steps 1&2 of that might be automated or manual (or, practically, a combination of both being valid paths), while Steps 3-6 can be entirely automated.

That (automation) is my goal. I want to automate 1&2 as well, but still allow a manual process so we can accomdate a variety of key compromise notifications. If I automate 1 and 2, it'll allow people like Matt to submit mass keys easier while still giving us the flexibility for people finding the 1-2 compromised certs.

Now I need to figure out the dev resources and priority. We were going to start the Quovadis migration effort next, but this may be more important.

Any updates/progress on where things stand?

Flags: needinfo?(jeremy.rowley)

Yes - we've gotten into dev for speccing. We're right now targeting mid-May to have the blacklist system in place. In the meantime, we've already implemented a process change where we scan for additional certs right after the revocation to see if any new ones are issued in between when we scan for all certs and when revoke all certs. I don't like this process since it's very manual. However, that'll go away once we have the blacklist implemented. The blacklist is specc'ed as a service so Quovadis can hook into it to as can teh sMIME issuing systems.

Flags: needinfo?(jeremy.rowley)

I hate having to ask, but given the ever-growing complexity of "DigiCert": Can you clarify which system(s)/brand(s)/frontend(s) this system applies to? e.g. does it only apply to certain CP/CPSes?

Flags: needinfo?(jeremy.rowley)

It's a good question considering the confusion on the siilar Quovadis bug and the fact there are about 4 different ways key compromise are treated within DigiCert itself (same alias for all of them so it's not confusing to reporters).

The system we're building is a service that all systems and brands will tie into to check for key compromise before allowing a cert to be issued. It'll also allow us to add a compromised key to the blacklist before we revoke a cert so we won't have people requesting new certs with a compromised key. It'll also let us add keys reported by Matt to the general community so we don't issue one that is revoked by another CA (assuming that we have the key reported to us as well).

The complete automation will need to come later though. The fix we're looking to implement in May will just be the blacklist. Quovadis will be asked to implement shortly after. Tying the end to end process together seemlessy will happen after, but I think that's more of a DigiCert efficiency issue than solving the compliance part of this bug. If you disagree, I can have the team look at what that additional effort will take (while still rolling out the blacklist in the May timeframe).

Not really part of the bug, but this may be of interest. The DigiCert systems are:

  1. CertCentral (newest DigiCert portal)
  2. Direct (legacy DigiCert) - Being retired as soon as I can get the functionality onto CertCentral
  3. About 9 legacy Symantec TLS issuing systems that are tied to Geotrust, Thawte, and Symantec brands. We've shut down a handful already (for example, Encryption Everywhere) with the rest being retired by EoY. We're currently in full migration mode from those systems to CertCentral
  4. MPKI8 (sMIME only)
  5. MPKI7 (sMIME only - legacy system and in the process of being moved to MPKI8)
  6. Quovadis
Flags: needinfo?(jeremy.rowley)

We're still tracking mid-late May for this bug. We've got the design mostly done and are looking at which team could develop it in a way that is consumable by the platforms mentioned above.

We've got it specc'ed out and are about ready to start developing. Expecting to have the service ready by start of June.

Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 1-June 2020

We found more keys that were reused after the key search occurred but hte certs were not revoked.

0645f02d763f0f0cb5cbfc015d189471
0c249ef89456b17d6a05f01d8ce9377e
0db8e45c571de60575727fcf700cbda8
066dbb149c6d4b85698759ceefc54458
065b4c47d38a23d16f0409c26158ccdb
08adf9acf766f65abb6553aa7422aed5
01d9c743413fcf8e596e56fdc6b0cae8
02f178792f007498e96098eeee7ea862
0585da3f4f98b40cfa12f30bbdbb74f5

We revoke them as soon as the post-revocation search runs.

Matt reported this on a seperate bug (same issue):

Certificates https://crt.sh/?id=2386157426 and https://crt.sh/?id=2399185578 were revoked within 24 hours of receiving the certificate problem report, based on validly signed OCSP responses. However, as of the time of submission of this bug, validly signed OCSP responses for certificate https://crt.sh/?id=2782826035 are showing the certificate as being valid.

We ahve teh short term fix in place where we are scanning for additional keys. This is manual so doesn't work great. Still targeting mid-June for the blacklist check.

Flags: needinfo?(jeremy.rowley)

The blacklist key checker is under development still. Had a couple of delays in finishing it up. We're currently tracking June 30th for deployment. At that point, we can feed keys into the blacklist and it will prevent any use at that point. Phase 2 is to turn this into a service for anyone to use, including Quovadis and any sMIME platforms.

Based on the update provided by DigiCert in Bug 1639801, I think we can close this if DigiCert could describe an instance where the blacklist checker has now worked in a real-world situation, as an example that the new process works as intended.

Here is an example revoked cert where we tried to issue a new one with the compromised key. It was blocked from issuance.

https://crt.sh/?id=2976315260

Can we close this bug? The key blocklist tool is live across DigiCert and will be live with Quovadis on the 28th. We're tracking that on the Quovadis bug.

Flags: needinfo?(jeremy.rowley)

Closing per notes above.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] Next Update - 1-June 2020 → [ca-compliance] [ov-misissuance]
You need to log in before you can comment on or make changes to this bug.