Open Bug 1711147 Opened 7 months ago Updated 2 months ago

Microsoft PKI Services: Malformed ICAs (missing certificate policy extensions)

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: johnmas, Assigned: johnmas)

Details

(Whiteboard: [ca-compliance] Next update 2021-12-31)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 Edg/90.0.818.56

Type: defect → task

Initial Incident Report - More Detailed Root Cause Analysis Will Be Provided Soon


  1. How your CA first became aware of the problem.

Microsoft PKI Services has identified eight (8) Intermediate CA’s that have been mis-issued because they are missing the certificatePolicies extension per section 7.1.2.2 of the Baseline Requirements. We became aware of this issue on 4 May 2021 03:28 PM (Pacific Time) by our WebTrust auditor via email.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

Note: Times are listed in the Pacific time zone.
• 4 May 2021 03:28 PM – Issue reported to Microsoft PKI Services by our auditor.
• 4 May 2021 03:40 PM – Issue acknowledged by Microsoft PKI Services and confirmation that investigation was underway.
• 6 May 2021 1:04 PM – Revocation for all eight (8) ICA’s completed (CRL files created).
• 7 May 2021 02:10 PM – Confirmation that all eight (8) ICA’s had been revoked and the CRL’s published.

  1. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

We stopped issuance via our offline CA systems and processes, as this is where this issue arose. We have added a new step at the end of our issuance ceremony from our Root CAs that includes using Zlint to confirm there are no issues with the ICA certificates. We have not issued any new certificates yet from either of these two Root CAs, since 11 Mar 2021.

  1. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

There are eight (8) certificates that we have found with this issue. All eight (8) have now been revoked. All eight (8) certificates were created on 11 Mar 2021. All eight (8) of these CAs had not yet been used to issue subscriber certificates. No other ICA certificates with this issue have been issued since.

https://crt.sh/?id=4258098919&opt=zlint,cablint,x509lint
https://crt.sh/?id=4271615882&opt=zlint,cablint,x509lint
https://crt.sh/?id=4271615883&opt=zlint,x509lint,cablint
https://crt.sh/?id=4271615887&opt=zlint,x509lint,cablint
https://crt.sh/?id=4262920409&opt=zlint,x509lint,cablint
https://crt.sh/?id=4271615885&opt=zlint,x509lint,cablint
https://crt.sh/?id=4271615886&opt=zlint,x509lint,cablint
https://crt.sh/?id=4271615884&opt=zlint,x509lint,cablint

  1. In a case involving certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

See the above for links to certificates.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

We are still investigating the root cause of this incident; we expect to provide a more detailed Root Cause Analysis next week (21 May 2021). In this case, the certificate policies extensions were specified in the CA ceremonies and configured to have certificate policies, but it appears a syntax error (or something like it) may be responsible for why they did not end up in the certificates. Additionally, it is clear the offline post issuance checks we had in place also failed. We will provide more details as we dig into our investigation.

  1. List of steps your CA is taking to resolve the situation and ensure that such a situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

This is a preliminary list of items and will be updated when we report more detail on root cause.

Completed Remediation:
• All eight (8) certificates were revoked within 72 hours of us discovering the error.
• We updated our offline CA/Root issuance process to include a post issuance check, using Zlint, to confirm there are no issues with certificates that are issued.

Open Remediation:
• We will review all TLS ICA certificates that Microsoft PKI Services has issued with our offline CA/Root processes for issues with certificate policy extensions (21 May 2021).
• We will provide a more detailed Root Cause Analysis next week (21 May 2021). This work will identify additional remediation steps.

Assignee: bwilson → johnmas
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

More Detailed Root Cause


The eight (8) Intermediate CAs that were mis issued were all created on 11 March 2021 using the same offline CA/Root processes.

Our offline CA/Root processes consist of three high level steps. The first step involves creating the CA, generating the key pair, and creating a Certificate Signing Request (CSR). The second step involves signing the CSR file with the Root or Parent CA. And the third step completes the CA creations process.

Our processes require a minimum of two Microsoft PKI Services personnel to be present and at some points, additional full-time employees are needed to provide oversight. The first and third step involve specific Ceremony documents for each CA/Root that are created and are verified by our personnel during each step of the ceremony.

The facts found are:
• Certificate policies extensions were specified in the CA ceremonies to be present.
• The Certificates Policies were missing from the CSRs (Certificate Service Requests) generated on the new subordinate CAs, which were created by a combination of a CAPolicy.inf file and the ADCS (Active Directory Certificate Services) installers in Windows Server 2019. The extensions were present in the CAPolicy.inf file but were not applied due to a syntax error, due to placing multiple policy OIDs within a single policy statement.
• Here is the exact syntax:

 [PolicyStatementExtension]
 Policies=Policy

  [Policy]
  OID=2.23.140.1.2.1 ;DV OID
  OID=2.23.140.1.2.2 ;OV OID
  URL=http://www.microsoft.com/pkiops/docs/repository.htm

• We found out that multiple OIDs are not supported under PolicyStatementExtension. Instead of an error, ADCS will skip this section of the INF if it cannot use, but still do the rest of the instructions in the CAPolicy.inf file, so when other expected attributes set by CAPolicy.inf were there, there was an assumption that all fields were correct, and the omission was missed upon multiple manual inspections.
• After testing, the following adjustments would have produced the desired results:

  Certificate[PolicyStatementExtension] 
  Policies=DVPolicy, OVPolicy, MSPKI

  [DVPolicy] 
  OID=2.23.140.1.2.1 ;DV OID 

  [OVPolicy] 
  OID=2.23.140.1.2.2 ;OV OID 

  [MSPKI]
  OID=1.3.6.1.4.1.311.76.509.1.1
  URL=http://www.microsoft.com/pkiops/docs/repository.htm

• In previous runs of this process we did not run into this syntax problem because we have not had multiple certificate policy extensions in a CA. In these cases our existing process syntax in the CAPolicy.inf file included only one certificate policy extension and it successfully applied the policies in the CSR.

• The Certificate policies were not applied to all eight (8) mis issued ICA certificates.
• The current check in the CA ceremony to manually check the certificate for completeness does not specifically call out cert policy extensions. This step does call out some controls of which we have had problems in the past. The check is general in nature.

This is an update on Completed and Open Remediation for this issue.

Completed Remediation:

• All eight (8) certificates were revoked within 72 hours of us discovering the error (7 May 2021).
• We updated our offline CA/Root issuance process to include a post issuance check, using Zlint, to confirm there are no issues with certificates that are issued (12 May 2021).
• We reviewed all TLS ICA certificates that Microsoft PKI Services has issued, with our offline CA/Root issuance processes, for issues with certificate policy extensions (21 May 2021). No additional issues were found.
• We completed a more detailed Root Cause Analysis for this issue and updated this bug (21 May 2021).

Open Remediation:

• We will add specific manual vetting steps in our Ceremony document post-CSR generation, and before certificate fulfillment, in our offline CA/Root issuance processes to validate that the following fields in the CSRs exist and are correctly set (4 June 2021):
• Certificate Policies extension contains at least the OV OID, 2.23.140.1.2.2, and a link to our repository http://www.microsoft.com/pkiops/docs/repository.htm and not set critical.

Thanks John.

Could you describe more about your issuance ceremony? I understand you introduced a post-lint step, but this also seems like it could have been detecting during a "dry run" with a test key, to ensure the process and procedures resulted in the same configuration.

It seems like your current ceremony goes

  1. Live key and CSR generation.
  2. Live certificate generation
  3. Live ... (not quite sure? activation of the CA)?

It seems like generating a (test) key and signing with a (test) root would help verify that the procedural controls are working as intended. This would then also let you perform the requisite linting (post-lint) on the certificate.

If memory serves, ADCS doesn't support the ability to export the tbsCertificate it produces from the CSR and "hold" it (e.g. to allow pre-issuance linting), but the above seems to be a standard practice for CAs, typically as part of the ceremony itself (to make sure all the tools and infrastructure are working as intended; otherwise, you could run into TOCTOU issues between lab and production environments).

I'm hoping a bit more expansion on the playbook, such as whether these procedures are already part of your playbooks, would be useful to understanding if there are still opportunities to prevent the wrong thing from happening. Mostly, the post-issuance check doesn't seem like it goes to prevention, and the act of transforming the CSR to a TBSCertificate still has a lot that can go wrong, which wouldn't be caught here.

Flags: needinfo?(johnmas)

Great feedback (Comment 3), thanks Ryan.

You are correct the current ceremony goes.

  1. Live key and CSR generation.
  2. Live certificate generation.
  3. Live certificate installation and configuration of the CA

Our team does have test CA’s and certificate issuance in our playbook and frequently uses them to test those features and functions that are new to us or have not been done in some time. However, we do not do this for every certificate that we are issuing using this process. Given the importance of these processes and the need to improve their quality we like this idea, and we are making plans to implement this as part of our process to issue offline CA/Roots. Our plan is to use the identical ceremonies to generate a (test) key and sign it with a (test) root immediately before we issue the CA (from a live Root), and to ensure that the post-issuance lint on the test CA certificate is as expected prior to issuing the production/live CA.

This new process will take a little time for the team to prototype, test, document and implement into production. We will not create any new live certificates until the additional test processes are in place and the committed date below may be accelerated if we have demand for a live certificate using this process before then.

This is an update on Completed and Open Remediations for this issue.

Completed Remediations:
• All eight (8) certificates were revoked within 72 hours of us discovering the error (7 May 2021).
• We updated our offline CA/Root issuance process to include a post issuance check, using Zlint, to confirm there are no issues with certificates that are issued (12 May 2021).
• We reviewed all TLS ICA certificates that Microsoft PKI Services has issued, with our offline CA/Root issuance processes, for issues with certificate policy extensions (21 May 2021). No additional issues were found.
• We completed a more detailed Root Cause Analysis for this issue and updated this bug (21 May 2021).
• We added specific manual vetting steps in our Ceremony document post-CSR generation, and before certificate fulfillment, in our offline CA/Root issuance processes to validate that the following fields in the CSRs exist and are correctly set (1 June 2021):
o Certificate Policies extension contains at least the OV OID, 2.23.140.1.2.2, and a link to our repository http://www.microsoft.com/pkiops/docs/repository.htm and not set critical.

Open Remediations:
• Add to our playbook additional steps, each time and immediately prior to live key and CSR generation, to issue (test) keys and sign them with a (test) Root. This will allow us to double check all the parameters of the cert are correct (lint the test certificate). (10 July 2021)

Flags: needinfo?(johnmas)
Summary: Microsoft PKI Services, Malformed ICAs (missing certificate policy extensions) → Microsoft PKI Services: Malformed ICAs (missing certificate policy extensions)

From https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed

You should also provide updates at least every week giving your progress, and confirm when the remediation steps have been completed - unless Mozilla representatives agree to a different schedule by setting a “Next Update” date in the “Whiteboard” field of the bug.

This is an issue that has been raised with a number of CAs recently, and in particular, Google Trust Services (Bug 1708516 ), so I think it's useful to understand how Microsoft missed this. Could you share your process for reviewing CA incidents (from other CAs), in addition to sharing where Microsoft is at with the playbook changes?

Flags: needinfo?(johnmas)

Ryan, thank you for the question. This is a topic where we clearly need to continue to improve. We are aware of the requirement and have done our best to keep up with this request. Honestly, I had not clearly understood about the subtlety between the NeedInfo agreement and a committed date until now. Meaning if we had set a committed date and there were no comments/questions, but no agreed upon NeedInfo that we needed to keep up the weekly updates. We will ensure from now on that we meet this standard. We do review all of our open Bugzilla items a couple of times a week as a team and have a formal role up to our Leadership team every other week, specific to open Bugzilla items.

Our process for reviewing CA incidents is to monitor both the MDSP Mailing List discussions and Open Bugzilla bugs several times a week. Specifically, we have two folks who make it a point to spend time each week for this review. This information is then used as feedback to the larger team when pertinent discussions come up during planning, strategic or operational team meetings.

As for an update on this specific bug. We continue to work on the Open Remediation that is discussed above:

Open Remediations:
• Add to our playbook additional steps, each time and immediately prior to live key and CSR generation, to issue (test) keys and sign them with a (test) Root. This will allow us to double check all the parameters of the cert are correct (lint the test certificate). (10 July 2021)

We anticipate this may be completed sooner than the committed date and we will have an update next week on progress of implementing this process change.

Flags: needinfo?(johnmas)

On 14 June 2021 we updated the ceremony to include the following steps during Certificate Generation:

  1. Test certificate generation.
    1. Manual inspection of issued certificate
      
    2. Linting using zlint of issued certificate
    3. If all went well then proceed to issue live certificate
  2. Live certificate generation
    1. Enhanced manual inspection of generated CSRs.
      
    2. Enhanced manual inspection of issued certificates
    3. Linting using zlint of issued certificate
    4. If all went well then proceed to live certificate installation and configuration of the CA

During our live ceremony on June 24 the process worked well to produce four (4) new ICAs from our ECC Root. Once we fixed the template issue noted below, four (4) new ICAs were also generated from our RSA root.

We ran into issues with the live certificate issuance of four (4) new ICAs from our RSA Root. We had generated ICAs using this process in a test environment that looked good, and we had confidence going into the live certificate issuance.

However, immediately following the issuance of the four (4) new RSA ICA certificates, the team noticed during enhanced manual inspection that the certificates were malformed. We have created a new Bugzilla bug that explores the root cause of that mis-issuance (https://bugzilla.mozilla.org/show_bug.cgi?id=1718991) and we believe it is a separate problem from the issues that we addressed in this bug.

The new processes that we added as result of this bug helped to fail fast during our ceremony on 24 June and quickly resolve the root cause.

This is an update on Completed and Open Remediation's for this issue.

Completed Remediation's:
• All eight (8) certificates were revoked within 72 hours of us discovering the error (7 May 2021).
• We updated our offline CA/Root issuance process to include a post issuance check, using Zlint, to confirm there are no issues with certificates that are issued (12 May 2021).
• We reviewed all TLS ICA certificates that Microsoft PKI Services has issued, with our offline CA/Root issuance processes, for issues with certificate policy extensions (21 May 2021). No additional issues were found.
• We completed a more detailed Root Cause Analysis for this issue and updated this bug (21 May 2021).
• We added specific manual vetting steps in our Ceremony document post-CSR generation, and before certificate fulfillment, in our offline CA/Root issuance processes to validate that the Certificate Policies extension contains at least the OV OID, 2.23.140.1.2.2, and a link to our repository http://www.microsoft.com/pkiops/docs/repository.htm and not set critical (1 June 2021).
• Add to our playbook additional steps, each time and immediately prior to live key and CSR generation, to issue (test) keys and sign them with a (test) Root. This will allow us to double check all the parameters of the cert are correct (lint the test certificate). (14 June 2021).

Open Remediation's:
• None

John: Pre-issuance linting would have caught both of these issues, correct?

Is there any work being planned on that, even for manual ceremonies?

Flags: needinfo?(johnmas)

More specifically, Comment #3 appears to have predicted the issue with Bug 1718991, and while I don't like to toot my own Nostradamus, I am hoping this may lead to a re-evaluation and roadmap about more systemic controls, since while they caught things after the fact, the goal is to catch them beforehand.

Thanks for your comments, Ryan.

Yes, pre-issuance linting would be especially useful in these cases and is a feature that we plan to implement in both our offline and online systems in the future.

Our focus as of late has been on detection and ensuring we fail fast and prevent bad certificates from being sent to customers. And that is why we have been tuning and implementing our own internal linting tools and industry linters (specifically zlint in our case). At the same time, we have been fixing any issues that have come up during the linting process. Then we are working back upstream from there.

We feel this is still the best approach as we have been able to put measures in place to discover when our systems are failing much quicker (and revoking much quicker). In our online system we have had post issuance zlint in place for all issuances for more than two months and we are happy to report that we have not had any certificates fail linting thus far.

Post issuance linting allows us to implement additional controls as necessary to prevent those flaws from re-occurring. As you point out this is still less than ideal but puts us in an improved position from where we have been.

Implementing pre-issuance linting is going to be an architectural challenge for some of our systems and we are committed to making those changes and improving the process as far upstream as possible. To be clear this roadmap will take some time to implement and prepare for, so for the near term we will continue to focus on pushing our controls more and more upstream as we work toward eventually detecting the issues with TBS certificates during pre-linting.

Flags: needinfo?(johnmas)

While I'm realizing I'm asking you to make your own forward-thinking predictions here, which is challenging (per Comment #9), do you have any sense of what sort of timeframes we're talking about until Microsoft has pre-issuance linting in place?

And while ZLint is fantastic, I want to call out that it doesn't handle a lot of invalid DER/invalid ASN.1 cases, which certlint is much more tailored to (by the use of automatically generated validators from the ASN.1 schema, by way of mixing asn1c with Ruby). So as part of that planning, I'm hoping pre-issuance linting is being thought of generically, and not just specific to a particular implementation/language (ZLint/Golang)

Flags: needinfo?(johnmas)

Ryan, as described above we have been working to improve the controls that we have in place to prevent mis-issued certificates. And we have made great progress with our automated tools over the past six months. We do agree that we still need to improve further and we are committed to doing so.

We do our sprint planning in quarters and our existing roadmaps through the end of the 3rd quarter are still focused on the controls that we have talked about in this bug and others that Microsoft PKI Services have had to open recently.

As acknowledged above, we agree that pre-issuance linting is a feature that we will work toward, but we still do not have the planning completed to predict when we might get it implemented into our systems. We have committed ourselves to do that planning in the 4th quarter and by the end of this calendar year we should have a much better idea of how and when we will implement pre-issuance linting into our systems.

And yes, we can confirm that we are thinking of pre-issuance linting, and in fact any linting that we have, generically and considering the best strategies for covering as many implementations/languages as practical.

Flags: needinfo?(johnmas)

Ben: I realize you're OOO, so this probably won't be touched until the 28th.

  • Comment #7 captures that the immediate actions to this have been completed.
  • Comments #8 - Comment #12 explore how this is a class of errors (invalid CA certificates) that has affected Microsoft previously, and a move towards pre-issuance linting for CA certificates, and not just subscriber certificates, offers a stronger mitigation.
  • Comment #12 suggests 2021-Q4 as the start of planning for that effort, with 2021-EOY being for when they can have a more meaningful timeline for pre-issuance linting for CAs.

My own inclination would be to set NextUpdate to Q4 - we've seen enough issues with Microsoft's CA issuance practices (Bug 1718991, this Bug 1711147, Bug 1644936, Bug 1586847, and Bug 1598390) that I think there's good reason to want to make sure this is delivered on. However, that's also a significant delay, presumably in part due to the use of Microsoft's Active Directory Certificate Services, which is both used by Microsoft and a customer-facing product.

I'd be curious your take, as an alternative option would be to close this issue out for now, and instead consider more significant action if Microsoft further misissues intermediate CAs in the time it takes them to implement the necessary functionality.

Like I said, I can go either way, but I don't have further questions, and hopefully the above summary is useful for you.

Flags: needinfo?(bwilson)
Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] → [ca-compliance] Next update 2021-10-01

We will continue to improve these processes and come back by 2021-10-01 and provide an update on where we are with our Q4 planning process for pre-issuance linting for CAs.

We are actively working on an architectural approach to incorporate pre-issuance linting into our processes. We are evaluating our options ranging from adding custom code to our existing architecture to a certificate management re-architecture. Based on progress to date, we are forecasting that we will have a plan for pre-issuance linting, as committed, by the end of December.

We ask, with this update on our progress, that the "Next Update" flag please be reset to the end of December this year.

Whiteboard: [ca-compliance] Next update 2021-10-01 → [ca-compliance] Next update 2021-12-31
You need to log in before you can comment on or make changes to this bug.