Closed Bug 1670337 Opened 10 months ago Closed 2 months ago

Microsoft PKI Services: Certificate Mis-Issuance, DNSNames must have a valid TLD

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: johnmas, Assigned: johnmas)

Details

(Whiteboard: [ca-compliance] Next Update 2021-04-19)

Attachments

(9 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.22 Safari/537.36 Edg/86.0.622.10

Type: defect → task
  1. How your CA first became aware of the problem.

We were notified by a partner via email, to our "CentralPKI@microsoft.com" alias, on 10/8/20 at 10:08 PM PDT that there was an issue with public TLS certificates that we had issued. We opened an internal Incident at approximately 1:53 AM PDT on 10/9/20. And have been actively managing the incident since then.

  1. A timeline of actions your CA took in response.

    A. Microsoft was notified about an issue with these certificates by an outside partner on 10/8/20 at 10:08 PM PDT. These certificates were issued by Microsoft CA’s that are Cross Signed by DigiCert and DigiCert was a part of the original notification.
    B. An internal MS incident was created at 1:53 AM PDT on 10/9/20 and our teams engaged.
    C. Identified issue was linked to one second level domain that was onboarded for a customer that should not have been, because it is not public, at around 7:30 AM PDT on 10/9/20.
    D. The “bad” domain was removed from the production system at approximately 11:00 AM PDT on 10/9/20. This removed the domain as an option in our tooling, effectively preventing any new “bad” certificates from being created.
    E. Customers were notified at 11:42 AM PDT on 10/9/20 that certificates issued to the domain would be revoked at approximately 2:00 PM PDT on 10/9/20.
    F. At approximately 2:30 PM PDT on 10/9/20 the 2 initial certificates that were identified and 6 more (that we subsequently identified), were revoked (see attached certificates). We will continue to investigate for more certificates (and precertificates) to revoke.
    G. Created this Bug in Bugzilla.

  2.  Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem.
    

As noted in the timeline we have stopped issuing certificates with this problem.

Preliminary understanding of the root cause is that the problem is limited to one second level Domain that was mistakenly added to our production system for issuing valid public TLS certificates, as it was not a public domain (ap.gbl). Certificates that were issued to this domain as public were mis-issued. At approximately 11:00 AM on 10/9/20 we removed this domain from our system and certificates can no longer be issued.

  1. A summary of the problematic certs.

The certificates were issued to a domain that was not public and thus failed the DNSNames must have a Valid Top-Level Domain (TLD) check (see attached certs for all known impacted certificates).

  1. The complete certificate data for the problematic certificates.
    Attached to this bug. (as known at this time)

  2. Explanation about how the mistakes were made or bugs introduced, and how they avoided detection until now.

Microsoft PKI Services began issuing Public TLS certificates in July 2018, but to a relatively small number of public domains. In April 2020 we began to onboard many more users, and thus many more domains. Preliminary indications are that the process we used to onboard domain in April 2020 was insufficient and did not adequately verify all the domains that were onboarded were public. The “bad” domain mentioned above was onboarded during the April 2020 upload of new domains. Since the April 2020 onboarding process, in July 2020, we have more completely and better documented this process to include improved public domain checks.

The mis-issued certificates escaped detection after creation because there is an issue with our internal Linting tool and TLD detection. The Linting tool did not detect the problematic Top-Level Domain (TLD). Now that we are aware of this issue with the Linting tool, we have taken this as a repair item and will include updates on this repair shortly.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

As this incident is still young, we are working through the repair items and mitigation steps currently. We will report back to this bug as they are identified and addressed over the next week or so.

Our initial list at this point includes:
+ We will revalidate that all domains in our tools are public (approximately 10/16/20)
+ We will ensure that we have identified and revoked all impacted certificates (approximately 10/16/20).
+ We are continuing to automate the domain validation process to remove any manual errors, we expect this to be in place by end of next week (approximately 10/16/20).
+ Addressing issues with our Linting tool and TLD detection (TBD).

Assignee: bwilson → johnmas
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

John,

Thanks for filing this. While Microsoft's response here is more commendable than most, I'd like to highlight that a failure to adequately do domain control validation is one of, if not the, single most troublesome incidents a CA can have (ignoring, of course, intentional actions). So I can hope you appreciate just how worrisome this is, and how essential it is to make sure the public has all the facts, and CAs have all the knowledge, to take appropriate actions.

I'd like to focus on the following snippets:

(In reply to John Mason from comment #9)

C. Identified issue was linked to one second level domain that was onboarded for a customer that should not have been, because it is not public, at around 7:30 AM PDT on 10/9/20.
D. The “bad” domain was removed from the production system at approximately 11:00 AM PDT on 10/9/20. This removed the domain as an option in our tooling, effectively preventing any new “bad” certificates from being created.

<snip>

Preliminary understanding of the root cause is that the problem is limited to one second level Domain that was mistakenly added to our production system for issuing valid public TLS certificates, as it was not a public domain (ap.gbl). Certificates that were issued to this domain as public were mis-issued. At approximately 11:00 AM on 10/9/20 we removed this domain from our system and certificates can no longer be issued.

<snip>

Microsoft PKI Services began issuing Public TLS certificates in July 2018, but to a relatively small number of public domains. In April 2020 we began to onboard many more users, and thus many more domains. Preliminary indications are that the process we used to onboard domain in April 2020 was insufficient and did not adequately verify all the domains that were onboarded were public. The “bad” domain mentioned above was onboarded during the April 2020 upload of new domains. Since the April 2020 onboarding process, in July 2020, we have more completely and better documented this process to include improved public domain checks.

There's a lot here that I see worrying, and I'd like to call out a few specific details:

  1. Given this domain is not public, how does Microsoft perform domain control validation? That is, this should not have met any of the BR requirements for domain control validation, and so that is the first control that should have caught this.

  2. What was the process to onboard domains in April 2020?

  3. What was deficient about the process?

  4. What controls did you have in place when onboarding domains, in addition to domain control validation?

  5. What were the old public domain checks, since they were improved?

  6. The response here implies it was a lack of documentation ("more completely and better documented this process"), but provides no insight into the existing documentation or how it's been improved. This is touched on in Responding to an Incident, as this response feels very similar to the example given in there:

    For example, it’s not sufficient to say that “human error” of “lack of training” was a root cause for the incident, nor that “training has been improved” as a solution.

    Can you attach the old documentation and process and provide the new documentation/process, to help better understand the changes?

Flags: needinfo?(johnmas)

(In reply to John Mason from comment #9)

As noted in the timeline we have stopped issuing certificates with this problem.

How did you do CAA checks for all these certificates? How can you be sure now that your CAA checks for all your other certificates were correctly done?

This incident and Microsoft's response are deeply concerning. Not performing domain validation is one of the most serious failures a CA can experience. Incidents like this, coupled with poor incident response and other compliance issues (of which Microsoft has had several), have led to the distrust of other CAs such as Symantec and Certinomis. Unfortunately, Microsoft's incident report in Comment 9 identifies the problem not as a failure to perform domain validation (or check CAA) but as a failure to check that the TLD is public, which calls into question Microsoft's awareness of the most important rules that a CA must follow.

In addition to the questions already posed by Comment 10 and Comment 11, I would ask Microsoft to answer:

  1. Once a domain is "onboarded", for how long does the domain remain in the production system and able to be used in new certificates?

  2. Based on Comment 9, my assumption is that the only validation that Microsoft preforms prior to issuance is checking that the domains in the certificate were previously onboarded. Is this correct? If not, what steps, precisely, does Microsoft perform to validate the DNS names in the certificate request? Given the serious nature of this incident, the more detail that can be provided the better.

The mis-issued certificates escaped detection after creation because there is an issue with our internal Linting tool and TLD detection. The Linting tool did not detect the problematic Top-Level Domain (TLD). Now that we are aware of this issue with the Linting tool, we have taken this as a repair item and will include updates on this repair shortly.

Is the linting tool that Microsoft uses an internally developed/proprietary solution? As a follow-up item beyond repairing your implementation have you considered augmenting that linting tool with additional linters? ZLint, as one example, regularly consumes up-to-date TLD information as part of the normal release cycle and flags this class of problem (e.g. this error level finding).

Ryan (Comment 10), Paul (Comment 11), Andrew (Comment 12), and Daniel (Comment 13) thanks for your questions. We will add more context and then answer the questions.

In July 2018 Microsoft began issuing TLS certificates under a new Root CA program (using the following two Roots, http://www.microsoft.com/pkiops/certs/Microsoft%20RSA%20Root%20Certificate%20Authority%202017.crt and http://www.microsoft.com/pkiops/certs/Microsoft%20ECC%20Root%20Certificate%20Authority%202017.crt) from Microsoft PKI Services [MS PKI]. Currently the program scope is limited to issue TLS certificates for Microsoft owned name spaces (public and private). The vast majority of public TLS certificates are issued under a separate MS program under a Baltimore root, from Microsoft DSRE PKI [MS DSRE].

In July 2020 the Delegated OCSP Responder incident impacted roughly 7 Million Certificates (https://bugzilla.mozilla.org/show_bug.cgi?id=1649951), all need to be replaced. Both root programs were called upon to service high volumes of replacement certificates which included process changes for the new MS root CA program. This required an update to MS PKI's Domain Validation processes in order to add automation and scale. Additional updates to this process are under way to close gaps that led to the previous certificate mis-issuance.

To answer Ryan's questions (Comment 10):

 1.  Given this domain is not public, how does Microsoft perform domain control validation? That is, this should not have met any of the BR requirements for domain control validation, and so that is the first control that should have caught this. 

[MS PKI] In April 2020 when this "bad" domain was added to our domain validation cache (for 825 days), we were using domain control validation as stated in CAB Forum BR 3.2.2.4.2 (Email, Fax, SMS, or Postal Mail to Domain Contact). Unfortunately, while manually validating ~400 domains, this "bad" domain was missed. (More details below)

 2.  What was the process to onboard domains in April 2020? 

[MS PKI] In April 2020 our process allowed any internal Microsoft user to request domains from our service with the requirement that the domains were owned by Microsoft, our process was as follows:

 1.  (After receiving request to add domains, via email from internal Microsoft customers) Validate the owner of the Domain through at least 1 of 2 options, a manual ICANN Registration lookup or WhoIs Domain lookup.   

 2.  Send Domains list for Domain Validation to Domain Contact inside Microsoft via email and await results. 

 3.  Validate the domain results email from the Domains Contact and update MS PKI Toolset domain validation cache. 

MS PKI manually confirmed ~400 domains using the above process on April 14 and we missed the "bad" domain in question (the team did reject 20 other domains during this particular process but missed the one). By July 2020 we had only used this April manual process and the July manual process (described below) about 10 times to validate a total of ~500 domains. The root cause of this issue happened during the April 14 domain validation process.

 3.  What was deficient about the process? 

[MS PKI] Our process lacked the ability to scale and for the team to manually process many domain validations at once. As we iterated the process manually between April and July, we added a Tracking Database that helped the team manage the domains as they progressed through each step in the domain validation process, this improved our ability to ensure that each domain did not miss a step. As we improved the process, we added redundant controls and checks that resulted in better results (see July process described below, like the DNS Operator check and the CAA check during domain validation). At this time, we believe that the issues in our initial domain validation cache of ~500 resulted from problems our team had during the April 14 domain validation check.

By July 2020 we had iterated to the following process to manually check domain control validation (our internal users would create a ticket to add (or removed domain)) while validating/remove domains from the MS PKI Toolset domain validation cache:

  1. Query for open Domain Validation requests (or removes)

  2. Acknowledge Open tickets

  3. Check the list of Validated Domains (or removes) against the Tracking database. Update Tracking Database as appropriate (add/remove).

  4. Look up registered owners of the new domain(s) (2 Options, ICANN Registration lookup or WhoIs Domain Lookup). Domain MUST be Owned by Microsoft. Update Tracking Database as needed.

  5. Look up DNS Operator using https://dns.google.com/. Update Tracking Database as needed.

  6. Perform a CAA Record Check to ensure that we can issue for that domain. Update Tracking Database as needed.

  7. Perform a Domain Validation request for all new Domains in Tracking Database with the Registered Domain Contact. Update Tracking Database with results from domain contact.

  8. Use Tracking Database updates to add new domains and remove appropriate domains from the domain validation cache in MS PKI Toolset.

  9. Upon Confirmation that Toolset is updated, update the Tracking Database to reflect confirmed changes to Domains.

  10. Update each ticket and resolve.

In hindsight our April process did not have enough redundancy or controls to ensure domains were validated thoroughly (with redundant checks that would help spot oversights).

 4.  What controls did you have in place when onboarding domains, in addition to domain control validation? 

[MS PKI] The request for initial Domain Validations were required to come in through our request intake site which requires valid Microsoft credentials and second authentication factor to access. In addition, at certificate issuance our toolset checks that the domain is still in our domain validation cache, that the domain validation date is within 825 days, ensures that the certificate request adheres to compliant certificate profiles and that all domains in the request are well-formed.

   5.  What were the old public domain checks, since they were improved? 

[MS PKI] The team would manually check the domains via an ICANN lookup or WHOIS to ensure that the domains belonged to Microsoft (we are currently only issuing certificates to internal Microsoft teams for Microsoft owned domains). Once the validators confirmed that the domains were owned by Microsoft, they would then send an email to the Microsoft domains team with a unique code.

One of the details that the April process lacked was that instead of sending the mail to the registered domain contact inside of Microsoft, our process was using the central domains contact at Microsoft. For the April 14 check, the central domains (domains@microsoft.com) contact is the registered domain for all but 5 of the domains. Our updated July process was improved to ensure the mail goes to the registered domain contact.

Most importantly it looks like step #1 (from April 2020) was missed, or only partially completed during the domain validation of ~400 domains on April 14. In addition, step #2 (from April 2020 process) returned a false positive for the “bad” domains.

 6.  The response here implies it was a lack of documentation ("more completely and better documented this process"), but provides no insight into the existing documentation or how it's been improved. This is touched on in Responding to an Incident, as this response feels very similar to the example given in there: 

  For example, it’s not sufficient to say that “human error” of “lack of training” was a root cause for the incident, nor that “training has been improved” as a solution. 

 Can you attach the old documentation and process and provide the new documentation/process, to help better understand the changes? 

[MS PKI] The old (April) manual process was as described above in the response to question #2. The newer (July) manual process is described above in the response to question #3. These processes accounted for ~500 domains being added to our domain validation cache.

The root cause of this incident is the lack of redundant controls in place to catch a bad domain before it is improperly validated. We believe the improvements made to the process between April and July were significant and provided the redundancy needed and diligence required to manually validate domains prior to adding them to our domains cache.

We believe this incident was limited to the improper validation that was performed on April 14 with the April manual process.

As described above, in early July the Delegated OCSP Responder incident impacted the CA's that our sister team, MS DSRE, were using. As a result of the Microsoft incident response, during the month of July, Microsoft had made the decision to source Public TLS certificates for Microsoft owned/managed properties from MS PKI instead of MS DSRE. This transition is planned to take place from August 2020 through February 2021.

This made it clear that MS PKI needed to quickly scale the number of domains we allowed from ~500 to more than ~55k. Our existing processes, personnel and tools were not up to ramping that quickly. Instead we adopted the tools that our sister team, MS DSRE, had been using and refining for years in operating a CAB Forum BR compliant domain validation process.

At this point, based on what we now know, we should have cleared our initial domain validation cache of ~500 domains and started over. Instead, we retained the ~500 domains that were in the cache and added the additional ~55K domains as described below (the additional ~55k domains were validated in three large batches between August and September).

To validate the large batches between August and September, MS PKI developed and executed domain validation tooling that leveraged the MS DSRE domain validation tools. The MS DSRE team had a domain validation cache that they had already validated using at least one of the following CA Forum BR methods 3.2.2.4.2, 3.2.2.4.3, 3.2.2.4.4, 3.2.2.4.6 and/or 3.2.2.4.7. So, the MS PKI team leveraged that validation to populate the domain validation cache in the MS PKI Toolset (between August and September).

The updated August process, and the more automated domain validation process that is currently in place, for changes to our domain validation cache is as follows:

Background: Microsoft currently has four Registrars that provides a daily feed of all Public Domains Microsoft owns (or used to own). Changes in ownership are denoted by flags that our tools use to remove previously validated domains from MS PKI domain validation cache. Additionally, we stopped taking new domain requests from internal Microsoft teams. Instead we rely on our daily updates to our domain validation cache.

  1. Tools automatically pull daily Registrar feeds into MS PKI Toolset and identify Domains that are new or need to be removed.

  2. Check the list of new Microsoft domains against the existing MS PKI domain validation cache.

    a. If the domain is no longer owned/managed by Microsoft it is removed from the domain validation cache.

    b. If the domain is new and not in our existing cache, then proceed.

  3. Perform a Domain Validation request for all new Domains in daily registrar feed, by emailing the Registered Domain Contact with a unique code.

  4. Update domain validation cache with email results from registered domain contact’s response.

To answer Paul's question (Comment 11):

 How did you do CAA checks for all these certificates? How can you be sure now that your CAA checks for all your other certificates were correctly done? 

[MS PKI] Per CAB Forum BR 3.2.2.8 Microsoft is the DNS Operator, and we were using the DNS operator's exception when we performed our process. As described above we added a manual CAA check in our manual July process. CAA check is a part of our post-July automated process during domain validation. CAA checks are not currently a part of our certificate issuance process, but we are now committed to adding is soon (Timing is TBD as we work with our engineers to plan this work).

To answer Andrew's questions (Comment 12):

 1. Once a domain is "onboarded", for how long does the domain remain in the production system and able to be used in new certificates? 

[MS PKI] It will last in our domain validated cache for 825 days, unless the domain ownership information changes via our domain registrars feed (no longer owned by Microsoft).

 2. Based on Comment 9, my assumption is that the only validation that Microsoft performs prior to issuance is checking that the domains in the certificate were previously onboarded. Is this correct? If not, what steps, precisely, does Microsoft perform to validate the DNS names in the certificate request? Given the serious nature of this incident, the more detail that can be provided the better. 

[MS PKI] As stated above (and pasted below) in the response to Comment 10 question #6.

The updated August process, and the more automated domain validation process that is currently in place, for changes to our domain validation cache is as follows:

Background: Microsoft currently has four Registrars that provides a daily feed of all Public Domains Microsoft owns (or used to own). Changes in ownership are denoted by flags that our tools use to remove previously validated domains from MS PKI domain validation cache. Additionally, we stopped taking new domain requests from internal Microsoft teams. Instead we rely on our daily updates to our domain validation cache.

  1. Tools automatically pull daily Registrar feeds into MS PKI Toolset and identify Domains that are new or need to be removed.

  2. Check the list of new Microsoft domains against the existing MS PKI domain validation cache.

    a. If the domain is no longer owned/managed by Microsoft it is removed from the domain validation cache.

    b. If the domain is new and not in our existing cache, then proceed.

  3. Perform a Domain Validation request for all new Domains in daily registrar feed, by emailing the Registered Domain Contact with a unique code.

  4. Update domain validation cache with email results from registered domain contact’s response.

To answer Daniel's question (Comment 13):

 Is the linting tool that Microsoft uses an internally developed/proprietary solution? As a follow-up item beyond repairing your implementation have you considered augmenting that linting tool with additional linters? ZLint, as one example, regularly consumes up to date TLD information as part of the normal release cycle and flags this class of problem (e.g. this error level finding). 

[MS PKI] The tool we currently use is proprietary. Part of the reason for that decision is that we have strict controls around code that can run in our high secure environments and it is difficult to gain approval to operate third party code in these environments, and a challenge to keep that code updated. This incident highlights the importance of using an industry standard linting tool and we are reexamining ways in which we can add one to our toolset.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

UPDATE on Findings, Repair Items and Mitigation Steps:

We have discovered that of the 8 certificates that were mis-issued and revoked on Oct 9 and disclosed in this bug that 4 certificates also included one public and non-Microsoft owned domain in the Subject Name (specifically akadns.net). 2 of these certificates were issued on 9/9/20 and 2 were issued on 9/14/20. This error originated during the same April 14 domain validation process and that is when the akadns.net domain was incorrectly added to our domain validation cache. We will reach out to the registered domain owner to explain the error and provide all the information that we can.

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the `500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug.

  • We examined all certificates that MS PKI has issued from all of our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug.

Open/In Process:

  • Reach out to registered domain owner for AKADNS.NET for follow up on error (approximately 10/23/20).

  • We are continuing to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 will be updated to automatically ingest the registered domain contacts response (approximately 10/30/20).

  • Addressing issues with our Linting tool and TLD detection (TBD).

+Reexamine ways in which to add an industry standard linting tool to our toolset (TBD).

  • We will use and industry standard linting tool to lint all certificates that we have issued and report back on any errors that are discovered (TBD)
Flags: needinfo?(johnmas)

Thanks, John, for the level of detail in Comment 14.

One of the details that the April process lacked was that instead of sending the mail to the registered domain contact inside of Microsoft, our process was using the central domains contact at Microsoft. For the April 14 check, the central domains (domains@microsoft.com) contact is the registered domain for all but 5 of the domains.

If I'm understanding this correctly, you're saying that for 5 domains, you sent the domain validation email to an address that isn't allowed by BRs 3.2.2.4.2, 3.2.2.4.4, 3.2.2.4.13, or 3.2.2.4.14? If so, the certificates for those domains are misissued and need to be revoked and disclosed here.

It's a bit hard to follow the different processes which MS PKI has used to validate domains. Is it correct that all certificates issued by MS PKI have validated domains using one of the following 5 processes?

  1. The original April manual process

  2. The improved July manual process

  3. The "updated August process"

  4. By importing domains which had been validated by MS DSRE

  5. The automated system currently in place

Could you provide more details about the "updated August process"?

Regarding CAA, what is your process for determining that you are the DNS Operator of a domain? For domains validated using the July process, is it done by consulting https://dns.google.com as described in step 5? How is it done for domains which were validated using the other 4 processes?

Flags: needinfo?(johnmas)

[MS PKI] Per CAB Forum BR 3.2.2.8 Microsoft is the DNS Operator, and we were using the DNS operator's exception when we performed our process. As described above we added a manual CAA check in our manual July process. CAA check is a part of our post-July automated process during domain validation. CAA checks are not currently a part of our certificate issuance process, but we are now committed to adding is soon (Timing is TBD as we work with our engineers to plan this work).

BR 3.2.2.8 only gives you the right to define an exception in your CP/CPS.

Where in your CP/CPS did you define this exception? If you did not, all of your certificates are, in fact, misissued.

(In reply to John Mason from comment #14)

We have discovered that of the 8 certificates that were mis-issued and revoked on Oct 9 and disclosed in this bug that 4 certificates also included one public and non-Microsoft owned domain in the Subject Name (specifically akadns.net).

For convenience, the 8 certificates currently attached to this bug (+ respective precertficates) are tracked in https://misissued.com/batch/186/ . If Microsoft had followed the recommendation of providing crt.sh IDs for the certificates, these certificates would have been easier to spot.

Andrew (Comment 15) and Paul (Comment 16 and 17) thanks for your questions.

To Answer Andrew’s questions (Comment 15):

If I'm understanding this correctly, you're saying that for 5 domains, you sent the domain validation email to an address that isn't allowed by BRs 3.2.2.4.2, 3.2.2.4.4, 3.2.2.4.13, or 3.2.2.4.14? If so, the certificates for those domains are misissued and need to be revoked and disclosed here.

[MS PKI] Yes, we did identify that for 5 domains during the April 14 check that we sent the email to the improper domain contact. All the certificates that were mis-issued have been revoked and disclosed here (we mis-issued for 2 of these domains and we did not issue for the other 3 domains) and we have removed those domains from our domain validation cache and/or revalidated them and added them back to the domain validation cache.

It's a bit hard to follow the different processes which MS PKI has used to validate domains. Is it correct that all certificates issued by MS PKI have validated domains using one of the following 5 processes?

  1. The original April manual process
  2. The improved July manual process
  3. The "updated August process"
  4. By importing domains which had been validated by MS DSRE
  5. The automated system currently in place
    Could you provide more details about the "updated August process"?

[MS PKI] Yes, those are the 5 processes used to validate domains. The “updated August process” is covered in Comment 14 and I will copy it here as well.

The updated August process for changes to our domain validation cache (the more automated domain validation process that is currently in place), is as follows:

Background: Microsoft currently has four Registrars that provides a daily feed of all Public Domains Microsoft owns (or used to own). Changes in ownership are denoted by flags that our tools use to remove previously validated domains from MS PKI domain validation cache. Additionally, we stopped accepting new domain requests from internal Microsoft teams. Instead, we rely on our daily updates to our domain validation cache.

  1. Tools automatically pull daily Registrar feeds into MS PKI Toolset and identify Domains that are new or need to be removed.
  2. Check the list of new Microsoft domains against the existing MS PKI domain validation cache.
    a. If the domain is no longer owned/managed by Microsoft it is removed from the domain validation cache.
    b. If the domain is new and not in our existing cache, then proceed.
  3. Perform a Domain Validation request for all new Domains in daily registrar feed, by emailing the Registered Domain Contact with a unique code.
  4. Update domain validation cache with email results from registered domain contact’s response.

Steps 1-3 of this process are already automated, and we are working to add automation to Step #4 (see repair items, target is to complete that by approximately 10/30/20).

Regarding CAA, what is your process for determining that you are the DNS Operator of a domain? For domains validated using the July process, is it done by consulting https://dns.google.com as described in step 5? How is it done for domains which were validated using the other 4 processes?

[MS PKI] MS PKI only issues certificates for Microsoft owned domains. For all our processes we have been using the DNS Operator exception at certificate issuance as defined in our CPS. For the July manual process, we did perform the check on ~100 domain validations. Working with our domains team we have verified that for all cases where Microsoft owns the domain that we are the DNS Operator.

However, we see the value in adding the CAA check for all certificate issuances, as an additional control to ensure compliance in the case of a bad domain validation. We plan to have this check implemented by approximately 11/6/20.

To answer Paul’s question (Comment 16):
BR 3.2.2.8 only gives you the right to define an exception in your CP/CPS.
Where in your CP/CPS did you define this exception? If you did not, all of your certificates are, in fact, misissued.

[MS PKI] We have defined the exception in our CPS (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.6.pdf) in section 3.2.2.8 we state “CAA checking is optional if Microsoft PKI Services or associated Affiliate is the DNS Operator (as defined in RFC 7719) of the domain's DNS.”

To answer Paul’s question (Comment 17):

For convenience, the 8 certificates currently attached to this bug (+ respective pre-certficates) are tracked in https://misissued.com/batch/186/ . If Microsoft had followed the recommendation of providing crt.sh IDs for the certificates, these certificates would have been easier to spot.

[MS PKI] Thank you for bringing this to our attention. We do our best to follow the recommendations and regret our oversight. We appreciate the reminder and will do our best to continuously improve our responses.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
UPDATE on Repair Items and Mitigation Steps:

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get certificates in the future.

Open/In Process:

  • We are working to implement CAA checks for all certificate issuances (targeting 11/6/20)
  • We are continuing to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 will be updated to automatically ingest the registered domain contacts response (targeting 10/30/20).
  • Improve capabilities of our Linting tool and TLD detection (TBD).
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes (TBD).
  • We will use an industry standard linting tool to lint all certificates that we have issued and report back on any errors that are discovered (TBD)
Flags: needinfo?(johnmas)

(In reply to John Mason from comment #18)

[MS PKI] We have defined the exception in our CPS (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.6.pdf) in section 3.2.2.8 we state “CAA checking is optional if Microsoft PKI Services or associated Affiliate is the DNS Operator (as defined in RFC 7719) of the domain's DNS.”

The quoted sentence is only part of several bullet points. Together with the introductory sentence, it reads: "Microsoft PKI Services MAY decide to not rely on any exceptions specified in their CP or CPS unless they are one of the following: [...] CAA checking is optional if Microsoft PKI ServicesorassociatedAffiliate is the DNS Operator (as defined in RFC 7719) of the domain's DNS.".

This does not define any exception at all by itself. All of your certificates are thus misissued and have to be revoked within 5 days.

Paul (Comment 19) thanks for your question.

To answer Paul’s question (Comment 19):

The quoted sentence is only part of several bullet points. Together with the introductory sentence, it reads: "Microsoft PKI Services MAY decide to not rely on any exceptions specified in their CP or CPS unless they are one of the following: [...] CAA checking is optional if Microsoft PKI ServicesorassociatedAffiliate is the DNS Operator (as defined in RFC 7719) of the domain's DNS.".

This does not define any exception at all by itself. All of your certificates are thus misissued and have to be revoked within 5 days.

[MS PKI] Our existing CPS (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.6.pdf) does define the exception in section 3.2.2.8 and this does not constitute a mis-issuance.

We agree that the existing document could be improved to make it clearer and more consistent with others in the industry, thus we will be revising our CPS shortly (targeting Nov 6, 2020).

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
UPDATE on Repair Items and Mitigation Steps:

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get certificates in the future.

Open/In Process:

  • We are working to implement CAA checks for all certificate issuances (targeting Nov 6, 2020)
  • We are continuing to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 will be updated to automatically ingest the registered domain contacts response (targeting Oct 30, 2020).
  • Improve capabilities of our Linting tool and TLD detection (TBD).
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes (TBD).
  • We will use an industry standard linting tool to lint all certificates that we have issued and report back on any errors that are discovered (TBD)

Posting an update on our Repair Items and Mitigation Steps. We will provide an update weekly until we burn them down.

Completed:
+ We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
+ We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
+ Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.

Open/In Process – Short Term:
+ Update our Microsoft PKI Services CPS to clarify CAA Record checks (targeting Nov 6, 2020).
+ We are continuing to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 will be updated to automatically ingest the registered domain contacts response (targeting Nov 16, 2020).
+ We are working to implement CAA checks for all certificate issuances (Running into technical constraints and need to replan a rollout date).
+ We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.

Open/In Process - Longer Term:
+ Improve capabilities of our internal Linting tool, specifically TLD detection.
+ Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

Posting an update on our Repair Items and Mitigation Steps:

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.
  • Updated our Microsoft PKI Services CPS to clarify CAA Record checks (Nov 9). (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf)

Open/In Process – Short Term:

  • We are continuing to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 will be updated to automatically ingest the registered domain contacts response (targeting Nov 16, 2020).
  • We are working to implement CAA checks for all certificate issuances (Running into technical constraints and need to replan a rollout date).
  • We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.

Open/In Process - Longer Term:

  • Improve capabilities of our internal Linting tool, specifically TLD detection.
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

Posting an update on our Repair Items and Mitigation Steps:

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.
  • Updated our Microsoft PKI Services CPS to clarify CAA Record checks (Nov 9). (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf)
  • We continue to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 was updated to automatically ingest the registered domain contact’s response (Nov 16).

Open/In Process – Short Term:

  • We are working to implement CAA checks for all certificate issuances.
  • We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.

Open/In Process - Longer Term:

  • Improve capabilities of our internal Linting tool, specifically TLD detection.
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

Posting an update on our Repair Items and Mitigation Steps:
Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.
  • Updated our Microsoft PKI Services CPS to clarify CAA Record checks (Nov 9). (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf)
  • We continue to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 was updated to automatically ingest the registered domain contact’s response (Nov 16).
  • We have implemented CAA checks in our production environment for all certificate issuances (Dec 3). We are currently in “audit” mode, to ensure that we understand all use cases and after the holiday code freezes, we will turn on “enforce” mode for all certificates we issue.

Open/In Process – Short Term:

  • Improve capabilities of our internal Linting tool, specifically TLD detection.

Open/In Process - Longer Term:

  • We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

Posting the last update for the year on our Repair Items and Mitigation Steps. We should complete all short-term items by January 15. At that point we will continue to work on the long-term items but ask that this bug be resolved after completion of the short-term items.

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.
  • Updated our Microsoft PKI Services CPS to clarify CAA Record checks (Nov 9). (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf)
  • We continue to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 was updated to automatically ingest the registered domain contact’s response (Nov 16).
  • We have implemented CAA checks in our production environment for all certificate issuances (Dec 3). We are currently in “audit” mode, to ensure that we understand all use cases and after the holiday code freezes, we will turn on “enforce” mode for all certificates we issue.

Open/In Process – Short Term:

  • Improve capabilities of our internal Linting tool, specifically TLD detection (targeting Jan 15, 2021)

Open/In Process - Longer Term:

  • We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

We should complete all short-term items by January 29. At that point we will continue to work on the long-term items but ask that this bug be resolved after completion of the short-term items.

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.
  • Updated our Microsoft PKI Services CPS to clarify CAA Record checks (Nov 9). (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf)
  • We continue to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 was updated to automatically ingest the registered domain contact’s response (Nov 16).
  • We have implemented CAA checks in our production environment for all certificate issuances (Dec 3). We are currently in “audit” mode, to ensure that we understand all use cases and after the holiday code freezes, we will turn on “enforce” mode for all certificates we issue. Enforce mode was turned on for all production environments (Jan 14, 2021).

Open/In Process – Short Term:

  • Improve capabilities of our internal Linting tool, specifically TLD detection (targeting Jan 29, 2021)

Open/In Process - Longer Term:

  • We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

I am glad that Microsoft is now enforcing CAA checks for all issuances. When will you remove the DNS Operator exception from your CPS?

Flags: needinfo?(johnmas)

Andrew (Comment 27) thanks for your question.

To answer Andrew's question (Comment 27):

We have been in enforce mode for our CAA checks for almost two weeks now. We have not had to use the DNS Operator exception in that time, but we intend to retain the DNS Operators Exception for exceptions, while using the CAA Checks as normal operating procedure..

As a reminder, we only issue certificates for Microsoft owned domains. We can imagine several scenarios where we may need to use the exception. In these cases, we would continue to document that the situation meets all the criteria for the DNS Operators exception.

We will continue to monitor the situation and if the exception is not necessary we will remove it from our CPS.

Flags: needinfo?(johnmas)
Whiteboard: [ca-compliance] → [ca-compliance] Next Update 2021-02-01

All Short-Term repair items have been completed (between October 2020 and January 2021) and we ask that this bug be resolved. We will continue to work on the longer-term items identified but will no longer provide updates to this bug.

Completed:

  • We used our daily Registrar feed from Oct 16, 2020 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17, 2020).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18, 2020).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20, 2020). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.
  • Updated our Microsoft PKI Services CPS to clarify CAA Record checks (Nov 9, 2020). (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf)
  • We continue to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 was updated to automatically ingest the registered domain contact’s response (Nov 16, 2020).
  • We have implemented CAA checks in our production environment for all certificate issuances (Dec 3). We are currently in “audit” mode, to ensure that we understand all use cases and after the holiday code freezes, we will turn on “enforce” mode for all certificates we issue. Enforce mode was turned on for all production environments (Jan 14, 2021).
  • We have improved the capabilities of our internal Linting tool, specifically for TLD detection. This change was implemented in all production environments (Jan 29, 2021).
    Open/In Process - Longer Term:
  • We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

What relying on the DNS Operator Exception, what process will you use to determine that Microsoft is the DNS Operator?

Flags: needinfo?(johnmas)

Andrew (Comment 29) thanks for your question.

As described, we have not had to use the exception since we automated the CAA checks in early December, so we expect that these will be one off and infrequent occurrences. We have developed manual procedures for these exceptions.

These manual processes are derived from the ones that we had been using prior to implementing CAA automation. For DNS Operator we currently use one or both of these tools (i) https://dns.google.com/ or (ii) https://toolbox.googleapps.com/apps/dig/.

Flags: needinfo?(johnmas)

With respect to Comment 29, I neglected to add our whole process. I am working on clarifying that now on my end and will update shortly.

To restate the answer to Andrew (Comment 29) regarding DNS Operator.

As described, Microsoft PKI Services have not had to use the exception since we automated the CAA check, so we expect that these will be one off and infrequent occurrences. Microsoft PKI Services have developed manual procedures for these exceptions in our processes.
When going through the manual exception process to issue a certificate with the DNS Operator exception, Microsoft PKI Services performs the following:

  1. Acknowledge Emergency Request from internal customer about failing certificate automation ticket(s).
  2. Manually look up registered owners of the domain(s) using our Domain Registrar feed or WhoIs Domain Lookup.
  3. Contact the registered domain owner (per CAB Forum BR method 3.2.2.4.2, 3.2.2.4.3, 3.2.2.4.4 or 3.2.2.4.7) and verify that Microsoft owns the Domain and operates the DNS servers for the Domain.
    In addition, we also check with a third party using one or both tools (i) https://dns.google.com/ or (ii) https://toolbox.googleapps.com/apps/dig/.
    At this point we establish that we can take the DNS Operator exception (if Microsoft owns the Domain and Operates the DNS server), or we will deny the certificate request.
  4. Manually perform a CAA Record Check to ensure that we can issue for that domain. This step is used to investigate the reasons that the automated CAA check was failing. So that the requestor can be notified to update CAA records to allow automation for future requests.
  5. If step 3 determined we will take the DNS Operator exception, issue the certificate.
    This step includes a pre and post issuance lint.
  6. Update each ticket with outcome, and evidence as necessary and resolve.

Are there any other items to review/discuss, or can this bug be closed?

Thanks, John, for your answers. I understand that Microsoft will only use the DNS Operator exception infrequently, but history has shown that manual/infrequent issuance processes are often prone to human error, e.g. Bug 1525710, Bug 1561013, Bug 1567659, Bug 1569266, Bug 1574594, Bug 1674561, so this does not assuage the concerns with using the DNS Operator exception. I also acknowledge that Microsoft intends to issue only for Microsoft-owned domains, but as this incident shows, mistakes are possible despite best intentions, which is why CAs need to have solid processes to prevent mistakes.

Unfortunately the process described by Comment 33 is not compliant with the BRs. First, BR method 3.2.2.4.3 has been forbidden since May 31, 2019. Second, web-based DNS tools are Delegated Third Parties, and using them to perform the requirements of BR section 3.2, which encompasses CAA and the DNS Operator exception, is subject to strict requirements by BR 1.3.2 which I assume are not met by https://dns.google.com/ or https://toolbox.googleapps.com/apps/dig/. A similar incident was https://bugzilla.mozilla.org/show_bug.cgi?id=1651026#c16 (note that in that case, the CA revoked all certificates whose validation relied on https://toolbox.googleapps.com/apps/dig/).

Considering that the misissuances in this incident can be partially attributed to a failure to correctly implement the DNS Operator exception, I don't believe this incident will be remediated until Microsoft either removes the DNS Operator exception from their CPS, or develops a BR-compliant, mistake-proofed process for evaluating whether they are the DNS Operator for a domain.

Andrew thanks for your comments and questions (Comment 35) regarding DNS Operator.

We believe Microsoft PKI Services has a BR compliant process and we do our best to prevent mistakes and continuously improve. We appreciate the dialogue and feedback from forums like this that help us to do just that. Microsoft PKI Services has incorporated your feedback and updated the process below.

Microsoft PKI Services does not use the DNS Operator exception process as a normal course of business as we have implemented automated CAA checks for all certificate issuances. However, unforeseen circumstances may arise for which we would need to use the DNS Operator exception and it would be irresponsible of us to remove it entirely as an option.

To address the two inadequacies that you expressed of this process. We agree 3.2.2.4.3 is not used anymore, this was an oversight on my part. It has been removed from the documented process. Regarding the Delegated Third-Party tool use in step 3, we were aware of the concerns around this and that is why we had it as a secondary check, but we are content removing this check from our process.

Updated process for clarification:

When going through the manual exception process to issue a certificate with the DNS Operator exception, Microsoft PKI Services performs the following:

  1. Acknowledge Emergency Request from internal customer about failing certificate automation ticket(s).
  2. Manually look up registered owners of the domain(s) using our Domain Registrar feed or WhoIs Domain Lookup.
  3. Contact the registered domain owner (per CAB Forum BR method 3.2.2.4.2, 3.2.2.4.4 or 3.2.2.4.7) and verify that Microsoft owns the Domain and operates the DNS servers for the Domain.
    At this point we establish that we can take the DNS Operator exception (if Microsoft owns the Domain and Operates the DNS server), or we will deny the certificate request.
  4. Manually perform a CAA Record Check to ensure that we can issue for that domain. This step is used to investigate the reasons that the automated CAA check was failing. So that the requestor can be notified to update CAA records to allow automation for future requests.
  5. If step 3 determined we will take the DNS Operator exception, issue the certificate.
    This step includes a pre and post issuance lint.
  6. Update each ticket with outcome, and evidence as necessary and resolve.

Thank you, John.

I don't think that using a third party DNS tool as a secondary check would violate the BRs, but if the third party tool is just a secondary check, then what is the primary check? Comment 36 says that Microsoft will verify that it is the DNS operator, but doesn't explain how.

It's concerning that a forbidden domain validation method was part of a documented process. I notice that 3.2.2.4.3 is also listed in your CPS (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf). This appears to be failure to keep processes up-to-date with BR changes, which I believe is a distinct incident worthy of an incident report that examines the root cause for the failure.

Flags: needinfo?(johnmas)

Andrew, thanks for your comments and questions (Comment 37) following up on DNS Operator.

Microsoft PKI Services is happy to retain a secondary check with the Google tools as specified in Step 3 below and will include it back into our process.

To be more specific on our method of checking DNS Operator. For every instance that Microsoft PKI Services use the DNS Operator exception we will follow the process outlined below at certificate issuance. We updated Step 3 below to be more specific, the registered domain owner for Microsoft verifies that they have edit rights for each DNS zone specified in the certificate request.

Regarding references to Section 3.2.2.4.3 and 3.2.2.4.6 in our CPS, we do not believe there is any problem with these remaining in our CPS. We specifically left them in the document to clarify the processes used to validate domains of certificates that were issued while these methods were still allowed. To be clear, our processes stopped using these methods to issue certificates on or before the dates that the BRs specify that they were no longer allowed (for 3.2.2.4.3 it was not used after May 31, 2019 and for 3.2.2.4.6 it was not used after June 3, 2020). We do agree that we should be more explicit in explaining this in our CPS and we are in the process of updating the document now, version 3.1.8 should be published soon (certainly before the end of February and likely next week, the week of February 22).

Updated process for clarification:
When going through the manual exception process to issue a certificate with the DNS Operator exception, Microsoft PKI Services performs the following at certificate issuance:

  1. Acknowledge Emergency Request from internal customer about failing certificate automation ticket(s).
  2. Manually look up registered owners of the domain(s) using our Domain Registrar feed or WhoIs Domain Lookup.
  3. Contact the registered domain owner (per CAB Forum BR method 3.2.2.4.2, 3.2.2.4.4 or 3.2.2.4.7) and ask them to verify that Microsoft owns and operates the domains that are listed in the certificate SAN. Specifically, the registered domain owner for Microsoft verifies that they have edit rights for each DNS zone specified in the request.
    At this point we establish that we can take the DNS Operator exception (if Microsoft owns the Domain and Operates the DNS server), or we will deny the certificate request.
    In addition, we also check with a third party using one or both tools (i) https://dns.google.com/ or (ii) https://toolbox.googleapps.com/apps/dig/. We do this to ensure that the larger ecosystem is up to date and there are no discrepancies. If discrepancies are found, the team works to resolve them on a separate thread.
  4. Manually perform a CAA Record Check to ensure that we can issue for that domain. This step is used to investigate the reasons that the automated CAA check was failing. So that the requestor can be notified to update CAA records to allow automation for future requests.
  5. If step 3 determined we will take the DNS Operator exception, issue the certificate. This step includes a pre and post issuance lint.
  6. Update each ticket with outcome and evidence as necessary and resolve.
Flags: needinfo?(johnmas)

(In reply to John Mason from comment #38)

  1. Contact the registered domain owner (per CAB Forum BR method 3.2.2.4.2, 3.2.2.4.4 or 3.2.2.4.7) and ask them to verify that Microsoft owns and operates the domains that are listed in the certificate SAN. Specifically, the registered domain owner for Microsoft verifies that they have edit rights for each DNS zone specified in the request.

Can you provide more detail about this process?

In particular, do you group multiple Authorization Domain Names/Fully Qualified Domain Names together to verify, or do you verify each individually?

I ask, because one (BR-violating) example that I've seen from CAs in the past is to group all such verifications together, into a single e-mail/challenge, and require that they get "a" response indicating the request is authorized. The problem with this approach is that if Domain 1 has Operators (A, B) and Domain 2 has Operators (B, C), it potentially allows A to approve Domain 2 / C to approve Domain 1, both of which would be misissuances.

Note that the term "DNS Operator" is used in a precise form, and simply using a BR validation method should not be presumed to be equivalent as confirming DNS operator. In particular, I do not believe it can be concluded on the basis of 3.2.2.4.7 that the entity is the "DNS Operator" in the precise sense of the RFC 7719 language, as a general rule.

As with Andrew, I agree that a more formal incident for the failure to update the CPS is warranted. As Relying Parties rely on CPS' to understand how a CA operators, and Certificate Policy OIDs explicitly exist to track these changes over time, I'm a little concerned with the "left it in" side. However, I think it's useful for the community to understand how Microsoft manages and reviews its CP/CPSes annually, to better understand if there's risk of similar inclusions of not-actually-practiced behaviors.

Flags: needinfo?(johnmas)

Ryan thanks for your comments and questions (Comment 39) following up on DNS Operator and our CPS.

Microsoft PKI Services does not plan to use the DNS Operator Exception as a normal course of business. We would only use it as a last-ditch effort to try to avoid certificate outages in unforeseen circumstances. However, if the BRs allow this exception, we must reserve the right to use it and help our customers maintain their business. To get to your specific question on grouping multiple Authorizations, we would perform the check individually for every SAN in the certificate request (during Step 3).

Regarding your comments on 3.2.2.4.7 we are not using this method, or the other methods, to verify DNS operator. The way our internal domain management tools function is that the registered domain owner for Microsoft can only edit the DNS zone if Microsoft is the DNS operator of that zone.

Regarding the CPS update, we have created a new incident for the failure to update the Domain Validation Method, that reference is https://bugzilla.mozilla.org/show_bug.cgi?id=1693932. Thank you both for the recommendation.

Flags: needinfo?(johnmas)

It has been a frustrating sequence of events trying to ascertain what process Microsoft has for the DNS Operator exception:

Comment 14: Microsoft states that the July domain validation process used https://dns.google.com/ for determining the DNS Operator

Comment 15: I ask for clarification about the DNS Operator check. I explicitly ask how the check was implemented for the other (non-July) domain validation processes

Comment 18: Microsoft does not answer how the check was implemented for the other domain validation processes

Comment 30: I ask again about the process for the DNS Operator check

Comment 31: Microsoft states that the DNS Operator check is implemented with https://dns.google.com/ or https://toolbox.googleapps.com

Comment 35: I express concern that using https://dns.google.com/ or https://toolbox.googleapps.com is a Delegated Third Party

Comment 36: Microsoft states that https://dns.google.com/ and https://toolbox.googleapps.com is only a secondary check, but does not explain what the primary check is

Comment 37: I ask what the primary check is

Comment 38: Microsoft seems to say that BR methods 3.2.2.4.2, 3.2.2.4.4 or 3.2.2.4.7 are used to implement the DNS Operator check

Comment 39: Ryan Sleevi expresses concern about using 3.2.2.4.7 for the DNS Operator check

Comment 40: Microsoft states that 3.2.2.4.7 is not actually used to implement the DNS Operator check

It remains unclear what process, if any, Microsoft has for the DNS Operator exception. I think it would be helpful if Microsoft replied with a comment that focused just on the process for validating that a particular domain's DNS Operator is Microsoft, rather than the overall process for exceptional issuances. The comment should exclude secondary/superfluous checks like consulting dns.google.com and focus just on the steps necessary to comply with the BRs. The comment should provide enough detail for others to assess its correctness. I think this would help provide assurance that this incident will not be repeated in the future.

Flags: needinfo?(johnmas)

(In reply to John Mason from comment #38)

Specifically, the registered domain owner for Microsoft verifies that they have edit rights for each DNS zone specified in the request.

(In reply to John Mason from comment #40)

Regarding your comments on 3.2.2.4.7 we are not using this method, or the other methods, to verify DNS operator. The way our internal domain management tools function is that the registered domain owner for Microsoft can only edit the DNS zone if Microsoft is the DNS operator of that zone.

I think there are still concerning gaps in the explanation, or at least, an unclear full picture, that continues to raise concerns that there's potential or likely non-compliance here, which would be a critical security failure (i.e. as this issue is - a critical security failure of one of the most important parts of being a CA).

That is, I can see a very problematic implementation using the description so far, in which, say, johndoe@microsoft.com requests a certificate for google.com. You're unable to verify the CAA record (naturally), and so using a method like 3.2.2.4.7, you e-mail johndoe@microsoft.com and ask if they can edit the records for google.com on Microsoft's DNS servers (i.e. not the records listed in the canonical NS/SOA). If they say they can, well, you trust that they are domain administrator.

Part of the reason for this challenge is it feels like you're giving clearly conflicting answers. For example, Comment #40 states:

Regarding your comments on 3.2.2.4.7 we are not using this method, or the other methods, to verify DNS operator.

While Comment 38 states:

Contact the registered domain owner (per CAB Forum BR method 3.2.2.4.2, 3.2.2.4.4 or 3.2.2.4.7) and ask them to verify that Microsoft owns and operates the domains that are listed in the certificate SAN

Which of these is correct and true? And, to my example above, the problem still exists with using .4.2 or .4.4 if, for example, Microsoft were to configure the MX records for google.com to point to a Microsoft domain. Now, we both might agree that would be silly and completely and totally BR violating, but that's the risk as described for your process of determining the Domain Operator. That's why Andrew's had to request multiple times now for you to describe your process, and which I have to again reiterate a request for more concrete technical detail. If I understand Andrew's concern correctly, then I share it as well, which is that there is a subtle amount of complexity here that we both expect the CA to understand and to implement correctly, hence the desire on the specific process used.

Regarding this comment:

However, if the BRs allow this exception, we must reserve the right to use it and help our customers maintain their business.

Ultimately, yes, to the extent that it's not (yet) forbidden, you are free to use it. I think the concern I, and others, would share, is that when a CA demonstrates improper judgement or implementation that leads them to cause an incident, and the incident report doesn't really provide much assurance that it won't be repeated, one of the things that the CA can do to remove any uncertainty is to commit to not using error-prone or risky actions, even if "technically" they're allowed to. The BRs represent a floor for operation, not a ceiling, and we look for all CAs to recognize risks and mitigate them. These incident reports are the chance for the CA to assure the community that "Even if we technically can, we recognize the risk, so we commit not to use it, and that's how you know we take the risk seriously." You can choose to keep using it (again, until it's forbidden), but you should realize that does not instill confidence in the judgement of risk in this case, or in general.

Ryan and Andrew thanks for your comments and questions (Comments 41 and 42).

We apologize for the frustration that our answers have caused, and we agree with Andrew that we will focus solely on the determination that an FQDN is operated by Microsoft DNS Servers (Step 3 of Comment 38 above). We believe this confusion was created because when we determine owner and operator we work with the same team at Microsoft, our Corporate Domains Team. For all Microsoft owned domains this team represents the registered domain owner, typically Microsoft Corporation. They have developed a service to manage Microsoft owned domains. This internal to Microsoft service is used to provide confirmation that Microsoft owns and operates the domain in question.

The procedure to determine DNS Operator starts with an email to Microsoft’s DNS operator, the Corporate Domains Team in our case. This is all done on a per FQDN basis for each domain within the certificate request SAN.

  1. For each FQDN, the Microsoft Corporate Domains team determine the zone file using the proprietary service. They use a DNS client and query for the authoritative name servers for the zone.
  2. They then verify the zone is operated by a Microsoft owned name server platform by comparing the query results against a dynamically maintained list of Microsoft operated DNS Servers.
  3. They verify that the authoritative zone exists on an internal Microsoft Azure subscription.
  4. If all these steps are successful, then we consider the FQDN to be operated by Microsoft DNS Servers.
  5. They then send an email confirmation back to Microsoft PKI Services to indicate whether or not Microsoft is the DNS Operator for each FQDN.

Per Ryan’s example of the google.com domain, we will perform a new domain validation check for each certificate issuance. This example request would fail our check for Microsoft ownership of the domain, prior to the DNS operator check described in the steps above. In Comment 38 we describe the full process and in Step 3, on a per FQDN basis, we work with the Corporate Domains Team to determine owner first and then operator.

Flags: needinfo?(johnmas)

John, thanks for the additional detail in Comment 43. I have a few questions:

  1. Is the Corporate Domains Team in scope of Microsoft PKI Services' audits, and are the personnel who perform this procedure trained in accordance with BR 5.3.3?

  2. Could you provide more detail about the DNS queries in step 1? What domain name(s) and resource record type(s) are queried? Is the DNS client a stub resolver or a full recursive resolver? If it's a stub resolver, what recursive resolver is used?

  3. How is the list of Microsoft-operated DNS Servers described in step 2 maintained? Is there a process to prevent BygoneSSL-style problems where a particular DNS server is still provisioned internally, but in the public DNS, the DNS server's domain name points somewhere else?

Flags: needinfo?(johnmas)

One week has passed with no response to Comment 44. As a reminder, https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed states:

Once the report is posted, you should respond promptly to questions that are asked, and in no circumstances should a question linger without a response for more than one week, even if the response is only to acknowledge the question and provide a later date when an answer will be delivered

Note that one week is just the upper bound, and I would hope to see more prompt responses going forward.

Andrew with respect to your additional questions (Comments 44):

  1. Is the Corporate Domains Team in scope of Microsoft PKI Services' audits, and are the personnel who perform this procedure trained in accordance with BR 5.3.3?
  • All Microsoft PKI Services processes are in scope of internal and external audits. Regarding Validation Specialists training, the Corporate Domains Team are in scope of that training.
  1. Could you provide more detail about the DNS queries in step 1? What domain name(s) and resource record type(s) are queried? Is the DNS client a stub resolver or a full recursive resolver? If it's a stub resolver, what recursive resolver is used?
  • They query the FQDN that is in the SAN, and the NS Record type. It is a full recursive resolver.
  1. How is the list of Microsoft-operated DNS Servers described in step 2 maintained? Is there a process to prevent BygoneSSL-style problems where a particular DNS server is still provisioned internally, but in the public DNS, the DNS server's domain name points somewhere else?
  • Every 24 hours a query is run against the list of Microsoft-owned domains to verify that the list is accurate.

Microsoft PKI Services believes it is prudent to set up our systems to allow us to use the DNS Operator exception to protect our customers and provide business continuity in case of emergency. We see very few use cases where we might ever use this and expect it will only be under extraordinary circumstances. We are delving into areas that are not defined in the BRs and believe it would be more constructive to have these conversations and work to define them in the BR working group.

Flags: needinfo?(johnmas)

They query the FQDN that is in the SAN, and the NS Record type. It is a full recursive resolver.

Although this description omits many details, based on the description it does not work. The resolver will follow CNAME records, which means that if example.com's DNS is operated by ACME Corp, and www.example.com is a CNAME to microsoft.com, the NS record lookup on www.example.com will return Microsoft name servers rather than ACME name servers. This would permit Microsoft to skip CAA checking for a SAN of www.example.com even though Microsoft is not the DNS Operator.

Furthermore, NS records only exist at the zone apex, so if the SAN is not a zone apex, the answer to an NS query will be empty. This means that this procedure would only work in a narrow case. This raises a concern that during an "emergency" situation, Microsoft would be forced to improvise a different procedure.

Every 24 hours a query is run against the list of Microsoft-owned domains to verify that the list is accurate.

There is no detail in this answer to evaluate whether it's sufficient to prevent the BygoneSSL-style problems which motivated the question.

Microsoft PKI Services believes it is prudent to set up our systems to allow us to use the DNS Operator exception to protect our customers and provide business continuity in case of emergency. We see very few use cases where we might ever use this and expect it will only be under extraordinary circumstances.

Microsoft can certainly choose to use the exception as long as it's not forbidden, but as Ryan said in Comment 42 this does not instill confidence in Microsoft. The continued use of the exception, combined with Microsoft's decision to use ADCS, despite ADCS lacking critical functionality (domain validation and CAA checking) for a publicly-trusted certificate authority, suggest that Microsoft is bad at judging risk when it comes to running a CA. Ceasing use of the DNS Operator exception would be an excellent way for Microsoft to restore confidence.

We are delving into areas that are not defined in the BRs and believe it would be more constructive to have these conversations and work to define them in the BR working group.

While the BRs don't say how to implement the DNS Operator exception, DNS Operator is a defined term and the lack of implementation guidance in the BRs is not a blank check for a CA to do whatever they want. Rather, it means the CA has to do extra work to develop a correct implementation. As always, the CA should be prepared to describe in detail their implementation, both to assure the community that it's correct, and to help other CAs avoid mistakes in their implementations. Quoting Ryan Sleevi in Bug 1695786 Comment 3, "a good incident report provides sufficient detail that an independent implementation of 'the thing' (e.g. the automatic control validation) could be implemented by another CA. The reason this is the target level of detail is because it helps all CAs examine their systems to see if they've taken a similar design and suffer similar issues - but the only way to achieve that is by providing sufficient detail."

Sadly, Microsoft has not taken the opportunity in this bug to do so. It does seem, based on the delays and lack of detail in Microsoft's comments, that this bug is becoming unproductive and as such I have no questions at this time, except for the ones I've already asked and received vague answers to. Should this bug be closed as-is, Microsoft's (non-)response should be noted for the record and factor in to the question of continued trust should Microsoft have further incidents in the future. Meanwhile, Mozilla should forbid the DNS Operator exception as it has done with other parts of the BRs that contribute to risky CA behavior.

As with Andrew, I do feel that we've exhausted all productive engagement, and we're unlikely to see an appropriately improved and detailed incident report from Microsoft, which is profoundly unfortunate. I have no further questions, because I don't feel like we're at a point where they would be answered sufficiently to address the underlying concerns.

While this incident has done significant harm to the confidence and trust in Microsoft's CAs, ultimately, these incident reports cannot continue indefinitely; as the proverb goes, you can lead a horse to water, but you can't make them drink. I can understand and appreciate that operational and organizational maturity takes time to build, but there are certain minimum levels expected of public CAs, as necessary to ensure users are safe.

I'm setting N-I for Ben to see if Mozilla has further questions. The conclusion Andrew has reached seems consistent with my impressions thus far: that the minimum necessary to protect users is to immediately work to forbid this DNS Operator exception via root programs, and then potentially the CA/Browser Forum.

Flags: needinfo?(bwilson)

Microsoft PKI Services very much appreciates and values the constructive feedback that we have received during this process and do agree that there should be a high-quality bar for Public CAs. Although we thought it was prudent to retain the option for our customers' sake, we agree that the potential use of the DNS Operator Exception is problematic, and we will remove it from our CPS. We could have done a better job of explaining our process along the way and in the end, we were swayed by the feedback and will remove it. We are committed to the use of the CAA check and have no plans to use the DNS Operator Exception and tremendously regret the consternation this discussion has caused. We will follow up to this bug when our CPS has been updated (we anticipate that it will be in about three weeks).

Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] Next Update 2021-02-01 → [ca-compliance] Next Update 2021-04-19

Hello,
Could you please explain why https://crt.sh/?id=3516925334&opt=ocsp is OCSP Unknown?

Attached file 3516925334_ocsp.txt

Michel: Questions like Comment #50 are best addressed as new incidents reports. However, it's worth noting that the certificate you're querying for is expired, and so this may not be an incident.

Regarding Michel’s question in Comment #50. This is expected for the certificate in question. OCSP responses for a TLS certificate will respond with revoked for revoked certificates that have NOT expired. OCSP responses for a TLS certificate will respond with unknown once the certificate HAS expired.

Additionally, as noted in our previous response, we have removed the DNS Operator exception from our CPS and updated our repository (https://www.microsoft.com/pkiops/docs/repository.htm). The updated CPS is called Microsoft PKI Services Certification Practices Statement v3.1.9.

The Microsoft PKI Services team is not using the DNS Operator exception anymore and has removed the allowance from our CPS.

This issue (along with others from our partner team in Microsoft) has caused us to re-evaluate the entire verification process and we will continue to focus in this area with automation.

With that said, is there anything else needed before closing this bug?

I will close this bug on or about Friday, 4-June-2021, unless there are any reasons to keep it open.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.