Open Bug 1670337 Opened 4 months ago Updated 4 days ago

Microsoft PKI Services: Certificate Mis-Issuance, DNSNames must have a valid TLD

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: johnmas, Assigned: johnmas)

Details

(Whiteboard: [ca-compliance])

Attachments

(8 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.22 Safari/537.36 Edg/86.0.622.10

Type: defect → task
  1. How your CA first became aware of the problem.

We were notified by a partner via email, to our "CentralPKI@microsoft.com" alias, on 10/8/20 at 10:08 PM PDT that there was an issue with public TLS certificates that we had issued. We opened an internal Incident at approximately 1:53 AM PDT on 10/9/20. And have been actively managing the incident since then.

  1. A timeline of actions your CA took in response.

    A. Microsoft was notified about an issue with these certificates by an outside partner on 10/8/20 at 10:08 PM PDT. These certificates were issued by Microsoft CA’s that are Cross Signed by DigiCert and DigiCert was a part of the original notification.
    B. An internal MS incident was created at 1:53 AM PDT on 10/9/20 and our teams engaged.
    C. Identified issue was linked to one second level domain that was onboarded for a customer that should not have been, because it is not public, at around 7:30 AM PDT on 10/9/20.
    D. The “bad” domain was removed from the production system at approximately 11:00 AM PDT on 10/9/20. This removed the domain as an option in our tooling, effectively preventing any new “bad” certificates from being created.
    E. Customers were notified at 11:42 AM PDT on 10/9/20 that certificates issued to the domain would be revoked at approximately 2:00 PM PDT on 10/9/20.
    F. At approximately 2:30 PM PDT on 10/9/20 the 2 initial certificates that were identified and 6 more (that we subsequently identified), were revoked (see attached certificates). We will continue to investigate for more certificates (and precertificates) to revoke.
    G. Created this Bug in Bugzilla.

  2.  Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem.
    

As noted in the timeline we have stopped issuing certificates with this problem.

Preliminary understanding of the root cause is that the problem is limited to one second level Domain that was mistakenly added to our production system for issuing valid public TLS certificates, as it was not a public domain (ap.gbl). Certificates that were issued to this domain as public were mis-issued. At approximately 11:00 AM on 10/9/20 we removed this domain from our system and certificates can no longer be issued.

  1. A summary of the problematic certs.

The certificates were issued to a domain that was not public and thus failed the DNSNames must have a Valid Top-Level Domain (TLD) check (see attached certs for all known impacted certificates).

  1. The complete certificate data for the problematic certificates.
    Attached to this bug. (as known at this time)

  2. Explanation about how the mistakes were made or bugs introduced, and how they avoided detection until now.

Microsoft PKI Services began issuing Public TLS certificates in July 2018, but to a relatively small number of public domains. In April 2020 we began to onboard many more users, and thus many more domains. Preliminary indications are that the process we used to onboard domain in April 2020 was insufficient and did not adequately verify all the domains that were onboarded were public. The “bad” domain mentioned above was onboarded during the April 2020 upload of new domains. Since the April 2020 onboarding process, in July 2020, we have more completely and better documented this process to include improved public domain checks.

The mis-issued certificates escaped detection after creation because there is an issue with our internal Linting tool and TLD detection. The Linting tool did not detect the problematic Top-Level Domain (TLD). Now that we are aware of this issue with the Linting tool, we have taken this as a repair item and will include updates on this repair shortly.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

As this incident is still young, we are working through the repair items and mitigation steps currently. We will report back to this bug as they are identified and addressed over the next week or so.

Our initial list at this point includes:
+ We will revalidate that all domains in our tools are public (approximately 10/16/20)
+ We will ensure that we have identified and revoked all impacted certificates (approximately 10/16/20).
+ We are continuing to automate the domain validation process to remove any manual errors, we expect this to be in place by end of next week (approximately 10/16/20).
+ Addressing issues with our Linting tool and TLD detection (TBD).

Assignee: bwilson → johnmas
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

John,

Thanks for filing this. While Microsoft's response here is more commendable than most, I'd like to highlight that a failure to adequately do domain control validation is one of, if not the, single most troublesome incidents a CA can have (ignoring, of course, intentional actions). So I can hope you appreciate just how worrisome this is, and how essential it is to make sure the public has all the facts, and CAs have all the knowledge, to take appropriate actions.

I'd like to focus on the following snippets:

(In reply to John Mason from comment #9)

C. Identified issue was linked to one second level domain that was onboarded for a customer that should not have been, because it is not public, at around 7:30 AM PDT on 10/9/20.
D. The “bad” domain was removed from the production system at approximately 11:00 AM PDT on 10/9/20. This removed the domain as an option in our tooling, effectively preventing any new “bad” certificates from being created.

<snip>

Preliminary understanding of the root cause is that the problem is limited to one second level Domain that was mistakenly added to our production system for issuing valid public TLS certificates, as it was not a public domain (ap.gbl). Certificates that were issued to this domain as public were mis-issued. At approximately 11:00 AM on 10/9/20 we removed this domain from our system and certificates can no longer be issued.

<snip>

Microsoft PKI Services began issuing Public TLS certificates in July 2018, but to a relatively small number of public domains. In April 2020 we began to onboard many more users, and thus many more domains. Preliminary indications are that the process we used to onboard domain in April 2020 was insufficient and did not adequately verify all the domains that were onboarded were public. The “bad” domain mentioned above was onboarded during the April 2020 upload of new domains. Since the April 2020 onboarding process, in July 2020, we have more completely and better documented this process to include improved public domain checks.

There's a lot here that I see worrying, and I'd like to call out a few specific details:

  1. Given this domain is not public, how does Microsoft perform domain control validation? That is, this should not have met any of the BR requirements for domain control validation, and so that is the first control that should have caught this.

  2. What was the process to onboard domains in April 2020?

  3. What was deficient about the process?

  4. What controls did you have in place when onboarding domains, in addition to domain control validation?

  5. What were the old public domain checks, since they were improved?

  6. The response here implies it was a lack of documentation ("more completely and better documented this process"), but provides no insight into the existing documentation or how it's been improved. This is touched on in Responding to an Incident, as this response feels very similar to the example given in there:

    For example, it’s not sufficient to say that “human error” of “lack of training” was a root cause for the incident, nor that “training has been improved” as a solution.

    Can you attach the old documentation and process and provide the new documentation/process, to help better understand the changes?

Flags: needinfo?(johnmas)

(In reply to John Mason from comment #9)

As noted in the timeline we have stopped issuing certificates with this problem.

How did you do CAA checks for all these certificates? How can you be sure now that your CAA checks for all your other certificates were correctly done?

This incident and Microsoft's response are deeply concerning. Not performing domain validation is one of the most serious failures a CA can experience. Incidents like this, coupled with poor incident response and other compliance issues (of which Microsoft has had several), have led to the distrust of other CAs such as Symantec and Certinomis. Unfortunately, Microsoft's incident report in Comment 9 identifies the problem not as a failure to perform domain validation (or check CAA) but as a failure to check that the TLD is public, which calls into question Microsoft's awareness of the most important rules that a CA must follow.

In addition to the questions already posed by Comment 10 and Comment 11, I would ask Microsoft to answer:

  1. Once a domain is "onboarded", for how long does the domain remain in the production system and able to be used in new certificates?

  2. Based on Comment 9, my assumption is that the only validation that Microsoft preforms prior to issuance is checking that the domains in the certificate were previously onboarded. Is this correct? If not, what steps, precisely, does Microsoft perform to validate the DNS names in the certificate request? Given the serious nature of this incident, the more detail that can be provided the better.

The mis-issued certificates escaped detection after creation because there is an issue with our internal Linting tool and TLD detection. The Linting tool did not detect the problematic Top-Level Domain (TLD). Now that we are aware of this issue with the Linting tool, we have taken this as a repair item and will include updates on this repair shortly.

Is the linting tool that Microsoft uses an internally developed/proprietary solution? As a follow-up item beyond repairing your implementation have you considered augmenting that linting tool with additional linters? ZLint, as one example, regularly consumes up-to-date TLD information as part of the normal release cycle and flags this class of problem (e.g. this error level finding).

Ryan (Comment 10), Paul (Comment 11), Andrew (Comment 12), and Daniel (Comment 13) thanks for your questions. We will add more context and then answer the questions.

In July 2018 Microsoft began issuing TLS certificates under a new Root CA program (using the following two Roots, http://www.microsoft.com/pkiops/certs/Microsoft%20RSA%20Root%20Certificate%20Authority%202017.crt and http://www.microsoft.com/pkiops/certs/Microsoft%20ECC%20Root%20Certificate%20Authority%202017.crt) from Microsoft PKI Services [MS PKI]. Currently the program scope is limited to issue TLS certificates for Microsoft owned name spaces (public and private). The vast majority of public TLS certificates are issued under a separate MS program under a Baltimore root, from Microsoft DSRE PKI [MS DSRE].

In July 2020 the Delegated OCSP Responder incident impacted roughly 7 Million Certificates (https://bugzilla.mozilla.org/show_bug.cgi?id=1649951), all need to be replaced. Both root programs were called upon to service high volumes of replacement certificates which included process changes for the new MS root CA program. This required an update to MS PKI's Domain Validation processes in order to add automation and scale. Additional updates to this process are under way to close gaps that led to the previous certificate mis-issuance.

To answer Ryan's questions (Comment 10):

 1.  Given this domain is not public, how does Microsoft perform domain control validation? That is, this should not have met any of the BR requirements for domain control validation, and so that is the first control that should have caught this. 

[MS PKI] In April 2020 when this "bad" domain was added to our domain validation cache (for 825 days), we were using domain control validation as stated in CAB Forum BR 3.2.2.4.2 (Email, Fax, SMS, or Postal Mail to Domain Contact). Unfortunately, while manually validating ~400 domains, this "bad" domain was missed. (More details below)

 2.  What was the process to onboard domains in April 2020? 

[MS PKI] In April 2020 our process allowed any internal Microsoft user to request domains from our service with the requirement that the domains were owned by Microsoft, our process was as follows:

 1.  (After receiving request to add domains, via email from internal Microsoft customers) Validate the owner of the Domain through at least 1 of 2 options, a manual ICANN Registration lookup or WhoIs Domain lookup.   

 2.  Send Domains list for Domain Validation to Domain Contact inside Microsoft via email and await results. 

 3.  Validate the domain results email from the Domains Contact and update MS PKI Toolset domain validation cache. 

MS PKI manually confirmed ~400 domains using the above process on April 14 and we missed the "bad" domain in question (the team did reject 20 other domains during this particular process but missed the one). By July 2020 we had only used this April manual process and the July manual process (described below) about 10 times to validate a total of ~500 domains. The root cause of this issue happened during the April 14 domain validation process.

 3.  What was deficient about the process? 

[MS PKI] Our process lacked the ability to scale and for the team to manually process many domain validations at once. As we iterated the process manually between April and July, we added a Tracking Database that helped the team manage the domains as they progressed through each step in the domain validation process, this improved our ability to ensure that each domain did not miss a step. As we improved the process, we added redundant controls and checks that resulted in better results (see July process described below, like the DNS Operator check and the CAA check during domain validation). At this time, we believe that the issues in our initial domain validation cache of ~500 resulted from problems our team had during the April 14 domain validation check.

By July 2020 we had iterated to the following process to manually check domain control validation (our internal users would create a ticket to add (or removed domain)) while validating/remove domains from the MS PKI Toolset domain validation cache:

  1. Query for open Domain Validation requests (or removes)

  2. Acknowledge Open tickets

  3. Check the list of Validated Domains (or removes) against the Tracking database. Update Tracking Database as appropriate (add/remove).

  4. Look up registered owners of the new domain(s) (2 Options, ICANN Registration lookup or WhoIs Domain Lookup). Domain MUST be Owned by Microsoft. Update Tracking Database as needed.

  5. Look up DNS Operator using https://dns.google.com/. Update Tracking Database as needed.

  6. Perform a CAA Record Check to ensure that we can issue for that domain. Update Tracking Database as needed.

  7. Perform a Domain Validation request for all new Domains in Tracking Database with the Registered Domain Contact. Update Tracking Database with results from domain contact.

  8. Use Tracking Database updates to add new domains and remove appropriate domains from the domain validation cache in MS PKI Toolset.

  9. Upon Confirmation that Toolset is updated, update the Tracking Database to reflect confirmed changes to Domains.

  10. Update each ticket and resolve.

In hindsight our April process did not have enough redundancy or controls to ensure domains were validated thoroughly (with redundant checks that would help spot oversights).

 4.  What controls did you have in place when onboarding domains, in addition to domain control validation? 

[MS PKI] The request for initial Domain Validations were required to come in through our request intake site which requires valid Microsoft credentials and second authentication factor to access. In addition, at certificate issuance our toolset checks that the domain is still in our domain validation cache, that the domain validation date is within 825 days, ensures that the certificate request adheres to compliant certificate profiles and that all domains in the request are well-formed.

   5.  What were the old public domain checks, since they were improved? 

[MS PKI] The team would manually check the domains via an ICANN lookup or WHOIS to ensure that the domains belonged to Microsoft (we are currently only issuing certificates to internal Microsoft teams for Microsoft owned domains). Once the validators confirmed that the domains were owned by Microsoft, they would then send an email to the Microsoft domains team with a unique code.

One of the details that the April process lacked was that instead of sending the mail to the registered domain contact inside of Microsoft, our process was using the central domains contact at Microsoft. For the April 14 check, the central domains (domains@microsoft.com) contact is the registered domain for all but 5 of the domains. Our updated July process was improved to ensure the mail goes to the registered domain contact.

Most importantly it looks like step #1 (from April 2020) was missed, or only partially completed during the domain validation of ~400 domains on April 14. In addition, step #2 (from April 2020 process) returned a false positive for the “bad” domains.

 6.  The response here implies it was a lack of documentation ("more completely and better documented this process"), but provides no insight into the existing documentation or how it's been improved. This is touched on in Responding to an Incident, as this response feels very similar to the example given in there: 

  For example, it’s not sufficient to say that “human error” of “lack of training” was a root cause for the incident, nor that “training has been improved” as a solution. 

 Can you attach the old documentation and process and provide the new documentation/process, to help better understand the changes? 

[MS PKI] The old (April) manual process was as described above in the response to question #2. The newer (July) manual process is described above in the response to question #3. These processes accounted for ~500 domains being added to our domain validation cache.

The root cause of this incident is the lack of redundant controls in place to catch a bad domain before it is improperly validated. We believe the improvements made to the process between April and July were significant and provided the redundancy needed and diligence required to manually validate domains prior to adding them to our domains cache.

We believe this incident was limited to the improper validation that was performed on April 14 with the April manual process.

As described above, in early July the Delegated OCSP Responder incident impacted the CA's that our sister team, MS DSRE, were using. As a result of the Microsoft incident response, during the month of July, Microsoft had made the decision to source Public TLS certificates for Microsoft owned/managed properties from MS PKI instead of MS DSRE. This transition is planned to take place from August 2020 through February 2021.

This made it clear that MS PKI needed to quickly scale the number of domains we allowed from ~500 to more than ~55k. Our existing processes, personnel and tools were not up to ramping that quickly. Instead we adopted the tools that our sister team, MS DSRE, had been using and refining for years in operating a CAB Forum BR compliant domain validation process.

At this point, based on what we now know, we should have cleared our initial domain validation cache of ~500 domains and started over. Instead, we retained the ~500 domains that were in the cache and added the additional ~55K domains as described below (the additional ~55k domains were validated in three large batches between August and September).

To validate the large batches between August and September, MS PKI developed and executed domain validation tooling that leveraged the MS DSRE domain validation tools. The MS DSRE team had a domain validation cache that they had already validated using at least one of the following CA Forum BR methods 3.2.2.4.2, 3.2.2.4.3, 3.2.2.4.4, 3.2.2.4.6 and/or 3.2.2.4.7. So, the MS PKI team leveraged that validation to populate the domain validation cache in the MS PKI Toolset (between August and September).

The updated August process, and the more automated domain validation process that is currently in place, for changes to our domain validation cache is as follows:

Background: Microsoft currently has four Registrars that provides a daily feed of all Public Domains Microsoft owns (or used to own). Changes in ownership are denoted by flags that our tools use to remove previously validated domains from MS PKI domain validation cache. Additionally, we stopped taking new domain requests from internal Microsoft teams. Instead we rely on our daily updates to our domain validation cache.

  1. Tools automatically pull daily Registrar feeds into MS PKI Toolset and identify Domains that are new or need to be removed.

  2. Check the list of new Microsoft domains against the existing MS PKI domain validation cache.

    a. If the domain is no longer owned/managed by Microsoft it is removed from the domain validation cache.

    b. If the domain is new and not in our existing cache, then proceed.

  3. Perform a Domain Validation request for all new Domains in daily registrar feed, by emailing the Registered Domain Contact with a unique code.

  4. Update domain validation cache with email results from registered domain contact’s response.

To answer Paul's question (Comment 11):

 How did you do CAA checks for all these certificates? How can you be sure now that your CAA checks for all your other certificates were correctly done? 

[MS PKI] Per CAB Forum BR 3.2.2.8 Microsoft is the DNS Operator, and we were using the DNS operator's exception when we performed our process. As described above we added a manual CAA check in our manual July process. CAA check is a part of our post-July automated process during domain validation. CAA checks are not currently a part of our certificate issuance process, but we are now committed to adding is soon (Timing is TBD as we work with our engineers to plan this work).

To answer Andrew's questions (Comment 12):

 1. Once a domain is "onboarded", for how long does the domain remain in the production system and able to be used in new certificates? 

[MS PKI] It will last in our domain validated cache for 825 days, unless the domain ownership information changes via our domain registrars feed (no longer owned by Microsoft).

 2. Based on Comment 9, my assumption is that the only validation that Microsoft performs prior to issuance is checking that the domains in the certificate were previously onboarded. Is this correct? If not, what steps, precisely, does Microsoft perform to validate the DNS names in the certificate request? Given the serious nature of this incident, the more detail that can be provided the better. 

[MS PKI] As stated above (and pasted below) in the response to Comment 10 question #6.

The updated August process, and the more automated domain validation process that is currently in place, for changes to our domain validation cache is as follows:

Background: Microsoft currently has four Registrars that provides a daily feed of all Public Domains Microsoft owns (or used to own). Changes in ownership are denoted by flags that our tools use to remove previously validated domains from MS PKI domain validation cache. Additionally, we stopped taking new domain requests from internal Microsoft teams. Instead we rely on our daily updates to our domain validation cache.

  1. Tools automatically pull daily Registrar feeds into MS PKI Toolset and identify Domains that are new or need to be removed.

  2. Check the list of new Microsoft domains against the existing MS PKI domain validation cache.

    a. If the domain is no longer owned/managed by Microsoft it is removed from the domain validation cache.

    b. If the domain is new and not in our existing cache, then proceed.

  3. Perform a Domain Validation request for all new Domains in daily registrar feed, by emailing the Registered Domain Contact with a unique code.

  4. Update domain validation cache with email results from registered domain contact’s response.

To answer Daniel's question (Comment 13):

 Is the linting tool that Microsoft uses an internally developed/proprietary solution? As a follow-up item beyond repairing your implementation have you considered augmenting that linting tool with additional linters? ZLint, as one example, regularly consumes up to date TLD information as part of the normal release cycle and flags this class of problem (e.g. this error level finding). 

[MS PKI] The tool we currently use is proprietary. Part of the reason for that decision is that we have strict controls around code that can run in our high secure environments and it is difficult to gain approval to operate third party code in these environments, and a challenge to keep that code updated. This incident highlights the importance of using an industry standard linting tool and we are reexamining ways in which we can add one to our toolset.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

UPDATE on Findings, Repair Items and Mitigation Steps:

We have discovered that of the 8 certificates that were mis-issued and revoked on Oct 9 and disclosed in this bug that 4 certificates also included one public and non-Microsoft owned domain in the Subject Name (specifically akadns.net). 2 of these certificates were issued on 9/9/20 and 2 were issued on 9/14/20. This error originated during the same April 14 domain validation process and that is when the akadns.net domain was incorrectly added to our domain validation cache. We will reach out to the registered domain owner to explain the error and provide all the information that we can.

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the `500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug.

  • We examined all certificates that MS PKI has issued from all of our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug.

Open/In Process:

  • Reach out to registered domain owner for AKADNS.NET for follow up on error (approximately 10/23/20).

  • We are continuing to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 will be updated to automatically ingest the registered domain contacts response (approximately 10/30/20).

  • Addressing issues with our Linting tool and TLD detection (TBD).

+Reexamine ways in which to add an industry standard linting tool to our toolset (TBD).

  • We will use and industry standard linting tool to lint all certificates that we have issued and report back on any errors that are discovered (TBD)
Flags: needinfo?(johnmas)

Thanks, John, for the level of detail in Comment 14.

One of the details that the April process lacked was that instead of sending the mail to the registered domain contact inside of Microsoft, our process was using the central domains contact at Microsoft. For the April 14 check, the central domains (domains@microsoft.com) contact is the registered domain for all but 5 of the domains.

If I'm understanding this correctly, you're saying that for 5 domains, you sent the domain validation email to an address that isn't allowed by BRs 3.2.2.4.2, 3.2.2.4.4, 3.2.2.4.13, or 3.2.2.4.14? If so, the certificates for those domains are misissued and need to be revoked and disclosed here.

It's a bit hard to follow the different processes which MS PKI has used to validate domains. Is it correct that all certificates issued by MS PKI have validated domains using one of the following 5 processes?

  1. The original April manual process

  2. The improved July manual process

  3. The "updated August process"

  4. By importing domains which had been validated by MS DSRE

  5. The automated system currently in place

Could you provide more details about the "updated August process"?

Regarding CAA, what is your process for determining that you are the DNS Operator of a domain? For domains validated using the July process, is it done by consulting https://dns.google.com as described in step 5? How is it done for domains which were validated using the other 4 processes?

Flags: needinfo?(johnmas)

[MS PKI] Per CAB Forum BR 3.2.2.8 Microsoft is the DNS Operator, and we were using the DNS operator's exception when we performed our process. As described above we added a manual CAA check in our manual July process. CAA check is a part of our post-July automated process during domain validation. CAA checks are not currently a part of our certificate issuance process, but we are now committed to adding is soon (Timing is TBD as we work with our engineers to plan this work).

BR 3.2.2.8 only gives you the right to define an exception in your CP/CPS.

Where in your CP/CPS did you define this exception? If you did not, all of your certificates are, in fact, misissued.

(In reply to John Mason from comment #14)

We have discovered that of the 8 certificates that were mis-issued and revoked on Oct 9 and disclosed in this bug that 4 certificates also included one public and non-Microsoft owned domain in the Subject Name (specifically akadns.net).

For convenience, the 8 certificates currently attached to this bug (+ respective precertficates) are tracked in https://misissued.com/batch/186/ . If Microsoft had followed the recommendation of providing crt.sh IDs for the certificates, these certificates would have been easier to spot.

Andrew (Comment 15) and Paul (Comment 16 and 17) thanks for your questions.

To Answer Andrew’s questions (Comment 15):

If I'm understanding this correctly, you're saying that for 5 domains, you sent the domain validation email to an address that isn't allowed by BRs 3.2.2.4.2, 3.2.2.4.4, 3.2.2.4.13, or 3.2.2.4.14? If so, the certificates for those domains are misissued and need to be revoked and disclosed here.

[MS PKI] Yes, we did identify that for 5 domains during the April 14 check that we sent the email to the improper domain contact. All the certificates that were mis-issued have been revoked and disclosed here (we mis-issued for 2 of these domains and we did not issue for the other 3 domains) and we have removed those domains from our domain validation cache and/or revalidated them and added them back to the domain validation cache.

It's a bit hard to follow the different processes which MS PKI has used to validate domains. Is it correct that all certificates issued by MS PKI have validated domains using one of the following 5 processes?

  1. The original April manual process
  2. The improved July manual process
  3. The "updated August process"
  4. By importing domains which had been validated by MS DSRE
  5. The automated system currently in place
    Could you provide more details about the "updated August process"?

[MS PKI] Yes, those are the 5 processes used to validate domains. The “updated August process” is covered in Comment 14 and I will copy it here as well.

The updated August process for changes to our domain validation cache (the more automated domain validation process that is currently in place), is as follows:

Background: Microsoft currently has four Registrars that provides a daily feed of all Public Domains Microsoft owns (or used to own). Changes in ownership are denoted by flags that our tools use to remove previously validated domains from MS PKI domain validation cache. Additionally, we stopped accepting new domain requests from internal Microsoft teams. Instead, we rely on our daily updates to our domain validation cache.

  1. Tools automatically pull daily Registrar feeds into MS PKI Toolset and identify Domains that are new or need to be removed.
  2. Check the list of new Microsoft domains against the existing MS PKI domain validation cache.
    a. If the domain is no longer owned/managed by Microsoft it is removed from the domain validation cache.
    b. If the domain is new and not in our existing cache, then proceed.
  3. Perform a Domain Validation request for all new Domains in daily registrar feed, by emailing the Registered Domain Contact with a unique code.
  4. Update domain validation cache with email results from registered domain contact’s response.

Steps 1-3 of this process are already automated, and we are working to add automation to Step #4 (see repair items, target is to complete that by approximately 10/30/20).

Regarding CAA, what is your process for determining that you are the DNS Operator of a domain? For domains validated using the July process, is it done by consulting https://dns.google.com as described in step 5? How is it done for domains which were validated using the other 4 processes?

[MS PKI] MS PKI only issues certificates for Microsoft owned domains. For all our processes we have been using the DNS Operator exception at certificate issuance as defined in our CPS. For the July manual process, we did perform the check on ~100 domain validations. Working with our domains team we have verified that for all cases where Microsoft owns the domain that we are the DNS Operator.

However, we see the value in adding the CAA check for all certificate issuances, as an additional control to ensure compliance in the case of a bad domain validation. We plan to have this check implemented by approximately 11/6/20.

To answer Paul’s question (Comment 16):
BR 3.2.2.8 only gives you the right to define an exception in your CP/CPS.
Where in your CP/CPS did you define this exception? If you did not, all of your certificates are, in fact, misissued.

[MS PKI] We have defined the exception in our CPS (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.6.pdf) in section 3.2.2.8 we state “CAA checking is optional if Microsoft PKI Services or associated Affiliate is the DNS Operator (as defined in RFC 7719) of the domain's DNS.”

To answer Paul’s question (Comment 17):

For convenience, the 8 certificates currently attached to this bug (+ respective pre-certficates) are tracked in https://misissued.com/batch/186/ . If Microsoft had followed the recommendation of providing crt.sh IDs for the certificates, these certificates would have been easier to spot.

[MS PKI] Thank you for bringing this to our attention. We do our best to follow the recommendations and regret our oversight. We appreciate the reminder and will do our best to continuously improve our responses.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
UPDATE on Repair Items and Mitigation Steps:

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get certificates in the future.

Open/In Process:

  • We are working to implement CAA checks for all certificate issuances (targeting 11/6/20)
  • We are continuing to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 will be updated to automatically ingest the registered domain contacts response (targeting 10/30/20).
  • Improve capabilities of our Linting tool and TLD detection (TBD).
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes (TBD).
  • We will use an industry standard linting tool to lint all certificates that we have issued and report back on any errors that are discovered (TBD)
Flags: needinfo?(johnmas)

(In reply to John Mason from comment #18)

[MS PKI] We have defined the exception in our CPS (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.6.pdf) in section 3.2.2.8 we state “CAA checking is optional if Microsoft PKI Services or associated Affiliate is the DNS Operator (as defined in RFC 7719) of the domain's DNS.”

The quoted sentence is only part of several bullet points. Together with the introductory sentence, it reads: "Microsoft PKI Services MAY decide to not rely on any exceptions specified in their CP or CPS unless they are one of the following: [...] CAA checking is optional if Microsoft PKI ServicesorassociatedAffiliate is the DNS Operator (as defined in RFC 7719) of the domain's DNS.".

This does not define any exception at all by itself. All of your certificates are thus misissued and have to be revoked within 5 days.

Paul (Comment 19) thanks for your question.

To answer Paul’s question (Comment 19):

The quoted sentence is only part of several bullet points. Together with the introductory sentence, it reads: "Microsoft PKI Services MAY decide to not rely on any exceptions specified in their CP or CPS unless they are one of the following: [...] CAA checking is optional if Microsoft PKI ServicesorassociatedAffiliate is the DNS Operator (as defined in RFC 7719) of the domain's DNS.".

This does not define any exception at all by itself. All of your certificates are thus misissued and have to be revoked within 5 days.

[MS PKI] Our existing CPS (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.6.pdf) does define the exception in section 3.2.2.8 and this does not constitute a mis-issuance.

We agree that the existing document could be improved to make it clearer and more consistent with others in the industry, thus we will be revising our CPS shortly (targeting Nov 6, 2020).

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
UPDATE on Repair Items and Mitigation Steps:

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get certificates in the future.

Open/In Process:

  • We are working to implement CAA checks for all certificate issuances (targeting Nov 6, 2020)
  • We are continuing to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 will be updated to automatically ingest the registered domain contacts response (targeting Oct 30, 2020).
  • Improve capabilities of our Linting tool and TLD detection (TBD).
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes (TBD).
  • We will use an industry standard linting tool to lint all certificates that we have issued and report back on any errors that are discovered (TBD)

Posting an update on our Repair Items and Mitigation Steps. We will provide an update weekly until we burn them down.

Completed:
+ We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
+ We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
+ Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.

Open/In Process – Short Term:
+ Update our Microsoft PKI Services CPS to clarify CAA Record checks (targeting Nov 6, 2020).
+ We are continuing to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 will be updated to automatically ingest the registered domain contacts response (targeting Nov 16, 2020).
+ We are working to implement CAA checks for all certificate issuances (Running into technical constraints and need to replan a rollout date).
+ We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.

Open/In Process - Longer Term:
+ Improve capabilities of our internal Linting tool, specifically TLD detection.
+ Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

Posting an update on our Repair Items and Mitigation Steps:

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.
  • Updated our Microsoft PKI Services CPS to clarify CAA Record checks (Nov 9). (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf)

Open/In Process – Short Term:

  • We are continuing to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 will be updated to automatically ingest the registered domain contacts response (targeting Nov 16, 2020).
  • We are working to implement CAA checks for all certificate issuances (Running into technical constraints and need to replan a rollout date).
  • We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.

Open/In Process - Longer Term:

  • Improve capabilities of our internal Linting tool, specifically TLD detection.
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

Posting an update on our Repair Items and Mitigation Steps:

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.
  • Updated our Microsoft PKI Services CPS to clarify CAA Record checks (Nov 9). (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf)
  • We continue to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 was updated to automatically ingest the registered domain contact’s response (Nov 16).

Open/In Process – Short Term:

  • We are working to implement CAA checks for all certificate issuances.
  • We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.

Open/In Process - Longer Term:

  • Improve capabilities of our internal Linting tool, specifically TLD detection.
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

Posting an update on our Repair Items and Mitigation Steps:
Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.
  • Updated our Microsoft PKI Services CPS to clarify CAA Record checks (Nov 9). (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf)
  • We continue to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 was updated to automatically ingest the registered domain contact’s response (Nov 16).
  • We have implemented CAA checks in our production environment for all certificate issuances (Dec 3). We are currently in “audit” mode, to ensure that we understand all use cases and after the holiday code freezes, we will turn on “enforce” mode for all certificates we issue.

Open/In Process – Short Term:

  • Improve capabilities of our internal Linting tool, specifically TLD detection.

Open/In Process - Longer Term:

  • We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

Posting the last update for the year on our Repair Items and Mitigation Steps. We should complete all short-term items by January 15. At that point we will continue to work on the long-term items but ask that this bug be resolved after completion of the short-term items.

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.
  • Updated our Microsoft PKI Services CPS to clarify CAA Record checks (Nov 9). (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf)
  • We continue to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 was updated to automatically ingest the registered domain contact’s response (Nov 16).
  • We have implemented CAA checks in our production environment for all certificate issuances (Dec 3). We are currently in “audit” mode, to ensure that we understand all use cases and after the holiday code freezes, we will turn on “enforce” mode for all certificates we issue.

Open/In Process – Short Term:

  • Improve capabilities of our internal Linting tool, specifically TLD detection (targeting Jan 15, 2021)

Open/In Process - Longer Term:

  • We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.

We should complete all short-term items by January 29. At that point we will continue to work on the long-term items but ask that this bug be resolved after completion of the short-term items.

Completed:

  • We used our daily Registrar feed from Oct 16 and ensured that all the domains in our validation cache, including the ~500 from the manual processes, were consistent with the Registrar’s feed (owned by Microsoft and public) and removed the “bad” domains discussed in this bug (Oct 17).
  • We examined all certificates that MS PKI has issued from all our issuing CA’s. We confirmed that the 8 certificates originally disclosed in this bug are the only certificates that were issued for “bad” domains disclosed in this bug (Oct 18).
  • Reached out to the registered domain owner for AKADNS.NET for follow up on this issue and received an acknowledgement of the issue (Oct 20). In addition to the improved domain validation process and additional CAA checks as mentioned above, we are both working with Microsoft teams to clarify where to get these certificates in the future.
  • Updated our Microsoft PKI Services CPS to clarify CAA Record checks (Nov 9). (https://www.microsoft.com/pkiops/Docs/Content/policy/Microsoft_PKI_Services_CPS_v3.1.7.pdf)
  • We continue to automate the Aug 2020 domain validation process to remove any manual errors. The August 2020 process step #4 was updated to automatically ingest the registered domain contact’s response (Nov 16).
  • We have implemented CAA checks in our production environment for all certificate issuances (Dec 3). We are currently in “audit” mode, to ensure that we understand all use cases and after the holiday code freezes, we will turn on “enforce” mode for all certificates we issue. Enforce mode was turned on for all production environments (Jan 14, 2021).

Open/In Process – Short Term:

  • Improve capabilities of our internal Linting tool, specifically TLD detection (targeting Jan 29, 2021)

Open/In Process - Longer Term:

  • We will use an industry standard linting tool to lint all certificates at a point in time that we have issued and report back on any errors that are discovered.
  • Reexamine ways in which to add an industry standard linting tool to our automated issuance processes.
You need to log in before you can comment on or make changes to this bug.