Closed Bug 1740585 Opened 3 years ago Closed 2 years ago

Microsoft: Unrevoked 4 intermediate certificates

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kathleen.a.wilson, Assigned: johnmas)

Details

(Whiteboard: [ca-compliance] [crl-failure])

It appears that Microsoft revoked the following 4 intermediate certificates on June 24, and then "unrevoked" them a few hours later. According to crt.sh, these were observed in their CRL on June 25, but the CRL was last checked by crt.sh today, November 10, and they are no longer in the CRL.

https://crt.sh/?sha256=ec02314a59a303990772bdf25513d5093581257ad4e242f086f988a98fba8b7d

https://crt.sh/?sha256=3acd6f50d569963ede389e5a3d024fef52cb537dbf497ca1725e9ce710117807

https://crt.sh/?sha256=eb79c04645b9137e67647a7389dac6eb1d3aad8aa74d8994aa8f9c01015ecde0

https://crt.sh/?sha256=3b7d95d4ff780b5ea537d852e24c5485cda83b2b7931c7af1c8feec9c62146db

Section 4.10.1 of the BRs says: Revocation entries on a CRL or OCSP Response MUST NOT be removed until after the Expiry Date of the revoked Certificate.

So I think the CA should provide an Incident Report to explain what happened.

Assignee: bwilson → johnmas
Status: NEW → ASSIGNED

Acknowledged on-behalf of John Mason.

We are actively investigating and will provide an update by Nov 15th 2021.

According to crt.sh, these certificates were observed in CRL again on November 13, but are not in CRL anymore.

In crt.sh see the table in the Revocation section, which has
Last Observed in CRL: 2021-11-13
Last Checked: 2021-11-15

Incident Report – Preliminary


  1. How your CA first became aware of the problem.

The issue was reported by Kathleen Wilson on , 10 Nov 2021, via Bugzilla. This is when we first became aware of the issue.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

Note: Times are listed in the Pacific time zone.
02 Jul 2021 12:50 PM - Microsoft PKI Services opened a bugzilla bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1718991) for creating 4 Malformed ICA's. This bug explains how the ICA's were created and revoked. The 4 ICA's from this new Bug are the same 4 ICA's that were Malformed in this bug.
07 Jul 2021 5:18 PM - Kathleen Wilson comments in the original bug (1718991) that explained trepidation about adding the PEM files for these malformed certs to CCADB (https://bugzilla.mozilla.org/show_bug.cgi?id=1718991#c7). We did add the Certificates to CCADB, but did not append the PEM files.
11 Nov 2021 09:31 AM – Issue acknowledged by Microsoft PKI Services
12 Nov 2021 02:10 PM – Confirmed that the investigation was underway and that we would provide an update by Nov 15th 2021.
13 Nov 2021 10:35 AM - Preliminary investigation completed and issue was remediated by Microsoft PKI Services by posting an updated CRL, from the Root, that correctly reflects the revocation of these 4 ICA's.
13 Nov 2021 10:44 AM - MS PKI downloaded the published CRLs and verified that the CRLs were publicly available.
13 Nov 2021 6:35 PM - Corrected the CCADB entries for the CAs from Active status to Revoked status.

  1. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

Microsoft PKI did not issue any certificates from the CAs as the CA certificates were revoked immediately after detecting the issue with the malformed certificates.

  1. In a case involving certificates, a summary of the problematic certificates.
    https://crt.sh/?sha256=ec02314a59a303990772bdf25513d5093581257ad4e242f086f988a98fba8b7d
    https://crt.sh/?sha256=3acd6f50d569963ede389e5a3d024fef52cb537dbf497ca1725e9ce710117807
    https://crt.sh/?sha256=eb79c04645b9137e67647a7389dac6eb1d3aad8aa74d8994aa8f9c01015ecde0
    https://crt.sh/?sha256=3b7d95d4ff780b5ea537d852e24c5485cda83b2b7931c7af1c8feec9c62146db

  2. In a case involving TLS server certificates, the complete certificate data for the problematic certificates.

See list of certificates above.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Microsoft PKI is performing a more in depth Root Cause Analysis for the issue and will provide a more detailed report by 19 Nov 2021.

These 4 ICAs were created and revoked, on the same day (24 Jun 2021). The Root's CRL was published manually with the updated revocations (that included these 4 ICA's as revoked). Some days later our automated system overwrote that manual CRL with an older CRL, from the same root, that did not include the revocation of these 4 ICA's.

On 13 Nov 2021, we corrected this error and the Root's CRL is now correctly listing these 4 ICA's as revoked. The team is still investigating the root cause of the incident.

We will provide an updated report by 19 Nov 2021.

With regard to Comment #3 from Kathleen. We updated the CRL on Saturday 13 Nov at approximately 10:35 AM. Even after your comment on 15 Nov we are still seeing the published CRL has the 4 ICA's as revoked. We are not sure why this is reflected differently in CRT.SH and will contact you offline to troubleshoot. We will include an update in our more detailed report to follow.

  1. List of steps your CA is taking to resolve the situation and ensure that such a situation or incident will not be repeated in the future.

We will provide detailed remediation steps, after completing the more detailed Root Cause Analysis.

Regarding Comment #3, CCADB verified the revocation via CRL just now, so maybe crt.sh will get the same results tomorrow.

Verified it in crt.sh as well today...
Last Observed in CRL: 2021-11-16
Last Checked: 2021-11-16

Incident Report - Update


This is an update to the Preliminary Report that was submitted earlier this week. There are only updates to sections 2, 6 and 7.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

Note: Times are listed in the Pacific time zone.

• 02 Jul 2021 12:50 PM - Microsoft PKI Services opened a Bugzilla bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1718991) for creating 4 Malformed ICA's. This bug explains how the ICA's were created and revoked. The 4 ICA's from this new Bug are the same 4 ICA's that were Malformed in this bug.
• 07 Jul 2021 5:18 PM - Kathleen Wilson comments in the original bug (1718991) that explained trepidation about adding the PEM files for these malformed certs to CCADB
• (https://bugzilla.mozilla.org/show_bug.cgi?id=1718991#c7). We did add the Certificates to CCADB, but did not append the PEM files.
• 11 Nov 2021 09:31 AM – Issue acknowledged by Microsoft PKI Services
• 12 Nov 2021 02:10 PM – Confirmed that the investigation was underway and that we would provide an update by 15 Nov 2021.
• 13 Nov 2021 10:35 AM - Preliminary investigation completed and issue was remediated by Microsoft PKI Services by posting an updated CRL, from the Root, that correctly reflects the revocation of these 4 ICA's.
• 13 Nov 2021 10:44 AM - MS PKI downloaded the published CRLs and verified that the CRLs were publicly available.
• 13 Nov 2021 6:35 PM - Corrected the CCADB entries for the CAs from Active status to Revoked status
• 15 Nov 2021 4:57 PM – Posted Preliminary Incident Report to CCADB

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The Root CA that issued the four (4) malformed ICAs and the ICAs themselves are all managed “offline”, meaning they are in a secure logical and physical environment that is not connected to a network. Our team has very strict protocols and procedures while operating in this environment designed to maintain the integrity of these assets. Most of the work in this environment is manual but leverages automation (in the form of scripts, primarily) where it is possible to reduce human error and manage toil.

Additionally, our proprietary Automated PKI System is designed to augment this “offline” environment to track and automate processes like the posting of CRLs, as an example. This automation reduces toil and improves the overall quality of our services.

In this case, as the ICAs were being created (during 24 June 2021), the team in the offline environment immediately realized through post issuance manual checks that the four (4) ICAs had been malformed. They wanted to ensure that in addition to reporting the mis-issuance via Bugzilla (https://bugzilla.mozilla.org/show_bug.cgi?id=1718991), they wanted to immediately revoke the ICAs to speed the posting of the CRL and revoked them manually while in the offline lab.

Unfortunately, the team missed a cleanup step to formally “revoke” the certificate in the Automated PKI system upon leaving the lab. Therefore, our Automated system was not aware of the revocation. So, when it came time to update and publish the next CRL for that Root CA (approximately 24 July 2021), a task the Automated PKI system does, it overwrote the manually issued CRL with an older CRL that did not contain the revoked ICAs that were created on 24 June 2021.

  1. List of steps your CA is taking to resolve the situation and ensure that such a situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

The team will update the “playbook” used in events like this. The existing playbook did not have this scenario (where the mis-issuance was detected at the time of creation). The updated playbook will NOT include manually publishing the CRLs, but instead have the team follow the process to update the Automated PKI system so that it can publish the CRLs.

Completed Remediations:
• Corrected CRL to properly reflect the four (4) revoked ICA's (13 Nov 2021) - This included updating our Automated PKI system so that it will publish the correct CRLs in the future.
• Updated CCADB to accurately reflect revocation of these four (4) revoked ICA's (13 Nov 2021)

Open Remediations:
• We will create a playbook specific to this use case, creating a certificate in our offline environment that is immediately identified as mis-issued/malformed. This will include a checklist of all steps and prohibit manual publishing of CRLs (30 Nov 2021).

Hi John,

Can you please help clarify the following:

Unfortunately, the team missed a cleanup step to formally “revoke” the certificate in the Automated PKI system upon leaving the lab. Therefore, our Automated system was not aware of the revocation. So, when it came time to update and publish the next CRL for that Root CA (approximately 24 July 2021), a task the Automated PKI system does, it overwrote the manually issued CRL with an older CRL that did not contain the revoked ICAs that were created on 24 June 2021.

It sounds like Microsoft is staging pre-generated CRLs on the Automated PKI System - and failed to destroy and re-create those pre-generated CRLs made obsolete after the 4 ICA certificate revocations. Is that right?

Thanks,
Ryan

Ryan with respect to Comment 8. Yes, your analysis is correct.

Thanks for confirming, John.

Has Microsoft considered adding a technical control to the Automated PKI system such that it would 1) notify and 2) block a CRL from being hosted if its CRL Number is less than the version currently hosted?

Thanks Ryan in response to Comment 10 yes, we have considered similar controls that would notify/block a CRL from being hosted if its CRL Number is less than the version currently hosted. We have a Product Backlog Item for the analysis and development of such a feature by our engineering team.

In the immediate term, we believe that our proposed remediation is an appropriate response as it will prohibit the manual publishing of CRLs and prevent such an issue to occur again.

An update from Microsoft PKI Services, we created another Bugzilla Bug related to this incident (https://bugzilla.mozilla.org/show_bug.cgi?id=1742195). This bug was related to the issues around posting the malformed ICA's in this bug to CCADB.

Additionally, we confirm that we have completed all remediations discussed above. Here is the summary:

Completed Remediations:
• Corrected CRL to properly reflect the four (4) revoked ICA's (13 Nov 2021) - This included updating our Automated PKI system so that it will publish the correct CRLs in the future.
• Updated CCADB to accurately reflect revocation of these four (4) revoked ICA's (13 Nov 2021)
•Created a playbook specific to this use case, creating a certificate in our offline environment that is immediately identified as mis-issued/malformed. This includes a checklist of all steps and prohibits manual publishing of CRLs (30 Nov 2021).

If there are no other comments, we ask that this bug please be resolved.

I'll close this on Friday, 3-Dec-2021, unless there is further discussion that needs to take place.

Flags: needinfo?(bwilson)

I would argue that, even with the proposed changes, Microsoft has not implemented checks that would allow them to detect the issue of unrevoking certificates even when it occurs again:

There was no check that would detect that from an outsiders' perspective certificates are being removed from the CRL endpoint (otherwise Microsoft would have detected this issue back in June), and there is no such check in the proposed and completed changes.

Flags: needinfo?(bwilson) → needinfo?(johnmas)
Flags: needinfo?(bwilson)

Thanks Matthias for your comments Comment 14. We understand that automation of checks on our CRL endpoints will further bolster our system from un-revoking certificates and that is why we have a long-term item in our backlog to add that functionality.

Our existing CRL publishing systems and processes have been operating for more than eight (8) years and have proved to be very reliable, as this is the first instance of this problem. We believe the root cause of the problem is the manual intervention that the team took (manually publishing the CRLs) to resolve the original Certificate malformations more expeditiously. Our remediations have addressed this root cause (the manual intervention) and should prevent that from occurring again.

Because we have addressed the root cause, we respectfully believe this bug should be resolved. We will continue to add enhancements to our systems to add additional defense in depth.

Flags: needinfo?(johnmas)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [crl-failure]
You need to log in before you can comment on or make changes to this bug.