Closed Bug 1842121 Opened 1 year ago Closed 11 months ago

Microsoft PKI Services: CRL Publication Failures

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dustin.Hollenback, Assigned: Dustin.Hollenback)

Details

(Whiteboard: [ca-compliance] [crl-failure] Next update 2023-08-18)

Preliminary report

How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in the MDSP mailing list, a Bugzilla bug, or internal self-audit), and the time and date.

Internal monitoring notified the Microsoft PKI Services operations team about potentially delayed CRL publication beyond an internal threshold. While investigation was ongoing, DigiCert notified us via email that their monitoring determined that one of those CRLs was almost near its nextUpdate value.

MS PKI Services identified two issues related to meeting the following requirement from the CA/B Forum Baseline Requirements 4.9.7: “If the CA publishes a CRL, then the CA SHALL update and reissue CRLs at least once every seven days, and the value of the nextUpdate field MUST NOT be more than ten days beyond the value of the thisUpdate field.”

  1. One recently deployed CA, Microsoft Azure RSA TLS Issuing CA 07, which has not issued any Subscriber certificates yet, was missing a Scheduled Task to publish more frequently than default values in the CA software. This caused the CA to not "reissue CRLs at least once every seven days". This has since been resolved.

  2. All Microsoft PKI Services Issuing CAs were configured within AD CS to set the nextUpdate value within 10 days of publishing, but the CA software was adding padded time so that the CRL effective validity was 10 days and 20 minutes. This has since been resolved and the CRLs now show a difference of 8 days and 20 minutes.

A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2023-06-27 09:04:05 AM: Last publication date for the Microsoft Azure RSA TLS Issuing CA 07

2023-07-05 06:27:48 PM: Internal monitoring alerted that multiple CRLs were still within the allowed publication timeline, but were exceeding an earlier publication goal. Internal incident created.

2023-07-06 08:46:40 AM: DigiCert notified MS PKI that their monitoring caught a delayed CRL publication for Microsoft Azure RSA TLS Issuing CA 07

2023-07-06 05:13:53 PM: CRL file published for Microsoft Azure RSA TLS Issuing CA 07 with a nextUpdate timestamp less than 10 days from the thisUpdate timestamp. This was published more than 7 days from the previous CRL.

2023-07-06 05:24 PM: Issue mitigated with Microsoft Azure RSA TLS Issuing CA 07 that caused delayed publication beyond 7 days.

2023-07-06 06:28:53 PM: Last remaining CRL was re-published with a nextUpdate timestamp less than 10 days from the thisUpdate timestamp

Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

All impacted CRLs (listed below) have been re-published and meet the requirements defined within CA/B Forum Baseline Requirements 4.9.7.

In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

The following CRLs were impacted:

In a case involving TLS server certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. It is also recommended that you use this form in your list "https://crt.sh/?sha256=[sha256-hash]", unless circumstances dictate otherwise. When the incident being reported involves an SMIME certificate, if disclosure of personally identifiable information in the certificate may be contrary to applicable law, please provide at least the certificate serial number and SHA256 hash of the certificate. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

This only impacted CRL files.

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Microsoft PKI Services is investigating the root cause and will have an update within 7 days with more details.

List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

Microsoft PKI Services is investigating the root cause and will have an update within 7 days with more details.

I misspoke on a previous statement, specifically the point in bold:
"One recently deployed CA, Microsoft Azure RSA TLS Issuing CA 07, which has not issued any Subscriber certificates yet, was missing a Scheduled Task to publish more frequently than default values in the CA software. This caused the CA to not "reissue CRLs at least once every seven days". This has since been resolved."

It was incorrect to state that the CA had not issued Subscriber certificates yet. It has issued a few dozen Subscriber certificates.

Assignee: nobody → Dustin.Hollenback
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [crl-failure]

Final Report

How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in the MDSP mailing list, a Bugzilla bug, or internal self-audit), and the time and date.

DigiCert notified MS PKI Services via email that their monitoring determined that one of the Microsoft PKI Services CRLs was almost near its nextUpdate value.
MS PKI Services identified two issues related to meeting the following requirement from the CA/B Forum Baseline Requirements 4.9.7: “If the CA publishes a CRL, then the CA SHALL update and reissue CRLs at least once every seven days, and the value of the nextUpdate field MUST NOT be more than ten days beyond the value of the thisUpdate field.”

  1. One recently deployed CA, Microsoft Azure RSA TLS Issuing CA 07, was missing a process to publish more frequently than default values in the CA software. This caused the CA to not "reissue CRLs at least once every seven days". This has since been resolved.
  2. All Microsoft PKI Services Issuing CAs were configured within AD CS to set the nextUpdate value within 10 days of publishing, but the CA software was adding padded time so that the CRL effective validity was 10 days and 20 minutes. This has since been resolved and the CRLs now show a difference of 8 days and 20 minutes.

A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

Note: All YYYY-MM-DD times in Pacific Time (PT).

2023-06-27 09:04:05 AM: Last publication date for the Microsoft Azure RSA TLS Issuing CA 07.
2023-07-02 02:20:03 AM: Microsoft Azure RSA TLS Issuing CA 07 published CRL, but could not replicate to CDP.
2023-07-05 ~04:30 PM: Operations team member investigated replication alerts and noticed that some CRL timestamps were older than usual on the public repository, but these were not publicly trusted CAs.
2023-07-05 06:27:48 PM: Internal incident created by Operations team member to investigate CRL replication to CDP, which include replication of multiple CRLs, including those issued by CAs trusted by multiple trusted root stores.
2023-07-05 06:46:56 PM: General replication issue mitigated.
2023-07-06 08:46:40 AM: DigiCert notified MS PKI that their monitoring caught a delayed CRL publication for Microsoft Azure RSA TLS Issuing CA 07.
2023-07-06 05:13:53 PM: CRL file published for Microsoft Azure RSA TLS Issuing CA 07 with a nextUpdate timestamp less than 10 days from the thisUpdate timestamp. This was published more than 7 days from the previous CRL on the public repository.
2023-07-06 05:24 PM: Issue mitigated with Microsoft Azure RSA TLS Issuing CA 07 that caused delayed publication beyond 7 days.
2023-07-06 06:28:53 PM: Last remaining CRL was re-published with a nextUpdate timestamp less than 10 days from the thisUpdate timestamp.
2023-07-12 06:51 AM: CRL monitoring tool updated to include CRLS for recently deployed CAs, including Microsoft Azure RSA TLS Issuing CA 07.

Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

All impacted CRLs (listed below) have been re-published and meet the requirements defined within CA/B Forum Baseline Requirements 4.9.7.

In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

The following CRLs were impacted:

In a case involving TLS server certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. It is also recommended that you use this form in your list "https://crt.sh/?sha256=[sha256-hash]", unless circumstances dictate otherwise. When the incident being reported involves an SMIME certificate, if disclosure of personally identifiable information in the certificate may be contrary to applicable law, please provide at least the certificate serial number and SHA256 hash of the certificate. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

This only impacted CRL files.

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

There are several issues that contributed to this bug:

  1. A CRL for Microsoft Azure RSA TLS Issuing CA 07 was not updated and reissued within seven days.
    • Build miss for more frequent CRL publishing
      • The CA software has a default CRL publishing cadence that meets the Baseline Requirements. This publishes the CRL file to a temporary file share that is then replicated to the CDP publicly accessible file repository. This replication failed during the time that the new CRL was published. There was also a bug that deleted the temporary file before confirming that it completed replication to the CDP. Once replication was repaired, it only replicated very recent files, but specifically could not replicate the deleted CRL file for Microsoft Azure RSA TLS Issuing CA 07. The next default CRL publishing file was after the seven day limit.
      • To ensure MS PKI Services is publishing more frequent CRL file updates, we have a process that runs on each CA to publish a CRL update every 4 hours. However, during the build of Microsoft Azure RSA TLS Issuing CA 07, the service was not configured to run.
      • When the replication issue was resolved, all CAs except for Microsoft Azure RSA TLS Issuing CA 07, had a very recent CRL file and those were immediately replicated to the CDP.
      • MS PKI Services has mitigated Microsoft Azure RSA TLS Issuing CA 07 so that the frequent CRL publishing process is running and confirmed it is running for all other ICAs. We do not have new ICA deployments planned, but will update the build process before the next CA is deployed to include additional post-build validation as a secondary check to confirm that settings set correctly.
    • Build miss for adding CAs to CRL monitoring
      • MS PKI Services uses a standalone tool to monitor the health of CRLs and requires separate onboarding of new CAs to the service. While other monitoring was implemented, the build process was missing the step to onboard to the CRL monitor. This tool alerts if the nextUpdate value is greater than a preset value. MS PKI Services recently deployed new CAs, including Microsoft Azure RSA TLS Issuing CA 07, and the URLs for these new CAs were not added to the CRL monitoring tool.
      • MS PKI Services mitigated the issue by adding the missing URLs to the CRL monitoring tool. We have updated the build process to include the missing step of onboarding new CAs to CRL monitoring.
      • Also, while not included in the mitigation, we have a long-term plan to move the CRL monitoring functionality from the standalone CRL monitoring tool into the main monitoring system by 2023-09-30. When new CAs and new CRL URLs are deployed, they will automatically include CRL monitoring instead of requiring separate onboarding.
  2. The value of the nextUpdate field was more than ten days beyond the value of the thisUpdate field.
    • Added CA Software Time Skew
      • While the CAs were configured to have a CRL validity (time between thisUpdate and nextUpdate) of 10 days, the CA software had an additional padded time skew that added 20 minutes to the effective validity of the CRL.
      • The settings have been used since all publicly-trusted TLS CAs were signed and there was not specific scrutiny on the additional 20 minutes. It was identified while investigating this incident.
      • MS PKI Services has mitigated the issue for all ICA CRLs, which now have a validity of 8 days and 20 minutes.

List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

MS PKI Services has already implemented mitigations for most of the primary issues (see above). We are still working on a timeline to implement additional validation automation to confirm that settings expected to be set during the build process are set correctly and within 7 days, expect to be able to provide a date for when this will be implemented. During the course of our investigation, we also identified additional process improvements that are being tracked outside of this bug that would not have prevented the issues, but will make the overall systems more resilient.

Our planning is taking longer than anticipated to identify a timeline to implement additional automated validation of server settings. I am expecting in the next two weeks to be able to provide a clearer commitment date for when we will implement this.

Whiteboard: [ca-compliance] [crl-failure] → [ca-compliance] [crl-failure] Next update 2023-08-07

We have almost completed development of the automated validation of server settings. We are on track to complete final testing and deployment by 2023-08-14.

Testing completed. We are still on track to complete the deployment by 2023-08-14.

Whiteboard: [ca-compliance] [crl-failure] Next update 2023-08-07 → [ca-compliance] [crl-failure] Next update 2023-08-18

I received confirmation that the final component of the deployment successfully completed on Friday, 2023-08-14. With that completed, can this bug be closed? Thank you.

Correction on date:
I received confirmation that the final component of the deployment successfully completed on Friday, 2023-08-11. With that completed, can this bug be closed? Thank you.

Now that we implemented the remaining remediation steps, is this bug able to be closed?

I will close this on Friday, 29-Sept-2023.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 11 months ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.