Microsoft PKI Services: Failure to Revoke in 5 Days for 1962829
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: u654666, Assigned: CentralPKI)
References
(Blocks 1 open bug)
Details
(Whiteboard: [ca-compliance] [leaf-revocation-delay])
Attachments
(2 files)
Preliminary Incident Report
Summary
-
Incident description
This incident is related to Bugzilla bug https://bugzilla.mozilla.org/show_bug.cgi?id=1962829.Microsoft PKI Services made a change to a previous policy document that included a copy and paste typo that was missed until after the document had already been superseded, but still has active certificates related to that already superseded document. While reformatting the document to include various tables, a new detail was added that did not align with how we have been operating since inception. Specifically, CPS Version 3.2.4 incorrectly added that keyEncipherment is not present in Subscriber certificates even though it had always been present and continues to be present.
Microsoft PKI Services understands that the Baseline Requirements 4.9.1.1 require that when “the CA is made aware that the Certificate was not issued in accordance with these Requirements (BRs) or the CA’s Certificate Policy or Certification Practice Statement” that the certificate needs to be revoked within 5 days.
We have not revoked these certificates, and this is the reason we have opened this Failure to Revoke in 5 days bug.
We understand this does not meet the expectations of the BRs and we look forward to discussing and resolving this issue with the industry.
-
Relevant policies:
- TLS Baseline Requirements Section 4.9.1.1 Reasons for Revoking a Subscriber Certificate
-
Source of incident disclosure:
This incident is related to Bugzilla bug https://bugzilla.mozilla.org/show_bug.cgi?id=1962829.
Updated•3 months ago
|
So are we to believe, even with Entrust's removal which started with a similar event, and all of the discussions around delayed revocation and the consequences over the past year - Microsoft are choosing simply not to revoke all the affected certificates and hoping they remain a trusted CA?
From searching I see that Microsoft's CA does not issue directly to other people where Microsoft do not control the private key. This means all keys and certificates are held within one organization. This surely means revokation and issuing of new certificates is 'easy' compared to a CA such like LetsEncrypt? What possible reason is there not to do this?
Questions:
- Are Microsoft really intending not to revoke impacted certificates as they are required to?
- What are the specific, detailed, technical and commercial reasons for not revoking?
- Do Microsoft expect to remain a trusted CA after such an incident?
It makes no sense to revoke certificates because of a clear typo in a document. It's like recalling washing machines because of a typo in the manual. Of course, if the typo was security relevant or the washing machine had a defect, recalling would be the only right thing to do.
(In reply to Max D from comment #1)
So are we to believe, even with Entrust's removal which started with a similar event, and all of the discussions around delayed revocation and the consequences over the past year - Microsoft are choosing simply not to revoke all the affected certificates and hoping they remain a trusted CA?
From searching I see that Microsoft's CA does not issue directly to other people where Microsoft do not control the private key. This means all keys and certificates are held within one organization. This surely means revokation and issuing of new certificates is 'easy' compared to a CA such like LetsEncrypt? What possible reason is there not to do this?
Questions:
- Are Microsoft really intending not to revoke impacted certificates as they are required to?
- What are the specific, detailed, technical and commercial reasons for not revoking?
- Do Microsoft expect to remain a trusted CA after such an incident?
Microsoft PKI Services has not finalized a plan related to revocation yet. We’re planning to provide more information in a Full Incident Report that includes responses to these questions. Until then, we opened a Preliminary Incident Report to acknowledge that we did not revoke certificates within 5 days as expected in Baseline Requirements section 4.9.1.1.
Updated•3 months ago
|
Comment 4•3 months ago
|
||
We understand that a full incident report will be provided, but we wanted to point to our related post in 1962829 to reaffirm how we view incident reporting. We will continue to evaluate this incident as more information becomes available and we look forward to a significant commitment to changes that definitively and convincingly resolve the underlying issues.
We have three questions for consideration with developing the full incident report:
(1) From past Chrome Root Program surveys and policy “preflight” processes, we understood Microsoft PKI Services to be highly automated, having a very small percentage of associated domains (~10%) relying on manual certificate issuance and management. Other public references 1, 2, 3, and 4 also led us to believe that automation and mass revocation was of minimal concern to Microsoft, outside of “CRL bloating” as referenced in the last link. What role is Microsoft PKI Services’ current automation solution and past lessons learned playing in responding to this incident?
(2) While acknowledging Microsoft PKI Services' reported level of automation and technical implementation does not rely on ACME, we’d like to ask about Microsoft PKI Service's equivalent of an ARI-like solution. What solution(s) similar to ARI does Microsoft PKI Services have in place to mitigate the impact of future events similar to this incident?
(3) How did CA and community member responses to past large-scale revocation events, many of which resulted in commitments to improved automation solutions and ARI, play a role in Microsoft PKI Services preparation for responding to this incident?
Assignee | ||
Comment 5•3 months ago
|
||
Assignee | ||
Comment 6•3 months ago
|
||
Response to Comment #4
(1) Automation and Lessons Learned
" From past Chrome Root Program surveys and policy “preflight” processes, we understood Microsoft PKI Services to be highly automated, having a very small percentage of associated domains (~10%) relying on manual certificate issuance and management. Other public references 1, 2, 3, and 4 also led us to believe that automation and mass revocation was of minimal concern to Microsoft, outside of “CRL bloating” as referenced in the last link. What role is Microsoft PKI Services’ current automation solution and past lessons learned playing in responding to this incident?"
We are in a much better position to auto-rotate our certificates and conduct a mass revocation from a CA perspective. However, there is high potential for business impact to subscribers as many still rely on deployments to consume new certificates. Given the CRL bloat issue, we are evaluating revocation options for the ICAs and will share the details as part of our full report.
(2) ARI-like Solutions
"While acknowledging Microsoft PKI Services' reported level of automation and technical implementation does not rely on ACME, we’d like to ask about Microsoft PKI Service's equivalent of an ARI-like solution. What solution(s) similar to ARI does Microsoft PKI Services have in place to mitigate the impact of future events similar to this incident?"
Microsoft has the capability to centrally renew all certificates issued by specific Issuers managed in Key Vault and internal vaults—a process we've successfully executed in the past and can repeat if necessary. While many subscribers adopt renewed certificates within 24–48 hours, some do not. Additionally, despite our guidance, many customers still use certificate pinning. As a result, even though we can renew certificates centrally, immediate revocation would negatively impact those customers.
(3) Preparation for revocation
" How did CA and community member responses to past large-scale revocation events, many of which resulted in commitments to improved automation solutions and ARI, play a role in Microsoft PKI Services preparation for responding to this incident?"
Greater than 90% of the impacted time-valid certificates are no longer in use, however due to the CRL bloat issue it is not possible to revoke them without impacting the certificates which are still in use. We are implementing CRL partitioning to handle this issue in our new CAs. Furthermore, as part of our effort to reduce the certificate lifetime we have already reduced most of our certificate lifetime to 6 months by default, with a goal to meet or exceed the industry lifetime requirements.
Comment 7•3 months ago
|
||
(In reply to Microsoft PKI Services from comment #6)
While many subscribers adopt renewed certificates within 24–48 hours, some do not. Additionally, despite our guidance, many customers still use certificate pinning. As a result, even though we can renew certificates centrally, immediate revocation would negatively impact those customers.
Is the intention of this response to indicate that negative impact on customers is a reason to avoid prompt revocation? Given the many discussions in Bugzilla on the topic in the last year, I feel it’s virtually certain that Microsoft is aware that subscriber impact due to missing automation or ill-advised practices like pinning is not considered to be sufficient reason to delay or avoid revocation of mis-issued certificates. But I’m not sure how else to interpret that answer, I confess!
Comment 8•3 months ago
|
||
Thank you for the responses in Comment 6. We understand (and hope) that these might be elaborated upon in the delivery of the full incident report, but we’d like to explicitly ask for more information to better understand this statement from Microsoft PKI Services:
Greater than 90% of the impacted time-valid certificates are no longer in use, however due to the CRL bloat issue it is not possible to revoke them without impacting the certificates which are still in use.
(1) What does it mean when you say these certificates are “no longer in use”?
(2) Can you describe how you are measuring “use”?
(3) Why do you suspect the percentage of certificates “no longer in use” is so high compared to the total population?
(4) Can you expand upon the impact of the “not in use” leaf revocations to the smaller percentage of certificates that would be considered “in use”? Is this offered from the assumption/perspective that the systems/user agents relying on the affected “in use” certificates would be polling the bloated CRL, timing out due to size, and failing closed? To be clear, we are not offering commentary on your conclusion, we are trying to better understand its basis.
Assignee | ||
Comment 9•3 months ago
|
||
Full Incident Report
Summary
-
CA Owner CCADB unique ID:
A002577 -
Incident description:
Microsoft made an erroneous revision to our Microsoft PKI Services policy document, specifically CPS Version 3.2.4, which incorrectly stated that keyEncipherment is not present in RSA Subscriber certificates, even though it has always been present. This error was overlooked until after the document had been superseded. According to Baseline Requirements 4.9.1.1, certificates issued under this erroneous policy need to be revoked within 5 days. While we understand this obligation, revocation is delayed due to the potential negative impact on client-side validation caused by the large size of Microsoft’s Certificate Revocation Lists (CRLs). Microsoft PKI Services plans to revoke the certificates in batches to manage CRL size and avoid client-side validation issues. The revocation process will begin on 5/28/25 and is expected to be completed by 11/15/2025. CRL partitioning will be implemented to prevent similar issues in the future. -
Timeline summary:
-
Non-compliance start date:
2024-07-21 (the non-compliance start date in the preliminary incident report incorrectly stated 2024-07-01) -
Non-compliance identified date:
2025-04-25 -
Non-compliance end date:
2025-04-21 -
Relevant policies:
TLS Baseline Requirements Section 4.9.1.1 Reasons for Revoking a Subscriber Certificate -
Source of incident disclosure:
This incident was self-reported and related to Bugzilla bug https://bugzilla.mozilla.org/show_bug.cgi?id=1962829 which was third-party reported.
-
Impact
-
Total number of certificates:
100,322,979 -
Total number of "remaining valid" certificates:
75,361,465 -
Affected certificate types:
Organization Validated TLS Subscriber Certificates -
Incident heuristic:
This incident impacts all OV Subscriber certificates with RSA keys issued between 2024-07-21 and 2025-04-21. -
Was issuance stopped in response to this incident, and why or why not?:
No. Microsoft did not stop issuance as the affected CPS version had already been superseded prior to discovery of the issue and issuance continues under corrected documentation. -
Analysis:
We have the capability to bulk revoke certificates and have exercised it in previous bugs requiring revocation (e.g. 1962830). However, revoking tens of millions of certificates at once will create CRLs >600MB in size, and negatively impact client-side validation of certificates. After considering various options, we have decided to take a staged approach to revocation. -
Additional considerations: Most Subscribers of the certificates issued by the CA require support for TLS 1.2, which requires keyEncipherment to be set as per RFC 5246: "keyEncipherment bit MUST be set if the key usage extension is present)." While this does not excuse the typographical mistake, it helps re-enforce that that this was a typo for a setting that was never planned to be changed.
Timeline
- 2024-07-21: Public TLS CPS 3.2.4 published with new tables that included a typo in Section 7.1.2.7.11 stating keyEncipherment was not present in Subscriber certificates with RSA keys even though this was being set at the time and continues to be set
- 2025-04-21: Public TLS CPS 3.3.0 published that replaced multiple tables with new Appendix B Certificate Profiles section where keyEncipherment may be set, but did not distinguish between ECC and RSA public keys
- 2025-04-25: Third-party researcher emailed a Certificate Problem Report to Microsoft PKI Services identifying mismatches between Subscriber certificates and CPS document language related to bug 1962829
- 2025-04-29: Public TLS CPS 3.3.1 published that retained language in Appendix B that keyEncipherment may be set, but did not distinguish between ECC and RSA public keys.
- 2025-05-09: This bug was opened as we did not meet the Baseline Requirement guidelines stated in 4.9.1.1 to revoke certificates within 5 days.
Related Incidents
Bug | Date | Description |
---|---|---|
1962829 | 2025-04-25 | Microsoft PKI Services: Policy document bug. Microsoft PKI Services introduced a typo error in CPS Version 3.2.4 while reformatting the document, incorrectly stating that keyEncipherment is not present in Subscriber certificates. This contradicts longstanding practice and affects still-active certificates tied to the superseded document. |
Root Cause Analysis
Contributing Factor 1: CRL bloat risk prevents timely mass revocation due to large number of unexpired subscriber certificates
-
Description: There are ~75M impacted, unexpired certificates. Revoking all these certificates at once will create CRLs >600MB in size and negatively impact client-side validation of valid certificates. After considering various options, we have decided to take a staged approach to revocation. Our plan is to revoke certificates in batches on a weekly basis, maintaining a CRL size which does not negatively impact clients, and leaving room for additional revocations in case other incidents occur. Given the volume of certificates and the anticipated space available on the CRL at any given moment, this means many certificates will expire before we are able to revoke them. We expect to begin the revocations on 5/28/25 and complete before 11/15/2025.
-
Timeline:
- Since inception: Known risk associated with lack of CRL partitioning
- 2025-04-25: Inconsistency between published CPS and issued certificates identified (Bug 1962829)
- 2025-05-09: Microsoft opens Bug 1965612, acknowledging delay due to CRL size concerns
-
Detection:
The risk was known in advance from previous revocation planning efforts, and work was already underway on risk mitigations. The problem was reaffirmed during planning of this incident’s revocation response. -
Interaction with other factors:
N/A. -
Root Cause Analysis methodology used:
5-Whys
Lessons Learned
- What went well:
None - What didn’t go well:
- Without CRL partitioning in place, MS PKI could not execute revocation of tens of millions of certificates in a manner that does not negatively impact relying parties. To eliminate CRL size as an issue in the future, we will complete implementation of CRL partitioning which we started before this incident and roll it out before 11/15/2025.
- We considered an alternate plan to revoke the issuing CAs. However, we are relying on cross-signing of our ICAs so that our subscribers can support legacy devices which do not trust our root, and we do not have warm standby cross-signed ICAs to move subscribers to. We will add warm standby ICAs so that ICA revocation becomes a viable option. We are working out the details for this and will share a target date before 6/14/25.
- We agree with the observations from the community about the benefits of reducing the volume of publicly trusted certificates. We have started an investigation, and suspect subscriber implementation issues may be a contributing factor to the large number of certificates. We will complete the investigation before 6/27/25. If we identify additional repair items during this investigation, we will append them to this bug.
- Because there was no change to established certificate profiles, the misstatement in the CPS was initially viewed as a documentation issue rather than mis-issuance, which delayed reporting of this bug.
- We had a playbook and mechanisms to do revocations, but it needed to be scaled carefully and validated to be able to safely revoke millions of certificates.
- Where we got lucky:
The erroneous CPS version had already been superseded: The impacted CPS (v3.2.4) was replaced by v3.3.0 before the issue was detected, eliminating the need to stop issuance upon discovery of the incident (see 1962829 - Microsoft PKI Services: Policy document bug for related action items). - Additional:
None
Action Items
Action Item | Kind | Root Cause(s) | Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Revoke impacted certificates (in batches beginning 5/28/2025) | Mitigate | Root Cause 1 | Track % of impacted certificates revoked. | 11/15/2025 | Not Started |
Migrate cert issuance to use partitioned CRLs | Prevent | Root Cause 1 | % certificates confirmed via CT logs and CDP endpoints | 11/15/2025 | In Progress |
Standup cross-signed warm stand by CAs. We are currently in planning stages. We will have the plan ready before 06/14/2025 | Prevent | Root Cause 1 | Publish and disclose standby ICAs in CT logs. Validate readiness through test issuance. | TBD | In Progress |
Create training and TSG Documentation to educate team on revocation expectations | Prevent | Root Cause 1 | TSG documentation will be created and training compliance tracked through internal processes. | TBD | New |
Reduce usage of public PKI | Prevent | Root Cause 1 | % reduction in public trusted certificates, unexpired certificates | TBD | In Progress |
Exercise and refine the mass revocation playbook | Prevent | Root Cause 1 | Playbook validated with multiple rounds of revocations. Tracked internally. | 7/27/2025 | In Progress |
Appendix
- See attached file for the full list of affected certificates.
- Relevant CPS Policy Documents:
Assignee | ||
Comment 10•3 months ago
|
||
Update to certificate text file
We are currently unable to upload the impacted certificate text file due to size limitations. We will provide a persistent URI for access in our next update.
Assignee | ||
Comment 11•3 months ago
|
||
Response to Comment 7
Is the intention of this response to indicate that negative impact on customers is a reason to avoid prompt revocation? Given the many discussions in Bugzilla on the topic in the last year, I feel it’s virtually certain that Microsoft is aware that subscriber impact due to missing automation or ill-advised practices like pinning is not considered to be sufficient reason to delay or avoid revocation of mis-issued certificates. But I’m not sure how else to interpret that answer, I confess!
Thank you for the follow up. Though being able to safely revoke certificates is a consideration, our primary constraint is CRL sizes: revoking all ~75 million impacted/time-valid certificates at once would result in CRLs exceeding 600MB, which would impair revocation checking for relying parties. As a result, we are proposing revocation in batches, which we will detail in our full incident report.
Assignee | ||
Comment 12•3 months ago
|
||
Response to Comment 8
(1) Definition of “No Longer in Use”
“What does it mean when you say these certificates are ‘no longer in use’?”
There are two scenarios. Our subscribers store private keys in vaults such as Azure Key Vault. The first scenario is the subscriber has deleted the certificate and its key from the vault. The second scenario is that the subscriber has enrolled and started using a new version of the certificate, is no longer using the previous version of the certificate, and previous version has not expired yet.
(2) Measuring Certificate “Use”
“Can you describe how you are measuring ‘use’?”
We have an inventory of certificates in subscriber vaults and telemetry of subscriber certificate usage.
(3) High Percentage of Certificates Not in Use
“Why do you suspect the percentage of certificates ‘no longer in use’ is so high compared to the total population?”
We are investigating why the total population is so high. We suspect, but have not yet confirmed, that the population is high due to subscriber implementation issues, aka a "leak”. If this is the case, it would explain why the percentage “no longer in use” is high compared to the total population.
(4) Impact of Revoking “Not in Use” Certificates
“Can you expand upon the impact of the “not in use” leaf revocations to the smaller percentage of certificates that would be considered “in use”? Is this offered from the assumption/perspective that the systems/user agents relying on the affected “in use” certificates would be polling the bloated CRL, timing out due to size, and failing closed? To be clear, we are not offering commentary on your conclusion, we are trying to better understand its basis.
Yes, the concern is that revoking all these certificates at once will create CRLs >600MB in size and client-side validation will experience delays and/or time-outs that may result in fail closed. Additionally, some clients have technical limits on the size of the CRLs and number of entries that they can process, which will be unable to process these CRLs altogether.
Comment 13•3 months ago
|
||
(In reply to Microsoft PKI Services from comment #11)
Response to Comment 7
Is the intention of this response to indicate that negative impact on customers is a reason to avoid prompt revocation? Given the many discussions in Bugzilla on the topic in the last year, I feel it’s virtually certain that Microsoft is aware that subscriber impact due to missing automation or ill-advised practices like pinning is not considered to be sufficient reason to delay or avoid revocation of mis-issued certificates. But I’m not sure how else to interpret that answer, I confess!
Thank you for the follow up. Though being able to safely revoke certificates is a consideration, our primary constraint is CRL sizes: revoking all ~75 million impacted/time-valid certificates at once would result in CRLs exceeding 600MB, which would impair revocation checking for relying parties. As a result, we are proposing revocation in batches, which we will detail in our full incident report.
The proposal of revoking 'the issuing CAs', or rather the affected intermediaries, and moving to new ones would handle the vast majority of your certificates. As it stands all that has been proposed is a massive step backwards for CA standards, all on the basis that a subset of subscribers are relying on the cross-signing for legacy device support. That it is taking a month to propose a potential, and lackluster, revocation plan shows a lack of regard for how the WebPKI as a whole has advanced over the years.
That the proposed revocation plan goes over a 6-month period is telling on how much effort Microsoft PKI have placed on reading any recent incidents. Moreso is the 'Related Incidents' part of the incident report that presumes the incidents must only relate to Microsoft PKI - a gross misunderstanding on incident reporting guidelines.
“Related Incidents” MUST consider incidents beyond those corresponding to the CA Owner subject of this report.
The Let's Encrypt: certificate lifetimes 90 days plus one second incident was 4 years ago. That focused on 185 million certificates, and drastically shifted how CAs operate to stop this being an issue going forward.
Q1: What learnings did Microsoft PKI take from that and similar incidents to make sure they would be capable of a mass-revocation event in keeping with the timelines in CCADB's Incident Reporting Guidelines?
The 'Timeline' section is remarkably quiet on what Microsoft PKI have been doing for 4 weeks. From statements provided it seems no assessment of the corpus of certificates has even occurred, but will start soon. This is called a 'Final' Incident Report because the work should have been completed already.
Q2: Does Microsoft PKI think this is acceptable practice in 2025? What has your CA been doing this entire time?
The Root Cause Analysis section is also lacking in completeness. It seems to have been written to focus solely on why this particular revocation plan is the only feasible way forward. It does not address the complete lack of attention to incidents in the past few years that would introduce best practices to Microsoft PKI that make this a non-issue.
Q3: Can Microsoft PKI talk us through the time it would take to generate new intermediaries and transition as many certificates across as possible? Note that this is not including any cross-signing.
Q4: What are the limitations on cross-signing with a new intermediary in getting this handled in a timely manner?
Q5: Are there any plans currently for dealing with root CAs being rotated and the impact on subscribers leaning on legacy-device use? See Chrome Root Program's Root CA Term-Limit as an example.
Given the entire point of these incidents is to tell the WebPKI ecosystem what has happened, and what you will do to ensure this does not happen again I'm rather baffled at the Action Items. As far as I can see these focus solely on dealing with the current problem, not making sure this can never happen again. There seems to be blinders on the CRL throughout the report, when alternative means of handling this exist but seem to be getting disregarded in favor of a 6-month plan at minimum.
2025-04-25 is when the Certificate Problem Report was sent in.
2025-05-10 is the start of this incident with a preliminary incident report.
2025-05-23 is when the final incident report was published.
Q6: Can Microsoft PKI please explain how this is in-keeping with CCADB's Incident Reporting Guidelines for handling incident reports? Note: 'When are reports expected?'
I strongly advise that Microsoft PKI at the very least read this comment by Mozilla in a recent incident.
Q7: With the above comment in mind, can Microsoft PKI please explain how this plan is showing public trust in adhering to best practices to date?
For those with censys access this should cover the majority of impacted certificates.
(labels="trusted" and validation.nss.has_trusted_path=true and not labels="revoked") and parsed.extensions.extended_key_usage.server_auth="true" and parsed.validity_period.not_before: {`2024-07-21` to `2025-04-21`} and parsed.issuer.organization=`Microsoft Corporation` and parsed.subject_key_info.key_algorithm.name=`RSA`
Results: 70,044,495
I will leave further analysis to other parties, suffice to say Chrome Root Program did hint at issues in question 7 of the other incident.
Comment 14•3 months ago
•
|
||
[In response to Comment 12.]
Thank you for providing answers to the questions posted in Comment 8 and for providing the Full Incident Report in Comment 9.
We have a few additional questions:
(1) Can you please describe the TLS server authentication certificate automation solution(s) in place to help us understand what is and is not considered in scope of the solutions available to subscribers? Statements in Comment 6 indicate “While many subscribers adopt renewed certificates within 24–48 hours, some do not.”
We interpret that to mean while the process of requesting a certificate (which may include key generation and performing domain control validation) is automated for some subscribers, the retrieval and installation of the corresponding certificate might not be in-scope for the automation solution.
(2) Can you help us understand the percent of affected certificates that are relying on “Azure Key Vault” or “internal vaults” where “Microsoft has the capability to centrally renew all certificates issued by specific Issuers”? For example: “XX% of the affected certificates can automatically be renewed and automatically configured for use due to Microsoft certificate lifecycle management solutions.”
(3) The context of a “leak” as presented below is unclear to us. Can you please explain this to us in a different way?
We suspect, but have not yet confirmed, that the population is high due to subscriber implementation issues, aka a "leak”.
Is this referencing scenarios where certificates were requested and issued, but then later abandoned by subscribers without requesting revocation?
(4) Comment 9 states:
Our plan is to revoke certificates in batches on a weekly basis, maintaining a CRL size which does not negatively impact clients, and leaving room for additional revocations in case other incidents occur.
Can you please share:
- (a) The criteria used for determining which certificates will be included in each week’s “batch”?
- (b) The CRL size being targeted to accomplish the stated goal of using this batch strategy?
- (c) How Microsoft PKI Services concluded the target size described immediately above will not negatively impact clients?
(5) We understand that Microsoft PKI Services was aware of its “CRL bloat” concerns related to mass revocation events in February 2025, and presumably earlier. Can you help us understand that given the existence of this concern and the community’s emphasis on improving response to mass revocation events over the past year, Microsoft PKI Services did not move forward with planning (minimally) or implementing partitioned CRLs sooner?
(6) Comment 6 includes:
Microsoft has the capability to centrally renew all certificates issued by specific Issuers managed in Key Vault and internal vaults—a process we've successfully executed in the past and can repeat if necessary.
Can you share which DCV method(s) is being relied upon during these types of renewals?
(7) Comment 6 includes:
Furthermore, as part of our effort to reduce the certificate lifetime we have already reduced most of our certificate lifetime to 6 months by default, with a goal to meet or exceed the industry lifetime requirements.
Given 90% of the impacted time-valid certificates were found to no longer be in use, and when considered against the degree of automation we understand to be in place, could the default validity be decreased further to reduce the likelihood of “stale” or unused TLS certificates?
(8) Related to the above, has Microsoft considered the use of short-lived certificates, as defined by the TLS BRs, for these subscriber use cases?
Assignee | ||
Comment 15•3 months ago
|
||
We apologize for the delay in uploading the impacted certificates. The high volume of certificates caused unexpected issues: https://prsspublishingstorage.blob.core.windows.net/public-tls-certs/all/crtshurls.txt
Assignee | ||
Comment 16•3 months ago
|
||
Response to Comment 14
(1) TLS Server Authentication Certificate Automation Scope
Can you please describe the TLS server authentication certificate automation solution(s) in place to help us understand what is and is not considered in scope of the solutions available to subscribers? Statements in Comment 6 indicate “While many subscribers adopt renewed certificates within 24–48 hours, some do not.” We interpret that to mean while the process of requesting a certificate (which may include key generation and performing domain control validation) is automated for some subscribers, the retrieval and installation of the corresponding certificate might not be in-scope for the automation solution.
Most subscribers use a vault such as Azure Key Vault for certificate management. The vault enrolls the certificate, and the subscriber retrieves the keys and certificate metadata from the vault and puts them into use. Triggering re-enrollment at the vault is fully automated, and vaults support automated distribution of certificates to subscriber nodes. However, not all subscribers have adopted the automated distribution solution yet.
(2) Percentage of Certificates Managed via Vaults
Can you help us understand the percent of affected certificates that are relying on “Azure Key Vault” or “internal vaults” where “Microsoft has the capability to centrally renew all certificates issued by specific Issuers”? For example: “XX% of the affected certificates can automatically be renewed and automatically configured for use due to Microsoft certificate lifecycle management solutions.”
Greater than 99% of impacted certificates are managed through vaults and can be centrally renewed.
(3) Clarification of “Leak”
The context of a “leak” as presented below is unclear to us. Can you please explain this to us in a different way? We suspect, but have not yet confirmed, that the population is high due to subscriber implementation issues, aka a "leak”. Is this referencing scenarios where certificates were requested and issued, but then later abandoned by subscribers without requesting revocation?
Yes, your interpretation is correct. That said, as we mentioned, we are following up with subscribers to understand whether these are implementation issues or valid use cases.
(4) Weekly Revocation Batch Strategy
Comment 9 states: Our plan is to revoke certificates in batches on a weekly basis, maintaining a CRL size which does not negatively impact clients, and leaving room for additional revocations in case other incidents occur. Can you please share:
• (a) The criteria used for determining which certificates will be included in each week’s “batch”?
• (b) The CRL size being targeted to accomplish the stated goal of using this batch strategy?
• (c) How Microsoft PKI Services concluded the target size described immediately above will not negatively impact clients?
(a) The primary criterion is telemetry that tells us if the certificate is in use or not. A secondary criterion is the certificate expiration date, which allows us to demonstrate revocation of larger batch sizes over time while preventing the CRL from exceeding the target size.
(b) Our goal is for the CRL size to not exceed 10MB.
(c) We used the recommended CRL size from the Windows TRP (10MB) and a large known existing CRL (13.3MB) as reference. As a precaution, we will scale up the revocation batch size over time while observing the impact on clients.
(5) CRL Partitioning Timeline
We understand that Microsoft PKI Services was aware of its “CRL bloat” concerns related to mass revocation events in February 2025, and presumably earlier. Can you help us understand that given the existence of this concern and the community’s emphasis on improving response to mass revocation events over the past year, Microsoft PKI Services did not move forward with planning (minimally) or implementing partitioned CRLs sooner?
Planning and implementation of CRL partitioning started before this incident and is currently being tested in a non-production environment. However, it has not been completed in time to be a mitigating factor in this incident.
(6) DCV Method for Central Renewals
Comment 6 includes: Microsoft has the capability to centrally renew all certificates issued by specific Issuers managed in Key Vault and internal vaults—a process we've successfully executed in the past and can repeat if necessary. Can you share which DCV method(s) is being relied upon during these types of renewals?
The domain control validation (DCV) method used during these types of certificate renewals is BR Section 3.2.2.4.2 – Email, Fax, SMS, or Postal Mail to Domain Contact. We use the email to the Domain Contact method specifically to validate the domains and this process is automated.
(7) Certificate Validity Period
Comment 6 includes:
Furthermore, as part of our effort to reduce the certificate lifetime we have already reduced most of our certificate lifetime to 6 months by default, with a goal to meet or exceed the industry lifetime requirements.
Given 90% of the impacted time-valid certificates were found to no longer be in use, and when considered against the degree of automation we understand to be in place, could the default validity be decreased further to reduce the likelihood of “stale” or unused TLS certificates?
Our goal is for our default validity to meet or exceed industry requirements. In addition, we are working with subscribers to request shorter lifetime certificates based on their scenarios.
(8) Use of Short-Lived Certificates
Related to the above, has Microsoft considered the use of short-lived certificates, as defined by the TLS BRs, for these subscriber use cases?
Yes, we are evaluating the use of short-lived subscriber certificates.
Assignee | ||
Comment 17•3 months ago
|
||
Response to Comment 13
(1) Learnings from Past Incidents
What learnings did Microsoft PKI take from that and similar incidents to make sure they would be capable of a mass-revocation event in keeping with the timelines in CCADB's Incident Reporting Guidelines?
The 'Timeline' section is remarkably quiet on what Microsoft PKI have been doing for 4 weeks. From statements provided it seems no assessment of the corpus of certificates has even occurred, but will start soon. This is called a 'Final' Incident Report because the work should have been completed already.
We have completed our internal assessment of the full corpus of impacted certificates. Due to file size limitations, we were unable to upload this directly to Bugzilla. We have provided access via public blob storage.
Two learnings we had from previous incidents were the criticality of CRL partitioning and the value of reducing certificate lifetime. Work had already started on CRL partitioning before this incident occurred but was not complete in time to be a mitigating factor. On certificate lifetime, we took a first step by reducing the lifetime of new certificates for the majority of subscribers from 1 year to 6 months starting in October 2024.
(2) Acceptability of Current Practices
Does Microsoft PKI think this is acceptable practice in 2025? What has your CA been doing this entire time?
The Root Cause Analysis section is also lacking in completeness. It seems to have been written to focus solely on why this particular revocation plan is the only feasible way forward. It does not address the complete lack of attention to incidents in the past few years that would introduce best practices to Microsoft PKI that make this a non-issue.
Thank you for the feedback. We recognize the importance of aligning with evolving best practices in the Web PKI ecosystem. As part of our long-term strategy to eliminate the conditions that led to this issue, we are investing in CRL partitioning, standing up warm standby ICAs, reducing the number of publicly trusted certificates, and exploring short-lived certificate models. These actions are designed to ensure we can support timely and large-scale revocation going forward.
Additionally, as mentioned in Comment 11 of Bug 1962829 we have opened action items to enhance our process to ensure we are adapting best practices from all Bugzilla incidents moving forward.
(3) Time to Transition to New Intermediaries
Can Microsoft PKI talk us through the time it would take to generate new intermediaries and transition as many certificates across as possible? Note that this is not including any cross-signing.
If we had warm standby ICAs, we could start transitioning subscribers immediately. The goal of the repair item to have warm standby CAs is to remove the lag associated with the creation of new ICAs. Subscribers can be transitioned through a combination of automation and an internal campaign in order of days.
(4) Limitations on Cross-Signing
What are the limitations on cross-signing with a new intermediary in getting this handled in a timely manner?
A new cross-signing arrangement requires negotiation, legal review, and formal execution of a contract with the third-party CA. This process introduces time constraints that make it unsuitable for immediate response actions.
(5) Root CA Rotation and Legacy Devices
Are there any plans currently for dealing with root CAs being rotated and the impact on subscribers leaning on legacy-device use? See Chrome Root Program's Root CA Term-Limit as an example.
Given the entire point of these incidents is to tell the WebPKI ecosystem what has happened, and what you will do to ensure this does not happen again I'm rather baffled at the Action Items. As far as I can see these focus solely on dealing with the current problem, not making sure this can never happen again. There seems to be blinders on the CRL throughout the report, when alternative means of handling this exist but seem to be getting disregarded in favor of a 6-month plan at minimum.
2025-04-25 is when the Certificate Problem Report was sent in.
2025-05-10 is the start of this incident with a preliminary incident report.
2025-05-23 is when the final incident report was published.
We are aware of the Chrome Root Program’s root CA term limits. While customer workloads for our subscribers have legacy device dependencies today, we expect those dependencies to diminish as those devices age out of the ecosystem in the coming years. Our reliance on cross-signing to support those devices will phase out accordingly.
Some of our action items are focused on resolving this incident, but others—such as implementing CRL partitioning and establishing Warm Standby ICAs—are intended to address the root causes that currently limit timely revocation. These forward-looking efforts are critical to ensuring we can respond more quickly and reliably to similar issues in the future.
(6) Incident Reporting Timeliness
Can Microsoft PKI please explain how this is in-keeping with CCADB's Incident Reporting Guidelines for handling incident reports? Note: 'When are reports expected?'
I strongly advise that Microsoft PKI at the very least read this comment by Mozilla in a recent incident.
We acknowledge that this bug (1965612) should have been opened earlier, ideally when it became clear that revocation within 5 days would not be feasible. While we filed the Preliminary Report for Bug 1962829 on 2025-05-09, we agree that separate reporting for revocation delays was warranted sooner.
We have noted this delay in the “What Did Not Go Well” section of our Full Incident Report and have committed to a related repair item: improving our internal processes for early scoping and rapid incident triage, including ensuring new bugs are filed in a timely manner when distinct revocation challenges arise.
(7) Demonstrating Public Trust and Best Practices
With the above comment in mind, can Microsoft PKI please explain how this plan is showing public trust in adhering to best practices to date?
For those with censys access this should cover the majority of impacted certificates.
(labels="trusted" and validation.nss.has_trusted_path=true and not labels="revoked") and parsed.extensions.extended_key_usage.server_auth="true" and parsed.validity_period.not_before: {2024-07-21
to2025-04-21
} and parsed.issuer.organization=Microsoft Corporation
and parsed.subject_key_info.key_algorithm.name=RSA
Results: 70,044,495
I will leave further analysis to other parties, suffice to say Chrome Root Program did hint at issues in question 7 of the other incident.
We are performing batch revocations to demonstrate our ability to revoke at scale while preserving CRL space for additional revocations if necessary. In parallel, we’re advancing efforts to stand up warm standby CAs, reduce subscriber reliance on publicly trusted certificates, reduce certificate lifetime and investigate short-lived certificates, and implement partitioned CRLs—all aimed at minimizing the risk of delayed revocation going forward.
Assignee | ||
Comment 18•3 months ago
|
||
Revocation Delay Status Update
- the number of certificates that have been revoked:
11,000
- the number of certificates that have not yet been revoked:
72,070,777
- the number of certificates planned for revocation that have expired:
3,290,688
- an estimate for when all remaining revocations will be completed:
we will continue to revoke certificates in batches until 11/15/2025 as mentioned in our Full Incident Report
Assignee | ||
Comment 19•3 months ago
|
||
Update to Action Items
We are actively working on all repair items associated with this incident. In addition, we have updated the due dates for several action items to reflect current progress and planning.
Action Item | Kind | Root Cause(s) | Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Revoke impacted certificates (in batches beginning 5/28/2025) | Mitigate | Root Cause 1 | Track % of impacted certificates revoked. | 11/15/2025 | In Progress |
Migrate cert issuance to use partitioned CRLs | Prevent | Root Cause 1 | % certificates confirmed via CT logs and CDP endpoints | 11/15/2025 | In Progress |
Standup cross-signed warm standby CAs. We are currently in planning stages. We will have the plan ready before 06/14/2025 | Prevent | Root Cause 1 | Publish and disclose standby ICAs in CT logs. Validate readiness through test issuance. | 9/30/2025 | In Progress |
Create training and TSG Documentation to educate team on revocation expectations | Prevent | Root Cause 1 | TSG documentation will be created and training compliance tracked through internal processes. | 7/31/2025 | New |
Reduce usage of public PKI | Prevent | Root Cause 1 | % reduction in public trusted certificates, unexpired certificates | 9/30/2025 | In Progress |
Exercise and refine the mass revocation playbook | Prevent | Root Cause 1 | Playbook validated with multiple rounds of revocations. Tracked internally. | 7/27/2025 | In Progress |
Comment 20•3 months ago
|
||
(In reply to Microsoft PKI Services from comment #16)
Response to Comment 14
(4) Weekly Revocation Batch Strategy
Comment 9 states: Our plan is to revoke certificates in batches on a weekly basis, maintaining a CRL size which does not negatively impact clients, and leaving room for additional revocations in case other incidents occur. Can you please share:
• (a) The criteria used for determining which certificates will be included in each week’s “batch”?
• (b) The CRL size being targeted to accomplish the stated goal of using this batch strategy?
• (c) How Microsoft PKI Services concluded the target size described immediately above will not negatively impact clients?(a) The primary criterion is telemetry that tells us if the certificate is in use or not. A secondary criterion is the certificate expiration date, which allows us to demonstrate revocation of larger batch sizes over time while preventing the CRL from exceeding the target size.
(b) Our goal is for the CRL size to not exceed 10MB.
(c) We used the recommended CRL size from the Windows TRP (10MB) and a large known existing CRL (13.3MB) as reference. As a precaution, we will scale up the revocation batch size over time while observing the impact on clients.
Could you provide a link the "Windows TRP" document referenced here?
Is this referring to Program Requirements - Microsoft Trusted Root Program section 3.a.5 "If an AIA extension with a valid OCSP URL is NOT included, then the resulting CRL File should be <10MB." and repeated in 3.c.3.c "Maximum size of the CRL file (either full CRL or partitioned CRL) should not exceed 10M."? I note that both of those are SHOULD recommendations/requirements, not MUST.
Which CAs CRLs have you referenced to develop that maximum size goal?
The blog article An analysis of CRL sizes posted mid-last year shows some historical data at or above that limit, following the CRL links in that article allows easy discovery of current examples like Digicert Global G2 TLS RSA SHA256 2020 CA1 (around 12.7MB at 2025-05-31T02:20Z).
(5) CRL Partitioning Timeline
We understand that Microsoft PKI Services was aware of its “CRL bloat” concerns related to mass revocation events in February 2025, and presumably earlier. Can you help us understand that given the existence of this concern and the community’s emphasis on improving response to mass revocation events over the past year, Microsoft PKI Services did not move forward with planning (minimally) or implementing partitioned CRLs sooner?
Planning and implementation of CRL partitioning started before this incident and is currently being tested in a non-production environment. However, it has not been completed in time to be a mitigating factor in this incident.
Could you share more detail about your timeline for CRL partitioning, such as when planning began and when the implementation was first in a state where it was considered ready for testing?
While trying to find the "Windows TRP" document referenced above I see Microsoft published guidance for deploying PKI on Windows Server 2003 on or before 2014/05/27 which recommends CRL partitioning. I find it surprising that Microsoft PKI Services have not adopted that guidance internally in the intervening decade.
(In reply to Microsoft PKI Services from comment #17)
Response to Comment 13
(1) Learnings from Past Incidents
What learnings did Microsoft PKI take from that and similar incidents to make sure they would be capable of a mass-revocation event in keeping with the timelines in CCADB's Incident Reporting Guidelines?
The 'Timeline' section is remarkably quiet on what Microsoft PKI have been doing for 4 weeks. From statements provided it seems no assessment of the corpus of certificates has even occurred, but will start soon. This is called a 'Final' Incident Report because the work should have been completed already.We have completed our internal assessment of the full corpus of impacted certificates. Due to file size limitations, we were unable to upload this directly to Bugzilla. We have provided access via public blob storage.
Two learnings we had from previous incidents were the criticality of CRL partitioning and the value of reducing certificate lifetime. Work had already started on CRL partitioning before this incident occurred but was not complete in time to be a mitigating factor. On certificate lifetime, we took a first step by reducing the lifetime of new certificates for the majority of subscribers from 1 year to 6 months starting in October 2024.
You mention that you took learnings from previous incidents, however you do not reference any incidents from other CAs in the "Related Incidents" section of your Full Incident Report. The CCADB Full Incident Report template and accompanying explanation of this field explicitly states:
“Related Incidents” MUST consider incidents beyond those corresponding to the CA Owner subject of this report.
Which incident(s) did you review when implementing the October 2024 default lifetime changes? Which incident(s) did you review when planning/implementing the CRL partitioning changes?
This would be useful information to include in the timeline for Contributing Factor 1 as part of your Root Cause Analysis.
Which incident(s) did you review when developing the mass revocation plan for this incident?
While you have mentioned that until recently Microsoft PKI Services only informally monitored and reviewed incidents, based on your responses to this question in comment 13 and question 3 of comment 4 it appears you did use other incidents to guide your response/revocation plan in this instance. Explicit references to the other mass revocation incidents and revocation delay incidents you reviewed should also be included in the Related Incidents section.
Comment 21•3 months ago
|
||
[In response to Comment 16.]
Thank you for providing answers to the questions posted in Comment 14.
We have a few follow-up questions and comments:
Question 1: In its response, Microsoft PKI Services stated:
Greater than 99% of impacted certificates are managed through vaults and can be centrally renewed.
Can you please share:
- a) Approximately what percent of subscribers have adopted the automated distribution solution?
- b) Approximately what percent of affected certificates are represented by those subscribers?
- c) Whether Microsoft has triggered the automatic renewal and replacement for these subscribers’ certificates?
Question 2: In its response, Microsoft PKI Services stated:
The domain control validation (DCV) method used during these types of certificate renewals is BR Section 3.2.2.4.2 – Email, Fax, SMS, or Postal Mail to Domain Contact.
Given this method has been sunset, can you share what Microsoft PKI Services is planning for certificates issued beginning July 15, 2025?
Question 3: In its response, Microsoft PKI Services stated:
Our goal is for our default validity to meet or exceed industry requirements. In addition, we are working with subscribers to request shorter lifetime certificates based on their scenarios.
Can you please share more? This response doesn’t offer actionable detail or directly answer the question posed. Asked differently, considering that 90% of affected certificates were determined not in use at the time of this incident, can that be interpreted to indicate the default validity should be less than six months given observed real-world use of these certificates?
Question 4: Do you have any indication that the not-in-use certificates corresponded to Azure resources that were intentionally short-lived (e.g., someone standing up a test environment for a few days, confirming something, and then deleting it — however in the process the corresponding TLS certificate is orphaned)? Like you, we’re trying to understand the user patterns that resulted in the significant amount of non-use.
Comment 1: The Evaluation Criteria included in subsequent updates can be improved by offering concrete objectives. For example “% reduction in public trusted certificates, unexpired certificates” doesn’t offer sufficient detail to understand how Microsoft is evaluating this Action Item, or how a member of the community can help.
Comment 2: Studying data disclosed to the CCADB, we do observe CA records trusted in Chrome that disclose a full and complete CRL, but NOT a partitioned array and whose corresponding size is larger than 10 MB. We’re not aware of specific issues related to these CAs, though we would be interested to learn if there are, in fact, specific issues.
- http://crl.quovadisglobal.com/hinicag2.crl (99.45 MB)
- http://certificates.godaddy.com/mastergodaddy2issuing.crl (39.39 MB)
- http://httpcrl.trust.telia.com/teliasoneramobileidcav2.crl (21.97 MB)
- https://www.accv.es/fileadmin/Archivos/certificados/accvca120_der.crl (17.45 MB)
- http://www.accv.es/fileadmin/Archivos/certificados/accvca120_der.crl (serves the same file as above)
- http://crl.sectigo.com/SectigoRSADomainValidationSecureServerCA.crl (10.77 MB)
Question 5: Looking at some of the Microsoft PKI Services CRLs relevant to this incident (e.g., http://www.microsoft.com/pkiops/crl/Microsoft%20Azure%20RSA%20TLS%20Issuing%20CA%2007.crl and http://www.microsoft.com/pkiops/crl/Microsoft%20Azure%20RSA%20TLS%20Issuing%20CA%2008.crl ) - the current size (~4KB) is significantly less than the stated 10 MB goal. We understand Microsoft is intending to gradually ramp-up revocations, but the timing of that ramp-up is unclear.
Can you help us understand how Microsoft PKI Services intends to balance its planned revocations and the desire to leave room for additional revocations in case other incidents occur (described in Comment 9)?
Assignee | ||
Comment 22•3 months ago
|
||
Weekly Status Update
We are actively progressing through all repair items identified in the incident report. All action items are currently in progress. We remain on track to meet the expected due dates outlined in the full report.
In addition, we encountered a new revocation-related issue yesterday that resulted in a non-compliance. We will be reporting this through a separate Bugzilla entry and will link it as a related incident once posted.
Assignee | ||
Comment 23•3 months ago
|
||
Revocation Delay Status Update
-
The number of certificates that have been revoked:
- 347,644
-
The number of certificates that have not yet been revoked:
- 69,172,142
-
The number of certificates planned for revocation that have expired:
- 2,554,134
-
An estimate for when all remaining revocations will be completed:
- We will continue to revoke certificates in batches until 11/15/2025 as mentioned in our FIR
Comment 24•3 months ago
|
||
(In reply to Microsoft PKI Services from comment #23)
Revocation Delay Status Update
The number of certificates that have been revoked:
- 347,644
The number of certificates that have not yet been revoked:
- 69,172,142
The number of certificates planned for revocation that have expired:
- 2,554,134
An estimate for when all remaining revocations will be completed:
- We will continue to revoke certificates in batches until 11/15/2025 as mentioned in our FIR
Do we have a more detailed revocation plan yet? Currently we seem to be stalling until November when the final certificates will expire instead of being revoked as required.
If there is no intent for these certificate to ever be revoked, why are they listed as 'planned for revocation'?
I am dismayed at the attention this incident is receiving and the lack of pro-activeness. We have regular reports late Friday, and no sign that this is being treated with any severity internally. If we are to learn of the plan through weekly questioning then please advise us in advance so the questions can be more thoroughly worded.
Assignee | ||
Comment 25•3 months ago
|
||
(In reply to Andrew from comment #20)
(In reply to Microsoft PKI Services from comment #16)
Response to Comment 14
(4) Weekly Revocation Batch Strategy
Comment 9 states: Our plan is to revoke certificates in batches on a weekly basis, maintaining a CRL size which does not negatively impact clients, and leaving room for additional revocations in case other incidents occur. Can you please share:
• (a) The criteria used for determining which certificates will be included in each week’s “batch”?
• (b) The CRL size being targeted to accomplish the stated goal of using this batch strategy?
• (c) How Microsoft PKI Services concluded the target size described immediately above will not negatively impact clients?(a) The primary criterion is telemetry that tells us if the certificate is in use or not. A secondary criterion is the certificate expiration date, which allows us to demonstrate revocation of larger batch sizes over time while preventing the CRL from exceeding the target size.
(b) Our goal is for the CRL size to not exceed 10MB.
(c) We used the recommended CRL size from the Windows TRP (10MB) and a large known existing CRL (13.3MB) as reference. As a precaution, we will scale up the revocation batch size over time while observing the impact on clients.Could you provide a link the "Windows TRP" document referenced here?
Is this referring to Program Requirements - Microsoft Trusted Root Program section 3.a.5 "If an AIA extension with a valid OCSP URL is NOT included, then the resulting CRL File should be <10MB." and repeated in 3.c.3.c "Maximum size of the CRL file (either full CRL or partitioned CRL) should not exceed 10M."? I note that both of those are SHOULD recommendations/requirements, not MUST.Which CAs CRLs have you referenced to develop that maximum size goal?
The blog article An analysis of CRL sizes posted mid-last year shows some historical data at or above that limit, following the CRL links in that article allows easy discovery of current examples like Digicert Global G2 TLS RSA SHA256 2020 CA1 (around 12.7MB at 2025-05-31T02:20Z).(5) CRL Partitioning Timeline
We understand that Microsoft PKI Services was aware of its “CRL bloat” concerns related to mass revocation events in February 2025, and presumably earlier. Can you help us understand that given the existence of this concern and the community’s emphasis on improving response to mass revocation events over the past year, Microsoft PKI Services did not move forward with planning (minimally) or implementing partitioned CRLs sooner?
Planning and implementation of CRL partitioning started before this incident and is currently being tested in a non-production environment. However, it has not been completed in time to be a mitigating factor in this incident.
Could you share more detail about your timeline for CRL partitioning, such as when planning began and when the implementation was first in a state where it was considered ready for testing?
While trying to find the "Windows TRP" document referenced above I see Microsoft published guidance for deploying PKI on Windows Server 2003 on or before 2014/05/27 which recommends CRL partitioning. I find it surprising that Microsoft PKI Services have not adopted that guidance internally in the intervening decade.(In reply to Microsoft PKI Services from comment #17)
Response to Comment 13
(1) Learnings from Past Incidents
What learnings did Microsoft PKI take from that and similar incidents to make sure they would be capable of a mass-revocation event in keeping with the timelines in CCADB's Incident Reporting Guidelines?
The 'Timeline' section is remarkably quiet on what Microsoft PKI have been doing for 4 weeks. From statements provided it seems no assessment of the corpus of certificates has even occurred, but will start soon. This is called a 'Final' Incident Report because the work should have been completed already.We have completed our internal assessment of the full corpus of impacted certificates. Due to file size limitations, we were unable to upload this directly to Bugzilla. We have provided access via public blob storage.
Two learnings we had from previous incidents were the criticality of CRL partitioning and the value of reducing certificate lifetime. Work had already started on CRL partitioning before this incident occurred but was not complete in time to be a mitigating factor. On certificate lifetime, we took a first step by reducing the lifetime of new certificates for the majority of subscribers from 1 year to 6 months starting in October 2024.
You mention that you took learnings from previous incidents, however you do not reference any incidents from other CAs in the "Related Incidents" section of your Full Incident Report. The CCADB Full Incident Report template and accompanying explanation of this field explicitly states:
“Related Incidents” MUST consider incidents beyond those corresponding to the CA Owner subject of this report.
Which incident(s) did you review when implementing the October 2024 default lifetime changes? Which incident(s) did you review when planning/implementing the CRL partitioning changes?
This would be useful information to include in the timeline for Contributing Factor 1 as part of your Root Cause Analysis.Which incident(s) did you review when developing the mass revocation plan for this incident?
While you have mentioned that until recently Microsoft PKI Services only informally monitored and reviewed incidents, based on your responses to this question in comment 13 and question 3 of comment 4 it appears you did use other incidents to guide your response/revocation plan in this instance. Explicit references to the other mass revocation incidents and revocation delay incidents you reviewed should also be included in the Related Incidents section.
(1) Windows TRP link
” Could you provide a link the "Windows TRP" document referenced here? Is this referring to Program Requirements - Microsoft Trusted Root Program section 3.a.5 "If an AIA extension with a valid OCSP URL is NOT included, then the resulting CRL File should be <10MB." and repeated in 3.c.3.c "Maximum size of the CRL file (either full CRL or partitioned CRL) should not exceed 10M."? I note that both of those are SHOULD recommendations/requirements, not MUST.”
Yes, the reference to "Windows TRP" in our response corresponds to the Microsoft Trusted Root Program requirements.
We acknowledge that both of these are "SHOULD" recommendations rather than "MUST" requirements. Our decision to target a CRL size around 10MB aligns with these recommendations and reflects a conservative approach aimed at minimizing potential impact to relying parties during revocation processing. This leaves some space for any revocations that we may need to do for potential problem reports.
(2) Referenced CRLs
"Which CAs CRLs have you referenced to develop that maximum size goal? The blog article An analysis of CRL sizes posted mid-last year shows some historical data at or above that limit, following the CRL links in that article allows easy discovery of current examples like Digicert Global G2 TLS RSA SHA256 2020 CA1 (around 12.7MB at 2025-05-31T02:20Z)."
We referenced CRLs from several widely deployed CAs when evaluating an acceptable maximum size. Specifically, we identified CRLs mentioned from How Big Are CRLs That Are Found In The Wild? | technotes.seastrom.com and from the link you mentioned.
We chose a 10MB target as a conservative threshold, aligning with Windows TRP recommendations and in line with some of the largest CRLs we found in the links mentioned above.
(3) CRL Partitioning Timeline
"Could you share more detail about your timeline for CRL partitioning, such as when planning began and when the implementation was first in a state where it was considered ready for testing? While trying to find the "Windows TRP" document referenced above I see Microsoft published guidance for deploying PKI on Windows Server 2003 on or before 2014/05/27 which recommends CRL partitioning. I find it surprising that Microsoft PKI Services have not adopted that guidance internally in the intervening decade. "
Implementation of CRL partitioning in our CA service started in November 2024 . We were already in the process of testing the changes in our pre-production environment at the time this bug was reported, but have identified issues in our testing, which we are working to resolve.
Specific to your question about the guidance from 2014, the method described in that reference specifies rolling the CA key every year to reduce the CRL size. That method does work to limit CRL size but has many other limitations that prohibit it from being a good option for managing our CA infrastructure. This article uses the word “partitioning” but describes a method different from what we have been discussing in this bug recently.
(4) Related Incidents
"You mention that you took learnings from previous incidents, however you do not reference any incidents from other CAs in the "Related Incidents" section of your Full Incident Report. The CCADB Full Incident Report template and accompanying explanation of this field explicitly states: “Related Incidents” MUST consider incidents beyond those corresponding to the CA Owner subject of this report. Which incident(s) did you review when implementing the October 2024 default lifetime changes? Which incident(s) did you review when planning/implementing the CRL partitioning changes? This would be useful information to include in the timeline for Contributing Factor 1 as part of your Root Cause Analysis."
Thank you for the clarification. We acknowledge the requirement to include relevant incidents from other CAs in the “Related Incidents” section of the Full Incident Report, as defined in the CCADB template guidance.
In relation to the default lifetime changes, there were multiple factors which drove that decision – Evolution of Microsoft’s own internal standards, evolution of industry requirements as well as learnings from past incidents like Bugzilla 1715672. Similarly, based on our own internal analysis, CRL partitioning was already in our plans prior to this incident as well as learnings from incidents like Bugzilla 1715672, As we have outlined in the action items for Bug 1962829 we are formalizing the process for Bugzilla bug reviews which will not only help us learn from other incidents systematically but will also make correlation of incidents easier.
We will update the “Related Incidents” section and the Root Cause Analysis timeline to reflect this.
(5) Incident Review for Mass Revocation Plan
"Which incident(s) did you review when developing the mass revocation plan for this incident? While you have mentioned that until recently Microsoft PKI Services only informally monitored and reviewed incidents, based on your responses to this question in comment 13 and question 3 of comment 4 it appears you did use other incidents to guide your response/revocation plan in this instance. Explicit references to the other mass revocation incidents and revocation delay incidents you reviewed should also be included in the Related Incidents section."
We did reference the following incidents as part of our response planning and will include them in our related incidents section -
1890896 - Entrust: CPS typographical (text placement) error
1910805 - DigiCert: Delayed revocation of 1910322
1715672 - Let's Encrypt: Failure to revoke for Certificate Lifetime Incident
Assignee | ||
Comment 26•2 months ago
|
||
Response to Comment 21 - Chrome Root Program
Question 1
”In its response, Microsoft PKI Services stated:
Greater than 99% of impacted certificates are managed through vaults and can be centrally renewed. Can you please share:
- a) Approximately what percent of subscribers have adopted the automated distribution solution?
- b) Approximately what percent of affected certificates are represented by those subscribers?
- c) Whether Microsoft has triggered the automatic renewal and replacement for these subscribers’ certificates?”
a) Based on our analysis to date, which covers 50% of the impacted certificate population, we have confirmed 95% are auto distributed within 5 days of renewal. We will continue analyzing the remaining 50% and will provide an update once that is complete. Note that this is true for the population of the affected certificates, and based on how customer workloads for the subscriber services evolve, this mix could change in the future.
b) The subscribers analyzed so far represent approximately 50% of the affected certificate volume.
c) Microsoft has not triggered a rotation of the affected certificates. Of the affected certificates, ~98% have already been deleted, expired or renewed. >99% of the remaining certs will be automatically rotated before July 31st.
Question 2
” In its response, Microsoft PKI Services stated:
The domain control validation (DCV) method used during these types of certificate renewals is BR Section 3.2.2.4.2 – Email, Fax, SMS, or Postal Mail to Domain Contact.
Given this method has been sunset, can you share what Microsoft PKI Services is planning for certificates issued beginning July 15, 2025?”
We will be using the Email to DNS CAA Contact method outlined in section 3.2.2.4.13 of the BRs starting on July 15, 2025. We also support the DNS Change method as outlined in section 3.2.4.7 of the BRs.
Question 3
” In its response, Microsoft PKI Services stated:
Our goal is for our default validity to meet or exceed industry requirements. In addition, we are working with subscribers to request shorter lifetime certificates based on their scenarios.
Can you please share more? This response doesn’t offer actionable detail or directly answer the question posed. Asked differently, considering that 90% of affected certificates were determined not in use at the time of this incident, can that be interpreted to indicate the default validity should be less than six months given observed real-world use of these certificates?”
Yes, based on our investigation, 75% of the impacted certificates could have had a 30 day lifetime based on the lifecycle of the underlying resource using the certificate. We plan to work with subscribers with scenarios like this to move them to 30 day certificates.
Question 4
”Do you have any indication that the not-in-use certificates corresponded to Azure resources that were intentionally short-lived (e.g., someone standing up a test environment for a few days, confirming something, and then deleting it — however in the process the corresponding TLS certificate is orphaned)? Like you, we’re trying to understand the user patterns that resulted in the significant amount of non-use.”
There are 2 major categories of workloads that we have found which are driving a high % of not-in-use certificates, which would benefit from moving to short lived certificates:
- Short-lived customer workloads
- Synthetic testing workloads for customer experience
In these cases, endpoints are created and then deleted in a short period, causing the certs to be created but then no longer used even though they remain valid.
Comment 1
” The Evaluation Criteria included in subsequent updates can be improved by offering concrete objectives. For example “% reduction in public trusted certificates, unexpired certificates” doesn’t offer sufficient detail to understand how Microsoft is evaluating this Action Item, or how a member of the community can help.”
Thank you for the suggestion. We have updated the evaluation criteria’s for our action items and will include in our weekly update.
Comment 2
”Studying data disclosed to the CCADB, we do observe CA records trusted in Chrome that disclose a full and complete CRL, but NOT a partitioned array and whose corresponding size is larger than 10 MB. We’re not aware of specific issues related to these CAs, though we would be interested to learn if there are, in fact, specific issues.
- http://crl.quovadisglobal.com/hinicag2.crl (99.45 MB)
- http://certificates.godaddy.com/mastergodaddy2issuing.crl (39.39 MB)
- http://httpcrl.trust.telia.com/teliasoneramobileidcav2.crl (21.97 MB)
- https://www.accv.es/fileadmin/Archivos/certificados/accvca120_der.crl (17.45 MB)
- http://www.accv.es/fileadmin/Archivos/certificados/accvca120_der.crl (serves the same file as above)
- http://crl.sectigo.com/SectigoRSADomainValidationSecureServerCA.crl (10.77 MB)”
When setting the initial targets, we researched CRLs from several widely deployed CAs when evaluating an acceptable maximum size. Specifically, we identified CRLs mentioned in the following articles – How Big Are CRLs That Are Found In The Wild? | technotes.seastrom.com and An analysis of CRL sizes. At the time of analysis, the largest CRL we found in these articles was approximately 13MB. We chose a 10MB target as a conservative threshold, aligning with Windows TRP recommendations, in line with some of the largest CRLs we found in the links mentioned above, and leaving room to grow up to 13 MB.
During our recent revocation efforts, we have received an escalation from a Microsoft service regarding the CRL size being too large (5MB at the time). We will continue to follow windows TRP recommendations and monitor potential impacts closely.
Question 5
” Looking at some of the Microsoft PKI services CRLs relevant to this incident (e.g., http://www.microsoft.com/pkiops/crl/Microsoft%20Azure%20RSA%20TLS%20Issuing%20CA%2007.crl and http://www.microsoft.com/pkiops/crl/Microsoft%20Azure%20RSA%20TLS%20Issuing%20CA%2008.crl ) the current size (~4KB) is significantly less than the stated 10 MB goal. We understand Microsoft is intending to gradually ramp-up revocations, but the timing of that ramp-up is unclear.”
Please see the attached weekly revocation plan which details out the ramp plan for how many certs we plan to revoke on a weekly basis.
Assignee | ||
Comment 27•2 months ago
|
||
Revocation Plan
Comment 28•2 months ago
|
||
(In reply to Microsoft PKI Services from comment #26)
Response to Comment 21 - Chrome Root Program
Question 1
”In its response, Microsoft PKI Services stated:
Greater than 99% of impacted certificates are managed through vaults and can be centrally renewed. Can you please share:
- a) Approximately what percent of subscribers have adopted the automated distribution solution?
- b) Approximately what percent of affected certificates are represented by those subscribers?
- c) Whether Microsoft has triggered the automatic renewal and replacement for these subscribers’ certificates?”
a) Based on our analysis to date, which covers 50% of the impacted certificate population, we have confirmed 95% are auto distributed within 5 days of renewal. We will continue analyzing the remaining 50% and will provide an update once that is complete. Note that this is true for the population of the affected certificates, and based on how customer workloads for the subscriber services evolve, this mix could change in the future.
b) The subscribers analyzed so far represent approximately 50% of the affected certificate volume.
c) Microsoft has not triggered a rotation of the affected certificates. Of the affected certificates, ~98% have already been deleted, expired or renewed. >99% of the remaining certs will be automatically rotated before July 31st.
Q1: By July 31st what percentage of certificates that should have been revoked will Microsoft have revoked as per the plan? Please include certificate that have expired since the start of May in that total as they should have been in the revocation to begin with.
Q2: The sampling is based on 50% of the affected certificates and we're getting results that ~98% are not longer in use. What is the barrier to moving the remaining ~2% to a different intermediary to work around the perceived CRL issue?
Question 3
” In its response, Microsoft PKI Services stated:
Our goal is for our default validity to meet or exceed industry requirements. In addition, we are working with subscribers to request shorter lifetime certificates based on their scenarios.
Can you please share more? This response doesn’t offer actionable detail or directly answer the question posed. Asked differently, considering that 90% of affected certificates were determined not in use at the time of this incident, can that be interpreted to indicate the default validity should be less than six months given observed real-world use of these certificates?”Yes, based on our investigation, 75% of the impacted certificates could have had a 30 day lifetime based on the lifecycle of the underlying resource using the certificate. We plan to work with subscribers with scenarios like this to move them to 30 day certificates.
That is good to hear.
Q3: Based off of data available so far how many subscribers can be moved to short-lived certificates bypassing the need for revocation entirely?
Q4: Are there any plans in the near future to move these subscribers to short-lived certificates?
Comment 2
”Studying data disclosed to the CCADB, we do observe CA records trusted in Chrome that disclose a full and complete CRL, but NOT a partitioned array and whose corresponding size is larger than 10 MB. We’re not aware of specific issues related to these CAs, though we would be interested to learn if there are, in fact, specific issues.
- http://crl.quovadisglobal.com/hinicag2.crl (99.45 MB)
- http://certificates.godaddy.com/mastergodaddy2issuing.crl (39.39 MB)
- http://httpcrl.trust.telia.com/teliasoneramobileidcav2.crl (21.97 MB)
- https://www.accv.es/fileadmin/Archivos/certificados/accvca120_der.crl (17.45 MB)
- http://www.accv.es/fileadmin/Archivos/certificados/accvca120_der.crl (serves the same file as above)
- http://crl.sectigo.com/SectigoRSADomainValidationSecureServerCA.crl (10.77 MB)”
When setting the initial targets, we researched CRLs from several widely deployed CAs when evaluating an acceptable maximum size. Specifically, we identified CRLs mentioned in the following articles – How Big Are CRLs That Are Found In The Wild? | technotes.seastrom.com and An analysis of CRL sizes. At the time of analysis, the largest CRL we found in these articles was approximately 13MB. We chose a 10MB target as a conservative threshold, aligning with Windows TRP recommendations, in line with some of the largest CRLs we found in the links mentioned above, and leaving room to grow up to 13 MB.
During our recent revocation efforts, we have received an escalation from a Microsoft service regarding the CRL size being too large (5MB at the time). We will continue to follow windows TRP recommendations and monitor potential impacts closely.
Q5: Could Microsoft elaborate on the service that is being impacted by a 5MB CRL? As elaborated there are multiple CAs pushing well past that boundary, and Microsoft's own data says that 10MB is a conservative threshold.
Q6: Are there any known publicly-used services that would be impacted by a CRL going past 10MB, or this figure solely reliant on unsourced figure from an old document?
Question 5
” Looking at some of the Microsoft PKI services CRLs relevant to this incident (e.g., http://www.microsoft.com/pkiops/crl/Microsoft%20Azure%20RSA%20TLS%20Issuing%20CA%2007.crl and http://www.microsoft.com/pkiops/crl/Microsoft%20Azure%20RSA%20TLS%20Issuing%20CA%2008.crl ) the current size (~4KB) is significantly less than the stated 10 MB goal. We understand Microsoft is intending to gradually ramp-up revocations, but the timing of that ramp-up is unclear.”
Please see the attached weekly revocation plan which details out the ramp plan for how many certs we plan to revoke on a weekly basis.
(In reply to Microsoft PKI Services from comment #27)
Created attachment 9494207 [details]
Bug1965612_Microsoft PKI Service_Revocation Plan.xlsxRevocation Plan
Q7: Is there a public version of that revocation plan? The version that is attached does not seem to be intended for public usage.
Assignee | ||
Comment 29•2 months ago
|
||
Revocation Plan CSV
Assignee | ||
Comment 30•2 months ago
|
||
Weekly Status Update
We are actively working on all repair items associated with this incident. In addition, we have updated the evaluation criteria per suggestion in Comment 21 to better align with CCADB guidance. After further review, we updated the due date of the last action item.
Action Items
Action Item | Kind | Root Cause(s) | Updated Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Revoke impacted certificates (in batches beginning 5/28/2025) | Mitigate | Root Cause 1 | Percent of impacted certificates revoked will be tracked and published monthly. Verification possible via Certificate Transparency (CT) logs and serial number disclosure via Microsoft’s CRL. | 11/15/2025 | In Progress |
Migrate cert issuance to use partitioned CRLs | Prevent | Root Cause 1 | Percentage of newly issued certificates appearing in CT logs with updated CDP endpoints pointing to partitioned CRLs. Logs and CRL URLs can be independently verified by the public. | 11/15/2025 | In Progress |
Standup cross-signed warm standby CAs. We are currently in planning stages. We will have the plan ready before 06/14/2025 | Prevent | Root Cause 1 | Standby ICAs will be disclosed in CT logs with test certificates. Public can verify issuance and presence of standby ICAs through CT logs and Microsoft’s published CA repository. | 9/30/2025 | In Progress |
Create training and TSG Documentation to educate team on revocation expectations | Prevent | Root Cause 1 | Training completion rates will be tracked internally. Effectiveness will be evaluated through internal audits and inclusion of the training materials in external audit reviews. | 7/31/2025 | In Progress |
Reduce usage of public PKI | Prevent | Root Cause 1 | Publish a monthly percentage reduction of unexpired, publicly trusted certificates issued from impacted hierarchies. Public can track progress using CT log data filtered for affected intermediates. | 9/30/2025 | In Progress |
Exercise and refine the mass revocation playbook | Prevent | Root Cause 1 | Effectiveness will be assessed through internal tracking of simulated revocation scenarios, including coverage and execution timing. The results of these exercises will inform iterative improvements to the playbook. While objective external metrics are limited, Microsoft will evaluate the impact through internal reviews and incorporate this action into relevant audit scopes. | 09/01/2025 | In Progress |
Related Incidents
Additionally, we mentioned in Comment 25 that we will include additional incidents in our “Related Incidents” section. Please see the updated section below:
Bug | Date | Description |
---|---|---|
1962829 | 2025-04-25 | Microsoft PKI Services: Policy document bug. Microsoft PKI Services introduced a typo error in CPS Version 3.2.4 while reformatting the document, incorrectly stating that keyEncipherment is not present in Subscriber certificates. This contradicts longstanding practice and affects still-active certificates tied to the superseded document. |
1890896 | 2024-04-10 | Entrust CPS policyQualifier error. This incident involved a typographical error in Entrust’s CPS that mistakenly added a policyQualifier (cpsURI) requirement to OV TLS certificates—resulting in over 6,000 misissued certificates. Although the issue stemmed from documentation rather than technical controls, the outdated or inaccurate CPS language led to non-compliance. |
1910805 | 2024-07-30 | DigiCert delayed revocation due to TRO. This incident involved DigiCert’s delayed revocation of certificates originally identified in Bug 1910322 (CNAME validation error). Due to a Temporary Restraining Order (TRO), revocation—which should have occurred within 24 hours under the Baseline Requirements—was delayed by five days. |
1715672 | 2021-06-09 | Let’s Encrypt certificate validity issue. This incident involved Let’s Encrypt (ISRG) issuing certificates valid for 90 days plus one second due to a CPS timestamp calculation issue, in violation of the CA/Browser Forum Baseline Requirements. ISRG opted not to revoke the certificates, as they determined revocation would not benefit the Web PKI. |
Assignee | ||
Comment 31•2 months ago
|
||
Response to Comment 24 from Wayne
"Do we have a more detailed revocation plan yet? Currently we seem to be stalling until November when the final certificates will expire instead of being revoked as required.
If there is no intent for these certificate to ever be revoked, why are they listed as 'planned for revocation'?
I am dismayed at the attention this incident is receiving and the lack of pro-activeness. We have regular reports late Friday, and no sign that this is being treated with any severity internally. If we are to learn of the plan through weekly questioning then please advise us in advance so the questions can be more thoroughly worded."
We acknowledge the concerns raised and want to clarify that Microsoft remains fully committed to revoking as many of the affected certificates as we can while managing the CRL size constraints described in our full incident report.
Earlier this week, we published an updated revocation plan with scheduled batches through November. Revocations began on May 28, 2025, and certificates marked as “planned for revocation” are actively queued for upcoming batches. We would also like to acknowledge your asks related to certs that were already expired before we began revocations, and will provide that in the next update.
We recognize that our responses have been largely following the 7-day response window due to the number of bugs we are concurrently managing and hope to shorten this to 3 days in the future.
We appreciate the feedback and will continue improving the clarity of our weekly updates to provide better visibility into progress and planning.
Assignee | ||
Comment 32•2 months ago
|
||
Revocation Delay Status Update
- the number of certificates that have been revoked:
400,000 - the number of certificates that have not yet been revoked:
64,687,168 - the number of certificates planned for revocation that have expired:
2,737,202 - an estimate for when all remaining revocations will be completed:
we will continue to revoke certificates in batches until 11/15/2025 as mentioned in our FIR
Comment 33•2 months ago
•
|
||
(In reply to Microsoft PKI Services from comment #29)
Created attachment 9494215 [details]
Bug1965612_Microsoft PKI Service_Revocation Plan.csvRevocation Plan CSV
There is a rather concerning line in this plan that requires far more information:
*Note: There is a company wide change advisory that may impact our ability to revoke this week. We will provide further details once we have that clarity.
This is regarding a revocation period of 2025-10-27 to 2025-11-02.
Q1: Are we to interpret that as Microsoft PKI not being able to handle revocation for a week due to an org-wide freeze? More details would be appreciated, even if absolute clarity is not available yet.
Q2: Has this happened before?
Q3: If this has happened before, where was the inability to handle revocation disclosed in any of your prior audits?
Q4: Given this is the 3rd-last 'revocation week', what exactly is stopping an increase in revocations up to this date to make it irrelevant?
The action items note:
Standup cross-signed warm standby CAs. We are currently in planning stages. We will have the plan ready before 06/14/2025
Q5: Is this plan now ready, and can we see it?
The current plan is showing 15 million certificates will be eventually revoked by November, while 56 million will be left to expire.
Q6: Can Microsoft PKI give examples of any prior incident where this has occurred, nevermind was considered acceptable practice?
Q7: Will Microsoft PKI be advising the Microsoft Root Store that this is to be considered the high standard to be held against all other CAs they govern?
Q8: Can Microsoft PKI explain why other Root Programs should take this plan in good faith, in spite of CRL evidence to the contrary and no change in plans appearing to date?
Comment 34•2 months ago
|
||
Thank you for surfacing that out of the spreadsheet and into the discussion, Wayne. I agree that it needs elaboration.
Question: in the event of a “change advisory”, are there other duties of a CA, beyond revocation for this incident, that Microsoft will not be performing? Specifically, will other revocations be performed in keeping with the BRs, and will Microsoft continue to issue certificates during that time?
Assignee | ||
Comment 35•2 months ago
|
||
Response to Comment 28 – Wayne
Question 1
“Q1: By July 31st what percentage of certificates that should have been revoked will Microsoft have revoked as per the plan? Please include certificate that have expired since the start of May in that total as they should have been in the revocation to begin with.”
Of the affected certs that would otherwise have expired by August 3rd, we would have revoked 16% (7% of the overall population of affected certificates).
Question 2
“Q2: The sampling is based on 50% of the affected certificates and we're getting results that ~98% are not longer in use. What is the barrier to moving the remaining ~2% to a different intermediary to work around the perceived CRL issue?”
As previously mentioned in our Full Incident Report – Lessons Learned, we do not have ICAs available to move our subscribers to. As a repair action, we are working on standing up Warm Stand by ICAs for future use in such circumstances.
Question 3
”Question 3"
” In its response, Microsoft PKI Services stated:
Our goal is for our default validity to meet or exceed industry requirements. In addition, we are working with subscribers to request shorter lifetime certificates based on their scenarios.
Can you please share more? This response doesn’t offer actionable detail or directly answer the question posed. Asked differently, considering that 90% of affected certificates were determined not in use at the time of this incident, can that be interpreted to indicate the default validity should be less than six months given observed real-world use of these certificates?”
“Yes, based on our investigation, 75% of the impacted certificates could have had a 30 day lifetime based on the lifecycle of the underlying resource using the certificate. We plan to work with subscribers with scenarios like this to move them to 30 day certificates.
That is good to hear.”
“Q3: Based off of data available so far how many subscribers can be moved to short-lived certificates bypassing the need for revocation entirely?”
To obviate the need for revocation, we will need to move the subscribers to 7-day (or less) validity certificates, for which we do not yet have a defined plan. As shared previously, we are working on identifying and moving workloads to 30-day certificates in the interim.
Question 4
"Q4: Are there any plans in the near future to move these subscribers to short-lived certificates?"
We currently do not have short term plans to move workloads to 7-day (or less) validity certs. We are working on moving eligible workloads to 30 days certs and then progressively shorter.
Question 5
”Studying data disclosed to the CCADB, we do observe CA records trusted in Chrome that disclose a full and complete CRL, but NOT a partitioned array and whose corresponding size is larger than 10 MB. We’re not aware of specific issues related to these CAs, though we would be interested to learn if there are, in fact, specific issues.
- http://crl.quovadisglobal.com/hinicag2.crl (99.45 MB)
- http://certificates.godaddy.com/mastergodaddy2issuing.crl (39.39 MB)
- http://httpcrl.trust.telia.com/teliasoneramobileidcav2.crl (21.97 MB)
- https://www.accv.es/fileadmin/Archivos/certificados/accvca120_der.crl (17.45 MB)
- http://www.accv.es/fileadmin/Archivos/certificados/accvca120_der.crl (serves the same file as above)
- http://crl.sectigo.com/SectigoRSADomainValidationSecureServerCA.crl (10.77 MB)
When setting the initial targets, we researched CRLs from several widely deployed CAs when evaluating an acceptable maximum size. Specifically, we identified CRLs mentioned in the following articles – How Big Are CRLs That Are Found In The Wild? | technotes.seastrom.com and An analysis of CRL sizes. At the time of analysis, the largest CRL we found in these articles was approximately 13MB. We chose a 10MB target as a conservative threshold, aligning with Windows TRP recommendations, in line with some of the largest CRLs we found in the links mentioned above, and leaving room to grow up to 13 MB.
During our recent revocation efforts, we have received an escalation from a Microsoft service regarding the CRL size being too large (5MB at the time). We will continue to follow windows TRP recommendations and monitor potential impacts closely.
Q5: Could Microsoft elaborate on the service that is being impacted by a 5MB CRL? As elaborated there are multiple CAs pushing well past that boundary, and Microsoft's own data says that 10MB is a conservative threshold.
One Microsoft service which has clients on memory constrained Android devices has reported end user failures when processing a ~5MB CRL. This incident confirmed the need to carefully manage CRL sizes to avoid end-user impact. These observations do not alter our revocation plans. Rather, they inform our batch sizing to ensure revocation in conformance with our plan while maintaining broad client compatibility.
Question 6
"Q6: Are there any known publicly-used services that would be impacted by a CRL going past 10MB, or this figure solely reliant on unsourced figure from an old document?"
Please see the response from Question 5
Question 7
(In reply to Microsoft PKI Services from comment #27)
Created attachment 9494207 [details]
Bug1965612_Microsoft PKI Service_Revocation Plan.xlsx
Revocation Plan
Q7: Is there a public version of that revocation plan? The version that is attached does not seem to be intended for public usage.
Thank you for pointing this out. The revocation plan was corrected and republished that same day.
Assignee | ||
Comment 36•2 months ago
|
||
Response to Comment 33 - Wayne
Question 1
"Q1: Are we to interpret that as Microsoft PKI not being able to handle revocation for a week due to an org-wide freeze? More details would be appreciated, even if absolute clarity is not available yet."
We understand the concern and appreciate the opportunity to clarify. The comment in the plan — "may impact our ability to revoke this week" — was not intended to indicate that revocation would be paused or unavailable. Revocation remains a critical function, and our systems and teams are equipped to execute it throughout the advisory period.
The company-wide change advisory referenced is part of our internal change management process. These advisories introduce additional oversight to ensure that any changes made during sensitive operational windows are executed safely and deliberately. The note was included out of an abundance of caution while we evaluate the optimal path to proceed without introducing risk to adjacent systems (which may include revoking the targeted certs in the week prior).
Question 2
"Q2: Has this happened before?"
Company-wide change advisories are regularly scheduled events within Microsoft’s change management process. These advisories introduce additional oversight but do not prevent critical operations such as certificate revocation. Revocations have always been permitted during these periods. This is not a new or exceptional situation, and we have not experienced an advisory that has blocked or delayed our ability to revoke certificates.
Question 3
"Q3: If this has happened before, where was the inability to handle revocation disclosed in any of your prior audits?"
There have been no instances where a change advisory prevented or delayed our ability to perform required revocations.
Question 4
"Q4: Given this is the 3rd-last 'revocation week', what exactly is stopping an increase in revocations up to this date to make it irrelevant?"
Thanks for the suggestion. We will consider it as an option.
Question 5
"Q5: Is this plan now ready, and can we see it?"
Outlined below is a high level plan for setting up Warm Standby Certificate Authorities (CAs) to ensure continuity and rapid response in case of CA revocation. The plan includes the creation, cross-signing, distribution, readiness timeline, and usage policy for the standby CAs.
- CA Creation: Create warm stand by RSA and ECC Certificate Authorities (CAs) from the Microsoft G1 root that meet all the expected baseline requirements. CCADB will be updated with the CA entries.(Mid July)
- Cross-Signing: Obtain cross-signatures for the newly created CAs from DigiCert, following the same process as used for existing CAs. CCADB will be updated after cross signing (Early August)
- Roll-out CRL partitioning to CAs:Issue a small batch of certs. CRL partitioning can be verified using CTLog entries for small batch of certs. (September)
- Distribution to MS Fleet:Distribute the CAs through the internal distribution pipeline to ensure availability and integration within the existing Subscriber infrastructure. (Mid August – Early October)
- Microsoft Fleet Ready to Consume Certificates:Ensure that the CAs are fully ready in production to start issuing publicly trusted TLS certificates by 10/15/2025. We plan to adopt the standby practice for all future iterations of CA creation.
- Usage Policy:These CAs are designated solely for standby purposes. They will only be activated in scenarios where existing CAs need to be revoked.
Question 6
"Q6: Can Microsoft PKI give examples of any prior incident where this has occurred, nevermind was considered acceptable practice?"
As previously noted in this bug, we acknowledge that our response plan deviates from the Baseline Requirements, and we are not presenting it as acceptable precedent.
As outlined in the full incident report, having CRL partitioning in place and/or having ready warm standby CAs would have allowed us to meet a more aggressive timeline for revocations, and both of those are part of our repair actions.
Question 7
"Q7: Will Microsoft PKI be advising the Microsoft Root Store that this is to be considered the high standard to be held against all other CAs they govern?"
No. We are not presenting our current revocation approach as a standard or as a benchmark for others. Our focus is on remediating the issue as responsibly and transparently as possible, not redefining expectations for root programs.
The Microsoft Trusted Root Program is operated independently from Microsoft PKI Services. Like other Root Programs, it sets its own requirements and enforcement expectations. We continue to support consistent application of Root Program policies and acknowledge that this incident highlights areas where our internal controls and infrastructure must improve.
Question 8
"Q8: Can Microsoft PKI explain why other Root Programs should take this plan in good faith, in spite of CRL evidence to the contrary and no change in plans appearing to date?"
We started at a lower number of certificates at the start of the revocations, so CRL sizes remained small. Since then, our revocations have been progressively ramping up.
Assignee | ||
Comment 37•2 months ago
|
||
Response to Comment 34 - Mike Shaver
"Thank you for surfacing that out of the spreadsheet and into the discussion, Wayne. I agree that it needs elaboration.
Question: in the event of a “change advisory”, are there other duties of a CA, beyond revocation for this incident, that Microsoft will not be performing? Specifically, will other revocations be performed in keeping with the BRs, and will Microsoft continue to issue certificates during that time?"
We appreciate the follow-up. As noted in our response to Comment 33, the change advisory introduces additional oversight—not a freeze—and does not prevent revocation activity related to this incident.
To clarify further: all other CA duties, including unrelated revocations and certificate issuance, will continue during this period in accordance with the Baseline Requirements. The advisory does not limit our ability to meet our obligations as a publicly trusted CA.
Assignee | ||
Comment 38•2 months ago
|
||
Weekly Update
We are actively progressing through all repair items identified in the incident report. All action items are currently in progress. We remain on track to meet the expected due dates outlined in the full incident report.
Assignee | ||
Comment 39•2 months ago
|
||
Revocation Delay Status Update
- the number of certificates that have been revoked:
- 399,350
- the number of certificates that have not yet been revoked:
- 64,687,168
- the number of certificates planned for revocation that have expired:
- 2,737,852
- Estimate for remaining revocations:
- We will continue to revoke certificates in batches until 11/15/2025
Assignee | ||
Comment 40•2 months ago
|
||
Weekly Update
We are actively progressing through all repair items identified in the incident report. All action items are currently in progress. We remain on track to meet the expected due dates outlined in the full incident report.
Assignee | ||
Comment 41•2 months ago
|
||
Revocation Delay Status Update
- the number of certificates that have been revoked this week:
- 600,960
- the number of certificates that have not yet been revoked:
- 58,197,100
- the number of certificates planned for revocation that have expired:
- 2,248,735
- Estimate for remaining revocations:
- We will continue to revoke certificates in batches until 11/15/2025
Comment 42•2 months ago
|
||
This is the fifth status update and I'm still unsure on the methodology or math involved.
Could Microsoft PKI please talk us through how they arrived at these figures? Comment 32 and 39 are especially odd, but overall I don't see the math quite adding up.
Comment 43•2 months ago
|
||
We have a few follow-up questions that will help us manage CRLite effectively in light of this incident:
1. Which issuing CAs and respective quantities were involved?
Please provide the names of the issuing CAs and quantities of affected certificates for those CAs.
2. For each of those issuing CAs, what percentage of certificates will be revoked vs. not revoked?
This information will help us assess CRL size and plan for distribution of revocation information using CRLite.
3. Has Microsoft considered revoking any of the issuing CAs involved?
If so, this could simplify our response because revocation at the ICA level would allow us to manage this incident via OneCRL, avoiding the scalability challenges of enumerating revocation information for millions of certificates through CRLite.
Thanks.
Assignee | ||
Comment 44•2 months ago
|
||
Weekly Update
We are actively progressing through all repair items identified in the incident report. All action items are currently in progress. We remain on track to meet the expected due dates outlined in the full incident report.
Assignee | ||
Comment 45•2 months ago
|
||
Response to Comment 42 - Wayne
Thanks, Wayne — we appreciate your continued engagement and the opportunity to clarify.
We acknowledge two key issues in our prior reporting which could be contributing to lack of clarity on the math:
-
(1) Cumulative vs. Weekly Totals: Our previous updates reported weekly figures, which may have caused confusion. Moving forward, we will report cumulative totals to provide clearer visibility.
-
(2) Duplicate Data on 6/20/2025: The numbers shared on that date were inadvertently duplicated.
The corrected figures for 6/20/2025 are:
- Revoked: 399,350
- Total: 61,245,835
- Expired Planned: 3,041,333
- Remaining Active: 57,805,152
As of this week (7/3/2025), our cumulative figures are:
- Total certificates revoked (planned to date): 2,558,954 (2,558,644)
- Remaining active certificates (total affected): 55,020,856 (72, 070,777)
- Total certificates expired and not revoked: 14,490,967
Assignee | ||
Comment 46•2 months ago
|
||
Revocation Delay Status Update
As mentioned in Comment 45 here are the revocation delay status updates:
- Total certificates revoked (planned to date): 2,558,954 (2,558,644)
- Remaining active certificates (total affected): 55,020,856 (72, 070,777)
- Total certificates expired and not revoked: 14,490,967
Assignee | ||
Comment 47•2 months ago
|
||
Response to Comment 43 - Ben
Question 1:
"Which issuing CAs and respective quantities were involved?
Please provide the names of the issuing CAs and quantities of affected certificates for those CAs."
Issuing and Intermediate CAs | Impacted Certs |
---|---|
Microsoft Azure RSA TLS Issuing CA 04 | 26,342,303 |
Microsoft Azure RSA TLS Issuing CA 07 | 24,014,300 |
Microsoft Azure RSA TLS Issuing CA 03 | 26,328,523 |
Microsoft Azure RSA TLS Issuing CA 08 | 23,637,853 |
Question 2:
" For each of those issuing CAs, what percentage of certificates will be revoked vs. not revoked?"
This information will help us assess CRL size and plan for distribution of revocation information using CRLite.
These number(s) represent revocations starting May 28th until November 15th as projected in our revocation plan:
Issuing and Intermediate CAs | % Revoked | % Not Revoked |
---|---|---|
Microsoft Azure RSA TLS Issuing CA 04 | 16% | 84% |
Microsoft Azure RSA TLS Issuing CA 07 | 15% | 85% |
Microsoft Azure RSA TLS Issuing CA 03 | 16% | 84% |
Microsoft Azure RSA TLS Issuing CA 08 | 15% | 85% |
Question 3:
"If so, this could simplify our response because revocation at the ICA level would allow us to manage this incident via OneCRL, avoiding the scalability challenges of enumerating revocation information for millions of certificates through CRLite."
Yes, we have considered revoking the ICAs. As mentioned in our Full Incident Report – Lessons Learned, ultimately we do not have a warm standby cross-signed ICA to move subscribers to.
Assignee | ||
Comment 48•1 month ago
|
||
Weekly Status Update
We would like to request a change to the cadence of our action item updates for this bug. Several of the action items currently tracked are not due for several months, and as such, we propose to provide our next update on Friday, August 1st, unless action items are completed sooner.
Please note that this change would only apply to the action item updates. We will continue to provide weekly updates on our revocation progress as usual.
Action Item | Kind | Root Cause(s) | Updated Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Revoke impacted certificates (in batches beginning 5/28/2025) | Mitigate | Root Cause 1 | Percent of impacted certificates revoked will be tracked and published monthly. Verification possible via Certificate Transparency (CT) logs and serial number disclosure via Microsoft’s CRL. | 11/15/2025 | In Progress |
Migrate cert issuance to use partitioned CRLs | Prevent | Root Cause 1 | Percentage of newly issued certificates appearing in CT logs with updated CDP endpoints pointing to partitioned CRLs. Logs and CRL URLs can be independently verified by the public. | 11/15/2025 | In Progress |
Standup cross-signed warm standby CAs. We are currently in planning stages. We will have the plan ready before 06/14/2025 | Prevent | Root Cause 1 | Standby ICAs will be disclosed in CT logs with test certificates. Public can verify issuance and presence of standby ICAs through CT logs and Microsoft’s published CA repository. | 9/30/2025 | In Progress |
Create training and TSG Documentation to educate team on revocation expectations | Prevent | Root Cause 1 | Training completion rates will be tracked internally. Effectiveness will be evaluated through internal audits and inclusion of the training materials in external audit reviews. | 7/31/2025 | In Progress |
Reduce usage of public PKI | Prevent | Root Cause 1 | Publish a monthly percentage reduction of unexpired, publicly trusted certificates issued from impacted hierarchies. Public can track progress using CT log data filtered for affected intermediates. | 9/30/2025 | In Progress |
Exercise and refine the mass revocation playbook | Prevent | Root Cause 1 | Effectiveness will be assessed through internal tracking of simulated revocation scenarios, including coverage and execution timing. The results of these exercises will inform iterative improvements to the playbook. While objective external metrics are limited, Microsoft will evaluate the impact through internal reviews and incorporate this action into relevant audit scopes. | 09/01/2025 | In Progress |
Let us know if there are any concerns with this approach. Thank you.
Assignee | ||
Comment 49•1 month ago
|
||
Revocation Delay Status Update
-
Total certificates revoked (planned to date):
- 3,358,954 (3,358,644)
-
Remaining active certificates (total affected):
- 51,822,781 (72,070,777)
-
Total certificates expired and not revoked (to date):
- 16,517,878
-
Estimate for remaining revocations:
- We will continue to revoke certificates in batches until 11/15/2025
Comment 50•1 month ago
|
||
We’d like to improve our understanding of Microsoft PKI Services' ability to stand up new Issuing CAs given comments made in this Incident Report.
Background:
- This Incident Report was opened on May 09, 2025.
- The high-level plan for setting up "Warm Standby" CAs provided in Comment 36 initially described a project completion date of October 15, 2025. This was subsequently updated in Comment 48 to September 30, 2025.
- Given these dates, it appears it will take Microsoft PKI Services approximately 145 days (from the opening of the report to September 30, 2025) to stand up a fleet of new issuing CAs that are considered usable.
Questions:
(1) Is Microsoft PKI Services satisfied with the approximately 145-day timeline for standing up a fleet of new issuing CAs, as outlined in this bug?
-
(a) If so, how does this align with Microsoft PKI Services' internal targets for operational readiness and community expectations for rapid response in a crisis or for routine CA rotation?
-
(b) If not, what specific, measurable plans are being implemented to significantly reduce this timeline for future deployments, and what are the updated target completion dates for these improvements?
(2) The inability to immediately revoke the ICAs subject of this Incident Report (as discussed in Comments 9 and 47) due to the absence of cross-signed "Warm Standby" CAs highlights a critical dependency.
-
(a) Can Microsoft PKI Services share more detail on the specific barriers or dependencies that caused this extended ICA deployment timeline?
-
(b) What concrete steps are being taken to ensure a significantly more rapid deployment capability in the future, should similar circumstances repeat?
-
(c) What is the new expected timeline for standing up cross-signed "Warm Standby" CAs under urgent conditions, and what factors will influence this?
(3) Can you help us quantify the specific risk(s) of moving forward with the use of the new ICAs prior to cross-certification given Microsoft PKI Services’ independent trust status?
(4) Can you explain how Microsoft PKI Services weighed the trade-offs of the above described risks (e.g., impact on subscribers if revocation of the (a) leafs or (b) corresponding ICAs subject of this discussion occurred without cross-signed warm standby CAs) against instead more quickly aligning with ecosystem expectations via revoking ICA certificates, as required by Baseline Requirements 4.9.1.1?
(5) Beyond addressing the current incident, what are Microsoft PKI Services' proactive, ongoing plans for routinely standing up and rotating new issuing CAs to practice continuous improvement in its operational practices while also enhancing cryptographic agility?
(6) This report, when considered with Bug 1974592 raises additional concern regarding the operational rigor and maturity of Microsoft PKI Services' existing ICA creation process. How does Microsoft PKI Services plan to address the combined implications of both incident reports regarding CA agility and operational robustness? If discussion is better scoped to 1974592, that works for us.
Assignee | ||
Comment 51•1 month ago
|
||
Response to Comment 50 - Chrome Root Program
Question 1 a &b:
"(1) Is Microsoft PKI Services satisfied with the approximately 145-day timeline for standing up a fleet of new issuing CAs, as outlined in this bug?
• (a) If so, how does this align with Microsoft PKI Services' internal targets for operational readiness and community expectations for rapid response in a crisis or for routine CA rotation?
• (b) If not, what specific, measurable plans are being implemented to significantly reduce this timeline for future deployments, and what are the updated target completion dates for these improvements?"
Thank you for raising this important point. Microsoft PKI Services is not satisfied with the current ~145-day timeline to stand up a fleet of new issuing CAs. The primary driver for this is lack of available fit for purpose CAs. Fit for purpose here also includes ability to support trust on legacy devices which requires us to cross sign these CAs. This is adding delays to our current plan. Going forward, our plan is to eliminate these delays by always having warm stand-by CAs available which we can quickly switch our subscribers to in case an incident requires revocation of the active ICAs.
Question 2 a-c:
"(2) The inability to immediately revoke the ICAs subject of this Incident Report (as discussed in Comments 9 and 47) due to the absence of cross-signed "Warm Standby" CAs highlights a critical dependency.
• (a) Can Microsoft PKI Services share more detail on the specific barriers or dependencies that caused this extended ICA deployment timeline?
• (b) What concrete steps are being taken to ensure a significantly more rapid deployment capability in the future, should similar circumstances repeat?
• (c) What is the new expected timeline for standing up cross-signed "Warm Standby" CAs under urgent conditions, and what factors will influence this?"
Due to a need to support clients on legacy devices which do not trust the Microsoft PKI CAs, we currently rely on ICA-level cross-signing to establish that trust. In this case, the absence of pre-established, cross-signed warm standby ICAs prevented immediate ICA revocation. Further the need to do new cross-signing adds to the time required to standing up the new ICAs. While we are working on getting new ICAs from our existing root cross signed, for our next generation root, we are shifting strategy to cross sign at the root level. Once we have migrated our subscribers to this new root (target Q2 CY26), this approach will allow us to eliminate the time required for cross-signing when standing up new ICAs in emergency situations. Please note that our migration plan already includes creation of warm standby ICAs for this new root, so as we migrate subscribers to ICAs from this new root, there will always be warm standby ICAs available.
In response to (c), as stated above, our plan is to eliminate the need for cross-signed ICAs in the future. Once we have completed migration of our subscribers to our new cross-signed root, we will no longer need cross-signing at the ICA level.
Question 3:
"(3) Can you help us quantify the specific risk(s) of moving forward with the use of the new ICAs prior to cross-certification given Microsoft PKI Services’ independent trust status?"
We estimate that 4% of traffic to subscriber services originates from legacy devices which do not trust our current root. This necessitates the need to issue certificates from a cross-signed CA at this time.
Question 4:
"(4) Can you explain how Microsoft PKI Services weighed the trade-offs of the above described risks (e.g., impact on subscribers if revocation of the (a) leafs or (b) corresponding ICAs subject of this discussion occurred without cross-signed warm standby CAs) against instead more quickly aligning with ecosystem expectations via revoking ICA certificates, as required by Baseline Requirements 4.9.1.1?"
We considered the following factors when evaluating ICA revocation impacts:
- Lack of Alternatives: At the time of the incident, we did not have any other ICAs available to transition subscribers to.
- Scope of Non-Compliance: The ICAs themselves were not universally non-compliant—only certificates issued during a specific window were affected. Revoking the ICAs would have invalidated both compliant and non-compliant certificates, unnecessarily disrupting active, valid use cases.
For leaf level revocation, revoking tens of millions of leaf certificates would have resulted in CRLs exceeding 600MB, which many clients cannot process—leading to revocation checking failures and degraded reliability across the ecosystem. Ultimately, we chose a phased leaf revocation plan to contain the impact to affected certificates while preserving service continuity and working toward service and operational improvements which will enable faster, standards-aligned responses in the future.
Question 5:
"(5) Beyond addressing the current incident, what are Microsoft PKI Services' proactive, ongoing plans for routinely standing up and rotating new issuing CAs to practice continuous improvement in its operational practices while also enhancing cryptographic agility?"
We are adopting a continuous readiness model where warm standby ICAs are always maintained and replaced immediately upon activation. This ensures we are regularly exercising the full lifecycle, from creation to deployment, rather than only reacting to incidents or relying on long ICA validity periods.
Question 6:
"(6) This report, when considered with Bug 1974592 raises additional concern regarding the operational rigor and maturity of Microsoft PKI Services' existing ICA creation process. How does Microsoft PKI Services plan to address the combined implications of both incident reports regarding CA agility and operational robustness? If discussion is better scoped to 1974592, that works for us."
The issue reported in with Bug 1974592 was an implementation bug found in a feature we extended to CA creations just prior to the creation of the impacted ICAs, and we have identified the associated repair actions in that bug. Further discussion on those repair actions is likely best suited for that bug.
Assignee | ||
Comment 52•1 month ago
|
||
Weekly Status update:
We would like to follow up our earlier request regarding action item update cadence: the remaining action items are mostly targeted for September and beyond, and as such, we propose to provide our next action item update on Friday, August 1, unless any are completed sooner.
This change would apply only to action item updates. We will continue providing weekly updates on revocation progress as usual.
Please let us know if this cadence is acceptable.
Assignee | ||
Comment 53•1 month ago
|
||
Revocation Delay Status Update
Total certificates revoked (planned to date):
- 4,108,951 (4,158,644)
Remaining active certificates (total affected):
- 48,251,935 (72,070,777)
Total certificates expired and not revoked (to date):
- 19,288,724
Estimate for remaining revocations:
- We will continue to revoke certificates in batches until 11/15/2025
Comment 54•1 month ago
|
||
(In reply to Microsoft PKI Services from comment #52)
Weekly Status update:
We would like to follow up our earlier request regarding action item update cadence: the remaining action items are mostly targeted for September and beyond, and as such, we propose to provide our next action item update on Friday, August 1, unless any are completed sooner.
Mozilla has concerns about the sufficiency of the current action items and intends to propose modifications and/or additions—potentially including topics such as accelerating the timeframe by which Microsoft will shorten certificate lifetimes, improved details to Microsoft's mass revocation planning, implementation of sharded/partitioned CRLs, and redundant use of diversified issuing CAs.
Comment 55•1 month ago
|
||
I think Ben is correct. This progress seems slow, like we are supposed to forget and let the issue continue.
Delayed revocations were a big reason other CAs were distrusted. Microsoft are of course too-big-to-fail so Mozilla (or Google or Apple or Microsoft on the 'trust' storing department) will not dis-trust them. Really they should, but it is obvious they will not.
Still Microsoft could show some fake concern and revoke more fast, or like Ben suggests - move now to short life certs. Microsoft control all the keys here, they are all 'managed' by Microsoft - every single one is within Microsoft or part of Microsoft.
Why can Microsoft not issuing 47-day certificates now?
Comment 56•1 month ago
|
||
(In response to Comment 51...)
General comments:
-
We’re having a hard time identifying direct responses to questions asked in Comment 50. We’d encourage Microsoft PKI Services to more directly address comments and questions from the community going forward.
-
We share the concerns communicated in comments 54 and 55 regarding the delivery schedule of Microsoft PKI Services’ Action Items.
Follow-up questions:
(Q1) The response to “Question 2 a-c” emphasizes Microsoft PKI Services’ plan to change its existing cross-certification relationship with DigiCert from cross-certifying ICAs to instead a next generation root. This change is described as targeted for completion in Q2 2026. Can Microsoft please provide a more specific and measurable plan, to include key milestones, assumptions, and dependencies that if not met, would shift this timeline to a later date? Particularly of interest is when this new root(s) will be established, and at what point there will no longer be a dependency on the existing hierarchies for issuance and validation.
(Q2) How should root store operators and members of the public consider Microsoft PKI Services’ response to this incident as an indicator for how it intends to reliably uphold community expectations going forward?
(Q3) Can Microsoft PKI Services directly confirm that in the absence of the above described root-level cross-certificate, all subsequently created ICAs will have the same risk as those affected by this incident? Said differently, can Microsoft PKI Services acknowledge that the immediate action to stand-up a fleet of cross-certified “warm stand-by” CAs may not reliably meet the intended goal if for some reason it’s later identified that those CAs are flawed in some way?
(Q4) Can Microsoft PKI Services explain why cross-certifying the existing, in-use Microsoft roots was not considered a simpler and more robust solution than continuing to cross-sign leaf-issuing intermediates?
(Q5) In response to Question 3, Microsoft PKI Services stated: “We estimate that 4% of traffic to subscriber services originates from legacy devices which do not trust our current root. This necessitates the need to issue certificates from a cross-signed CA at this time.” Can you please share how you determined 4% of traffic originates from devices that do not trust the current root(s)?
(Q6) If not the 4% described above, can Microsoft PKI Services share the threshold it would otherwise consider acceptable to move forward without the cross-certificate(s)?
(Q7) Can Microsoft PKI Services explain why the delayed revocation of the CA certificates responsible for issuing the misissued TLS server authentication certificates (subject of Bug 1962829) until new ICAs are cross-signed should not be interpreted as Microsoft PKI Services prioritizing the reduction of subscriber impact over its obligations to the TLS Baseline Requirements?
(Q8) The response to Comment 50 Question 5 does not provide sufficient detail to help us understand, in practical terms, how Microsoft PKI Services is planning to establish new ICAs, or how it plans to rotate issuance to new CAs once established. Can you please provide more specificity and directly address the question?
Assignee | ||
Comment 57•28 days ago
|
||
Weekly Status Update
We are actively making progress on the action items identified in the full incident report. No changes to status at this time.
Assignee | ||
Comment 58•28 days ago
|
||
Response to Comment 55 - JR Moir
In relation to reducing certificate lifetimes, MS PKI Services currently supports 1 month certificates. But the certificate validity period is chosen by the subscribers based on their cadence and constraints. Our current plan for enforcing shorter certificate lifetimes follows the timeline outlined in Ballot SC-081v3.
Assignee | ||
Comment 59•28 days ago
|
||
Response to Comment 54 - Ben Wilson
We will focus on providing more details for these topics. We would be happy to consider any repair actions that you would like to propose. In the meantime, we will continue to provide the action item updates on a weekly basis.
Assignee | ||
Comment 60•28 days ago
|
||
Revocation Delay Status Update
-
Total certificates revoked (planned to date):
4,676,112 (4,958,644) -
Remaining active certificates (total affected):
44,451,262 ( 72,070,777) -
Total certificates expired and not revoked (to date):
22,289,397 -
Estimate for remaining revocations:
We will continue to revoke certificates in batches until 11/15/2025
Assignee | ||
Comment 61•28 days ago
|
||
Response to Comment 56 - Chrome Root Program
Question 1
"(Q1) The response to “Question 2 a-c” emphasizes Microsoft PKI Services’ plan to change its existing cross-certification relationship with DigiCert from cross-certifying ICAs to instead a next generation root. This change is described as targeted for completion in Q2 2026. Can Microsoft please provide a more specific and measurable plan, to include key milestones, assumptions, and dependencies that if not met, would shift this timeline to a later date? Particularly of interest is when this new root(s) will be established, and at what point there will no longer be a dependency on the existing hierarchies for issuance and validation."
- We have already created the cross-signed root (published in CCADB), and are in the process of creating and deploying ICAs from this new root. This is expected to complete in early August.
- We have a dependency on CRL partitioning to be available on these ICAs before we make them available for enrollment. CRL partitioning is targeted to be available by late October. At which point the ICAs will be available for enrollment to our subscribers. Though technically we do not have to wait for CRL sharding to start enrollment, issuing certificates from the new CAs without it will make them vulnerable to the same issues as the existing CAs.
- Subscriber migration is expected to complete by April 2026, at which point issuance from existing ICAs will cease.
Question 2
"(Q2) How should root store operators and members of the public consider Microsoft PKI Services’ response to this incident as an indicator for how it intends to reliably uphold community expectations going forward?"
We discovered gaps in our readiness to deal with a revocation event at this scale. Based on these gaps, we have identified action items to address those gaps (CRL partitioning, stand-by CAs, eliminating long lead time for new ICAs creation by eliminating need for cross-signing the ICAs, and mass revocation playbook as required by the Mozilla root program requirements). Microsoft remains committed to uphold the CAB/F and TRP requirements.
Question 3
"(Q3) Can Microsoft PKI Services directly confirm that in the absence of the above described root-level cross-certificate, all subsequently created ICAs will have the same risk as those affected by this incident? Said differently, can Microsoft PKI Services acknowledge that the immediate action to stand-up a fleet of cross-certified “warm stand-by” CAs may not reliably meet the intended goal if for some reason it’s later identified that those CAs are flawed in some way?"
The major limiting factor in not being able to revoke in a timely manner was lack of CRL partitioning on the existing CAs. The Warm Stand-bys are planned to have CRL partitioning. So will not suffer from the same issues.
Further, our plan is stop the use of cross-signed ICAs and move to issuance from ICAs off the newly cross signed G2 root. Once the ICAs from this new root are available for enrollment, in case we discover issues in the future with the current or the newly cross-signed (warm stand-by) ICAs from the G1 root we will accelerate migration of the workload to the ICAs from the cross-signed G2 root.
Question 4
"(Q4) Can Microsoft PKI Services explain why cross-certifying the existing, in-use Microsoft roots was not considered a simpler and more robust solution than continuing to cross-sign leaf-issuing intermediates?"
Even prior to this incident, our existing plan was to deprecate issuance from the cross signed G1 ICAs (since those are expiring in August 2026) and replace them with the ICAs off the G2 CA. The plan for cross-signing of the G2 CA was already in flight and we chose to rely on that plan as the primary path. That said, the idea of cross signing the G1 Microsoft root has merit and we will consider it for future readiness.
Question 5
"(Q5) In response to Question 3, Microsoft PKI Services stated: “We estimate that 4% of traffic to subscriber services originates from legacy devices which do not trust our current root. This necessitates the need to issue certificates from a cross-signed CA at this time.” Can you please share how you determined 4% of traffic originates from devices that do not trust the current root(s)?"
The 4% estimate is based analysis of aggregated, non-identifying telemetry data for major Microsoft subscriber services over multi-week periods. We analyzed browser and platform trust data to determine which clients trust the Microsoft Gen1 root hierarchy.
Our methodology included:
- Identifying the earliest versions of major platforms (Windows, macOS, iOS, Android, Firefox, Chrome, Edge) that trust the Gen1 roots.
- Mapping browser traffic to these trust anchors using user-agent strings and platform metadata.
- Categorizing traffic from clients that either do not trust the Gen1 roots or do not disclose trust anchor information.
Question 6
"(Q6) If not the 4% described above, can Microsoft PKI Services share the threshold it would otherwise consider acceptable to move forward without the cross-certificate(s)?"
The decisions related to legacy device support for Microsoft services are business decisions which are owned by the respective services. That said, our plan to cross-sign our G2 CA at the root level, and cease issuance from the G1 cross-signed ICAs will obviate the need for cross signing any additional ICAs in the future.
Question 7
"(Q7) Can Microsoft PKI Services explain why the delayed revocation of the CA certificates responsible for issuing the misissued TLS server authentication certificates (subject of Bug 1962829) until new ICAs are cross-signed should not be interpreted as Microsoft PKI Services prioritizing the reduction of subscriber impact over its obligations to the TLS Baseline Requirements?"
Revocation of the ICAs was considered as an option, but revoking the ICAs would have impacted active subscriber certificates which were not mis-issued (with no alternate available for them).
Question 8
"(Q8)The response to Comment 50 Question 5 does not provide sufficient detail to help us understand, in practical terms, how Microsoft PKI Services is planning to establish new ICAs, or how it plans to rotate issuance to new CAs once established. Can you please provide more specificity and directly address the question?"
In relation to the ICAs from the G2 CAs – we are creating double the number of required ICAs. Where half of them will be used for issuance, and the other half will not (those will be used as Warm Stand-bys for the G2 ICAs). Migrating subscribers to the new CAs will be done in a staged fashion. In case there is an incident requiring rotation and revocation, we can run emergency campaigns with all of our subscribers to complete such activity.
We are interpreting this question as “what is our capability to migrate issuance to new CAs for all our subscribers”. If that is not the intent of the question, please clarify.
Comment 62•26 days ago
|
||
After reading most comments, I am interesting in technical details of the revocations.
According to http://www.microsoft.com/pkiops/crl/Microsoft%20Azure%20RSA%20TLS%20Issuing%20CA%2007.crl, the revocation entry for one certificate needs about 50 bytes. So a CRL of 10 MB contains only 200,000 entries. As mentioned in the file "Bug1965612_Microsoft PKI Service_Revocation Plan.csv", each week 800,000 certificates will be revoked. And since total 4 ICAs are involved, in best case, each ICA has 200,000 certificates to be revoked per week.
For better understanding, let's consider the revocation of certificates issued by only one ICA, namely 200,000 certificates per week. And assuming 20% of the revoked certificates will expire in the next CRL issuing period. Then we have the following data:
- Week 1: 200,000 entries in CRL (10 MB)
- Week 2: 40,000 (20% of 200,000) certificates expired, then we have 160,000 remaining entries + 200,000 new entries = 360,000 entries (18MB)
- Week 3: 72,000 (20% of 360,000) certificates expired, then wen have 288,000 remaining + 200,000 new entries = 488,000 entries (24.4MB)
- Week 4: total 590 000 entries (29.5MB)
- and so on.
The maximal size of 10 MB per CRL remains only valid, if all the certificates in the CRL will expire in the next week. Then the question, what is the sense to revoke only the certificates which will expire shortly, but not the certificates with longer validation period.
Comment 63•26 days ago
|
||
And just for correctness:
"Additional considerations: Most Subscribers of the certificates issued by the CA require support for TLS 1.2, which requires keyEncipherment to be set as per RFC 5246: "keyEncipherment bit MUST be set if the key usage extension is present)." While this does not excuse the typographical mistake, it helps re-enforce that that this was a typo for a setting that was never planned to be changed."
RFC 5246 does not requires keyEncipherment to be set in RSA certificate. The requirements is valid only for the key exchange algorithms "RSA and RSA_PSK", but not for "DHE_RSA and ECDHE_RSA".
Comment 64•24 days ago
|
||
(Responding to Comment 61)
Thank you for your response to our questions in Comment 56. A few additional follow-ups are listed below.
In response to Question 1 of Comment 56, Microsoft PKI Services stated:
“Subscriber migration is expected to complete by April 2026, at which point issuance from existing ICAs will cease.”
(Q1) What stops Microsoft PKI Services from accomplishing this migration sooner?
In response to Question 2 of Comment 56, Microsoft PKI Services stated:
“We discovered gaps in our readiness to deal with a revocation event at this scale. Based on these gaps, we have identified action items to address those gaps (CRL partitioning, stand-by CAs, eliminating long lead time for new ICAs creation by eliminating need for cross-signing the ICAs, and mass revocation playbook as required by the Mozilla root program requirements). Microsoft remains committed to uphold the CAB/F and TRP requirements.”
We struggle to reconcile this statement with prior knowledge and Microsoft's own history.
-
Microsoft has been aware of the benefits of CRL partitioning since at least May 2014 when it published guidance recommending it for Windows Server 2003 PKI deployments. Again, it is surprising that Microsoft PKI Services had not adopted this guidance internally in the intervening decade.
-
Microsoft PKI Services’ public response to the Mozilla Policy 3.0 Survey focused on “mass revocation” readiness and challenges cited concerns related to CRL bloat.
-
The TLS Baseline Requirements have always included expectations for timely certificate revocation, and these expectations have been increasingly emphasized within the community over the past year (e.g., discussions within the CA/Browser Forum, Mozilla’s Mass Revocation Policy and surrounding discussions, and here in Bugzilla). Microsoft's current prolonged revocation plan appears to contradict these long-standing and recently amplified expectations.
From our view and when considering the above, Microsoft PKI Services’ handling of this incident depicts an organization that was operating in a capacity where it was (and seemingly still is) unprepared to take steps necessary to adhere to the expectations described in the TLS Baseline Requirements. If it was not already aware of these shortcomings when Bug 1962829 was disclosed, it raises significant concerns about Microsoft PKI Services’ long-standing operational readiness and reliability when considering the inherent risks posed to the public-trust ecosystem.
(Q2) Can you offer more substantial commentary, or even better, enact more meaningful change that demonstrates Microsoft PKI Services’s commitment to reliably upholding the public-trust requirements? (we offer some examples below)
In Comment 14 of this incident report we asked “We understand that Microsoft PKI Services was aware of its “CRL bloat” concerns related to mass revocation events in February 2025, and presumably earlier. Can you help us understand that given the existence of this concern and the community’s emphasis on improving response to large revocation events over the past year, Microsoft PKI Services did not move forward with planning (minimally) or implementing partitioned CRLs sooner?”
The response was “Planning and implementation of CRL partitioning started before this incident and is currently being tested in a non-production environment. However, it has not been completed in time to be a mitigating factor in this incident.”
(Q3) We’d like to understand why CRL partitioning was “not completed in time to be a mitigating factor in this incident.” Can you please explain this in more detail?
In response to Question 3 of Comment 56, Microsoft PKI Services stated:
“The major limiting factor in not being able to revoke in a timely manner was lack of CRL partitioning on the existing CAs. The Warm Stand-bys are planned to have CRL partitioning. So will not suffer from the same issues.”
(Q4) This only appears true once all leafs are migrated to an ICA with partitioned CRLs. Is there something that we are missing?
In response to Question 7 of Comment 56, Microsoft PKI Services stated:
“Revocation of the ICAs was considered as an option, but revoking the ICAs would have impacted active subscriber certificates which were not mis-issued (with no alternate available for them).”
This does not directly address the question presented to Microsoft PKI Services.
However, the response to Question 6 states:
“The decisions related to legacy device support for Microsoft services are business decisions which are owned by the respective services. That said, our plan to cross-sign our G2 CA at the root level, and cease issuance from the G1 cross-signed ICAs will obviate the need for cross signing any additional ICAs in the future.”
We interpret this to indicate that Microsoft PKI Services is allowing external needs (i.e., “business decisions which are owned by the respective services.”) to take precedence over its obligations to the TLS Baseline Requirements.
This response also appears to ignore that non-Microsoft PKI Services CA service providers could be an option for the affected subscribers.
(Q5) The responses in Comment 56 do not address the (mis?)perception that Microsoft PKI Services is misprioritizing its responsibilities. We will again ask for Microsoft PKI Services to explain why its response to this incident should not be interpreted as prioritizing subscriber needs over its obligations to the TLS Baseline Requirements as a publicly-trusted CA Owner?
In response to Question 8 of Comment 56, Microsoft PKI Services stated:
“We are interpreting this question as “what is our capability to migrate issuance to new CAs for all our subscribers”. If that is not the intent of the question, please clarify.”
(Q6) This question was to understand how you will in practice migrate subscribers across issuing CAs. For example, GlobalSign describes rotating ICAs on a quarterly basis. With this clarification, does your answer change?
In response to Comment 58:
Microsoft PKI Services stated: “In relation to reducing certificate lifetimes, MS PKI Services currently supports 1 month certificates. But the certificate validity period is chosen by the subscribers based on their cadence and constraints. Our current plan for enforcing shorter certificate lifetimes follows the timeline outlined in Ballot SC-081v3.”
Despite supporting 1-month certificates and allowing validity to be chosen by subscribers, approximately 90% of the certificates affected by Bug 1962829 were determined by Microsoft as not in use. That seems to describe that the existing approach could and should be improved.
As one possible alternative, one might imagine that by default Microsoft PKI Services’ could issue short-lived certificates (i.e., those that do not need to be revoked), and instead could issue longer-lived certificates when explicitly requested by the applicant - for the validity requested.
(Comment) Given the circumstances of this report and Microsoft’s response, we strongly encourage Microsoft PKI Services to more aggressively pursue a remedy to this incident that includes a reduction of validity well in advance of the timelines included in SC-081 as a demonstration of its commitment to promoting agility, resilience, and improved security across the ecosystem.
Comment 65•23 days ago
|
||
This comment follows up on Comment #54. While we are pleased by MPS’s commitment to implement CRL partitioning and provision standby ICAs, we remain concerned that the current set of action items may not fully address the operational gaps that led to the delayed revocations and their impacts on the broader ecosystem.
MPS has already championed shorter certificate lifetimes by transitioning a number of users to six-month certificates by default, Comment #6. And in its responses, MPS has said that it is evaluating the use of short-lived certificates, Comment #16, and it has also discussed efforts to migrate a large fraction of the certificates it issues to 30-day lifetimes, Comment #26.
Mozilla requests that MPS commit to these efforts as part of its formal Action Items, with a clear timetable.
Specifically, we would like MPS to commit to concrete steps to increase the adoption of 30-day certificates, including:
- adoption targets with clear evaluation criteria for success; and
- specific actions to promote subscriber adoption, such as making 30-day lifetimes the default issuance profile.
In parallel, we would like MPS to make “Short-lived Subscriber Certificates”—as defined in the TLS Baseline Requirements (≤10 days until March 15, 2026; ≤7 days thereafter)— available as a profile, and if suitable, the default option for short-lived cloud deployments. As outlined in section 4.9.1.1 of the TLS BRs, MPS wouldn’t need to provide any revocation services for such certificates, thereby improving both scalability and resilience.
Given the high volume of unused certificates reported, MPS’s recent informal commitments, and control over its issuance and deployment infrastructure, we believe these steps are not only achievable, but also would significantly enhance agility and reduce reliance on large-scale revocation in the event of future incidents.
Additionally, to help Mozilla assess MPS’s alignment with our expectations and readiness improvements, we’d appreciate learning additional, clarifying details, as follows:
A. Mass Revocation Planning
Mozilla requires that MPS adopt a Mass Revocation Plan on or before September 1, 2025. The newly adopted section 5.7.1.2 of the TLS BRs requires that by December 1 MPS include a statement in its CPS that MPS maintains a Mass Revocation Plan. In addition to these requirements, MPS must perform annual operational testing and incorporate lessons learned into the plan.
The plan must cover plan activation criteria, customer contact mechanisms, differentiation of automated and manual steps, time-based objectives for triage and revocation, subscriber notifications, role assignments, training, testing methods, and post-event or post-test analysis.
Can MPS:
(1) share more detail about the structure, testing approach, and frequency for its mass revocation plan;
(2) confirm that its mass revocation plan includes the foregoing required components;
(3) describe the testing methodology used (e.g., simulations, tabletops);
(4) indicate whether and how the Plan and CPS updates are being adopted before the required deadlines; and
(5) share how it intends to validate its readiness internally or through audit processes?
B. Partitioned/Sharded CRLs and G2 Root-Based Hierarchy Migration
Bug comments indicate that CRL partitioning is a gating factor for deploying the G2-based ICAs into production, and we understand that G2 ICA creation is expected to be completed in early August. That milestone is imminent.
To better understand feasibility and preparedness in meeting MPS’s deadlines:
(1) can MPS share the specific design being used for partitioning (e.g. based on serial number, time, hash, etc.)?
(2) has MPS scheduled its key ceremony, and can a specific target date be shared?
(3) how many ICAs will be created?
(4) what are the target dates in late October for MPS’s deployment under the new G2 hierarchy?
(5) besides CRL partitioning, what other dependencies are there on deploying the ICAs into production/standby?
Again, we encourage MPS to update the Action Items section of its incident report to reflect any newly confirmed timelines or improvements, including those in response to community feedback and our current requests.
Comment 66•23 days ago
|
||
Microsoft root program says that all certificates need to have CRL or OCSP information. Even though it would be very nice to have certificates that do not need revokation, it is not possible now.
Assignee | ||
Comment 67•21 days ago
|
||
Response to Comment 62 - Lijun Liao
Thank you for the thoughtful analysis Lijun. You are correct that the CRL size and certificate expiration timelines are critical for our revocation strategy for this bug. Our approach to revoking certificates in weekly batches was designed to balance several competing priorities:
- CRL Size Management: As noted in Comment 16, we are aligning with the Windows Trusted Root Program’s recommendation to maintain CRL sizes at or below 10MB.
- Certificate Expiration: By revoking certificates that are nearing expiration, we ensure that they fall off the CRLs to make space of more certificates to be revoked. This allows us to maximize the revocations while operating within the CRL size constraints.
Assignee | ||
Comment 68•21 days ago
|
||
Response to Comment 63 - Lijun Liao
We acknowledge and thank you for the correction, you're right that RFC 5246 only requires keyEncipherment for RSA and RSA_PSK key exchange, not for DHE_RSA or ECDHE_RSA.
Assignee | ||
Comment 69•21 days ago
|
||
Weekly Status Update
We are actively making progress on the action items identified in the full incident report. Also, action item #4 has been marked as complete.
Action Item | Kind | Root Cause(s) | Updated Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Revoke impacted certificates (in batches beginning 5/28/2025) | Mitigate | Root Cause 1 | Percent of impacted certificates revoked will be tracked and published monthly. Verification possible via Certificate Transparency (CT) logs and serial number disclosure via Microsoft’s CRL. | 11/15/2025 | In Progress |
Migrate cert issuance to use partitioned CRLs | Prevent | Root Cause 1 | Percentage of newly issued certificates appearing in CT logs with updated CDP endpoints pointing to partitioned CRLs. Logs and CRL URLs can be independently verified by the public. | 11/15/2025 | In Progress |
Standup cross-signed warm standby CAs. We are currently in planning stages. We will have the plan ready before 06/14/2025 | Prevent | Root Cause 1 | Standby ICAs will be disclosed in CT logs with test certificates. Public can verify issuance and presence of standby ICAs through CT logs and Microsoft’s published CA repository. | 9/30/2025 | In Progress |
Create training and TSG Documentation to educate team on revocation expectations | Prevent | Root Cause 1 | Training completion rates will be tracked internally. Effectiveness will be evaluated through internal audits and inclusion of the training materials in external audit reviews. | 7/31/2025 | Complete |
Reduce usage of public PKI | Prevent | Root Cause 1 | Publish a monthly percentage reduction of unexpired, publicly trusted certificates issued from impacted hierarchies. Public can track progress using CT log data filtered for affected intermediates. | 9/30/2025 | In Progress |
Exercise and refine the mass revocation playbook | Prevent | Root Cause 1 | Effectiveness will be assessed through internal tracking of simulated revocation scenarios, including coverage and execution timing. The results of these exercises will inform iterative improvements to the playbook. While objective external metrics are limited, Microsoft will evaluate the impact through internal reviews and incorporate this action into relevant audit scopes. | 09/01/2025 | In Progress |
Assignee | ||
Comment 70•21 days ago
|
||
Revocation Delay Status Update
- Total certificates revoked (planned to date):
- 5,396,112 (5,758,644)
- Remaining active certificates (total affected):
- 40,401,165 (72,070,777)
- Total certificates expired and not revoked (to date):
- 25,539,494
- Estimate for remaining revocations:
- We will continue to revoke certificates in batches until 11/15/2025
Comment 71•21 days ago
|
||
(In reply to Microsoft PKI Services from comment #67)
Response to Comment 62 - Lijun Liao
Thank you for the thoughtful analysis Lijun. You are correct that the CRL size and certificate expiration timelines are critical for our revocation strategy for this bug. Our approach to revoking certificates in weekly batches was designed to balance several competing priorities:
- CRL Size Management: As noted in Comment 16, we are aligning with the Windows Trusted Root Program’s recommendation to maintain CRL sizes at or below 10MB.
- Certificate Expiration: By revoking certificates that are nearing expiration, we ensure that they fall off the CRLs to make space of more certificates to be revoked. This allows us to maximize the revocations while operating within the CRL size constraints.
If the certificates will expire in one week, no revocation is needed (similar to the short-lived certificates). You can skip the revocation, such revocation does not have any real sense. So my understanding of this revocation strategy is just to say the "Microsoft is taking the revocation action".
For me, I would like to follow other direction. Since only 10% of the certificates are active. I will use the limited resource to revoke these active certificates. Instead of revoking all certificates expired in one week, I may have the capability to revoke the active certificates expired in the next n weeks (where n seems to be between 4-10).
Comment 72•21 days ago
|
||
(In reply to Microsoft PKI Services from comment #67)
- Certificate Expiration: By revoking certificates that are nearing expiration, we ensure that they fall off the CRLs to make space of more certificates to be revoked. This allows us to maximize the revocations while operating within the CRL size constraints.
In my understanding, it is an incorrect assumption that the expiration of a certificate automatically exempts you from the obligation to track it in a revocation list. The BR (7.2.2) even recommends (SHOULD) to update revocation entries (including dates) if new information about the compromise gets known.
Comment 73•17 days ago
|
||
At what point do we realize Microsoft have no intent to meet revocation timeline and just slow-walk
revocation. More are expring than revoked. CRL size is simply excuses.
It's also interesting that browsers do not seem to care. Question asked, not answered, but no action taken.
Comment 74•17 days ago
|
||
If this is the “Secure Future Initiative”[1][2] that Microsoft promised to its customers and the general public after heavy criticism from the security community, US Senators and others following multiple security disasters, then maybe it should be questioned whether cloud providers and OS/browser vendors should be allowed to run a public Web PKI CAs in the first place. There are many apparent conflicts of interest surfacing here.
What is the consensus on this question of principle? This probably has been discussed in the past.
[1] https://www.bleepingcomputer.com/news/microsoft/microsoft-pledges-to-bolster-security-as-part-of-secure-future-initiative/
[2] https://www.microsoft.com/en-us/trust-center/security/secure-future-initiative
Assignee | ||
Comment 75•16 days ago
|
||
Response to Comment 64 – Chrome Root Program
Question 1
"In response to Question 1 of Comment 56, Microsoft PKI Services stated:
“Subscriber migration is expected to complete by April 2026, at which point issuance from existing ICAs will cease.”
(Q1) What stops Microsoft PKI Services from accomplishing this migration sooner?"
April ’26 was our high confidence date for completing the migration. We are accelerating work, and our new target is by end of Feb ’26. We will look to accelerate further as we implement that plan. The major publicly trackable milestones are below, and we will provide updates to the community as we hit those milestones:
- Issuance of certificates from G2 ICAs with partitioned CRLs begins
- Issuance of certificates from G1 ICAs with partitioned CRLs begins
- Issuance fully migrated to partitioned CRLs (G1 or G2)
We will also provide regular updates on burndown for G1 to G2 transition.
Question 2
"In response to Question 2 of Comment 56, Microsoft PKI Services stated:
“We discovered gaps in our readiness to deal with a revocation event at this scale. Based on these gaps, we have identified action items to address those gaps (CRL partitioning, stand-by CAs, eliminating long lead time for new ICAs creation by eliminating need for cross-signing the ICAs, and mass revocation playbook as required by the Mozilla root program requirements). Microsoft remains committed to uphold the CAB/F and TRP requirements.”
We struggle to reconcile this statement with prior knowledge and Microsoft's own history.
• Microsoft has been aware of the benefits of CRL partitioning since at least May 2014 when it published guidance recommending it for Windows Server 2003 PKI deployments. Again, it is surprising that Microsoft PKI Services had not adopted this guidance internally in the intervening decade.
• Microsoft PKI Services’ public response to the Mozilla Policy 3.0 Survey focused on “mass revocation” readiness and challenges cited concerns related to CRL bloat.
• The TLS Baseline Requirements have always included expectations for timely certificate revocation, and these expectations have been increasingly emphasized within the community over the past year (e.g., discussions within the CA/Browser Forum, Mozilla’s Mass Revocation Policy and surrounding discussions, and here in Bugzilla). Microsoft's current prolonged revocation plan appears to contradict these long-standing and recently amplified expectations.
From our view and when considering the above, Microsoft PKI Services’ handling of this incident depicts an organization that was operating in a capacity where it was (and seemingly still is) unprepared to take steps necessary to adhere to the expectations described in the TLS Baseline Requirements. If it was not already aware of these shortcomings when Bug 1962829 was disclosed, it raises significant concerns about Microsoft PKI Services’ long-standing operational readiness and reliability when considering the inherent risks posed to the public-trust ecosystem.
(Q2) Can you offer more substantial commentary, or even better, enact more meaningful change that demonstrates Microsoft PKI Services’s commitment to reliably upholding the public-trust requirements? (we offer some examples below)"
In addition to the already committed repairs, we are also committing to the below repair actions –
- Accelerate migration to ICAs with partitioned CRLs (see response to Q1)
- Accelerating default certificate lifetimes reduction (see response to last comment)
- Plan for frequent ICA rotations to maintain operational readiness and crypto agility (Q6).
With these changes in place, we will be in a much improved state to be able to respond to revocation events at a scale such as this.
Question 3
"
In Comment 14 of this incident report we asked “We understand that Microsoft PKI Services was aware of its “CRL bloat” concerns related to mass revocation events in February 2025, and presumably earlier. Can you help us understand that given the existence of this concern and the community’s emphasis on improving response to large revocation events over the past year, Microsoft PKI Services did not move forward with planning (minimally) or implementing partitioned CRLs sooner?”
The response was “Planning and implementation of CRL partitioning started before this incident and is currently being tested in a non-production environment. However, it has not been completed in time to be a mitigating factor in this incident.”
(Q3) We’d like to understand why CRL partitioning was “not completed in time to be a mitigating factor in this incident.” Can you please explain this in more detail?"
The underlying CA software we use did not support partitioned CRLs without re-keying CAs. Given our current volumes, utilizing that method would have required us to re-key CAs at a frequency which is not operationally viable.
Development and testing of operationally viable CA software features was already in progress prior to the Feb 2025 MRP survey. Testing and bug fixes have completed at this time, and the CA software update is slated for release by mid-August, after which we will validate and roll out to production as a pre-requisite to G2 ICA migration start.
Question 4
"In response to Question 3 of Comment 56, Microsoft PKI Services stated:
“The major limiting factor in not being able to revoke in a timely manner was lack of CRL partitioning on the existing CAs. The Warm Stand-bys are planned to have CRL partitioning. So will not suffer from the same issues.”
(Q4) This only appears true once all leafs are migrated to an ICA with partitioned CRLs. Is there something that we are missing?"
You are correct that CRL partitioning risk will only be eliminated once all the leaf certificates are renewed post CRL partitioning implementation. To address this issue, per the plan provided in our response to Q1, we will accelerate migration to G2 ICAs to accelerate issuance of all new certs to CAs with partitioned CRLs.
Question 5
"In response to Question 7 of Comment 56, Microsoft PKI Services stated:
“Revocation of the ICAs was considered as an option, but revoking the ICAs would have impacted active subscriber certificates which were not mis-issued (with no alternate available for them).”
This does not directly address the question presented to Microsoft PKI Services.
However, the response to Question 6 states:
“The decisions related to legacy device support for Microsoft services are business decisions which are owned by the respective services. That said, our plan to cross-sign our G2 CA at the root level, and cease issuance from the G1 cross-signed ICAs will obviate the need for cross signing any additional ICAs in the future.”
We interpret this to indicate that Microsoft PKI Services is allowing external needs (i.e., “business decisions which are owned by the respective services.”) to take precedence over its obligations to the TLS Baseline Requirements.
This response also appears to ignore that non-Microsoft PKI Services CA service providers could be an option for the affected subscribers.
(Q5) The responses in Comment 56 do not address the (mis?)perception that Microsoft PKI Services is misprioritizing its responsibilities. We will again ask for Microsoft PKI Services to explain why its response to this incident should not be interpreted as prioritizing subscriber needs over its obligations to the TLS Baseline Requirements as a publicly-trusted CA Owner?"
Inability to precisely direct revocations to only the affected certificates in a revocation event of this scale is the primary driver for the delayed revocations. To address this issue and to reinforce MPS’s commitment to the public TLS requirements, we are making and accelerating significant investments to improve our systems as per plan provided in response to Q2.
Question 6
"In response to Question 8 of Comment 56, Microsoft PKI Services stated:
“We are interpreting this question as “what is our capability to migrate issuance to new CAs for all our subscribers”. If that is not the intent of the question, please clarify.”
(Q6) This question was to understand how you will in practice migrate subscribers across issuing CAs. For example, GlobalSign describes rotating ICAs on a quarterly basis. With this clarification, does your answer change?"
Though there is no BR requirement stipulating ICA rotation schedules, we recognize the benefits of frequent ICA rotations (operational readiness, eliminating CA pinning by subscribers, crypto agility etc.). By mid-October, we will develop a plan for doing scheduled ICA rotations at a fixed cadence.
Comment
"In response to Comment 58:
Microsoft PKI Services stated: “In relation to reducing certificate lifetimes, MS PKI Services currently supports 1 month certificates. But the certificate validity period is chosen by the subscribers based on their cadence and constraints. Our current plan for enforcing shorter certificate lifetimes follows the timeline outlined in Ballot SC-081v3.”
Despite supporting 1-month certificates and allowing validity to be chosen by subscribers, approximately 90% of the certificates affected by Bug 1962829 were determined by Microsoft as not in use. That seems to describe that the existing approach could and should be improved.
As one possible alternative, one might imagine that by default Microsoft PKI Services’ could issue short-lived certificates (i.e., those that do not need to be revoked), and instead could issue longer-lived certificates when explicitly requested by the applicant - for the validity requested.
(Comment) Given the circumstances of this report and Microsoft’s response, we strongly encourage Microsoft PKI Services to more aggressively pursue a remedy to this incident that includes a reduction of validity well in advance of the timelines included in SC-081 as a demonstration of its commitment to promoting agility, resilience, and improved security across the ecosystem."
We are committed to reducing the default validity periods of certificates much ahead of the BR required dates, including availability of 7-day profiles. We will share the details of the plan by 08/22. We have added an action item for this to the repair actions.
Additional Committed Action Items
Action Item Description | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Publish a phased plan to reduce the default certificate validity period, with the long-term goal of transitioning to short-lived certificates. | Preventive | Root Cause 1 | Effectiveness will be measured by publication of the plan by 2025-08-22. Public can verify via the published plan and future CPS updates reflecting the proposed changes. | 2025-08-22 | New |
Begin implementation of the phased certificate lifecycle reduction plan, including updates to issuance systems and CPS. | Preventive | Root Cause 1 | Effectiveness will be measured by issuance of certificates with reduced validity periods, visible in CT logs, and updated CPS language. Public can verify through CT data and CPS version history. | TBD (based on plan milestones) | New |
Complete migration of all customers to G2 ICAs with CRL partitioning and eliminate issuance from non-partitioned ICAs. | Mitigate | Root Cause 1 | Effectiveness will be measured by the percentage of certificates issued from G2 ICAs with partitioned CRLs, visible in CT logs. Public can verify through CCADB hierarchy updates and issuance patterns in CT. Internal tracking will confirm deprecation of non-partitioned ICAs. We will provide regular updates on the burndown for G1 to G2 transition. | 2026-02-16 | New |
Develop and publish a plan for regular ICA rotations to maintain operational readiness and crypto agility. | Preventive | Root Cause 1 | Effectiveness will be measured by publication of the ICA rotation plan. ICA rotation can be publicly verified through CCADB and CT logs as we execute the plan. | TBD | New |
Assignee | ||
Comment 76•16 days ago
|
||
Response to Comment 65 - Ben Wilson
Reduction in Certificate validity
"This comment follows up on Comment #54. While we are pleased by MPS’s commitment to implement CRL partitioning and provision standby ICAs, we remain concerned that the current set of action items may not fully address the operational gaps that led to the delayed revocations and their impacts on the broader ecosystem.
MPS has already championed shorter certificate lifetimes by transitioning a number of users to six-month certificates by default, Comment #6. And in its responses, MPS has said that it is evaluating the use of short-lived certificates, Comment #16, and it has also discussed efforts to migrate a large fraction of the certificates it issues to 30-day lifetimes, Comment #26.
Mozilla requests that MPS commit to these efforts as part of its formal Action Items, with a clear timetable.
Specifically, we would like MPS to commit to concrete steps to increase the adoption of 30-day certificates, including:
• adoption targets with clear evaluation criteria for success; and
• specific actions to promote subscriber adoption, such as making 30-day lifetimes the default issuance profile.
In parallel, we would like MPS to make “Short-lived Subscriber Certificates”—as defined in the TLS Baseline Requirements (≤10 days until March 15, 2026; ≤7 days thereafter)— available as a profile, and if suitable, the default option for short-lived cloud deployments. As outlined in section 4.9.1.1 of the TLS BRs, MPS wouldn’t need to provide any revocation services for such certificates, thereby improving both scalability and resilience.
Given the high volume of unused certificates reported, MPS’s recent informal commitments, and control over its issuance and deployment infrastructure, we believe these steps are not only achievable, but also would significantly enhance agility and reduce reliance on large-scale revocation in the event of future incidents.
We estimate that we can move 25% of our subscriber certs to 30 day certs over the next 9-12 months. To address your questions related to reduction of certificate validity periods, please see action item added as part of Comment 75.
Mass Revocation Plan
"A. Mass Revocation Planning
Mozilla requires that MPS adopt a Mass Revocation Plan on or before September 1, 2025. The newly adopted section 5.7.1.2 of the TLS BRs requires that by December 1 MPS include a statement in its CPS that MPS maintains a Mass Revocation Plan. In addition to these requirements, MPS must perform annual operational testing and incorporate lessons learned into the plan.
The plan must cover plan activation criteria, customer contact mechanisms, differentiation of automated and manual steps, time-based objectives for triage and revocation, subscriber notifications, role assignments, training, testing methods, and post-event or post-test analysis.
Can MPS:
(1) share more detail about the structure, testing approach, and frequency for its mass revocation plan;
(2) confirm that its mass revocation plan includes the foregoing required components;
(3) describe the testing methodology used (e.g., simulations, tabletops);
(4) indicate whether and how the Plan and CPS updates are being adopted before the required deadlines; and
(5) share how it intends to validate its readiness internally or through audit processes?"
This incident highlighted critical areas for improving our mass revocation readiness. We are actively working on the plan to comply with all the MRP requirements by the September 1st deadline.
We will provide more detailed response for these specific questions before September 5th.
Partitioned/Sharded CRLs and G2 Root-Based Hierarchy Migration
"Bug comments indicate that CRL partitioning is a gating factor for deploying the G2-based ICAs into production, and we understand that G2 ICA creation is expected to be completed in early August. That milestone is imminent.
To better understand feasibility and preparedness in meeting MPS’s deadlines:
(1) can MPS share the specific design being used for partitioning (e.g. based on serial number, time, hash, etc.)?
(2) has MPS scheduled its key ceremony, and can a specific target date be shared?
(3) how many ICAs will be created?
(4) what are the target dates in late October for MPS’s deployment under the new G2 hierarchy?
(5) besides CRL partitioning, what other dependencies are there on deploying the ICAs into production/standby?"
- At issuance, each certificate is randomly assigned a CRL partition number, which determines the specific CRL that will contain its serial number if revoked. The certificate’s CDP extension includes an IDP extension indicating the scope of the partition it covers.
- For security reasons, Microsoft does not disclose the exact dates of the key ceremonies. We have already created 7 G2 ICAs which are disclosed in CCADB, and the remaining G2 and G1 CAs will be created by 8/15.
- Microsoft plans to create 12 G2 CAs—4 RSA and 2 ECC for certificate issuance, and 4 RSA and 2 ECC as warm standbys. Additionally, 6 CAs—4 RSA and 2 ECC—will be created from G1 root for warm standby purposes.
- Our current target date for G2 CA availability for enrollment is 10/27. Though, as mentioned in Comment 75 we are planning to accelerate that availability.
- The primary dependency is CRL partitioning, no other dependencies are known at this time for G2 ICAs. For G1 ICAs, we have an additional dependency to have them cross-signed.
Assignee | ||
Comment 77•14 days ago
|
||
Response to Comment 71 - Lijun Liao
"If the certificates will expire in one week, no revocation is needed (similar to the short-lived certificates). You can skip the revocation, such revocation does not have any real sense. So my understanding of this revocation strategy is just to say the "Microsoft is taking the revocation action".
For me, I would like to follow other direction. Since only 10% of the certificates are active. I will use the limited resource to revoke these active certificates. Instead of revoking all certificates expired in one week, I may have the capability to revoke the active certificates expired in the next n weeks (where n seems to be between 4-10)."
We acknowledge that revoking certificates close to expiration may have limited operational impact. However, the challenge was not about convenience; it was about avoiding a scenario where revocation itself would destabilize the ecosystem.
At the time of this incident, approximately 4.5M of the impacted certificates were active. Revoking all of them immediately, without partitioned CRLs, would have produced CRLs so large that relying-party software could not process them reliably. This would have caused widespread failures across the ecosystem, including for parties unrelated to the incident.
Given this constraint, we executed the maximum safe revocation possible under the circumstances while accelerating the structural fix, partitioned CRLs, that permanently removes this limitation. Per Comment 75, we are also expediting migration off these CAs.
Assignee | ||
Comment 78•14 days ago
|
||
Response to Comment 72 - Stephan Verbücheln
"In my understanding, it is an incorrect assumption that the expiration of a certificate automatically exempts you from the obligation to track it in a revocation list. The BR (7.2.2) even recommends (SHOULD) to update revocation entries (including dates) if new information about the compromise gets known."
We appreciate the clarification regarding BR 7.2.2. Our process follows BR 4.10, were we remove CRL entries for expired certs. To confirm, our strategy does not assume that expiration exempts revocation obligations. Rather, our decision to defer revocation for certificates nearing expiration was driven by the need to manage CRL size and avoid ecosystem-wide failures. We remain committed to revoking compromised certificates regardless of expiration status when new compromise information is discovered.
Assignee | ||
Comment 79•14 days ago
|
||
Weekly Status Update
We are actively progressing on the full set of action items outlined in the incident report. Below is the complete and updated list, which now includes several newly added items.
Action Item Description | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Revoke impacted certificates (in batches beginning 5/28/2025) | Mitigate | Root Cause 1 | Percent of impacted certificates revoked will be tracked and published monthly. Verification possible via Certificate Transparency (CT) logs and serial number disclosure via Microsoft’s CRL. | 2025-11-15 | In Progress |
Migrate cert issuance to use partitioned CRLs | Prevent | Root Cause 1 | Percentage of newly issued certificates appearing in CT logs with updated CDP endpoints pointing to partitioned CRLs. Logs and CRL URLs can be independently verified by the public. | 2025-11-15 | In Progress |
Standup cross-signed warm standby CAs. We are currently in planning stages. We will have the plan ready before 06/14/2025 | Prevent | Root Cause 1 | Standby ICAs will be disclosed in CT logs with test certificates. Public can verify issuance and presence of standby ICAs through CT logs and Microsoft’s published CA repository. | 2025-09-30 | In Progress |
Create training and TSG Documentation to educate team on revocation expectations | Prevent | Root Cause 1 | Training completion rates will be tracked internally. Effectiveness will be evaluated through internal audits and inclusion of the training materials in external audit reviews. | 2025-07-31 | Complete |
Reduce usage of public PKI | Prevent | Root Cause 1 | Publish a monthly percentage reduction of unexpired, publicly trusted certificates issued from impacted hierarchies. Public can track progress using CT log data filtered for affected intermediates. | 2025-09-30 | In Progress |
Exercise and refine the mass revocation playbook | Prevent | Root Cause 1 | Effectiveness will be assessed through internal tracking of simulated revocation scenarios, including coverage and execution timing. The results of these exercises will inform iterative improvements to the playbook. While objective external metrics are limited, Microsoft will evaluate the impact through internal reviews and incorporate this action into relevant audit scopes. | 2025-09-01 | In Progress |
Publish a phased plan to reduce the default certificate validity period, with the long-term goal of transitioning to short-lived certificates. | Preventive | Root Cause 1 | Effectiveness will be measured by publication of the plan by 2025-08-22. Public can verify via the published plan and future CPS updates reflecting the proposed changes. | 2025-08-22 | New |
Begin implementation of the phased certificate lifecycle reduction plan, including updates to issuance systems and CPS. | Preventive | Root Cause 1 | Effectiveness will be measured by issuance of certificates with reduced validity periods, visible in CT logs, and updated CPS language. Public can verify through CT data and CPS version history. | TBD (based on plan milestones) | New |
Complete migration of all customers to G2 ICAs with CRL partitioning and eliminate issuance from non-partitioned ICAs. | Mitigate | Root Cause 1 | Effectiveness will be measured by the percentage of certificates issued from G2 ICAs with partitioned CRLs, visible in CT logs. Public can verify through CCADB hierarchy updates and issuance patterns in CT. Internal tracking will confirm deprecation of non-partitioned ICAs. We will provide regular updates on the burndown for G1 to G2 transition. | 2026-02-16 | New |
Develop and publish a plan for regular ICA rotations to maintain operational readiness and crypto agility. | Preventive | Root Cause 1 | Effectiveness will be measured by publication of the ICA rotation plan. ICA rotation can be publicly verified through CCADB and CT logs as we execute the plan. | TBD | New |
Assignee | ||
Comment 80•14 days ago
|
||
Revocation Delay Status Update
-
Total certificates revoked (planned to date):
- 6,006,812 (6,558,644)
-
Remaining active certificates (total affected):
- 36,079,260 (72,070,777)
-
Total certificates expired and not revoked (to date):
- 29,061,399
-
Estimate for remaining revocations:
- We will continue to revoke certificates in batches until 11/15/2025
Assignee | ||
Comment 81•8 days ago
|
||
Weekly Status Update
We are actively progressing on the full set of action items outlined in the incident report. We have updated the Due Date for the last repair item. Please see full list below:
Action Item Description | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Revoke impacted certificates (in batches beginning 5/28/2025) | Mitigate | Root Cause 1 | Percent of impacted certificates revoked will be tracked and published monthly. Verification possible via Certificate Transparency (CT) logs and serial number disclosure via Microsoft’s CRL. | 2025-11-15 | In Progress |
Migrate cert issuance to use partitioned CRLs | Prevent | Root Cause 1 | Percentage of newly issued certificates appearing in CT logs with updated CDP endpoints pointing to partitioned CRLs. Logs and CRL URLs can be independently verified by the public. | 2025-11-15 | In Progress |
Standup cross-signed warm standby CAs. We are currently in planning stages. We will have the plan ready before 06/14/2025 | Prevent | Root Cause 1 | Standby ICAs will be disclosed in CT logs with test certificates. Public can verify issuance and presence of standby ICAs through CT logs and Microsoft’s published CA repository. | 2025-09-30 | In Progress |
Create training and TSG Documentation to educate team on revocation expectations | Prevent | Root Cause 1 | Training completion rates will be tracked internally. Effectiveness will be evaluated through internal audits and inclusion of the training materials in external audit reviews. | 2025-07-31 | Complete |
Reduce usage of public PKI | Prevent | Root Cause 1 | Publish a monthly percentage reduction of unexpired, publicly trusted certificates issued from impacted hierarchies. Public can track progress using CT log data filtered for affected intermediates. | 2025-09-30 | In Progress |
Exercise and refine the mass revocation playbook | Prevent | Root Cause 1 | Effectiveness will be assessed through internal tracking of simulated revocation scenarios, including coverage and execution timing. The results of these exercises will inform iterative improvements to the playbook. While objective external metrics are limited, Microsoft will evaluate the impact through internal reviews and incorporate this action into relevant audit scopes. | 2025-09-01 | In Progress |
Publish a phased plan to reduce the default certificate validity period, with the long-term goal of transitioning to short-lived certificates. | Preventive | Root Cause 1 | Effectiveness will be measured by publication of the plan by 2025-08-22. Public can verify via the published plan and future CPS updates reflecting the proposed changes. | 2025-08-22 | In Progress |
Begin implementation of the phased certificate lifecycle reduction plan, including updates to issuance systems and CPS. | Preventive | Root Cause 1 | Effectiveness will be measured by issuance of certificates with reduced validity periods, visible in CT logs, and updated CPS language. Public can verify through CT data and CPS version history. | TBD (based on plan milestones) | New |
Complete migration of all customers to G2 ICAs with CRL partitioning and eliminate issuance from non-partitioned ICAs. | Mitigate | Root Cause 1 | Effectiveness will be measured by the percentage of certificates issued from G2 ICAs with partitioned CRLs, visible in CT logs. Public can verify through CCADB hierarchy updates and issuance patterns in CT. Internal tracking will confirm deprecation of non-partitioned ICAs. We will provide regular updates on the burndown for G1 to G2 transition. | 2026-02-28 | New |
Develop and publish a plan for regular ICA rotations to maintain operational readiness and crypto agility. | Preventive | Root Cause 1 | Effectiveness will be measured by publication of the ICA rotation plan. ICA rotation can be publicly verified through CCADB and CT logs as we execute the plan. | 2025-10-17 | New |
Assignee | ||
Comment 82•8 days ago
|
||
Revocation Delay Status Update
-
Total certificates revoked (planned to date):
- 6,806,812 (7,358,644)
-
Remaining active certificates (total affected):
- 30,930,510 (72,070,777)
-
Total certificates expired and not revoked (to date):
- 33,410,149
-
Estimate for remaining revocations:
- We will continue to revoke certificates in batches until 11/15/2025
Comment 83•7 days ago
|
||
Mozilla appreciates MPS’ progress in improving certificate lifecycle management, its commitment to reducing certificate validity periods ahead of required timelines, and its adoption of a phased implementation plan to issue more shorter-lived certificates.
We note that in Comment #76 MPS estimates that it can move 25% of its subscriber certificates to 30-day certificates over the next 9-12 months, which is a meaningful first step, but in Comment #26 MPS said that approximately 75% of impacted certificates could have had 30-day validity based on the lifecycle of the underlying resources. Given that prior estimate, we would appreciate additional context around the mention of this lower 25% migration target.
Can MPS please explain the rationale and factors involved in the 25% 9-12-month goal? Did MPS identify implementation issues or subscriber constraints that pushed it out? Will the plan to be provided next week include additional adoption targets (e.g. 50%, 75%, 90%) and more aggressive dates ?
Understanding the assumptions and barriers informing MPS' staged plan will help the community better assess MPS’ path toward broader adoption of shorter-lived certificates.
Thanks.
Assignee | ||
Comment 84•10 hours ago
|
||
Response to Comment 83 - Ben Wilson
"Mozilla appreciates MPS’ progress in improving certificate lifecycle management, its commitment to reducing certificate validity periods ahead of required timelines, and its adoption of a phased implementation plan to issue more shorter-lived certificates.
We note that in Comment #76 MPS estimates that it can move 25% of its subscriber certificates to 30-day certificates over the next 9-12 months, which is a meaningful first step, but in Comment #26 MPS said that approximately 75% of impacted certificates could have had 30-day validity based on the lifecycle of the underlying resources. Given that prior estimate, we would appreciate additional context around the mention of this lower 25% migration target.
Can MPS please explain the rationale and factors involved in the 25% 9-12-month goal? Did MPS identify implementation issues or subscriber constraints that pushed it out? Will the plan to be provided next week include additional adoption targets (e.g. 50%, 75%, 90%) and more aggressive dates ?
Understanding the assumptions and barriers informing MPS' staged plan will help the community better assess MPS’ path toward broader adoption of shorter-lived certificates.
Thanks."
In Comment 26 we referred to a 75% migration potential, which reflected the upper bound of what is technically feasible. However, since then we have engaged with the subscriber community and have learned additional constraints - workload durations not being known in advance, upstream dependencies for the subscribers, and safe deployment norms which are reflected in the more realistic 25% figure mentioned in Comment 75. We are continuing to work with our largest subscribers to understand their constraints and address those constraints. As those constraints are solved, we hope to make faster progress in this front.
Assignee | ||
Comment 85•10 hours ago
|
||
Below please find the lifetime reduction plan as part of this action item:
Action Item Description | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Publish a phased plan to reduce the default certificate validity period, with the long-term goal of transitioning to short-lived certificates. | Preventive | Root Cause 1 | Effectiveness will be measured by publication of the plan by 2025-08-22. Public can verify via the published plan and future CPS updates reflecting the proposed changes. | 2025-08-22 | Complete |
Lifetime Reduction Plan
Our goal is to reduce the default certificate validity to 47-day certificates by May 2026. Below are the dates when transitions to each default validity period will begin. Note that once a transition begins, it can take up to 6 weeks for it to saturate through our entire subscriber base. Also note that these are defaults, and subscribers can ask for and receive exceptions (the max validity periods with exception are noted in the last column). Beyond this committed plan, we are investigating a plan to introduce 7-day default validity before the end of CY26.
Certificate issued on or after | Expected saturation date for policy changes | Default validity period | Exception maximums |
---|---|---|---|
~September 15, 2025 | November 1, 2025 | 100 days | 360 days |
March 15, 2026 | May 1, 2026 | 47 days | 200 days |
Though it is difficult to estimate how many of our subscribers may choose to exercise exceptions, we have estimated the bounds of adoption based on the historical subscriber behavior. Based on these projections, we estimate that subscribers who account for approximately 65% of the certificates are likely to adopt the defaults (or less) when they are rolled out (e.g. once the defaults change to 100 days is rolled out, we expect ~65% of certs issued after that date to be 100 day certs or less). We expect the subscribers for the remaining 35% to take between 6-18 months to make the necessary changes at their end to adopt the shorter validity periods. Note that these are estimates based on historical data. Changes in subscriber behavior (e.g. more subscribers than projected delay adoption, changes to a usage pattern for a high-volume customer of a subscriber service) can skew these numbers. Adoption of these changes can be publicly tracked via validity data on certs in crt.sh.
Assignee | ||
Comment 86•10 hours ago
|
||
Revocation Delay Status Update
-
Total certificates revoked (planned to date):
- 7,606,812 (8,158,644)
-
Remaining active certificates (total affected):
- 26,644,634 (72,070,777)
-
Total certificates expired and not revoked (to date):
- 36,896,025
-
Estimate for remaining revocations:
- We will continue to revoke certificates in batches until 11/15/2025
Assignee | ||
Comment 87•10 hours ago
|
||
Weekly Status Update
We are actively progressing on the full set of action items outlined in the incident report. We have completed action item #7.
Action Item Description | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Revoke impacted certificates (in batches beginning 5/28/2025) | Mitigate | Root Cause 1 | Percent of impacted certificates revoked will be tracked and published monthly. Verification possible via Certificate Transparency (CT) logs and serial number disclosure via Microsoft’s CRL. | 2025-11-15 | In Progress |
Migrate cert issuance to use partitioned CRLs | Prevent | Root Cause 1 | Percentage of newly issued certificates appearing in CT logs with updated CDP endpoints pointing to partitioned CRLs. Logs and CRL URLs can be independently verified by the public. | 2025-11-15 | In Progress |
Standup cross-signed warm standby CAs. We are currently in planning stages. We will have the plan ready before 06/14/2025 | Prevent | Root Cause 1 | Standby ICAs will be disclosed in CT logs with test certificates. Public can verify issuance and presence of standby ICAs through CT logs and Microsoft’s published CA repository. | 2025-09-30 | In Progress |
Create training and TSG Documentation to educate team on revocation expectations | Prevent | Root Cause 1 | Training completion rates will be tracked internally. Effectiveness will be evaluated through internal audits and inclusion of the training materials in external audit reviews. | 2025-07-31 | Complete |
Reduce usage of public PKI | Prevent | Root Cause 1 | Publish a monthly percentage reduction of unexpired, publicly trusted certificates issued from impacted hierarchies. Public can track progress using CT log data filtered for affected intermediates. | 2025-09-30 | In Progress |
Exercise and refine the mass revocation playbook | Prevent | Root Cause 1 | Effectiveness will be assessed through internal tracking of simulated revocation scenarios, including coverage and execution timing. The results of these exercises will inform iterative improvements to the playbook. While objective external metrics are limited, Microsoft will evaluate the impact through internal reviews and incorporate this action into relevant audit scopes. | 2025-09-01 | In Progress |
Publish a phased plan to reduce the default certificate validity period, with the long-term goal of transitioning to short-lived certificates. | Preventive | Root Cause 1 | Effectiveness will be measured by publication of the plan by 2025-08-22. Public can verify via the published plan and future CPS updates reflecting the proposed changes. | 2025-08-22 | Complete |
Begin implementation of the phased certificate lifecycle reduction plan, including updates to issuance systems and CPS. | Preventive | Root Cause 1 | Effectiveness will be measured by issuance of certificates with reduced validity periods, visible in CT logs, and updated CPS language. Public can verify through CT data and CPS version history. | TBD (based on plan milestones) | New |
Complete migration of all customers to G2 ICAs with CRL partitioning and eliminate issuance from non-partitioned ICAs. | Mitigate | Root Cause 1 | Effectiveness will be measured by the percentage of certificates issued from G2 ICAs with partitioned CRLs, visible in CT logs. Public can verify through CCADB hierarchy updates and issuance patterns in CT. Internal tracking will confirm deprecation of non-partitioned ICAs. We will provide regular updates on the burndown for G1 to G2 transition. | 2026-02-28 | In Progress |
Develop and publish a plan for regular ICA rotations to maintain operational readiness and crypto agility. | Preventive | Root Cause 1 | Effectiveness will be measured by publication of the ICA rotation plan. ICA rotation can be publicly verified through CCADB and CT logs as we execute the plan. | 2025-10-17 | In Progress |
Description
•