Open Bug 1999850 Opened 4 months ago Updated 21 days ago

Microsoft PKI Services: OCSP Non-Compliance

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: CentralPKI, Assigned: CentralPKI)

Details

(Whiteboard: [ca-compliance] [ocsp-failure] Next update 2026-04-24)

Preliminary Incident Report

Summary

  • Incident description:

We are in the process of migrating our OCSP traffic to a new infrastructure. On November 10th, we discovered an issue in our new infrastructure where the OCSP responses included fractional seconds in the thisUpdate, nextUpdate and producedAt fields of the response; which caused TLS handshakes to fail for some relying parties, and is inconsistent with RFC 5280: Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile (Section 4.1.2.5.2). As a result, we have moved OCSP responder traffic back to our legacy OCSP responder infrastructure. Our legacy OCSP responder infrastructure has not been audited against requirements outlined in NCSSRs v2.0.5, and has known compliance gaps with those requirements, notably network isolation and trusted role management, though some compensating controls related to access control (JIT, MFA) exist. The new infrastructure has already undergone a practice audit against NCSSRs v2.0.5 with a WebTrust qualified consultant. We are working to resolve the issue in the new infrastructure and once resolved, we will resume the migration of the traffic back to the new infrastructure. We will provide an ETA for the fix as part of our full incident report.

Assignee: nobody → CentralPKI
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [ocsp-failure]

Full Incident Report

Summary

  • CA Owner CCADB unique ID:
    A002577

  • Incident description:
    Microsoft PKI Services were in the process of migrating our OCSP traffic to a new infrastructure. On November 10th, we discovered an issue in our new infrastructure where OCSP responses included fractional seconds in the thisUpdate, nextUpdate, revocationTime and producedAt fields of the response. This caused TLS handshakes to fail for some relying parties and is inconsistent with section 4.1.2.5.2 of RFC 5280: Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile. As a result, we moved OCSP responder traffic back to our legacy OCSP responder infrastructure. Our legacy OCSP responder infrastructure has not been audited against requirements outlined in NCSSRs v2.0.5, and has known compliance gaps with those requirements. The new infrastructure has already undergone a practice audit against NCSSRs v2.0.5 with a WebTrust qualified consultant. We are working to resolve the issue in the new infrastructure and once resolved, we will resume the migration of the traffic back to the new infrastructure.

  • Timeline summary:

    • Non-compliance start date (RFC):
      2025-11-03
    • Non-compliance identified date (RFC):
      2025-11-10 13:43 PDT
    • Non- compliance end date (RFC):
      2025-11-11 01:15 PDT
    • NCSSR v2.0.5- Non-compliance Start date:
      2025-11-12 00:00 PDT
    • NCSSR v2.0.5- Non-compliance end date:
      TBD. We will publish the plan for migrating traffic back to compliant infrastructure on 2025-12-12
  • Relevant policies:

    • RFC 5280 – section 4.1.2.5.2 – “GeneralizedTime values MUST NOT include fractional seconds
    • NCSSRs v2.0.5 – Sections 1.1 and 2.1 in relation to Signing of authoritative certificate status (definition of Certificate System - bullet #8).
  • Source of incident disclosure:
    Self Reported

Impact

  • Total number of certificates: N/A

  • Total number of "remaining valid" certificates: N/A

  • Affected certificate types: N/A

  • Incident heuristic: N/A

  • Was issuance stopped in response to this incident, and why or why not?:
    Issuance was not stopped as subscriber certificates were not affected.

  • Analysis: N/A

  • Additional considerations: N/A

Timeline

  • 2025-11-03: Migration to new OCSP infrastructure started
  • 2025-11-10 13:43 PDT: Relying party reported incident related to certificate validation errors
  • 2025-11-10 18:30 PDT: Migrated all traffic back to legacy OCSP infrastructure
  • 2025-11-11 01:15 PDT: Confirmed root cause as presence of fractional seconds

Related Incidents

None

Root Cause Analysis

Contributing Factor # 1: Gaps in testing coverage for new OCSP infrastructure

  • Description: We missed adding validation for fractional seconds in our test suites as part of the development and deployment of the new infrastructure.
  • Timeline:
    • 2025-11-03: Testing of new OCSP infrastructure complete/Prod Rollout starts
    • 2025-11-10 13:43 PDT: First relying party reported incident
    • 2025-11-10 18:30 PDT: OCSP responder traffic moved back to legacy infrastructure
  • Detection:
    Self detected via comparative testing, and via Relying party incidents.
  • Interaction with other factors:
    N/A
  • Root Cause Analysis methodology used:
    5 Whys

Contributing Factor # 2: Migration of OCSP traffic to new infrastructure

  • Description:
    OCSP traffic was being migrated to new infrastructure since the existing infrastructure has known compliance gaps with NCSSR v2.0.5 (effective 2025-11-12).-
    -Timeline:
    • 2025-03: Development of new OCSP infrastructure started
    • 2025-11-03: Testing complete/Prod Rollout starts
    • 2025-11-10 13:43 PDT: First relying party reported incident
    • 2025-11-10 18:30 PDT: OCSP responder traffic moved back to legacy infrastructure

  • Detection:
    N/A

  • Interaction with other factors:
    N/A

  • Root Cause Analysis methodology used:
    5 Whys

Lessons Learned

  • What went well:
  1. Legacy infrastructure was available for fallback which allowed us to limit relying party impact.
  • What didn’t go well:
  1. Platform testing did not capture the timestamp formatting issue; RFC validation checks were incomplete.
  2. Delays in rollout of new OCSP infrastructure led to late discovery of this bug which meant that the only possible mitigation was to rely on the legacy infrastructure.
  • Where we got lucky:
  1. Issue surfaced relatively early in traffic migration, limiting relying party impact.
  • Additional: N/A

Action Items

Action Item Kind Corresponding Root Cause(s) Evaluation Criteria Due Date Status
Close testing and verification gaps. Prevent #1 Effectiveness will be evaluated through internal tracking and verification of repair action item completion during external audits. 2025-12-19 In Progress
Publish date for completing OCSP traffic migration to new infrastructure Mitigate #2 Migration will be tracked internally and verified via external audit reviews. 2025-12-19 Not started

Appendix

Weekly Status Report


We are actively progressing through all repair items identified in the incident report. No major changes at this time.

Weekly Status Update


We are actively progressing through all repair items identified in the incident report. No major changes at this time.

Weekly Status Update


We are actively progressing through all repair items identified in the incident report. No major changes at this time.

Weekly Status Update


We have completed action items #1 and #2. As part of action item #2, we are planning to migrate our OCSP services to a new infrastructure by March 6, 2026. We propose providing our next update on Friday, February 20, 2026, unless action items are completed sooner.

Action Item Kind Corresponding Root Cause(s) Evaluation Criteria Due Date Status
Close testing and verification gaps. Prevent #1 Effectiveness will be evaluated through internal tracking and verification of repair action item completion during external audits. 2025-12-19 Complete
Publish date for completing OCSP traffic migration to new infrastructure Mitigate #2 Migration will be tracked internally and verified via external audit reviews. 2025-12-19 Complete
Migrate OCSP to new infrastructure Mitigate #2 Migration will be tracked internally and verified via external audit reviews. 2026-03-06 New

Weekly Status Update


We are actively working through the full set of action items outlined in the incident report. No changes at this time. We propose providing our next update on Friday, February 20, 2026, unless action items are completed sooner.

Whiteboard: [ca-compliance] [ocsp-failure] → [ca-compliance] [ocsp-failure] Next update 2026-02-20

Weekly Status Update


We continue to work through the full set of action items outlined in the incident report. We are extending the due date for action item #3 to allow additional work to improve resiliency and ensure a durable, long-term solution. Based on learnings from other Microsoft services, we have decided to implement both an internal OCSP solution and a backup external implementation prior to roll-out. This added scope requires an updated timeline.

Action Item Kind Corresponding Root Cause(s) Evaluation Criteria Due Date Status
#1 - Close testing and verification gaps. Prevent #1 Effectiveness will be evaluated through internal tracking and verification of repair action item completion during external audits. 2025-12-19 Complete
#2 - Publish date for completing OCSP traffic migration to new infrastructure Mitigate #2 Migration will be tracked internally and verified via external audit reviews. 2025-12-19 Complete
#3 - Migrate OCSP to new infrastructure Mitigate #2 Migration will be tracked internally and verified via external audit reviews. 2026-05-19 In Progress

Weekly Status Update


We continue to work through the full set of action items outlined in the incident report. As part of the due date extension to action item #3 noted in comment 7 we would like to provide propose providing our next update on Friday, April 24, 2026, unless action items are completed sooner.

Whiteboard: [ca-compliance] [ocsp-failure] Next update 2026-02-20 → [ca-compliance] [ocsp-failure] Next update 2026-04-24
You need to log in before you can comment on or make changes to this bug.