Open Bug 1963778 Opened 19 days ago Updated 4 days ago

FNMT: CP/CPS, Revocation Requests Mechanism, Certificate Problem Report, CRL and OCSP disruption

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: amaya.espinosa, Assigned: amaya.espinosa)

Details

(Keywords: spain, Whiteboard: [ca-compliance] [policy-failure])

Preliminary Incident Report

Summary

Incident description:
On Monday, April 28th, 2025, Spain experienced a nationwide blackout that caused a complete and sustained loss of electrical power across the entire country. The incident led to, among other consequences, a disruption in communications and widespread chaos in the transportation system for several hours
As part of FNMT’s contingency plan, services were redirected to the backup Data Processing Center (DPC). However, during the transfer process, a failure occurred that prevented the successful restoration of services from the backup DPC. Our infrastructure experienced a disruption, which resulted in the unavailability of services from April 28th at 2:30 PM until April 29th at 8:20 PM, when all services were finally restored.

Relevant policies:
This incident had implications regarding compliance with several requirements defined in the Baseline Requirements for the Issuance and Management of Publicly Trusted TLS Server Certificates v 2.1.4. In particular:

  • Section 2.2 – Publication of information:
    The CA SHALL publicly disclose its Certificate Policy and/or Certification Practice Statement through an appropriate and readily accessible online means that is available on a 24x7 basis.
  • Section 4.10.2 – Service availability:
    The CA SHALL maintain an online 24x7 Repository that application software can use to automatically check the current status of all unexpired Certificates issued by the CA.
  • Section 4.9.7 – CRL issuance frequency:
    CRLs MUST be available via a publicly-accessible HTTP URL.
  • Section 4.9.3 – Procedure for revocation request:
    The CA SHALL maintain a continuous 24x7 ability to accept and respond to revocation requests and Certificate Problem Reports.

Source of incident disclosure: Self Reported

Assignee: nobody → amaya.espinosa
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [policy-failure]

There are a dozen+ CA located in Spain and Portugal that are probably looking to similar incident reports. Is there guidance from bugzilla for CA to report such regional-everything-out?

Hi Amaya - In your full incident report, I'd be interested in hearing where your failover systems reside. Are they still in Spain? Also, can you talk about how often you do a DR and BCP exercise? I would be interested in reading about those and whether they prepared you for this incident.

Hi Jeremy,
Thank you for your questions — we will take them into account in the full incident report.
I’d like to mention that we are certified under the ISO 22301 standard for Business Continuity Management. As part of this framework, we conduct annual Business Continuity (BCP) and Disaster Recovery (DR) exercises that simulate a variety of failure scenarios, including infrastructure outages and failover between primary and backup sites

Full Incident Report

Summary

  • CA Owner CCADB unique ID: A000274
  • Incident description:
    On Monday, April 28th, 2025, a power system outage occurred across the entire Spanish mainland. As a result of the severity of the outage, the normal functioning of infrastructures, communications, roads, trains, airports, schools, hospitals, and other services was disrupted.
    As part of FNMT's contingency plan, services were redirected to the backup Data Center. However, following the switchover process, an incident occurred that prevented the full recovery of services from the backup data centre and their restoration at the main data center. Both facilities are located in Spain.
    In our preliminary report, we indicated different start and end times. Further investigation has shown that the incident caused service unavailability started from April 28,2025 at 12:06 PM (UTC) until April 29,2025 at 18:01 PM (UTC).
  • Timeline summary:
    • Non-compliance start date: April 28, 2025 at 12:06 PM (UTC)
    • Non-compliance identified date: April 28, 2025 at 12:23 PM (UTC)
    • Non-compliance end date: April 29, 2025 at 18:01 PM (UTC)
  • Relevant policies:
    This incident had implications regarding compliance with several requirements defined in the Baseline Requirements for the Issuance and Management of Publicly Trusted TLS Server Certificates v 2.1.4. In particular:
    • Section 2.2 – Publication of information: The CA SHALL publicly disclose its Certificate Policy and/or Certification Practice Statement through an appropriate and readily accessible online means that is available on a 24x7 basis.
    • Section 4.10.2 – Service availability: The CA SHALL maintain an online 24x7 Repository that application software can use to automatically check the current status of all unexpired Certificates issued by the CA.
    • Section 4.9.7 – CRL issuance frequency: CRLs MUST be available via a publicly-accessible HTTP URL.
    • Section 4.9.3 – Procedure for revocation request: The CA SHALL maintain a continuous 24x7 ability to accept and respond to revocation requests and Certificate Problem Reports.
  • Source of incident disclosure: Self Reported

Impact

Timeline

All times are in UTC.
April 28, 2025:

  • 10:33 am - A general power outage occurs, affecting all of Spain, triggering the alternative power supply systems at the Data Centers facilities (main and backup) of FNMT, both of them are in Spain.
  • 11:30 am - It was decided to transfer the services to the backup data centre, as it is estimated that it has more power supply autonomy. Based on the information available at the time, the outage could last several days.
  • 11:45 am - The transfer of the OCSP validation services has been completed. We continue with the transfer of the rest of the services.
  • 12:06 pm - All service are in backup data center. Initially, the backup systems operated correctly, maintaining service availability.
  • 12:23 pm - The monitoring systems started to advice that the services are not stable. Failures were detected in DNS queries to FNMT domains and therefore instability in the effective provision of the OCSP validation service. Work began to analyse the situation to detect and correct the possible failure.
  • 14:30 pm - The rest of the systems that were still on are shut down in an orderly manner.
  • 17:00 pm - Shutdown is completed. The OCSP validation service continues to operate from the backup data center, although with some instability.
  • 16:00 pm - Observed traffic and service demand at the backup data center is minimal. At this time, there were severe difficulties in the country in telecommunications, transportation systems and essential sectors such as emergency services.
  • At approximately 22:00 pm, the power supply to the main FNMT data center was restored.

April 29, 2025:

  • 5:00 am - It is confirmed that services at the backup centre are not being provided properly. It is observed that service demand is not being adequately met.
  • 5:10 am - With the power supply restored and stable, work began again to restore services to the main DPC. The systems are started up and it is detected that it is not possible to transfer services from backup data centre.
  • 5:30 am - Work begins on analysing the situation to determine why the transfer of services is not possible.
  • 8:30 am - We confirmed that the root cause is in the communications equipment of the service provider of the backup centre is not operational. We notify it to the supplier and ask for resolution.
  • 13:30 pm - The provider finally identified the cause of the issue. It was necessary to restart the communication equipment manually, as it did not recover automatically.
  • 15:44 pm - Connectivity to the backup data center was restored. This allows the service recovery process to begin.
  • 16:34 pm - The processes required for the provision of services have been fully restored. Nevertheless, unstable behaviour was observed in the load balancing layer, primarily affecting DNS and service publishing.
  • 16:45 pm - After solving the load balancing layer issue, the previously interrupted or failing communications began functioning correctly and stably again, including the DNS services.
  • 17:10 pm - The CRL publishing service becomes operational.
  • 17:15 pm - It is confirmed that the validation service OCSP is being provided correctly.
  • 18:01 pm - The FNMT’ websites were restored. All services have been switched back to the main data center. CP/CPS site started working properly.

April 30, 2025:

  • Investigation of all events begins

May 1, 2025:

  • 08:21 am - A preliminary incident report was posted on Bugzilla

May 12, 2025:

  • A full incident report has been submitted to Bugzilla.

Related Incidents

Bug Date Description
1909203 2024-07-22 The subject Bug is related to the referenced Bug in that it affected the same set of services — CP/CPS availability, responses to certificate revocation requests and certificate problem reports, CRL distribution, and OCSP validation. However, the root cause was different in each case.

Root Cause Analysis

Contributing Factor 1: Limited Backup Power Autonomy After Grid Outage

  • Description:
    A general power outage affected the main data center. Although the generator set was activated correctly, given the initial time estimate without power, the operations team decided to transfer loads to a backup center with greater autonomy
  • Timeline:
    • 10:33 UTC: Total power outage is detected in the main data center.
    • 10:34 UTC: Generators start operating.
    • 11:30 UTC: Remaining autonomy is evaluated. It is decided to start the progressive transfer of services to the backup data center. At this moment, the backup data center offers more power autonomy.
    • 11:45 UTC: OCSP validation service transfer is completed.
    • 12:06 UTC: All services are in main data center
  • Detection:
    Self Detected. Internal alarms on power loss and generator consumption. In addition, external confirmation of power outage is received nationwide.
  • Interaction with other factors:
    More autonomy would have allowed to keep the services active in the main data center.
  • Detection:
    Self Detected. Internal alarms on power loss and generator consumption. In addition, external confirmation of power outage is received nationwide.
  • Interaction with other factors:
    More autonomy would have allowed to keep the services active in the main data center.
  • Root Cause Analysis methodology used: Drilling Down

Contributing Factor 2: Communication problems between the main DC and the backup DC

  • Description:
    Communication misconfiguration between the main data center and the backup data center. The supplier's communications equipment physically located in the backup data center needed to be reconfigured. The equipment was not automatically recovered after power restoration and needed to be manually activated.
  • Timeline:
    April 29th, 2025:
    • 5:00 am - It is confirmed that services at the backup centre are not being provided properly. It is observed that demand for the service is not being adequately met.
    • 5:10 am - With the power supply restored and stable, work began again to restore services to the main DPC. The systems are started up and it is detected that it is not possible to transfer services from backup data centre.
    • 5:30 am - Work begins on analysing the situation to determine why the transfer of services is not possible.
    • 8:30 am - We confirmed that the root cause is in the communications equipment of the service provider of the backup centre is not operational. We notify it to the supplier and ask for resolution.
    • 13:30 pm – The provider finally identified the cause of the issue. It was necessary to restart the communication equipment manually, as it did not recover automatically.
    • 15:44 pm – Connectivity with the secondary data center was restored. This allows the service recovery process to begin.
  • Detection: Self Reported.
  • Interaction with other factors:
    It prevents the transfer of services between the main and backup data centers. It also prevents the administration of the backup data center from the FNMT facilities, forcing the displacement of personnel to the physical location of the backup data center.
  • Root Cause Analysis methodology used: Drilling Down

Lessons Learned

  • What went well:
    Our suppliers' technical teams responded quickly and effectively, deploying personnel to help in resolve the incident.
  • What didn’t go well:
    • The limited remote management capability of the backup data center delayed the resolution of the communications problem.
    • Not having a DNS system in the cloud, which would guarantee operation in the event of corporate DNS failures.
  • Where we got lucky:
    N/A. We did not identify any aspect in which we got lucky.
  • Additional: N/A

Action Items

Action Item Kind Corresponding Root Cause(s) Evaluation Criteria Due Date Status
Review of the level of autonomy of generators and supply plan Mitigate Root Cause # 1 Guaranteed minimum capacity greater than 12 hours without external power supply To be confirmed Ongoing
Cloud DNS deployment to improve service resiliency Prevent Root Cause #1 Operational, tested and in production 2025/07/31 Ongoing
Provision of a remote administration system for the backup data center from an alternative location to the FNMT main infrastructure Mitigate Root Cause #2 System accesible, periodic connectivity testing 2025/10/31 Ongoing
Business Continuity Plan and Disaster Recovery Plan review and update Prevent Continuous improvement Periodic exercises and audits 2025/06/31 Ongoing

Appendix

N/A. This incident has not led to any misissued certificates.

Keywords: spain
You need to log in before you can comment on or make changes to this bug.