Closed Bug 1947207 Opened 1 year ago Closed 1 year ago

FNMT: Incorrect publication of information for Test Website - Valid

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: amaya.espinosa, Assigned: amaya.espinosa)

Details

(Whiteboard: [ca-compliance] [policy-failure])

Preliminary Incident Report

The FNMT has received a notification regarding the incorrect publication of information for Test Website -Valid. The certificate that appears as “Test Website -Valid” value in the CCADB for AC RAIZ FNMT-RCM SERVIDORES SEGUROS appears as an expired certificate.
We have diagnosed and solved the problem and a full report with the findings and corrective actions implemented will be issued in the next few days.

Assignee: nobody → amaya.espinosa
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [policy-failure]

Automated monitoring as suggested in https://bugzilla.mozilla.org/show_bug.cgi?id=1925239#c2 can permanently stop this type of problems.

Thank you for your suggestion.

We agree that implementing monitoring can effectively prevent these types of issues in the future. We have reviewed and enhanced our monitoring systems to ensure better coverage.

Incident Report

Summary

The FNMT has received a notification regarding the incorrect publication of information for the 'Test Website - Valid.' After reviewing it, the 'Test Websites - Valid' URLs [https://testactivetipo1.cert.fnmt.es] and [https://testactivetipo2.cert.fnmt.es] are protected by expired certificates. This constitutes a non-compliance breach as required CA/Browser Forum TLS BR, 2.2 Publication of information.

Impact

Only two certificates were affected: Extended Validation (EV) and Organization Validation (OV) test website certificates. Fortunately, since the incident had no impact on pki infrastructure, there was no need to stop issuance. The affected certificates are listed in the Appendix.

Timeline

All times are UTC.

2025-01-15:

  • 07:16: New valid test certificates were issued due to the expiration of test certificates on January 26.
  • 07:43: Valid test certificates were updated on the URLs testactivetipo1.cert.fnmt.es and testactivetipo2.cert.fnmt.es, but only on the active web node.

2025-02-04:

  • 17:17: Due to a bug, the TLS termination web cluster balanced to a passive node that was not updated. As a result, active certificates became expired.

2025-02-07:

  • 17:25: Compliance staff processed an email received through the incident mailbox to report an error in the URL testactivetipo1.cert.fnmt.es, the valid test website. Web certificate appeared to be expired.
  • 17:35: Compliance staff reviewed test websites to confirm non-compliance. They discovered that the URL testactivetipo2.cert.fnmt.es was also affected.
  • 17:40: Compliance staff alerted the technical area.
  • 18:10: Technical staff checked the infrastructure and updated the non-compliant node.
  • 18:36: The cluster was rebalanced.

2025-02-08:

  • 18:17: A reply was sent to the email received the day before.

2025-02-10:

  • 13:40: A monitoring system was deployed to detect future issues.
  • 15:56: A preliminary incident report was posted on Bugzilla.

Root Cause Analysis

Technical staff were using a renewal procedure to update certificates, so we were confident that the change was doing well. However, after the incident, when we revised the procedure, we realized that several steps needed improvement. Normally, the renewal would not have been a problem, as the certificates were installed on the active node. The procedure should ensure that certificates are updated on both nodes in order to prevent failures.

Furthermore, although the website certificates were issued in time, we cannot rely solely on human technical expertise. There are already several websites monitored by an external probe. However, we need to check that any relevant websites are monitorized. It would have been necessary to deploy a monitoring system to detect non-compliance on the test websites.

To summarize, we detected two main root causes:

  • The procedure did not cover all the steps for certificate renewal in the infraestructure.
  • A lack of monitoring system on the test websites.

Lessons Learned

What went well

  • There are no other test sites impacted.
  • New certificates were previously issued, so we could update websites certificates easily.
  • Good communication between the compliance team and technical staff.
  • Dedication of technical personnel to resolve the issue as quickly as possible.

What didn't go well

  • Lack of monitoring system on test sites.

Where we got lucky

  • Being notified of the issue in just three days from the non-compliance.

Action Items

Action Item Kind Due Date
Implement test site monitoring systems to detect possible future problems Detect Done
Revise the procedure for certificate renewal in the infraestructure to include all steps Prevent Done

Appendix

These are the two test web sites affected by the certificate expiration:

Root CA Type URL
AC Servidores Seguros EV https://testactivetipo1.cert.fnmt.es
AC Servidores Seguros OV https://testactivetipo2.cert.fnmt.es

Details of affected certificates

https://crt.sh/?sha256=[8EF3F2979C667612A93E0D90FC6ADD31E9EE3FCA7E7FEACACF1F1F42CF3FEB50]
https://crt.sh/?sha256=[3C9A084FFD7BF0C27D9A6AA19BBAEB848773210BD328A8C7FBA9360E191A5A0B]

Sorry, I have updated the details of affected certificates. I am fixing a typo in the URL of the certificates by removing the [.

Details of affected certificates

https://crt.sh/?sha256=8EF3F2979C667612A93E0D90FC6ADD31E9EE3FCA7E7FEACACF1F1F42CF3FEB50
https://crt.sh/?sha256=3C9A084FFD7BF0C27D9A6AA19BBAEB848773210BD328A8C7FBA9360E191A5A0B

Report Closure Summary

  • Incident description: Valid test websites https://testactivetipo1.cert.fnmt.es and https://testactivetipo2.cert.fnmt.es were protected by two expired certifications, resulting in a non-compliance violation as required CA/Browser Forum TLS BR, 2.2 Publication of information.
    Although valid test certificates were updated in time, the update was only performed on the active node of the web cluster. Due to a bug, the cluster was balanced to a passive node and, as a result, active certificates became expired.

  • Incident Root Cause(s): The root causes of this incident were:

    • The procedure for certificate renewal did not cover all necessary steps: Using an incomplete procedure led to unexpected mistakes.
    • Inadequate monitoring and detection: A lack of monitoring system on test sites made it difficult for technical staff to identify potential issues proactively.
  • Remediation description: The following actions were taken to address the incident:

    • Updated the non-compliant node and rebalanced the cluster.
    • Implemented a monitoring system to detect non-compliance on test websites.
    • Revised the procedure for certificate renewal in the infrastructure to include all necessary steps.
  • Commitment summary: The FNMT commits to:

    • Review and improve our procedures to ensure they cover all necessary steps.
    • Enhance monitoring capabilities to prevent similar incidents from occurring in the future.

All Action Items disclosed in this report have been completed as described, and we request its closure.

I intend to close this on Friday, 28-Feb-2025, unless there are issues or questions to discuss.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.