Open Bug 2052399 Opened 2 days ago

Certainly: Expired certificates on "Valid" and "Revoked" test websites

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

UNCONFIRMED

People

(Reporter: djeffery, Unassigned)

Details

Full Incident Report

Summary

  • CA Owner CCADB unique ID: A007878

  • Incident description: On 2026-06-08, the TLS certificates on Certainly's BR §2.2 test websites expired and were not renewed. The "valid" and "revoked" test sites presented expired certificates to visitors for 23 days. The underlying cause was a combination of a breaking change in an unpinned upstream dependency used to obtain certificates for the test sites, and a misconfiguration in external monitoring that suppressed the TLS validation alert.

  • Timeline summary:

    • Non-compliance start date: 2026-06-08 (~15:03 UTC)
    • Non-compliance identified date: 2026-07-01 (~22:00 UTC)
    • Non-compliance end date: 2026-07-02 04:47 UTC
  • Relevant policies: Section 2.2 of the TLS Baseline Requirements:

    The CA SHALL host test Web pages that allow Application Software Suppliers to test their software with Subscriber Certificates that chain up to each publicly trusted Root Certificate. At a minimum, the CA SHALL host separate Web pages using Subscriber Certificates that are (i) valid, (ii) revoked, and (iii) expired.

  • Source of incident disclosure: Self Reported. Discovered during routine monthly review of crt.sh, CCADB, and external monitors of our CA infrastructure.

Impact

  • Total number of certificates: 0 (no misissuance)
  • Total number of "remaining valid" certificates: N/A
  • Affected certificate types: N/A
  • Incident heuristic: The following 6 test websites were affected: https://valid.root-e1.certainly.com, https://valid.root-r1.certainly.com, https://revoked.root-e1.certainly.com, https://revoked.root-r1.certainly.com, https://expired.root-e1.certainly.com, https://expired.root-r1.certainly.com. The "valid" and "revoked" sites were non-compliant (presenting expired certificates rather than valid and revoked certificates respectively). The "expired" sites continued to serve their intended expired certificates.
  • Was issuance stopped in response to this incident, and why or why not?: No. There was no misissuance. This incident relates solely to the availability and correctness of test websites.
  • Analysis: The test websites were returning HTTP responses throughout the outage, but presenting expired TLS certificates. Browsers rejected the connections with certificate errors, making the sites effectively unreachable for their intended purpose of allowing application software suppliers to test certificate validation behavior.
  • Additional considerations: We are aware of no further impact to the Web PKI community or our internal CA systems. Certificate issuance, revocation, and CRL services were unaffected.

Timeline

All dates and times are UTC.

2025-05-26

  • 17:01 Prior test website incident (Bug #1968836) resolved. Committed monitoring improvements completed within deadlines.

~2026-05-19 (estimated)

  • An upstream dependency used to obtain certificates for the test sites released a major version with breaking changes to its command-line interface. The test site system referenced this dependency without a pinned version, causing it to silently pull the incompatible release.

2026-05-09

  • 15:03 Last successful certificate issuance for all 6 test websites.

2026-06-08

  • ~15:03 All test website certificates expired (30-day validity from May 9 issuance).
  • No alert received. External monitoring was active but a misconfiguration ("ignore SSL errors" enabled on the "valid" site monitors) suppressed the TLS validation alert.

2026-06-25 - 2026-07-01

  • System rebooted on daily schedule. Attempted certificate issuance, failed due to incompatible upstream CLI changes. Fell back to cached expired certificates.

2026-07-01

  • ~22:00 Issue identified during routine monthly review of CCADB, crt.sh and external certificate monitors.
  • ~22:30 Root cause confirmed: upstream dependency CLI breaking change.
  • ~23:00 Fix prepared and tested.

2026-07-02

  • 04:37 Fix deployed via emergency change.
  • 04:45 All 6 certificates issued successfully.
  • 04:47 Sites restored and verified externally.

Related Incidents

Bug Date Description
#1968836 2025-05-28 Certainly: Sample Websites Unavailable (49-hour outage)

This incident is a regression affecting the same sites. The monitoring improvements committed in Bug #1968836 were completed. However, a subsequent misconfiguration in the external monitoring tool re-introduced a blind spot that allowed this failure to persist undetected.

Root Cause Analysis

Contributing Factor 1: Unpinned upstream dependency

  • Description: The system that provisions certificates for the test sites referenced an upstream dependency without a pinned version. When the upstream project released a major version with breaking changes to its command-line interface, the system silently pulled the incompatible version on next build.
  • Detection: Not detected until manual investigation on 2026-07-01 (23 days after certificate expiration).
  • Interaction with other factors: Factor #2 (monitoring misconfiguration) and Factor #3 (resilient fallback) allowed this to persist undetected.

Contributing Factor 2: External monitoring misconfiguration

  • Description: External monitoring was active on all 6 test websites throughout the incident. However, the "ignore SSL errors" setting was enabled on the monitors for the "valid" sites. This setting is correct for the "revoked" and "expired" site monitors, but incorrect for the "valid" sites. This suppressed the alert that should have fired when those certificates expired. It is unclear when this misconfiguration was introduced.
  • Detection: Discovered during incident investigation when reviewing monitor configurations.
  • Interaction with other factors: Factor #1 caused the failure; Factor #2 prevented detection.

Contributing Factor 3: Resilient fallback masked persistent failure

  • Description: The test site system was improved in 2025 to survive transient certificate issuance failures by caching previously issued certificates and deploying them as fallback. The system also restarts daily and retries automatically. These changes ensure the sites remain online during brief outages — but when combined with the monitoring misconfiguration (Factor #2), they converted what would previously have been a visible site outage into a silent certificate expiration.
  • Detection: Error messages were logged on every system restart but were not surfaced to any alerting system.
  • Interaction with other factors: Factor #1 caused issuance to fail; Factor #3 kept the sites online with stale certificates; Factor #2 prevented detection of the expired TLS state. Previously, Factor #1 alone would have triggered a site-down alert, but Factors #2 and #3 together made the failure invisible to automated monitoring.

Root Cause Analysis methodology used: 5 Whys

Lessons Learned

  • What went well:

    • Proactive monthly review of CCADB, crt.sh and external monitors detected the issue independently of primary alerting — this redundant check worked as intended.
    • Once identified, root cause was isolated quickly from system logs.
    • Fix was developed, tested, and deployed within 7 hours of discovery at night, before a holiday.
  • What didn't go well:

    • External monitoring was present but misconfigured — suppressed TLS validation prevented the alert that should have detected this within minutes.
    • Resilience improvements that keep sites online during transient failures inadvertently masked a persistent failure when combined with the monitoring misconfiguration.
    • The 23-day detection time is unacceptable for a BR-mandated service.
    • Aggressive security patching needed to be tempered with at least major version pinning on such a compliance-critical system.
  • Where we got lucky:

    • The issue did not affect certificate issuance, revocation, or any security-critical service.
    • Monthly external monitoring review caught this before a third party reported it.
  • Additional:

    • This is the second test website incident in 13 months (Bug #1968836 was a 49-hour outage in May 2025). While the prior incident's committed improvements were completed, this regression demonstrates that the monitoring configuration for these sites requires an even more robust approach. We are accelerating our migration to abandon the existing third-party monitoring service and move to an internal, configuration-as-code monitoring framework where all settings are more easily version-controlled and peer-reviewed.

Action Items

Action Item Kind Corresponding Root Cause(s) Evaluation Criteria Due Date Status
Pin upstream dependency to specific major version Prevent Contributing Factor #1 Dependency references explicit version; no unpinned tags in build 2026-07-02 Completed
Fix monitoring SSL validation setting on "valid" site monitors Detect Contributing Factor #2 TLS validation enabled; confirmed alert fires on expired certificate 2026-07-01 Completed
Audit all external monitor configurations Detect Contributing Factor #2 Each monitor's settings documented and verified correct for its site type 2026-07-08 Partially Complete
Complete migration to configuration-as-code monitoring; decommission legacy tool Detect Contributing Factor #2 All test site monitoring defined in version-controlled configuration; legacy tool decommissioned 2026-07-31 Not Started
Add certificate expiry pre-alerting Detect Contributing Factor #3 Alert when served certificate is approaching expiry 2026-07-31 Not Started
You need to log in before you can comment on or make changes to this bug.