Certainly: Expired certificates on "Valid" and "Revoked" test websites
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: djeffery, Unassigned)
Details
Full Incident Report
Summary
-
CA Owner CCADB unique ID: A007878
-
Incident description: On 2026-06-08, the TLS certificates on Certainly's BR §2.2 test websites expired and were not renewed. The "valid" and "revoked" test sites presented expired certificates to visitors for 23 days. The underlying cause was a combination of a breaking change in an unpinned upstream dependency used to obtain certificates for the test sites, and a misconfiguration in external monitoring that suppressed the TLS validation alert.
-
Timeline summary:
- Non-compliance start date: 2026-06-08 (~15:03 UTC)
- Non-compliance identified date: 2026-07-01 (~22:00 UTC)
- Non-compliance end date: 2026-07-02 04:47 UTC
-
Relevant policies: Section 2.2 of the TLS Baseline Requirements:
The CA SHALL host test Web pages that allow Application Software Suppliers to test their software with Subscriber Certificates that chain up to each publicly trusted Root Certificate. At a minimum, the CA SHALL host separate Web pages using Subscriber Certificates that are (i) valid, (ii) revoked, and (iii) expired.
-
Source of incident disclosure: Self Reported. Discovered during routine monthly review of crt.sh, CCADB, and external monitors of our CA infrastructure.
Impact
- Total number of certificates: 0 (no misissuance)
- Total number of "remaining valid" certificates: N/A
- Affected certificate types: N/A
- Incident heuristic: The following 6 test websites were affected: https://valid.root-e1.certainly.com, https://valid.root-r1.certainly.com, https://revoked.root-e1.certainly.com, https://revoked.root-r1.certainly.com, https://expired.root-e1.certainly.com, https://expired.root-r1.certainly.com. The "valid" and "revoked" sites were non-compliant (presenting expired certificates rather than valid and revoked certificates respectively). The "expired" sites continued to serve their intended expired certificates.
- Was issuance stopped in response to this incident, and why or why not?: No. There was no misissuance. This incident relates solely to the availability and correctness of test websites.
- Analysis: The test websites were returning HTTP responses throughout the outage, but presenting expired TLS certificates. Browsers rejected the connections with certificate errors, making the sites effectively unreachable for their intended purpose of allowing application software suppliers to test certificate validation behavior.
- Additional considerations: We are aware of no further impact to the Web PKI community or our internal CA systems. Certificate issuance, revocation, and CRL services were unaffected.
Timeline
All dates and times are UTC.
2025-05-26
- 17:01 Prior test website incident (Bug #1968836) resolved. Committed monitoring improvements completed within deadlines.
~2026-05-19 (estimated)
- An upstream dependency used to obtain certificates for the test sites released a major version with breaking changes to its command-line interface. The test site system referenced this dependency without a pinned version, causing it to silently pull the incompatible release.
2026-05-09
- 15:03 Last successful certificate issuance for all 6 test websites.
2026-06-08
- ~15:03 All test website certificates expired (30-day validity from May 9 issuance).
- No alert received. External monitoring was active but a misconfiguration ("ignore SSL errors" enabled on the "valid" site monitors) suppressed the TLS validation alert.
2026-06-25 - 2026-07-01
- System rebooted on daily schedule. Attempted certificate issuance, failed due to incompatible upstream CLI changes. Fell back to cached expired certificates.
2026-07-01
- ~22:00 Issue identified during routine monthly review of CCADB, crt.sh and external certificate monitors.
- ~22:30 Root cause confirmed: upstream dependency CLI breaking change.
- ~23:00 Fix prepared and tested.
2026-07-02
- 04:37 Fix deployed via emergency change.
- 04:45 All 6 certificates issued successfully.
- 04:47 Sites restored and verified externally.
Related Incidents
| Bug | Date | Description |
|---|---|---|
| #1968836 | 2025-05-28 | Certainly: Sample Websites Unavailable (49-hour outage) |
This incident is a regression affecting the same sites. The monitoring improvements committed in Bug #1968836 were completed. However, a subsequent misconfiguration in the external monitoring tool re-introduced a blind spot that allowed this failure to persist undetected.
Root Cause Analysis
Contributing Factor 1: Unpinned upstream dependency
- Description: The system that provisions certificates for the test sites referenced an upstream dependency without a pinned version. When the upstream project released a major version with breaking changes to its command-line interface, the system silently pulled the incompatible version on next build.
- Detection: Not detected until manual investigation on 2026-07-01 (23 days after certificate expiration).
- Interaction with other factors: Factor #2 (monitoring misconfiguration) and Factor #3 (resilient fallback) allowed this to persist undetected.
Contributing Factor 2: External monitoring misconfiguration
- Description: External monitoring was active on all 6 test websites throughout the incident. However, the "ignore SSL errors" setting was enabled on the monitors for the "valid" sites. This setting is correct for the "revoked" and "expired" site monitors, but incorrect for the "valid" sites. This suppressed the alert that should have fired when those certificates expired. It is unclear when this misconfiguration was introduced.
- Detection: Discovered during incident investigation when reviewing monitor configurations.
- Interaction with other factors: Factor #1 caused the failure; Factor #2 prevented detection.
Contributing Factor 3: Resilient fallback masked persistent failure
- Description: The test site system was improved in 2025 to survive transient certificate issuance failures by caching previously issued certificates and deploying them as fallback. The system also restarts daily and retries automatically. These changes ensure the sites remain online during brief outages — but when combined with the monitoring misconfiguration (Factor #2), they converted what would previously have been a visible site outage into a silent certificate expiration.
- Detection: Error messages were logged on every system restart but were not surfaced to any alerting system.
- Interaction with other factors: Factor #1 caused issuance to fail; Factor #3 kept the sites online with stale certificates; Factor #2 prevented detection of the expired TLS state. Previously, Factor #1 alone would have triggered a site-down alert, but Factors #2 and #3 together made the failure invisible to automated monitoring.
Root Cause Analysis methodology used: 5 Whys
Lessons Learned
-
What went well:
- Proactive monthly review of CCADB, crt.sh and external monitors detected the issue independently of primary alerting — this redundant check worked as intended.
- Once identified, root cause was isolated quickly from system logs.
- Fix was developed, tested, and deployed within 7 hours of discovery at night, before a holiday.
-
What didn't go well:
- External monitoring was present but misconfigured — suppressed TLS validation prevented the alert that should have detected this within minutes.
- Resilience improvements that keep sites online during transient failures inadvertently masked a persistent failure when combined with the monitoring misconfiguration.
- The 23-day detection time is unacceptable for a BR-mandated service.
- Aggressive security patching needed to be tempered with at least major version pinning on such a compliance-critical system.
-
Where we got lucky:
- The issue did not affect certificate issuance, revocation, or any security-critical service.
- Monthly external monitoring review caught this before a third party reported it.
-
Additional:
- This is the second test website incident in 13 months (Bug #1968836 was a 49-hour outage in May 2025). While the prior incident's committed improvements were completed, this regression demonstrates that the monitoring configuration for these sites requires an even more robust approach. We are accelerating our migration to abandon the existing third-party monitoring service and move to an internal, configuration-as-code monitoring framework where all settings are more easily version-controlled and peer-reviewed.
Action Items
| Action Item | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Pin upstream dependency to specific major version | Prevent | Contributing Factor #1 | Dependency references explicit version; no unpinned tags in build | 2026-07-02 | Completed |
| Fix monitoring SSL validation setting on "valid" site monitors | Detect | Contributing Factor #2 | TLS validation enabled; confirmed alert fires on expired certificate | 2026-07-01 | Completed |
| Audit all external monitor configurations | Detect | Contributing Factor #2 | Each monitor's settings documented and verified correct for its site type | 2026-07-08 | Partially Complete |
| Complete migration to configuration-as-code monitoring; decommission legacy tool | Detect | Contributing Factor #2 | All test site monitoring defined in version-controlled configuration; legacy tool decommissioned | 2026-07-31 | Not Started |
| Add certificate expiry pre-alerting | Detect | Contributing Factor #3 | Alert when served certificate is approaching expiry | 2026-07-31 | Not Started |
Description
•