Certainly: Sample Websites Unavailable
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: djeffery, Assigned: djeffery)
Details
(Whiteboard: [ca-compliance] [policy-failure])
Steps to reproduce:
Preliminary Incident Report
Summary
Incident description: On 23-May 2025, Certainly deployed a configuration change to Boulder that discontinued the inclusion of the OCSP URI in end-entity certificates. The following day, 24-May 2025, the service hosting our sample sites with valid, revoked and expired certificates attempted its daily restart to reissue and revoke certificates. It failed due to a hardcoded expectation that it could read the OCSP URI in the certificates and entered a restart loop.
An alert was received by on-call staff at the start, but was mistakenly classified as invalid until another staff member reviewed on 26-May. After proper triage, the issue was remediated promptly.
Relevant policies:
Section 2.2 of the BR requires that the CA publish sample websites using certificates it has issued that are valid, revoked and expired. No acceptable period of unavailability is defined.
Source of incident disclosure:
This incident was identified by our own internal monitoring solutions and staff in the ordinary course of their duties.
A fix has been deployed. We will post a full incident report within 7 days.
Updated•5 months ago
|
| Assignee | ||
Comment 1•5 months ago
|
||
Full Incident Report
Summary
-
CA Owner CCADB unique ID: A007878
-
Incident description: Certainly stopped including the OCSP URI in end-entity certificates on 23-May 2025. Systems relying on the OCSP URI in the AIA extension were believed to all have been updated and no issues were observed that day.
The following day, Saturday, 24-May 2025, the container which serves the BR-mandated test web pages that present valid, revoked and expired certificates for each root, attempted to restart as part of a daily job that refreshes the valid and revoked certificates. In its startup routine it attempted to verify the OCSP response for each of the certificates using the OCSP URI contained in the certificate. As this was no longer present, the script failed and restarted the container, initiating a crash loop that kept the sites offline.
When these sites were confirmed unreachable by our monitoring, an alert was sent to the on-call engineer. The on-call engineer received this alert along with a few others and began investigating. Having determined that the causes of the other alerts were short-lived, the engineer incorrectly assumed that this was a brief interruption to the service that would also self-recover. They acknowledged it with the plan to take a closer look at why it had alerted on Tuesday, 27-May, after the holiday weekend.
On Monday, 26-May, another staff member took over the on-call duty and re-reviewed the alert, determining that it was a legitimate failure, and quickly identified the cause in the startup script. The fix was then rapidly committed and pushed to get the broken service back online. -
Timeline summary: All dates and times are UTC
- Non-compliance start date: [2025-05-24 16:00] Container serving the test web pages restarts.
- Non-compliance identified date: [2025-05-26 16:25] Alert properly triaged and investigation initiated.
- Non-compliance end date: [2025-05-26 17:00] Monitoring log confirms sites are reachable.
-
Relevant policies: Section 2.2 of the BR requires that the CA publish test web pages using certificates it has issued that are valid, revoked and expired. No acceptable period of unavailability is defined.
-
Source of incident disclosure: Self Reported
Impact
- Total number of certificates: 0
- Total number of "remaining valid" certificates: 0
- Affected certificate types: None
- Incident heuristic: The following 6 sites were affected: https://valid.root-e1.certainly.com, https://revoked.root-e1.certainly.com, https://expired.root-e1.certainly.com, https://valid.root-r1.certainly.com, https://revoked.root-r1.certainly.com, and https://expired.root-r1.certainly.com.
- Was issuance stopped in response to this incident, and why or why not?: No. There was no misissuance.
- Analysis: Even though the incident didn't have an impact on certificate-related functions (e.g. issuance, revocation checking, etc.) it placed us in violation of section 2.2 of the BR. The sites were unavailable for 49 hours.
- Additional considerations: We are aware of no further impact to the Web PKI community or our internal CA systems.
Timeline
All dates and times are UTC
2025-05-23
- 17:41 Boulder configuration change deployed.
- 18:41 Deployment completed on all CA nodes.
2025-05-24
- 16:00 Container serving the test web pages restarts with newly issued certificates.
- 16:00 Monitoring log first reports failure to reach site.
- 17:00 First paging alert reporting website error.
- 17:03 Alert acknowledged by on-call engineer.
2025-05-26
- 16:25 Alert properly triaged and investigation initiated.
- 16:32 Issue identified, along with a fix.
- 16:39 Fix was deployed.
- 16:56 Last failure to reach site reported in monitoring log.
- 17:01 Monitoring log confirms sites are reachable.
- 17:01 Alert automatically resolved.
Related Incidents
| Bug | Date | Description |
|---|---|---|
| 1962809 | 25-April 2025 | Bug relates to maintenance of the test web pages mandated in BR section 2.2. |
Root Cause Analysis
Contributing Factor 1: Alert behavior differs from other alerts.
- Description: This alert does not re-alert periodically like most of our alerts and self-resolves upon recovery. The alert is also missing a link directly to documentation on how to respond. As a result, the on-call engineer may assume it will self recover and fail to fully triage or verify recovery later.
- Timeline:
[2025-05-24 16:00] Container serving the test web pages restarts.
[2025-05-24 16:00] Monitoring log first reports failure to reach site.
[2025-05-24 16:03] Alert acknowledged by on-call engineer.
[2025-05-26 16:25] Alert properly triaged and investigation initiated. - Detection: On-call engineer acknowledged the alert and reviewed it within minutes, but reached the mistaken determination that it should be left alone to self-resolve. Determination that the underlying problem was still occurring did not happen until a new engineer took over on-call duty and re-reviewed the acknowledged alert.
- Interaction with other factors: Lack of specific knowledge of this alert behavior from factor #2.
Contributing Factor 2: Inadequate team training and experience with this service.
- Description: Certainly issues relatively short-lived certificates (30 days) and thus has had automated certificate provisioning to our test web pages from the beginning. This service is very efficient and requires little maintenance and no routine interventions. As a result, newer team members have little experience with the service and may be more prone to misunderstanding its behavior.
- Timeline:
[2023-03-01] Last update to the code that generates the test web pages container.
[2025-05-26] Update to remove OCSP dependency. - Detection: 5 Whys analysis identified lack of experience with this system as an area for improvement.
- Interaction with other factors: The non-standard alert behavior from factor #1.
Contributing Factor 3: Certificate profile update.
- Description: Various tests and communications had been made to validate that the profile update would be non-disruptive, but there was a lack of awareness that this service expected OCSP to be available to confirm the status of the revoked certificates.
- Timeline:
[2025-05-23 17:41] Boulder configuration change deployed.
[2025-05-24 16:00] Container serving the test web pages restarts. - Detection: The profile update was known and expected. Identification as a contributing factor is a result of factors #1 and #2.
- Interaction with other factors: Underlying change that led to both factors #1 and #2.
Root Cause Analysis methodology used: 5 Whys
Lessons Learned
- What went well:
- Alert triggered in response to the failure.
- Preparation for this certificate profile change succeeded in preventing any other impact.
- Configuration error was quickly and easily resolved, once properly triaged.
- What didn’t go well:
- Failure occurred a day after the change was deployed, by which time it appeared unrelated.
- Alert was initially misidentified as a false alarm.
- Alert does not re-alert when a problem persists, deviating from expected patterns with other alerts.
- Where we got lucky:
- Alert was noticed by another team member, leading to an earlier resolution.
Action Items
| Action Item | Kind | Corresponding Contributing Factor(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Improve documentation and training on test web pages | Prevent | Contributing Factor # 1 | Team reviews and confirms documentation is complete | 2025-06-30 | Ongoing |
| Add documentation link directly to the alert | Prevent | Contributing Factor # 1 | Team reviews and confirms all alerts link to documentation | 2025-06-30 | Ongoing |
| Modify behavior of alert to align with other alerts | Prevent | Contributing Factor # 2 | Team reviews and confirms alert behavior | 2025-06-30 | Ongoing |
| Assignee | ||
Comment 2•5 months ago
|
||
Weekly Update 2025-06-09
Action Items
- Improve documentation and training on test web pages
- Team has reviewed the documentation, but not yet formally agreed it is complete.
- Add documentation link directly to the alert
- Some technical challenges exist. Team is reviewing the best approach.
- Modify behavior of alert to align with other alerts
- Further analysis to determine the best approach is still being conducted.
We expect to have more substantive updates next week and are monitoring this bug.
| Assignee | ||
Comment 3•5 months ago
|
||
Weekly Update 2025-06-16
Action Items
- Improve documentation and training on test web pages
- We've conducted training based on the reviewed and improved documentation. Team has signed off.
- Add documentation link directly to the alert
- Alert now includes the documentation link.
- Modify behavior of alert to align with other alerts
- Team has identified ways to alter existing alerts to improve the consistency of alert behavior. We expect to have tested and rolled out the new approach by the 30th.
| Action Item | Kind | Corresponding Contributing Factor(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Improve documentation and training on test web pages | Prevent | Contributing Factor # 1 | Team reviews and confirms documentation is complete | 2025-06-30 | Complete |
| Add documentation link directly to the alert | Prevent | Contributing Factor # 1 | Team reviews and confirms all alerts link to documentation | 2025-06-30 | Complete |
| Modify behavior of alert to align with other alerts | Prevent | Contributing Factor # 2 | Team reviews and confirms alert behavior | 2025-06-30 | Ongoing |
We continue to monitor this bug.
| Assignee | ||
Comment 4•5 months ago
|
||
Weekly Update 2025-06-23
Action Items
- Modify behavior of alert to align with other alerts
- Further testing and a production limited implementation identified new challenges with the changes to alert behavior. We still expect to roll out by the 30th.
We continue to monitor this bug.
| Assignee | ||
Comment 5•4 months ago
|
||
Weekly Update 2025-06-30
Action Items
- Modify behavior of alert to align with other alerts
- We are close to finishing this one, but still ironing out a couple items. We are moving the date for completion back to the 3rd.
| Action Item | Kind | Corresponding Contributing Factor(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Modify behavior of alert to align with other alerts | Prevent | Contributing Factor # 2 | Team reviews and confirms alert behavior | 2025-07-03 | Ongoing |
We continue to monitor this bug.
| Assignee | ||
Comment 6•4 months ago
|
||
Action Item Update 2025-07-03
Action Items
- Modify behavior of alert to align with other alerts
- The deployment for this action item has been moved to 2025-07-20. The original scope was expanded to include additional technical requirements discovered during implementation. The new date also allows for a safer deployment window outside of the holiday weekend.
| Action Item | Kind | Corresponding Contributing Factor(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Modify behavior of alert to align with other alerts | Prevent | Contributing Factor # 2 | Team reviews and confirms alert behavior | 2025-07-20 | Ongoing |
We continue to monitor this bug.
| Assignee | ||
Comment 7•4 months ago
|
||
Weekly Update Update 2025-07-10
Action Items
- Modify behavior of alert to align with other alerts
- Progress on the additional technical requirements was good this week. We are on track for the new date.
We continue to monitor this bug.
| Assignee | ||
Comment 8•4 months ago
|
||
Action Item Update 2025-07-16
Action Items
- Modify behavior of alert to align with other alerts
- The new alert behavior has been deployed and is working as intended.
| Action Item | Kind | Corresponding Contributing Factor(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Modify behavior of alert to align with other alerts | Prevent | Contributing Factor # 2 | Team has reviewed and confirmed alert behavior | 2025-07-20 | Completed |
We continue to monitor this bug.
| Assignee | ||
Comment 9•4 months ago
|
||
Report Closure Summary
- Incident description: Certainly stopped including the OCSP URI in end-entity certificates on 23-May 2025. Systems relying on the OCSP URI in the AIA extension were believed to all have been updated and no issues were observed that day.
The following day, Saturday, 24-May 2025, the container which serves the BR-mandated test web pages that present valid, revoked and expired certificates for each root, attempted to restart as part of a daily job that refreshes the valid and revoked certificates. In its startup routine it attempted to verify the OCSP response for each of the certificates using the OCSP URI contained in the certificate. As this was no longer present, the script failed and restarted the container, initiating a crash loop that kept the sites offline.
When these sites were confirmed unreachable by our monitoring, an alert was sent to the on-call engineer. The on-call engineer received this alert along with a few others and began investigating. Having determined that the causes of the other alerts were short-lived, the engineer incorrectly assumed that this was a brief interruption to the service that would also self-recover. They acknowledged it with the plan to take a closer look at why it had alerted on Tuesday, 27-May, after the holiday weekend.
On Monday, 26-May, another staff member took over the on-call duty and re-reviewed the alert, determining that it was a legitimate failure, and quickly identified the cause in the startup script. The fix was then rapidly committed and pushed to get the broken service back online. - Incident Root Cause(s):
- Alert behavior differs from other alerts.
- Inadequate team training and experience with this service.
- Certificate profile update.
- Remediation description:
- Improve documentation and training on test web pages
- We've conducted training based on the reviewed and improved documentation. Team has signed off.
- Add documentation link directly to the alert
- Alert now includes the documentation link.
- Modify behavior of alert to align with other alerts
- The new alert behavior has been deployed and is working as intended.
- Improve documentation and training on test web pages
- Commitment summary: Certainly seeks to continuously improve our alerting, training and documentation. This incident highlighted some needed improvements and going forward we are committed to maintaining and applying these lessons throughout our work.
All Action Items disclosed in this report have been completed as described, and we request its closure.
Comment 10•3 months ago
|
||
This is a final call for comments or questions on this Incident Report.
Otherwise, this bug will be closed on approximately 2025-08-08.
Updated•3 months ago
|
Updated•3 months ago
|
Description
•