Closed Bug 1754593 Opened 2 years ago Closed 2 years ago

IdenTrust: Unavailable CRL and OCSP Responders

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: roots, Assigned: roots)

Details

(Whiteboard: [ca-compliance] [ocsp-failure] [crl-failure])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.81 Safari/537.36

Steps to reproduce:

Summary:
On February 4, 2022, we experienced an issue for 8 hours that prevented 24X7 consistent CRL and OCSP online responses of ten seconds or less per B.R. 4.10.2 Service Availability guidelines.

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

IdenTrust: (All times stated in MST)
IdenTrust Engineers first became aware of the problem from system monitoring that alerted to an issue resulting in connectivity issues. The internal monitor alert was received on 02/04/2022 at 12:01 MST

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

IdenTrust:
02/04/2022 – 12:01 MST, Monitoring alerted IdenTrust engineers to a system impact affecting customers. Resources immediately began to troubleshoot and isolate the cause of the problem, which was an extremely high volume of traffic hitting the firewalls. Escalation tickets were opened with a third party vendors and troubleshooting and restoration efforts continued.
16:36 MST, Following intermittent recoveries of services over the course of about an hour, services were initially confirmed to be fully operational, but turned out to be only issuance was restored.
19:59 MST, Services had been restarted picking up the renewed OCSP responder certificates which resulted in full recovery for validation services.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
    IdenTrust: Not applicable

  2. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
    IdenTrust: Not applicable

  3. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
    IdenTrust: Not Applicable

  4. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

IdenTrust: The bombardment of traffic hitting the firewalls created a situation with the firewalls where they could not pass traffic. Working with the vendor, it was identified as a known problem with the version of the firmware running on the firewalls and our options were to upgrade the firewall immediately or turn off the offending functionality within the firewall pending the upgrade. To restore services, we opted to turn off the functionality until the firewalls were upgraded in a change window the next day and confirmed to be OK.

Once traffic was restored within the firewall, we were able to see that a batch of OCSP responder certificates that had been renewed two days prior required a restart of services to take effect and that restart was not done and the expired certificates are what caused the high volume of traffic that caused the firewalls to go into a bad state. Services were restored by restarting services to recognize the renewed OCSP responder certificates.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

IdenTrust:
Steps taken to resolve the situation:
• Worked with the vendor to identify and bypass functionality with the firewalls that resulted in the inability to pass traffic.
• Performed an upgrade the following day to upgrade the firmware of the firewalls to ensure the problem does not reoccur.
• Restarted services to pick up the renewed OCSP responder certificates.

Steps that will take place to avoid recurrence:
• Deployed additional monitoring to pick up certificates as needing renewal in advance of the expiration and the goal will be to have 7 days advance notice that will not clear until the certificates are seen as renewed within the system automatically.
• Strengthen procedures for batch certificate monthly renewals to ensure all services are confirmed as operational at both the primary and secondary sites.

Assignee: bwilson → roots
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

Was lack of training and/or documentation a cause for the service not to be restarted? Which service on which device needed to be restarted?

(In reply to Ben Wilson from comment #1)
The individual who caused this issue has been trained and has successfully completed the steps to renew these OCSP responder certificates monthly for a few months and the step is documented as well in procedures. In this case, the newer resource was just working on too many activities in a short period of time and made an honest mistake by missing the step of restarting the OCSP responder services for those PKIs requiring OCSP responder certificate renewal in the monthly batch of renewals, which had to be restarted once the secondary issues of high traffic affecting the firewalls was resolved. Services tied to the Commercial Root, where SSL/TLS services are mainly provided, were among those that needed to be restarted to restore these services.

I don't have any further questions.

Since the resolution of this issue on 2022-02-04 at 19:59 MST, the CRL and OCSP responders referenced in this incident report have been consistently available without any interruptions. We consider this issue resolved and request formal closure from Mozilla.

We have no pending actions for this Incident Report other than including it in this year's annual WebTrust Audit.

Just to confirm that we have no other pending activities for this incident other than including it in this year's WebTrust annual audit report.

It appears that this incident can be closed, which I'll do on or about Wed. 13-April-2022.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ocsp-failure] [crl-failure]
You need to log in before you can comment on or make changes to this bug.