Closed Bug 1636544 Opened 5 years ago Closed 4 years ago

IdenTrust: OCSP Outage

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wthayer, Assigned: roots)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

It has been reported that IdenTrust suffered a multi-day OCSP outage around April 25th, 2020. Please either explain why this report is inaccurate, or provide an incident report in the Mozilla format

Flags: needinfo?(roots)

We acknowledge receipt of this incident and will start investigating.

Flags: needinfo?(roots)

We have made progress investigating what happened here and are in the process of coordinating a formal response by no later than May 22, 2020.

Summary:
On 2020-04-23 15:01 (all times in GMT), IdenTrust experienced a significant increase in requests to the OCSP responder at http://isrg.trustid.ocsp.identrust.com, peaking at 650k requests per second (compared with a typical rate of 10k rps to this responder) and the exact cause of that spike in traffic unknown. This responder is fronted by a CDN, but these requests almost all hit the origin. As a result, the ISRG OCSP responder started serving errors, and the firewall at the primary data center serving this responder became unresponsive. This was confirmed later in logging of the degradation in the logging event.

When monitoring alerted IdenTrust of potential system impact at 05:23 on 2020-04-24, IdenTrust began troubleshooting to try to identify a cause and if there was impact and the issue appeared to have cleared as traffic returned to normal levels.

Later on 2020-04-24 the issue reoccurred and system engineers again were troubleshooting the cause of the increase in traffic. IdenTrust eventually performed a hard boot on the firewall at the primary data center which required IdenTrust staff to travel to the site. After the firewall was restarted, the request rate dropped back to normal levels and the OCSP responder stopped serving errors.
The outage was lengthened by the hard boot that required air travel to the datacenter; during this time, users of this OCSP responder were receiving errors.

As the impact was limited to this particular OCSP responder and it was believed to have only been spikes in traffic that stressed the infrastructure but we didn’t see signs of impact at that time as we believed at the time that traffic successfully failed over to the secondary site, we considered that this incident didn’t require further actions. As we became aware on 2020-05-08 through the community that there was a report of impact and we have been re-evaluating the incident. We have confirmed that there were affected service as outlined below.

Affected Service:
OCSP Responder URI: http://isrg.trustid.ocsp.identrust.com
This OCSP responder is responsible for responding to requests about the validity of four Subordinate CA certificates:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
    IdenTrust:
    Monitoring alerted IdenTrust to a systems issue affecting traffic related to Let’s Encrypt on 2020-04-24 03:23, however monitoring showed that all actions taken resulted in what was believed to be traffic failing over properly from one data center to the other with what was believed to be no impact to customers.
    On 2020-05-08 we were made aware that relying parties may have experienced service disruption via Bugzilla bug https://bugzilla.mozilla.org/show_bug.cgi?id=1636544

  2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
    IdenTrust:
    2020-04-23 15:01: An increase in OCSP validations at ISRG-OCSP responder was experienced. This was caused by a significant increase in traffic from the CDN.
    2020-04-24 03:23: IdenTrust monitoring service alerted IdenTrust Operations of unusual spike in traffic to the ISRG-OCSP responder validation server. (Service Status: Degraded)
    2020-04-24 03:46: System Engineers reviewed the connections and switched traffic from the primary ISP to a secondary ISP
    2020-04-24 04:00: Systems team shifted traffic back to the primary ISP. Monitoring indicated no additional problems.
    2020-04-24 13:45: IdenTrust received additional monitoring indicating that the increase in traffic had reoccurred. We began troubleshooting that continued on with ISRG.
    2020-04-24 22:06 – Notified ISRG that the firewall needed restart and that ISRG CDN server needed a configuration update to reroute traffic from the primary responder server to a secondary site.
    2020-04-24 22:23: Firewall restarted but failed to recover.
    2020-04-24 23:00: Planned an on-site visit to the data center to restart the affected firewall.
    2020-04-25 19:15: After coordinating air travel to the primary site, the firewall was successfully restarted and the validation traffic service was configured back to the primary site.
    2020-04-25 19:30: Service Recovery. (Service Status: Normal).

2020-05-08: The IdenTrust program management team was alerted of these posts:

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
    IdenTrust: N/A

  2. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
    IdenTrust: N/A

  3. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
    IdenTrust: N/A

  4. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
    IdenTrust:
    The service impact can be summarized into two problems.

  5. An increase in traffic to the ISRG OCSP responder resulted in degraded performance and resulted in traffic not reaching the ISRG OCSP responder.

  6. The failover procedure did not work as expected and monitoring failed to accurately detect impact to relying parties of this OCSP service.

The monitoring was not performing end-to-end monitoring which IdenTrust was unable to detect any external service impact of traffic from the CDN and this caused a belief that the failover procedure was working as intended.

Several days after the described events, we learned that someone had reported experiencing connectivity issues. Our investigation reflects that there was either significant degradation or possibly complete failure at times to route traffic from the CDN servicing the IdenTrust ISRG-OCSP responder to the secondary site. We later found that logs showed OCSP validation service was degraded during the period of time running at the primary site.

On 2020-05-13 we confirmed that there was a significant degradation or complete failure of the current failover procedures and corrected these to better handle failovers of traffic to our secondary datacenter.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
    IdenTrust:
    IdenTrust identified the problem with the failover process that resulted in unavailability for ISRG traffic. The new change involves updating the documented steps for configuration settings to allow traffic from the CDN to connect to the service.

On 2020-05-16, the failover was successfully tested using the updated steps. This was done by redirecting traffic and improving monitoring to evaluate traffic transitions.

IdenTrust will be implementing additional end-to-end monitoring of http://isrg.trustid.ocsp.identrust.com that would reveal a failed attempt in the future, and would alert earlier on user-facing errors. Completion date is expected no later than 2020-06-30

Thank you for the incident report.

Is OCSP failover part of Identrust's DR procedures? If so, can you explain why the issue was not detected during regular review and testing of the DR plan?

Flags: needinfo?(roots)

(In reply to Wayne Thayer from comment #4)

Thank you for the incident report.

Is OCSP failover part of Identrust's DR procedures? If so, can you explain why the issue was not detected during regular review and testing of the DR plan?

Yes, this and other OCSP validation servers are part of IdenTrust annual DR testing.
The last DR exercise was conducted on October 2019 and at that time switching of all DNS traffic was tested successfully with the caveat that no traffic spikes were experienced. We also used the same monitoring method (mentioned in the incident report) that did not detect any external service impact of traffic from the CDN to determine success during our DR test. As noted in the incident report, we will be updating this monitoring method which in the future would detect issues related to traffic switch between CDN and our sites.

Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 30-July 2020

In accordance with our communicated plan, we have completed implementation of enhanced monitoring for end-to-end monitoring of http://isrg.trustid.ocsp.identrust.com. With the implemented monitoring we will detect user-facing availability issues for http://isrg.trustid.ocsp.identrust.com, including any communication failures between CDN services and origin OCSP responders.

I intend to close this bug on or after 5-Aug-2020 unless additional issues or questions are raised.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(roots)
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] Next Update - 30-July 2020 → [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.