Closed Bug 1734906 Opened 2 months ago Closed 1 month ago

IdenTrust: Intermittent interruptions to DNS service

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: roots, Assigned: roots)

Details

(Whiteboard: [ca-compliance])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36

Steps to reproduce:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
    IdenTrust:
    Two issues occurred, the first of which triggered the second incident. Additionally, one issue was identified on October 6, 2021 that presented anomalous behaviors that may or may not be related to the original and incident. Although these incidents do not have any bearing on certificate issuance, IdenTrust is filing a report to ensure that full disclosure is made according to CA/B Forum requirements. This incident report provides details pertaining to both issues in order to explain the cause and effect of the interrelated incidents.
    September 30, 2021 at 8:01 am MT, the IdenTrust Root DST X3 (DST X3) expired as expected. This resulted in an inordinate number of customers relying on the DST X3 root connection with our services hosted in Salt Lake City to download the new certificate chain.
    IdenTrust system engineers were notified of large volume of network traffic coming into IdenTrust, causing site connectivity to be intermittent. This traffic was a result of the DST X3 CA certificate that expired and further impacted IdenTrust services causing slow response times and timeouts for customers attempting OCSP validations and other certificate lifecycle events.
    Related to the DST X3 CA certificate expiring, but outside of IdenTrust issues, networks and systems worldwide had issues validating traffic and DNS causing slow responses and timeouts compounding issues for customers to access our sites to download the new root chains.
    On 10/05/2021 beginning at approximately 5:00 pm MT intermittent CRL validations failures were logged for traffic routed via a round-robin method to the DR system. As previously mentioned rerouting to the DR system was implemented as a temporary stop-gap method to help alleviate the time outs due to the a-typical traffic pattern created by the expiration of DST X3.
    While alerts were logged during the night, no notifications were sent to IT personnel via text message. At approximately 6:30 am MT on 10/06/2021 a customer report precipitated escalation of the issue and commencement of a SWAT call to determine root cause and delegate remediation tasks.
    The validation failures were isolated to only the DR system and the root cause was identified to be that the CRLs used by the DR system were not updated according to normal production protocol and were outdated. After conducting testing, new CRLs were pushed to the DR system at 9:18 am MT on 10/06/2021, and issued resolution was confirmed.
  2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
    IdenTrust:
    IdenTrust system engineers were troubleshooting and working on plans to remediate issues and redistribute incoming traffic. As a result of these discussions approximately 40% of the traffic was moved to our secondary site to alleviate bottle necks in our SLC site. These actions resolved issues for OCSP and lifecycle events. The timeline associated with these activities are as follows:
    September 30, 2021
    • 8:01 am MT: Border firewalls reported a large increase in traffic to IdenTrust services
    • 8:10 am MT: Monitoring of website and systems notified system engineers of site connectivity issues due to slow response times and timeouts
    • 8:15 am MT: IdenTrust System Engineers were actively troubleshooting the issues created due to the traffic volumes
    • 8:30 am MT: IdenTrust Systems Engineers verified all network and systems integrity
    • 10:00 am MT: Systems engineers migrated estimated 40% traffic on main and secondary sites to mitigate slow response times due to the volume of traffic. Also determined that a customer was sending large volumes of traffic that they shouldn’t be, which was mitigated by blocking their http posts.
    • 10:30 am MT: Response times stabilized in both sites and response times functioned within parameters to correct website slowness and timeouts. OCSP and certificate lifecycle events confirmed restored
    October 1, 2021
    • 8:30 am MT: Monitoring of firewalls reported large volume traffic in both main and secondary sites causing new OCSP and LRA access issues due to slow response times and timeouts
    • 8:45 am MT: Continued troubleshooting and monitoring shows all response times as healthy
    • 10:15 am MT: Moved all OCSP validation traffic to secondary site
    • 10:25 am MT: Verified that DST X3 root p7c file was published for retrieval
    • 11:00 am MT: Completed move for DST X3 CRL to cloud for retrieval
    • 11:00 am MT: Traffic in both main and secondary sites returned to healthy levels and verified OCSP and certificate lifecycle event access restored
    As previously mentioned, rerouting some traffic to the DR system was implemented as a temporary stop-gap method to help alleviate the time outs due to the a-typical traffic pattern created by the expiration of DST X3 CA certificate. The following details pertain to an interrelated incident that was triggered due to the rerouting of traffic to the DR system.
    October 5, 2021
    • 5:01 pm MT: Intermittent CRL validations failures were logged for traffic routed via a round-robin method to the DR system
    October 6, 2021
    • 6:30 am MT: A customer report precipitated escalation of a potential CRL issue October 6, 2021 6:30 am MT: A SWAT call was initiated
    • 6:30 am MT to 8:30 am MT: trouble shooting was conducted resulting in the identification of the root cause
    • 8:30 am MT: Testing of the proposed fix was conducted in the test environment resulting in confirmation of remediation plan and confirmed that the CRL failures were isolated to requests routed only at the DR system
    • 9:18 am MT: New CRLs were pushed to the DR system and the CRL posting script was updated
    • 9:20 am MT: Resolution was confirmed
    Anomalous behavior was identified on October 6, 2021, but presented for two brief periods of time and then self-resolved. An inordinately high volume of DNS traffic from a wide array of IP address caused intermittent service impacts. To fully mitigate these occurrences IdenTrust is further expanding DNS capabilities.
    October 6, 2021
    • 4:18 pm MT: Began receiving alerts related to intermittent service impacts
    • 4:22 pm MT: Traffic normalized without IT intervention
    • 4:41 pm MT: Again, began receiving alerts of the same type
    • 5:13 pm MT: Again, traffic normalized without IT intervention
  3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
    IdenTrust: This incident did not impact certificate issuance, only CRL validation.
  4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
    IdenTrust: Not applicable.
  5. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
    IdenTrust: Not applicable
  6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
    IdenTrust: This was an a-typical situation where expiration of the DST X3 root created a traffic pattern that had not been experienced before. The steps put in place to troubleshoot and mitigate impacts caused by the originating incident were also a-typical and implemented in short order to relieve the overall impact to all customers using the service.
    The changes made to divert the customer traffic, as mentioned the bullet point with date/timestamp September 30, 2021 10:00 am above, inadvertently broke the CRL copy script because it was using an option that was “posting” rather than “getting”. This script has been working effectively without failures for a number of years, and was only disrupted when IdenTrust blocked the method by which it runs. The script was updated on October 6, 2021, resolving the DR CRL posting issues.
    In addition, due to the large number of alerts that were being logged during and after the original DST X3 expiration issue, the IdenTrust team was inundated with a large quantity of alerts to identify original and residual impacts to the system. The large quantities of monitoring alerts also resulted in lagging alerts that made it unclear as to whether the alert pertained to the prior or on-going issue, ultimately causing the residual issue to not be recognized.
  7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
    IdenTrust: IdenTrust took the following mitigation steps, but has no residual outstanding tasks related to this incident, other than to continuously monitor the situation:
    • Involved the use of a Content Delivery Network (CDN) to help support and redistribute the high volume of traffic
    • Fixed the automated push for CRL distribution to the DR system
Assignee: bwilson → roots
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

Status Update:
We have continued the monitoring and have not experienced service further DNS service interruptions.
No additional tasks are pending relevant to this incident.

Our continued monitoring on this expired root reflect that volumes have subsided with no further DNS service interruptions experienced. We kindly request that this bug be resolved as ‘Fixed”.

Flags: needinfo?(roots)
Status: ASSIGNED → RESOLVED
Closed: 1 month ago
Flags: needinfo?(roots)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.