Open Bug 1895312 Opened 1 month ago Updated 26 days ago

TunTrust: CRL and OCSP unavailable

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: ndca.pki, Assigned: ndca.pki)

Details

(Whiteboard: [ca-compliance] [crl-failure] [ocsp-failure])

Attachments

(1 file)

This is a preliminary incident report. The complete incident report will be posted before May 17th , 2024.

Incident Report

Summary

Internet connection to the website (CRL repository) and the validation authority (OCSP) was not available on May 03rd 2024 from 17:12 to 20:02 UTC time.

Impact

Third parties could not access the CRL of “TunTrust Root CA” and the OCSP.

Timeline

All times are UTC.

2024-05-03:

  • 17:12 IT team receives the first SMS alert stating that the website and the OCSP are unavailable. These SMS alerts kept coming in every 05 minutes (This is a pre-configured value in the monitoring systems).
  • 17:25 IT team checks the availability of the online services (in the main site and the disaster recovery site) and confirms that they are unreachable using an internet connection.
  • 18:08 IT team checks the internal connection to systems and confirms that the issue is related to the external network.
  • 18:30 IT team contacts the internet service provider and confirms that they have an incident that caused a problem to all their clients resulting in an unavailability of several websites.
  • 20:02 The last SMS alert was received and the internet connectivity was confirmed to be re-established.

Root Cause Analysis

The internet service provider had an incident related to the provision of internet connectivity on May 3rd 2024. We are waiting for the official statement from this entity regarding the incident.
We will provide more information regarding the investigation as soon as we receive the official statement.

Lessons Learned

What went well

  • The alerts that were received by SMS (automatically from the monitoring systems) made it possible to detect the incident at the beginning.

What didn't go well

  • The disaster recovery site and the main site rely both on the same internet service provider. Therefore, switching to the disaster recovery site was irrelevant since it was also not reachable.

Where we got lucky

Action Items

This section will be filled once the official statement from the internet service provider is received.

Action Item Kind Due Date

Appendix

Details of affected certificates

No certificates were affected because of this incident.

Based on Incident Reporting Template v. 2.0

Assignee: nobody → ndca.pki
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [crl-failure] [ocsp-failure]

This is an update to the incident report. All changes are marked in bold.

Summary

Internet connection to the website (CRL repository) and the validation authority (OCSP) was not available on May 03rd 2024 from 17:12 to 20:02 UTC time.

Impact

Third parties could not access the CRL of “TunTrust Root CA” and the OCSP.

Timeline

All times are UTC.
2024-05-03:

  • 17:12 IT team receives the first SMS alert stating that the website and the OCSP are unavailable. These SMS alerts kept coming in every 05 minutes (This is a pre-configured value in the monitoring systems).
  • 17:25 IT team checks the availability of the online services (in the main site and the disaster recovery site) and confirms that they are unreachable using an internet connection.
  • 18:08 IT team checks the internal connection to systems and confirms that the issue is related to the external network.
  • 18:30 IT team contacts the internet service provider and confirms that they have an incident that caused a problem to all their clients resulting in an unavailability of several websites.
  • 20:02 The last SMS alert was received and the internet connectivity was confirmed to be re-established.

2024-05-06:

  • TunTrust sent a request for a detailed incident report from the internet service provider (ISP).

2024-05-10:

  • 13:45 TunTrust received the incident report from the ISP.

Root Cause Analysis

The internet service provider had an incident related to the provision of internet connectivity on May 3rd 2024 that was caused by a hardware failure of an equipment in their architecture.

Lessons Learned

What went well

  • The alerts that were received by SMS (automatically from the monitoring systems) made it possible to detect the incident at the beginning.

What didn't go well

  • The disaster recovery site and the main site rely both on the same internet service provider. Therefore, switching to the disaster recovery site was irrelevant since it was also not reachable.

Where we got lucky

Action Items

Although the ISP committed to put in place action items in their architecture to guarantee the high availability in internet provision, TunTrust is going to put in place the following actions:

Action Item Kind Due Date
Reinforce internet provision of the main site so that there would be 02 different lines for the internet provision, relying on two different telecommunication operators. Prevent 31-07-2024

Appendix

Details of affected certificates

No certificates were affected because of this incident.

Blaming the ISP as a root cause does not seem to me like it lives up to the expectations set by the CCADB incident report template. As shown by the action items, you are already aware of the option of redundant internet lines. I would expect the root cause analysis to delve into this and beyond, for example investigating why this redundancy was not already in place and whether you were aware of that lack before this incident. I think it would also be worth updating the timeline with things such as for how long you have been relying on this single uplink.

(In reply to TunTrust from comment #1)

Summary

Internet connection to the website (CRL repository) and the validation authority (OCSP) was not available on May 03rd 2024 from 17:12 to 20:02 UTC time.

Impact

Third parties could not access the CRL of “TunTrust Root CA” and the OCSP.

The point of the Impact section is to communicate what and how many things were affected. Providing statistics and aggregate helps the reader measure each problem. I would expect more detail on what was impacted especially since CCADB provides guidance with examples.

The Impact section should contain a short description of the size and nature of the incident. For example: how many certificates, OCSP responses, or CRLs were affected; whether the affected objects share features (such as issuance time, signature algorithm, or validation type); and whether the CA Owner had to cease issuance during the incident.

TunTrust Root CA is the only root in the Mozilla store right? Does that mean that all trusted certificates had their CRL and OCSP services impacted? How many certificates is that? What domains were impacted? Were all OCSP endpoints and CRLs hosted on the same domain? What were those OCSP endpoints and CRLs that were unavailable?

(In reply to boet from comment #2)

Blaming the ISP as a root cause does not seem to me like it lives up to the expectations set by the CCADB incident report template. As shown by the action items, you are already aware of the option of redundant internet lines. I would expect the root cause analysis to delve into this and beyond, for example investigating why this redundancy was not already in place and whether you were aware of that lack before this incident. I think it would also be worth updating the timeline with things such as for how long you have been relying on this single uplink.

We do have a redundancy in place: two optic fiber cables (OF) for the main site and two optic fiber cables for the DR site. Furthermore, we have SLAs in the contract with the ISP for high availability. The contract states that they manage the two lines so that if the first line is down, they switch to the second line. However, the incident occurred within the ISP that is at the same time the DNS Registrar in Tunisia. Therefore, accessing their platform to make the DNS based fail over was not possible and was irrelevant: Their platform was not reachable.

The action item we are suggesting is to have another line with another ISP totally independent from our ISP.

We are also exploring the possibility of having a second CRL endpoint under another extension other than “.tn” hosted outside of the Tunisian borders.

(In reply to Mathew Hodson from comment #3)

The point of the Impact section is to communicate what and how many things were affected. Providing statistics and aggregate helps the reader measure each problem. I would expect more detail on what was impacted especially since CCADB provides guidance with examples.

TunTrust Root CA is the only root in the Mozilla store right? Does that mean that all trusted certificates had their CRL and OCSP services impacted? How many certificates is that? What domains were impacted? Were all OCSP endpoints and CRLs hosted on the same domain? What were those OCSP endpoints and CRLs that were unavailable?

We will provide the requested appendix in attachment.

TunTrust Root CA is indeed the only root CA we have in Mozilla Trusted List. We have one issuing CA under “TunTrust Root CA”, which is “TunTrust Services CA” and it is technically constrained to only issue SSL certificates to “.tn”. This means indeed that all trusted certificates had their CRL and OCSP services impacted. On the 13th of May 2024, there were 59 valid certificates.

Among these certificates:

  • 28 rely on the same ISP (therefore were unreachable because of the incident),
  • 05 rely on other ISPs,
  • 13 rely on a private network we have between governmental entities so our CRL and OCSP were available for them,
  • 13 did not give us an answer yet.

OCSP endpoints are hosted on “va.tuntrust.tn” and the CRLs are hosted on “crl.tuntrust.tn” but these two are not on the same server (i.e. not the same IP address).

Attached file Appendix.txt
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: