1885568 - VikingCloud: Delayed revocation of TLS certificates in connection to bug #1883779

Andrea Holland

Assignee

Description

•

4 months ago

Steps to reproduce:

Incident Report

This is a preliminary report.

Summary

While investigating incident https://bugzilla.mozilla.org/show_bug.cgi?id=1883779, we identified 3167 OV certificates impacted with the same Subject RDN Attribute order reversal. Based on that incident, the affected certificates should be revoked within 5 days according to Section 4.9.1.1 of the TLS Baseline Requirements (revoke by 2024-03-16 14:53 UTC). However, we will not achieve the 5-day timeline because of several factors that will be provided in the full incident report along with the list of OV certificates that were delayed in revocation.

Ben Wilson

Updated

•

4 months ago

Assignee: nobody → andreaholland

Status: UNCONFIRMED → ASSIGNED

Type: defect → task

Ever confirmed: true

Whiteboard: [ca-compliance] [ov-misissuance]

JR Moir

Comment 1

•

4 months ago

Can you confirm that the 3000-some certificates represent almost every single valid certificate of Viking Cloud?
I believe that to be the case.

If a CA mis-issues theire ENTIRE certificate base and then can't follow guidelines to revoke them in a suitable time...why are they trusted?

Andrea Holland

Assignee

Comment 2

•

4 months ago

Our full report will give details around this delay including the areas of issues and our plans to improve. The full report will be posted no later than March 29, 2024.

Andrea Holland

Assignee

Comment 3

•

3 months ago

Incident Report

Summary

VikingCloud had issued TLS certificates with a reverse Subject RDN attribute order as described in https://bugzilla.mozilla.org/show_bug.cgi?id=1883779. According to TLS BR Section 4.9.1.1 #12, the affected certificates should have been revoked within 5 days.

Our reissuance and revocation process was initially slowed down because our ticketing system caused a delay in our ability to send out email notifications to all relevant Subscribers – ultimately impacting response time. While our system would have allowed us to revoke all the certificates within the 5-day timeframe, Subscribers had not reissued all impacted certificates and we didn’t have confirmation of Subscriber installation within that timeframe. Additionally, continuity of some of the Subscribers' critical infrastructures depended on the affected certificates. While the Subscribers had processes in place to handle emergency situations, in this case it was determined by the Subscribers to be riskier to push the certificate reissuances at an emergency speed, skipping normal installation steps like quality assurance and security testing, which imposes higher risks in this particular situation. Therefore, revocation of the affected certificates was delayed to avoid negatively impacting Subscriber’s services and their end-users.

It is important to note that there have not been any identified impacts in terms of security or use of the misissued certificates as the information contained in the certificates was valid. Regardless, the misissuance event was an opportunity to action our large reissuance and revocation process and interface with Subscribers regarding the revocation requirements in our CPS as well as the TLS Baseline Requirements. Effective preventive actions will be put in place to reduce the likelihood of recurrence.

Impact

3082 certificates were not revoked in the 5-day period.

Even though the non-compliance incident does not pose a security risk, nor was it a misissuance of the Subject information provided in the certificates, industry rules require that we revoke certificates not issued in full compliance with TLS BR standards. We will comply with the standards and revoke the affected certificates. The expected final revocation date is April 30, 2024.

Timeline (All time are UTC)

2024-03-11 14:53 – Confirmation other OV certificates affected.
2024-03-11 17:04 – Notifications to Subscribers began.
2024-03-12 14:50 – Sales organization involved in aiding Subscriber contact and started setting up meetings of largely impacted Subscribers to address a plan of action.
2024-03-12 20:45 – Subscriber notified us of email communication issues.
2024-03-14 17:43 – A second round of email notifications sent out.
2024-03-15 16:20 – Preliminary report for delayed revocation created.
2024-03-16 14:53 – TLS BR required revocation date of the affected OV certificates. – Failed to meet.
Continuing to work with Subscribers on action plans for reissuance, installation, and revocation.

Root Cause Analysis

Root Cause # 1

Why was there a problem?

Because the 5-day revocation requirement for a misissaunce event was not met for several of the impacted certificates.

Why was the 5-day revocation requirement missed?

Because the Subscriber did not take immediate action.

Why did the Subscriber not take immediate action?

Because our ticketing system caused a delay in reaching out to inform Subscribers about this situation and the need for revocation.

Why did our communication plan and system not properly inform Subscribers?

Because the communication didn’t reach the Subscriber who had the ability to take immediate action on the request.

Why did the communication not reach the Subscriber who had the ability to take immediate action on the request?

The following are some, but not all of the specific reasons why this communication did not properly inform the Subscriber:

Subscriber contact information was outdated in the Subscriber Portal due to change in personnel within the Subscriber organization.
Subscriber mailbox rules dropped the communication into a folder which was not viewed as often by the Subscriber.
Subscriber mail systems had labeled the communication as possible SPAM.
The communication system did not send to all entered Subscriber contact email addresses on the messages.
The communication system had an issue that caused support-initiated emails to incorrectly respond as though they were Subscriber created issues.

Given this communication breakdown, this situation was also impacted by another root cause:

Root Cause # 2:

Why was there a problem?

Because the 5-day revocation requirement for a misissaunce event was not met for several of the impacted certificates.

Why was the 5-day revocation requirement missed?

Because Subscribers in critical industries and more heavily impacted Subscribers needed more time for installation of reissued certificates before the revocation of the misissued certificates.

Why did the Subscribers need more time?

Because manual replacement, especially on devices without automation and on critical infrastructure, requires more time and change control due to the risk of error.

Why was the risk of error a factor in manual replacement?

Because the risk of error during rapid replacement, which involves skipping normal installation steps like security testing, was deemed to be too high of a risk for Subscriber organizations in this situation.

Why did the risk of error outweigh the need to rapidly replace?

Because the risk of causing a potential outage by a rapid replacement could directly impact the Subscriber’s end-users and introduce potential security issues, especially in some of the industries and devices that these certificates were installed on.

Lessons Learned

What went well

Once the appropriate Subscriber contact had received the correct email, they reached out to support with questions and for help.

What didn’t go well

Subscriber Notification

The notices sent out to Subscribers took longer than anticipated:
- Some of the initial emails were not sending the misissuance event content and instead sending a “Thank you for contacting support” notice.
- We ran into issues with the notification system only sending to one of the contact email addresses instead of all.
Some of the Subscribers had not been updating contact information on their accounts, so the proper contacts were not aware of the necessity to reissue and requirement for revocation.
Notification back from the Subscriber of installation completion was not properly connected to the original email for reissuance and therefore it was time consuming to link those up prior to revocation.

Subscriber Infrastructure

Subscribers with wildcard certificates in use across multiple devices require more time to install across all the impacted devices.
A large number of our Subscribers do not utilize the automation we currently have in place due to security concerns, system isolation, or device requirements.
Subscribers who were starting to implement automation ran into issues with certificates spread across automation platforms and not in automation platforms.
Subscribers who integrate with multiple divisions and/or organizations needed time to reach out to get an action plan in place.

Where we got lucky

We were lucky that this misissuance event was not a security incident requiring 24-hour revocation.

Action Items

Action Item	Kind	Due Date
Investigate new system for misissuance notifications and responses	Mitigation	10-23-2024
Setup annual request for Subscribers to check contact information provided to us	Prevent	06-26-2024
Annual test of misissuance process in our risk assessment procedure	Prevent	07-31-2024
Update ACME implementation to provide more Subscriber value	Mitigation	12-18-2024
Improve Subscriber adoption of automation	Mitigation	12-18-2024
Communicate process improvements to Subscriber in the event of misissuance	Mitigation	06-26-2024

Appendix

Details of affected certificates

Attached.

Number of revoked certificates

As of 2024-03-29 20:29 UTC 1177 affected certificates have been revoked.

Andrea Holland

Assignee

Comment 4

•

3 months ago

Attached file Affected_Certficates.txt — Details

Andrea Holland

Assignee

Comment 5

•

3 months ago

Update

Additional EV certificates were impacted by the Subject RDN attribute order as described in https://bugzilla.mozilla.org/show_bug.cgi?id=1883779. The affected certificates should be revoked within 5 days according to Section 4.9.1.1 of the TLS Baseline Requirements (revoke by 2024-04-06 15:15 UTC). However, we will not achieve the 5-day timeline. The list of delayed revocation for EV certificates will be provided by Monday, April 8.

Number of revoked certificates

As of 2024-04-05 21:00 UTC 1424 affected OV certificates have been revoked.

Andrea Holland

Assignee

Comment 6

•

3 months ago

Attached file Affected Certficates2.txt — Details

Andrea Holland

Assignee

Comment 7

•

3 months ago

Number of revoked certificates

As of 2024-04-12 21:00 UTC 1615 affected OV certificates have been revoked.
As of 2024-04-12 21:00 UTC 8 affected EV certificates have been revoked.

Andrea Holland

Assignee

Comment 8

•

3 months ago

Number of revoked certificates

As of 2024-04-19 21:00 UTC 1871 affected OV certificates have been revoked.
As of 2024-04-19 21:00 UTC 19 affected EV certificates have been revoked.

Rob Stradling

Updated

•

2 months ago

In checking this the CRL for EV certs: http://crl.vikingcloud.com/VCEVCA_L1.crl has last had an entry added on May 3rd, but says its last update was May 17th at 09:00 UTC.
Looking at another CRL for OV certs: http://crl.securetrust.com/OVCA2_L1.crl shows the last entry was May 15th, and the last update was May 17th at 10:00 UTC.

Flags: needinfo?(andreaholland)

Andrea Holland

Assignee

Comment 15

•

1 month ago

The timing on the Bugzilla last update about the completed revocations did not take into account the CDN CRL caching time. Thank you for bringing this to our attention. Upon seeing your post, we did clear our CDN cache which pushed the latest CRLs to the endpoints.

Flags: needinfo?(andreaholland)

Ben Wilson

Comment 16

•

1 month ago

Hi Andrea,
You mention critical industries / critical infrastructure in your incident report. Can you be more specific about the type of customers you're talking about? I didn't see any categorization in that regard.
Thanks,
Ben

Flags: needinfo?(andreaholland)

Andrea Holland

Assignee

Comment 17

•

1 month ago

The following types of customers were impacted: Government entities, healthcare technology providers (hospitals, labs, virtual medical services, etc.), telecommunications, energy providers, finance (banking, payment services, etc.), transportation, and essential business chains.

Flags: needinfo?(andreaholland)

Ben Wilson

Updated

•

1 month ago

Whiteboard: [ca-compliance] [ov-misissuance] [leaf-revocation-delay] → [ca-compliance] [ov-misissuance] [leaf-revocation-delay] 2024-06-26

Andrea Holland

Assignee

Comment 18

•

28 days ago

If there are no further comments, please set Next Update to 2024-06-26.

Tim Callan

Comment 19

•

20 days ago

(In reply to Andrea Holland from comment #18)

If there are no further comments, please set Next Update to 2024-06-26.

Andrea,

A five-day revocation event took more than two months to complete. It’s difficult to blame this late revocation on a glitch in your ticketing system that according to your timeline was solved in three days. I do not see any discussion on this bug about the true root cause of this massive delay. I do not see any action items to address this root cause. This bug does not meet expectations without these elements.

I will remind you of what has been repeatedly stated on many, other, current Bugzilla incidents, which is that Subscribers’ stated unwillingness to install replacement certificates within 120 hours is not the root cause of your failure to revoke, and encouraging Subscribers to become agile in their certificate management is insufficient as an action item to address this issue.

Also, comment 17 does not meet the requirements of Mozilla’s policy, which states in part,

The decision and rationale for delaying revocation will be disclosed in the form of a preliminary incident report immediately; preferably before the BR-mandated revocation deadline. The rationale must include detailed and substantiated explanations for why the situation is exceptional. Responses similar to “we do not deem this non-compliant certificate to be a security risk” are not acceptable. When revocation is delayed at the request of specific Subscribers, the rationale must be provided on a per-Subscriber basis.

Rather than setting Next Updates that are weeks in the future, I would like to see VikingCloud focus on these gaps in this incident report. Please provide a root cause analysis that addresses the real root cause of this failure and list action items that the community can reasonably expect to rectify this problem. Please provide detailed and substantiated explanations on a per-subscriber basis.

Andrea Holland

Assignee

Comment 20

•

14 days ago

As mentioned in the Root Cause Analysis section, the issues with Subscriber communications were due to multiple factors. In our timeline, the sending of a second round of notifications did not resolve the issues underlying the communication root cause. In many Subscriber situations, several communications (email and telephone) were required to help impacted customers understand the misissuance, the actions that needed to take place, and the people within their organization who had to perform those actions. It is important to rectify the communications issues that occurred for any future mass revocation event communications to improve Subscriber ability to timely address revocations, therefore reducing the impact on relying parties. We have action items around the improvement of communications with Subscribers.

Since this was the first time that most of our Subscribers had been required by us to revoke and replace their certificates within required timeframes, their emergency replacement processes, procedures, and tools had not been tested. This misissuance event also served as an ideal opportunity to interact with impacted Subscribers to push them in the direction of automation, private PKI for non-public TLS use cases, or discuss additional solutions and to be better prepared for the next security-style event which impacts all WebPKI. As an action item, we are reaching out to each impacted Subscriber to discuss improvements to streamline their certificate replacement process to prevent this event in the future.

Another one of our action items is technological improvements by upgrading our ACME implementation to include ARI which addresses some of the shortcomings of non-renewal functionality of our current ACME implementation. Having Subscribers update their processes and interactions with TLS impacts the WebPKI ecosystem as every Subscriber that moves in the direction of automation or private PKI means less impact and risk to relying parties overall.

Andrea Holland

Assignee

Comment 21

•

8 days ago

Action Items Update

Action Item	Kind	Due Date
Investigate new system for misissuance notifications and responses	Mitigation	10-23-2024
Setup annual request for Subscribers to check contact information provided to us	Prevent	Completed
Annual test of misissuance process in our risk assessment procedure	Prevent	07-31-2024
Update ACME implementation to provide more Subscriber value	Mitigation	12-18-2024
Improve Subscriber adoption of automation	Mitigation	12-18-2024
Communicate process improvements to Subscriber in the event of misissuance	Mitigation	Continuous

Mike Shaver (:shaver emeritus)

Comment 22

•

7 days ago

(In reply to Andrea Holland from comment #20)

As mentioned in the Root Cause Analysis section, the issues with Subscriber communications were due to multiple factors. In our timeline, the sending of a second round of notifications did not resolve the issues underlying the communication root cause.

In my opinion, communication issues cannot really be considered a root cause of a delayed revocation, because no subscriber communication is required in order to revoke a certificate. Do you agree with this? There is a legally-binding commitment to acceptance and acknowledgement of the possibility of immediate revocation, which Subscribers must agree to under 9.6.3(8).

There are really only two things that can cause a delayed revocation:

a technical or operational failure of a CA to complete an attempted revocation, for which the root cause will likely be a failure to ensure that those capabilities are always functioning, or;
a deliberate decision by the CA to not revoke a certificate when they are technically capable of it, for which the root cause will virtually always be either an incorrect revocation policy or a truly unforeseeable situation in which intolerable harm to the web ecosystem would result from timely revocation

It has to be “can’t due to literal inability to generate the revocation entry in the CRL” or “won’t because of decision to put disruption avoidance ahead of web PKI integrity due to the exceptional scale of harm that would result”.

Subscriber communication may help your subscribers avoid outages due to limits on their operational capability around certificate management, and I am certainly supportive of you helping them that way, but that is not an element of your commitments to the web PKI under the BRs or MRSP.

This misissuance event also served as an ideal opportunity to interact with impacted Subscribers to push them in the direction of automation, private PKI for non-public TLS use cases, or discuss additional solutions and to be better prepared for the next security-style event which impacts all WebPKI.

This is an excellent step to take, in that it might eliminate an unforeseeable circumstance of exceptional harm.

Is there a reason that you are not also contacting the subscribers who were not affected in this particular incident to push them similarly? Were the subscribers affected by the underlying incident unusually likely, among your population of subscribers, to have these issues of missing automation or misapplication of public web PKI? If this is likely to affect other subscribers who didn’t receive misissued certificates in this incident, then it seems that the proactive response would be to also contact them.

Another one of our action items is technological improvements by upgrading our ACME implementation to include ARI which addresses some of the shortcomings of non-renewal functionality of our current ACME implementation.

As with Subscriber communication, this does not in my opinion serve as an action item to address a root cause for delayed revocation, as lack of automation by either CA or Subscriber is not sufficient justification for delaying revocation of an invalid certificate within the 5/120 limits. Do you agree with this?

While I have questions about this, I do want to say that I appreciate the thoughtfulness and candor evident in this incident report.

Flags: needinfo?(andreaholland)

Affected_Certficates.txt 3 months ago Andrea Holland 267.87 KB, text/plain		Details
Affected Certficates2.txt 3 months ago Andrea Holland 2.17 KB, text/plain		Details